The aim of this practical is to introduce you to word embedding, and enhance your understanding of sentiment analysis by doing dictionary-based sentiment embedding.
library(tidyverse) # as always :)library(tidytext) # for text mininglibrary(text2vec) # our dataset comes from this package
Word Embedding
In this part of the practical we will use word embedding. To work with text, we need to transform it into numerical values. One way is to count the words and weight their frequency (e.g. with TF-IDF). Another method is word embedding. Word embedding techniques such as word2vec and GloVe use neural networks to construct word vectors. With these techniques, words in similar contexts are represented by similar numerical vectors.
We will use the Harry Potter books as our data. Let’s start by installing the harrypotter package from github using remotes.
remotes::install_github("bradleyboehmke/harrypotter")library(harrypotter) # Not to be confused with the CRAN palettes package
The harrypotter package supplies the first seven novels in the Harry Potter series. Here is some info about this data set:
A data set with all Harry Potter books.
This data set contains the full texts of the first seven Harry Potter books (see below the list). Each text is in a character vector with each element representing a single chapter. It is provided from the harrypotter package written by Bradley Boehmke.
philosophers_stone: Harry Potter and the Philosophers Stone, published in 1997
chamber_of_secrets: Harry Potter and the Chamber of Secrets, published in 1998
prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban, published in 1999
goblet_of_fire: Harry Potter and the Goblet of Fire, published in 2000
order_of_the_phoenix: Harry Potter and the Order of the Phoenix, published in 2003
half_blood_prince: Harry Potter and the Half-Blood Prince, published in 2005
deathly_hallows: Harry Potter and the Deathly Hallows, published in 2007
1. Use the code below to load the first seven novels in the Harry Potter series.
hp_books <-c("philosophers_stone", "chamber_of_secrets","prisoner_of_azkaban", "goblet_of_fire","order_of_the_phoenix", "half_blood_prince","deathly_hallows")hp_words <-list( philosophers_stone, chamber_of_secrets, prisoner_of_azkaban, goblet_of_fire, order_of_the_phoenix, half_blood_prince, deathly_hallows) |># name each list elementset_names(hp_books) |># convert each book to a data frame and merge into a single data framemap_df(as_tibble, .id ="book") |># convert book to a factormutate(book =factor(book, levels = hp_books)) |># remove empty chaptersfilter(!is.na(value)) |># create a chapter id columngroup_by(book) |>mutate(chapter =row_number(book))head(hp_words)
# A tibble: 6 x 3
# Groups: book [1]
book value chapter
<fct> <chr> <int>
1 philosophers_stone "THE BOY WHO LIVED Mr. and Mrs. Dursley, of num~ 1
2 philosophers_stone "THE VANISHING GLASS Nearly ten years had passe~ 2
3 philosophers_stone "THE LETTERS FROM NO ONE The escape of the Braz~ 3
4 philosophers_stone "THE KEEPER OF THE KEYS BOOM. They knocked agai~ 4
5 philosophers_stone "DIAGON ALLEY Harry woke early the next morning~ 5
6 philosophers_stone "THE JOURNEY FROM PLATFORM NINE AND THREE-QUARTERS~ 6
2. Use the unnest_tokens function from the tidytext package to tokenize the data frame and name your object hp_words.
# tokenize the data framehp_words <- hp_words |>unnest_tokens(word, value)head(hp_words)
# A tibble: 6 x 2
# Groups: book [1]
book word
<fct> <chr>
1 philosophers_stone the
2 philosophers_stone boy
3 philosophers_stone who
4 philosophers_stone lived
5 philosophers_stone mr
6 philosophers_stone and
3. Remove the stop words from the tokenized data frame.
Hint: Use anti_join function to filter the stop_words from the tidytext package.
# A tibble: 6 x 2
# Groups: book [1]
book word
<fct> <chr>
1 philosophers_stone boy
2 philosophers_stone lived
3 philosophers_stone dursley
4 philosophers_stone privet
5 philosophers_stone drive
6 philosophers_stone proud
4. Creates a vocabulary of unique terms using the create_vocabulary function from the text2vec package and remove the words that they appear less than 5 times.
Hint: The code is given below, make sure you understand it.
Step 1. Create a list of words from hp_words (iterator object) using list function.
Step 2. Use the itoken function on the word list to create index-tokens.
Step 3: Use the create_vocabulary function on the itoken object to collect unique terms.
Step 4: Use the prune_vocabulary on the dataframe of unique terms and specify term_count_min = 5 to filter the infrequent terms.
# make it a list (iterator)hp_words_ls <-list(hp_words$word)# create index-tokensit <-itoken(hp_words_ls, progressbar =FALSE) # collects unique terms hp_vocab <-create_vocabulary(it)# filters the infrequent terms (number of occurrence is less than 5)hp_vocab <-prune_vocabulary(hp_vocab, term_count_min =5)# show the resulting vocabulary object (formatting it with datatable)hp_vocab
# We've just created word counts, that's all the vocabulary object is!
5. The next step is to create a token co-occurrence matrix(TCM). First, we need to apply vocab_vectorizer function to transform the list of tokens into a vector space. Then, use create_tcm function to create a TCM with the window of 5 for context words.
Hint: The code is given below, make sure you understand it. Step 1: Map the words to indices by vocab_vectorizer(vocabulary object from Q14).
Step 2: Create a TCM by create_tcm(it, vectorizer function from Step 1, skip_grams_window = 5). it is the list of iterators over tokens from itoken.
# maps words to indicesvectorizer <-vocab_vectorizer(hp_vocab)# use window of 5 for context wordshp_tcm <-create_tcm(it, vectorizer, skip_grams_window =5)# Note that such a matrix will be extremely sparse. # Most words do not go with other words in the grand # scheme of things. So when they do, it usually matters.
6. Use the GlobalVectors as given in the code below to fit the word vectors on our data set. Choose the embedding size (rank variable) equal to 50, and the maximum number of co-occurrences equal to 10. Train word vectors in 20 iterations. You can check the full input arguments of the fit_transform function from here.
INFO [16:14:44.462] epoch 1, loss 0.1199
INFO [16:14:44.865] epoch 2, loss 0.0781
INFO [16:14:45.269] epoch 3, loss 0.0677
INFO [16:14:45.635] epoch 4, loss 0.0615
INFO [16:14:45.997] epoch 5, loss 0.0572
INFO [16:14:46.456] epoch 6, loss 0.0541
INFO [16:14:46.812] epoch 7, loss 0.0516
INFO [16:14:47.189] epoch 8, loss 0.0496
INFO [16:14:47.555] epoch 9, loss 0.0480
INFO [16:14:47.956] epoch 10, loss 0.0466
INFO [16:14:48.329] epoch 11, loss 0.0455
INFO [16:14:48.670] epoch 12, loss 0.0445
INFO [16:14:49.013] epoch 13, loss 0.0436
INFO [16:14:49.349] epoch 14, loss 0.0428
INFO [16:14:49.690] epoch 15, loss 0.0421
INFO [16:14:50.032] epoch 16, loss 0.0415
INFO [16:14:50.408] epoch 17, loss 0.0410
INFO [16:14:50.783] epoch 18, loss 0.0405
INFO [16:14:51.210] epoch 19, loss 0.0400
INFO [16:14:51.593] epoch 20, loss 0.0396
7. The GloVe model learns two sets of word vectors: main and context. Essentially they are the same since the model is symmetric. In general combining the two sets of word vectors leads to higher quality embeddings (read more here). The best practice is to combine both the main word vectors and the context word vectors into one matrix. Extract the word vectors and save the summation of them for further questions.
Hint: Follow the steps below.
Step 1: Extract context word vectors by glove$components.
Step 2: Sum two sets of word vectors (e.g., hv_wv_main + t(hp_wv_context)).
# extract context word vectorhp_wv_context <- glove$components# check the dimension for both matricesdim(hp_wv_main); dim(hp_wv_context)
[1] 8628 50
[1] 50 8628
# Either word-vectors matrices could work, but the developers of the technique# suggest the sum/mean may work betterhp_word_vectors <- hp_wv_main +t(hp_wv_context) # transpose one matrix to perform matrix addition
8. Find the most similar 10 words to each of the words: “harry”, “death”, and “love”.
Hint: Follow the steps below.
Step 1: Extract the row of the corresponding word from the word vector matrix (e.g., matrix[“harry”, , drop = FALSE]).
Step 2: Use sim2 function with the cosine similarity measure to calculate the pairwise similarities between the chosen row vector (from Step 1) and the rest of words: sim2(x = whole word vector matrix, y = chosen row vector, method = "cosine", norm = "l2").
Step 3: Sort the resulting column vector of similarities in descending order and present the first 10 values. For example, you can do this by head(sort(similarity vector, decrasing = TRUE), 10).
Step 4: Repeat Step 1 - Step 3 for the other words.
# extract the row of "harry"harry <- hp_word_vectors["harry", , drop =FALSE]# calculates pairwise similarities between"harry" and the rest of wordscos_sim_harry <-sim2(x = hp_word_vectors, y = harry, method ="cosine", norm ="l2")# the top 10 words with the highest similaritieshead(sort(cos_sim_harry[,1], decreasing = T), 10)
harry ron hermione time moment left looked told
1.0000000 0.8860322 0.8702359 0.8326799 0.8273744 0.7741125 0.7699056 0.7223630
happened found
0.7162410 0.7099754
# extract the row of "death"death <- hp_word_vectors["death", , drop =FALSE]# calculates pairwise similarities between"harry" and the rest of wordscos_sim_death <-sim2(x = hp_word_vectors, y = death, method ="cosine", norm ="l2")# the top 10 words with the highest similaritieshead(sort(cos_sim_death[,1], decreasing = T), 10)
# extract the row of "love"love <- hp_word_vectors["love", , drop =FALSE]# calculates pairwise similarities between"harry" and the rest of wordscos_sim_love <-sim2(x = hp_word_vectors, y = love, method ="cosine", norm ="l2")# top 10 words with the highest similaritieshead(sort(cos_sim_love[,1], decreasing = T), 10)
love potion truth potions dad telling lucky polyjuice
1.0000000 0.5577526 0.5039527 0.5024910 0.4958414 0.4927394 0.4826604 0.4614335
remember who'd
0.4587588 0.4537236
9. Now you can play with word vectors! For example, add the word vector of “harry” with the word vector of “love” and subtract them from the word vector of “death”. What are the top terms in your result?
Hint: You can literally add/subtract the word vectors to each other (e.g., harry word vector + love word vector - death word vector). Once you have the resulting vector, calculate similarities as you did previously in Question 8.
# add/subtract word vectorstest <- harry + love - death# calculates pairwise similarities between"harry" and the rest of wordscos_sim_test <-sim2(x = hp_word_vectors, y = test, method ="cosine", norm ="l2")# top 10 words with the highest similaritieshead(sort(cos_sim_test[,1], decreasing = T), 10)
ron hermione harry added thinking time quickly sighed
0.6980152 0.6796295 0.6794213 0.5845963 0.5726407 0.5434684 0.5378614 0.5341729
ginny told
0.5297617 0.5279977
Dictionary-based Sentiment Analysis
In this part of the practical, we are going to apply dictionary-based sentiment analysis methods on the movie_review data set.
Text data set
We are going to use the data set movie_review. This is a data set with 5,000 IMDB movie reviews available from the text2vec package, labeled according to their IMDB rating The sentiment of the reviews is binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 has a sentiment score of 1. No individual movie has more than 30 reviews.
The objective of the practical is to understand if the reviews are predictive of the IMDB rating.
Let’s load the data set and convert it to a dataframe.
# load the movie review dataset from the text2vec packagedata("movie_review")movie_review <-as_tibble(movie_review)head(movie_review)
# A tibble: 6 x 3
id sentiment review
<chr> <int> <chr>
1 5814_8 1 "With all this stuff going down at the moment with MJ i've s~
2 2381_9 1 "\\\"The Classic War of the Worlds\\\" by Timothy Hines is a~
3 7759_3 0 "The film starts with a manager (Nicholas Bell) giving welco~
4 3630_4 0 "It must be assumed that those who praised this film (\\\"th~
5 9495_8 1 "Superbly trashy and wondrously unpretentious 80's exploitat~
6 8196_8 1 "I dont know why people think this is such a bad movie. Its ~
Dictionary-based sentiment analysis
The tidytext package contains 4 general purpose lexicons in the sentiments dataset.
afinn: list of English words rated for valence between -5 and +5
bing: list of positive and negative sentiment
nrc: list of English words and their associations with 8 emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and 2 sentiments (negative and positive); binary
loughran: list of sentiment words for accounting and finance by category (Negative, Positive, Uncertainty, Litigious, Strong Modal, Weak Modal, Constraining)
We are going to use labMT dictionary (Dodds’ et al. 2011), one of the best dictionaries for sentiment analysis (see e.g. this paper.)