The aim of this practical is to introduce you to word embedding, and enhance your understanding of sentiment analysis by doing dictionary-based sentiment embedding.
library(tidyverse) # as always :)library(tidytext) # for text mininglibrary(text2vec) # our dataset comes from this package
Word Embedding
In this part of the practical we will use word embedding. To work with text, we need to transform it into numerical values. One way is to count the words and weight their frequency (e.g. with TF-IDF). Another method is word embedding. Word embedding techniques such as word2vec and GloVe use neural networks to construct word vectors. With these techniques, words in similar contexts are represented by similar numerical vectors.
We will use the Harry Potter books as our data. Let’s start by installing the harrypotter package from github using remotes.
remotes::install_github("bradleyboehmke/harrypotter")library(harrypotter) # Not to be confused with the CRAN palettes package
The harrypotter package supplies the first seven novels in the Harry Potter series. Here is some info about this data set:
A data set with all Harry Potter books.
This data set contains the full texts of the first seven Harry Potter books (see below the list). Each text is in a character vector with each element representing a single chapter. It is provided from the harrypotter package written by Bradley Boehmke.
philosophers_stone: Harry Potter and the Philosophers Stone, published in 1997
chamber_of_secrets: Harry Potter and the Chamber of Secrets, published in 1998
prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban, published in 1999
goblet_of_fire: Harry Potter and the Goblet of Fire, published in 2000
order_of_the_phoenix: Harry Potter and the Order of the Phoenix, published in 2003
half_blood_prince: Harry Potter and the Half-Blood Prince, published in 2005
deathly_hallows: Harry Potter and the Deathly Hallows, published in 2007
1. Use the code below to load the first seven novels in the Harry Potter series.
hp_books <-c("philosophers_stone", "chamber_of_secrets","prisoner_of_azkaban", "goblet_of_fire","order_of_the_phoenix", "half_blood_prince","deathly_hallows")hp_words <-list( philosophers_stone, chamber_of_secrets, prisoner_of_azkaban, goblet_of_fire, order_of_the_phoenix, half_blood_prince, deathly_hallows) |># name each list elementset_names(hp_books) |># convert each book to a data frame and merge into a single data framemap_df(as_tibble, .id ="book") |># convert book to a factormutate(book =factor(book, levels = hp_books)) |># remove empty chaptersfilter(!is.na(value)) |># create a chapter id columngroup_by(book) |>mutate(chapter =row_number(book))head(hp_words)
# A tibble: 6 x 3
# Groups: book [1]
book value chapter
<fct> <chr> <int>
1 philosophers_stone "THE BOY WHO LIVED Mr. and Mrs. Dursley, of num~ 1
2 philosophers_stone "THE VANISHING GLASS Nearly ten years had passe~ 2
3 philosophers_stone "THE LETTERS FROM NO ONE The escape of the Braz~ 3
4 philosophers_stone "THE KEEPER OF THE KEYS BOOM. They knocked agai~ 4
5 philosophers_stone "DIAGON ALLEY Harry woke early the next morning~ 5
6 philosophers_stone "THE JOURNEY FROM PLATFORM NINE AND THREE-QUARTERS~ 6
2. Use the unnest_tokens function from the tidytext package to tokenize the data frame and name your object hp_words.
# tokenize the data framehp_words <- hp_words |>unnest_tokens(word, value)head(hp_words)
# A tibble: 6 x 2
# Groups: book [1]
book word
<fct> <chr>
1 philosophers_stone the
2 philosophers_stone boy
3 philosophers_stone who
4 philosophers_stone lived
5 philosophers_stone mr
6 philosophers_stone and
3. Remove the stop words from the tokenized data frame.
Hint: Use anti_join function to filter the stop_words from the tidytext package.
# A tibble: 6 x 2
# Groups: book [1]
book word
<fct> <chr>
1 philosophers_stone boy
2 philosophers_stone lived
3 philosophers_stone dursley
4 philosophers_stone privet
5 philosophers_stone drive
6 philosophers_stone proud
4. Creates a vocabulary of unique terms using the create_vocabulary function from the text2vec package and remove the words that they appear less than 5 times.
Hint: The code is given below, make sure you understand it.
Step 1. Create a list of words from hp_words (iterator object) using list function.
Step 2. Use the itoken function on the word list to create index-tokens.
Step 3: Use the create_vocabulary function on the itoken object to collect unique terms.
Step 4: Use the prune_vocabulary on the dataframe of unique terms and specify term_count_min = 5 to filter the infrequent terms.
# make it a list (iterator)hp_words_ls <-list(hp_words$word)# create index-tokensit <-itoken(hp_words_ls, progressbar =FALSE) # collects unique terms hp_vocab <-create_vocabulary(it)# filters the infrequent terms (number of occurrence is less than 5)hp_vocab <-prune_vocabulary(hp_vocab, term_count_min =5)# show the resulting vocabulary object (formatting it with datatable)hp_vocab
# We've just created word counts, that's all the vocabulary object is!
5. The next step is to create a token co-occurrence matrix(TCM). First, we need to apply vocab_vectorizer function to transform the list of tokens into a vector space. Then, use create_tcm function to create a TCM with the window of 5 for context words.
Hint: The code is given below, make sure you understand it. Step 1: Map the words to indices by vocab_vectorizer(vocabulary object from Q14).
Step 2: Create a TCM by create_tcm(it, vectorizer function from Step 1, skip_grams_window = 5). it is the list of iterators over tokens from itoken.
# maps words to indicesvectorizer <-vocab_vectorizer(hp_vocab)# use window of 5 for context wordshp_tcm <-create_tcm(it, vectorizer, skip_grams_window =5)# Note that such a matrix will be extremely sparse. # Most words do not go with other words in the grand # scheme of things. So when they do, it usually matters.
6. Use the GlobalVectors as given in the code below to fit the word vectors on our data set. Choose the embedding size (rank variable) equal to 50, and the maximum number of co-occurrences equal to 10. Train word vectors in 20 iterations. You can check the full input arguments of the fit_transform function from here.
INFO [16:14:44.462] epoch 1, loss 0.1199
INFO [16:14:44.865] epoch 2, loss 0.0781
INFO [16:14:45.269] epoch 3, loss 0.0677
INFO [16:14:45.635] epoch 4, loss 0.0615
INFO [16:14:45.997] epoch 5, loss 0.0572
INFO [16:14:46.456] epoch 6, loss 0.0541
INFO [16:14:46.812] epoch 7, loss 0.0516
INFO [16:14:47.189] epoch 8, loss 0.0496
INFO [16:14:47.555] epoch 9, loss 0.0480
INFO [16:14:47.956] epoch 10, loss 0.0466
INFO [16:14:48.329] epoch 11, loss 0.0455
INFO [16:14:48.670] epoch 12, loss 0.0445
INFO [16:14:49.013] epoch 13, loss 0.0436
INFO [16:14:49.349] epoch 14, loss 0.0428
INFO [16:14:49.690] epoch 15, loss 0.0421
INFO [16:14:50.032] epoch 16, loss 0.0415
INFO [16:14:50.408] epoch 17, loss 0.0410
INFO [16:14:50.783] epoch 18, loss 0.0405
INFO [16:14:51.210] epoch 19, loss 0.0400
INFO [16:14:51.593] epoch 20, loss 0.0396
7. The GloVe model learns two sets of word vectors: main and context. Essentially they are the same since the model is symmetric. In general combining the two sets of word vectors leads to higher quality embeddings (read more here). The best practice is to combine both the main word vectors and the context word vectors into one matrix. Extract the word vectors and save the summation of them for further questions.
Hint: Follow the steps below.
Step 1: Extract context word vectors by glove$components.
Step 2: Sum two sets of word vectors (e.g., hv_wv_main + t(hp_wv_context)).
# extract context word vectorhp_wv_context <- glove$components# check the dimension for both matricesdim(hp_wv_main); dim(hp_wv_context)
[1] 8628 50
[1] 50 8628
# Either word-vectors matrices could work, but the developers of the technique# suggest the sum/mean may work betterhp_word_vectors <- hp_wv_main +t(hp_wv_context) # transpose one matrix to perform matrix addition
8. Find the most similar 10 words to each of the words: “harry”, “death”, and “love”.
Hint: Follow the steps below.
Step 1: Extract the row of the corresponding word from the word vector matrix (e.g., matrix[“harry”, , drop = FALSE]).
Step 2: Use sim2 function with the cosine similarity measure to calculate the pairwise similarities between the chosen row vector (from Step 1) and the rest of words: sim2(x = whole word vector matrix, y = chosen row vector, method = "cosine", norm = "l2").
Step 3: Sort the resulting column vector of similarities in descending order and present the first 10 values. For example, you can do this by head(sort(similarity vector, decrasing = TRUE), 10).
Step 4: Repeat Step 1 - Step 3 for the other words.
# extract the row of "harry"harry <- hp_word_vectors["harry", , drop =FALSE]# calculates pairwise similarities between"harry" and the rest of wordscos_sim_harry <-sim2(x = hp_word_vectors, y = harry, method ="cosine", norm ="l2")# the top 10 words with the highest similaritieshead(sort(cos_sim_harry[,1], decreasing = T), 10)
harry ron hermione time moment left looked told
1.0000000 0.8860322 0.8702359 0.8326799 0.8273744 0.7741125 0.7699056 0.7223630
happened found
0.7162410 0.7099754
# extract the row of "death"death <- hp_word_vectors["death", , drop =FALSE]# calculates pairwise similarities between"harry" and the rest of wordscos_sim_death <-sim2(x = hp_word_vectors, y = death, method ="cosine", norm ="l2")# the top 10 words with the highest similaritieshead(sort(cos_sim_death[,1], decreasing = T), 10)
# extract the row of "love"love <- hp_word_vectors["love", , drop =FALSE]# calculates pairwise similarities between"harry" and the rest of wordscos_sim_love <-sim2(x = hp_word_vectors, y = love, method ="cosine", norm ="l2")# top 10 words with the highest similaritieshead(sort(cos_sim_love[,1], decreasing = T), 10)
love potion truth potions dad telling lucky polyjuice
1.0000000 0.5577526 0.5039527 0.5024910 0.4958414 0.4927394 0.4826604 0.4614335
remember who'd
0.4587588 0.4537236
9. Now you can play with word vectors! For example, add the word vector of “harry” with the word vector of “love” and subtract them from the word vector of “death”. What are the top terms in your result?
Hint: You can literally add/subtract the word vectors to each other (e.g., harry word vector + love word vector - death word vector). Once you have the resulting vector, calculate similarities as you did previously in Question 8.
# add/subtract word vectorstest <- harry + love - death# calculates pairwise similarities between"harry" and the rest of wordscos_sim_test <-sim2(x = hp_word_vectors, y = test, method ="cosine", norm ="l2")# top 10 words with the highest similaritieshead(sort(cos_sim_test[,1], decreasing = T), 10)
ron hermione harry added thinking time quickly sighed
0.6980152 0.6796295 0.6794213 0.5845963 0.5726407 0.5434684 0.5378614 0.5341729
ginny told
0.5297617 0.5279977
Dictionary-based Sentiment Analysis
In this part of the practical, we are going to apply dictionary-based sentiment analysis methods on the movie_review data set.
Text data set
We are going to use the data set movie_review. This is a data set with 5,000 IMDB movie reviews available from the text2vec package, labeled according to their IMDB rating The sentiment of the reviews is binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 has a sentiment score of 1. No individual movie has more than 30 reviews.
The objective of the practical is to understand if the reviews are predictive of the IMDB rating.
Let’s load the data set and convert it to a dataframe.
# load the movie review dataset from the text2vec packagedata("movie_review")movie_review <-as_tibble(movie_review)head(movie_review)
# A tibble: 6 x 3
id sentiment review
<chr> <int> <chr>
1 5814_8 1 "With all this stuff going down at the moment with MJ i've s~
2 2381_9 1 "\\\"The Classic War of the Worlds\\\" by Timothy Hines is a~
3 7759_3 0 "The film starts with a manager (Nicholas Bell) giving welco~
4 3630_4 0 "It must be assumed that those who praised this film (\\\"th~
5 9495_8 1 "Superbly trashy and wondrously unpretentious 80's exploitat~
6 8196_8 1 "I dont know why people think this is such a bad movie. Its ~
Dictionary-based sentiment analysis
The tidytext package contains 4 general purpose lexicons in the sentiments dataset.
afinn: list of English words rated for valence between -5 and +5
bing: list of positive and negative sentiment
nrc: list of English words and their associations with 8 emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and 2 sentiments (negative and positive); binary
loughran: list of sentiment words for accounting and finance by category (Negative, Positive, Uncertainty, Litigious, Strong Modal, Weak Modal, Constraining)
We are going to use labMT dictionary (Dodds’ et al. 2011), one of the best dictionaries for sentiment analysis (see e.g. this paper.)
10. Run the code below to download the labMT dictionary.
we <-"https://raw.githubusercontent.com/andyreagan/labMT-simple/master/labMTsimple/data/LabMT/data/labMTwords-english.csv"se <-"https://raw.githubusercontent.com/andyreagan/labMT-simple/master/labMTsimple/data/LabMT/data/labMTscores-english.csv"labMT <-bind_cols(read_csv(we, col_names ="word"),read_csv(se, col_names ="value"))
11. Use unnest_tokens function from tidytext package to break the movie_review data set into individual tokens, then use the head function to see its first several rows.
# tokenize the reviewstidy_review <- movie_review |>unnest_tokens(word, review) head(tidy_review)
# A tibble: 6 x 3
id sentiment word
<chr> <int> <chr>
1 5814_8 1 with
2 5814_8 1 all
3 5814_8 1 this
4 5814_8 1 stuff
5 5814_8 1 going
6 5814_8 1 down
12. Remove the stop words from the tokenized data frame.
Hint: Use anti_join function to filter the stop_words from the tidytext package.
# A tibble: 6 x 4
id sentiment word value
<chr> <int> <chr> <dbl>
1 5814_8 1 stuff 5.58
2 5814_8 1 moment 5.68
3 5814_8 1 mj 5.06
4 5814_8 1 started 6
5 5814_8 1 listening 6.28
6 5814_8 1 music 8.02
14. Calculate the average sentiment score for each review. What are the three most positive and negative reviews (i.e., has the highest and lowest average sentiment score)? Save the results with the name sorted_review_sentiment.
Hint: Follow the steps below.
Step 1: Group the data by id using group_by function.
Step 2: Use summarize function to compute (1) the average sentiment score (mean(value)) (2) the average sentiment from the original review (mean(sentiment)).
Step 3: Use arrange function to sort the average sentiment score in descending order. Or, you can use slice_max function to select the rows with the highest sentiment score. Step 4: Use the IDs of the reviews with the highest and lowest average sentiments to filter the movie_review dataset and see if your results make sense.
sorted_review_sentiment <- review_sentiment |># group by review idgroup_by(id) |># compute the average sentiment scoresummarize(average_sentiment =mean(value),sentiment =mean(sentiment)) |># arrange the average sentiment score in descending orderarrange(desc(average_sentiment)) sorted_review_sentiment
[1] "I can see why Laurel and Hardy purists might be offended by this rather gentle 're-enactment', but this film would be an excellent way to introduce children to the pleasures of classic L & H. Bronson Pinchot and Gailard Sartain acquit themselves reasonably as the comedy duo and there's a reasonably good supporting cast. I enjoyed it."
[2] "Far by my most second favourite cartoon Spielberg did, after Animaniacs. Even if the ratings were low, so what, I still enjoyed it and loved it, was so funny and I adored the cast, wow Jess Harnell and Tress Macneille were in there and were just fantastic, the whole cast were brilliant, especially the legendary Frank Welker.<br /><br />I'd love to see this cartoon again, was so awesome and the jokes were brilliant. Also I can remember the hilarious moment where Brain cameos in it, you hear his voice and it played the PATB theme instrumental, that was just fantastic, I love it in those cartoons when cameos pop in. I wish this cartoon and Animaniacs came back, i loved them"
[3] "Very smart, sometimes shocking, I just love it. It shoved one more side of David's brilliant talent. He impressed me greatly! David is the best. The movie captivates your attention for every second."
[1] "This isn't the worst movie I've ever seen, but I really can't recall when I've seen a worse one. I thought this would be about an aircraft accident investigation. What it really was is a soap opera, and a bad one at that. They overplayed the 'conflict' card to the extreme. The first hour or so seems like a shouting match, with some implausible scenes thrown in.<br /><br />*Possible spoiler*<br /><br />The 40-or-so minute 'memorial' scene (with requisite black umbrellas and rain) to fictitious crash victims was lame, and I thought it would never end. <br /><br />Avoid this one at all costs, unless you revel in 'conflict'.<br /><br />"
[2] "This tear-teaser, written by Steve Martin himself, is so unbelievably bad, it makes you sick to your stomach!<br /><br />The plot is pathetic, the acting awful, and the dialogue is even more predictable than the ending.<br /><br />Avoid at all costs!"
[3] "The characters are unlikeable and the script is awful. It's a waste of the talents of Deneuve and Auteuil."
15. Plot a bar chart of these average sentiment scores across the ids. Use color to show the original sentiment.
sorted_review_sentiment |># create the coloring variablemutate(sentiment =ifelse(sentiment >0, "pos", "neg")) |># reorder `id` such that average sentiment scores are presented in descending order. ggplot(aes(x =reorder(id, -average_sentiment), y = average_sentiment -mean(average_sentiment), fill = sentiment )) +# plot the bar chart using `geom_col`geom_col() +labs(x ="id", title ="Average sentiment scores") +theme_minimal() +theme(axis.text.x=element_blank())
16. Create a predicted_sentiment column, such that_average score higher than 5.75 is positive = 1 and average score lower than or equal to 5.75 is negative = 0. Then, use a confusion matrix to compare this predicted_sentiment to the original sentiment. What is the accuracy of our results?
Hint: You can use ifelse function to create predicted_sentiment (i.e., dichotomize the average sentiment score). Then, use table function to create the confusion matrix.
Note that there are some rows removed when we inner_join the reviews with labMT lexicon. You need to filter out those removed rows before comparing the predicted sentiment to original sentiment.
review_sentiment_agg <- sorted_review_sentiment |># Dichotomize average sentiment score to match with the original sentiment scores# 1: scores higher than zero; 0: scores lower than or equal to zeromutate(predicted_sentiment =ifelse(average_sentiment >5.75, 1, 0),)## not too badtable(true = review_sentiment_agg$sentiment, predicted = review_sentiment_agg$predicted_sentiment)
predicted
true 0 1
0 1729 754
1 864 1653
# Accuracy = 1653+1729/5000 = 67.64%
17. Follow the same steps, but only for the reviews in the top or bottom 25% of the distribution of average sentiment (there are likely many neutral reviews). Do our results improve? What is the accuracy of our results?
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.280 5.528 5.736 5.736 5.945 6.987
review_sentiment_agg <- sorted_review_sentiment |>filter((average_sentiment >5.945) | (average_sentiment <5.528)) |># Dichotomize average sentiment score to match with the original sentiment scores# 1: scores higher than zero; 0: scores lower than or equal to zeromutate(predicted_sentiment =ifelse(average_sentiment >5.75, 1, 0),)## not too badtable(true = review_sentiment_agg$sentiment, predicted = review_sentiment_agg$predicted_sentiment)