Text Mining II

Published

October 25, 2023

Introduction

Welcome to the third practical of the week!

The aim of this practical is to introduce you to word embedding, and enhance your understanding of sentiment analysis by doing dictionary-based sentiment embedding.

library(tidyverse) # as always :)
library(tidytext)  # for text mining
library(text2vec)  # our dataset comes from this package

Word Embedding

In this part of the practical we will use word embedding. To work with text, we need to transform it into numerical values. One way is to count the words and weight their frequency (e.g. with TF-IDF). Another method is word embedding. Word embedding techniques such as word2vec and GloVe use neural networks to construct word vectors. With these techniques, words in similar contexts are represented by similar numerical vectors.

We will use the Harry Potter books as our data. Let’s start by installing the harrypotter package from github using remotes.

remotes::install_github("bradleyboehmke/harrypotter")
library(harrypotter) # Not to be confused with the CRAN palettes package

The harrypotter package supplies the first seven novels in the Harry Potter series. Here is some info about this data set:

A data set with all Harry Potter books.
This data set contains the full texts of the first seven Harry Potter books (see below the list). Each text is in a character vector with each element representing a single chapter. It is provided from the harrypotter package written by Bradley Boehmke.
- philosophers_stone: Harry Potter and the Philosophers Stone, published in 1997
- chamber_of_secrets: Harry Potter and the Chamber of Secrets, published in 1998
- prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban, published in 1999
- goblet_of_fire: Harry Potter and the Goblet of Fire, published in 2000
- order_of_the_phoenix: Harry Potter and the Order of the Phoenix, published in 2003
- half_blood_prince: Harry Potter and the Half-Blood Prince, published in 2005
- deathly_hallows: Harry Potter and the Deathly Hallows, published in 2007

1. Use the code below to load the first seven novels in the Harry Potter series.

hp_books <- c("philosophers_stone", "chamber_of_secrets",
              "prisoner_of_azkaban", "goblet_of_fire",
              "order_of_the_phoenix", "half_blood_prince",
              "deathly_hallows")

hp_words <- list(
  philosophers_stone,
  chamber_of_secrets,
  prisoner_of_azkaban,
  goblet_of_fire,
  order_of_the_phoenix,
  half_blood_prince,
  deathly_hallows
) |>
  # name each list element
  set_names(hp_books) |>
  # convert each book to a data frame and merge into a single data frame
  map_df(as_tibble, .id = "book") |>
  # convert book to a factor
  mutate(book = factor(book, levels = hp_books)) |>
  # remove empty chapters
  filter(!is.na(value)) |>
  # create a chapter id column
  group_by(book) |>
  mutate(chapter = row_number(book))

head(hp_words)

# A tibble: 6 x 3
# Groups:   book [1]
  book               value                                               chapter
  <fct>              <chr>                                                 <int>
1 philosophers_stone "THE BOY WHO LIVED  Mr. and Mrs. Dursley, of num~       1
2 philosophers_stone "THE VANISHING GLASS  Nearly ten years had passe~       2
3 philosophers_stone "THE LETTERS FROM NO ONE  The escape of the Braz~       3
4 philosophers_stone "THE KEEPER OF THE KEYS  BOOM. They knocked agai~       4
5 philosophers_stone "DIAGON ALLEY  Harry woke early the next morning~       5
6 philosophers_stone "THE JOURNEY FROM PLATFORM NINE AND THREE-QUARTERS~       6

2. Use the unnest_tokens function from the tidytext package to tokenize the data frame and name your object hp_words.

# tokenize the data frame
hp_words <- hp_words |>
  unnest_tokens(word, value)

head(hp_words)

# A tibble: 6 x 2
# Groups:   book [1]
  book               word 
  <fct>              <chr>
1 philosophers_stone the  
2 philosophers_stone boy  
3 philosophers_stone who  
4 philosophers_stone lived
5 philosophers_stone mr   
6 philosophers_stone and

3. Remove the stop words from the tokenized data frame.

Hint: Use anti_join function to filter the stop_words from the tidytext package.

hp_words <- hp_words |> 
  # remove stop words
  anti_join(stop_words)

Joining, by = "word"

head(hp_words)

# A tibble: 6 x 2
# Groups:   book [1]
  book               word   
  <fct>              <chr>  
1 philosophers_stone boy    
2 philosophers_stone lived  
3 philosophers_stone dursley
4 philosophers_stone privet 
5 philosophers_stone drive  
6 philosophers_stone proud

4. Creates a vocabulary of unique terms using the create_vocabulary function from the text2vec package and remove the words that they appear less than 5 times.

Hint: The code is given below, make sure you understand it.
Step 1. Create a list of words from hp_words (iterator object) using list function.
Step 2. Use the itoken function on the word list to create index-tokens.
Step 3: Use the create_vocabulary function on the itoken object to collect unique terms.
Step 4: Use the prune_vocabulary on the dataframe of unique terms and specify term_count_min = 5 to filter the infrequent terms.

# make it a list (iterator)
hp_words_ls <- list(hp_words$word)
# create index-tokens
it <- itoken(hp_words_ls, progressbar = FALSE) 
# collects unique terms 
hp_vocab <- create_vocabulary(it)
# filters the infrequent terms (number of occurrence is less than 5)
hp_vocab <- prune_vocabulary(hp_vocab, term_count_min = 5)
# show the resulting vocabulary object (formatting it with datatable)
hp_vocab

Number of docs: 1 
0 stopwords:  ... 
ngram_min = 1; ngram_max = 1 
Vocabulary: 
            term term_count doc_count
   1:         10          5         1
   2:      aaaah          5         1
   3:      aargh          5         1
   4:    abysmal          5         1
   5:       acts          5         1
  ---                                
8624:     looked       2344         1
8625: dumbledore       2873         1
8626:   hermione       4912         1
8627:        ron       5750         1
8628:      harry      16557         1

# We've just created word counts, that's all the vocabulary object is!

5. The next step is to create a token co-occurrence matrix(TCM). First, we need to apply vocab_vectorizer function to transform the list of tokens into a vector space. Then, use create_tcm function to create a TCM with the window of 5 for context words.

Hint: The code is given below, make sure you understand it. Step 1: Map the words to indices by vocab_vectorizer(vocabulary object from Q14).
Step 2: Create a TCM by create_tcm(it, vectorizer function from Step 1, skip_grams_window = 5). it is the list of iterators over tokens from itoken.

# maps words to indices
vectorizer <- vocab_vectorizer(hp_vocab)
# use window of 5 for context words
hp_tcm <- create_tcm(it, vectorizer, skip_grams_window = 5)


# Note that such a matrix will be extremely sparse. 
# Most words do not go with other words in the grand 
# scheme of things. So when they do, it usually matters.

6. Use the GlobalVectors as given in the code below to fit the word vectors on our data set. Choose the embedding size (rank variable) equal to 50, and the maximum number of co-occurrences equal to 10. Train word vectors in 20 iterations. You can check the full input arguments of the fit_transform function from here.

glove      <- GlobalVectors$new(rank = 50, x_max = 10)
hp_wv_main <- glove$fit_transform(hp_tcm, n_iter = 20, convergence_tol = 0.001)

INFO  [16:14:44.462] epoch 1, loss 0.1199 
INFO  [16:14:44.865] epoch 2, loss 0.0781 
INFO  [16:14:45.269] epoch 3, loss 0.0677 
INFO  [16:14:45.635] epoch 4, loss 0.0615 
INFO  [16:14:45.997] epoch 5, loss 0.0572 
INFO  [16:14:46.456] epoch 6, loss 0.0541 
INFO  [16:14:46.812] epoch 7, loss 0.0516 
INFO  [16:14:47.189] epoch 8, loss 0.0496 
INFO  [16:14:47.555] epoch 9, loss 0.0480 
INFO  [16:14:47.956] epoch 10, loss 0.0466 
INFO  [16:14:48.329] epoch 11, loss 0.0455 
INFO  [16:14:48.670] epoch 12, loss 0.0445 
INFO  [16:14:49.013] epoch 13, loss 0.0436 
INFO  [16:14:49.349] epoch 14, loss 0.0428 
INFO  [16:14:49.690] epoch 15, loss 0.0421 
INFO  [16:14:50.032] epoch 16, loss 0.0415 
INFO  [16:14:50.408] epoch 17, loss 0.0410 
INFO  [16:14:50.783] epoch 18, loss 0.0405 
INFO  [16:14:51.210] epoch 19, loss 0.0400 
INFO  [16:14:51.593] epoch 20, loss 0.0396

7. The GloVe model learns two sets of word vectors: main and context. Essentially they are the same since the model is symmetric. In general combining the two sets of word vectors leads to higher quality embeddings (read more here). The best practice is to combine both the main word vectors and the context word vectors into one matrix. Extract the word vectors and save the summation of them for further questions.

Hint: Follow the steps below.

Step 1: Extract context word vectors by glove$components.
Step 2: Sum two sets of word vectors (e.g., hv_wv_main + t(hp_wv_context)).

# extract context word vector
hp_wv_context <- glove$components

# check the dimension for both matrices
dim(hp_wv_main); dim(hp_wv_context)

[1] 8628   50

[1]   50 8628

# Either word-vectors matrices could work, but the developers of the technique
# suggest the sum/mean may work better
hp_word_vectors <- hp_wv_main + t(hp_wv_context) # transpose one matrix to perform matrix addition

8. Find the most similar 10 words to each of the words: “harry”, “death”, and “love”.

Hint: Follow the steps below.

Step 1: Extract the row of the corresponding word from the word vector matrix (e.g., matrix[“harry”, , drop = FALSE]).
Step 2: Use sim2 function with the cosine similarity measure to calculate the pairwise similarities between the chosen row vector (from Step 1) and the rest of words: sim2(x = whole word vector matrix, y = chosen row vector, method = "cosine", norm = "l2").
Step 3: Sort the resulting column vector of similarities in descending order and present the first 10 values. For example, you can do this by head(sort(similarity vector, decrasing = TRUE), 10).
Step 4: Repeat Step 1 - Step 3 for the other words.

# extract the row of "harry"
harry <- hp_word_vectors["harry", , drop = FALSE]

# calculates pairwise similarities between"harry" and the rest of words
cos_sim_harry <- sim2(x = hp_word_vectors, y = harry, method = "cosine", norm = "l2")

# the top 10 words with the highest similarities
head(sort(cos_sim_harry[,1], decreasing = T), 10)

    harry       ron  hermione      time    moment      left    looked      told 
1.0000000 0.8860322 0.8702359 0.8326799 0.8273744 0.7741125 0.7699056 0.7223630 
 happened     found 
0.7162410 0.7099754

# extract the row of "death"
death <- hp_word_vectors["death", , drop = FALSE]

# calculates pairwise similarities between"harry" and the rest of words
cos_sim_death <- sim2(x = hp_word_vectors, y = death, method = "cosine", norm = "l2")

# the top 10 words with the highest similarities
head(sort(cos_sim_death[,1], decreasing = T), 10)

    death    eaters     eater voldemort   escaped   eater's     fight    killed 
1.0000000 0.9321750 0.8719578 0.6132848 0.5977462 0.5737796 0.5481855 0.5289612 
  shouted      kill 
0.5148064 0.5104089

# extract the row of "love"
love <- hp_word_vectors["love", , drop = FALSE]

# calculates pairwise similarities between"harry" and the rest of words
cos_sim_love <- sim2(x = hp_word_vectors, y = love, method = "cosine", norm = "l2")

# top 10 words with the highest similarities
head(sort(cos_sim_love[,1], decreasing = T), 10)

     love    potion     truth   potions       dad   telling     lucky polyjuice 
1.0000000 0.5577526 0.5039527 0.5024910 0.4958414 0.4927394 0.4826604 0.4614335 
 remember     who'd 
0.4587588 0.4537236

9. Now you can play with word vectors! For example, add the word vector of “harry” with the word vector of “love” and subtract them from the word vector of “death”. What are the top terms in your result?

Hint: You can literally add/subtract the word vectors to each other (e.g., harry word vector + love word vector - death word vector). Once you have the resulting vector, calculate similarities as you did previously in Question 8.

# add/subtract word vectors
test <- harry + love - death

# calculates pairwise similarities between"harry" and the rest of words
cos_sim_test <- sim2(x = hp_word_vectors, y = test, method = "cosine", norm = "l2")

# top 10 words with the highest similarities
head(sort(cos_sim_test[,1], decreasing = T), 10)

      ron  hermione     harry     added  thinking      time   quickly    sighed 
0.6980152 0.6796295 0.6794213 0.5845963 0.5726407 0.5434684 0.5378614 0.5341729 
    ginny      told 
0.5297617 0.5279977

Dictionary-based Sentiment Analysis

In this part of the practical, we are going to apply dictionary-based sentiment analysis methods on the movie_review data set.

Text data set

We are going to use the data set movie_review. This is a data set with 5,000 IMDB movie reviews available from the text2vec package, labeled according to their IMDB rating The sentiment of the reviews is binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 has a sentiment score of 1. No individual movie has more than 30 reviews.

The objective of the practical is to understand if the reviews are predictive of the IMDB rating.

Let’s load the data set and convert it to a dataframe.

# load the movie review dataset from the text2vec package
data("movie_review")
movie_review <- as_tibble(movie_review)
head(movie_review)

# A tibble: 6 x 3
  id     sentiment review                                                       
  <chr>      <int> <chr>                                                        
1 5814_8         1 "With all this stuff going down at the moment with MJ i've s~
2 2381_9         1 "\\\"The Classic War of the Worlds\\\" by Timothy Hines is a~
3 7759_3         0 "The film starts with a manager (Nicholas Bell) giving welco~
4 3630_4         0 "It must be assumed that those who praised this film (\\\"th~
5 9495_8         1 "Superbly trashy and wondrously unpretentious 80's exploitat~
6 8196_8         1 "I dont know why people think this is such a bad movie. Its ~

Dictionary-based sentiment analysis

The tidytext package contains 4 general purpose lexicons in the sentiments dataset.

afinn: list of English words rated for valence between -5 and +5
bing: list of positive and negative sentiment
nrc: list of English words and their associations with 8 emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and 2 sentiments (negative and positive); binary
loughran: list of sentiment words for accounting and finance by category (Negative, Positive, Uncertainty, Litigious, Strong Modal, Weak Modal, Constraining)

We are going to use labMT dictionary (Dodds’ et al. 2011), one of the best dictionaries for sentiment analysis (see e.g. this paper.)

10. Run the code below to download the labMT dictionary.

we <- "https://raw.githubusercontent.com/andyreagan/labMT-simple/master/labMTsimple/data/LabMT/data/labMTwords-english.csv"
se <- "https://raw.githubusercontent.com/andyreagan/labMT-simple/master/labMTsimple/data/LabMT/data/labMTscores-english.csv"

labMT <- bind_cols(
  read_csv(we, col_names = "word"),
  read_csv(se, col_names = "value")
)

11. Use unnest_tokens function from tidytext package to break the movie_review data set into individual tokens, then use the head function to see its first several rows.

# tokenize the reviews
tidy_review <- 
  movie_review |> 
  unnest_tokens(word, review) 

head(tidy_review)

# A tibble: 6 x 3
  id     sentiment word 
  <chr>      <int> <chr>
1 5814_8         1 with 
2 5814_8         1 all  
3 5814_8         1 this 
4 5814_8         1 stuff
5 5814_8         1 going
6 5814_8         1 down

12. Remove the stop words from the tokenized data frame.

Hint: Use anti_join function to filter the stop_words from the tidytext package.

tidy_review <- tidy_review |> 
  # remove stop words
  anti_join(stop_words)

Joining, by = "word"

head(tidy_review)

# A tibble: 6 x 3
  id     sentiment word     
  <chr>      <int> <chr>    
1 5814_8         1 stuff    
2 5814_8         1 moment   
3 5814_8         1 mj       
4 5814_8         1 started  
5 5814_8         1 listening
6 5814_8         1 music

13. Use the inner_join function to find a sentiment score for each of the tokenized review words using the labMT dictionary.

review_sentiment <- 
  tidy_review |>
  inner_join(labMT)

Joining, by = "word"

head(review_sentiment)

# A tibble: 6 x 4
  id     sentiment word      value
  <chr>      <int> <chr>     <dbl>
1 5814_8         1 stuff      5.58
2 5814_8         1 moment     5.68
3 5814_8         1 mj         5.06
4 5814_8         1 started    6   
5 5814_8         1 listening  6.28
6 5814_8         1 music      8.02

14. Calculate the average sentiment score for each review. What are the three most positive and negative reviews (i.e., has the highest and lowest average sentiment score)? Save the results with the name sorted_review_sentiment.

Hint: Follow the steps below.
Step 1: Group the data by id using group_by function.
Step 2: Use summarize function to compute (1) the average sentiment score (mean(value)) (2) the average sentiment from the original review (mean(sentiment)).
Step 3: Use arrange function to sort the average sentiment score in descending order. Or, you can use slice_max function to select the rows with the highest sentiment score. Step 4: Use the IDs of the reviews with the highest and lowest average sentiments to filter the movie_review dataset and see if your results make sense.

sorted_review_sentiment <- 
  review_sentiment |>
  # group by review id
  group_by(id) |> 
  # compute the average sentiment score
  summarize(average_sentiment = mean(value),
            sentiment = mean(sentiment)) |> 
  # arrange the average sentiment score in descending order
  arrange(desc(average_sentiment)) 

sorted_review_sentiment

# A tibble: 5,000 x 3
   id       average_sentiment sentiment
   <chr>                <dbl>     <dbl>
 1 12421_10              6.99         1
 2 6601_4                6.94         0
 3 10780_10              6.90         1
 4 2684_10               6.85         1
 5 8714_10               6.83         1
 6 4518_9                6.78         1
 7 226_10                6.75         1
 8 11841_9               6.74         1
 9 2642_10               6.72         1
10 9144_7                6.72         1
# i 4,990 more rows

# filter happiest reviews
movie_review |> 
  filter(id %in% head(sorted_review_sentiment$id, 3)) |> 
  pull(review)

[1] "I can see why Laurel and Hardy purists might be offended by this rather gentle 're-enactment', but this film would be an excellent way to introduce children to the pleasures of classic L & H. Bronson Pinchot and Gailard Sartain acquit themselves reasonably as the comedy duo and there's a reasonably good supporting cast. I enjoyed it."                                                                                                                                                                                                                                                                                                                                                        
[2] "Far by my most second favourite cartoon Spielberg did, after Animaniacs. Even if the ratings were low, so what, I still enjoyed it and loved it, was so funny and I adored the cast, wow Jess Harnell and Tress Macneille were in there and were just fantastic, the whole cast were brilliant, especially the legendary Frank Welker.<br /><br />I'd love to see this cartoon again, was so awesome and the jokes were brilliant. Also I can remember the hilarious moment where Brain cameos in it, you hear his voice and it played the PATB theme instrumental, that was just fantastic, I love it in those cartoons when cameos pop in. I wish this cartoon and Animaniacs came back, i loved them"
[3] "Very smart, sometimes shocking, I just love it. It shoved one more side of David's brilliant talent. He impressed me greatly! David is the best. The movie captivates your attention for every second."

# filter saddest reviews
movie_review |> 
  filter(id %in% tail(sorted_review_sentiment$id, 3)) |> 
  pull(review)

[1] "This isn't the worst movie I've ever seen, but I really can't recall when I've seen a worse one. I thought this would be about an aircraft accident investigation. What it really was is a soap opera, and a bad one at that. They overplayed the 'conflict' card to the extreme. The first hour or so seems like a shouting match, with some implausible scenes thrown in.<br /><br />*Possible spoiler*<br /><br />The 40-or-so minute 'memorial' scene (with requisite black umbrellas and rain) to fictitious crash victims was lame, and I thought it would never end. <br /><br />Avoid this one at all costs, unless you revel in 'conflict'.<br /><br />"
[2] "This tear-teaser, written by Steve Martin himself, is so unbelievably bad, it makes you sick to your stomach!<br /><br />The plot is pathetic, the acting awful, and the dialogue is even more predictable than the ending.<br /><br />Avoid at all costs!"                                                                                                                                                                                                                                                                                                                                                                                                      
[3] "The characters are unlikeable and the script is awful. It's a waste of the talents of Deneuve and Auteuil."

15. Plot a bar chart of these average sentiment scores across the ids. Use color to show the original sentiment.

sorted_review_sentiment |>
  # create the coloring variable
  mutate(sentiment = ifelse(sentiment > 0, "pos", "neg")) |>
  # reorder `id` such that average sentiment scores are presented in descending order. 
  ggplot(aes(
    x = reorder(id, -average_sentiment), 
    y = average_sentiment - mean(average_sentiment), 
    fill = sentiment
  )) +
  # plot the bar chart using `geom_col`
  geom_col() + 
  labs(x = "id", title = "Average sentiment scores") +
  theme_minimal() + 
  theme(axis.text.x=element_blank())

16. Create a predicted_sentiment column, such that_average score higher than 5.75 is positive = 1 and average score lower than or equal to 5.75 is negative = 0. Then, use a confusion matrix to compare this predicted_sentiment to the original sentiment. What is the accuracy of our results?

Hint: You can use ifelse function to create predicted_sentiment (i.e., dichotomize the average sentiment score). Then, use table function to create the confusion matrix.
Note that there are some rows removed when we inner_join the reviews with labMT lexicon. You need to filter out those removed rows before comparing the predicted sentiment to original sentiment.

review_sentiment_agg <- 
  sorted_review_sentiment |>
  # Dichotomize average sentiment score to match with the original sentiment scores
  # 1: scores higher than zero; 0: scores lower than or equal to zero
  mutate(predicted_sentiment = ifelse(average_sentiment > 5.75, 1, 0),)

## not too bad
table(
  true = review_sentiment_agg$sentiment, 
  predicted = review_sentiment_agg$predicted_sentiment
)

    predicted
true    0    1
   0 1729  754
   1  864 1653

# Accuracy = 1653+1729/5000 = 67.64%

17. Follow the same steps, but only for the reviews in the top or bottom 25% of the distribution of average sentiment (there are likely many neutral reviews). Do our results improve? What is the accuracy of our results?

summary(sorted_review_sentiment$average_sentiment)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.280   5.528   5.736   5.736   5.945   6.987

review_sentiment_agg <- sorted_review_sentiment |>
  filter((average_sentiment > 5.945) | (average_sentiment < 5.528)) |> 
  # Dichotomize average sentiment score to match with the original sentiment scores
  # 1: scores higher than zero; 0: scores lower than or equal to zero
  mutate(predicted_sentiment = ifelse(average_sentiment > 5.75, 1, 0),)

## not too bad
table(
  true = review_sentiment_agg$sentiment, 
  predicted = review_sentiment_agg$predicted_sentiment
)

    predicted
true   0   1
   0 935 259
   1 317 990

# Accuracy = (935+990)/2501 = 77.0%
# Our accuracy increases considerably