Text Mining I

Published

October 23, 2023

Introduction

Welcome to the second practical of the week!

The aim of this practical is to introduce you to text pre-processing and regular expressions.

In this practical, you will learn:

Visualize the most frequent words in a data set using a word cloud and a bar plot.
How to prepare and pre-process your text data.
Use regular expressions to retrieve structured information from text.

library(tidyverse) # as always :)
library(tidytext)  # for text mining
library(tm)        # needed for wordcloud
library(wordcloud) # to create pretty word clouds
library(SnowballC) # for Porter's stemming algorithm

Preparation & Pre-process

Text data set

We are going to use the following data set in this practical.

A data set with reviews of computers.
This data set is annotated using aspect-based sentiment analysis. Aspect-based sentiment analysis is a text analysis technique that categorizes data by specific aspects and identifies the sentiment attributed to each one. You can download the data as a text file here. The original source is here.

Each row of the file has the following format:

screen[-1], picture quality[-1] ## When the screen was n't contracting or glitching the overall picture quality was poor to fair .

The aspects are shown before ##, the review is shown after ##.

Visualization

1. Use the read_delim function from the readr package to read data from the computer.txt file.

computer_531 <- read_delim(
  file = "data/computer.txt", 
  delim = "\n", 
  col_names = "review"
)

Rows: 531 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\001"
chr (1): review

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(computer_531)

# A tibble: 6 × 1
  review                                                                        
  <chr>                                                                         
1 ## I purchased this monitor because of budgetary concerns .                   
2 inexpensive[+1][a] ## This item was the most inexpensive 17 inch monitor avai…
3 monitor[-1] ## My overall experience with this monitor was very poor .        
4 screen[-1], picture quality[-1] ## When the screen was n't contracting or gli…
5 monitor[-1], picture quality[-1] ## I 've viewed numerous different monitor m…
6 screen[-1] ## A week out of the box and I began to see slight contractions of…

2. wordcloud is a function from the wordcloud package, which plots cool word clouds based on word frequencies in a given data set. Use this function to plot the 50 most frequent words with a minimum frequency of 5 using your data set. wordcloud will directly tokenize your documents.

Check the help file of wordcloud function (e.g., ?wordcloud) to see how to specify the arguments.

wordcloud(
  words = computer_531$review,
  min.freq = 5,
  max.words = 50,
  random.order = FALSE,
  colors = brewer.pal(8, "Dark2")
)

Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
drops documents

Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
tm::stopwords())): transformation drops documents

Another way to visualize the term frequencies is using a barplot. For this we need to convert the data to a tidy format (where each word is a different row) and count the number of occurrences of each word. See the diagram below.

3. Use unnest_tokens function from tidytext package to break the text into individual tokens (a process called tokenization) and use head function to see its first several rows.

Hint: unnest_tokens(data, output column name, input column name)

# tokenize texts
comp_words <- computer_531 |> 
  unnest_tokens(word, review)

# check the resulting tokens
head(comp_words)

# A tibble: 6 × 1
  word     
  <chr>    
1 i        
2 purchased
3 this     
4 monitor  
5 because  
6 of

4. Use functions from the dplyr package (e.g., count, arrange) to select the most frequent 30 tokens and plot a bar chat using ggplot.

comp_words |> 
  # count the frequency of each word
  count(word) |> 
  # arrange the words by its frequency in descending order
  arrange(desc(n)) |> 
  # select the top 30 most frequent words
  head(30) |> 
  # make a bar plot (reorder words by their frequencies)
  ggplot(aes(x = n, y = reorder(word, n))) + 
  geom_col(fill="gray") +
  labs(x = "frequency", y="words") + 
  theme_minimal()

# Here you see that many of the top words are *stop words*. 
# In the next part of the practical, we will learn how to proceed 
# with pre-processing and remove the stop words!

5. Remove the stop words from the tokenized data frame and plot a new bar chat.

Hint: Use anti_join function to filter the stop_words from the tidytext package. stop_words is a dataset with stop words. By using anti_join, you keep only the words in the first dataset that are not in the stop_words dataset. Check the help file if you want further information on either one (e.g., ?anti_join, ?stop_words).

comp_words_no_stop <- 
  comp_words |> 
  # remove stop words
  anti_join(stop_words)

Joining with `by = join_by(word)`

comp_words_no_stop |> 
  # count the frequency of each word
  count(word) |> 
  # arrange the words by its frequency in descending order
  arrange(desc(n)) |> 
  # select the top 30 most frequent words
  head(30) |> 
  # make a bar plot (reorder words by their frequencies)
  ggplot(aes(x = n, y = reorder(word, n))) + 
  geom_col(fill="gray") +
  labs(x = "frequency", y="words") + 
  theme_minimal()

6. You can always adjust your stopwords list. Create a custom list of stopwords df_stopwords by modifying the stop_words, adding new words (e.g., ‘buy’), and removing some of the current words listed (e.g., ‘order’). Then use df_stopwords in the anti_join function and check the plot.

# R code here
df_stopwords <- 
  stop_words |> 
  filter(!word %in% c("order", "ordered", "ordering", "orders")) |> 
  rbind(tibble(word = c("buy", "online"), lexicon = c("Me", "Me")))

# remove stop words
comp_words_no_stop_modified <- 
  comp_words |>
  anti_join(df_stopwords)

Joining with `by = join_by(word)`

comp_words_no_stop_modified |> 
  # count the frequency of each word
  count(word) |> 
  # arrange the words by its frequency in descending order
  arrange(desc(n)) |> 
  # select the top 30 most frequent words
  head(30) |> 
  # make a bar plot (reorder words by their frequencies)
  ggplot(aes(x = n, y = reorder(word, n))) + 
  geom_col(fill="gray") +
  labs(x = "frequency", y="words") + 
  theme_minimal()

7. When we deal with text, often documents contain different versions of one base word, called a stem. SnowballC is a package based on the famous Porter’s stemming algorithm that collapses words into their stem. Use the getStemLanguages function to find out how many languages this package supports.

getStemLanguages()

 [1] "arabic"     "basque"     "catalan"    "danish"     "dutch"     
 [6] "english"    "finnish"    "french"     "german"     "greek"     
[11] "hindi"      "hungarian"  "indonesian" "irish"      "italian"   
[16] "lithuanian" "nepali"     "norwegian"  "porter"     "portuguese"
[21] "romanian"   "russian"    "spanish"    "swedish"    "tamil"     
[26] "turkish"

8. Applying stemming on your data frame and save the results in a new column. Show the top 30 most frequent stemmed words in a bar plot.

comp_words_stemming <- 
  comp_words_no_stop_modified |> 
  mutate(stem = wordStem(word))

comp_words_stemming |> 
  # count the frequency of each word
  count(stem) |> 
  # arrange the words by its frequency in descending order
  arrange(desc(n)) |> 
  # select the top 30 most frequent words
  head(30) |> 
  # make a bar plot (reorder words by their frequencies)
  ggplot(aes(x = n, y = reorder(stem, n))) + 
  geom_col(fill="gray") +
  labs(x = "frequency", y="words") + 
  theme_minimal()

Regular expressions

9. Use regular expressions (regex) to find the reviews in computer_531, which contain words “Monitor” or “monitor”, “memory” or “Memory”, and “Delivery” or “delivery”. See how many reviews contain each pair of words.

Hint: One of many different ways to achieve this is as follows:

Use str_detect function from stringr package to find the presence of patterns of your interest. See the lecture slide if you are unsure how to specify the regex. You can try different regex here: https://regexr.com/
Use filter function to only retain the ones that contain the patterns. You can write in one line such as filter(str_detect(input vector, regex)).
Count the number of reviews that contain each pattern.

# reviews containing "Monitor" or "monitor"
reviews_mon <- computer_531 |> 
  filter(str_detect(review, "[Mm]onitor"))

# reviews containing "memory" or "Memory"
reviews_mem <- computer_531 |> 
  filter(str_detect(review, "[Mm]emory"))

# reviews containing "Delivery" or "delivery"
reviews_del <- computer_531 |> 
  filter(str_detect(review, "[Dd]elivery"))

# compare the occurrence of each pair of words
kwords    <- c("Monitor", "Memory", "Delivery")
nr_kwords <- c(nrow(reviews_mon), nrow(reviews_mem), nrow(reviews_del))

tibble(kwords, nr_kwords)

# A tibble: 3 × 2
  kwords   nr_kwords
  <chr>        <int>
1 Monitor        105
2 Memory           5
3 Delivery         3

10. Compare str_detect, str_extract, str_subset and str_match functions from the stringr package to check which words appear before “monitor” in the reviews from the computer_531 data set.

What does the regular expression (\\w+) ([Mm]onitor) match to? You can explore it using https://regexr.com

## Detect
head(str_detect(computer_531$review, "(\\w+) ([Mm]onitor)"))

[1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

## Extract
head(str_extract(computer_531$review, "(\\w+) ([Mm]onitor)"))

[1] "this monitor"      "inch monitor"      "this monitor"     
[4] NA                  "different monitor" NA

## Subset
head(str_subset(computer_531$review, "(\\w+) ([Mm]onitor)"))

[1] "## I purchased this monitor because of budgetary concerns ."                                                                                                                                   
[2] "inexpensive[+1][a] ## This item was the most inexpensive 17 inch monitor available to me at the time I made the purchase ."                                                                    
[3] "monitor[-1] ## My overall experience with this monitor was very poor ."                                                                                                                        
[4] "monitor[-1], picture quality[-1] ## I 've viewed numerous different monitor models since I 'm a college student and this particular monitor had as poor of picture quality as any I 've seen ."
[5] "## I sent it back and now will search for a better quality monitor ."                                                                                                                          
[6] "monitor[-1], colors[-1] ## I used this monitor for 2 years and now it shows a mixture of various colors and de-focused text ."

## Match (Useful with capture groups)
head(str_match(computer_531$review, "(\\w+) ([Mm]onitor)"))

     [,1]                [,2]        [,3]     
[1,] "this monitor"      "this"      "monitor"
[2,] "inch monitor"      "inch"      "monitor"
[3,] "this monitor"      "this"      "monitor"
[4,] NA                  NA          NA       
[5,] "different monitor" "different" "monitor"
[6,] NA                  NA          NA

# The regular expression `(\\w+) ([Mm]onitor)` matches any word, 
# followed by a space, followed by either Monitor or monitor.
# - str_detect: returns T/F if the match is found or not
# - str_extract: returns the match if it is found (NA otherwise)
# - str_subset: returns the observations for which a match has been found
# - str_match: returns the match (split into groups) and NA otherwise

11. You can also use str_extract_all and str_match_all to extract all matches. Use either of these functions to see all of the fully capitalized words in the first 10 reviews.

# use str_extract_all
str_extract_all(computer_531$review[1:10], '\\b[A-Z]+\\b', simplify = TRUE)

      [,1] [,2] [,3]
 [1,] "I"  ""   ""  
 [2,] "I"  ""   ""  
 [3,] ""   ""   ""  
 [4,] ""   ""   ""  
 [5,] "I"  "I"  "I" 
 [6,] "A"  "I"  ""  
 [7,] ""   ""   ""  
 [8,] "I"  "I"  ""  
 [9,] "I"  ""   ""  
[10,] "I"  ""   ""

# The output from `str_extract_all`/`str_match_all` is a list, 
# which can be pasted together by using `sapply`. In the case
# of str_extract_all, it can be parsed in different columns 
# using simplify = TRUE.

12. In order to separate aspects and sentiments in the reviews from the computer_531 data, let’s first use a regular expression to extract the characters at the beginning of each line until ##. Do this for only the first 20 reviews.

Hint: Use the str_extract function such that str_extract(first 20 reviews, regex).

Hint 2: The symbol ^ matches to “start of line”. The symbol . matches anything. The symbol * indicates that the previous character can appear any number of times (including zero times). e.g., .* matches to all characters (until the end of the line); .*R matches to all characters until a capital R is found.

str_extract(computer_531[[1]][1:20], "^.*##")

 [1] "##"                                  "inexpensive[+1][a] ##"              
 [3] "monitor[-1] ##"                      "screen[-1], picture quality[-1] ##" 
 [5] "monitor[-1], picture quality[-1] ##" "screen[-1] ##"                      
 [7] "Display[-1] ##"                      "monitor[-1] ##"                     
 [9] "##"                                  "##"                                 
[11] "##"                                  "##"                                 
[13] "##"                                  "monitor[-1], colors[-1] ##"         
[15] "##"                                  "size[+1] ##"                        
[17] "computer[+1], quality[+1] ##"        "##"                                 
[19] "keyboard[+1], color[+1] ##"          "speed[+1], memory[+1] ##"

13. Add a new column to the computer_531 data frame with the name cleaned_review, which contains only the review text. And add another column with the name aspect_sentiment, which contains the asepcts and sentiment words (i.e., the ones at the end of each review text).

Hint: Use mutate() and str_extract() (e.g., mutate(cleaned_review = str_extract(...), aspect_sentiment = str_extract(...))).

computer_531 <- computer_531 |> 
  mutate(cleaned_review = str_extract(review,"##.*$"),
         aspect_sentiment = str_extract(review,"^.*##"))

# check the newly added columns
head(computer_531[-1])

# A tibble: 6 × 2
  cleaned_review                                                aspect_sentiment
  <chr>                                                         <chr>           
1 ## I purchased this monitor because of budgetary concerns .   ##              
2 ## This item was the most inexpensive 17 inch monitor availa… inexpensive[+1]…
3 ## My overall experience with this monitor was very poor .    monitor[-1] ##  
4 ## When the screen was n't contracting or glitching the over… screen[-1], pic…
5 ## I 've viewed numerous different monitor models since I 'm… monitor[-1], pi…
6 ## A week out of the box and I began to see slight contracti… screen[-1] ##

# for an even cleaner version you can use regex's lookbehind and lookahead 
# features to drop the ## parts:
computer_531 <- computer_531 |> 
  mutate(cleaned_review = str_extract(review,"(?<=##).*$"),
         aspect_sentiment = str_extract(review,"^.*(?=##)"))

# check the newly added columns
head(computer_531[-1])

# A tibble: 6 × 2
  cleaned_review                                                aspect_sentiment
  <chr>                                                         <chr>           
1 " I purchased this monitor because of budgetary concerns ."   ""              
2 " This item was the most inexpensive 17 inch monitor availab… "inexpensive[+1…
3 " My overall experience with this monitor was very poor ."    "monitor[-1] "  
4 " When the screen was n't contracting or glitching the overa… "screen[-1], pi…
5 " I 've viewed numerous different monitor models since I 'm … "monitor[-1], p…
6 " A week out of the box and I began to see slight contractio… "screen[-1] "

14. Run the following code below to create a new column sentiment with the values positive, negative and neutral. The code assigns neutral in case when there is no aspect in the corresponding column or the sum of scores is equal to zero, and it assigns negative (positive) when the sum score is lower (higher) than zero.

# define sum_list function which adds the scores
sum_list <- function(list_values) {
  sum(as.numeric(list_values))
}

computer_531 <- computer_531 |>
  # sentiment_score: extract the scores and sum them using the sum_list function
  mutate(sentiment_score = map_dbl(str_extract_all(aspect_sentiment, "[-|\\+]\\d+"), sum_list),
                               # assign negative when score < 0
         sentiment = case_when(sentiment_score < 0 ~ "negative",
                               # assign negative when score > 0
                               sentiment_score > 0 ~ "positive",
                               # assign neutral otherwise
                               TRUE ~ "neutral"))

head(computer_531)

# A tibble: 6 × 5
  review               cleaned_review aspect_sentiment sentiment_score sentiment
  <chr>                <chr>          <chr>                      <dbl> <chr>    
1 ## I purchased this… " I purchased… ""                             0 neutral  
2 inexpensive[+1][a] … " This item w… "inexpensive[+1…               1 positive 
3 monitor[-1] ## My o… " My overall … "monitor[-1] "                -1 negative 
4 screen[-1], picture… " When the sc… "screen[-1], pi…              -2 negative 
5 monitor[-1], pictur… " I 've viewe… "monitor[-1], p…              -2 negative 
6 screen[-1] ## A wee… " A week out … "screen[-1] "                 -1 negative

15. What is the most positive review?

computer_531 |> 
  arrange(sentiment_score) |>
  tail(1) |>
  pull(cleaned_review)

[1] " After much delay I received my Acer Netbook Aspire One and I can teell you I am very satisfied with this computer , is very pretty , well finished , high quality , so quiet , not heated than other netbook or at least I think it is within in normal ranges , the keyboard is very smooth and very well laid out , the screen functions are very practical in a nutshell is an excellent buy and I am very pleased with the purchase , I recommend it ."