library(tidyverse) # as always :)
library(tidytext) # for text mining
library(tm) # needed for wordcloud
library(wordcloud) # to create pretty word clouds
library(SnowballC) # for Porter's stemming algorithm
Text Mining I
October 23, 2023
Introduction
Welcome to the second practical of the week!
The aim of this practical is to introduce you to text pre-processing and regular expressions.
In this practical, you will learn:
- Visualize the most frequent words in a data set using a word cloud and a bar plot.
- How to prepare and pre-process your text data.
- Use regular expressions to retrieve structured information from text.
Preparation & Pre-process
Text data set
We are going to use the following data set in this practical.
- A data set with reviews of computers.
This data set is annotated using aspect-based sentiment analysis. Aspect-based sentiment analysis is a text analysis technique that categorizes data by specific aspects and identifies the sentiment attributed to each one. You can download the data as a text file here. The original source is here.
Each row of the file has the following format:
screen[-1], picture quality[-1] ## When the screen was n't contracting or glitching the overall picture quality was poor to fair .
The aspects are shown before ##
, the review is shown after ##
.
Visualization
read_delim
function from the readr
package to read data from the computer.txt
file.
Rows: 531 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\001"
chr (1): review
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 1
review
<chr>
1 ## I purchased this monitor because of budgetary concerns .
2 inexpensive[+1][a] ## This item was the most inexpensive 17 inch monitor avai…
3 monitor[-1] ## My overall experience with this monitor was very poor .
4 screen[-1], picture quality[-1] ## When the screen was n't contracting or gli…
5 monitor[-1], picture quality[-1] ## I 've viewed numerous different monitor m…
6 screen[-1] ## A week out of the box and I began to see slight contractions of…
wordcloud
is a function from the wordcloud
package, which plots cool word clouds based on word frequencies in a given data set. Use this function to plot the 50 most frequent words with a minimum frequency of 5 using your data set. wordcloud will directly tokenize your documents.
Check the help file of wordcloud
function (e.g., ?wordcloud
) to see how to specify the arguments.
wordcloud(
words = computer_531$review,
min.freq = 5,
max.words = 50,
random.order = FALSE,
colors = brewer.pal(8, "Dark2")
)
Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
drops documents
Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
tm::stopwords())): transformation drops documents
Another way to visualize the term frequencies is using a barplot. For this we need to convert the data to a tidy format (where each word is a different row) and count the number of occurrences of each word. See the diagram below.
unnest_tokens
function from tidytext
package to break the text into individual tokens (a process called tokenization) and use head
function to see its first several rows.
dplyr
package (e.g., count
, arrange
) to select the most frequent 30 tokens and plot a bar chat using ggplot
.
comp_words |>
# count the frequency of each word
count(word) |>
# arrange the words by its frequency in descending order
arrange(desc(n)) |>
# select the top 30 most frequent words
head(30) |>
# make a bar plot (reorder words by their frequencies)
ggplot(aes(x = n, y = reorder(word, n))) +
geom_col(fill="gray") +
labs(x = "frequency", y="words") +
theme_minimal()
Hint: Use anti_join
function to filter the stop_words
from the tidytext
package. stop_words
is a dataset with stop words. By using anti_join, you keep only the words in the first dataset that are not in the stop_words
dataset. Check the help file if you want further information on either one (e.g., ?anti_join
, ?stop_words
).
Joining with `by = join_by(word)`
comp_words_no_stop |>
# count the frequency of each word
count(word) |>
# arrange the words by its frequency in descending order
arrange(desc(n)) |>
# select the top 30 most frequent words
head(30) |>
# make a bar plot (reorder words by their frequencies)
ggplot(aes(x = n, y = reorder(word, n))) +
geom_col(fill="gray") +
labs(x = "frequency", y="words") +
theme_minimal()
df_stopwords
by modifying the stop_words
, adding new words (e.g., ‘buy’), and removing some of the current words listed (e.g., ‘order’). Then use df_stopwords
in the anti_join
function and check the plot.
# R code here
df_stopwords <-
stop_words |>
filter(!word %in% c("order", "ordered", "ordering", "orders")) |>
rbind(tibble(word = c("buy", "online"), lexicon = c("Me", "Me")))
# remove stop words
comp_words_no_stop_modified <-
comp_words |>
anti_join(df_stopwords)
Joining with `by = join_by(word)`
comp_words_no_stop_modified |>
# count the frequency of each word
count(word) |>
# arrange the words by its frequency in descending order
arrange(desc(n)) |>
# select the top 30 most frequent words
head(30) |>
# make a bar plot (reorder words by their frequencies)
ggplot(aes(x = n, y = reorder(word, n))) +
geom_col(fill="gray") +
labs(x = "frequency", y="words") +
theme_minimal()
getStemLanguages
function to find out how many languages this package supports.
[1] "arabic" "basque" "catalan" "danish" "dutch"
[6] "english" "finnish" "french" "german" "greek"
[11] "hindi" "hungarian" "indonesian" "irish" "italian"
[16] "lithuanian" "nepali" "norwegian" "porter" "portuguese"
[21] "romanian" "russian" "spanish" "swedish" "tamil"
[26] "turkish"
comp_words_stemming <-
comp_words_no_stop_modified |>
mutate(stem = wordStem(word))
comp_words_stemming |>
# count the frequency of each word
count(stem) |>
# arrange the words by its frequency in descending order
arrange(desc(n)) |>
# select the top 30 most frequent words
head(30) |>
# make a bar plot (reorder words by their frequencies)
ggplot(aes(x = n, y = reorder(stem, n))) +
geom_col(fill="gray") +
labs(x = "frequency", y="words") +
theme_minimal()
Regular expressions
computer_531
, which contain words “Monitor” or “monitor”, “memory” or “Memory”, and “Delivery” or “delivery”. See how many reviews contain each pair of words.
Hint: One of many different ways to achieve this is as follows:
Use
str_detect
function fromstringr
package to find the presence of patterns of your interest. See the lecture slide if you are unsure how to specify the regex. You can try different regex here: https://regexr.com/Use
filter
function to only retain the ones that contain the patterns. You can write in one line such asfilter(str_detect(input vector, regex))
.Count the number of reviews that contain each pattern.
# reviews containing "Monitor" or "monitor"
reviews_mon <- computer_531 |>
filter(str_detect(review, "[Mm]onitor"))
# reviews containing "memory" or "Memory"
reviews_mem <- computer_531 |>
filter(str_detect(review, "[Mm]emory"))
# reviews containing "Delivery" or "delivery"
reviews_del <- computer_531 |>
filter(str_detect(review, "[Dd]elivery"))
# compare the occurrence of each pair of words
kwords <- c("Monitor", "Memory", "Delivery")
nr_kwords <- c(nrow(reviews_mon), nrow(reviews_mem), nrow(reviews_del))
tibble(kwords, nr_kwords)
# A tibble: 3 × 2
kwords nr_kwords
<chr> <int>
1 Monitor 105
2 Memory 5
3 Delivery 3
str_detect
, str_extract
, str_subset
and str_match
functions from the stringr
package to check which words appear before “monitor” in the reviews from the computer_531
data set.
What does the regular expression (\\w+) ([Mm]onitor)
match to? You can explore it using https://regexr.com
[1] TRUE TRUE TRUE FALSE TRUE FALSE
[1] "this monitor" "inch monitor" "this monitor"
[4] NA "different monitor" NA
[1] "## I purchased this monitor because of budgetary concerns ."
[2] "inexpensive[+1][a] ## This item was the most inexpensive 17 inch monitor available to me at the time I made the purchase ."
[3] "monitor[-1] ## My overall experience with this monitor was very poor ."
[4] "monitor[-1], picture quality[-1] ## I 've viewed numerous different monitor models since I 'm a college student and this particular monitor had as poor of picture quality as any I 've seen ."
[5] "## I sent it back and now will search for a better quality monitor ."
[6] "monitor[-1], colors[-1] ## I used this monitor for 2 years and now it shows a mixture of various colors and de-focused text ."
[,1] [,2] [,3]
[1,] "this monitor" "this" "monitor"
[2,] "inch monitor" "inch" "monitor"
[3,] "this monitor" "this" "monitor"
[4,] NA NA NA
[5,] "different monitor" "different" "monitor"
[6,] NA NA NA
# The regular expression `(\\w+) ([Mm]onitor)` matches any word,
# followed by a space, followed by either Monitor or monitor.
# - str_detect: returns T/F if the match is found or not
# - str_extract: returns the match if it is found (NA otherwise)
# - str_subset: returns the observations for which a match has been found
# - str_match: returns the match (split into groups) and NA otherwise
str_extract_all
and str_match_all
to extract all matches. Use either of these functions to see all of the fully capitalized words in the first 10 reviews.
[,1] [,2] [,3]
[1,] "I" "" ""
[2,] "I" "" ""
[3,] "" "" ""
[4,] "" "" ""
[5,] "I" "I" "I"
[6,] "A" "I" ""
[7,] "" "" ""
[8,] "I" "I" ""
[9,] "I" "" ""
[10,] "I" "" ""
computer_531
data, let’s first use a regular expression to extract the characters at the beginning of each line until ##
. Do this for only the first 20 reviews.
Hint: Use the str_extract
function such that str_extract(first 20 reviews, regex)
.
Hint 2: The symbol ^
matches to “start of line”. The symbol .
matches anything. The symbol *
indicates that the previous character can appear any number of times (including zero times). e.g., .*
matches to all characters (until the end of the line); .*R
matches to all characters until a capital R is found.
[1] "##" "inexpensive[+1][a] ##"
[3] "monitor[-1] ##" "screen[-1], picture quality[-1] ##"
[5] "monitor[-1], picture quality[-1] ##" "screen[-1] ##"
[7] "Display[-1] ##" "monitor[-1] ##"
[9] "##" "##"
[11] "##" "##"
[13] "##" "monitor[-1], colors[-1] ##"
[15] "##" "size[+1] ##"
[17] "computer[+1], quality[+1] ##" "##"
[19] "keyboard[+1], color[+1] ##" "speed[+1], memory[+1] ##"
computer_531
data frame with the name cleaned_review
, which contains only the review text. And add another column with the name aspect_sentiment
, which contains the asepcts and sentiment words (i.e., the ones at the end of each review text).
Hint: Use mutate()
and str_extract()
(e.g., mutate(cleaned_review = str_extract(...), aspect_sentiment = str_extract(...))
).
computer_531 <- computer_531 |>
mutate(cleaned_review = str_extract(review,"##.*$"),
aspect_sentiment = str_extract(review,"^.*##"))
# check the newly added columns
head(computer_531[-1])
# A tibble: 6 × 2
cleaned_review aspect_sentiment
<chr> <chr>
1 ## I purchased this monitor because of budgetary concerns . ##
2 ## This item was the most inexpensive 17 inch monitor availa… inexpensive[+1]…
3 ## My overall experience with this monitor was very poor . monitor[-1] ##
4 ## When the screen was n't contracting or glitching the over… screen[-1], pic…
5 ## I 've viewed numerous different monitor models since I 'm… monitor[-1], pi…
6 ## A week out of the box and I began to see slight contracti… screen[-1] ##
# for an even cleaner version you can use regex's lookbehind and lookahead
# features to drop the ## parts:
computer_531 <- computer_531 |>
mutate(cleaned_review = str_extract(review,"(?<=##).*$"),
aspect_sentiment = str_extract(review,"^.*(?=##)"))
# check the newly added columns
head(computer_531[-1])
# A tibble: 6 × 2
cleaned_review aspect_sentiment
<chr> <chr>
1 " I purchased this monitor because of budgetary concerns ." ""
2 " This item was the most inexpensive 17 inch monitor availab… "inexpensive[+1…
3 " My overall experience with this monitor was very poor ." "monitor[-1] "
4 " When the screen was n't contracting or glitching the overa… "screen[-1], pi…
5 " I 've viewed numerous different monitor models since I 'm … "monitor[-1], p…
6 " A week out of the box and I began to see slight contractio… "screen[-1] "
sentiment
with the values positive
, negative
and neutral
. The code assigns neutral
in case when there is no aspect in the corresponding column or the sum of scores is equal to zero, and it assigns negative
(positive
) when the sum score is lower (higher) than zero.
# define sum_list function which adds the scores
sum_list <- function(list_values) {
sum(as.numeric(list_values))
}
computer_531 <- computer_531 |>
# sentiment_score: extract the scores and sum them using the sum_list function
mutate(sentiment_score = map_dbl(str_extract_all(aspect_sentiment, "[-|\\+]\\d+"), sum_list),
# assign negative when score < 0
sentiment = case_when(sentiment_score < 0 ~ "negative",
# assign negative when score > 0
sentiment_score > 0 ~ "positive",
# assign neutral otherwise
TRUE ~ "neutral"))
head(computer_531)
# A tibble: 6 × 5
review cleaned_review aspect_sentiment sentiment_score sentiment
<chr> <chr> <chr> <dbl> <chr>
1 ## I purchased this… " I purchased… "" 0 neutral
2 inexpensive[+1][a] … " This item w… "inexpensive[+1… 1 positive
3 monitor[-1] ## My o… " My overall … "monitor[-1] " -1 negative
4 screen[-1], picture… " When the sc… "screen[-1], pi… -2 negative
5 monitor[-1], pictur… " I 've viewe… "monitor[-1], p… -2 negative
6 screen[-1] ## A wee… " A week out … "screen[-1] " -1 negative
[1] " After much delay I received my Acer Netbook Aspire One and I can teell you I am very satisfied with this computer , is very pretty , well finished , high quality , so quiet , not heated than other netbook or at least I think it is within in normal ranges , the keyboard is very smooth and very well laid out , the screen functions are very practical in a nutshell is an excellent buy and I am very pleased with the purchase , I recommend it ."