The aim of this practical is to introduce you to text pre-processing and regular expressions.
In this practical, you will learn:
Visualize the most frequent words in a data set using a word cloud and a bar plot.
How to prepare and pre-process your text data.
Use regular expressions to retrieve structured information from text.
library(tidyverse) # as always :)library(tidytext) # for text mininglibrary(tm) # needed for wordcloudlibrary(wordcloud) # to create pretty word cloudslibrary(SnowballC) # for Porter's stemming algorithm
Preparation & Pre-process
Text data set
We are going to use the following data set in this practical.
A data set with reviews of computers.
This data set is annotated using aspect-based sentiment analysis. Aspect-based sentiment analysis is a text analysis technique that categorizes data by specific aspects and identifies the sentiment attributed to each one. You can download the data as a text file here. The original source is here.
Each row of the file has the following format:
screen[-1], picture quality[-1] ## When the screen was n't contracting or glitching the overall picture quality was poor to fair .
The aspects are shown before ##, the review is shown after ##.
Visualization
1. Use the read_delim function from the readr package to read data from the computer.txt file.
2. wordcloud is a function from the wordcloud package, which plots cool word clouds based on word frequencies in a given data set. Use this function to plot the 50 most frequent words with a minimum frequency of 5 using your data set. wordcloud will directly tokenize your documents.
Check the help file of wordcloud function (e.g., ?wordcloud) to see how to specify the arguments.
Another way to visualize the term frequencies is using a barplot. For this we need to convert the data to a tidy format (where each word is a different row) and count the number of occurrences of each word. See the diagram below.
3. Use unnest_tokens function from tidytext package to break the text into individual tokens (a process called tokenization) and use head function to see its first several rows.
4. Use functions from the dplyr package (e.g., count, arrange) to select the most frequent 30 tokens and plot a bar chat using ggplot.
5. Remove the stop words from the tokenized data frame and plot a new bar chat.
Hint: Use anti_join function to filter the stop_words from the tidytext package. stop_words is a dataset with stop words. By using anti_join, you keep only the words in the first dataset that are not in the stop_words dataset. Check the help file if you want further information on either one (e.g., ?anti_join, ?stop_words).
6. You can always adjust your stopwords list. Create a custom list of stopwords df_stopwords by modifying the stop_words, adding new words (e.g., ‘buy’), and removing some of the current words listed (e.g., ‘order’). Then use df_stopwords in the anti_join function and check the plot.
7. When we deal with text, often documents contain different versions of one base word, called a stem. SnowballC is a package based on the famous Porter’s stemming algorithm that collapses words into their stem. Use the getStemLanguages function to find out how many languages this package supports.
8. Applying stemming on your data frame and save the results in a new column. Show the top 30 most frequent stemmed words in a bar plot.
Regular expressions
9. Use regular expressions (regex) to find the reviews in computer_531, which contain words “Monitor” or “monitor”, “memory” or “Memory”, and “Delivery” or “delivery”. See how many reviews contain each pair of words.
Hint: One of many different ways to achieve this is as follows:
Use str_detect function from stringr package to find the presence of patterns of your interest. See the lecture slide if you are unsure how to specify the regex. You can try different regex here: https://regexr.com/
Use filter function to only retain the ones that contain the patterns. You can write in one line such as filter(str_detect(input vector, regex)).
Count the number of reviews that contain each pattern.
10. Compare str_detect, str_extract, str_subset and str_match functions from the stringr package to check which words appear before “monitor” in the reviews from the computer_531 data set.
What does the regular expression (\\w+) ([Mm]onitor) match to? You can explore it using https://regexr.com
11. You can also use str_extract_all and str_match_all to extract all matches. Use either of these functions to see all of the fully capitalized words in the first 10 reviews.
12. In order to separate aspects and sentiments in the reviews from the computer_531 data, let’s first use a regular expression to extract the characters at the beginning of each line until ##. Do this for only the first 20 reviews.
Hint: Use the str_extract function such that str_extract(first 20 reviews, regex).
Hint 2: The symbol ^ matches to “start of line”. The symbol . matches anything. The symbol * indicates that the previous character can appear any number of times (including zero times). e.g., .* matches to all characters (until the end of the line); .*R matches to all characters until a capital R is found.
13. Add a new column to the computer_531 data frame with the name cleaned_review, which contains only the review text. And add another column with the name aspect_sentiment, which contains the asepcts and sentiment words (i.e., the ones at the end of each review text).
Hint: Use mutate() and str_extract() (e.g., mutate(cleaned_review = str_extract(...), aspect_sentiment = str_extract(...))).
14. Run the following code below to create a new column sentiment with the values positive, negative and neutral. The code assigns neutral in case when there is no aspect in the corresponding column or the sum of scores is equal to zero, and it assigns negative (positive) when the sum score is lower (higher) than zero.