Text Mining I

Published

October 23, 2023

Introduction

Welcome to the second practical of the week!

The aim of this practical is to introduce you to text pre-processing and regular expressions.

In this practical, you will learn:

  • Visualize the most frequent words in a data set using a word cloud and a bar plot.
  • How to prepare and pre-process your text data.
  • Use regular expressions to retrieve structured information from text.
library(tidyverse) # as always :)
library(tidytext)  # for text mining
library(tm)        # needed for wordcloud
library(wordcloud) # to create pretty word clouds
library(SnowballC) # for Porter's stemming algorithm

Preparation & Pre-process

Text data set

We are going to use the following data set in this practical.

  • A data set with reviews of computers.
    This data set is annotated using aspect-based sentiment analysis. Aspect-based sentiment analysis is a text analysis technique that categorizes data by specific aspects and identifies the sentiment attributed to each one. You can download the data as a text file here. The original source is here.

Each row of the file has the following format:

screen[-1], picture quality[-1] ## When the screen was n't contracting or glitching the overall picture quality was poor to fair .

The aspects are shown before ##, the review is shown after ##.

Visualization

1. Use the read_delim function from the readr package to read data from the computer.txt file.
computer_531 <- read_delim(
  file = "data/computer.txt", 
  delim = "\n", 
  col_names = "review"
)
Rows: 531 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\001"
chr (1): review

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(computer_531)
# A tibble: 6 × 1
  review                                                                        
  <chr>                                                                         
1 ## I purchased this monitor because of budgetary concerns .                   
2 inexpensive[+1][a] ## This item was the most inexpensive 17 inch monitor avai…
3 monitor[-1] ## My overall experience with this monitor was very poor .        
4 screen[-1], picture quality[-1] ## When the screen was n't contracting or gli…
5 monitor[-1], picture quality[-1] ## I 've viewed numerous different monitor m…
6 screen[-1] ## A week out of the box and I began to see slight contractions of…
2. wordcloud is a function from the wordcloud package, which plots cool word clouds based on word frequencies in a given data set. Use this function to plot the 50 most frequent words with a minimum frequency of 5 using your data set. wordcloud will directly tokenize your documents.

Check the help file of wordcloud function (e.g., ?wordcloud) to see how to specify the arguments.

wordcloud(
  words = computer_531$review,
  min.freq = 5,
  max.words = 50,
  random.order = FALSE,
  colors = brewer.pal(8, "Dark2")
)
Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
drops documents
Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
tm::stopwords())): transformation drops documents

Another way to visualize the term frequencies is using a barplot. For this we need to convert the data to a tidy format (where each word is a different row) and count the number of occurrences of each word. See the diagram below.