Text Mining I

Published

October 23, 2023

Introduction

Welcome to the second practical of the week!

The aim of this practical is to introduce you to text pre-processing and regular expressions.

In this practical, you will learn:

  • Visualize the most frequent words in a data set using a word cloud and a bar plot.
  • How to prepare and pre-process your text data.
  • Use regular expressions to retrieve structured information from text.
library(tidyverse) # as always :)
library(tidytext)  # for text mining
library(tm)        # needed for wordcloud
library(wordcloud) # to create pretty word clouds
library(SnowballC) # for Porter's stemming algorithm

Preparation & Pre-process

Text data set

We are going to use the following data set in this practical.

  • A data set with reviews of computers.
    This data set is annotated using aspect-based sentiment analysis. Aspect-based sentiment analysis is a text analysis technique that categorizes data by specific aspects and identifies the sentiment attributed to each one. You can download the data as a text file here. The original source is here.

Each row of the file has the following format:

screen[-1], picture quality[-1] ## When the screen was n't contracting or glitching the overall picture quality was poor to fair .

The aspects are shown before ##, the review is shown after ##.

Visualization

1. Use the read_delim function from the readr package to read data from the computer.txt file.
2. wordcloud is a function from the wordcloud package, which plots cool word clouds based on word frequencies in a given data set. Use this function to plot the 50 most frequent words with a minimum frequency of 5 using your data set. wordcloud will directly tokenize your documents.

Check the help file of wordcloud function (e.g., ?wordcloud) to see how to specify the arguments.

Another way to visualize the term frequencies is using a barplot. For this we need to convert the data to a tidy format (where each word is a different row) and count the number of occurrences of each word. See the diagram below.