library(tidyverse) # as always :)
library(tidytext) # for text mining
library(tm) # needed for wordcloud
library(wordcloud) # to create pretty word clouds
library(SnowballC) # for Porter's stemming algorithm
Text Mining I
Introduction
Welcome to the second practical of the week!
The aim of this practical is to introduce you to text pre-processing and regular expressions.
In this practical, you will learn:
- Visualize the most frequent words in a data set using a word cloud and a bar plot.
- How to prepare and pre-process your text data.
- Use regular expressions to retrieve structured information from text.
Preparation & Pre-process
Text data set
We are going to use the following data set in this practical.
- A data set with reviews of computers.
This data set is annotated using aspect-based sentiment analysis. Aspect-based sentiment analysis is a text analysis technique that categorizes data by specific aspects and identifies the sentiment attributed to each one. You can download the data as a text file here. The original source is here.
Each row of the file has the following format:
screen[-1], picture quality[-1] ## When the screen was n't contracting or glitching the overall picture quality was poor to fair .
The aspects are shown before ##
, the review is shown after ##
.
Visualization
Another way to visualize the term frequencies is using a barplot. For this we need to convert the data to a tidy format (where each word is a different row) and count the number of occurrences of each word. See the diagram below.