library(tidyverse) # as always :)
library(tidytext) # for text mining
library(text2vec) # our dataset comes from this package
Text Mining II
Introduction
Welcome to the third practical of the week!
The aim of this practical is to introduce you to word embedding, and enhance your understanding of sentiment analysis by doing dictionary-based sentiment embedding.
Word Embedding
In this part of the practical we will use word embedding. To work with text, we need to transform it into numerical values. One way is to count the words and weight their frequency (e.g. with TF-IDF). Another method is word embedding. Word embedding techniques such as word2vec and GloVe use neural networks to construct word vectors. With these techniques, words in similar contexts are represented by similar numerical vectors.
We will use the Harry Potter books as our data. Let’s start by installing the harrypotter
package from github
using remotes.
::install_github("bradleyboehmke/harrypotter")
remoteslibrary(harrypotter) # Not to be confused with the CRAN palettes package
The harrypotter
package supplies the first seven novels in the Harry Potter series. Here is some info about this data set:
A data set with all Harry Potter books.
This data set contains the full texts of the first seven Harry Potter books (see below the list). Each text is in a character vector with each element representing a single chapter. It is provided from the harrypotter package written by Bradley Boehmke.- philosophers_stone: Harry Potter and the Philosophers Stone, published in 1997
- chamber_of_secrets: Harry Potter and the Chamber of Secrets, published in 1998
- prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban, published in 1999
- goblet_of_fire: Harry Potter and the Goblet of Fire, published in 2000
- order_of_the_phoenix: Harry Potter and the Order of the Phoenix, published in 2003
- half_blood_prince: Harry Potter and the Half-Blood Prince, published in 2005
- deathly_hallows: Harry Potter and the Deathly Hallows, published in 2007
Dictionary-based Sentiment Analysis
In this part of the practical, we are going to apply dictionary-based sentiment analysis methods on the movie_review
data set.
Text data set
We are going to use the data set movie_review
. This is a data set with 5,000 IMDB movie reviews available from the text2vec
package, labeled according to their IMDB rating The sentiment of the reviews is binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 has a sentiment score of 1. No individual movie has more than 30 reviews.
The objective of the practical is to understand if the reviews are predictive of the IMDB rating.
Let’s load the data set and convert it to a dataframe.
# load the movie review dataset from the text2vec package
data("movie_review")
<- as_tibble(movie_review)
movie_review head(movie_review)
Dictionary-based sentiment analysis
The tidytext
package contains 4 general purpose lexicons in the sentiments dataset.
afinn
: list of English words rated for valence between -5 and +5bing
: list of positive and negative sentimentnrc
: list of English words and their associations with 8 emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and 2 sentiments (negative and positive); binaryloughran
: list of sentiment words for accounting and finance by category (Negative, Positive, Uncertainty, Litigious, Strong Modal, Weak Modal, Constraining)
We are going to use labMT dictionary (Dodds’ et al. 2011), one of the best dictionaries for sentiment analysis (see e.g. this paper.)