Data visualization using ggplot2

Published

September 11, 2023

Introduction

In this practical, we will learn how to create data visualisations using the grammar of graphics. For the visualisations, we will be using a package that implements the grammar of graphics: ggplot2. This package is part of the tidyverse, so we only need to load tidyverse. We will use datasets from the ISLR package.

library(ISLR)
library(tidyverse)

An excellent reference manual for ggplot can be found on the tidyverse website: https://ggplot2.tidyverse.org/reference/

What is ggplot?

Plots can be made in R without the use of ggplot using plot(), hist() or barplot() and related functions. Here is an example of each on the Hitters dataset from ISLR:

# Get an idea of what the Hitters dataset looks like
head(Hitters)
                  AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
-Andy Allanson      293   66     1   30  29    14     1    293    66      1
-Alan Ashby         315   81     7   24  38    39    14   3449   835     69
-Alvin Davis        479  130    18   66  72    76     3   1624   457     63
-Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
-Andres Galarraga   321   87    10   39  42    30     2    396   101     12
-Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
                  CRuns CRBI CWalks League Division PutOuts Assists Errors
-Andy Allanson       30   29     14      A        E     446      33     20
-Alan Ashby         321  414    375      N        W     632      43     10
-Alvin Davis        224  266    263      A        W     880      82     14
-Andre Dawson       828  838    354      N        E     200      11      3
-Andres Galarraga    48   46     33      N        E     805      40      4
-Alfredo Griffin    501  336    194      A        W     282     421     25
                  Salary NewLeague
-Andy Allanson        NA         A
-Alan Ashby        475.0         N
-Alvin Davis       480.0         A
-Andre Dawson      500.0         N
-Andres Galarraga   91.5         N
-Alfredo Griffin   750.0         A
# histogram of the distribution of salary
hist(Hitters$Salary, xlab = "Salary in thousands of dollars")

# barplot of how many members in each league
barplot(table(Hitters$League))

# Number of career hits versus number of career home runs
plot(x = Hitters$Hits, y = Hitters$HmRun, 
     xlab = "Hits", ylab = "Home runs")

These plots are informative and useful for visually inspecting the dataset, and they each have a specific syntax associated with them. ggplot has a more unified approach to plotting, where you build up a plot layer by layer using the + operator:

homeruns_plot <- 
  ggplot(Hitters, aes(x = Hits, y = HmRun)) +
  geom_point() +
  labs(x = "Hits", y = "Home runs")

homeruns_plot

As introduced in the lecture, a ggplot object is built up in different layers:

  1. input the dataset to a ggplot() function call
  2. construct aesthetic mappings
  3. add (geometric) components to your plot that use these mappings
  4. add labels, themes, visuals.

Because of this layered syntax, it is then easy to add elements like these fancy density lines, a title, and a different theme:

homeruns_plot + 
  geom_density_2d() +
  labs(title = "Cool density and scatter plot of baseball data") +
  theme_minimal()