Data visualization using ggplot2

Published

September 11, 2023

Introduction

In this practical, we will learn how to create data visualisations using the grammar of graphics. For the visualisations, we will be using a package that implements the grammar of graphics: ggplot2. This package is part of the tidyverse, so we only need to load tidyverse. We will use datasets from the ISLR package.

library(ISLR)
library(tidyverse)

An excellent reference manual for ggplot can be found on the tidyverse website: https://ggplot2.tidyverse.org/reference/

What is ggplot?

Plots can be made in R without the use of ggplot using plot(), hist() or barplot() and related functions. Here is an example of each on the Hitters dataset from ISLR:

# Get an idea of what the Hitters dataset looks like
head(Hitters)
                  AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
-Andy Allanson      293   66     1   30  29    14     1    293    66      1
-Alan Ashby         315   81     7   24  38    39    14   3449   835     69
-Alvin Davis        479  130    18   66  72    76     3   1624   457     63
-Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
-Andres Galarraga   321   87    10   39  42    30     2    396   101     12
-Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
                  CRuns CRBI CWalks League Division PutOuts Assists Errors
-Andy Allanson       30   29     14      A        E     446      33     20
-Alan Ashby         321  414    375      N        W     632      43     10
-Alvin Davis        224  266    263      A        W     880      82     14
-Andre Dawson       828  838    354      N        E     200      11      3
-Andres Galarraga    48   46     33      N        E     805      40      4
-Alfredo Griffin    501  336    194      A        W     282     421     25
                  Salary NewLeague
-Andy Allanson        NA         A
-Alan Ashby        475.0         N
-Alvin Davis       480.0         A
-Andre Dawson      500.0         N
-Andres Galarraga   91.5         N
-Alfredo Griffin   750.0         A
# histogram of the distribution of salary
hist(Hitters$Salary, xlab = "Salary in thousands of dollars")

# barplot of how many members in each league
barplot(table(Hitters$League))

# Number of career hits versus number of career home runs
plot(x = Hitters$Hits, y = Hitters$HmRun, 
     xlab = "Hits", ylab = "Home runs")

These plots are informative and useful for visually inspecting the dataset, and they each have a specific syntax associated with them. ggplot has a more unified approach to plotting, where you build up a plot layer by layer using the + operator:

homeruns_plot <- 
  ggplot(Hitters, aes(x = Hits, y = HmRun)) +
  geom_point() +
  labs(x = "Hits", y = "Home runs")

homeruns_plot

As introduced in the lecture, a ggplot object is built up in different layers:

  1. input the dataset to a ggplot() function call
  2. construct aesthetic mappings
  3. add (geometric) components to your plot that use these mappings
  4. add labels, themes, visuals.

Because of this layered syntax, it is then easy to add elements like these fancy density lines, a title, and a different theme:

homeruns_plot + 
  geom_density_2d() +
  labs(title = "Cool density and scatter plot of baseball data") +
  theme_minimal()

In conclusion, ggplot objects are easy to manipulate and they force a principled approach to data visualisation. In this practical, we will learn how to construct them.

1. Name the aesthetics, geoms, scales, and facets of the above visualisation. Also name any statistical transformations or special coordinate systems.
# Aesthetics
#   number of hits mapped to x-position
#   number of home runs mapped to y-position
# Geoms: points and contour lines
# Scales:
#   x-axis: continuous
#   y-axis: continuous
# Facets: None
# Statistical transformations: None
# Special coordinate system: None (just cartesian)

Aesthetics and data preparation

The first step in constructing a ggplot is the preparation of your data and the mapping of variables to aesthetics. In the homeruns_plot, we used an existing data frame, the Hitters dataset.

The data frame needs to have proper column names and the types of the variables in the data frame need to be correctly specified. Numbers should be numerics, categories should be factors, and names or identifiers should be character variables. ggplot() always expects a data frame, which may feel awfully strict, but it allows for excellent flexibility in the remaining plotting steps.

2. Run the code below to generate data.
set.seed(1234)
student_grade  <- rnorm(32, 7)
student_number <- round(runif(32) * 2e6 + 5e6)
programme      <- sample(c("Science", "Social Science"), 32, replace = TRUE)

There will be three vectors in your environment. Put them in a data frame for entering it in a ggplot() call using either the data.frame() or the tibble() function. Give informative names and make sure the types are correct (use the as.<type>() functions). Name the result gg_students.

gg_students <- tibble(
  number = as.character(student_number), # an identifier
  grade  = student_grade,                # already the correct type.
  prog   = as.factor(programme)          # categories should be factors.
)

head(gg_students)
# A tibble: 6 × 3
  number  grade prog          
  <chr>   <dbl> <fct>         
1 5478051  5.79 Social Science
2 6412989  7.28 Science       
3 5616190  8.08 Social Science
4 6017095  4.65 Social Science
5 5103293  7.43 Social Science
6 6129140  7.51 Science       
# note that if you use data.frame(), you need to set the argument 
# stringsAsFactors to FALSE to get student number to be a character.
# tibble() does this by default.

(we will use this dataset later in the practical)

Mapping aesthetics is usually done in the main ggplot() call. Aesthetic mappings are the second argument to the function, after the data.

3. Plot the first homeruns_plot again, but map the Hits to the y-axis and the HmRun to the x-axis instead.
ggplot(Hitters, aes(x = HmRun, y = Hits)) +
  geom_point() +
  labs(y = "Hits", x = "Home runs")

4. Recreate the same plot once more, but now also map the variable League to the colour aesthetic and the variable Salary to the size aesthetic.
ggplot(Hitters, aes(x = HmRun, y = Hits, colour = League, size = Salary)) +
  geom_point() +
  labs(y = "Hits", x = "Home runs")
Warning: Removed 59 rows containing missing values (`geom_point()`).

Examples of aesthetics are:

  • x
  • y
  • alpha (transparency)
  • colour
  • fill
  • group
  • shape
  • size
  • stroke

Geoms

Up until now we have used two geoms: contour lines and points. The geoms in ggplot2 are added via the geom_<geomtype>() functions. Each geom has a required aesthetic mapping to work. For example, geom_point() needs at least and x and y position mapping, as you can read here. You can check the required aesthetic mapping for each geom via the ?geom_<geomtype>.

5. Look at the many different geoms on the reference website.

There are two types of geoms:

  • geoms which perform a transformation of the data beforehand, such as geom_density_2d() which calculates contour lines from x and y positions.
  • geoms which do not transform data beforehand, but use the aesthetic mapping directly, such as geom_point().

Histogram

6. Use geom_histogram() to create a histogram of the grades of the students in the gg_students dataset.

Play around with the binwidth argument of the geom_histogram() function.

gg_students |>
  ggplot(aes(x = grade)) +
  geom_histogram(binwidth = .5)

Density

The continuous equivalent of the histogram is the density estimate.

7. Use geom_density() to create a density plot of the grades of the students in the gg_students dataset. Add the argument fill = "light seagreen" to geom_density().
gg_students |> 
  ggplot(aes(x = grade)) +
  geom_density(fill = "light seagreen")

The downside of only looking at the density or histogram is that it is an abstraction from the raw data, thus it might alter interpretations. For example, it could be that a grade between 8.5 and 9 is in fact impossible. We do not see this in the density estimate. To counter this, we can add a raw data display in the form of rug marks.

8. Add rug marks to the density plot through geom_rug(). You can edit the colour and size of the rug marks using those arguments within the geom_rug() function.
gg_students |> 
  ggplot(aes(x = grade)) +
  geom_density(fill = "light seagreen") +
  geom_rug(size = 1, colour = "light seagreen")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

9. Increase the data to ink ratio by removing the y axis label, setting the theme to theme_minimal(), and removing the border of the density polygon.

Also set the limits of the x-axis to go from 0 to 10 using the xlim() function, because those are the plausible values for a student grade.

gg_students |> 
  ggplot(aes(x = grade)) +
  geom_density(fill = "light seagreen", colour = NA) +
  geom_rug(size = 1, colour = "light seagreen") +
  theme_minimal() +
  labs(y = "") +
  xlim(0, 10)

Boxplot

A common task is to compare distributions across groups. A classic example of a visualisation that performs this is the boxplot, accessible via geom_boxplot(). It allows for visual comparison of the distribution of two or more groups through their summary statistics.

10. Create a boxplot of student grades per programme in the gg_students dataset you made earlier: map the programme variable to the x position and the grade to the y position.

For extra visual aid, you can additionally map the programme variable to the fill aesthetic.

gg_students |> 
  ggplot(aes(x = prog, y = grade, fill = prog)) +
  geom_boxplot() +
  theme_minimal()

11. What do each of the horizontal lines in the boxplot mean? What do the vertical lines (whiskers) mean?
# From the help file of geom_boxplot:

# The middle line indicates the median, the outer horizontal
# lines are the 25th and 75th percentile.

# The upper whisker extends from the hinge to the largest value 
# no further than 1.5 * IQR from the hinge (where IQR is the 
# inter-quartile range, or distance between the first and third 
# quartiles). The lower whisker extends from the hinge to the 
# smallest value at most 1.5 * IQR of the hinge. Data beyond 
# the end of the whiskers are called "outlying" points and are 
# plotted individually.

Bar plot

We can display amounts or proportions as a bar plot to compare group sizes of a factor.

12. Create a bar plot of the variable Years from the Hitters dataset.
Hitters |> 
  ggplot(aes(x = Years)) + 
  geom_bar() +
  theme_minimal()

geom_bar() automatically transforms variables to counts (see ?stat_count), similar to how the function table() works:

table(Hitters$Years)

 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 23 24 
22 25 29 36 30 30 21 16 15 14 10 14 12 13  9  7  7  7  1  2  1  1 

Faceting

13. Create a data frame called baseball based on the Hitters dataset.

In this data frame, create a factor variable which splits players’ salary range into 3 categories. Tip: use the filter() function to remove the missing values, and then use the cut() function and assign nice labels to the categories. In addition, create a variable which indicates the proportion of career hits that was a home run.

baseball <-
  Hitters |> 
  filter(!is.na(Salary)) |> 
  mutate(
    Salary_range = cut(Salary, breaks = 3, 
                       labels = c("Low salary", "Mid salary", "High salary")),
    Career_hmrun_proportion = CHmRun/CHits
  )
14. Create a scatter plot where you map CWalks to the x position and the proportion you calculated in the previous exercise to the y position.

Fix the y axis limits to (0, 0.4) and the x axis to (0, 1600) using ylim() and xlim(). Add nice x and y axis titles using the labs() function. Save the plot as the variable baseball_plot.

baseball_plot <-   
  baseball |> 
  ggplot(aes(x = CWalks, y = Career_hmrun_proportion)) +
  geom_point() +
  ylim(0, 0.4) +
  xlim(0, 1600) + 
  theme_minimal() +
  labs(y = "Proportion of home runs",
       x = "Career number of walks")

baseball_plot

15. Split up this plot into three parts based on the salary range variable you calculated. Use the facet_wrap() function for this; look at the examples in the help file for tips.
baseball_plot + facet_wrap(~Salary_range)

Faceting can help interpretation. In this case, we can see that high-salary earners are far away from the point (0, 0) on average, but that there are low-salary earners which are even further away. Faceting should preferably be done using a factor variable. The order of the facets is taken from the levels() of the factor. Changing the order of the facets can be done using fct_relevel() if needed.

OPTIONAL: Final exercise

If you have time left, do this final exercise!

16. Create an interesting data visualisation based on the Carseats data from the ISLR package.
# an example answer could be:
Carseats |> 
  mutate(Competition = Price/CompPrice,
         ShelveLoc   = fct_relevel(ShelveLoc, "Bad", "Medium", "Good")) |> 
  ggplot(aes(x = Competition, y = Sales, colour = Age)) +
  geom_point() +
  theme_minimal() +
  scale_colour_viridis_c() + # add a custom colour scale
  facet_wrap(vars(ShelveLoc))