library(ISLR)
library(tidyverse)
Data visualization using ggplot2
Introduction
In this practical, we will learn how to create data visualisations using the grammar of graphics. For the visualisations, we will be using a package that implements the grammar of graphics: ggplot2
. This package is part of the tidyverse
, so we only need to load tidyverse
. We will use datasets from the ISLR
package.
An excellent reference manual for ggplot
can be found on the tidyverse website: https://ggplot2.tidyverse.org/reference/
What is ggplot
?
Plots can be made in R
without the use of ggplot
using plot()
, hist()
or barplot()
and related functions. Here is an example of each on the Hitters
dataset from ISLR
:
# Get an idea of what the Hitters dataset looks like
head(Hitters)
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
-Andy Allanson 293 66 1 30 29 14 1 293 66 1
-Alan Ashby 315 81 7 24 38 39 14 3449 835 69
-Alvin Davis 479 130 18 66 72 76 3 1624 457 63
-Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
-Andres Galarraga 321 87 10 39 42 30 2 396 101 12
-Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
CRuns CRBI CWalks League Division PutOuts Assists Errors
-Andy Allanson 30 29 14 A E 446 33 20
-Alan Ashby 321 414 375 N W 632 43 10
-Alvin Davis 224 266 263 A W 880 82 14
-Andre Dawson 828 838 354 N E 200 11 3
-Andres Galarraga 48 46 33 N E 805 40 4
-Alfredo Griffin 501 336 194 A W 282 421 25
Salary NewLeague
-Andy Allanson NA A
-Alan Ashby 475.0 N
-Alvin Davis 480.0 A
-Andre Dawson 500.0 N
-Andres Galarraga 91.5 N
-Alfredo Griffin 750.0 A
# histogram of the distribution of salary
hist(Hitters$Salary, xlab = "Salary in thousands of dollars")
# barplot of how many members in each league
barplot(table(Hitters$League))
# Number of career hits versus number of career home runs
plot(x = Hitters$Hits, y = Hitters$HmRun,
xlab = "Hits", ylab = "Home runs")
These plots are informative and useful for visually inspecting the dataset, and they each have a specific syntax associated with them. ggplot
has a more unified approach to plotting, where you build up a plot layer by layer using the +
operator:
<-
homeruns_plot ggplot(Hitters, aes(x = Hits, y = HmRun)) +
geom_point() +
labs(x = "Hits", y = "Home runs")
homeruns_plot
As introduced in the lecture, a ggplot
object is built up in different layers:
- input the dataset to a
ggplot()
function call - construct aesthetic mappings
- add (geometric) components to your plot that use these mappings
- add labels, themes, visuals.
Because of this layered syntax, it is then easy to add elements like these fancy density lines, a title, and a different theme:
+
homeruns_plot geom_density_2d() +
labs(title = "Cool density and scatter plot of baseball data") +
theme_minimal()
In conclusion, ggplot
objects are easy to manipulate and they force a principled approach to data visualisation. In this practical, we will learn how to construct them.
Aesthetics and data preparation
The first step in constructing a ggplot
is the preparation of your data and the mapping of variables to aesthetics. In the homeruns_plot
, we used an existing data frame, the Hitters
dataset.
The data frame needs to have proper column names and the types of the variables in the data frame need to be correctly specified. Numbers should be numerics
, categories should be factors
, and names or identifiers should be character
variables. ggplot()
always expects a data frame, which may feel awfully strict, but it allows for excellent flexibility in the remaining plotting steps.
Mapping aesthetics is usually done in the main ggplot()
call. Aesthetic mappings are the second argument to the function, after the data.
Examples of aesthetics are:
- x
- y
- alpha (transparency)
- colour
- fill
- group
- shape
- size
- stroke
Geoms
Up until now we have used two geoms: contour lines and points. The geoms in ggplot2
are added via the geom_<geomtype>()
functions. Each geom has a required aesthetic mapping to work. For example, geom_point()
needs at least and x and y position mapping, as you can read here. You can check the required aesthetic mapping for each geom via the ?geom_<geomtype>
.
There are two types of geoms:
- geoms which perform a transformation of the data beforehand, such as
geom_density_2d()
which calculates contour lines from x and y positions. - geoms which do not transform data beforehand, but use the aesthetic mapping directly, such as
geom_point()
.
Histogram
Density
The continuous equivalent of the histogram is the density estimate.
The downside of only looking at the density or histogram is that it is an abstraction from the raw data, thus it might alter interpretations. For example, it could be that a grade between 8.5 and 9 is in fact impossible. We do not see this in the density estimate. To counter this, we can add a raw data display in the form of rug marks.
Boxplot
A common task is to compare distributions across groups. A classic example of a visualisation that performs this is the boxplot, accessible via geom_boxplot()
. It allows for visual comparison of the distribution of two or more groups through their summary statistics.
Bar plot
We can display amounts or proportions as a bar plot to compare group sizes of a factor.
geom_bar()
automatically transforms variables to counts (see ?stat_count
), similar to how the function table()
works:
table(Hitters$Years)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 23 24
22 25 29 36 30 30 21 16 15 14 10 14 12 13 9 7 7 7 1 2 1 1
Faceting
Faceting can help interpretation. In this case, we can see that high-salary earners are far away from the point (0, 0) on average, but that there are low-salary earners which are even further away. Faceting should preferably be done using a factor
variable. The order of the facets is taken from the levels()
of the factor. Changing the order of the facets can be done using fct_relevel()
if needed.
OPTIONAL: Final exercise
If you have time left, do this final exercise!