library(ISLR)
library(tidyverse)
Data visualization using ggplot2
Introduction
In this practical, we will learn how to create data visualisations using the grammar of graphics. For the visualisations, we will be using a package that implements the grammar of graphics: ggplot2
. This package is part of the tidyverse
, so we only need to load tidyverse
. We will use datasets from the ISLR
package.
An excellent reference manual for ggplot
can be found on the tidyverse website: https://ggplot2.tidyverse.org/reference/
What is ggplot
?
Plots can be made in R
without the use of ggplot
using plot()
, hist()
or barplot()
and related functions. Here is an example of each on the Hitters
dataset from ISLR
:
# Get an idea of what the Hitters dataset looks like
head(Hitters)
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
-Andy Allanson 293 66 1 30 29 14 1 293 66 1
-Alan Ashby 315 81 7 24 38 39 14 3449 835 69
-Alvin Davis 479 130 18 66 72 76 3 1624 457 63
-Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
-Andres Galarraga 321 87 10 39 42 30 2 396 101 12
-Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
CRuns CRBI CWalks League Division PutOuts Assists Errors
-Andy Allanson 30 29 14 A E 446 33 20
-Alan Ashby 321 414 375 N W 632 43 10
-Alvin Davis 224 266 263 A W 880 82 14
-Andre Dawson 828 838 354 N E 200 11 3
-Andres Galarraga 48 46 33 N E 805 40 4
-Alfredo Griffin 501 336 194 A W 282 421 25
Salary NewLeague
-Andy Allanson NA A
-Alan Ashby 475.0 N
-Alvin Davis 480.0 A
-Andre Dawson 500.0 N
-Andres Galarraga 91.5 N
-Alfredo Griffin 750.0 A
# histogram of the distribution of salary
hist(Hitters$Salary, xlab = "Salary in thousands of dollars")
# barplot of how many members in each league
barplot(table(Hitters$League))
# Number of career hits versus number of career home runs
plot(x = Hitters$Hits, y = Hitters$HmRun,
xlab = "Hits", ylab = "Home runs")
These plots are informative and useful for visually inspecting the dataset, and they each have a specific syntax associated with them. ggplot
has a more unified approach to plotting, where you build up a plot layer by layer using the +
operator:
<-
homeruns_plot ggplot(Hitters, aes(x = Hits, y = HmRun)) +
geom_point() +
labs(x = "Hits", y = "Home runs")
homeruns_plot
As introduced in the lecture, a ggplot
object is built up in different layers:
- input the dataset to a
ggplot()
function call - construct aesthetic mappings
- add (geometric) components to your plot that use these mappings
- add labels, themes, visuals.
Because of this layered syntax, it is then easy to add elements like these fancy density lines, a title, and a different theme:
+
homeruns_plot geom_density_2d() +
labs(title = "Cool density and scatter plot of baseball data") +
theme_minimal()