library(tidyverse)
library(GGally)
Data visualization using ggplot2
Introduction
In this practical, you will conduct an exploratory data analysis, to investigate the relationship between portion size and nutritional value of the items sold at McDonald’s. To do this, you will need to use ggplot2
and tidyverse
, as you did in the previous practical, and we will also use the GGally
package.
The dataset used contains information on the McDonald’s menu and can be downloaded here. Download the file and put it in your data folder. The original source of the data is on Kaggle.
Loading the data into R
To load the data from a separate file, use read_csv()
from the tidyverse
. We assign the data to a variable called menu
.
<- read_csv("data/menu.csv") menu
You can also load data directly from a URL instead of from a local file! You can try it with this url: https://infomdwr.nl/labs/week_5/1_data_visualization_2/data/menu.csv
A first impression of the data
Before you start the analysis, a good first step is to get a sense of what the dataset looks like, especially when you import data from a separate file. You want to make sure that the dataset is imported correctly.
# Get an idea of what the menu dataset looks like
head(menu)
glimpse(menu)
You can see that the variable indicating the serving size is of a very inconvenient form. To transform this variable into a numeric variable indicating the serving size in grams (food) or milliliters (drinks), you can run the code below. You can also try to recode it yourself of course!
# Transformation drinks
<- menu |>
drink_fl filter(str_detect(`Serving Size`, " fl oz.*")) |>
mutate(`Serving Size` = str_remove(`Serving Size`, " fl oz.*")) |>
mutate(`Serving Size` = as.numeric(`Serving Size`) * 29.5735)
<- menu |>
drink_carton filter(str_detect(`Serving Size`, "carton")) |>
mutate(`Serving Size` = str_extract(`Serving Size`, "[0-9]{2,3}")) |>
mutate(`Serving Size` = as.numeric(`Serving Size`))
# Transformation food
<- menu |>
food filter(str_detect(`Serving Size`, "g")) |>
mutate(`Serving Size` = str_extract(`Serving Size`, "(?<=\\()[0-9]{2,4}")) |>
mutate(`Serving Size` = as.numeric(`Serving Size`))
# Add Type variable indicating whether an item is food or a drink
<-
menu_tidy bind_rows(drink_fl, drink_carton, food) |>
mutate(
Type = case_when(
as.character(Category) == 'Beverages' ~ 'Drinks',
as.character(Category) == 'Coffee & Tea' ~ 'Drinks',
as.character(Category) == 'Smoothies & Shakes' ~ 'Drinks',
TRUE ~ 'Food'
) )
Variation
The first type of questions in visual EDA concern the variation in the data. As you can see, the items are categorized into different product groups. To get an idea of the variation among item categories and nutritional value, several visualisations can be made.
Association
The second type of question in visual EDA concerns the association between different variables. The histogram you created in the previous question gives insight in how the Calories
are distributed over the whole menu. It may also be interesting to see how Calories
are distributed on Category
level.
An easier way to spot outliers is by creating a boxplot, because the geom is constructed in a way that emphasizes the outliers.
Now you know in which category the outlier falls, you can check which item it is that has such a high energetic value.
For the next plot,
Now you know what causes the outlier, you can check whether there is an association between the serving size and the number of calories.
So far, you’ve only looked at the association between two variables at a time. However, when you are exploring associations, you may want to look at the association of many different variables at the same time. An easy way to do this is to make use of the ggpairs()
function from the GGally
package.