Data visualization using ggplot2

Published

September 30, 2024

Introduction

In this practical, you will conduct an exploratory data analysis, to investigate the relationship between portion size and nutritional value of the items sold at McDonald’s. To do this, you will need to use ggplot2 and tidyverse, as you did in the previous practical, and we will also use the GGally package.

library(tidyverse)
library(GGally)

The dataset used contains information on the McDonald’s menu and can be downloaded here. Download the file and put it in your data folder. The original source of the data is on Kaggle.

Loading the data into R

To load the data from a separate file, use read_csv() from the tidyverse. We assign the data to a variable called menu.

menu <- read_csv("data/menu.csv")

You can also load data directly from a URL instead of from a local file! You can try it with this url: https://infomdwr.nl/labs/week_5/1_data_visualization_2/data/menu.csv

A first impression of the data

Before you start the analysis, a good first step is to get a sense of what the dataset looks like, especially when you import data from a separate file. You want to make sure that the dataset is imported correctly.

# Get an idea of what the menu dataset looks like
head(menu)
glimpse(menu)

You can see that the variable indicating the serving size is of a very inconvenient form. To transform this variable into a numeric variable indicating the serving size in grams (food) or milliliters (drinks), you can run the code below. You can also try to recode it yourself of course!

# Transformation drinks
drink_fl <- menu |> 
  filter(str_detect(`Serving Size`, " fl oz.*")) |> 
  mutate(`Serving Size` = str_remove(`Serving Size`, " fl oz.*")) |> 
  mutate(`Serving Size` = as.numeric(`Serving Size`) * 29.5735)

drink_carton <- menu |> 
  filter(str_detect(`Serving Size`, "carton")) |> 
  mutate(`Serving Size` = str_extract(`Serving Size`, "[0-9]{2,3}")) |> 
  mutate(`Serving Size` = as.numeric(`Serving Size`))

# Transformation food
food <-  menu |> 
  filter(str_detect(`Serving Size`, "g")) |> 
  mutate(`Serving Size` = str_extract(`Serving Size`, "(?<=\\()[0-9]{2,4}")) |> 
  mutate(`Serving Size` = as.numeric(`Serving Size`))

# Add Type variable indicating whether an item is food or a drink 
menu_tidy <-  
  bind_rows(drink_fl, drink_carton, food) |> 
  mutate(
   Type = case_when(
     as.character(Category) == 'Beverages' ~ 'Drinks',
     as.character(Category) == 'Coffee & Tea' ~ 'Drinks',
     as.character(Category) == 'Smoothies & Shakes' ~ 'Drinks',
     TRUE ~ 'Food'
   )
  )
1. After you ran the code, check the structure of the data once again, what type of variable is ‘Serving Size’ now, and what was it before you ran the code?
# The variable type of `Serving Size` is numeric, before running the code it was a character variable.

Variation

The first type of questions in visual EDA concern the variation in the data. As you can see, the items are categorized into different product groups. To get an idea of the variation among item categories and nutritional value, several visualisations can be made.

2. Create a graph that gives insight in the number of items in each Category, use geom_bar() to make the graph. Use the menu_tidy dataframe to do this.
menu_tidy |> 
  ggplot(aes(y = Category)) +
  geom_bar()

3. Plot the distribution of Calories using geom_histogram(). Describe the distribution, do you see anything notable?
menu_tidy |> 
  ggplot(aes(x = Calories)) +
  geom_histogram(binwidth = 30) 

# The distribution is slightly skewed to the right, most items contain less than 750 calories.
# There is one outlier containing almost 2000 calories.

Association

The second type of question in visual EDA concerns the association between different variables. The histogram you created in the previous question gives insight in how the Calories are distributed over the whole menu. It may also be interesting to see how Calories are distributed on Category level.

4. Plot the distribution of Calories for each Category using geom_density() in combination with facet_wrap(), can you see in which Category the outlier that can be seen in the histogram falls?
menu_tidy |> 
  ggplot(aes(x = Calories)) +
  geom_density() +
  facet_wrap(vars(Category))

# The Chicken & Fish density plot shows a little bump at the end, 
# the outlier of almost 2000 calories is a Chicken & Fish item.

An easier way to spot outliers is by creating a boxplot, because the geom is constructed in a way that emphasizes the outliers.

5. Create a boxplot of the Calories for each Category. Is the outlier in the same Category as you thought it was in based on the previous question?
menu_tidy |> 
  ggplot(aes(y = Category, x = Calories)) +
  geom_boxplot()