Data visualization using ggplot2

Published

September 29, 2023

Introduction

In this practical, you will conduct an exploratory data analysis, to investigate the relationship between portion size and nutritional value of the items sold at McDonald’s. To do this, you will need to use ggplot2 and tidyverse, as you did in the previous practical, and we will also use the GGally package.

library(tidyverse)
library(GGally)

The dataset used contains information on the McDonald’s menu and can be downloaded here. Download the file and put it in your data folder. The original source of the data is on Kaggle.

Loading the data into R

To load the data from a separate file, use read_csv() from the tidyverse. We assign the data to a variable called menu.

menu <- read_csv("data/menu.csv")

You can also load data directly from a URL instead of from a local file! You can try it with this url: https://infomdwr.nl/labs/week_5/1_data_visualization_2/data/menu.csv

A first impression of the data

Before you start the analysis, a good first step is to get a sense of what the dataset looks like, especially when you import data from a separate file. You want to make sure that the dataset is imported correctly.

# Get an idea of what the menu dataset looks like
head(menu)
glimpse(menu)

You can see that the variable indicating the serving size is of a very inconvenient form. To transform this variable into a numeric variable indicating the serving size in grams (food) or milliliters (drinks), you can run the code below. You can also try to recode it yourself of course!

# Transformation drinks
drink_fl <- menu |> 
  filter(str_detect(`Serving Size`, " fl oz.*")) |> 
  mutate(`Serving Size` = str_remove(`Serving Size`, " fl oz.*")) |> 
  mutate(`Serving Size` = as.numeric(`Serving Size`) * 29.5735)

drink_carton <- menu |> 
  filter(str_detect(`Serving Size`, "carton")) |> 
  mutate(`Serving Size` = str_extract(`Serving Size`, "[0-9]{2,3}")) |> 
  mutate(`Serving Size` = as.numeric(`Serving Size`))

# Transformation food
food <-  menu |> 
  filter(str_detect(`Serving Size`, "g")) |> 
  mutate(`Serving Size` = str_extract(`Serving Size`, "(?<=\\()[0-9]{2,4}")) |> 
  mutate(`Serving Size` = as.numeric(`Serving Size`))

# Add Type variable indicating whether an item is food or a drink 
menu_tidy <-  
  bind_rows(drink_fl, drink_carton, food) |> 
  mutate(
   Type = case_when(
     as.character(Category) == 'Beverages' ~ 'Drinks',
     as.character(Category) == 'Coffee & Tea' ~ 'Drinks',
     as.character(Category) == 'Smoothies & Shakes' ~ 'Drinks',
     TRUE ~ 'Food'
   )
  )
1. After you ran the code, check the structure of the data once again, what type of variable is ‘Serving Size’ now, and what was it before you ran the code?

Variation

The first type of questions in visual EDA concern the variation in the data. As you can see, the items are categorized into different product groups. To get an idea of the variation among item categories and nutritional value, several visualisations can be made.

2. Create a graph that gives insight in the number of items in each Category, use geom_bar() to make the graph. Use the menu_tidy dataframe to do this.
3. Plot the distribution of Calories using geom_histogram(). Describe the distribution, do you see anything notable?

Association

The second type of question in visual EDA concerns the association between different variables. The histogram you created in the previous question gives insight in how the Calories are distributed over the whole menu. It may also be interesting to see how Calories are distributed on Category level.

4. Plot the distribution of Calories for each Category using geom_density() in combination with facet_wrap(), can you see in which Category the outlier that can be seen in the histogram falls?

An easier way to spot outliers is by creating a boxplot, because the geom is constructed in a way that emphasizes the outliers.

5. Create a boxplot of the Calories for each Category. Is the outlier in the same Category as you thought it was in based on the previous question?

Now you know in which category the outlier falls, you can check which item it is that has such a high energetic value.

For the next plot,

6. Create a plot using geom_col() to visualise the number of Calories of each item in the Chicken & Fish category.

Use filter() in combination with a pipe (|>) to select the information that you need for this plot. What is the item that contains so many calories? Why do you need to use geom_col() here instead of geom_bar()?

Use the pipe %>% to filter the data and then enter it as the first argument of the ggplot() function. More information on pipes can be found here.

Now you know what causes the outlier, you can check whether there is an association between the serving size and the number of calories.

7. Create a scatter plot to visualise the association between serving size and calories.

Are serving size and energetic value related to each other? Did you expect this outcome? Use the alpha argument in geom_point() to adjust the transparency. NB: to deal with spaces in variable names in ggplot(), you can surround your variable by backticks, e.g., Serving Size.

8. Create a new scatter plot to visualise the association between serving size and calories, but now use the colour argument to make a distinction between Type (see ?aes for more info).

Use geom_smooth() to add a regression line to the plot. Does this alter your conclusion about the relationship between serving size and calories? What do you conclude now?

9. Why do you think the relationship is so much affected by food type, is there a way to investigate this further using visualizations?

So far, you’ve only looked at the association between two variables at a time. However, when you are exploring associations, you may want to look at the association of many different variables at the same time. An easy way to do this is to make use of the ggpairs() function from the GGally package.

  1. Create a plot using ggpairs() where the association between at least 4 different variables is visualized. Are there differences in the associations between those variables based on item Type?