Data visualization using ggplot2

Published

September 30, 2024

Introduction

In this practical, you will conduct an exploratory data analysis, to investigate the relationship between portion size and nutritional value of the items sold at McDonald’s. To do this, you will need to use ggplot2 and tidyverse, as you did in the previous practical, and we will also use the GGally package.

library(tidyverse)
library(GGally)

The dataset used contains information on the McDonald’s menu and can be downloaded here. Download the file and put it in your data folder. The original source of the data is on Kaggle.

Loading the data into R

To load the data from a separate file, use read_csv() from the tidyverse. We assign the data to a variable called menu.

menu <- read_csv("data/menu.csv")

You can also load data directly from a URL instead of from a local file! You can try it with this url: https://infomdwr.nl/labs/week_5/1_data_visualization_2/data/menu.csv

A first impression of the data

Before you start the analysis, a good first step is to get a sense of what the dataset looks like, especially when you import data from a separate file. You want to make sure that the dataset is imported correctly.

# Get an idea of what the menu dataset looks like
head(menu)
glimpse(menu)

You can see that the variable indicating the serving size is of a very inconvenient form. To transform this variable into a numeric variable indicating the serving size in grams (food) or milliliters (drinks), you can run the code below. You can also try to recode it yourself of course!

# Transformation drinks
drink_fl <- menu |> 
  filter(str_detect(`Serving Size`, " fl oz.*")) |> 
  mutate(`Serving Size` = str_remove(`Serving Size`, " fl oz.*")) |> 
  mutate(`Serving Size` = as.numeric(`Serving Size`) * 29.5735)

drink_carton <- menu |> 
  filter(str_detect(`Serving Size`, "carton")) |> 
  mutate(`Serving Size` = str_extract(`Serving Size`, "[0-9]{2,3}")) |> 
  mutate(`Serving Size` = as.numeric(`Serving Size`))

# Transformation food
food <-  menu |> 
  filter(str_detect(`Serving Size`, "g")) |> 
  mutate(`Serving Size` = str_extract(`Serving Size`, "(?<=\\()[0-9]{2,4}")) |> 
  mutate(`Serving Size` = as.numeric(`Serving Size`))

# Add Type variable indicating whether an item is food or a drink 
menu_tidy <-  
  bind_rows(drink_fl, drink_carton, food) |> 
  mutate(
   Type = case_when(
     as.character(Category) == 'Beverages' ~ 'Drinks',
     as.character(Category) == 'Coffee & Tea' ~ 'Drinks',
     as.character(Category) == 'Smoothies & Shakes' ~ 'Drinks',
     TRUE ~ 'Food'
   )
  )

1. After you ran the code, check the structure of the data once again, what type of variable is ‘Serving Size’ now, and what was it before you ran the code?

# The variable type of `Serving Size` is numeric, before running the code it was a character variable.

Variation

The first type of questions in visual EDA concern the variation in the data. As you can see, the items are categorized into different product groups. To get an idea of the variation among item categories and nutritional value, several visualisations can be made.

2. Create a graph that gives insight in the number of items in each Category, use geom_bar() to make the graph. Use the menu_tidy dataframe to do this.

menu_tidy |> 
  ggplot(aes(y = Category)) +
  geom_bar()

3. Plot the distribution of Calories using geom_histogram(). Describe the distribution, do you see anything notable?

menu_tidy |> 
  ggplot(aes(x = Calories)) +
  geom_histogram(binwidth = 30)

# The distribution is slightly skewed to the right, most items contain less than 750 calories.
# There is one outlier containing almost 2000 calories.

Association

The second type of question in visual EDA concerns the association between different variables. The histogram you created in the previous question gives insight in how the Calories are distributed over the whole menu. It may also be interesting to see how Calories are distributed on Category level.

4. Plot the distribution of Calories for each Category using geom_density() in combination with facet_wrap(), can you see in which Category the outlier that can be seen in the histogram falls?

menu_tidy |> 
  ggplot(aes(x = Calories)) +
  geom_density() +
  facet_wrap(vars(Category))

# The Chicken & Fish density plot shows a little bump at the end, 
# the outlier of almost 2000 calories is a Chicken & Fish item.

An easier way to spot outliers is by creating a boxplot, because the geom is constructed in a way that emphasizes the outliers.

5. Create a boxplot of the Calories for each Category. Is the outlier in the same Category as you thought it was in based on the previous question?

menu_tidy |> 
  ggplot(aes(y = Category, x = Calories)) +
  geom_boxplot()

# Yes, very easy to see now that the outlier is in the 
# Chicken & Fish category.

Now you know in which category the outlier falls, you can check which item it is that has such a high energetic value.

For the next plot,

6. Create a plot using geom_col() to visualise the number of Calories of each item in the Chicken & Fish category.

Use filter() in combination with a pipe (|>) to select the information that you need for this plot. What is the item that contains so many calories? Why do you need to use geom_col() here instead of geom_bar()?

Use the pipe |> to filter the data and then enter it as the first argument of the ggplot() function. More information on pipes can be found here.

menu_tidy |> 
  filter(Category == "Chicken & Fish") |> 
  ggplot(aes(x = Item, y = Calories)) +
  geom_col() +
  coord_flip()

# The item containing the stunning amount of almost 
# 2000 calories is the 40 piece Chicken McNuggets.
# Geom_bar() makes the height of the bar proportional 
# to the number of cases in each category, geom_col() 
# uses values in the data to determine the height of 
# the bars. Since in this plot, the height of the bars 
# should indicate the energetic value (calories), 
# geom_col() should be used. See ?geom_bar() for more.

Now you know what causes the outlier, you can check whether there is an association between the serving size and the number of calories.

7. Create a scatter plot to visualise the association between serving size and calories.

Are serving size and energetic value related to each other? Did you expect this outcome? Use the alpha argument in geom_point() to adjust the transparency. NB: to deal with spaces in variable names in ggplot(), you can surround your variable by backticks, e.g., Serving Size.

menu_tidy |> 
  ggplot(aes(x = `Serving Size`, y = Calories)) +
  geom_point(alpha = 0.5)

# There doesn't seem to be a strong relationship 
# between the portion size and energetic value 
# based on this plot. However, this seems highly 
# counter intuitive since you'd expect that the 
# larger the portion, the more calories it contains.

8. Create a new scatter plot to visualise the association between serving size and calories, but now use the colour argument to make a distinction between Type (see ?aes for more info).

Use geom_smooth() to add a regression line to the plot. Does this alter your conclusion about the relationship between serving size and calories? What do you conclude now?

menu_tidy |> 
  ggplot(aes(x = `Serving Size`, y = Calories, colour = Type)) +
  geom_point() +
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

# When splitting up the data in drinks and food, 
# you see that there is actually a very strong 
# relation between serving size and calories for 
# the food items, where the association is very 
# weak for the drink items.

9. Why do you think the relationship is so much affected by food type, is there a way to investigate this further using visualizations?

# A part of the drink-type items are items with 
# no calories at all (coffee/tea, light sodas), 
# so for these items the correlation between 
# calories and serving size is zero. This items 
# are likely to deflate the correlation for 
# drink-type items a lot. A way of investigating 
# this hypothesis using a visualisation is this:

menu_tidy %>% 
  filter(Type == 'Drinks') %>% 
    ggplot(mapping = aes(x = `Serving Size`, y = Calories, colour = Category)) +
    geom_point() +
    geom_smooth(method = 'lm')

`geom_smooth()` using formula = 'y ~ x'

# As can be seen from the plot, the association 
# between categories containing 0 calory items 
# (Beverages and Coffee& Tea) is far weaker than 
# the category without 0 calorie items (Smoothies & 
# Shakes)

So far, you’ve only looked at the association between two variables at a time. However, when you are exploring associations, you may want to look at the association of many different variables at the same time. An easy way to do this is to make use of the ggpairs() function from the GGally package.

Create a plot using ggpairs() where the association between at least 4 different variables is visualized. Are there differences in the associations between those variables based on item Type?

# In this plot, the association between `Serving Size`, 
# `Calories` and the three macro nutrients are 
# visualised. Because GGally is an extension of 
# ggplot2, additional aesthetics arguments can be 
# added to make a distinction in type.

menu_tidy |> 
  ggpairs(
    mapping = aes(colour = Type),
    columns = c("Serving Size", "Calories", "Total Fat", 
                "Carbohydrates", "Protein")
  )

Warning in geom_point(): All aesthetics have length 1, but the data has 25 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

# Using this plot, you can see that the correlations 
# change a lot when you take into account the item 
# type. This phenomenon, where the correlation 
# disappears or even changes of direction when 
# taking groups into account, is called Simpson's 
# paradox.