Model-based clustering using mclust

Published

October 20, 2023

Introduction

In this practical, we will apply model-based clustering on a data set of bank note measurements.

We use the following packages:

library(mclust)
library(tidyverse)
library(patchwork)

The data is built into the mclust package and can be loaded as a tibble by running the following code:

df <- as_tibble(banknote)

Data exploration

1. Read the help file of the banknote data set to understand what it’s all about.
?banknote
2. Create a scatter plot of the left (x-axis) and right (y-axis) measurements on the data set. Map the Status column to colour. Jitter the points to avoid overplotting. Are the classes easy to distinguish based on these features?
df |> 
  ggplot(aes(x = Left, y = Right, colour = Status)) +
  geom_point(position = position_jitter())

# the classes are not easy to distinguish: there is considerable overlap.
3. From now on, we will assume that we don’t have the labels. Remove the Status column from the data set.
df <- df |> select(-Status)
4. Create density plots for all columns in the data set. Which single feature is likely to be best for clustering?
# big patchwork of density plots
df |> ggplot(aes(x = Length))   + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Left))     + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Right))    + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Bottom))   + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Top))      + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Diagonal)) + geom_density() + theme_minimal()

# the Diagonal feature looks good! Look at the two bumps in its density plot.

# Colourful alternative:
library(ggridges)
df |> 
  mutate(across(everything(), scale)) |> 
  pivot_longer(everything(), names_to = "feature", values_to = "value") |> 
  ggplot(aes(x = value, y = feature, fill = feature)) + 
  geom_density_ridges() +
  scale_fill_viridis_d(guide = "none") +
  theme_minimal()
Picking joint bandwidth of 0.31

Univariate model-based clustering

5. Use Mclust to perform model-based clustering with 2 clusters on the feature you chose. Assume equal variances. Name the model object fit_E_2. What are the means and variances of the clusters?
fit_E_2 <- Mclust(df$Diagonal, G = 2, modelNames = "E")
# means:
fit_E_2$parameters$mean
       1        2 
139.4464 141.5221 
# variances:
fit_E_2$parameters$variance$sigmasq
[1] 0.244004
6. Use the formula from the slides and the model’s log-likelihood (fit_E_2$loglik) to compute the BIC for this model. Compare it to the BIC stored in the model object (fit_E_2$bic). Explain how many parameters (m) you used and which parameters these are.
# This model has 4 parameters: 1 class probability pi, 2 means, and 1 variance
- 2 * fit_E_2$loglik + log(nrow(df)) * 4 
[1] 569.4667
fit_E_2$bic
[1] -569.4667
# note: the BIC from the Mclust package is (wrongly) multiplied by -1!
7. Plot the model-implied density using the plot() function. Afterwards, add rug marks of the original data to the plot using the rug() function from the base graphics system.
# plot the density using base R plots
plot(fit_E_2, "density", xlab = "Diagonal measurement")
# add the observations using rug marks
rug(df$Diagonal)