library(mclust)
library(tidyverse)
library(patchwork)
Model-based clustering using mclust
Published
October 20, 2023
Introduction
In this practical, we will apply model-based clustering on a data set of bank note measurements.
We use the following packages:
The data is built into the mclust
package and can be loaded as a tibble
by running the following code:
Data exploration
2. Create a scatter plot of the left (x-axis) and right (y-axis) measurements on the data set. Map the
Status
column to colour. Jitter the points to avoid overplotting. Are the classes easy to distinguish based on these features?
3. From now on, we will assume that we don’t have the labels. Remove the
Status
column from the data set.
4. Create density plots for all columns in the data set. Which single feature is likely to be best for clustering?
# big patchwork of density plots
df |> ggplot(aes(x = Length)) + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Left)) + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Right)) + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Bottom)) + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Top)) + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Diagonal)) + geom_density() + theme_minimal()
# the Diagonal feature looks good! Look at the two bumps in its density plot.
# Colourful alternative:
library(ggridges)
df |>
mutate(across(everything(), scale)) |>
pivot_longer(everything(), names_to = "feature", values_to = "value") |>
ggplot(aes(x = value, y = feature, fill = feature)) +
geom_density_ridges() +
scale_fill_viridis_d(guide = "none") +
theme_minimal()
Picking joint bandwidth of 0.31
Univariate model-based clustering
5. Use
Mclust
to perform model-based clustering with 2 clusters on the feature you chose. Assume equal variances. Name the model object fit_E_2
. What are the means and variances of the clusters?
6. Use the formula from the slides and the model’s log-likelihood (
fit_E_2$loglik
) to compute the BIC for this model. Compare it to the BIC stored in the model object (fit_E_2$bic
). Explain how many parameters (m) you used and which parameters these are.
7. Plot the model-implied density using the
plot()
function. Afterwards, add rug marks of the original data to the plot using the rug()
function from the base graphics system.