library(mclust)
library(tidyverse)
library(patchwork)
Model-based clustering using mclust
Published
October 20, 2023
Introduction
In this practical, we will apply model-based clustering on a data set of bank note measurements.
We use the following packages:
The data is built into the mclust
package and can be loaded as a tibble
by running the following code:
Data exploration
2. Create a scatter plot of the left (x-axis) and right (y-axis) measurements on the data set. Map the
Status
column to colour. Jitter the points to avoid overplotting. Are the classes easy to distinguish based on these features?
3. From now on, we will assume that we don’t have the labels. Remove the
Status
column from the data set.
4. Create density plots for all columns in the data set. Which single feature is likely to be best for clustering?
# big patchwork of density plots
df |> ggplot(aes(x = Length)) + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Left)) + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Right)) + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Bottom)) + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Top)) + geom_density() + theme_minimal() +
df |> ggplot(aes(x = Diagonal)) + geom_density() + theme_minimal()
# the Diagonal feature looks good! Look at the two bumps in its density plot.
# Colourful alternative:
library(ggridges)
df |>
mutate(across(everything(), scale)) |>
pivot_longer(everything(), names_to = "feature", values_to = "value") |>
ggplot(aes(x = value, y = feature, fill = feature)) +
geom_density_ridges() +
scale_fill_viridis_d(guide = "none") +
theme_minimal()
Picking joint bandwidth of 0.31
Univariate model-based clustering
5. Use
Mclust
to perform model-based clustering with 2 clusters on the feature you chose. Assume equal variances. Name the model object fit_E_2
. What are the means and variances of the clusters?
6. Use the formula from the slides and the model’s log-likelihood (
fit_E_2$loglik
) to compute the BIC for this model. Compare it to the BIC stored in the model object (fit_E_2$bic
). Explain how many parameters (m) you used and which parameters these are.
7. Plot the model-implied density using the
plot()
function. Afterwards, add rug marks of the original data to the plot using the rug()
function from the base graphics system.
8. Use
Mclust
to perform model-based clustering with 2 clusters on this feature again, but now assume unequal variances. Name the model object fit_V_2
. What are the means and variances of the clusters? Plot the density again and note the differences.
9. How many parameters does this model have? Name them.
10. According to the deviance, which model fits better?
Multivariate model-based clustering
We will now use all available information in the data set to cluster the observations.
12. Use Mclust with all 6 features to perform clustering. Allow all model types (shapes), and from 1 to 9 potential clusters. What is the optimal model based on the BIC?
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VVE (ellipsoidal, equal orientation) model with 3 components:
log-likelihood n df BIC ICL
-663.3814 200 53 -1607.574 -1607.71
Clustering table:
1 2 3
18 98 84
13. How many mean parameters does this model have?
[,1] [,2] [,3]
Length 215.023017 214.971360 214.78091
Left 130.499684 129.929686 130.26435
Right 130.304813 129.701143 130.17988
Bottom 8.775842 8.301115 10.85719
Top 11.173812 10.162379 11.10798
Diagonal 138.731602 141.541673 139.62387
14. Run a 2-component VVV model on this data. Create a matrix of bivariate contour (“density”) plots using the
plot()
function. Which features provide good component separation? Which do not?
15. Create a scatter plot just like the first scatter plot in this tutorial, but map the estimated class assignments to the colour aesthetic.
Map the uncertainty (part of the fitted model list) to the size aesthetic, such that larger points indicate more uncertain class assignments. Jitter the points to avoid overplotting. What do you notice about the uncertainty?