library(MASS)
library(tidyverse)
library(patchwork)
library(ggdendro)
Hierarchical and k-means clustering
Published
October 18, 2023
Introduction
In this practical, we will apply hierarchical and k-means clustering to two synthetic datasets.
First, we load the packages required for this lab.
Make sure to load MASS
before tidyverse
otherwise the function MASS::select()
will overwrite dplyr::select()
Before we start, set a seed for reproducibility, we use 123
, and also use options(scipen = 999)
to suppress scientific notations, making it easier to compare and interpret results later in the session.
Data processing
1. The data can be generated by running the code below. Try to understand what is happening as you run each line of the code below.
# randomly generate bivariate normal data
set.seed(123)
sigma <- matrix(c(1, .5, .5, 1), 2, 2)
sim_matrix <- mvrnorm(n = 100, mu = c(5, 5), Sigma = sigma)
colnames(sim_matrix) <- c("x1", "x2")
# change to a data frame (tibble) and add a cluster label column
sim_df <-
sim_matrix |>
as_tibble() |>
mutate(class = sample(c("A", "B", "C"), size = 100, replace = TRUE))
# Move the clusters to generate separation
sim_df_small <-
sim_df |>
mutate(x2 = case_when(class == "A" ~ x2 + .5,
class == "B" ~ x2 - .5,
class == "C" ~ x2 + .5),
x1 = case_when(class == "A" ~ x1 - .5,
class == "B" ~ x1 - 0,
class == "C" ~ x1 + .5))
sim_df_large <-
sim_df |>
mutate(x2 = case_when(class == "A" ~ x2 + 2.5,
class == "B" ~ x2 - 2.5,
class == "C" ~ x2 + 2.5),
x1 = case_when(class == "A" ~ x1 - 2.5,
class == "B" ~ x1 - 0,
class == "C" ~ x1 + 2.5))
2. Prepare two unsupervised datasets by removing the class feature.
3. For each of these datasets, create a scatterplot. Combine the two plots into a single frame (look up the “patchwork” package to see how to do this!) What is the difference between the two datasets?
Hierarchical clustering
4. Run a hierarchical clustering on these datasets and display the result as dendrograms. Use euclidian distances and the complete agglomeration method. Make sure the two plots have the same y-scale. What is the difference between the dendrograms?
Hint: functions you will need are hclust
, ggdendrogram
, and ylim
.
dist_s <- dist(df_s, method = "euclidian")
dist_l <- dist(df_l, method = "euclidian")
res_s <- hclust(dist_s, method = "complete")
res_l <- hclust(dist_l, method = "complete")
ggdendrogram(res_s, labels = FALSE) + ggtitle("Small") + ylim(0, 10) +
ggdendrogram(res_l, labels = FALSE) + ggtitle("Large") + ylim(0, 10)
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
5. For the dataset with small differences, also run a complete agglomeration hierarchical cluster with manhattan distance.
6. Use the
cutree()
function to obtain the cluster assignments for three clusters and compare the cluster assignments to the 3-cluster euclidian solution. Do this comparison by creating two scatter plots with cluster assignment mapped to the colour aesthetic. Which difference do you see?
clus_1 <- as_factor(cutree(res_s, 3))
clus_2 <- as_factor(cutree(res_s2, 3))
p1 <- df_s |>
ggplot(aes(x = x1, y = x2, colour = clus_1)) +
geom_point() +
ggtitle("Euclidian") +
theme(legend.position = "n") +
coord_fixed()
p2 <- df_s |>
ggplot(aes(x = x1, y = x2, colour = clus_2)) +
geom_point() +
ggtitle("Manhattan") +
theme(legend.position = "n") +
coord_fixed()
p1 + p2
K-means clustering
7. Create k-means clustering with 2, 3, 4, and 6 clusters on the large difference data. Again, create coloured scatter plots for these clustering results.
# Let's set the seed for reproducibility
set.seed(45)
# we can do it in a single pipeline and with faceting
# (of course there are many ways to do this, though)
df_l |>
mutate(
class_2 = as_factor(kmeans(df_l, 2)$cluster),
class_3 = as_factor(kmeans(df_l, 3)$cluster),
class_4 = as_factor(kmeans(df_l, 4)$cluster),
class_6 = as_factor(kmeans(df_l, 6)$cluster)
) |>
pivot_longer(cols = c(class_2, class_3, class_4, class_6),
names_to = "class_num", values_to = "cluster") |>
ggplot(aes(x = x1, y = x2, colour = cluster)) +
geom_point() +
scale_colour_brewer(type = "qual") + # use easy to distinguish scale
facet_wrap(~class_num)
8. Do the same thing again a few times. Do you see the same results every time? where do you see differences?
# Let's set the seed for reproducibility
set.seed(46)
df_l |>
mutate(
class_2 = as_factor(kmeans(df_l, 2)$cluster),
class_3 = as_factor(kmeans(df_l, 3)$cluster),
class_4 = as_factor(kmeans(df_l, 4)$cluster),
class_6 = as_factor(kmeans(df_l, 6)$cluster)
) |>
pivot_longer(cols = c(class_2, class_3, class_4, class_6),
names_to = "class_num", values_to = "cluster") |>
ggplot(aes(x = x1, y = x2, colour = cluster)) +
geom_point() +
scale_colour_brewer(type = "qual") + # use easy to distinguish scale
facet_wrap(~class_num)
9. Find a way online to perform bootstrap stability assessment for the 3 and 6-cluster solutions.
# For this we decided to use the function clusterboot from the fpc package.
# NB: this package needs to be installed first!
# install.packages("fpc")
library(fpc)
Warning: package 'fpc' was built under R version 4.2.3
# You can read the documentation ?clusterboot to figure out how to use this
# function. This can take a while but is something you will need to do a lot in
# the real world!
boot_3 <- clusterboot(df_l, B = 1000, clustermethod = kmeansCBI, k = 3,
count = FALSE)
boot_6 <- clusterboot(df_l, B = 1000, clustermethod = kmeansCBI, k = 6,
count = FALSE)
# the average stability is much lower for 6 means than for 3 means:
boot_3$bootmean
[1] 0.9844165 0.9730219 0.9739694
[1] 0.7844248 0.5383414 0.7593547 0.7230086 0.6897734 0.7091287