library(MASS)
library(tidyverse)
library(patchwork)
library(ggdendro)
Hierarchical and k-means clustering
Published
October 18, 2023
Introduction
In this practical, we will apply hierarchical and k-means clustering to two synthetic datasets.
First, we load the packages required for this lab.
Make sure to load MASS
before tidyverse
otherwise the function MASS::select()
will overwrite dplyr::select()
Before we start, set a seed for reproducibility, we use 123
, and also use options(scipen = 999)
to suppress scientific notations, making it easier to compare and interpret results later in the session.
Data processing
1. The data can be generated by running the code below. Try to understand what is happening as you run each line of the code below.
# randomly generate bivariate normal data
set.seed(123)
sigma <- matrix(c(1, .5, .5, 1), 2, 2)
sim_matrix <- mvrnorm(n = 100, mu = c(5, 5), Sigma = sigma)
colnames(sim_matrix) <- c("x1", "x2")
# change to a data frame (tibble) and add a cluster label column
sim_df <-
sim_matrix |>
as_tibble() |>
mutate(class = sample(c("A", "B", "C"), size = 100, replace = TRUE))
# Move the clusters to generate separation
sim_df_small <-
sim_df |>
mutate(x2 = case_when(class == "A" ~ x2 + .5,
class == "B" ~ x2 - .5,
class == "C" ~ x2 + .5),
x1 = case_when(class == "A" ~ x1 - .5,
class == "B" ~ x1 - 0,
class == "C" ~ x1 + .5))
sim_df_large <-
sim_df |>
mutate(x2 = case_when(class == "A" ~ x2 + 2.5,
class == "B" ~ x2 - 2.5,
class == "C" ~ x2 + 2.5),
x1 = case_when(class == "A" ~ x1 - 2.5,
class == "B" ~ x1 - 0,
class == "C" ~ x1 + 2.5))
2. Prepare two unsupervised datasets by removing the class feature.
3. For each of these datasets, create a scatterplot. Combine the two plots into a single frame (look up the “patchwork” package to see how to do this!) What is the difference between the two datasets?
Hierarchical clustering
4. Run a hierarchical clustering on these datasets and display the result as dendrograms. Use euclidian distances and the complete agglomeration method. Make sure the two plots have the same y-scale. What is the difference between the dendrograms?
Hint: functions you will need are hclust
, ggdendrogram
, and ylim
.
dist_s <- dist(df_s, method = "euclidian")
dist_l <- dist(df_l, method = "euclidian")
res_s <- hclust(dist_s, method = "complete")
res_l <- hclust(dist_l, method = "complete")
ggdendrogram(res_s, labels = FALSE) + ggtitle("Small") + ylim(0, 10) +
ggdendrogram(res_l, labels = FALSE) + ggtitle("Large") + ylim(0, 10)
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
5. For the dataset with small differences, also run a complete agglomeration hierarchical cluster with manhattan distance.
6. Use the
cutree()
function to obtain the cluster assignments for three clusters and compare the cluster assignments to the 3-cluster euclidian solution. Do this comparison by creating two scatter plots with cluster assignment mapped to the colour aesthetic. Which difference do you see?
clus_1 <- as_factor(cutree(res_s, 3))
clus_2 <- as_factor(cutree(res_s2, 3))
p1 <- df_s |>
ggplot(aes(x = x1, y = x2, colour = clus_1)) +
geom_point() +
ggtitle("Euclidian") +
theme(legend.position = "n") +
coord_fixed()
p2 <- df_s |>
ggplot(aes(x = x1, y = x2, colour = clus_2)) +
geom_point() +
ggtitle("Manhattan") +
theme(legend.position = "n") +
coord_fixed()
p1 + p2
K-means clustering
7. Create k-means clustering with 2, 3, 4, and 6 clusters on the large difference data. Again, create coloured scatter plots for these clustering results.
# Let's set the seed for reproducibility
set.seed(45)
# we can do it in a single pipeline and with faceting
# (of course there are many ways to do this, though)
df_l |>
mutate(
class_2 = as_factor(kmeans(df_l, 2)$cluster),
class_3 = as_factor(kmeans(df_l, 3)$cluster),
class_4 = as_factor(kmeans(df_l, 4)$cluster),
class_6 = as_factor(kmeans(df_l, 6)$cluster)
) |>
pivot_longer(cols = c(class_2, class_3, class_4, class_6),
names_to = "class_num", values_to = "cluster") |>
ggplot(aes(x = x1, y = x2, colour = cluster)) +
geom_point() +
scale_colour_brewer(type = "qual") + # use easy to distinguish scale
facet_wrap(~class_num)