Hierarchical and k-means clustering

Published

October 18, 2023

Introduction

In this practical, we will apply hierarchical and k-means clustering to two synthetic datasets.

First, we load the packages required for this lab.

library(MASS)
library(tidyverse)
library(patchwork)
library(ggdendro)

Make sure to load MASS before tidyverse otherwise the function MASS::select() will overwrite dplyr::select()

Before we start, set a seed for reproducibility, we use 123, and also use options(scipen = 999) to suppress scientific notations, making it easier to compare and interpret results later in the session.

set.seed(123)
options(scipen = 999)

Data processing

1. The data can be generated by running the code below. Try to understand what is happening as you run each line of the code below.

# randomly generate bivariate normal data
set.seed(123)
sigma      <- matrix(c(1, .5, .5, 1), 2, 2)
sim_matrix <- mvrnorm(n = 100, mu = c(5, 5), Sigma = sigma)
colnames(sim_matrix) <- c("x1", "x2")

# change to a data frame (tibble) and add a cluster label column
sim_df <- 
  sim_matrix |> 
  as_tibble() |>
  mutate(class = sample(c("A", "B", "C"), size = 100, replace = TRUE))

# Move the clusters to generate separation
sim_df_small <- 
  sim_df |>
  mutate(x2 = case_when(class == "A" ~ x2 + .5,
                        class == "B" ~ x2 - .5,
                        class == "C" ~ x2 + .5),
         x1 = case_when(class == "A" ~ x1 - .5,
                        class == "B" ~ x1 - 0,
                        class == "C" ~ x1 + .5))
sim_df_large <- 
  sim_df |>
  mutate(x2 = case_when(class == "A" ~ x2 + 2.5,
                        class == "B" ~ x2 - 2.5,
                        class == "C" ~ x2 + 2.5),
         x1 = case_when(class == "A" ~ x1 - 2.5,
                        class == "B" ~ x1 - 0,
                        class == "C" ~ x1 + 2.5))

2. Prepare two unsupervised datasets by removing the class feature.

3. For each of these datasets, create a scatterplot. Combine the two plots into a single frame (look up the “patchwork” package to see how to do this!) What is the difference between the two datasets?

Hierarchical clustering

4. Run a hierarchical clustering on these datasets and display the result as dendrograms. Use euclidian distances and the complete agglomeration method. Make sure the two plots have the same y-scale. What is the difference between the dendrograms?

Hint: functions you will need are hclust, ggdendrogram, and ylim.

5. For the dataset with small differences, also run a complete agglomeration hierarchical cluster with manhattan distance.

6. Use the cutree() function to obtain the cluster assignments for three clusters and compare the cluster assignments to the 3-cluster euclidian solution. Do this comparison by creating two scatter plots with cluster assignment mapped to the colour aesthetic. Which difference do you see?

K-means clustering

7. Create k-means clustering with 2, 3, 4, and 6 clusters on the large difference data. Again, create coloured scatter plots for these clustering results.

8. Do the same thing again a few times. Do you see the same results every time? where do you see differences?

9. Find a way online to perform bootstrap stability assessment for the 3 and 6-cluster solutions.