```
library(MASS)
library(tidyverse)
library(patchwork)
library(ggdendro)
```

# Hierarchical and k-means clustering

Published

October 18, 2023

# Introduction

In this practical, we will apply hierarchical and k-means clustering to two synthetic datasets.

First, we load the packages required for this lab.

`MASS`

before `tidyverse`

otherwise the function `MASS::select()`

will overwrite `dplyr::select()`

Before we start, set a seed for reproducibility, we use `123`

, and also use `options(scipen = 999)`

to suppress scientific notations, making it easier to compare and interpret results later in the session.

# Data processing

1. The data can be generated by running the code below. Try to understand what is happening as you run each line of the code below.

```
# randomly generate bivariate normal data
set.seed(123)
sigma <- matrix(c(1, .5, .5, 1), 2, 2)
sim_matrix <- mvrnorm(n = 100, mu = c(5, 5), Sigma = sigma)
colnames(sim_matrix) <- c("x1", "x2")
# change to a data frame (tibble) and add a cluster label column
sim_df <-
sim_matrix |>
as_tibble() |>
mutate(class = sample(c("A", "B", "C"), size = 100, replace = TRUE))
# Move the clusters to generate separation
sim_df_small <-
sim_df |>
mutate(x2 = case_when(class == "A" ~ x2 + .5,
class == "B" ~ x2 - .5,
class == "C" ~ x2 + .5),
x1 = case_when(class == "A" ~ x1 - .5,
class == "B" ~ x1 - 0,
class == "C" ~ x1 + .5))
sim_df_large <-
sim_df |>
mutate(x2 = case_when(class == "A" ~ x2 + 2.5,
class == "B" ~ x2 - 2.5,
class == "C" ~ x2 + 2.5),
x1 = case_when(class == "A" ~ x1 - 2.5,
class == "B" ~ x1 - 0,
class == "C" ~ x1 + 2.5))
```

2. Prepare two unsupervised datasets by removing the class feature.

3. For each of these datasets, create a scatterplot. Combine the two plots into a single frame (look up the “patchwork” package to see how to do this!) What is the difference between the two datasets?

# Hierarchical clustering

4. Run a hierarchical clustering on these datasets and display the result as dendrograms. Use euclidian distances and the complete agglomeration method. Make sure the two plots have the same y-scale. What is the difference between the dendrograms?

Hint: functions you will need are `hclust`

, `ggdendrogram`

, and `ylim`

.

```
dist_s <- dist(df_s, method = "euclidian")
dist_l <- dist(df_l, method = "euclidian")
res_s <- hclust(dist_s, method = "complete")
res_l <- hclust(dist_l, method = "complete")
ggdendrogram(res_s, labels = FALSE) + ggtitle("Small") + ylim(0, 10) +
ggdendrogram(res_l, labels = FALSE) + ggtitle("Large") + ylim(0, 10)
```

```
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
```

5. For the dataset with small differences, also run a complete agglomeration hierarchical cluster with manhattan distance.

6. Use the

`cutree()`

function to obtain the cluster assignments for three clusters and compare the cluster assignments to the 3-cluster euclidian solution. Do this comparison by creating two scatter plots with cluster assignment mapped to the colour aesthetic. Which difference do you see?
```
clus_1 <- as_factor(cutree(res_s, 3))
clus_2 <- as_factor(cutree(res_s2, 3))
p1 <- df_s |>
ggplot(aes(x = x1, y = x2, colour = clus_1)) +
geom_point() +
ggtitle("Euclidian") +
theme(legend.position = "n") +
coord_fixed()
p2 <- df_s |>
ggplot(aes(x = x1, y = x2, colour = clus_2)) +
geom_point() +
ggtitle("Manhattan") +
theme(legend.position = "n") +
coord_fixed()
p1 + p2
```

# K-means clustering

7. Create k-means clustering with 2, 3, 4, and 6 clusters on the large difference data. Again, create coloured scatter plots for these clustering results.

```
# Let's set the seed for reproducibility
set.seed(45)
# we can do it in a single pipeline and with faceting
# (of course there are many ways to do this, though)
df_l |>
mutate(
class_2 = as_factor(kmeans(df_l, 2)$cluster),
class_3 = as_factor(kmeans(df_l, 3)$cluster),
class_4 = as_factor(kmeans(df_l, 4)$cluster),
class_6 = as_factor(kmeans(df_l, 6)$cluster)
) |>
pivot_longer(cols = c(class_2, class_3, class_4, class_6),
names_to = "class_num", values_to = "cluster") |>
ggplot(aes(x = x1, y = x2, colour = cluster)) +
geom_point() +
scale_colour_brewer(type = "qual") + # use easy to distinguish scale
facet_wrap(~class_num)
```