Make sure to load MASS before tidyverse otherwise the function MASS::select() will overwrite dplyr::select()
Before we start, set a seed for reproducibility, we use 123, and also use options(scipen = 999) to suppress scientific notations, making it easier to compare and interpret results later in the session.
set.seed(123)options(scipen =999)
Data processing
1. The data can be generated by running the code below. Try to understand what is happening as you run each line of the code below.
# randomly generate bivariate normal dataset.seed(123)sigma <-matrix(c(1, .5, .5, 1), 2, 2)sim_matrix <-mvrnorm(n =100, mu =c(5, 5), Sigma = sigma)colnames(sim_matrix) <-c("x1", "x2")# change to a data frame (tibble) and add a cluster label columnsim_df <- sim_matrix |>as_tibble() |>mutate(class =sample(c("A", "B", "C"), size =100, replace =TRUE))# Move the clusters to generate separationsim_df_small <- sim_df |>mutate(x2 =case_when(class =="A"~ x2 + .5, class =="B"~ x2 - .5, class =="C"~ x2 + .5),x1 =case_when(class =="A"~ x1 - .5, class =="B"~ x1 -0, class =="C"~ x1 + .5))sim_df_large <- sim_df |>mutate(x2 =case_when(class =="A"~ x2 +2.5, class =="B"~ x2 -2.5, class =="C"~ x2 +2.5),x1 =case_when(class =="A"~ x1 -2.5, class =="B"~ x1 -0, class =="C"~ x1 +2.5))
2. Prepare two unsupervised datasets by removing the class feature.
3. For each of these datasets, create a scatterplot. Combine the two plots into a single frame (look up the “patchwork” package to see how to do this!) What is the difference between the two datasets?
Hierarchical clustering
4. Run a hierarchical clustering on these datasets and display the result as dendrograms. Use euclidian distances and the complete agglomeration method. Make sure the two plots have the same y-scale. What is the difference between the dendrograms?
Hint: functions you will need are hclust, ggdendrogram, and ylim.
5. For the dataset with small differences, also run a complete agglomeration hierarchical cluster with manhattan distance.
6. Use the cutree() function to obtain the cluster assignments for three clusters and compare the cluster assignments to the 3-cluster euclidian solution. Do this comparison by creating two scatter plots with cluster assignment mapped to the colour aesthetic. Which difference do you see?
K-means clustering
7. Create k-means clustering with 2, 3, 4, and 6 clusters on the large difference data. Again, create coloured scatter plots for these clustering results.
8. Do the same thing again a few times. Do you see the same results every time? where do you see differences?
9. Find a way online to perform bootstrap stability assessment for the 3 and 6-cluster solutions.