Supervised learning: classification

Published

October 6, 2023

Introduction

In this practical, we will learn to use three different classification methods: K-nearest neighbours, logistic regression, and linear discriminant analysis.

One of the packages we are going to use is class. For this, you will probably need to install.packages("class") before running the library() functions.

library(MASS)
library(class)
library(ISLR)
library(tidyverse)

Make sure to load MASS before tidyverse otherwise the function MASS::select() will overwrite dplyr::select()

Default dataset

The default dataset contains credit card loan data for 10 000 people. The goal is to classify credit card cases as yes or no based on whether they will default on their loan.

1. Create a scatterplot of the Default dataset, where balance is mapped to the x position, income is mapped to the y position, and default is mapped to the colour. Can you see any interesting patterns already?
Default |>  
  arrange(default) |> # so the yellow dots are plotted after the blue ones
  ggplot(aes(x = balance, y = income, colour = default)) +
  geom_point(size = 1.3) +
  theme_minimal() +
  scale_colour_viridis_d() # optional custom colour scale