Supervised learning: regression in R

Published

October 3, 2023

Introduction

In this practical, you will learn how to perform regression analysis in R, how to create predictions, how to plot with confidence and prediction intervals, how to calculate MSE, perform train-test splits, and write a function for cross validation.

Just like in the practical at the end of chapter 3 of the ISLR book, we will use the Boston dataset, which is in the MASS package that comes with R.

library(ISLR)
library(MASS)
library(tidyverse)

Make sure to load MASS before tidyverse otherwise the function MASS::select() will overwrite dplyr::select()

Regression in R

Regression is performed through the lm() function. It requires two arguments: a formula and data. A formula is a specific type of object that can be constructed like so:

some_formula <- outcome ~ predictor_1 + predictor_2 

You can read it as “the outcome variable is a function of predictors 1 and 2”. As with other objects, you can check its class and even convert it to other classes, such as a character vector:

class(some_formula)
as.character(some_formula)

You can estimate a linear model using lm() by specifying the outcome variable and the predictors in a formula and by inputting the dataset these variables should be taken from.

1. Create a linear model object called lm_ses using the formula medv ~ lstat and the Boston dataset.
lm_ses <- lm(formula = medv ~ lstat, data = Boston)

You have now trained a regression model with medv (housing value) as the outcome/dependent variable and lstat (socio-economic status) as the predictor / independent variable.

Remember that a regression estimates \(\beta_0\) (the intercept) and \(\beta_1\) (the slope) in the following equation:

\[\boldsymbol{y} = \beta_0 + \beta_1\cdot \boldsymbol{x}_1 + \boldsymbol{\epsilon}\]

2. Use the function coef() to extract the intercept and slope from the lm_ses object. Interpret the slope coefficient.
coef(lm_ses)
(Intercept)       lstat 
 34.5538409  -0.9500494 
# for each point increase in lstat, the median housing value drops by 0.95
3. Use summary() to get a summary of the lm_ses object. What do you see? You can use the help file ?summary.lm.
summary(lm_ses)

Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,    Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

We now have a model object lm_ses that represents the formula

\[\text{medv}_i = 34.55 - 0.95 * \text{lstat}_i + \epsilon_i\]

With this object, we can predict a new medv value by inputting its lstat value. The predict() method enables us to do this for the lstat values in the original dataset.

4. Save the predicted y values to a variable called y_pred
y_pred <- predict(lm_ses)
5. Create a scatter plot with y_pred mapped to the x position and the true y value (Boston$medv) mapped to the y value. What do you see? What would this plot look like if the fit were perfect?
tibble(pred = y_pred, 
       obs  = Boston$medv) %>% 
  ggplot(aes(x = pred, y = obs)) +
  geom_point() +
  theme_minimal() +
  geom_abline(slope = 1)

# I've added an ideal line where all the points would lie on if the 
# fit were perfect.

We can also generate predictions from new data using the newdata argument in the predict() method. For that, we need to prepare a data frame with new values for the original predictors.

6. Use the seq() function to generate a sequence of 1000 equally spaced values from 0 to 40. Store this vector in a data frame with (data.frame() or tibble()) as its column name lstat. Name the data frame pred_dat.
pred_dat <- tibble(lstat = seq(0, 40, length.out = 1000))
7. Use the newly created data frame as the newdata argument to a predict() call for lm_ses. Store it in a variable named y_pred_new.
y_pred_new <- predict(lm_ses, newdata = pred_dat)

Plotting lm() in ggplot

A good way of understanding your model is by visualizing it. We are going to walk through the construction of a plot with a fit line and prediction / confidence intervals from an lm object.

8. Create a scatter plot from the Boston dataset with lstat mapped to the x position and medv mapped to the y position. Store the plot in an object called p_scatter.
p_scatter <- 
  Boston %>% 
  ggplot(aes(x = lstat, y = medv)) +
  geom_point() +
  theme_minimal()

p_scatter

Now we’re going to add a prediction line to this plot.

9. Add the vector y_pred_new to the pred_dat data frame with the name medv.
# this can be done in several ways. Here are two possibilities:
# pred_dat$medv <- y_pred_new
pred_dat <- pred_dat %>% mutate(medv = y_pred_new)
10. Add a geom_line() to p_scatter, with pred_dat as the data argument. What does this line represent?
p_scatter + geom_line(data = pred_dat)

# This line represents predicted values of medv for the values of lstat 
11. The interval argument can be used to generate confidence or prediction intervals. Create a new object called y_pred_95 using predict() (again with the pred_dat data) with the interval argument set to “confidence”. What is in this object?
y_pred_95 <- predict(lm_ses, newdata = pred_dat, interval = "confidence")

head(y_pred_95)
       fit      lwr      upr
1 34.55384 33.44846 35.65922
2 34.51580 33.41307 35.61853
3 34.47776 33.37768 35.57784
4 34.43972 33.34229 35.53715
5 34.40168 33.30690 35.49646
6 34.36364 33.27150 35.45578
# it's a matrix with an estimate and a lower and an upper confidence interval.
12. Create a data frame with 4 columns: lstat, medv, lower, and upper using the pred_dat data and y_pred_95.
gg_pred <- tibble(
  lstat = pred_dat$lstat,
  medv  = y_pred_95[, 1],
  lower = y_pred_95[, 2],
  upper = y_pred_95[, 3]
)

gg_pred
# A tibble: 1,000 x 4
    lstat  medv lower upper
    <dbl> <dbl> <dbl> <dbl>
 1 0       34.6  33.4  35.7
 2 0.0400  34.5  33.4  35.6
 3 0.0801  34.5  33.4  35.6
 4 0.120   34.4  33.3  35.5
 5 0.160   34.4  33.3  35.5
 6 0.200   34.4  33.3  35.5
 7 0.240   34.3  33.2  35.4
 8 0.280   34.3  33.2  35.4
 9 0.320   34.2  33.2  35.3
10 0.360   34.2  33.1  35.3
# i 990 more rows
13. Add a geom_ribbon() to the plot with the data frame you just made. The ribbon geom requires three aesthetics: x (lstat, already mapped), ymin (lower), and ymax (upper).

Add the ribbon first, then the geom_line() and the geom_points() from before to make sure they remain visible. Give it a nice colour and clean up the plot, too!

# Create the plot
Boston %>% 
  ggplot(aes(x = lstat, y = medv)) + 
  geom_ribbon(aes(ymin = lower, ymax = upper), data = gg_pred, fill = "#00008b44") +
  geom_point(colour = "#883321") + 
  geom_line(data = pred_dat, colour = "#00008b", size = 1) +
  theme_minimal() + 
  labs(x    = "Proportion of low SES households",
       y    = "Median house value",
       title = "Boston house prices")

14. Explain in your own words what the ribbon represents.
# The ribbon represents the 95% confidence interval of the fit line.
# The uncertainty in the estimates of the coefficients are taken into
# account with this ribbon. 

# You can think of it as:
# upon repeated sampling of data from the same population, at least 95% of
# the ribbons will contain the true fit line.
15. Do the same thing, but now with the prediction interval instead of the confidence interval.
# pred with pred interval
y_pred_95 <- predict(lm_ses, newdata = pred_dat, interval = "prediction")


# create the df
gg_pred <- tibble(
  lstat = pred_dat$lstat,
  medv  = y_pred_95[, 1],
  l95   = y_pred_95[, 2],
  u95   = y_pred_95[, 3]
)

# Create the plot
Boston %>% 
  ggplot(aes(x = lstat, y = medv)) + 
  geom_ribbon(aes(ymin = l95, ymax = u95), data = gg_pred, fill = "#00008b44") +
  geom_point(colour = "#883321") + 
  geom_line(data = pred_dat, colour = "#00008b", size = 1) +
  theme_minimal() + 
  labs(x     = "Proportion of low SES households",
       y     = "Median house value",
       title = "Boston house prices")

# You can look at ISLR p.82-83 for a discussion of prediction intervals

Mean squared error

16. Write a function called mse() that takes in two vectors: true y values and predicted y values, and which outputs the mean squared error.

Start like so:

mse <- function(y_true, y_pred) {
  # your function here
}

Wikipedia may help for the formula.

# there are many ways of doing this.
mse <- function(y_true, y_pred) {
  mean((y_true - y_pred)^2)
}
17. Make sure your mse() function works correctly by running the following code.
mse(1:10, 10:1)

You have now calculated the mean squared length of the dashed lines below.

18. Calculate the mean squared error of the lm_ses model. Use the medv column as y_true and use the predict() method to generate y_pred.
mse(Boston$medv, predict(lm_ses))
[1] 38.48297

You have calculated the mean squared length of the dashed lines in the plot below.

Train-validation-test split

Now we will use the sample() function to randomly select observations from the Boston dataset to go into a training, test, and validation set. The training set will be used to fit our model, the validation set will be used to calculate the out-of sample prediction error during model building (this can also be called hyperparameter tuning), and the test set will be used to estimate the true out-of-sample MSE.

19. The Boston dataset has 506 observations. Use c() and rep() to create a vector with 253 times the word “train”, 152 times the word “validation”, and 101 times the word “test”. Call this vector splits.
splits <- c(rep("train", 253), rep("validation", 152), rep("test", 101))
20. Use the function sample() to randomly order this vector and add it to the Boston dataset using mutate(). Assign the newly created dataset to a variable called boston_master.
boston_master <- Boston %>% mutate(splits = sample(splits))
21. Now use filter() to create a training, validation, and test set from the boston_master data. Call these datasets boston_train, boston_valid, and boston_test.
boston_train <- boston_master %>% filter(splits == "train")
boston_valid <- boston_master %>% filter(splits == "validation")
boston_test  <- boston_master %>% filter(splits == "test")

We will set aside the boston_test dataset for now.

22. Train a linear regression model called model_1 using the training dataset. Use the formula medv ~ lstat like in the first lm() exercise. Use summary() to check that this object is as you expect.
model_1 <- lm(medv ~ lstat, data = boston_train)
summary(model_1)

Call:
lm(formula = medv ~ lstat, data = boston_train)

Residuals:
   Min     1Q Median     3Q    Max 
-9.769 -4.230 -1.498  1.921 24.245 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 35.46800    0.79695   44.51   <2e-16 ***
lstat       -1.01923    0.05501  -18.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.396 on 251 degrees of freedom
Multiple R-squared:  0.5777,    Adjusted R-squared:  0.576 
F-statistic: 343.3 on 1 and 251 DF,  p-value: < 2.2e-16
23. Calculate the MSE with this object. Save this value as model_1_mse_train.
model_1_mse_train <- mse(y_true = boston_train$medv, y_pred = predict(model_1))
24. Now calculate the MSE on the validation set and assign it to variable model_1_mse_valid. Hint: use the newdata argument in predict().
model_1_mse_valid <- mse(y_true = boston_valid$medv, 
                         y_pred = predict(model_1, newdata = boston_valid))

This is the estimated out-of-sample mean squared error.

25. Create a second model model_2 for the train data which includes age and tax as predictors. Calculate the train and validation MSE.
model_2 <- lm(medv ~ lstat + age + tax, data = boston_train)
model_2_mse_train <- mse(y_true = boston_train$medv, y_pred = predict(model_2))
model_2_mse_valid <- mse(y_true = boston_valid$medv, 
                         y_pred = predict(model_2, newdata = boston_valid))
26. Compare model 1 and model 2 in terms of their training and validation MSE. Which would you choose and why?
# If you are interested in out-of-sample prediction, the 
# answer may depend on the random sampling of the rows in the
# dataset splitting: everyone has a different split. However, it
# is likely that model_2 has both lower training and validation MSE.
27. Calculate the test MSE for the model of your choice in the previous question. What does this number tell you?
model_2_mse_test <- mse(y_true = boston_test$medv, 
                        y_pred = predict(model_2, newdata = boston_test))

# The estimate for the expected amount of error when predicting 
# the median value of a not previously seen town in Boston when 
# using this model is:

sqrt(model_2_mse_test)
[1] 6.027941

OPTIONAL Programming exercise: cross-validation

This is an advanced exercise. Some components we have seen before in this and previous practicals, but some things will be completely new. Try to complete it by yourself, but don’t worry if you get stuck. If you don’t know about for loops in R, read up on those before you start the exercise.

Use help in this order:

  • R help files
  • Internet search & stack exchange
  • Your peers
  • The answer, which shows one solution

You may also just read the answer and try to understand what happens in each step.

28. Create a function that performs k-fold cross-validation for linear models.

Inputs:

  • formula: a formula just as in the lm() function
  • dataset: a data frame
  • k: the number of folds for cross validation
  • any other arguments you need

Outputs:

  • Mean squared error averaged over folds
# Just for reference, here is the mse() function once more
mse <- function(y_true, y_pred) mean((y_true - y_pred)^2)

cv_lm <- function(formula, dataset, k) {
  # We can do some error checking before starting the function
  stopifnot(is_formula(formula))       # formula must be a formula
  stopifnot(is.data.frame(dataset))    # dataset must be data frame
  stopifnot(is.integer(as.integer(k))) # k must be convertible to int
  
  # first, add a selection column to the dataset as before
  n_samples  <- nrow(dataset)
  select_vec <- rep(1:k, length.out = n_samples)
  data_split <- dataset %>% mutate(folds = sample(select_vec))
  
  # initialise an output vector of k mse values, which we 
  # will fill by using a _for loop_ going over each fold
  mses <- rep(0, k)
  
  # start the for loop
  for (i in 1:k) {
    # split the data in train and validation set
    data_train <- data_split %>% filter(folds != i)
    data_valid <- data_split %>% filter(folds == i)
    
    # calculate the model on this data
    model_i <- lm(formula = formula, data = data_train)
    
    # Extract the y column name from the formula
    y_column_name <- as.character(formula)[2]
    
    # calculate the mean squared error and assign it to mses
    mses[i] <- mse(y_true = data_valid[[y_column_name]],
                   y_pred = predict(model_i, newdata = data_valid))
  }
  
  # now we have a vector of k mse values. All we need is to
  # return the mean mse!
  mean(mses)
}
29. Use your function to perform 9-fold cross validation with a linear model with as its formula medv ~ lstat + age + tax. Compare it to a model with as formula medv ~ lstat + I(lstat^2) + age + tax.
cv_lm(formula = medv ~ lstat + age + tax, dataset = Boston, k = 9)
[1] 37.71756
cv_lm(formula = medv ~ lstat + I(lstat^2) + age + tax, dataset = Boston, k = 9)
[1] 28.18794