Supervised learning: regression in R

Published

October 3, 2023

Introduction

In this practical, you will learn how to perform regression analysis in R, how to create predictions, how to plot with confidence and prediction intervals, how to calculate MSE, perform train-test splits, and write a function for cross validation.

Just like in the practical at the end of chapter 3 of the ISLR book, we will use the Boston dataset, which is in the MASS package that comes with R.

library(ISLR)
library(MASS)
library(tidyverse)

Make sure to load MASS before tidyverse otherwise the function MASS::select() will overwrite dplyr::select()

Regression in R

Regression is performed through the lm() function. It requires two arguments: a formula and data. A formula is a specific type of object that can be constructed like so:

some_formula <- outcome ~ predictor_1 + predictor_2 

You can read it as “the outcome variable is a function of predictors 1 and 2”. As with other objects, you can check its class and even convert it to other classes, such as a character vector:

class(some_formula)
as.character(some_formula)

You can estimate a linear model using lm() by specifying the outcome variable and the predictors in a formula and by inputting the dataset these variables should be taken from.

1. Create a linear model object called lm_ses using the formula medv ~ lstat and the Boston dataset.
lm_ses <- lm(formula = medv ~ lstat, data = Boston)

You have now trained a regression model with medv (housing value) as the outcome/dependent variable and lstat (socio-economic status) as the predictor / independent variable.

Remember that a regression estimates \(\beta_0\) (the intercept) and \(\beta_1\) (the slope) in the following equation:

\[\boldsymbol{y} = \beta_0 + \beta_1\cdot \boldsymbol{x}_1 + \boldsymbol{\epsilon}\]

2. Use the function coef() to extract the intercept and slope from the lm_ses object. Interpret the slope coefficient.
coef(lm_ses)
(Intercept)       lstat 
 34.5538409  -0.9500494 
# for each point increase in lstat, the median housing value drops by 0.95
3. Use summary() to get a summary of the lm_ses object. What do you see? You can use the help file ?summary.lm.
summary(lm_ses)

Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,    Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

We now have a model object lm_ses that represents the formula

\[\text{medv}_i = 34.55 - 0.95 * \text{lstat}_i + \epsilon_i\]

With this object, we can predict a new medv value by inputting its lstat value. The predict() method enables us to do this for the lstat values in the original dataset.

4. Save the predicted y values to a variable called y_pred
y_pred <- predict(lm_ses)
5. Create a scatter plot with y_pred mapped to the x position and the true y value (Boston$medv) mapped to the y value. What do you see? What would this plot look like if the fit were perfect?
tibble(pred = y_pred, 
       obs  = Boston$medv) %>% 
  ggplot(aes(x = pred, y = obs)) +
  geom_point() +
  theme_minimal() +
  geom_abline(slope = 1)

# I've added an ideal line where all the points would lie on if the 
# fit were perfect.

We can also generate predictions from new data using the newdata argument in the predict() method. For that, we need to prepare a data frame with new values for the original predictors.

6. Use the seq() function to generate a sequence of 1000 equally spaced values from 0 to 40. Store this vector in a data frame with (data.frame() or tibble()) as its column name lstat. Name the data frame pred_dat.
pred_dat <- tibble(lstat = seq(0, 40, length.out = 1000))
7. Use the newly created data frame as the newdata argument to a predict() call for lm_ses. Store it in a variable named y_pred_new.
y_pred_new <- predict(lm_ses, newdata = pred_dat)

Plotting lm() in ggplot

A good way of understanding your model is by visualizing it. We are going to walk through the construction of a plot with a fit line and prediction / confidence intervals from an lm object.

8. Create a scatter plot from the Boston dataset with lstat mapped to the x position and medv mapped to the y position. Store the plot in an object called p_scatter.
p_scatter <- 
  Boston %>% 
  ggplot(aes(x = lstat, y = medv)) +
  geom_point() +
  theme_minimal()

p_scatter

Now we’re going to add a prediction line to this plot.

9. Add the vector y_pred_new to the pred_dat data frame with the name medv.
# this can be done in several ways. Here are two possibilities:
# pred_dat$medv <- y_pred_new
pred_dat <- pred_dat %>% mutate(medv = y_pred_new)
10. Add a geom_line() to p_scatter, with pred_dat as the data argument. What does this line represent?
p_scatter + geom_line(data = pred_dat)