```
library(ISLR)
library(MASS)
library(tidyverse)
```

# Supervised learning: regression in R

# Introduction

In this practical, you will learn how to perform regression analysis in R, how to create predictions, how to plot with confidence and prediction intervals, how to calculate MSE, perform train-test splits, and write a function for cross validation.

Just like in the practical at the end of chapter 3 of the ISLR book, we will use the `Boston`

dataset, which is in the `MASS`

package that comes with `R`

.

Make sure to load `MASS`

before `tidyverse`

otherwise the function `MASS::select()`

will overwrite `dplyr::select()`

# Regression in `R`

Regression is performed through the `lm()`

function. It requires two arguments: a `formula`

and `data`

. A `formula`

is a specific type of object that can be constructed like so:

`<- outcome ~ predictor_1 + predictor_2 some_formula `

You can read it as “the outcome variable is a function of predictors 1 and 2”. As with other objects, you can check its class and even convert it to other classes, such as a character vector:

```
class(some_formula)
as.character(some_formula)
```

You can estimate a linear model using `lm()`

by specifying the outcome variable and the predictors in a formula and by inputting the dataset these variables should be taken from.

You have now trained a regression model with `medv`

(housing value) as the outcome/dependent variable and `lstat`

(socio-economic status) as the predictor / independent variable.

Remember that a regression estimates \(\beta_0\) (the intercept) and \(\beta_1\) (the slope) in the following equation:

\[\boldsymbol{y} = \beta_0 + \beta_1\cdot \boldsymbol{x}_1 + \boldsymbol{\epsilon}\]

We now have a model object `lm_ses`

that represents the formula

\[\text{medv}_i = 34.55 - 0.95 * \text{lstat}_i + \epsilon_i\]

With this object, we can predict a new `medv`

value by inputting its `lstat`

value. The `predict()`

method enables us to do this for the `lstat`

values in the original dataset.

We can also generate predictions from new data using the `newdata`

argument in the `predict()`

method. For that, we need to prepare a data frame with new values for the original predictors.

# Plotting lm() in `ggplot`

A good way of understanding your model is by visualizing it. We are going to walk through the construction of a plot with a fit line and prediction / confidence intervals from an `lm`

object.

Now we’re going to add a prediction line to this plot.