```
library(tidyverse)
library(mice)
library(ggmice)
library(ISLR)
```

# Missing data and imputation II

# Introduction

The aim of this lab is to learn more about ad-hoc imputation methods as well as multiple imputation, all using the `mice`

package.

In this lab, we perform the following steps:

- Data processing.
- Inspection of the missingness pattern.
- Apply ad hoc imputation methods using the
`mice`

package. - Apply multiple imputation using the
`mice`

package. - Perform a linear regression and compare the output when different imputation approaches are used.

First, we load the packages required for this lab.

Before we start, set a seed for reproducibility, we use `45`

, and also use `options(scipen = 999)`

to suppress scientific notations, making it easier to compare and interpret results later in the session.

```
set.seed(123)
options(scipen = 999)
```

# Data processing

For this part of the practical, you can directly copy and paste the code as given.

From the `ISLR`

package, we will use the `College`

dataset. Check out `help(College)`

to find out more about this dataset. We will only use the following four variables:

- Outstate: Out-of-state tuition
- PhD: Pct. of faculty with Ph.D.’s
- Terminal: Pct. of faculty with terminal degree
- Expend: Instructional expenditure per student

Hence, we create a new data frame containing only these four variables:

```
<-
data_complete |>
College select(Outstate, PhD, Terminal, Expend)
summary(data_complete)
```

```
Outstate PhD Terminal Expend
Min. : 2340 Min. : 8.00 Min. : 24.0 Min. : 3186
1st Qu.: 7320 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.: 6751
Median : 9990 Median : 75.00 Median : 82.0 Median : 8377
Mean :10441 Mean : 72.66 Mean : 79.7 Mean : 9660
3rd Qu.:12925 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:10830
Max. :21700 Max. :103.00 Max. :100.0 Max. :56233
```

During the practical, we focus on how different imputation approaches affect the linear relationship between `Outstate`

and `Expend`

. Therefore, we first inspect how this relationship looks like on the complete dataset.

```
<- lm(Expend ~ Outstate,
lm_complete data = data_complete)
summary(lm_complete)
```

```
Call:
lm(formula = Expend ~ Outstate, data = data_complete)
Residuals:
Min 1Q Median 3Q Max
-6131 -2022 -637 1027 39273
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 542.86969 385.92915 1.407 0.16
Outstate 0.87325 0.03449 25.315 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3866 on 775 degrees of freedom
Multiple R-squared: 0.4526, Adjusted R-squared: 0.4519
F-statistic: 640.9 on 1 and 775 DF, p-value: < 0.00000000000000022
```

We now generate missingness in the variables `Expend`

and `Terminal`

under a Seen Data-Dependent (SDD) mechanism, assuming that the missingness in these variables are dependent on the values of `Outstate`

and `PhD`

respectively.

```
<- function(x) (exp(x) / (1 + exp(x))) # logistic function
logistic
<- nrow(data_complete) # number of observations
N
<- logistic(scale(data_complete$Outstate)) # predictor Outstate
misprob_out <- logistic(scale(data_complete$PhD)) # predictor PhD
misprob_phd
<-
data_missing |>
data_complete mutate(
R_out = 1 - rbinom(N, 1, misprob_out), # reponse indicator outstate
R_phd = 1 - rbinom(N, 1, misprob_out), # response indicator PhD
m_Expend = ifelse(R_out == 0, NA, Expend), # Missing in expend
m_Terminal = ifelse(R_phd == 0, NA, Terminal) # and in terminal
|>
) select(Outstate, m_Expend, m_Terminal) # Select variables that we will use
```

# Inspection of missingness

# Ad Hoc imputation methods

## Listwise deletion

## Regression imputation

The `with(data, expr)`

function allows to fit `lm()`

and `glm()`

models with imputed data sets (`mids`

-objects, to be precise), without requiring any further data handling. Check how the function works by running `?with.mids()`

.

## Stochastic regression imputation

We will now impute the missing data with stochastic regression imputation, adding an error term to the predicted values such that the imputations show variation around the regression line. The errors follow a normal distribution with a mean equal to zero and a variance equal to the residual variance.

# Multiple imputation

We will now impute the missing data with multiple imputation.