Missing data and imputation II
The aim of this lab is to learn more about ad-hoc imputation methods as well as multiple imputation, all using the mice
In this lab, we perform the following steps:
- Data processing.
- Inspection of the missingness pattern.
- Apply ad hoc imputation methods using the
package. - Apply multiple imputation using the
package. - Perform a linear regression and compare the output when different imputation approaches are used.
First, we load the packages required for this lab.
Before we start, set a seed for reproducibility, we use 123
, and also use options(scipen = 999)
to suppress scientific notations, making it easier to compare and interpret results later in the session.
options(scipen = 999)
Data processing
For this part of the practical, you can directly copy and paste the code as given.
From the ISLR
package, we will use the College
dataset. Check out help(College)
to find out more about this dataset. We will only use the following four variables:
- Outstate: Out-of-state tuition
- PhD: Pct. of faculty with Ph.D.’s
- Terminal: Pct. of faculty with terminal degree
- Expend: Instructional expenditure per student
Hence, we create a new data frame containing only these four variables:
data_complete |>
College select(Outstate, PhD, Terminal, Expend)
Outstate PhD Terminal Expend
Min. : 2340 Min. : 8.00 Min. : 24.0 Min. : 3186
1st Qu.: 7320 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.: 6751
Median : 9990 Median : 75.00 Median : 82.0 Median : 8377
Mean :10441 Mean : 72.66 Mean : 79.7 Mean : 9660
3rd Qu.:12925 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:10830
Max. :21700 Max. :103.00 Max. :100.0 Max. :56233
During the practical, we focus on how different imputation approaches affect the linear relationship between Outstate
and Expend
. Therefore, we first inspect how this relationship looks like on the complete dataset.
<- lm(Expend ~ Outstate,
lm_complete data = data_complete)
lm(formula = Expend ~ Outstate, data = data_complete)
Min 1Q Median 3Q Max
-6131 -2022 -637 1027 39273
Estimate Std. Error t value Pr(>|t|)
(Intercept) 542.86969 385.92915 1.407 0.16
Outstate 0.87325 0.03449 25.315 <0.0000000000000002 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3866 on 775 degrees of freedom
Multiple R-squared: 0.4526, Adjusted R-squared: 0.4519
F-statistic: 640.9 on 1 and 775 DF, p-value: < 0.00000000000000022
We now generate missingness in the variables Expend
and Terminal
under a Seen Data-Dependent (SDD) mechanism, assuming that the missingness in these variables are dependent on the values of Outstate
and PhD
<- function(x) (exp(x) / (1 + exp(x))) # logistic function
<- nrow(data_complete) # number of observations
<- logistic(scale(data_complete$Outstate)) # predictor Outstate
misprob_out <- logistic(scale(data_complete$PhD)) # predictor PhD
data_missing |>
data_complete mutate(
R_out = 1 - rbinom(N, 1, misprob_out), # response indicator outstate
R_phd = 1 - rbinom(N, 1, misprob_phd), # response indicator PhD
m_Expend = ifelse(R_out == 0, NA, Expend), # Missing in expend
m_Terminal = ifelse(R_phd == 0, NA, Terminal) # and in terminal
) select(Outstate, m_Expend, m_Terminal) # Select variables that we will use
Inspection of missingness
Ad Hoc imputation methods
Listwise deletion
Regression imputation
The with(data, expr)
function allows to fit lm()
and glm()
models with imputed data sets (mids
-objects, to be precise), without requiring any further data handling. Check how the function works by running ?with.mids()
Stochastic regression imputation
We will now impute the missing data with stochastic regression imputation, adding an error term to the predicted values such that the imputations show variation around the regression line. The errors follow a normal distribution with a mean equal to zero and a variance equal to the residual variance.
Multiple imputation
We will now impute the missing data with multiple imputation.