Missing data and imputation II

Published

October 15, 2024

Introduction

The aim of this lab is to learn more about ad-hoc imputation methods as well as multiple imputation, all using the mice package.

In this lab, we perform the following steps:

Data processing.
Inspection of the missingness pattern.
Apply ad hoc imputation methods using the mice package.
Apply multiple imputation using the mice package.
Perform a linear regression and compare the output when different imputation approaches are used.

First, we load the packages required for this lab.

library(tidyverse)
library(mice)
library(ggmice)
library(ISLR)

Before we start, set a seed for reproducibility, we use 123, and also use options(scipen = 999) to suppress scientific notations, making it easier to compare and interpret results later in the session.

set.seed(123)
options(scipen = 999)

Data processing

For this part of the practical, you can directly copy and paste the code as given.

From the ISLR package, we will use the College dataset. Check out help(College) to find out more about this dataset. We will only use the following four variables:

Outstate: Out-of-state tuition
PhD: Pct. of faculty with Ph.D.’s
Terminal: Pct. of faculty with terminal degree
Expend: Instructional expenditure per student

Hence, we create a new data frame containing only these four variables:

data_complete <- 
  College |> 
  select(Outstate, PhD, Terminal, Expend) 

summary(data_complete)

    Outstate          PhD            Terminal         Expend     
 Min.   : 2340   Min.   :  8.00   Min.   : 24.0   Min.   : 3186  
 1st Qu.: 7320   1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.: 6751  
 Median : 9990   Median : 75.00   Median : 82.0   Median : 8377  
 Mean   :10441   Mean   : 72.66   Mean   : 79.7   Mean   : 9660  
 3rd Qu.:12925   3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:10830  
 Max.   :21700   Max.   :103.00   Max.   :100.0   Max.   :56233

During the practical, we focus on how different imputation approaches affect the linear relationship between Outstate and Expend. Therefore, we first inspect how this relationship looks like on the complete dataset.

lm_complete <- lm(Expend ~ Outstate, 
                  data = data_complete)

summary(lm_complete)


Call:
lm(formula = Expend ~ Outstate, data = data_complete)

Residuals:
   Min     1Q Median     3Q    Max 
 -6131  -2022   -637   1027  39273 

Coefficients:
             Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 542.86969  385.92915   1.407                0.16    
Outstate      0.87325    0.03449  25.315 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3866 on 775 degrees of freedom
Multiple R-squared:  0.4526,    Adjusted R-squared:  0.4519 
F-statistic: 640.9 on 1 and 775 DF,  p-value: < 0.00000000000000022

We now generate missingness in the variables Expend and Terminal under a Seen Data-Dependent (SDD) mechanism, assuming that the missingness in these variables are dependent on the values of Outstate and PhD respectively.

logistic <- function(x) (exp(x) / (1 + exp(x))) # logistic function

N <- nrow(data_complete) # number of observations

misprob_out <- logistic(scale(data_complete$Outstate)) # predictor Outstate
misprob_phd <- logistic(scale(data_complete$PhD)) # predictor PhD

data_missing <- 
  data_complete |> 
  mutate(
    R_out = 1 - rbinom(N, 1, misprob_out), # response indicator outstate
    R_phd = 1 - rbinom(N, 1, misprob_phd), # response indicator PhD
    m_Expend = ifelse(R_out == 0, NA, Expend),    # Missing in expend
    m_Terminal = ifelse(R_phd == 0, NA, Terminal) # and in terminal
  ) |> 
  select(Outstate, m_Expend, m_Terminal) # Select variables that we will use

Inspection of missingness

1. Inspect the missingness pattern of the dataset using the function plot_pattern() from the ggmice package.

Ad Hoc imputation methods

Listwise deletion

2. Create the object lm_listwise by fitting a regression model with m_Expend as the outcome object, predicted by Outstate using the data containing missing values.

Hint: By default, lm() removes the observations that have missings on one of the variables in the model.

3. Inspect the regression output by using the summary() function on the lm_listwise object. How many observations are excluded from this analysis? Is there a significant effect of Outstate on m_Expend?

Regression imputation

4. Use mice() to impute data_missing with regression imputation (method = "norm.predict"), using a single imputation (m = 1), and a single iteration (maxit = 1), and name the output object imp_regression.

NB: m = 1 implies that only a single imputed data set is created, whereas maxit = 1 implies that mice() does not iterate to converge to a distribution.

5. Extract the complete data from the mids object using the complete() function, and confirm that you have a fully imputed dataset.

The with(data, expr) function allows to fit lm() and glm() models with imputed data sets (mids-objects, to be precise), without requiring any further data handling. Check how the function works by running ?with.mids().

6. Use the with() function to fit the regression model lm(m_Expend ~ Outstate) on the imputed data using the imp_regression as the data argument, and assign the output to an object called lm_regression.

7. Inspect the results of the regression model. What happens to the regression coefficient and the corresponding standard error?

8. Use ggmice() to create a scatterplot that shows the relationship between m_Expend (on the y-axis) and Outstate (on the x-axis). What do you notice?

Hint: ggmice() works much the same way as ggplot(), but accepts mids objects. Check ?ggmice for help.

Stochastic regression imputation

We will now impute the missing data with stochastic regression imputation, adding an error term to the predicted values such that the imputations show variation around the regression line. The errors follow a normal distribution with a mean equal to zero and a variance equal to the residual variance.

9. Impute data_missing with the stochastic regression method using mice() with method = "norm.nob", again using a single imputation (m = 1) and a single iteration (maxit = 1) and assign the outcome to an object called imp_stochastic.

10. Fit the same regression model as before using the with() function, and assign the outcome to an object called lm_stochastic, and inspect the results of the fitted model using summary(). What do you notice?

11. Use ggmice() to create a scatterplot that shows the relationship between m_Expend (on the y-axis) and Outstate (on the x-axis) in the stochastic imputed data. What do you notice?

Multiple imputation

We will now impute the missing data with multiple imputation.

12. Impute the missing data by calling the mice() function with default settings and assign the result to an object called imp_multiple.

13. Inspect the output object imp_multiple. What information does it give you?

14. Again, perform the linear regression on the imputed data using the with() function and assign it to an object called lm_multiple. What do you notice when you inspect this object?

15. Use the pool() function to combine the regression analyses and assign these results into a new object called pool_multiple. Inspect pool_multiple and summary(pool_multiple), what information do these objects provide?

16. Use ggmice() to create a scatterplot that shows the relationship between m_Expend (on the y-axis) and Outstate (on the x-axis) in the multiply imputed data. What do you notice?

17. Now compare the results from the different methods applied. What is your conclusion?