Before we start, set a seed for reproducibility, we use 123, and also use options(scipen = 999) to suppress scientific notations, making it easier to compare and interpret results later in the session.
set.seed(123)options(scipen =999)
Data processing
For this part of the practical, you can directly copy and paste the code as given.
From the ISLR package, we will use the College dataset. Check out help(College) to find out more about this dataset. We will only use the following four variables:
Outstate: Out-of-state tuition
PhD: Pct. of faculty with Ph.D.’s
Terminal: Pct. of faculty with terminal degree
Expend: Instructional expenditure per student
Hence, we create a new data frame containing only these four variables:
data_complete <- College |>select(Outstate, PhD, Terminal, Expend) summary(data_complete)
Outstate PhD Terminal Expend
Min. : 2340 Min. : 8.00 Min. : 24.0 Min. : 3186
1st Qu.: 7320 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.: 6751
Median : 9990 Median : 75.00 Median : 82.0 Median : 8377
Mean :10441 Mean : 72.66 Mean : 79.7 Mean : 9660
3rd Qu.:12925 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:10830
Max. :21700 Max. :103.00 Max. :100.0 Max. :56233
During the practical, we focus on how different imputation approaches affect the linear relationship between Outstate and Expend. Therefore, we first inspect how this relationship looks like on the complete dataset.
lm_complete <-lm(Expend ~ Outstate, data = data_complete)summary(lm_complete)
Call:
lm(formula = Expend ~ Outstate, data = data_complete)
Residuals:
Min 1Q Median 3Q Max
-6131 -2022 -637 1027 39273
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 542.86969 385.92915 1.407 0.16
Outstate 0.87325 0.03449 25.315 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3866 on 775 degrees of freedom
Multiple R-squared: 0.4526, Adjusted R-squared: 0.4519
F-statistic: 640.9 on 1 and 775 DF, p-value: < 0.00000000000000022
We now generate missingness in the variables Expend and Terminal under a Seen Data-Dependent (SDD) mechanism, assuming that the missingness in these variables are dependent on the values of Outstate and PhD respectively.
logistic <-function(x) (exp(x) / (1+exp(x))) # logistic functionN <-nrow(data_complete) # number of observationsmisprob_out <-logistic(scale(data_complete$Outstate)) # predictor Outstatemisprob_phd <-logistic(scale(data_complete$PhD)) # predictor PhDdata_missing <- data_complete |>mutate(R_out =1-rbinom(N, 1, misprob_out), # response indicator outstateR_phd =1-rbinom(N, 1, misprob_phd), # response indicator PhDm_Expend =ifelse(R_out ==0, NA, Expend), # Missing in expendm_Terminal =ifelse(R_phd ==0, NA, Terminal) # and in terminal ) |>select(Outstate, m_Expend, m_Terminal) # Select variables that we will use
Inspection of missingness
1. Inspect the missingness pattern of the dataset using the function plot_pattern() from the ggmice package.
plot_pattern(data_missing)
Ad Hoc imputation methods
Listwise deletion
2. Create the object lm_listwise by fitting a regression model with m_Expend as the outcome object, predicted by Outstate using the data containing missing values.
Hint: By default, lm() removes the observations that have missings on one of the variables in the model.
3. Inspect the regression output by using the summary() function on the lm_listwise object. How many observations are excluded from this analysis? Is there a significant effect of Outstate on m_Expend?
summary(lm_listwise)
Call:
lm(formula = m_Expend ~ Outstate, data = data_missing)
Residuals:
Min 1Q Median 3Q Max
-4986.2 -1490.9 -398.1 736.4 30053.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2006.74472 396.63965 5.059 0.000000641 ***
Outstate 0.70211 0.04153 16.904 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2758 on 402 degrees of freedom
(373 observations deleted due to missingness)
Multiple R-squared: 0.4155, Adjusted R-squared: 0.414
F-statistic: 285.8 on 1 and 402 DF, p-value: < 0.00000000000000022
# The coefficient is a bit smaller, the standard error somewhat larger, so we# introduced some bias, and reduced the power. Both could have been expected,# due to how we generated missings and since we lost 373 cases.
Regression imputation
4. Use mice() to impute data_missing with regression imputation (method = "norm.predict"), using a single imputation (m = 1), and a single iteration (maxit = 1), and name the output object imp_regression.
NB: m = 1 implies that only a single imputed data set is created, whereas maxit = 1 implies that mice() does not iterate to converge to a distribution.
5. Extract the complete data from the mids object using the complete() function, and confirm that you have a fully imputed dataset.
complete(imp_regression) |># mice::complete() Takes an object of class mids, is.na() |># fills in the missing data, and returns the completed data in a specified format.sum()
[1] 0
The with(data, expr) function allows to fit lm() and glm() models with imputed data sets (mids-objects, to be precise), without requiring any further data handling. Check how the function works by running ?with.mids().
6. Use the with() function to fit the regression model lm(m_Expend ~ Outstate) on the imputed data using the imp_regression as the data argument, and assign the output to an object called lm_regression.
# The regression coefficient is almost identical to the coefficient from the# model with listwise deletion, which is not weird, because we used this# relationship to impute the missing data. The standard error is substantially# smaller though, because we treat the imputed values as the true values, and# given that all are on the regression line, we are overly certain about the# estimated regression coefficient.# From an inferential statistics viewpoint, this approach would only be valid # if we have definitive proof that the unobserved values would exactly conform # to the observed data. If this assumption does not hold, `norm.predict` # creates too little variation in our dataset and we can not trust the # resulting standard errors and p-values.
8. Use ggmice() to create a scatterplot that shows the relationship between m_Expend (on the y-axis) and Outstate (on the x-axis). What do you notice?
Hint: ggmice() works much the same way as ggplot(), but accepts mids objects. Check ?ggmice for help.
imp_regression |>ggmice(aes(x = Outstate, y = m_Expend)) +geom_point()
# Almost all imputed values are on a (almost) straight regression line (which # is how we specified the imputation model). The variation around this line# comes from the other variables that are also used in calculating the# predictions.
Stochastic regression imputation
We will now impute the missing data with stochastic regression imputation, adding an error term to the predicted values such that the imputations show variation around the regression line. The errors follow a normal distribution with a mean equal to zero and a variance equal to the residual variance.
9. Impute data_missing with the stochastic regression method using mice() with method = "norm.nob", again using a single imputation (m = 1) and a single iteration (maxit = 1) and assign the outcome to an object called imp_stochastic.
10. Fit the same regression model as before using the with() function, and assign the outcome to an object called lm_stochastic, and inspect the results of the fitted model using summary(). What do you notice?
# When comparing the results after performing stochastic imputation with # regression imputation, we see that the regression coefficient is still # biased, but that the standard error has increased a little bit.
11. Use ggmice() to create a scatterplot that shows the relationship between m_Expend (on the y-axis) and Outstate (on the x-axis) in the stochastic imputed data. What do you notice?
imp_stochastic |>ggmice(aes(x = Outstate, y = m_Expend)) +geom_point()
# There is considerably more variation between the imputed values as compared to# the regression imputations. However, the current approach still treats the # estimated regression coeffcients as fixed, although there is also uncertainty # with respect to these coefficients. This uncertainty is still not taken into # account with stochastic regression imputation.
Multiple imputation
We will now impute the missing data with multiple imputation.
12. Impute the missing data by calling the mice() function with default settings and assign the result to an object called imp_multiple.
imp_multiple <-mice(data_missing) # mids : Multiply imputed data set
# The imputations are now done. As you can see, the algorithm ran for 5 # iterations (the default: `maxit = 5`) and presented us 5 imputations (the# default: `m = 5`) for each missing datum.
13. Inspect the output object imp_multiple. What information does it give you?
# We have created m = 5 imputed data sets, using the method predictive mean # matching. Moreover, all variables are used in the imputation models, as# shown by the variables in the columns that are used as predictors for the# variable in each row.# Moreover, the imp_multiple object contains the data with missings, the# imputations, etc.# To obtain a full overview of the information stored in the object # `imp_multiple`, we can use the `attributes()` function.
14. Again, perform the linear regression on the imputed data using the with() function and assign it to an object called lm_multiple. What do you notice when you inspect this object?
lm_multiple <-with(imp_multiple, lm(m_Expend ~ Outstate)) # mira : Multiply Imputed Repeated Analyses# The `lm_multiple` object contains the regression summaries for each of the 5# datasets. The new object `lm_multiple` is actually of class `mira`, which # stands for Multiply Imputed Repeated Analyses.# in the real data.
15. Use the pool() function to combine the regression analyses and assign these results into a new object called pool_multiple. Inspect pool_multiple and summary(pool_multiple), what information do these objects provide?
pool_multiple <-pool(lm_multiple) # mipo : multiply imputed pooled objectsummary(pool_multiple)
# These results show the regression coefficient and corresponding standard # error obtained after pooling the estimates from the five regression models# applied on the five imputed datasets.# The complete pooled object gives the relevant pooled regression coefficients # and parameters, as well as the fraction of information about the coefficients# missing due to nonresponse (`fmi`) and the proportion of the variation# attributable to the missing data (`lambda`). The pooled fit object is of # class `mipo`, which stands for multiply imputed pooled object.*
16. Use ggmice() to create a scatterplot that shows the relationship between m_Expend (on the y-axis) and Outstate (on the x-axis) in the multiply imputed data. What do you notice?
imp_multiple |>ggmice(aes(x = Outstate, y = m_Expend)) +geom_point()
# Here we take all uncertainty with respect to the imputations into account # (i.e., uncertainty with respect to the values and with respect to the# coefficients). Moreover, we impute multiple times, so that we can assess the# variability between different data sets (on the level of the estimated model# parameters). This is often the best approach when interest is in making # inferences with respect to the imputed data.
17. Now compare the results from the different methods applied. What is your conclusion?
# When comparing the results from the different approaches, it can be # concluded that in case of a dataset where the missingness follows the # mechanism Seen Data Dependent, the estimate obtained after applying multiple# imputation is closest to the original estimate obtained from the complete# dataset. In addition, it can be concluded that the standard error obtained # using multiple imputation is somewhat larger than the standard error obtain # from the complete dataset. This is due to the fact that when the standard # errors are pooled over the different imputations, not only the average is # taken, but the variance over the differences is also incorporated (check # the pooling rules for more detail), i.e. the standard error also incorporates# uncertainty due to the missing values. In contrast, the results obtained # using listwise deletion, regression imputation and stochastic imputation all # show a biased estimate, while regression imputation and stochastic # imputation also show a standard error that is too low.*