library(MASS)
library(tidyverse)
Missing data and imputation I
Introduction
The aim of this practical is to enhance your understanding in different missingness mechanisms and how they affect your estimates of interest. In addition, we will go over the mean imputation briefly and illustrate why it does not solve the missing data problem.
In this practical, we will perform the following steps:
- Generating missing values
- Visualize missing data
- Perform mean imputation
Make sure to load MASS
before tidyverse
otherwise the function MASS::select()
will overwrite dplyr::select()
Before we start, set a seed for reproducibility, we use 45
.
set.seed(45)
Income dataset
In this practical, we will manually create missingness to get an idea how different missingness mechanisms affect the outcomes of analyses. To this end, we start with a complete dataset, income.rds
which can be downloaded here. Load the data in R after downloading as follows:
<- readRDS("data/income.rds") income
Generating missing values
The general expression of the missing data model is \(Pr(R = 0|Y_{\text{obs}}, Y_{\text{mis}}, \psi)\). Let’s go through this formula step by step.
\(R\) is a matrix with response indicators, which contains the locations of the missing values. For each person and each variable, \(R_{i,j}\) indicates whether person \(i\) has a missing value (\(R_{i,j} = 0\)) or an observed value (\(R_{i,j} = 1\)) on variable \(j\). In our case, we have \(R_{\text{age}}\), \(R_{\text{gender}}\) and \(R_{\text{income}}\).
\(Y_{\text{obs}}\) denotes the observed data.
\(Y_{\text{mis}}\) denotes the missing data. Note, we don’t have these data, but they might influence whether or not a person has missings on a variable.
\(\psi\) denotes the parameters for the missing data models. These parameters relate the observed and missing values to the response indicator.
Now we have a binary response indicator, and some additional variables, we can build a classification model to predict whether or not someone has a missing value on some variable. Normally, we would use the data to evaluate whether the missingness is related to someone’s values on other variables. However, we can also reverse the process, and impose a missingness structure on the observed data that depends on the variables.
In this practical, we will create missingness in the variable income
, and we will do that using a logistic regression model. The logistic regression model predicts the probability of a missing value, given a set of indicators.
Note that a logistic regression model for the missingness can be defined as \[ \Pr(R_{\text{income}} = 0 | Y_{\text{obs}}, Y_{\text{mis}}, \psi) = \frac{\exp\{\psi_0 + \psi_1 * \text{age} + \psi_2 * (\text{gender}=\text{male}) + \psi_3 * \text{income}\}}{1 + \exp\{\psi_0 + \psi_1 * \text{age} + \psi_2 * (\text{gender} = \text{male}) + \psi_3 * \text{income}\}}. \] In this equation, \(\psi_0\) is a parameter for the baseline probability of missingness in \(\text{income}\), \(\psi_1\) and \(\psi_2\) relate the observed data to the missingness in \(\text{income}\), and \(\psi_3\) relates unobserved information to the missingness in \(\text{income}\).
If we would translate this equation into R-code, we would have
<- function(psi0, psi1, psi2, psi3, age, gender, income) {
prob_mis exp(psi0 + psi1 * age + psi2 * (gender == "Male") + psi3 * income) /
1 + exp(psi0 + psi1 * age + psi2 * (gender == "Male") + psi3 * income))
( }
Not Data-Dependent (NDD)
The data are considered to be NDD if \(Pr(R = 0|Y_{obs}, Y_{mis}, \psi) = Pr(R=0|\psi)\), indicating that the missingness probability is unrelated to the data \(Y\) and only depends on some parameters \(\psi\).
Using these probabilities, we can randomly draw the observations that will have a missing value on \(\text{income}\) from a binomial distribution.
Seen Data-Dependent (SDD)
So far, there was a baseline probability to have missing values, but this was the same for all observations (hence the term, Not Data-Dependent). Data are considered to be Seen Data-Dependent if the probability of having a missing value depends on the observed variables. Formally, we can define this as \(Pr(R = 0|Y_{obs}, Y_{mis}, \psi) = Pr(R=0|Y_{obs},\psi)\).
In this example, we let the missingness in \(\text{income}\) depend on the observed values of \(\text{age}\) and \(\text{gender}\), in such a way that older people have a higher probability of having a missing, and that males have a higher probability of having a missing than females.
Unseen Data-Dependent (UDD)
We now take the missingness one step further again, by making it dependent on unobserved data. In practice, it is often the case that those who have a higher income are less willing to share their income, which is an example of unseen data-dependent missingness. Namely, we cannot observe whether those who have a missing on income have a higher income, because this is exactly the information that we miss. The missingness could also depend on variables that we have not observed, for example because people who are self-employed might be less willing to answer survey questions about their income than people who have an employer. In both situations, we call the missingness Unseen Data-Dependent. Formally, this is defined as \(Pr(R = 0|Y_{obs}, Y_{mis}, \psi) = Pr(R=0|Y_{obs},Y_{mis},\psi)\).
Because we have no unobserved variables in this example, we let the missingness in \(\text{income}\) depend on the income values themselves, such that higher incomes have a higher probability of being missing.
Now we have generated a model, we can use the predict()
method to output the estimated probabilities for each point in the training dataset. By default predict
outputs the log-odds, but we can transform it back using the inverse logit function of before or setting the argument type = "response"
within the predict function.
Visualizing missing data
Mean imputation
We will now show why mean imputation is, in general, a bad idea.