Missing data and imputation I

Published

October 13, 2023

Introduction

The aim of this practical is to enhance your understanding in different missingness mechanisms and how they affect your estimates of interest. In addition, we will go over the mean imputation briefly and illustrate why it does not solve the missing data problem.

In this practical, we will perform the following steps:

  • Generating missing values
  • Visualize missing data
  • Perform mean imputation
library(MASS)
library(tidyverse)

Make sure to load MASS before tidyverse otherwise the function MASS::select() will overwrite dplyr::select()

Before we start, set a seed for reproducibility, we use 45.

set.seed(45)

Income dataset

In this practical, we will manually create missingness to get an idea how different missingness mechanisms affect the outcomes of analyses. To this end, we start with a complete dataset, income.rds which can be downloaded here. Load the data in R after downloading as follows:

income <- readRDS("data/income.rds")

Generating missing values

The general expression of the missing data model is \(Pr(R = 0|Y_{\text{obs}}, Y_{\text{mis}}, \psi)\). Let’s go through this formula step by step.

  • \(R\) is a matrix with response indicators, which contains the locations of the missing values. For each person and each variable, \(R_{i,j}\) indicates whether person \(i\) has a missing value (\(R_{i,j} = 0\)) or an observed value (\(R_{i,j} = 1\)) on variable \(j\). In our case, we have \(R_{\text{age}}\), \(R_{\text{gender}}\) and \(R_{\text{income}}\).

  • \(Y_{\text{obs}}\) denotes the observed data.

  • \(Y_{\text{mis}}\) denotes the missing data. Note, we don’t have these data, but they might influence whether or not a person has missings on a variable.

  • \(\psi\) denotes the parameters for the missing data models. These parameters relate the observed and missing values to the response indicator.

Now we have a binary response indicator, and some additional variables, we can build a classification model to predict whether or not someone has a missing value on some variable. Normally, we would use the data to evaluate whether the missingness is related to someone’s values on other variables. However, we can also reverse the process, and impose a missingness structure on the observed data that depends on the variables.

In this practical, we will create missingness in the variable income, and we will do that using a logistic regression model. The logistic regression model predicts the probability of a missing value, given a set of indicators.

Note that a logistic regression model for the missingness can be defined as \[ \Pr(R_{\text{income}} = 0 | Y_{\text{obs}}, Y_{\text{mis}}, \psi) = \frac{\exp\{\psi_0 + \psi_1 * \text{age} + \psi_2 * (\text{gender}=\text{male}) + \psi_3 * \text{income}\}}{1 + \exp\{\psi_0 + \psi_1 * \text{age} + \psi_2 * (\text{gender} = \text{male}) + \psi_3 * \text{income}\}}. \] In this equation, \(\psi_0\) is a parameter for the baseline probability of missingness in \(\text{income}\), \(\psi_1\) and \(\psi_2\) relate the observed data to the missingness in \(\text{income}\), and \(\psi_3\) relates unobserved information to the missingness in \(\text{income}\).

If we would translate this equation into R-code, we would have

prob_mis <- function(psi0, psi1, psi2, psi3, age, gender, income) {
  exp(psi0 + psi1 * age + psi2 * (gender == "Male") + psi3 * income) /
    (1 + exp(psi0 + psi1 * age + psi2 * (gender == "Male") + psi3 * income))
}

Not Data-Dependent (NDD)

The data are considered to be NDD if \(Pr(R = 0|Y_{obs}, Y_{mis}, \psi) = Pr(R=0|\psi)\), indicating that the missingness probability is unrelated to the data \(Y\) and only depends on some parameters \(\psi\).

1. Use the function above to add a variable with the missingness probabilities misprob_ndd to the income data, in which you specify all missingness parameters to be 0 (i.e., \(\psi_0 = \psi_1 = \psi_2 = \psi_3\) = 0), such that the probability of having a missing value equals \(0.5\) for everyone.

Tip: use the mutate() function to create the new variable.

income <-
  income |>
  mutate(misprob_ndd = prob_mis(
    psi0   = 0,
    psi1   = 0,
    psi2   = 0,
    psi3   = 0,
    age    = age,
    gender = gender,
    income = income
  ))

Using these probabilities, we can randomly draw the observations that will have a missing value on \(\text{income}\) from a binomial distribution.

2. Use rbinom() to create a missingness indicator R_ndd using the misprob_ndd probabilities, and add this indicator to the income data.

NB: The function prob_mis gives the probability of having a missing, while the missingness indicator must give zero if income is not observed, so use 1 - rbinom() to create a proper missingness indicator.

income <- 
  income |>
  mutate(R_ndd = 1 - rbinom(n(), 1, misprob_ndd))
3. Use the function ifelse() to add a new variable income_ndd to the income data, that equals NA if R_ndd equals 0, and the observed income value otherwise.
income <- 
  income |>
  mutate(income_ndd = ifelse(R_ndd == 0, NA, income))
4. Verify that approximately \(50\%\) of the observations have a missing value on income_ndd.
table(is.na(income$income_ndd))

FALSE  TRUE 
  263   237 

Seen Data-Dependent (SDD)

So far, there was a baseline probability to have missing values, but this was the same for all observations (hence the term, Not Data-Dependent). Data are considered to be Seen Data-Dependent if the probability of having a missing value depends on the observed variables. Formally, we can define this as \(Pr(R = 0|Y_{obs}, Y_{mis}, \psi) = Pr(R=0|Y_{obs},\psi)\).

In this example, we let the missingness in \(\text{income}\) depend on the observed values of \(\text{age}\) and \(\text{gender}\), in such a way that older people have a higher probability of having a missing, and that males have a higher probability of having a missing than females.

5. Use the prob_mis() function to create a vector of predicted probabilities called misprob.sdd, using psi0 = -3, psi1 = 0.05, psi2 = 0.5 and psi3 = 0.
income <-
  income |>
  mutate(misprob_sdd = prob_mis(
    psi0   = -3,
    psi1   = 0.05,
    psi2   = 0.5,
    psi3   = 0,
    age    = age,
    gender = gender,
    income = income
  ))
6. Create a plot that shows how the probability of having a missing value (on the y-axis) depends on \(\text{age}\) (on the x-axis) and \(\text{gender}\) (as color aesthetic).
income |> 
  ggplot(aes(x = age, y = misprob_sdd, colour = gender)) +
  geom_line() +
  geom_point() +
  theme_minimal() +
  labs(
    x = "Age", 
    y = "Missingness probability", 
    colour = "Gender",
    title = "Seen data dependent missingness"
  )