```
library(MASS)
library(tidyverse)
```

# Missing data and imputation I

# Introduction

The aim of this practical is to enhance your understanding in different missingness mechanisms and how they affect your estimates of interest. In addition, we will go over the mean imputation briefly and illustrate why it does not solve the missing data problem.

In this practical, we will perform the following steps:

- Generating missing values
- Visualize missing data
- Perform mean imputation

Make sure to load `MASS`

before `tidyverse`

otherwise the function `MASS::select()`

will overwrite `dplyr::select()`

Before we start, set a seed for reproducibility, we use `45`

.

`set.seed(45)`

# Income dataset

In this practical, we will manually create missingness to get an idea how different missingness mechanisms affect the outcomes of analyses. To this end, we start with a complete dataset, `income.rds`

which can be downloaded here. Load the data in R after downloading as follows:

`<- readRDS("data/income.rds") income `

# Generating missing values

The general expression of the missing data model is \(Pr(R = 0|Y_{\text{obs}}, Y_{\text{mis}}, \psi)\). Let’s go through this formula step by step.

\(R\) is a matrix with response indicators, which contains the locations of the missing values. For each person and each variable, \(R_{i,j}\) indicates whether person \(i\) has a missing value (\(R_{i,j} = 0\)) or an observed value (\(R_{i,j} = 1\)) on variable \(j\). In our case, we have \(R_{\text{age}}\), \(R_{\text{gender}}\) and \(R_{\text{income}}\).

\(Y_{\text{obs}}\) denotes the observed data.

\(Y_{\text{mis}}\) denotes the missing data. Note, we don’t have these data, but they might influence whether or not a person has missings on a variable.

\(\psi\) denotes the parameters for the missing data models. These parameters relate the observed and missing values to the response indicator.

Now we have a binary response indicator, and some additional variables, we can build a classification model to predict whether or not someone has a missing value on some variable. Normally, we would use the data to evaluate whether the missingness is related to someone’s values on other variables. However, we can also reverse the process, and impose a missingness structure on the observed data that depends on the variables.

In this practical, we will create missingness in the variable `income`

, and we will do that using a logistic regression model. The logistic regression model predicts the probability of a missing value, given a set of indicators.

Note that a logistic regression model for the missingness can be defined as \[ \Pr(R_{\text{income}} = 0 | Y_{\text{obs}}, Y_{\text{mis}}, \psi) = \frac{\exp\{\psi_0 + \psi_1 * \text{age} + \psi_2 * (\text{gender}=\text{male}) + \psi_3 * \text{income}\}}{1 + \exp\{\psi_0 + \psi_1 * \text{age} + \psi_2 * (\text{gender} = \text{male}) + \psi_3 * \text{income}\}}. \] In this equation, \(\psi_0\) is a parameter for the baseline probability of missingness in \(\text{income}\), \(\psi_1\) and \(\psi_2\) relate the observed data to the missingness in \(\text{income}\), and \(\psi_3\) relates unobserved information to the missingness in \(\text{income}\).

If we would translate this equation into R-code, we would have

```
<- function(psi0, psi1, psi2, psi3, age, gender, income) {
prob_mis exp(psi0 + psi1 * age + psi2 * (gender == "Male") + psi3 * income) /
1 + exp(psi0 + psi1 * age + psi2 * (gender == "Male") + psi3 * income))
( }
```

# Not Data-Dependent (NDD)

The data are considered to be *NDD* if \(Pr(R = 0|Y_{obs}, Y_{mis}, \psi) = Pr(R=0|\psi)\), indicating that the missingness probability is unrelated to the data \(Y\) and only depends on some parameters \(\psi\).

Using these probabilities, we can randomly draw the observations that will have a missing value on \(\text{income}\) from a binomial distribution.

# Seen Data-Dependent (SDD)

So far, there was a baseline probability to have missing values, but this was the same for all observations (hence the term, Not Data-Dependent). Data are considered to be Seen Data-Dependent if the probability of having a missing value depends on the observed variables. Formally, we can define this as \(Pr(R = 0|Y_{obs}, Y_{mis}, \psi) = Pr(R=0|Y_{obs},\psi)\).

In this example, we let the missingness in \(\text{income}\) depend on the observed values of \(\text{age}\) and \(\text{gender}\), in such a way that older people have a higher probability of having a missing, and that males have a higher probability of having a missing than females.