Multiple Imputation Overview

As we see in Traditional Method for Handling Missing Data, single imputation approaches result in inaccurate values for the mean, variance, or covariance matrix, depending on the specific technique used. Multiple imputation provides a way to get around these difficulties by generating multiple imputations with a random component and then combining the results. In this way, MI creates values for the missing data that preserve the inherent characteristics of the variables (means, variance, etc.).

Steps

In simple terms, the steps used are as follows:

  1. Create m plausible versions of the missing data
  2. Run the appropriate analysis on each of the m complete data sets
  3. Combine the analyses into one analysis

We use an iterative approach in which each iteration consists of two steps. The first step, the imputation step, is similar to stochastic regression imputation. Here complete estimates of the data are obtained from the current estimates of population parameters (such as means, variances, and covariances). The second step, the posterior step, uses Bayesian estimation techniques to generate new estimates of the population parameters based on the data from the imputation step.

The iteration is repeated a fixed number of times to create one estimate of the missing data. A fixed number of such imputations (generally 5, 10, or 20) are produced and the desired analysis is run on each of the imputations. The results of the analyses are then combined to produce a single analysis.

Assumptions

For our purposes we will make the following assumptions:

  • Data for all the variables that will be used in the ultimate analysis are included even if they don’t contain any missing data (although not all the variables used in the MI procedure need to be included in the final analysis)
  • Some specific distribution of the parameters we are trying to estimate must be known or assumed for the Bayesian techniques to work properly.
  • Missing data is MAR (missing at random)

Note that wherever possible, any variable which can be a predictor of missingness of other variables should be included in the model even when they are not needed in the final analysis. These are called auxiliary variables.

References

UCLA (2021) How do I perform multiple imputation using predictive mean matching in R
https://stats.oarc.ucla.edu/r/faq/how-do-i-perform-multiple-imputation-using-predictive-mean-matching-in-r/

Murray, J. S. (2018) Multiple imputation: a review of practical and theoretical findings
https://projecteuclid.org/journals/statistical-science/volume-33/issue-2/Multiple-Imputation-A-Review-of-Practical-and-Theoretical-Findings/10.1214/18-STS644.full

Woods, A. D. et al. (2021) Missing data and multiple imputation decision tree. PsyArXiv
https://doi.org/10.31234/osf.io/mdw5r

Tufis, C. (2008) Multiple imputation as a solution to the missing data problem in social sciences
https://www.revistacalitateavietii.ro/journal/article/download/538/458/883

2 thoughts on “Multiple Imputation Overview”

  1. Hi Dr. Zaiontz,
    your explanations are wonderfully clear and the tools/functions very useful.
    My question is if we can use the same predictors for the Multiple Imputation and for the multiple regression analysis. In other word, I want to use the “fixed” data as a dependent variable in the multiple regression analysis where the auxiliary variables used for the Multiple Imputation will be the independent variables in the multiple regression.
    Thank you,
    Marzena

    Reply

Leave a Comment