We begin by considering samples {x1j, …, xnj} of size n for each of the k random variables xj where j = 1, …, k. Thus the data X = [xij] can be viewed as an n × k matrix where there is the possibility that some of the elements are missing. In line with the usual classification, we describe three types of missing data:
- Missing completely at random (MCAR)
- Missing at random (MAR)
- Not missing at random (NMAR)
MCAR
With data missing completely at random (MCAR) the fact that any observation is missing is completely unrelated to the values of the data for the other variables or to the non-missing data elements in the variable for the missing data. I.e. the missing elements in the data matrix X are located completely at random.
Thus data from a questionnaire on drug use would not be MCAR if people who take drugs are more likely not to answer all the questions. Similarly, if people from a poorer family are more likely not to complete the questionnaire then the data would not be MCAR (since missingness would be correlated with family income).
Data that is MCAR seldom happens in practice, although if you are designing an experiment where you decide to eliminate 10% of the data elements randomly, then the MCAR condition would be met.
MAR
A more realistic assumption is that data is missing at random (MAR). Data is considered to be missing at random if the data meet the requirement that missingness does not depend on the value of xj after controlling for all the other variables. For example, people who come from poorer families might be less inclined to answer questions about drug use, and so the level of drug use is related to family income. If within the group of people from poorer families, the probability of answering the questions about drug use is unrelated to income level, then the data would be considered to be MAR.
The key characteristic of MAR is that the values of the missing data can somehow be predicted from some of the other variables being studied.
NMAR
Data that is not MAR is called not missing at random (NMAR). E.g. if students skipped a question in a questionnaire where they were asked to tell whether or not they used drugs because they feared that they would be expelled from school.
While there are some tests to determine whether data is MCAR, there are no definitive tests for MAR or NMAR since any such test would depend on unobserved data. In any case, it is common to assume that data is MAR unless there is good reason to believe otherwise. Also, most of the procedures to handle missing data (Multiple Imputation, Expectation-Maximalization, etc.) depend on the MAR assumption.
References
Raghunathan, T. (2016) Missing data analysis in practice. CRC Press
https://www.taylorfrancis.com/books/mono/10.1201/b19428/missing-data-analysis-practice-trivellore-raghunathan
Howell, D. (2008) The treatment of missing data
https://www.uvm.edu/~statdhtx/StatPages/Missing_Data/MissingDataFinal.pdf
Great post! Have nice day ! 🙂 btgkd
Hi,
Thank you for all this information, very usefull!
Can you (or somebody else) tell me if the Multiple regresion FIML it’s a good method for nondetects values (<LDM)?
Thank's again.
Sebastian
Juan,
By LDM do you mean linkage disequilibrium mapping?
Charles