In this part of the website, we explore how to deal with missing data. We begin by describing the various types of missing data and then describe some traditional approaches for dealing with missing data, including the shortcomings of these approaches.
Next, we describe some more advanced approaches, including Multiple Imputation (MI) and Full Information Maximum Likelihood (FIML), and show how to use them in performing multiple regression. We also explore the EM algorithm.
Topics
- Types of Missing Data
- Traditional Approaches for Dealing with Missing Data
- Multiple Imputation (MI)
- Full Information Maximum Likelihood (FIML)
- EM Algorithm
References
Raghunathan, T. (2016)Â Missing data analysis in practice. CRC Press
https://www.taylorfrancis.com/books/mono/10.1201/b19428/missing-data-analysis-practice-trivellore-raghunathan
Howell, D. (2008)Â The treatment of missing data
https://www.uvm.edu/~statdhtx/StatPages/Missing_Data/MissingDataFinal.pdf
Hi Charles,
I am looking to run a correlation in excel. I have a lot of missing data for this question because there were several different paths respondents could take in the survey. Should these be blank when I run the correlation? I have the option of exporting the CSV with 9s for missing data or blank. I get different results when I run it with 9s vs. blank. Am assuming the 9s are impacting the results and indicating a higher correlation. Thanks for the help.
Best,
Caitlin
Caitlin,
Yes, 9s will impact the value since they will be treated as the number 9 instead of as a blank.
If, say, your data is inserted in the ranges A1:A20 and B1:B20, then CORREL(A1:A20,B1:B20) will calculate the correlation between the data in the two ranges. Any row with one or two blanks (missing data) will be ignored).
Charles
Charles,
I also want to report that the array function GRID_HDI does not work. When I use the function for the example 2 provided in the web, it came back with “#NAME?”.
-Sun
Hello Sun,
I just checked and GRID_HDI is working in the latest release, Rel 7.1. I understand that you have installed the latest release but check that Excel is using the latest release by inserting the formula =VER() in any cell.
Charles
As usual, thank you so much for your “prompt” response to my question.
I re-downloaded the software and it worked like a charm. However, I just wanted to bring your attention about the inconsistent link that led me to download a wrong version on the web at the first place. When I followed the link shown in ” If you are using Excel 2007 click here. “, this made me to download “RealStats-2007” rather than the correct version. When I checked the version for this software, it said “5.11Excel2007”. So, I thought it was atypical version name.
Well, now I have the correct version, I am planning to indulge all examples you have on the web.
Cheers,
-Sun
Charles,
As there is no comment field in the Bayesian analysis section, I am using this section to ask the following question:
Based on the web instruction, it stated that “press Ctrl-m and select the Creating a Grid for Bayesian Analysis option (from the Misc tab if using the MultiPage user interface)”. However, there is no such analysis option under the “Misc” or any other tabs. Please advise how I can access to this tool.
Note that I have downloaded the most current version just to ensure that I am operating based on the most current version of the Real Stat.
Thanks,
-Sun
Hello Sun,
I just checked and Creating a Grid for Bayesian Analysis is working in the latest release, Rel 7.1. Please see my response to your previous comment.
Charles
Hi Charles
Your website seems very helpful. I was wondering whether there are any specific techniques to handle missing data in time series.
For example, I downloaded the Chicago Beach dataset (https://data.cityofchicago.org/Parks-Recreation/Beach-Weather-Stations-Automated-Sensors/k7hf-8y75)
which is supposed to contain hourly sensor data.
However, I noticed that some hours are missing.
I scanned through the techniques on your website for handling missing data but could not find one that corresponds to this situation.
Would appreciate your guidance
Thanks in advance.
Chuck
Hi Chuck,
I have not yet implemented any techniques for handling missing data in time series. The following webpages explain various ways to handle missing data in time series. Linear interpolation is probably the easiest to implement among the better approaches.
https://www.kaggle.com/juejuewang/handle-missing-values-in-time-series-for-beginners
https://cloud.r-project.org/web/packages/imputeTS/vignettes/imputeTS-Time-Series-Missing-Value-Imputation-in-R.pdf
Charles
Dr. Zaiontz,
I have scattered missing data cells throughout my dataset but do not plan on any systematic imputation/prediction to fill them in – so it’s unbalanced here and there. I’m running consistency tests (cronbach’s alpha) but the issue may apply to other functions.
If a data input range includes cells with no data (noData) it will effect the test outcome compared to selecting a subset input range with complete (balanced) data. After trial and error my conclusion is the algorithm does not ignore noData cells and I wish it would – is there an option to do that?
Thanks,
Andrew K.
Andrew,
This is a long topic and the answer really depends on the data and what you are trying to accomplish, but here is a short answer assuming that you can’t or don’t want to do any imputation.
1. Remove any samples with missing data (listwise deletion)
2. If you want to create a balanced model you will need to randomly delete non-missing data from the samples with more elements.
Charles
Sir,
Are there times when it is preferable to take no action on missing data? Just go ahead and analyse results without the missing data.
Thanks.
Abiola,
The problem is that most of the tests don’t work if there is missing data.
Charles