Handling Missing Data

In this part of the website, we explore how to deal with missing data. We begin by describing the various types of missing data and then describe some traditional approaches for dealing with missing data, including the shortcomings of these approaches.

Next, we describe some more advanced approaches, including Multiple Imputation (MI) and Full Information Maximum Likelihood (FIML), and show how to use them in performing multiple regression. We also explore the EM algorithm.

Topics

References

Raghunathan, T. (2016) Missing data analysis in practice. CRC Press
https://www.taylorfrancis.com/books/mono/10.1201/b19428/missing-data-analysis-practice-trivellore-raghunathan

Howell, D. (2008) The treatment of missing data
https://www.uvm.edu/~statdhtx/StatPages/Missing_Data/MissingDataFinal.pdf

15 thoughts on “Handling Missing Data”

  1. Hi Charles,

    I am looking to run a correlation in excel. I have a lot of missing data for this question because there were several different paths respondents could take in the survey. Should these be blank when I run the correlation? I have the option of exporting the CSV with 9s for missing data or blank. I get different results when I run it with 9s vs. blank. Am assuming the 9s are impacting the results and indicating a higher correlation. Thanks for the help.

    Best,
    Caitlin

    Reply
    • Caitlin,
      Yes, 9s will impact the value since they will be treated as the number 9 instead of as a blank.
      If, say, your data is inserted in the ranges A1:A20 and B1:B20, then CORREL(A1:A20,B1:B20) will calculate the correlation between the data in the two ranges. Any row with one or two blanks (missing data) will be ignored).
      Charles

      Reply
  2. Charles,
    I also want to report that the array function GRID_HDI does not work. When I use the function for the example 2 provided in the web, it came back with “#NAME?”.

    -Sun

    Reply
    • Hello Sun,
      I just checked and GRID_HDI is working in the latest release, Rel 7.1. I understand that you have installed the latest release but check that Excel is using the latest release by inserting the formula =VER() in any cell.
      Charles

      Reply
      • As usual, thank you so much for your “prompt” response to my question.

        I re-downloaded the software and it worked like a charm. However, I just wanted to bring your attention about the inconsistent link that led me to download a wrong version on the web at the first place. When I followed the link shown in ” If you are using Excel 2007 click here. “, this made me to download “RealStats-2007” rather than the correct version. When I checked the version for this software, it said “5.11Excel2007”. So, I thought it was atypical version name.

        Well, now I have the correct version, I am planning to indulge all examples you have on the web.

        Cheers,
        -Sun

        Reply
  3. Charles,
    As there is no comment field in the Bayesian analysis section, I am using this section to ask the following question:
    Based on the web instruction, it stated that “press Ctrl-m and select the Creating a Grid for Bayesian Analysis option (from the Misc tab if using the MultiPage user interface)”. However, there is no such analysis option under the “Misc” or any other tabs. Please advise how I can access to this tool.

    Note that I have downloaded the most current version just to ensure that I am operating based on the most current version of the Real Stat.

    Thanks,
    -Sun

    Reply
    • Hello Sun,
      I just checked and Creating a Grid for Bayesian Analysis is working in the latest release, Rel 7.1. Please see my response to your previous comment.
      Charles

      Reply
  4. Hi Charles

    Your website seems very helpful. I was wondering whether there are any specific techniques to handle missing data in time series.

    For example, I downloaded the Chicago Beach dataset (https://data.cityofchicago.org/Parks-Recreation/Beach-Weather-Stations-Automated-Sensors/k7hf-8y75)

    which is supposed to contain hourly sensor data.
    However, I noticed that some hours are missing.
    I scanned through the techniques on your website for handling missing data but could not find one that corresponds to this situation.
    Would appreciate your guidance

    Thanks in advance.

    Chuck

    Reply
  5. Dr. Zaiontz,

    I have scattered missing data cells throughout my dataset but do not plan on any systematic imputation/prediction to fill them in – so it’s unbalanced here and there. I’m running consistency tests (cronbach’s alpha) but the issue may apply to other functions.

    If a data input range includes cells with no data (noData) it will effect the test outcome compared to selecting a subset input range with complete (balanced) data. After trial and error my conclusion is the algorithm does not ignore noData cells and I wish it would – is there an option to do that?

    Thanks,
    Andrew K.

    Reply
    • Andrew,
      This is a long topic and the answer really depends on the data and what you are trying to accomplish, but here is a short answer assuming that you can’t or don’t want to do any imputation.
      1. Remove any samples with missing data (listwise deletion)
      2. If you want to create a balanced model you will need to randomly delete non-missing data from the samples with more elements.
      Charles

      Reply
  6. Sir,
    Are there times when it is preferable to take no action on missing data? Just go ahead and analyse results without the missing data.
    Thanks.

    Reply

Leave a Comment