Dealing with Missing Data

Basic Concepts

One problem faced when collecting data is that some of the data may be missing. For example, in conducting a survey with ten questions, perhaps some of the people who take the survey don’t answer all ten questions. In Identifying Outliers and Missing Data we show how to identify missing data using a data analysis tool provided in the Real Statistics Resource Pack.

A simple approach for dealing with missing data is to throw out all the data for any sample missing one or more data elements (called listwise deletion). One problem with this approach is that the sample size will be reduced. This is particularly relevant when the reduced sample size is too small to obtain significant results in the analysis. In this case, additional sample data elements may need to be collected. This problem is bigger than it might appear at first. E.g. if a questionnaire with 5 questions is randomly missing 10% of the data, then on average about 41% of the sample will have at least one question missing.

Also, it is often the case that the missing data is not randomly distributed. E.g. people filling out a long questionnaire may give up at some point and not answer any further questions, or they may be offended or embarrassed by a particular question and choose not to answer it. These are characteristics that might be relevant to the analysis being conducted.

In general, the following are typical remedies for dealing with missing data:

  • Delete the samples with a missing data element
  • Impute the value of the missing data
  • Remove a variable (e.g. a particular question in the case of a questionnaire or survey) that has a high incidence of missing data, especially if there are other variables (e.g. questions) that measure similar aspects of the characteristics being studied.

We explore the first two of these techniques on the remainder of this web page.

Deleting Missing Data

Of particular importance is the randomness of the missing data. E.g. suppose a lot of people didn’t answer question 5 but everyone answered question 7. If the frequency of the responses to question 7 changes significantly when samples that are missing responses to question 5 are dropped, then the missing data is not random, and so dropping samples can bias the results of the analysis. In this case, either another remedy should be employed or the analysis should be run twice: once with samples with missing data included (e.g. by adding a “no response” for any missing data) and once with these samples dropped.

Worksheet Functions

Missing data can be removed by using the following array functions provided in the Real Statistics Resource Pack.

Real Statistics Functions

DELBLANK(R1, s) – fills the highlighted range with the data in range R1 (by columns) omitting any empty cells

DELNonNum(R1, s) – fills the highlighted range with the data in range R1 (by columns) omitting any non-numeric cells

DELROWBLANK(R1, head, s) – fills the highlighted range with the data in range R1 omitting any row that has one or more empty cells; if head is TRUE then the first row of R1 (presumably containing column headings) is always copied (even if it contains an empty cell); this argument is optional and defaults to head = FALSE.

DELROWNonNum(R1, head, s) – fills the highlighted range with the data in range R1 omitting any row that has one or more non-numeric cells; if head is TRUE then the first row of R1 (presumably containing column headings) is always copied (even if it contains a non-numeric cell); this argument is optional and defaults to head = FALSE.

The string s is used as a filler in case the output range has more cells/rows than needed. This argument is optional and defaults to the error value #N/A. See Data Conversion and Reformatting for an example of the use of these functions.

The following additional function is useful with dynamic arrays (see Dynamic Array Formulas).

DELROWS(R1, head, blank): outputs an array with the data in R1 omitting any row that has one or more blank elements if blank = TRUE or one or more non-numeric elements if blank = FALSE (default); if head = TRUE, then the first row is always included in the output; otherwise (default), the first row is treated like any other row.

Related Functions

In addition, there is the function CountFullRows(R1, blank) where blank = TRUE (default) or FALSE. 

CountFullRows(R1, TRUE) = the number of rows in range R1 which don’t have any empty cells

CountFullRows(R1, FALSE) = the number of rows in range R1 which don’t have any non-numeric cells

There is also the related Real Statistics function CountPairs(R1, R2, blank) where blank = TRUE (default) or FALSE. Here we look at pairs of cells from R1 and R2: the ith cell in R1 is paired with the ith cell in R2

CountPairs(R1, R2, TRUE) = the number of pairs for which neither cell in the pair is empty

CountPairs(R1, R2, FALSE) = the number of pairs for which neither cell in the pair is empty or non-numeric

Note that in standard Excel the equivalent of CountPairs(R1, R2, TRUE) can be calculated by

=SUMPRODUCT(1-ISBLANK(R1),1-ISBLANK(R1))

CountPairs(R1, R2, FALSE) can be calculated by

=SUMPRODUCT(–ISNUMBER(R1),–ISNUMBER(R2))

To calculate the number of pairs with equal numeric entries, we can use the formula

=SUMPRODUCT(–ISNUMBER(R1),–ISNUMBER(R2),–(R1=R2))

Examples

Example 1: Delete any missing data listwise (indicated by an empty cell) from the sample in range A3:G22 of Figure 1.

Listwise deletion missing data

Figure 1 – Listwise deletion of missing data

Since we want to delete any row that contains one or more empty cells (except the first row which contains column headings), we highlight range I3:O22, enter the array formula =DELROWBLANK(A3:G22,TRUE), and press the key sequence Ctrl-Shft-Enter to produce the output shown on the right side of Figure 1.

Users of Excel 2021 or 365 can use the following Excel formula to achieve the same result:

=FILTER(A3:G22,(LEN(B3:B22)>0)*(LEN(C3:C22)>0)*(LEN(D3:D22)>0)*(LEN(E3:E22)>0)*(LEN(F3:F22)>0)*(LEN(G3:G22)>0))

Cell M24 contains the formula =CountFullRows(A4:G22).

Data Analysis Tool

Real Statistics Analysis Tool: The Real Statistics Resource Pack supplies the Reformatting a Data Range by Rows data analysis tool which provides easier-to-use versions of the DELROWBLANK and DELROWNonNum functions described above.

We can also use the Reformatting a Data Range data analysis tool as a substitute for the DELBLANK and DELNonNum functions. We won’t demonstrate this tool here, but see Data Conversion and Reformatting for more information about how to use that tool.

Example 2: Repeat Example 1 using the Reformatting a Data Range by Rows data analysis tool.

To use this data analysis tool press Ctrl-m and choose the Reformatting a Data Range by Rows option. A dialog box will appear as in Figure 2. Fill in the dialog box as indicated and click on OK. The output will be identical to that shown in range I3:O22 of Figure 1.

Reformat range by rows

Figure 2 – Dialog box for Reformat Data Range by Rows

The data analysis tool will output the same number of rows as in the input data range, but any extra rows will be filled in with the values #N/A. Since four rows had at least one empty cell, four rows are deleted from the output (those for Arkansas, Colorado, Idaho, and Indiana) and so the last four rows of the output need to be filled with #N/A.

Actually, all the cells in the output range I3:O22 will contain the array formula =DELROWBLANK(A3:G22,TRUE) and so if we change the value of cell B15 to say 10.2, the row for Idaho would now automatically appear in the output and there would be one less row with values #N/A.

Filler field

If we had entered an asterisk in the Filler field of Figure 2, then the output would be the same as we saw in Figure 1 except that this time all the cells in range I19:O22 would contain an asterisk instead of #N/A.

If we had entered the number 0 in the Filler field then all the cells in the output range would contain the array formula

=DELROWBLANK(A3:G22,True,””)

although the values of all the cells in the range I19:O22 would be empty. As before if we change the contents of cell B15 to 10.2, then the row for Idaho would appear in the output and only three rows with empty cells would appear. All the cells in the output range would still have the same array formula, namely

=DELROWBLANK(A3:G22,TRUE,””)

Freeze output range size field

If we had checked the Freeze output range size element then the data analysis tool would determine that four rows have missing data and so it would output a range with four fewer rows, namely the range I3:O18. Although the output would be displayed exactly as in the case described in the previous paragraph, this time only the range I3:O18 would contain the formula =DELROWBLANK(A3:G22,TRUE). This time if cell B15 is changed to 10.2, then Idaho would be added to the output range, but since the output range only goes down to row 18, the last input row (that for Maine) would not be displayed, which is probably not what we want.

In conclusion, the Freeze output range size option makes the output cleaner (since all the rows contain data), but should not be used if there is the possibility that some missing data may be added later.

Imputing the values for missing data

Some techniques for imputing values for missing data include:

  • Substituting the missing data with another observation that is considered similar, either taken from another sample or a previous study
  • Using the mean of all the non-missing data elements for that variable. This might be acceptable in cases with a small number of missing data elements, but otherwise, it can distort the distribution of the data (e.g. by reducing the variance) or by lowering the observed correlations (see Basic Concepts of Correlation).
  • Using regression techniques. In this approach regression (as described in Regression and Multiple Regression) is used to predict the value of the missing data element based on the relationship between that variable and other variables. This approach reinforces existing relationships and makes it more likely that the analysis will characterize the sample and not the general population.

For a more complete description of approaches that can be used to deal with missing data, see Handling Missing Data.

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

References

Wikipedia (2012) Missing data
https://en.wikipedia.org/wiki/Missing_data

Wikipedia (2012) Imputation (statistics)
https://en.wikipedia.org/wiki/Imputation_(statistics)

51 thoughts on “Dealing with Missing Data”

  1. Hi
    Thank you for Supporting me!
    I have the missing Clinical data in the Previous Year that is weakily data from 52 weak 5 weak no data at the Middle and at the end at the seam time no data from 47 – 52 weakily data in 2020 but i have fool data up to 2016- 2019
    ,haw to fill the gap ? thank you!

    Reply
  2. Charles, could you explain me how you got 60% on your example below.
    E.g. if a questionnaire with 5 questions is randomly missing 10% of the data, then on average almost 60% of the sample will have at least one question missing.

    Reply
    • Hello Gerson,
      I can’t remember which source provided this figure. In any case, the value is not correct. Given that the probability that an answer for any question is missing is 10%, the probability that it is not missing is 90%. The probability that none of the 5 questions on any questionnaire is missing is .9 ^ 5 = .59049. Thus the probability that any questionnaire will have at least one question missing is 1-.59049 = .40951. Thus, the sentence should read, “if a questionnaire with 5 questions is randomly missing 10% of the data, then on average about 41% of the sample will have at least one question missing.”
      Thanks for making me aware of this error. I have just corrected the error on the webpage.
      Charles

      Reply
  3. Hi,
    I need to impute missing data in a dataset with minimum value of each row. how do I do it computationally on excel. I went through some videos but did not find what I need.
    Thank you

    Reply
    • This could be accomplished via =IF(C1″”,C1,MIN(R1)) where C1 is the cell and R1 is the row.
      To use this for data in the range A1:D10 with the output placed in range F1:I10, you would insert the formula =IF(A1″”,A1,MIN($A1:$D1)) in cell F1. Then highlight the range AF1:I10 and press Ctrl-D and Ctrl-R.
      Charles

      Reply
  4. Charles – greta website! Kudus.

    i think there is a typo in the M-estimator page,when looking for convergance.

    shouldn’t it be X(k+1) – X(k) < Epsilon ?

    Reply
  5. Thank you for covering an important but often overlooked part of data analysis. That is Missing Data. The approaches you have mentioned are surely going to be helpful in data analysis.

    Reply
  6. Hi,

    I’m trying to run a two-way ANOVA test, but my rows do not match up. In this example, I’m comparing the scores from the entry exams, midterm exams, and final exams between the males and females of the class, but, after removing the data points that did not have all three grades, there are more female data points than male. Because the ANOVA test requires an equal number of rows, should I go about removing female data points or adding male data points? In addition, would I do so arbitrarily or should there be a certain methodology behind it?

    Thank you!

    Reply
  7. Hi Charles,

    Thanks for these resources, and your willingness to help people with their problems. I’ll be sure to recommend your resources to my fellow excel statisticians.

    I have a question regarding covariance matrices.
    Currently I am using your ‘=DELROWNONNUM’ function to omit rows with missing data, and this provides a handy table for my covariance matrix.

    However, my data are a sample, and omitting entire rows due to one missing cell is not ideal. I may be omitting good data which express a covariance, purely because an unrelated datum is missing.

    Is there a way I can include all rows with at least 2 values in a table palatable for the covariance matrix?

    I realise I could achieve this manually (aided by your functions), but I have many variables, and that defeats the spirit of programming!

    I’d greatly appreciate any help.

    Many thanks,
    Tom

    Reply
  8. Dear Charles,

    Hopefully you can help me with my problem:

    I have a subscale of 8 questions (Likert scores 1-7), where only one answer can be missing: respondents with 2 or more missing values are dismissed for this subscale.
    I want to use the mean of the non-missing values for that person, on that subscale. So, NOT the sMEAN, because that uses the mean of all the respondents.

    Until now, I only have the mean of the scores with 7 and 8 questions answered.

    COMPUTE Friend_mean=MEAN.7(V1,V2,V3,V4,V5,V6,V7,V8).
    EXECUTE.

    How do I imput this mean score into the missing values? ONLY for those with only one missing?

    I tried Recode Missing Values and some IF (SYSMIS) etc. but I can’t figure it out 🙁

    Can you help me with the syntax I need?

    All the best,

    Lynn.

    Reply
    • Lynn,
      You are clearly not using Excel as your statistical analysis tool. This website is about Excel.
      I don’t know which tool you are using and am not familiar with the syntax of SPSS, SAS and the other tools. Sorry.
      Charles

      Reply
  9. I have rows of data and some of them have missing data. I am trying to sort out all rows with missing data down while all rows with full data will be up. Please how do I get this done?
    Thank you

    Reply
    • Segun,
      Suppose your data is in range A1:D100. Place the formula =COUNT(A1:D1) in cell E1, highlight the range E1:E100 and press Ctrl-D. If a cell in column E has value 4 then the corresponding data row is full; otherwise it is missing data. Now sort range A1:E100 from largest to smallest using column E as the sort key. You can do this using Excel’s sort capability (select Data > Sort) or by using the Real Statistics QSORTROWS function.
      Charles

      Reply
  10. hi,

    is there an easy way where i can automatically distribute the weights of my metrics if my actual data is missing(not available)?

    15.00% 10.00% 10.00% 15.00%
    prod nps nps reso attendance
    5.00 3.50 4.00 4.50

    Reply
  11. I have missing residual errors and a regression F missing but I do have the other info. How do i put my numbers I have into excel to get my missing information? I appreciate any help you can give me.

    Reply
  12. I have data in the form:

    +———-+——–+——-+
    | ID#| Name| Age|
    +———-+——–+——-+
    | 1 | John | 20 |
    | | Jill | 23 |
    | 2 | Mary | 50 |
    | | Todd | 5 |
    | | Alex | 6 |
    | 3 | Bob | 35 |
    +———-+——–+——-+

    I wish to replace the blanks with the first non blank value found above it to obtain:

    +———-+——–+——-+
    | House#| Name| Age|
    +———-+——–+——-+
    | 1 | John | 20 |
    | 1 | Jill | 23 |
    | 2 | Mary | 50 |
    | 2 | Todd | 5 |
    | 2 | Alex | 6 |
    | 3 | Bob | 35 |
    +———-+——–+——-+

    I have hundreds of cases so doing it with copy & paste would be very time consuming. Is there a “command” I can use to do this?

    Reply
    • I don’t know of a simple command that will do this, but here is a pretty simple procedure that might help. This looks more complicated than it really is. It took me only a few seconds to do this and it did the job well.

      Let’s suppose that the data is contained in range A1:C500 where column A has some blank cells (and we will assume that cell A1 is not blank).

      1. Insert the formula =A1 in cell E1 (or copy cell A1 into cell E1)
      2. Insert the formula =IF(A2=””,E1,A2) in cell E2
      3. Highlight the range E2:E500 and press Ctrl-D
      4. Press Ctrl-C (i.e. copy range E2:E500)
      5. Click on cell A2 and press Ctrl-Alt-V and then select the Paste Values option and press OK
      6. Highlight range E1:E500 and press the Delete key (to erase column E)

      Charles

      Reply
    • I know this is an old post but it is a common question and there actually is an easier way! Try using the Fill feature in PowerQuery. This is not the same function as Fill on the Home tab of Excel, that will not work and will overwrite your data.

      If you have not heard of PowerQuery, or it’s superhero cousin PowerPivot, then you should know they are Excel add-ins that Microsoft released free of charge for some versions of Excel. You can find the details on Microsoft’s website.

      Using Office 365:

      1) Select your data range
      2) Go to the Data tab and click “From Table/Range” on the ribbon. It’s on the left side in the Get and Transform section.
      3) Select the column you want to change in the new window.
      4) Go to the Transform tab of the ribbon, click Fill, select Down (it’s in roughly the same location on the ribbon as “From Table/Range”)
      5) In the file menu, click Close and Load to close the editor and load your data back into Excel as a new table. Alternatively, you can copy the column to paste over your original data and close/discard the query.

      Easy!!

      Side note: Creating references and changing the structure of your data is something PowerQuery excels at (pardon the pun). It also records every step in a query so it can be automatically recreated by simply refreshing the source.

      Reply
  13. Hi Charles,

    I am going a regression analysis with ‘Facebook liking’ as the independent variable and ‘intention to donate money’ as dependent variable. I included some questions in my survey that were only showed to those participants that ‘liked’ a charity’s Facebook page. Now I want to include these variables in my regression analysis, but since only the people that liked a FB page answered the question, SPSS automatically doesn’t take into account the rest of my sample anymore. However, the inclusion of these participants is important for testing my main effect of liking on intention. How can I include my whole sample in regression, even if some people did not answer all questions?

    Reply
  14. Hello, I’m trying to do a statistic analysis by performing a linear regression model in excel. I actually collect data (FDI) from 10 counties over the 20 years period of time. But the problem is that there are many missing datas like there is an available number in some years but none in another years. Moreover, in some year there is a data for one country but the other countries may not have. For example, FDI in 1995 may have in France but it did not have in Germany and like from 1995 to 2015, there are only data for 1995, 1998, 2000, 2006, 2009, 2013, the rest is missing.
    At first I tried to generate a mean over that 20 year period within each country. But the analysis was not significant at all and it was way much bias as I had actually reduced from 20 years variables to only one year variable for each country.
    Currently, I’m really out of idea. I don’t know how to solve this missing problem at all. Could the imputation method be possible to apply or there could be other ways?

    Reply
  15. I am running a regression analysis in which I want to predict housing prices based upon variables in previously sold homes (number of bedrooms, bathrooms, year built, amenities, etc) In many cases the historical data is missing information on one or more variables. If I delete the row of data in which a missing value occurs, I’m going to lost a LOT of cases. Will Excel not run regression if a cell is left blank? If that is the case, is there a work around?

    Reply
  16. How can I replace missing data for questions that didn’t appear to respondents.
    I have logic skip questions in my survey, so if for example Q 2 has a scale from 1-6 and the respondent gave me an answer on that scale from 1-3. Q 3 shouldn’t be displayed to the respondent. I have 40 cases missing in my data for that question only because the Q3 wasn’t displayed to the respondents

    Reply
    • Shaimaa,
      How you deal with missing data depends on what you plan to do next. Assuming that there was no missing data, what sort of analyses are you planning to do?
      Charles

      Reply
  17. I am very elementary at statistics. I am working with total population numbers for each year from 2000 to 2010. I am missing the total population for the year 2003. The total population for year 2002 is 1012 and for year 2004 is 1146. What is the best method to to replace the missing population number for year 2003? What is the name of this best method? How valid will this method be?

    Reply
  18. I am trying to find an appropriate way to handle missing data due to the death of the subject when that death is influenced by treatement.

    i.e. Treatment A is supposed to alleviate the symptoms of a disease and to delay death in animal subjects relative to treatment B.
    Regular measurements of weight and clinical score are taken from subjects. The study is continued till many of the subjects have become moribund (when they are euthenized)

    Analysis of date of death is straightforward, but how can I analyze or impute data for body weight and clinical score when the worst affected animals are gone?

    Reply
    • Ambrose,
      The typical approaches to imputing values to missing data are based on the assumption that such data are missing at random (with various definitions of what this means). Since this is not your case, you can’t use any of these techniques.
      I would need some more information about what sorts of data you are referring to. If it is weight, then perhaps you could use the weight just before or at death. If it is heart rate, perhaps you could use the heart rate just before death or the value zero. In other words, the values assigned to “missing” data probably don’t rely on mathematical techniques, but approaches based on your domain knowledge.
      Charles

      Reply
  19. I would like to compute the reduction percentage only of certain data (A and B) and wherein I have values as eg:
    1. A=1 and B=2 it resulted to 200% decreased, in which it has corresponding points of 75 for every 75-100% decreased;
    2. A=2 and B=1 it resulted to 50% decreased with same corresponding pts of 50 for every 50-74% increased;
    3. A=0 and B=0, or maintained values like A=5 and B=5; which has no reduction nor increased. Wherein the corresponding pts if there is reduction from 0-25% is 20;
    4. A=1 and B=0 which resulted to an error;
    5. A=0 and B=1 with 0 result which conflicted to number 3 example (A=0 and B=0);
    And also with blank/empty cell.

    Reply
    • Sorry Bryan, but I don’t understand your question. Are you looking for a single value which gives the overall percentage of reduction?
      Charles

      Reply
  20. Hi Sir Charles,

    I have a degree in statistics but it’s not enough for me to face common challenges particularly in analyzing survey results which commonly use only frequency and percentage.

    However, I have a question. I know it is very simple for you but I am just not sure what is the right way of doing so. This is just frequency and percentage, not really advance stat analysis etc. But how should I specifically handle this case: we ask the sample if they have business, if yes it will proceed to question like “did you already already accessed loan?”. Say having business is 256, but next question shows that accessed loan=100 and did not accessed=150 which means there is 6 is missing. How this case should be dealt particularly on computing the percent frequency? should it be accessed loan: 100/250 or 100/256??

    Reply
    • It really depends on how you will use the information subsequently, but based on what I have heard so far, I would say 100/250.
      Charles

      Reply
    • Amy,

      One way to do it is to create a copy of the data in column D with the missing data filled in. Assuming missing data is an empty cell, this can be accomplished by

      1. Highlight the range D2:D73
      2. Insert the formula =IF(B2<>“”,B2,AVERAGE(B$2:B$73)) in cell D2
      3. Press Ctrl-D

      Charles

      Reply
  21. This function is not found in normal excel function for Microsoft Office 2007 and 2010. But I do not know if it is contained in other advanced Microsoft Office. This is very useful especially with very huge data.

    Reply
    • Not sure which function you are referring to, but the functions described on this part of the website are part of the Real Statistics Resource Pack, and are not contained standard Excel.
      Charles

      Reply
  22. I’m calculating the min, median, max, mean, standard deviation of a set of data with missing values in it using excel. How do I replace those missing data by mean or by median in excel? And when is replacement by median suitable?

    Reply
    • First note that when calculating the min, median, max, mean and standard deviation Excel ignores any missing data.

      Suppose your data is in range A1:D10 and the mean or median is in cell A12 and missing data is indicated by an empty cell. Now place the formula =IF(A1=””,A12,A1) in cell F1 and then highlight the range F1:I10 and press Ctrl-R and Ctrl-D. Range F1:I10 will now contain the desired values.

      As explained on the website, replacing missing data by the mean (or median) will reduce the variance of the data and so is undesirable. About the only time you should even consider doing this is if only a very small percentage of the data is missing.

      Charles

      Reply
  23. I am trying to calculate individuals’ variances on multiple measures with missing data coded as -99. I would like my formula to calculate the variance for these measures while ignoring/skipping/omitting the -99 values.

    I do not want to change all of the -99 values to 0 or blank because it is a an indicator of how many questions they saw but chose to skip in a survey with lots of branching. If I wanted to get rid of them I would just use ctrl+f and replace them with “”. Any suggestions other than using SPSS or other statistical softwares?

    Currently my formula is: =IF(ISERROR(VAR(DJ4:EH4)),””,VAR(DJ4:EH4))

    But, it could be simplified to =VAR(DJ4:EH4) to figure out how to do this first.

    I would like it to be something like: calculate the var(DJ4:EH4) except for cells = -99.

    I tried =Var(DJ4:EH4, <"-99"), but that didn't work.

    Reply
    • Tiffany,

      The the data is in range R1 then the following formula should work. Keep in mind that this is an array formula and so you need to press Ctrl-Shft-Enter.

      =VAR(IF(R1=-99,””,R1))

      Charles

      Reply

Leave a Comment