Contingency tables with missing elements

In Independence Testing we show how to test independence using a chi-square test. We now show how to perform such a test when there is missing data (provided the data is missing at random).

Example

Example 1: Impute the missing data in the contingency table shown in Figure 1.

Contingency table missing data

Figure 1 – Contingency Table with missing data

We partition our sample into subsets A, B, and C, in which A consists of those elements where both the row and column components are observed, B consists of those elements where only the row component is observed and C consists of those elements where only the column component is observed. Actually, there is another subset D that consists of those elements where both the row and column component is unobserved (for Example 1 there are 25 such elements, i.e. the value in cell E4). These elements are excluded from the analysis.

The completed contingency table, therefore, contains entries xij where

Here x_{ij}^A is observed, but x_{ij}^B and x_{ij}^C are not. The marginal totals are xi = \sum{}_{j=1}^c {x_{ij}} and xj = \sum{}_{i=1}^r {x_{ij}} where the completed contingency has r rows and c columns (r = 2 and c = 3 for Example 1).

We also define x_i^B = \sum{}_{j=1}^c {x_{ij}^B} and x_j^C = \sum{}_{i=1}^r {x_{ij}^C}. Note that the values $latex x_{ij}^A$, x_i^B, latex x_j^C$ are observed.

EM Algorithm

Essentially we have a multinomial distribution with parameters p = {pij: i ≤ r and j ≤ c}. Our goal is to find the values of pij that minimize LL based on the observed data. Initially, we can set pij = 1/(rc) (initial M step). The algorithm works as follows:

E step: estimate the values of the  based on the observed data and the current estimates of the pij (from the previous M step):

M step: estimate the values of the  based on the observed data and the current estimates of the xij (from the previous E step):

 

The two steps can be combined so that new values of the pij (at step k+1) can be estimated from the previous values of the pij (at step k) by

EM Algorithm for Example 1

For Example 1, the analysis begins as shown in Figure 2.

Contingency table initial iterations

Figure 2 – EM algorithm (iterations 1 and 2)

The six cells in range B9:D10 contain the initial M step values of pij = 1/6 (for a total of 1). Cell H9 (representing the initial data value x11) contains the worksheet formula =$B$2+$E$2*B9/E9+$B$4*B9/B11 (and similarly for the other cells in range H9:J10, for the first E step). Range B15:D16 contains the array formula =H9:J10/K11 (second M step).

The proceeding E and M steps are calculated in the same way. After 14 steps we arrive at the results shown in Figure 3, demonstrating convergence.

Contingency table EM convergence

Figure 3 – Convergence of EM algorithm

Thus, we see that we can apportion the missing data as shown in range H81:J82. We can also obtain estimated pij values as shown in range B81:D82.

We now calculate the maximum log-likelihood value based on this result. This value is based on the partition consisting of A, B, and C, as follows:

EM log-likelihood function

where

We obtain the maximum log-likelihood estimate of -1,559.6, as calculated in Figure 4.

Maximum log-likelihood calculation

Figure 4 – Maximum log-likelihood estimate

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

References

Ehlers, R. (2005) Incomplete contingency tables
No longer available online

Howell, D. (2008) The treatment of missing data
https://www.uvm.edu/~statdhtx/StatPages/Missing_Data/MissingDataFinal.pdf

Leave a Comment