EM Algorithm Bivariate Normal Data

Basic Concepts

We now show how to estimate the parameters of a bivariate normal distribution based on a sample from this distribution even when some of the data is missing. We use the Expectation-Maximization (EM) algorithm to accomplish this.

In this method, missing data is estimated using an iterative process where each iteration consists of two steps: (1) an M step (maximization) where parameters are calculated based on the missing data results from the previous E step (or via a guess in the initial iteration) and (2) an E step (expectation) where each missing data element is estimated from the parameters in the previous M step.

Missing y values

Example 1: Suppose that we have a sample consisting of 25 pairs of the form (x, y) randomly selected from a bivariate normal population, as shown in Figure 1. Note that the y values of the last 5 of the pairs are missing (set to zero in the figure). Estimate the parameters of the bivariate normal distribution.

Bivariate normal missing data

Figure 1 – Bivariate normal data with missing y values

The mean, standard deviation and correlation coefficient for the completely observed data are shown in range D3:F6 (e.g. cell E4 contains the formula =AVERAGE(A2:A23)), while the same parameters for all 25 sample elements are shown in range D24:F27 where the missing data is arbitrarily filled in with zeros (e.g. cell E25 contains the formula =AVERAGE(A2:A28)). This is the first M step.

Note that we can choose to initialize the missing y values in many different ways (e.g. as the mean of the observed y values).

E and M steps

We now use these parameters to come up with new estimates for the missing values (E step). Here

Estimate missing data

The results are shown on the left side of Figure 2.

EM algorithm continued

Figure 2 – Continuation of EM algorithm

Here cell I24 contains the formula =F25+E27*F26/E26*(H24-E25), and similarly for the other missing y values (e.g. the same formula is used for cell I25 with H24 replaced by H25 in the formula).

This results in a new set of x and y values in range H2:I28 where the observed values (range H2:I23) are identical to those shown in range A2:B23 of Figure 1. New parameter values in range K24:M27 are now calculated from the complete data in range H2:I28 (the next M step). We continue in this way as shown on the right side of Figure 2 until we get convergence of the parameter values. These values are shown in Figure 3.

Convergence summary

Figure 3 – Convergence Rate

Row 13 (or row 14) contains the estimated parameter values for the original data with 5 missing values.

Missing x and y data

Example 2: Repeat Example 1 for the data in Figure 4. This time there is both missing x data and y data.

Missing x and y data

Figure 4 – EM algorithm with missing x and y data

In this case, we calculate missing y values as before and missing x values in a similar way, namely:

Estimate missing data
x in terms of y

The convergence is as shown in Figure 5.

Convergence summary

Figure 5 – EM Convergence

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

References

Efron and Hastie (2016) Computer age statistical inference. Cambridge University Press.

Walczak, B., Massart, D. L. (2001) Dealing with missing data: Part II. Chemometrics and Intelligent Laboratory Systems 58 Ž2001. 29–42
https://www.academia.edu/59642526/Dealing_with_missing_data

Leave a Comment