Saturated model for two-way contingency tables

We begin by investigating the saturated model, which accounts for all the possible variables. We do this by reexamining Example 2 of Independence Testing using a log-linear approach.

Example

Example 1: Create a saturated log-linear model for the data in Example 2 of Independence Testing

The data for the 150 patients are again summarized in the contingency table in Figure 1.

Two-way contingency table

Figure 1 – Contingency table

Define the following coding of the categorical variables:

t1 = 1 if therapy 1 and = -1 if therapy 2
t2 = 1 if cured and = -1 if not cured

Based on this coding the data can be expressed as in Figure 2.

Saturated log-linear 2D model

Figure 2 – Fitting the data to the saturated model

The log-linear model takes the form:

ln yi = β0 + β1ti1 + β2ti2 + β3ti1ti2 + ln εi

Here all the variables are included, including the interaction terms. This is called the saturated model. We now find the values of the population coefficients β0, β1, β2, β3. As usual, using the sample data we find the estimates of these coefficients b0, b1, b2, b3, where

ln yi = b0 + b1ti1 + b2ti2 + b3ti1ti2

It then follows (using the data in Figure 2) that:

3.434 = ln 31 = ln y1 = b0 + b1â‹…1 + b2â‹…1 + b3â‹…1 = b0 + b1 + b2 + b3

2.398 = ln 11 = ln y2 = b0 + b1⋅1 + b2(-1) + b3(-1) = b0 + b1 – b2 – b3

4.043 = ln 57 = ln y3 = b0 + b1(-1) + b2⋅1 + b3(-1) = b0 – b1 + b2 – b3

3.932 = ln 51 = ln y4 = b0 + b1(-1) + b2(-1) + b3⋅1 = b0 – b1 – b2 + b3

Adding all four equations and dividing by 4 we get

b0 = (ln 31 + ln 11 + ln 57 + ln 51)/4 = (3.434 + 2.398 + 4.043 + 3.932)/4 = 3.452

Adding the first two equations and dividing by 2 we get

b1 = (ln 31 + ln 11)/2 – b0 = (3.434 + 2.398)/2 – 3.452 = -.536

Now, adding the first and third equations and dividing by 2 we get

b2 = (ln 31 + ln 57)/2 – b0 = (3.434 + 4.043)/2 – 3.452 = .287

Adding the first and last and dividing by 2 we get

b3 = (ln 31 + ln 51)/2 – b0 = (3.434 + 3.932)/2 – 3.452 = .231

Thus the model is

ln y = 3.452 – .536 t1 + .287 t2 + .231 t1t2

which is equivalent to

y = exp(3.452 – .536 t1 + .287 t2 + .231 t1t2)

 

which is, in turn, is equivalent to

image2252

Using \tau_i = e^{\beta_i}, the log-linear model takes the form (dropping the error term):

image2254

Marginal averages

In Figure 3 we provide the contingency table for the logs of the original data in range S13:T14, but this time instead of calculating the marginal totals, we calculate the marginal averages.

Marginal averages log-linear 2D

Figure 3 – Marginal averages

Thus, for example, the marginal average for the Cured row (cell U13) contains the formula =AVERAGE(S13:T13) and the marginal average for the Therapy 1 column (cell S15) contains the formula =AVERAGE(S13:S14).

Note that b0 = the grand mean (cell U15), b1 = the mean for Cured (cell U13) minus the grand mean, b2 = the mean for Therapy 1 (cell S15) minus the grand mean and b3 = Cured × Therapy 1 (cell S13) minus the mean for Cured minus the mean for Therapy 1 plus the grand mean.

We now map the log values back into the original contingency table (range R5:U8) by using the exponential function. Thus the marginal average for the Cured row in the original contingency table (cell U6) is EXP(U13) = EXP(3.738519) = 42.0357. Note, however, that the arithmetic mean of 31 and 57 is not 42.0357. It turns out, however, that the geometric mean of 31 and 57 is 42.0357. Thus we could also put the formula GEOMEAN(S6:T6) in cell U6 and get the same value of 42.0357. This relationship is also true for the other marginal averages.

Observation

The saturated model is an exact fit for the data (i.e. the error terms are zero), and simply provides a new way of looking at the observed data.

The exact version of the coefficients calculated depends on the coding of the dummy variables used. E.g., if we use the coding

t1 = 0 if therapy 1 and = 1 if therapy 2
t2= 0 if not cured and = 1 if cured

then the log-linear regression model becomes:

ln y = 2.398 + 1.534 t1 + 1.036 t2 – .925 t1t2

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

Reference

Howell, D. C. (2010) Statistical methods for psychology (7th ed.). Wadsworth, Cengage Learning.
https://labs.la.utexas.edu/gilden/files/2016/05/Statistics-Text.pdf

11 thoughts on “Saturated model for two-way contingency tables”

  1. Dear Charles,
    maybe I found two small typos in your example 1 description:
    1) In the set of equations for solving b coefficients, there should be ln y1, ln y2, ln y3, ln y4; instead of ln y1, ln y2, ln y2, ln y2
    2) In the summary model for y (and ln y), there is additional index /1/ after the coefficient in the term for t_1; the same is also in the Marginal example.

    Your big fan Jirka

    Reply
    • As explained on the webpage, the saturated model is an exact fit for the data (i.e. the error terms are zero), and simply provides a new way of looking at the observed data. The saturated model is not only the “best fit”, it is an “exact fit”, in that it simply re-expresses the data exactly.
      Charles

      Reply
  2. Thank you for your website – it is most helpful! I am having trouble understanding the log-linear regression model with the alternate coding at the bottom of the page. I would think that b0 = 3.932 (ln 51, when t1 = t2 = 0), which gave me the following model: ln y = 3.932 – 1.534 t1 + 0.111 t2 + 0.925 t1t2.

    Reply
    • Thanks for your comment. The coding I actually used is

      t1 = 0 if therapy 1 and = 1 if therapy 2 (instead of 1 for therapy 1 and 0 for therapy 2 which is how it was stated)
      t2 = 0 if not cured and = 1 if cured

      The probably accounts for the difference. I have now corrected the webpage.

      Charles

      Reply

Leave a Comment