Autocorrelation Introduction

Basic Concepts

When performing multiple linear regression using the data in a sample of size n, we have n error terms, called residuals, defined by ei = yi – ŷi. One of the assumptions of linear regression is that there is no autocorrelation between the residuals, i.e. for all i ≠ j, cov(ei, ej) = 0.

Definition 1: The autocorrelation (aka serial correlation) between the data is cov(ei, ej). We say that the data is autocorrelated (or there exists autocorrelation) if cov(ei, ej) ≠ 0 for some ij.

First-order autocorrelation occurs when consecutive residuals are correlated. In general, p-order autocorrelation occurs when residuals p units apart are correlated.

Observation: Since another assumption for linear regression is that the mean of the residuals is 0, it follows that

cov(ei, ej) = E[(ei–0)(ej–0)] = E[eiej]

and so data is autocorrelated if E[eiej] for some ij.

Example 1: Find the first-order autocorrelation for the regression of rainfall and temperature on crop yield for the data in range A3:D14 of Figure 1.

First-order autocorrelation

Figure 1 – First-order autocorrelation

The predicted Yield values are shown in column F and the Residuals are shown in column G. The predicted values in range F4:F14 are calculated by the array formula =TREND(D4:D14,B4:C14) and the residuals in range G4:G14 are calculated by the array formula =D4:D14-F4:F14. The first-order autocorrelation is .58987 (cell G16) as calculated by the formula =CORREL(G4:G13,G5:G14).

Sources of Autocorrelation

We now give some of the reasons for the existence of autocorrelation.

In the case of stock market prices, there are psychological reasons why prices might continue to rise day after day until some unexpected event occurs. Then after some bad news, prices may continue to fall. Thus, it is common for time series data to exhibit autocorrelation.

Sometimes an event takes time to have an effect. Prices for oil may rise due to under-supply or increased demand, which results in increased production, which has a delayed effect in price reductions, which in turn may result in decreased production, etc. Oil price data, in this case, will show p-order autocorrelation where p is the lag time for this effect.

Autocorrelation may also be caused by an incorrectly specified regression model. E.g. suppose the true regression model is

Multiple linear regression model

using two independent variables x1 and x2. Thus, b2 ≠ 0 and, based on the usual regression assumptions, cov(xi, ej) = 0 and cov(ei, ej) = 0. Also, for any random variable u and constant c, cov(u, c) = 0.

But now suppose that you use a model instead which leaves out the x2 variable, say

Deficient regression model

The residual term ui in the deficient model can be expressed as ui = c0 + c1x1i + b2x2i + ei where c0 = b0a0 and c1 = b1a1. It follows that

Thus

Since it is quite likely that there is correlation between the data elements, it is quite likely that cov(ui, uj) ≠ 0, i.e. there is autocorrelation.

Leave a Comment