Method of Least Squares

In Correlation we study the linear correlation between two random variables x and y. We now look at the line in the xy plane that best fits the data (x1, y1), …, (xn, yn).

Recall that the equation for a straight line is y = bx + a, where

b = the slope of the line
a = y-intercept, i.e. the value of y where the line intersects with the y-axis

For our purposes, we write the equation of the best fit line as

image1667

and so the y-intercept is
image1668

For each i, we define ŷi as the y-value of xi on this line, and so

image1670

The best fit line is the line for which the sum of the distances between each of the n data points and the line is as small as possible. A mathematically useful approach is therefore to find the line with the property that the sum of the following squares is minimum.

image1671

Theorem 1: The best fit line for the points (x1, y1), …, (xn, yn) is given by

image1673

where
image1674

Click here for the proof of Theorem 1. Two proofs are given, one of which does not use calculus.

Definition 1: The best fit line is called the regression line.

Observation: The theorem shows that the regression line passes through the point (, ȳ) and has the equation

image1673

where the slope is
image1674

and the y-intercept is
image1676

Note too that b = cov(x,y)/var(x). Since the terms involving n cancel out, this can be viewed as either the population covariance and variance or the sample covariance and variance. Thus a and b can be calculated in Excel as follows where R1 = the array of y values and R2 = the array of x values:

b = SLOPE(R1, R2) = COVAR(R1, R2) / VARP(R2)

a = INTERCEPT(R1, R2) = AVERAGE(R1) – b * AVERAGE(R2)

Property 1image1679

Proof: By Definition 2 of Correlation,

image1680

and so by the above observation we have

image1681

Excel Functions: Excel provides the following functions for forecasting the value of y for any x based on the regression line. Here R1 = the array of y data values and R2 = the array of x data values:

SLOPE(R1, R2) = slope of the regression line as described above

INTERCEPT(R1, R2) = y-intercept of the regression line as described above

FORECAST(x, R1, R2) calculates the predicted value y for the given value of x. Thus FORECAST(x, R1, R2) = a + b * x where a = INTERCEPT(R1, R2) and b = SLOPE(R1, R2).

TREND(R1, R2) = array function which produces an array of predicted y values corresponding to x values stored in array R2, based on the regression line calculated from x values stored in array R2 and y values stored in array R1.

TREND(R1, R2, R3) = array function which predicts the y values corresponding to the x values in R3 based on the regression line based on the x values stored in array R2 and y values stored in array R1.

To use TREND(R1, R2), highlight the range where you want to store the predicted values of y. Then enter TREND and a left parenthesis. Next highlight the array of observed values for y (array R1), enter a comma and highlight the array of observed values for x (array R2) followed by a right parenthesis. Finally press Crtl-Shft-Enter.

To use TREND(R1, R2, R3), highlight the range where you want to store the predicted values of y. Then enter TREND and a left parenthesis. Next highlight the array of observed values for y (array R1), enter a comma and highlight the array of observed values for x (array R2) followed by another comma and highlight the array R3 containing the values for x for which you want to predict y values based on the regression line. Now enter a right parenthesis and press Crtl-Shft-Enter.

Excel 2016 Function: Excel 2016 introduces a new function FORECAST.LINEAR, which is equivalent to FORECAST.

Example 1: Calculate the regression line for the data in Example 1 of One Sample Hypothesis Testing for Correlation and plot the results.

Regression line fit Excel

Figure 1 – Fitting a regression line to the data in Example 1

Using Theorem 1 and the observation following it, we can calculate the slope b and y-intercept a of the regression line that best fits the data as in Figure 1 above. Using Excel’s charting capabilities we can plot the scatter diagram for the data in columns A and B above and then select Layout > Analysis|Trendline and choose a Linear Trendline from the list of options. This will display the regression line given by the equation y = bx + a (see Figure 1).

42 thoughts on “Method of Least Squares”

  1. Thank you Charles for this excellent site. Excel’s LINEST function is another very useful function for performing regression analysis. It allows the analyst to capture the same regression variables that are calculated and / or displayed in a chart trendline equation in worksheet cells. LINEST can also display certain statistics that the Analysis Toolpak would generate in worksheet cells too.

    Reply
  2. Find the intercept (B0) and slope (B1) of the shifted line l1 from the data x 1 − x̄, . . . , x N − x̄, and y 1 − ȳ, . . . , y N − ȳ. How do we obtain the intercept and slope of l1 from those of the shifted line l1 ?

    Reply
  3. Sir, to my teacher wife and me the clarity of your instruction is MOST refreshing – so much so that I’m both move to express gratitude and to model my own instruction of certain propositions after yours. Thanks!

    Reply
        • Jonathan,
          Yes, you can view y as representing the vector consisting of the elements yi. Alternatively y can be viewed as a random variable.
          The same is true for x, except that now in addition to being viewed as a vector consisting of the elements xi, it can also be viewed as a matrix with values xij (this is the multiple linear regression case).
          Charles

          Reply
  4. I don’t understand nothing that write here where I should begin to study this subject to understand this(Some free internet basic Course) ?

    Reply
  5. Dear Charles
    First, I would like to thank you for you great page. Second, my problem is; I have 3 input data (time, speed, acceleration) and 1 output data ( emissions). I would like to establish the relitionship between input and output data .
    Can you help me what method that I can used it. Finally, thank you for your kind support in advance
    Ima

    Reply
  6. Hi, great page.

    Forgive this 101 question.

    I’ve been using the stats add-on for Excel than includes the regression function.

    I want to know which box to read to see what the trend/slope is.

    Here’s what I do in pictures, to make it clearer.

    First I open the statistics add-on package and choose Regression: Picture 1
    Then I select Labels, Residuals, Residual plots, Standardised residuals and Line fit plots, plug in the X and Y variables in their respective dialogue boxes: Picture 2
    When I click OK the results appear in a new spreadsheet: Picture 3

    Is the slope given in one of the boxes? Or do I need to make another calculation?

    Not being a regular user of Excel, I’ve simply subtracted the first point of the line fit from the last and divided by the number of data points to get the trend (in this case, divide by number of months).

    Does one of the boxes in Picture 3 give me the slope?

    Reply
    • Thanks very much!

      I noticed Knowledgeless lady (below) was working with monthly global lower tropospheric temperature data from Remote Sensing Systems. My examples were based on another source that provides data for the same metric – University of Huntsville, Alabama.

      I would not be surprised if you got a few drop-ins following/participating in the online climate debates. Global temperatures are a hot-button issue in the semi-popular debate, and there are a few online apps that give trends (and some give the confidence intervals, such as at the link she provided) for various climate data just by pressing a few buttons. Many regulars, including me, have no training in stats at all, and some of us are trying to get to grips with it conceptually, even if the math is beyond us.

      Don’t know if you know of it, but there’s an excellent blog devoted to analysing climate trends and educating people on how statistics work, including common – and uncommon – pitfalls. The author is a high-end statistical analyst and his posts (if you ignore the biffo) are wonderfully clear and concise. If you’ve a few minutes and it sounds interesting, to you, I’d recommend checking it out.

      This is the blog:

      https://tamino.wordpress.com/

      His older posts have been archived here.

      Once again, thanks for your help.

      Reply
  7. Nice page!

    But I’m looking for more specific info more concerning trends in existing time series (e.g. temperature) than in estimations of the future.

    Having in an excel table a column with dates and one with temperature values (or whatever else) , I can easily construct a chart giving a line linking all values, and then, by selecting that line, produce a trend line with the info:
    y = ax + b; R² = .

    (About the exact significance of R², there are about as many meanings as web pages talking about it.)

    But this is still not quite what I expect: I would like Excel computing the trend value for the data series, possibly with a standard error associated to the trend, like done here:

    http://www.ysbl.york.ac.uk/~cowtan/applets/trend/trend.html

    Is it possible to invoke in Excel a function computing the trend as understood here?
    The Excel trend function is certainly not the right candidate!

    Thanks in advance

    Reply
    • Glad you liked the referenced webpage.

      As you probably know, you can add a linear trendline to an Excel scatter chart. This will provide the trendline, but not the standard error.

      If you know the standard error and so can compute the equations of the upper and lower lines (as in the site you referenced), then you can add these lines manually to the Excel chart. In fact for any line once you know two points on the line you can create a line through these points using Excel’s Scatter with Straight Lines chart capability.

      Charles

      Reply
      • Thanks for the quick answer, Charles, but… it is exactly what I already know and did not want to to.

        Maybe you misunderstood me: I’m not interested in incorporating the lines in Kevin’s charts, which seem to denote some kind of “uncertainty surface” encompassing the value deviations from the computed trend.

        I even don’t need the error deviation he computes, as e.g. for RSS data between 1979 and today:

        Trend: 0.126 ±0.063 °C/decade

        Solely the trend would be enough.

        Thus my surprise when trying, in an Excel table I read RSS data in, to obtain the same trend info in a cell where I invoke the trend function with a year sequence as xes and the anomaly sequence as ys.

        The result is bare nonsense, what tells me no more than that I don’t use that function properly.

        I can’t imagine Excel displaying a trend line in a chart but refusing to put in a cell what it itself had computed before drawing!

        Reply
          • That’s a pity indeed! Here’s some stuff that hopefully might help.

            The example above you can see displayed by Excel in a chart here (in pdf form):

            http://fs5.directupload.net/images/160317/3zuwxkzk.pdf

            You see
            – a curve linking 444 data points together (stored in a table column)
            – the trend line together with
            – its specification
            — y = 0.001 x – 0.1183
            — R2 = 0.3029

            What I miss here: the trend value itself along these values, which is
            — Trend: 0.126
            when the x-axis runs in months from 1979 to 2015.

            And, Knowledgeless lady as I am, I naively thought that invoking in a cell on the table the f(x)

            TREND(C1:C444; A1:A444)

            would give exactly that value. However, I get as result

            -0,113

            Thus I don’t invoke the TREND fonction properly. Where is the mistake?

          • Assuming that C1:C444 contains the y values of your data and A1:A444 contains the x values, =TREND(C1:C444,A1:A444) returns the forecasted y value for the first x value. If you treat =TREND(C1:C444,A1:A444) as an array formula, then you need to highlight a column range with 444 cells enter the formula =TREND(C1:C444,A1:A444) and press Ctrl-Shft-Enter (not just Enter) and in this case you would get the forecasted values corresponding to all 444 data elements. This is equivalent to the trend line.
            Charles

          • Many thanks for “March 19, 2016 at 6:59 pm”.

            Inbetween I found a pretty good alternative (“linest”) giving trend, standard deviation and R^2 in one step.

  8. First, this is great stuff. Thanks for putting this out there! One question, the phrase at the top: “the value of y where the line intersects with the x-axis”…isn’t this always zero (0)? I think we mean “[the value of y] when x=0”. Or am I missing something?

    Reply
    • LP,
      You are correct. Thank you very much for catching this error. I have just revised the webpage to reflect this change.
      Charles

      Reply
      • hey charles,
        it is common in the UK, taught in schools that ‘m’ represents the gradient where you used ‘b’.

        the version used is y = mx + c
        where y is the equation of the straight line. m is the gradient and c is the y-intercept.

        Reply
  9. How do you balance the accuracy of the trendline showed with its r2?

    I am studying very similar trends in a proyect and i have my doubts regarding how cautious one must be, specially with values like 50, a bit far away from the central data

    In this case i see that there isn´t a regression that tends to average ( like in studies of heights in families ). How can you increase the likeliness of this doesn´t happening in your study?

    thanks!

    Reply
  10. Hello there Sir,

    The first three equations doesn’t make sense to me yet. Can you elaborate on the meaning of each symbol, like where does “c” and “x-bar”come from and what is the reason of introducing them into the original linear equation? Can you provide me references for further understanding these equations?

    Suggestion: Is it possible for you to put equation references like (Eq. 2.11).

    Thank you so much!

    Reply
    • x-bar is the mean of the x sample values. This is standard notation and is used throughout the website.

      An equation of a straight line takes the form y = b*x + a (slope b and y-intercept a). An alternative form (from high school algebra) is y – y0 = b(x – x0) where (x0, y0) is any point on the line (a straight is determined by any point on the line and its slope). I am choosing to use a point who x-value is x-bar and whose y-value is an unknown value c. Thus y – c = b*(x – x-bar). By algebra y = b*x – b*x-bar + c. But y = b*x + a, and so b*x – b*x-bar + c = b*x + a, from which it follows that -b*x-bar + c = a, and so c = a + b*x-bar: i.e. (x-bar, a + b*x-bar) lies on the line.

      Charles

      Reply
    • Ryan,
      When R2 contains a single column (simple linear regression) then FORECAST(x, R1, R2) is equivalent to TREND(R1, R2, x) and FORECAST(R3, R1, R2) is equivalent to TREND(R1, R2, R3). TREND can be used when R2 contains more than one column (multiple regression) while FORECAST cannot.
      Charles

      Reply
  11. Nice website!

    Is there a function for the slope of a regression line, when forced to have an intercept of zero?

    I know I can plot the data, fit a trend line, and then print the equation, but is there a more direct way?

    Thanks.

    Reply
    • Charles,
      For the case where there is only one independent variable x, the formula for the slope is b = ∑x_iy_i/∑x_i^2.
      I plan to add information about this situation to the website in the future.
      Charles

      Reply
    • Remember that the intercept plays a role as well as the slope. E.g. in y = x/2 + 1000 (slope .5) as long as x < 2000 x will be less than y. Charles

      Reply

Leave a Comment