Method of Least Squares

In Correlation we study the linear correlation between two random variables x and y. We now look at the line in the xy plane that best fits the data (x₁, y₁), …, (x_n, y_n).

Recall that the equation for a straight line is y = bx + a, where

b = the slope of the line
a = y-intercept, i.e. the value of y where the line intersects with the y-axis

For our purposes, we write the equation of the best fit line as

and so the y-intercept is

For each i, we define ŷ_i as the y-value of x_i on this line, and so

The best fit line is the line for which the sum of the distances between each of the n data points and the line is as small as possible. A mathematically useful approach is therefore to find the line with the property that the sum of the following squares is minimum.

Theorem 1: The best fit line for the points (x₁, y₁), …, (x_n, y_n) is given by

where

Click here for the proof of Theorem 1. Two proofs are given, one of which does not use calculus.

Definition 1: The best fit line is called the regression line.

Observation: The theorem shows that the regression line passes through the point (x̄, ȳ) and has the equation

where the slope is

and the y-intercept is

Note too that b = cov(x,y)/var(x). Since the terms involving n cancel out, this can be viewed as either the population covariance and variance or the sample covariance and variance. Thus a and b can be calculated in Excel as follows where R1 = the array of y values and R2 = the array of x values:

b = SLOPE(R1, R2) = COVAR(R1, R2) / VARP(R2)

a = INTERCEPT(R1, R2) = AVERAGE(R1) – b * AVERAGE(R2)

Property 1:

Proof: By Definition 2 of Correlation,

and so by the above observation we have

Excel Functions: Excel provides the following functions for forecasting the value of y for any x based on the regression line. Here R1 = the array of y data values and R2 = the array of x data values:

SLOPE(R1, R2) = slope of the regression line as described above

INTERCEPT(R1, R2) = y-intercept of the regression line as described above

FORECAST(x, R1, R2) calculates the predicted value y for the given value of x. Thus FORECAST(x, R1, R2) = a + b * x where a = INTERCEPT(R1, R2) and b = SLOPE(R1, R2).

TREND(R1, R2) = array function which produces an array of predicted y values corresponding to x values stored in array R2, based on the regression line calculated from x values stored in array R2 and y values stored in array R1.

TREND(R1, R2, R3) = array function which predicts the y values corresponding to the x values in R3 based on the regression line based on the x values stored in array R2 and y values stored in array R1.

To use TREND(R1, R2), highlight the range where you want to store the predicted values of y. Then enter TREND and a left parenthesis. Next highlight the array of observed values for y (array R1), enter a comma and highlight the array of observed values for x (array R2) followed by a right parenthesis. Finally press Crtl-Shft-Enter.

To use TREND(R1, R2, R3), highlight the range where you want to store the predicted values of y. Then enter TREND and a left parenthesis. Next highlight the array of observed values for y (array R1), enter a comma and highlight the array of observed values for x (array R2) followed by another comma and highlight the array R3 containing the values for x for which you want to predict y values based on the regression line. Now enter a right parenthesis and press Crtl-Shft-Enter.

Excel 2016 Function: Excel 2016 introduces a new function FORECAST.LINEAR, which is equivalent to FORECAST.

Example 1: Calculate the regression line for the data in Example 1 of One Sample Hypothesis Testing for Correlation and plot the results.

Figure 1 – Fitting a regression line to the data in Example 1

Using Theorem 1 and the observation following it, we can calculate the slope b and y-intercept a of the regression line that best fits the data as in Figure 1 above. Using Excel’s charting capabilities we can plot the scatter diagram for the data in columns A and B above and then select Layout > Analysis|Trendline and choose a Linear Trendline from the list of options. This will display the regression line given by the equation y = bx + a (see Figure 1).

42 thoughts on “Method of Least Squares”

John M.

January 14, 2021 at 1:38 pm

Thank you Charles for this excellent site. Excel’s LINEST function is another very useful function for performing regression analysis. It allows the analyst to capture the same regression variables that are calculated and / or displayed in a chart trendline equation in worksheet cells. LINEST can also display certain statistics that the Analysis Toolpak would generate in worksheet cells too.
Reply
- Charles
  
  January 14, 2021 at 3:35 pm
  
  John,
  LINEST is reviewed at https://www.real-statistics.com/multiple-regression/multiple-regression-analysis/multiple-regression-analysis-excel/
  Charles
  Reply
John

January 7, 2021 at 3:15 pm

Find the intercept (B0) and slope (B1) of the shifted line l1 from the data x 1 − x̄, . . . , x N − x̄, and y 1 − ȳ, . . . , y N − ȳ. How do we obtain the intercept and slope of l1 from those of the shifted line l1 ?
Reply
- Charles
  
  January 9, 2021 at 9:46 am
  
  John,
  Sorry, but I don’t understand the shifted line.
  Charles
  Reply
Hal von Luebbert

May 16, 2019 at 6:12 pm

Sir, to my teacher wife and me the clarity of your instruction is MOST refreshing – so much so that I’m both move to express gratitude and to model my own instruction of certain propositions after yours. Thanks!
Reply
- Charles
  
  May 16, 2019 at 10:16 pm
  
  Hello, I am very pleased that you found the instructions helpful.
  Charles
  Reply
Jonathan Bechtel

October 30, 2017 at 4:45 pm

Charles,

For the equation of the best fine line, what does c stand for in:

y-c=b(x-xavg)??
Reply
- Charles
  
  October 30, 2017 at 5:52 pm
  
  Jonathan,
  c is the value of y when x is the average of the x values.
  Charles
  Reply
  - Jonathan Bechtel
    
    November 2, 2017 at 5:46 pm
    
    Thank you. Do y and x represent the vector of values for X and Y when not denoted by Xi and Yi?
    Reply
    - Charles
      
      November 5, 2017 at 3:38 pm
      
      Jonathan,
      Yes, you can view y as representing the vector consisting of the elements yi. Alternatively y can be viewed as a random variable.
      The same is true for x, except that now in addition to being viewed as a vector consisting of the elements xi, it can also be viewed as a matrix with values xij (this is the multiple linear regression case).
      Charles
      Reply
Shahar Dagan

August 13, 2017 at 2:54 pm

I don’t understand nothing that write here where I should begin to study this subject to understand this(Some free internet basic Course) ?
Reply
- Charles
  
  August 13, 2017 at 9:37 pm
  
  Shahar,
  Sorry, but we don-t offer free courses.
  Charles
  Reply
Ima

May 4, 2017 at 7:08 pm

Dear Charles
First, I would like to thank you for you great page. Second, my problem is; I have 3 input data (time, speed, acceleration) and 1 output data ( emissions). I would like to establish the relitionship between input and output data .
Can you help me what method that I can used it. Finally, thank you for your kind support in advance
Ima
Reply
- Charles
  
  May 5, 2017 at 6:42 am
  
  Ima,
  In this case, you use multiple regression. See Multiple Regression.
  Charles
  Reply
barry

March 10, 2017 at 11:53 am

Hi, great page.

Forgive this 101 question.

I’ve been using the stats add-on for Excel than includes the regression function.

I want to know which box to read to see what the trend/slope is.

Here’s what I do in pictures, to make it clearer.

First I open the statistics add-on package and choose Regression: Picture 1
Then I select Labels, Residuals, Residual plots, Standardised residuals and Line fit plots, plug in the X and Y variables in their respective dialogue boxes: Picture 2
When I click OK the results appear in a new spreadsheet: Picture 3

Is the slope given in one of the boxes? Or do I need to make another calculation?

Not being a regular user of Excel, I’ve simply subtracted the first point of the line fit from the last and divided by the number of data points to get the trend (in this case, divide by number of months).

Does one of the boxes in Picture 3 give me the slope?
Reply
- Charles
  
  March 11, 2017 at 9:07 am
  
  Barry,
  Picture 3 shows the slope. It is the coefficient (below intercept).
  Charles
  Reply
- barry
  
  March 11, 2017 at 2:30 pm
  
  Thanks very much!
  
  I noticed Knowledgeless lady (below) was working with monthly global lower tropospheric temperature data from Remote Sensing Systems. My examples were based on another source that provides data for the same metric – University of Huntsville, Alabama.
  
  I would not be surprised if you got a few drop-ins following/participating in the online climate debates. Global temperatures are a hot-button issue in the semi-popular debate, and there are a few online apps that give trends (and some give the confidence intervals, such as at the link she provided) for various climate data just by pressing a few buttons. Many regulars, including me, have no training in stats at all, and some of us are trying to get to grips with it conceptually, even if the math is beyond us.
  
  Don’t know if you know of it, but there’s an excellent blog devoted to analysing climate trends and educating people on how statistics work, including common – and uncommon – pitfalls. The author is a high-end statistical analyst and his posts (if you ignore the biffo) are wonderfully clear and concise. If you’ve a few minutes and it sounds interesting, to you, I’d recommend checking it out.
  
  This is the blog:
  
  https://tamino.wordpress.com/
  
  His older posts have been archived here.
  
  Once again, thanks for your help.
  Reply
  - Charles
    
    March 11, 2017 at 5:25 pm
    
    Thanks Barry for sharing this information with us. I have actually seen this blog before.
    Charles
    Reply
Pingback: some of linear regression – philosopher's cooking club
Knowledgeless lady

March 14, 2016 at 2:09 am

Nice page!

But I’m looking for more specific info more concerning trends in existing time series (e.g. temperature) than in estimations of the future.

Having in an excel table a column with dates and one with temperature values (or whatever else) , I can easily construct a chart giving a line linking all values, and then, by selecting that line, produce a trend line with the info:
y = ax + b; R² = .

(About the exact significance of R², there are about as many meanings as web pages talking about it.)

But this is still not quite what I expect: I would like Excel computing the trend value for the data series, possibly with a standard error associated to the trend, like done here:

http://www.ysbl.york.ac.uk/~cowtan/applets/trend/trend.html

Is it possible to invoke in Excel a function computing the trend as understood here?
The Excel trend function is certainly not the right candidate!

Thanks in advance
Reply
- Charles
  
  March 14, 2016 at 10:59 pm
  
  Glad you liked the referenced webpage.
  
  As you probably know, you can add a linear trendline to an Excel scatter chart. This will provide the trendline, but not the standard error.
  
  If you know the standard error and so can compute the equations of the upper and lower lines (as in the site you referenced), then you can add these lines manually to the Excel chart. In fact for any line once you know two points on the line you can create a line through these points using Excel’s Scatter with Straight Lines chart capability.
  
  Charles
  Reply
  - Knowledgeless lady
    
    March 16, 2016 at 6:41 pm
    
    Thanks for the quick answer, Charles, but… it is exactly what I already know and did not want to to.
    
    Maybe you misunderstood me: I’m not interested in incorporating the lines in Kevin’s charts, which seem to denote some kind of “uncertainty surface” encompassing the value deviations from the computed trend.
    
    I even don’t need the error deviation he computes, as e.g. for RSS data between 1979 and today:
    
    Trend: 0.126 ±0.063 °C/decade
    
    Solely the trend would be enough.
    
    Thus my surprise when trying, in an Excel table I read RSS data in, to obtain the same trend info in a cell where I invoke the trend function with a year sequence as xes and the anomaly sequence as ys.
    
    The result is bare nonsense, what tells me no more than that I don’t use that function properly.
    
    I can’t imagine Excel displaying a trend line in a chart but refusing to put in a cell what it itself had computed before drawing!
    Reply
    - Charles
      
      March 16, 2016 at 8:45 pm
      
      Sorry, but I don’t understand.
      Charles
      Reply
      - Knowledgeless lady
        
        March 17, 2016 at 9:34 pm
        
        That’s a pity indeed! Here’s some stuff that hopefully might help.
        
        The example above you can see displayed by Excel in a chart here (in pdf form):
        
        http://fs5.directupload.net/images/160317/3zuwxkzk.pdf
        
        You see
        – a curve linking 444 data points together (stored in a table column)
        – the trend line together with
        – its specification
        — y = 0.001 x – 0.1183
        — R2 = 0.3029
        
        What I miss here: the trend value itself along these values, which is
        — Trend: 0.126
        when the x-axis runs in months from 1979 to 2015.
        
        And, Knowledgeless lady as I am, I naively thought that invoking in a cell on the table the f(x)
        
        TREND(C1:C444; A1:A444)
        
        would give exactly that value. However, I get as result
        
        -0,113
        
        Thus I don’t invoke the TREND fonction properly. Where is the mistake?
      - Charles
        
        March 19, 2016 at 6:59 pm
        
        Assuming that C1:C444 contains the y values of your data and A1:A444 contains the x values, =TREND(C1:C444,A1:A444) returns the forecasted y value for the first x value. If you treat =TREND(C1:C444,A1:A444) as an array formula, then you need to highlight a column range with 444 cells enter the formula =TREND(C1:C444,A1:A444) and press Ctrl-Shft-Enter (not just Enter) and in this case you would get the forecasted values corresponding to all 444 data elements. This is equivalent to the trend line.
        Charles
      - Knowledgeless lady
        
        March 27, 2016 at 12:12 am
        
        Many thanks for “March 19, 2016 at 6:59 pm”.
        
        Inbetween I found a pretty good alternative (“linest”) giving trend, standard deviation and R^2 in one step.
LP

March 9, 2016 at 8:29 pm

First, this is great stuff. Thanks for putting this out there! One question, the phrase at the top: “the value of y where the line intersects with the x-axis”…isn’t this always zero (0)? I think we mean “[the value of y] when x=0”. Or am I missing something?
Reply
- Charles
  
  March 10, 2016 at 7:14 am
  
  LP,
  You are correct. Thank you very much for catching this error. I have just revised the webpage to reflect this change.
  Charles
  Reply
SHoxha

March 2, 2016 at 6:16 pm

Can you tell me the whole steeps finding m and c
Reply
- Charles
  
  March 2, 2016 at 7:30 pm
  
  I don’t see any “m” on the referenced webpage. What are you referring to_
  Charles
  Reply
  - Shakib
    
    December 3, 2019 at 4:16 pm
    
    hey charles,
    it is common in the UK, taught in schools that ‘m’ represents the gradient where you used ‘b’.
    
    the version used is y = mx + c
    where y is the equation of the straight line. m is the gradient and c is the y-intercept.
    Reply
    - Charles
      
      December 3, 2019 at 4:36 pm
      
      Yes, this is what I learned in school as well. For regression there will be many slope values b1, b2, etc. and commonly b is used.
      Charles
      Reply
Andrés

February 5, 2016 at 7:10 am

How do you balance the accuracy of the trendline showed with its r2?

I am studying very similar trends in a proyect and i have my doubts regarding how cautious one must be, specially with values like 50, a bit far away from the central data

In this case i see that there isn´t a regression that tends to average ( like in studies of heights in families ). How can you increase the likeliness of this doesn´t happening in your study?

thanks!
Reply
- Charles
  
  February 5, 2016 at 8:27 am
  
  Sorry Andrés, but I don’t understand your comment.
  Charles
  Reply
Dyadee

May 26, 2015 at 5:43 am

Hello there Sir,

The first three equations doesn’t make sense to me yet. Can you elaborate on the meaning of each symbol, like where does “c” and “x-bar”come from and what is the reason of introducing them into the original linear equation? Can you provide me references for further understanding these equations?

Suggestion: Is it possible for you to put equation references like (Eq. 2.11).

Thank you so much!
Reply
- Charles
  
  May 26, 2015 at 3:35 pm
  
  x-bar is the mean of the x sample values. This is standard notation and is used throughout the website.
  
  An equation of a straight line takes the form y = b*x + a (slope b and y-intercept a). An alternative form (from high school algebra) is y – y0 = b(x – x0) where (x0, y0) is any point on the line (a straight is determined by any point on the line and its slope). I am choosing to use a point who x-value is x-bar and whose y-value is an unknown value c. Thus y – c = b*(x – x-bar). By algebra y = b*x – b*x-bar + c. But y = b*x + a, and so b*x – b*x-bar + c = b*x + a, from which it follows that -b*x-bar + c = a, and so c = a + b*x-bar: i.e. (x-bar, a + b*x-bar) lies on the line.
  
  Charles
  Reply
Ryan

February 18, 2015 at 5:36 pm

What is the difference between the FORECAST(x, R1, R2) and TREND(R1, R2, R3) functions?
Reply
- Charles
  
  February 18, 2015 at 7:27 pm
  
  Ryan,
  When R2 contains a single column (simple linear regression) then FORECAST(x, R1, R2) is equivalent to TREND(R1, R2, x) and FORECAST(R3, R1, R2) is equivalent to TREND(R1, R2, R3). TREND can be used when R2 contains more than one column (multiple regression) while FORECAST cannot.
  Charles
  Reply
Charles Steinmeyer

February 5, 2015 at 1:56 pm

Nice website!

Is there a function for the slope of a regression line, when forced to have an intercept of zero?

I know I can plot the data, fit a trend line, and then print the equation, but is there a more direct way?

Thanks.
Reply
- Charles
  
  February 6, 2015 at 2:15 pm
  
  Charles,
  For the case where there is only one independent variable x, the formula for the slope is b = ∑x_iy_i/∑x_i^2.
  I plan to add information about this situation to the website in the future.
  Charles
  Reply
shade may

September 4, 2014 at 7:02 pm

if my observed values of y are greater than the values of x how can the slope be .9 less than one?
Reply
- Charles
  
  September 9, 2014 at 5:04 pm
  
  Remember that the intercept plays a role as well as the slope. E.g. in y = x/2 + 1000 (slope .5) as long as x < 2000 x will be less than y. Charles
  Reply

42 thoughts on “Method of Least Squares”

Leave a Comment Cancel reply