Definition 1: We now reformulate the least-squares model using matrix notation (see Basic Concepts of Matrices and Matrix Operations for more details about matrices and how to operate with matrices in Excel).
We start with a sample {y1, …, yn} of size n for the dependent variable y and samples {x1j, x2j, …, xnj} for each of the independent variables xj for j = 1, 2, …, k.
Let Y = an n × 1 column vector with the entries y1, …, yn. Let X be the n × (k+1) matrix (called the design matrix):
The n equations
 can now be expressed as the single matrix equation
where β is the (k+1) × 1 column vector with entries β0, β1, …, βk and ε is the n × 1 column vector with entries ε1, …, εn.
Let B be a (k+1) × 1 column vector consisting of the coefficients b0, b1, …, bk. and let Y-hat be the (k+1) × 1 column vector consisting of the entries ŷ1, …, ŷn. Then the least-squares model can be expressed as
Furthermore, we define the n × n hat matrix H as
Definition 2: We can extend the definition of expectation to vectors as follows. Let A = [aij] be an m × n matrix. Then the expectation of A is the m × n matrix whose elements are E[aij].
If X is an n × 1 column vector then the covariance matrix X is the n × n matrix
E[(X−E[X])(X−E[X])T]
Observation: The linearity assumption for multiple linear regression can be restated in matrix terminology as
E[ε] = 0
From the independence and homogeneity of variances assumptions, we know that the n × n covariance matrix can be expressed as
Note too that the covariance matrix for Y is also σ2I.
Property 3: B is an unbiased estimator of β, i.e. E[B] = β
Property 4:Â The covariance matrix of B can be represented by
Observation: Click here for proofs of the above four properties.
Definition 3: Let X, Y and B be defined as in Definition 1. Define the residuals vector E to be the n × 1 column vector with entries e1 , …, en such that ei = yi − ŷi .
Observation: The regression model can be expressed as
 Y = XB + E
where B can be expressed as in Property 1.
Hi Charles,
I have a question, hope you can guide me.
For example, if we have 6 predictor variables and 3 dependent variables (having different probability known probability distributions Weibull, exponential and logistic). Is there a way to form an equation representing dependent variables as a function of predicator variables.
when using multivariate linear regression, I am getting the coefficients but it is unable to validate the prediction by Multiple linear regression.
I do not have that well background of Statistics but, if you suggest something relevant i will check.
Hi Jamil,
Unfortunately, Real Statistics doesn’t yet support multivariate regression (i.e. where there is more than one dependent variable). In any case, the usual assumption is that variables follow a normal distribution, and so I don’t know how to deal with the case where there is an assortment of distributions. Perhaps a bootstrapping approach would be appropriate.
Charles
Charles,
love your work… thank you so much…
I have wondered about this question for awhile… doesn’t really relate to multiple regression but thought I’d throw it here.
if I add an asset (say BABA) to an index (say S&P 500, SPY), what is then BABA’s correlation to the new index?… assuming we have the 4 relevant stats (stdev of BABA, SPY.. correlation of the 2, % weight of Baba in new portfolio)?
surprised I don’t see
Hello Stephen,
Glad to read that you appreciate the Real Statistics project and thank you for your support.
Correlation is calculated based on pairs of data in two data sets. What are these pairs in the case of BABA vs SPY? You reference stdev of BABA, but what BABA data are you taking the std deviation of? You also reference the % weight of BABA; is the corresponding data element in SPY 100%?
Charles
Hi Charles,
I have a scenario which I would describe as multi variate, non linear regression ….. I wanted to maximize the profit(o/p variable) and hence get the values for the inputs (freshness percentage, quantity, expenditure on advertisement) — I am doing it by getting the trend line from the past data(in excel I am able to get trend line of only one input vs output– do not know if we can get it as function of two independent variables together too), fetching the equation from it and then taking first derivative of the equation, equating it to zero and getting the values of inputs, and then choosing the new sets of input which maximize the o/p from a given range. Would want to know if we have any method in excel to get the best fit equation for output involving all inputs, so that when i solve for all variables while maximizing the output, I can get it… Thanks in advance. 🙂
and also some method through which we can calculate the derivative of the trend line and get the set of values which maximize the output…. Later we can choose the set of inputs as per my requirement eg . Investing $5 will give me more profit compared to investing $10 or $2, but I have only $2 budget , hence would choose $2 in this case as investment, so my range becomes $0 to $2, where $2 had highest profit as output.. I am trying to make this sound as simple as possible … Apologies for the long text… But I am really stuck and need some help..
Thanks once again 🙂
Solver won’t calculate the derivative of the trend line, but it will provide the optimization capabilities that you are probably looking for. It will also allow you to specify constraints (such as a $2 budget).
Charles
For these sorts of problems, using Solver is usually a good approach.
I have shown how to do this in a number of places on the website. In particular, see
Exponential Regression using Solver
Charles
Hello, Charles.
A long time ago I found a real estate related linear regression on my old mac computer:
The inputs were Sold Price, Living Area, Days on Market (DOM)
The purpose was to predict the optimum price and DOM for various floor areas.
Can you point me in the right direction please. I’d really appreciate it.
Loren,
It sounds like a fit for multiple linear regression. I don’t understand the part about predicting DOM when DOM is one of the inputs though.
Charles
Hello again Charles,
I hope you are well. When you take the inverse of XtX (i.e. MMULT(TRANSPOSE(X),X)), what happens if the XtX is not invertible?
Thanks!
Fred
Fred,
You will get error values. There are techniques to deal with this situation, including Ridge Regression and LASSO Regression. These will be covered in the next release of the Real Statistics software. This release should be available in a few days.
Charles
Great timing then I guess 🙂 this situation occurs more often with categorical variables as they are encoded as 0s and 1s and I noticed that in many instances they generated matrices with “duplicated” columns or rows.
Again, thank you!
Fred
Fred,
I am also adding a new option to the Multiple Linear Regression data analysis tool that can be useful when you have a lot of independent variables.
Stay tuned.
Charles
As Charles says, you need the correlation matrix to include Y. If so, then the partial correlations are related to the T-statistics for each X-variable (you just need to know the residual degrees of freedom n-p-1. But to get the actual regression coefficients, I think you need to raw data, not just the correlation data.
Hello,
How do I make a least square regression analysis on a correlation matrix? I already have the matrix set up I am just not sure about which values would be inserted for x and y in the regression data analysis option on excel. Thank you!
Meredith,
The correlation matrix is for what data? If it only relates to the X data then you will missing something since you need to take the Y data into account to perform regression. Also you need to be able to take the means of the X data into account.
Charles