Multiple Regression using Matrices

Definition 1: We now reformulate the least-squares model using matrix notation (see Basic Concepts of Matrices and Matrix Operations for more details about matrices and how to operate with matrices in Excel).

We start with a sample {y1, …, yn} of size n for the dependent variable y and samples {x1j, x2j, …, xnj} for each of the independent variables xj for j = 1, 2, …, k.

Let Y = an n × 1 column vector with the entries y1, …, yn. Let X be the n × (k+1) matrix (called the design matrix):

Design matrix

The n equations

image9106 can now be expressed as the single matrix equation

image9107

where β is the (k+1) × 1 column vector with entries β0, β1, …, βk and ε is the n × 1 column vector with entries ε1, …, εn.

Let B be a (k+1) × 1 column vector consisting of the coefficients b0, b1, …, bk. and let Y-hat be the (k+1) × 1 column vector consisting of the entries ŷ1, …, ŷn. Then the least-squares model can be expressed as

Least squares model matrix

Furthermore, we define the n × n hat matrix H as

Hat matrix

Definition 2: We can extend the definition of expectation to vectors as follows. Let A = [aij] be an m × n matrix. Then the expectation of A is the m × n matrix whose elements are E[aij].

If X is an n × 1 column vector then the covariance matrix X is the n × n matrix

E[(X−E[X])(X−E[X])T]

Observation: The linearity assumption for multiple linear regression can be restated in matrix terminology as

E[ε] = 0

From the independence and homogeneity of variances assumptions, we know that the n × n covariance matrix can be expressed as

image9108

Note too that the covariance matrix for Y is also σ2I.

Property 1:
Coefficient matrix regression

Property 2:
image1856

Property 3: B is an unbiased estimator of β, i.e. E[B] = β

Property 4: The covariance matrix of B can be represented by

image9109

Observation: Click here for proofs of the above four properties.

Definition 3: Let X, Y and B be defined as in Definition 1. Define the residuals vector E to be the n × 1 column vector with entries e1 , …, en such that ei = yi − ŷi .

Observation: The regression model can be expressed as

 Y = XB + E

where B can be expressed as in Property 1.

17 thoughts on “Multiple Regression using Matrices”

  1. Hi Charles,

    I have a question, hope you can guide me.

    For example, if we have 6 predictor variables and 3 dependent variables (having different probability known probability distributions Weibull, exponential and logistic). Is there a way to form an equation representing dependent variables as a function of predicator variables.

    when using multivariate linear regression, I am getting the coefficients but it is unable to validate the prediction by Multiple linear regression.

    I do not have that well background of Statistics but, if you suggest something relevant i will check.

    Reply
    • Hi Jamil,
      Unfortunately, Real Statistics doesn’t yet support multivariate regression (i.e. where there is more than one dependent variable). In any case, the usual assumption is that variables follow a normal distribution, and so I don’t know how to deal with the case where there is an assortment of distributions. Perhaps a bootstrapping approach would be appropriate.
      Charles

      Reply
  2. Charles,

    love your work… thank you so much…

    I have wondered about this question for awhile… doesn’t really relate to multiple regression but thought I’d throw it here.

    if I add an asset (say BABA) to an index (say S&P 500, SPY), what is then BABA’s correlation to the new index?… assuming we have the 4 relevant stats (stdev of BABA, SPY.. correlation of the 2, % weight of Baba in new portfolio)?

    surprised I don’t see

    Reply
    • Hello Stephen,
      Glad to read that you appreciate the Real Statistics project and thank you for your support.
      Correlation is calculated based on pairs of data in two data sets. What are these pairs in the case of BABA vs SPY? You reference stdev of BABA, but what BABA data are you taking the std deviation of? You also reference the % weight of BABA; is the corresponding data element in SPY 100%?
      Charles

      Reply
  3. Hi Charles,
    I have a scenario which I would describe as multi variate, non linear regression ….. I wanted to maximize the profit(o/p variable) and hence get the values for the inputs (freshness percentage, quantity, expenditure on advertisement) — I am doing it by getting the trend line from the past data(in excel I am able to get trend line of only one input vs output– do not know if we can get it as function of two independent variables together too), fetching the equation from it and then taking first derivative of the equation, equating it to zero and getting the values of inputs, and then choosing the new sets of input which maximize the o/p from a given range. Would want to know if we have any method in excel to get the best fit equation for output involving all inputs, so that when i solve for all variables while maximizing the output, I can get it… Thanks in advance. 🙂

    Reply
    • and also some method through which we can calculate the derivative of the trend line and get the set of values which maximize the output…. Later we can choose the set of inputs as per my requirement eg . Investing $5 will give me more profit compared to investing $10 or $2, but I have only $2 budget , hence would choose $2 in this case as investment, so my range becomes $0 to $2, where $2 had highest profit as output.. I am trying to make this sound as simple as possible … Apologies for the long text… But I am really stuck and need some help..

      Thanks once again 🙂

      Reply
      • Solver won’t calculate the derivative of the trend line, but it will provide the optimization capabilities that you are probably looking for. It will also allow you to specify constraints (such as a $2 budget).
        Charles

        Reply
  4. Hello, Charles.
    A long time ago I found a real estate related linear regression on my old mac computer:
    The inputs were Sold Price, Living Area, Days on Market (DOM)
    The purpose was to predict the optimum price and DOM for various floor areas.
    Can you point me in the right direction please. I’d really appreciate it.

    Reply
    • Loren,
      It sounds like a fit for multiple linear regression. I don’t understand the part about predicting DOM when DOM is one of the inputs though.
      Charles

      Reply
  5. Hello again Charles,
    I hope you are well. When you take the inverse of XtX (i.e. MMULT(TRANSPOSE(X),X)), what happens if the XtX is not invertible?
    Thanks!
    Fred

    Reply
    • Fred,
      You will get error values. There are techniques to deal with this situation, including Ridge Regression and LASSO Regression. These will be covered in the next release of the Real Statistics software. This release should be available in a few days.
      Charles

      Reply
      • Great timing then I guess 🙂 this situation occurs more often with categorical variables as they are encoded as 0s and 1s and I noticed that in many instances they generated matrices with “duplicated” columns or rows.
        Again, thank you!
        Fred

        Reply
        • Fred,
          I am also adding a new option to the Multiple Linear Regression data analysis tool that can be useful when you have a lot of independent variables.
          Stay tuned.
          Charles

          Reply
  6. As Charles says, you need the correlation matrix to include Y. If so, then the partial correlations are related to the T-statistics for each X-variable (you just need to know the residual degrees of freedom n-p-1. But to get the actual regression coefficients, I think you need to raw data, not just the correlation data.

    Reply
  7. Hello,

    How do I make a least square regression analysis on a correlation matrix? I already have the matrix set up I am just not sure about which values would be inserted for x and y in the regression data analysis option on excel. Thank you!

    Reply
    • Meredith,
      The correlation matrix is for what data? If it only relates to the X data then you will missing something since you need to take the Y data into account to perform regression. Also you need to be able to take the means of the X data into account.
      Charles

      Reply

Leave a Comment