Linear Regression

The goal of linear regression analysis is to describe the relationship between two variables based on observed data and to predict the value of the dependent variable based on the value of the independent variable.

Topics

73 thoughts on “Linear Regression”

  1. Hello! Do I need to find the missing data for this question? If yes, how do I get with it on excel? (A video will help) Below is the data and what I want to get
    Use excel to run this data. Generate summery statistics, ANOVA Table and descriptive statistics.
    Employees
    A
    B
    C
    D
    E
    F
    G
    H
    I
    Talent_Mgt
    12
    X
    16
    X
    20
    X
    X
    30
    32
    Performance
    56
    67
    X
    X
    X
    X
    89
    X
    100

    Reply
  2. I have the data, which is as below:
    x = [1, 2, 3, 4, 5]
    y1 = [26.5, 139, 269, 337, 409]
    y2 = [26.2, 138, 273, 343, 404]
    Using the raw data, I want to have following Results of Regression Analysis as per web attachment. Thanks for the help.

    Reply
  3. Dear Sir,

    Please look at my work below. I am not sure if this method is acceptable.
    I want to generate a synthetic data for testing or teaching linear regression.
    I use the Excel random function to generate;

    Linear regression modeled as;
    y(i) = a + b*x(i) + e(i)

    a & b is constant, thus for all i;
    a = RAND()*(U_a – L_a) + L_a
    b = RAND()*(U_b – L_b) + L_b

    U_a & L_a is upper & lower bound for a
    U_b & L_b is upper & lower bound for b

    x(i) is a variable, thus for all i;
    x(1) = RAND()*(U_x – L_x) + L_x
    x(2) = RAND()*(U_x – L_x) + L_x

    x(n-1) = RAND()*(U_x – L_x) + L_x
    x(n) = RAND()*(U_x – L_x) + L_x

    U_x & L_x is upper & lower bound for x

    e(i) is a variable, normally distributed with mean 0, thus for all i;
    e(1) = NORM.INV(RAND(), m, s)
    e(2) = NORM.INV(RAND(), m, s)

    e(n-1) = NORM.INV(RAND(), m, s)
    e(n) = NORM.INV(RAND(), m, s)

    m is mean which is equal to 0
    s is standard deviation

    therefore, variable y(i) for all i;
    y(1) = a + b*x(1) + e(1)
    y(2) = a + b*x(2) + e(2)

    y(n-1) = a + b*x(n-1) + e(n-1)
    y(n) = a + b*x(n) + e(n)

    please note that, a & b is constant. Therefore, it only have 1 random value respectively.
    So, when dragging, use cell reference and anchor, to avoid a & b generate multiple value for all i.

    Reply
  4. Hello Charles,
    I have conducted a survey of patients with a chronic medical condition. I have received 39 responses to the questionnaire. I need to identify if there is a relation between their mobility level (independent variable) (mobility is classified into 5; level 1 -level 5 with decreasing mobility) and them receiving a transition plan (dependent variable; yes or no)
    What statistical test will serve my purpose best? I think logistic regression is not feasible due to the small sample size.
    Thank you so much for taking the time to reply.

    Regards
    Bhushan

    Reply
    • Hello Bhushan,
      You can divide the 39 people who answered the questionnaire into two groups: Group A who received a transition plan and group B who didn’t receive a transition plan. Now you can perform a two sample t-test between the two samples based on the mobility value. Since it is likely that the normality assumption will fail, you can instead perform a Mann-Whitney test.
      Charles

      Reply
  5. RegCoeff(x,y) cause error when there are missing values in x.
    Is there any method to ignore missing values in x and export regression coefficients with non-missing data?

    Reply
    • Yes, you need to first eliminate any rows with missing data. The usual approach is to use listwise deletion. This means that any row that is missing any element of X or Y data is removed. If, for example, your X data is contained in range A1:C10 and your Y data is contained in range D1:D10, then you can use the Real Statistics array formula =DELROWS(A1:D10) to remove any rows with missing data. Next, you can use RegCoeff to find the regression coefficients.
      Listwise elimination may be appropriate when you have very few missing data entries. There are other techniques, however, for dealing with missing data. These consist of imputing values for the missing data based on the data that is not missing. These are described at
      https://www.real-statistics.com/handling-missing-data/
      Charles

      Reply
  6. Hello Charles;
    I am stack trying to test the homoscedasticity and normality assumptions, but my simple regression lineal model does not accomplish both. The Pearson coefficient is about 0.95.
    My data is the following
    Polvo Respirable, Silice Cristalino
    1.2 0.084
    0.76 0.039
    7.9 0.33
    14 0.7
    27 2.6
    4 0.11
    3.1 0.25
    4.1 0.35
    3.3 0.3
    7.1 0.47
    Can I keep working with my model even if not accomplished those assumptions, how valid my model will be?
    Thank you for your help

    Reply
  7. Hi Chareles,

    Would you please help me out on the following doubts?

    Do I need to have normal y and x values for regression analysis? I am not talking about the errors but the variables. Or, I can start without checking the normality of the available data. Awaiting for your kind response.

    Reply
    • Normality is not a requirement for regression. The normality of the residuals is required for the various tests (e.g. whether or not a regression coefficient is significantly different from zero). You can use the regression model to make predictions even if the normality assumption is not met.
      Charles

      Reply
  8. Dear Charles,

    Thank you so much for this helpful tool!

    I need to compare two measurement units, but their units are different. The “gold-standard” is 0.7-1 (variance of 0.03) and the new method is 3-6 (variance of 0.2).
    I was thinking of model ii regression and ICC, but I’m not sure which to use and how to interpret the results of both methods.

    Could you please expand on their interpretation and if I’m doing the right thing with this type of analysis?

    Many thanks

    Reply
    • Hello Shlomi,
      I would need more information before I could determine whether either of these is appropriate and which is better.
      Other approaches for comparison with a gold standard are Bland Altman, Lin’s Concordance Correlation Coefficient and Passing-Bablok Regression. All of these are described on the Real Statistics website.
      Charles

      Reply
      • Hi Charles,

        Thank you for your reply.

        Since I’m comparing 2 measurements with different units, I believe I can only use one of your suggestions if I normalize the data somehow (min-max or z) so I can compare both.
        Otherwise, I thought about Deming and ICC since, as much as I know, I can use them even if my units are different.

        I have one array (n=33) from the gold-standard of scores between 0.7 to 1, and the other (also 33) from the new device of scores between 3-6.

        Many thanks for your help

        Reply
  9. To begin, I am not a statistician, so please excuse my ignorance. I’m trying to determine the effects of several factors on the results of a finite element analysis. I have several categorical variables and some continuous ones. I’ve read about and have completed the categorical coding for regression and the linear regression analysis using Real Statistics Using Excel. What I don’t understand is how to use the results of the analysis. Can you tell me where do find this information?

    Reply
  10. Hi Charles,
    Thanks for providing a wonderful resource! I have a question regarding linear regression. Assuming normality, I think the mean and the standard error of the y-estimate is a function of x. Is there a function currently available to find the standard error of the y-estimate at each point of x? I am assuming mean(y-hat) = a + bx, Var(y-hat) = Var(a) + x^2*Var(b) + 2*x*Cov(a,b). Thanks for your help!

    Reply
  11. Hi Charles,
    I am trying to compare the regressions of the responses of 4 mutants against a control. All are independent. I need to test whether the intercepts are significantly different. Could you point me in the right direction. Many thanks.

    Reply
    • Steven
      Are you saying that you have 4 linear regression models, each one consists of one of the mutants vs the control, and want to test whether the intercepts are equal (or show a significant difference)?
      Are the 4 regression lines parallel?
      Charles

      Reply
  12. Buenas tardes,
    Tengo la siguiente consulta:
    Realizo un trabajo entre variables continuas,
    el numero del coeficiente de correlaciĆ³n es -0,496630749
    significa que la relaciĆ³n lineal entre las variables es negativa y muy dĆ©bil , sin embargo cuando aumenta x disminuye y.
    como obtengo el numero de p= probabilidad ?
    Muchas gracias !!

    Reply
  13. HELLO, the computed r is 0.201566193 which means that tere is a weak correlatio between the variables, while the pvalue is 6.63411E-10 which tells me that i have to reject a null hypothesis. is this result really possible?

    Reply
    • Mary Ann,
      This is certainly possible. If you are using simple linear regression, then the p-value being so low only means that there is a significant difference between the population correlation and zero. It doesn’t mean that the population value of r is high; it just means that it is not likely to be zero.
      Charles

      Reply
  14. Charles,
    Thanks for all the interesting information you have available here.
    I’m relatively new to regressions and I’m hoping you can give me your thoughts on the following:
    I’m trying to roughly estimate/predict what the hourly energy prices ($/MWh), at a certain grid point, will be going forward, out 12 months.
    The issue I’m having is that the price I’m trying to estimate, lets call it “Price A”, is relatively new, with only 6 months of hourly historical prices available. Knowing that this price is highly correlated to a different price (r = 0.98 and r-squared = 95%), let’s call it “Price B”, and that Price B does have available historical data going back multiple years, here’s what I’ve done: calculated in Excel, using the equation y=m*x+a (where y = price A and x = price B) and parameters calculated in Excel (“m” and “a”), what the prices would had been at point A, let’s say for the last 12 months.
    Essentially, I would appreciate if you could tell me whether or not this is valid approach and also what would I should be doing next to estimate the prices for the next 12 months.
    I hope this isn’t too confusing, please let me know otherwise.
    Thank you in advance.

    Reply
  15. Hi Charles

    For linear regression, if R square is .209 (20%) and ANOVA p= 0.01, what does this imply? Isn’t R square value should be greater than 30% ?

    Reply
  16. Hi Charles,
    Thanks so much for helping on here first off. My question is that I am attempting to compare “OB GYN per capita population” for each state to the “percent of total deliveries that are via cesarean section” to see if there is a coorelation between the two. So basically less OB GYNs is perhaps associated with higher rates of cesarean section for instance. Would this be a simple linear regression? Would it be proper to enter data for all 50 states to arrive at conclusion? Thanks again

    Reply
    • Hi Will,
      Yes, this could be a way to test this hypothesis. You don’t know a priori whether the relationship is linear (or quadratic or something else), but it makes sense to start with simple linear regression.
      Charles

      Reply
      • Thanks. After obtaining result is there some other step to progress to based on result? Statistics is not my strong point

        Reply
        • Will,
          The next step depends on why you decided to use regression in the first place. E.g., you can use the regression for forecasting.
          Charles

          Reply
          • I’m merely trying to determine if the two variables are correlated in some way to some statistical significance p=.05. Does this make sense or no?

          • Will,
            To test whether there is a significant correlation between two variables, please look at the following webpage
            Correlation Testing
            There are a number of different tests that can be used, as explained on the website.
            Charles

  17. Hello Charles.

    If a model has R-square 55% is it doing a good job?

    Some people say it is good others say it is weak

    i would appreciate it if you helped me

    Reply
  18. Hello!

    I need to run regressions using this software for my BUS252 class. However, when I click “Linear Regression” and it prompts me to highlight the X and Y values, I’m unable to do so. Im working from a Macbook, but can’t seem to get it to work. Any help would be much appreciated!

    Reply
    • Hello Connor,
      For the Mac version of the software, you can’t highlight the range, but need to enter the ranges manually. E.g. for range D5:E24, you need to type in the value D5:E24 in the Input Range field. Good luck with your course.
      Charles

      Reply
  19. hi Charles,

    i have a result from one of the independent variables showing a highest standardized beta but the p-value are not significant.
    so it is still accepted or rejected to explain which is the most influential variables?
    how do i interpret this result?

    help me.
    tq!

    Reply
    • Hi Hani,

      Standardizing the variables doesn’t necessarily lead to identifying the most important/influential variable in linear regression. For example, see the webpage
      http://www.jerrydallal.com/lhsp/importnt.htm.

      Depending on what you mean by most influential, you might want to compare partial correlation coefficients or compare the R-square values when each variable is dropped from the model.

      Charles

      Reply
  20. You have created a wonderful resource in this website and the Excel add-in. I’m wondering if you’re planning to add standardized coefficients to the linear regression routine. Is it already there and I missed it?

    Keep up the great work!

    Reply
  21. When running a regression in Excel what do you use to determine if the null hypothesis is accepted or rejected? What parts of the regression output are used in the regression equation?

    Reply
    • Gustavo,

      Q. When running a regression in Excel what do you use to determine if the null hypothesis is accepted or rejected?
      A. Significance F

      Q. What parts of the regression output are used in the regression equation?
      A. The coefficients

      Charles

      Reply
  22. Ive loaded Real Stats as an add in but I cant see it up in the data analysis icon area? How do I use it, where do I access it it in the excel page?
    Dan

    Reply
  23. Thank you so much for this site. It is very helpful. I have to make a project for bachelor’s degree but I don’t know how to find some data so I can make the regression analysis. I want you to tell me what variable to take, to make the regression analysis?I have the theme of agriculture and environmental. Thank you

    Regards
    Kalina

    Reply
    • Kalina,
      You haven’t provided enough information for me to tell you what variables to use, but think about what hypothesis in the area of agriculture and environment you want to make. From this will naturally follow which variables you need to study.
      Charles

      Reply
  24. Dear Charles:

    Thank you for the explanation. This information will be very useful for my future calculations.

    Regards.

    William Agurto.

    Reply
  25. Dear Charles:

    I found a little bug in the residuals and Cook’s D sections (when that options are selected in linear regression analysis):
    1. ThereĀ“s a section at the right of the Cook’s D table that calculates dfE (degrees of freedom of errors) substracting (k+1) instead of substracting k.
    2. Because of that, there’s a difference between Data Analysis Excel Add-in results and Real Statistics 2.16.1 results in the calculation of Standard Residuals (SResidual in Cook’s D table).
    3. Also, for that reason, there are some mistakes in the calculation of: ModMSE, RStudent, T-Test, Cook’s D, and DFFITS columns in Cook’s D table.

    I hope these comments will be helpful.

    Regards.

    William agurto.

    Reply
    • Dear Charles:

      The bug in the SResidual calculation is still unfixed in Real Statistics 2.17.
      I present an example for making the explanation simpler:

      1. Data for analysis:
      X Y
      9 0.98
      14 0.96
      25 0.94
      30 0.92
      59 0.9
      122 0.88
      128 0.86
      138 0.84
      144 0.82

      2. Results obtained in Excel 2010 (using Data Analysis Add-in) for RESIDUAL OUTPUT:

      Observation Predicted Y Residuals Standard Residuals
      1 0.959434128 0.020565872 1.305567744
      2 0.954885598 0.005114402 0.324673749
      3 0.944878831 -0.004878831 -0.309719156
      4 0.940330301 -0.020330301 -1.290613152
      5 0.913948826 -0.013948826 -0.885502786
      6 0.856637346 0.023362654 1.483113757
      7 0.851179109 0.008820891 0.559969944
      8 0.842082049 -0.002082049 -0.132173143
      9 0.836623813 -0.016623813 -1.055316957

      3. Results obtained in Real Statistics 2.17 for Cook’s D Table (only columns 1, and 4 to 7 are shown):

      Obs Pred Y Residual Leverage SResidual
      1 0.959434128 0.020565872 0.271566883 1.221246799
      2 0.954885598 0.005114402 0.247947105 0.303704483
      3 0.944878831 -0.004878831 0.202599637 -0.289715742
      4 0.940330301 -0.020330301 0.184994445 -1.207258058
      5 0.913948826 -0.013948826 0.11994921 -0.82831201
      6 0.856637346 0.023362654 0.1965224 1.387325886
      7 0.851179109 0.008820891 0.219377825 0.523803919
      8 0.842082049 -0.002082049 0.263484784 -0.123636654
      9 0.836623813 -0.016623813 0.293557711 -0.987158622

      4. As you see, the Standard residuals obtained by Data Analysis Add-in is different from those obtained in Real Statistics 2.17. As I commented in the prior message, this is because Real Statistics 2.17 calculates dfE (degrees of freedom of errors) substracting (k+1) instead of substracting k (in the example, 7 instead of 8). When I manually correct this difference (I write 8 instead of 7), all the standard residuals are OK.

      I hope this comment will be helpfull.

      Regards.

      William Agurto.

      Reply

Leave a Comment