The goal of linear regression analysis is to describe the relationship between two variables based on observed data and to predict the value of the dependent variable based on the value of the independent variable.
Topics
- Method of Least Squares
- Regression Analysis
- Testing if the regression line is a good fit
- Testing the significance of the slope of the regression line
- Confidence and prediction intervals for regression
- Exponential Regression Model
- Power Regression Model
- Linear regression models for comparing means
- Total least squares
- Deming Regression
- Passing-Bablok Regression
- LOESS Regression
- Kendall-Theil-Sen Regression
Hello! Do I need to find the missing data for this question? If yes, how do I get with it on excel? (A video will help) Below is the data and what I want to get
Use excel to run this data. Generate summery statistics, ANOVA Table and descriptive statistics.
Employees
A
B
C
D
E
F
G
H
I
Talent_Mgt
12
X
16
X
20
X
X
30
32
Performance
56
67
X
X
X
X
89
X
100
Hello Jojo,
Standard Excel doesn’t provide such a capability, but you can use the approaches described on this website using Excel.
See https://real-statistics.com/handling-missing-data/
Charles
I have the data, which is as below:
x = [1, 2, 3, 4, 5]
y1 = [26.5, 139, 269, 337, 409]
y2 = [26.2, 138, 273, 343, 404]
Using the raw data, I want to have following Results of Regression Analysis as per web attachment. Thanks for the help.
image web address is https://ibb.co/r7WFrzm
Dear Sir,
Please look at my work below. I am not sure if this method is acceptable.
I want to generate a synthetic data for testing or teaching linear regression.
I use the Excel random function to generate;
Linear regression modeled as;
y(i) = a + b*x(i) + e(i)
a & b is constant, thus for all i;
a = RAND()*(U_a – L_a) + L_a
b = RAND()*(U_b – L_b) + L_b
U_a & L_a is upper & lower bound for a
U_b & L_b is upper & lower bound for b
x(i) is a variable, thus for all i;
x(1) = RAND()*(U_x – L_x) + L_x
x(2) = RAND()*(U_x – L_x) + L_x
…
x(n-1) = RAND()*(U_x – L_x) + L_x
x(n) = RAND()*(U_x – L_x) + L_x
U_x & L_x is upper & lower bound for x
e(i) is a variable, normally distributed with mean 0, thus for all i;
e(1) = NORM.INV(RAND(), m, s)
e(2) = NORM.INV(RAND(), m, s)
…
e(n-1) = NORM.INV(RAND(), m, s)
e(n) = NORM.INV(RAND(), m, s)
m is mean which is equal to 0
s is standard deviation
therefore, variable y(i) for all i;
y(1) = a + b*x(1) + e(1)
y(2) = a + b*x(2) + e(2)
…
y(n-1) = a + b*x(n-1) + e(n-1)
y(n) = a + b*x(n) + e(n)
please note that, a & b is constant. Therefore, it only have 1 random value respectively.
So, when dragging, use cell reference and anchor, to avoid a & b generate multiple value for all i.
Hello Maaruf,
If I understand aht you are doing correctly, your approach does seem to be correct.
Charles
Hello Charles,
I have conducted a survey of patients with a chronic medical condition. I have received 39 responses to the questionnaire. I need to identify if there is a relation between their mobility level (independent variable) (mobility is classified into 5; level 1 -level 5 with decreasing mobility) and them receiving a transition plan (dependent variable; yes or no)
What statistical test will serve my purpose best? I think logistic regression is not feasible due to the small sample size.
Thank you so much for taking the time to reply.
Regards
Bhushan
Hello Bhushan,
You can divide the 39 people who answered the questionnaire into two groups: Group A who received a transition plan and group B who didn’t receive a transition plan. Now you can perform a two sample t-test between the two samples based on the mobility value. Since it is likely that the normality assumption will fail, you can instead perform a Mann-Whitney test.
Charles
RegCoeff(x,y) cause error when there are missing values in x.
Is there any method to ignore missing values in x and export regression coefficients with non-missing data?
Yes, you need to first eliminate any rows with missing data. The usual approach is to use listwise deletion. This means that any row that is missing any element of X or Y data is removed. If, for example, your X data is contained in range A1:C10 and your Y data is contained in range D1:D10, then you can use the Real Statistics array formula =DELROWS(A1:D10) to remove any rows with missing data. Next, you can use RegCoeff to find the regression coefficients.
Listwise elimination may be appropriate when you have very few missing data entries. There are other techniques, however, for dealing with missing data. These consist of imputing values for the missing data based on the data that is not missing. These are described at
https://www.real-statistics.com/handling-missing-data/
Charles
what if Hosmer and Lemeshow Test is greater than 0.05?
Hello Genamo,
A significant result (i.e. when p-value < .05) indicates that the logistic regression model is not a good fit for the data. See the following webpage for details: https://www.real-statistics.com/logistic-regression/hosmer-lemeshow-test/
Charles
Hello Charles;
I am stack trying to test the homoscedasticity and normality assumptions, but my simple regression lineal model does not accomplish both. The Pearson coefficient is about 0.95.
My data is the following
Polvo Respirable, Silice Cristalino
1.2 0.084
0.76 0.039
7.9 0.33
14 0.7
27 2.6
4 0.11
3.1 0.25
4.1 0.35
3.3 0.3
7.1 0.47
Can I keep working with my model even if not accomplished those assumptions, how valid my model will be?
Thank you for your help
Carlos,
See the following webpages for the answers to your questions:
1. Homoscedasticity: https://www.real-statistics.com/multiple-regression/heteroskedasticity/
2. Normality: You need to check whether the residuals are normally distributed:
https://www.real-statistics.com/multiple-regression/residuals/
https://www.real-statistics.com/tests-normality-and-symmetry/
3. You can use the regression model for forecasting even when the assumptions are not met. p-values and confidence intervals may not be accurate though.
Charles
Thank you Charles, you help me a lot.
God bless you and your family
Charles
I click Reg, but doesn’t show me breusch-Pagan test, maybe I have office 2007
Carlos,
Enter the formula =VER() to see whether you are using the Excel 2007 version of the software.
Charles
Thanks Charles,
When I enter =VER(), it shows me
5.11 Excel 2007
I did manually the test for homoscedasticity by the following three methods:
1.- NWRSR ā The Neter-Wasserman / Ramsey / Spearman Rho T,
2.- NWGQ ā Neter-Wasserman / Goldfeld-Quandt , and 3.- Breusch-Pagan-Godfrey.
When I used an significance level alpha= 0.05 there is not homoscedasticity, but when I used an alpha=0.01 in all pass the test for homoscedasticity.
It is valid to use an alpha =0.01 for my model?
Hello Carlos,
Yes, this seems to be a perplexing development. This is a consequence of the fact that usually, the alternative hypothesis is what we are trying to demonstrate, but in this case, the null hypothesis is the desirable state.
I don’t have an easy answer for you. I simply conclude that homoscedasticity is questionable and move forward with the ordinary regression. You might want to correct for heteroscedasticity and see whether the results are different.
Charles
Se puede realizar pronosticos de demanda con 4 aƱos base?
Sorry, but I don’t understand your question. What do the 4 base years refer to and what sort of forecast do you want to make?
Charles
Dear Charles,
Does RealStats allow one to run mixed model linear regression?
Thanks,
Ian Moffit
Hello Ian,
RealStats doesn’t currently support mixed model linear regression. I do plan to add this capability in the future.
Charles
Hi Chareles,
Would you please help me out on the following doubts?
Do I need to have normal y and x values for regression analysis? I am not talking about the errors but the variables. Or, I can start without checking the normality of the available data. Awaiting for your kind response.
Normality is not a requirement for regression. The normality of the residuals is required for the various tests (e.g. whether or not a regression coefficient is significantly different from zero). You can use the regression model to make predictions even if the normality assumption is not met.
Charles
Many thanks
Dear Charles,
Thank you so much for this helpful tool!
I need to compare two measurement units, but their units are different. The “gold-standard” is 0.7-1 (variance of 0.03) and the new method is 3-6 (variance of 0.2).
I was thinking of model ii regression and ICC, but I’m not sure which to use and how to interpret the results of both methods.
Could you please expand on their interpretation and if I’m doing the right thing with this type of analysis?
Many thanks
Hello Shlomi,
I would need more information before I could determine whether either of these is appropriate and which is better.
Other approaches for comparison with a gold standard are Bland Altman, Lin’s Concordance Correlation Coefficient and Passing-Bablok Regression. All of these are described on the Real Statistics website.
Charles
Hi Charles,
Thank you for your reply.
Since I’m comparing 2 measurements with different units, I believe I can only use one of your suggestions if I normalize the data somehow (min-max or z) so I can compare both.
Otherwise, I thought about Deming and ICC since, as much as I know, I can use them even if my units are different.
I have one array (n=33) from the gold-standard of scores between 0.7 to 1, and the other (also 33) from the new device of scores between 3-6.
Many thanks for your help
To begin, I am not a statistician, so please excuse my ignorance. I’m trying to determine the effects of several factors on the results of a finite element analysis. I have several categorical variables and some continuous ones. I’ve read about and have completed the categorical coding for regression and the linear regression analysis using Real Statistics Using Excel. What I don’t understand is how to use the results of the analysis. Can you tell me where do find this information?
Hi Tim,
This is explained on the website. Since you have several independent variables, you should look at
Multiple Regression
See especially Multiple Regression Analysis
Usually, regression is used to make forecasts or predictions (although there are other motivations). This can be done using the TREND function, as explained at:
https://real-statistics.com/multiple-regression/multiple-regression-analysis/multiple-regression-analysis-excel/
Charles
Hi Charles,
Thanks for providing a wonderful resource! I have a question regarding linear regression. Assuming normality, I think the mean and the standard error of the y-estimate is a function of x. Is there a function currently available to find the standard error of the y-estimate at each point of x? I am assuming mean(y-hat) = a + bx, Var(y-hat) = Var(a) + x^2*Var(b) + 2*x*Cov(a,b). Thanks for your help!
Lina,
See https://real-statistics.com/regression/confidence-and-prediction-intervals/
Charles
Thank you Charles!
Hi Charles,
I am trying to compare the regressions of the responses of 4 mutants against a control. All are independent. I need to test whether the intercepts are significantly different. Could you point me in the right direction. Many thanks.
Steven
Are you saying that you have 4 linear regression models, each one consists of one of the mutants vs the control, and want to test whether the intercepts are equal (or show a significant difference)?
Are the 4 regression lines parallel?
Charles
Hi. Does your Excel add-in support quantile regression, as well as supporting linear models for the mean? Thanks. Adam
Adam,
The Real Statistics addin supports quantile regression in the case where the quantile is the median (i.e. LAD regression), but not the general case.
Charles
Thanks.
Buenas tardes,
Tengo la siguiente consulta:
Realizo un trabajo entre variables continuas,
el numero del coeficiente de correlaciĆ³n es -0,496630749
significa que la relaciĆ³n lineal entre las variables es negativa y muy dĆ©bil , sin embargo cuando aumenta x disminuye y.
como obtengo el numero de p= probabilidad ?
Muchas gracias !!
Diego,
Please look at the following webpage
One sample hypothesis testing for correlation
Charles
HELLO, the computed r is 0.201566193 which means that tere is a weak correlatio between the variables, while the pvalue is 6.63411E-10 which tells me that i have to reject a null hypothesis. is this result really possible?
Mary Ann,
This is certainly possible. If you are using simple linear regression, then the p-value being so low only means that there is a significant difference between the population correlation and zero. It doesn’t mean that the population value of r is high; it just means that it is not likely to be zero.
Charles
Charles,
Thanks for all the interesting information you have available here.
I’m relatively new to regressions and I’m hoping you can give me your thoughts on the following:
I’m trying to roughly estimate/predict what the hourly energy prices ($/MWh), at a certain grid point, will be going forward, out 12 months.
The issue I’m having is that the price I’m trying to estimate, lets call it “Price A”, is relatively new, with only 6 months of hourly historical prices available. Knowing that this price is highly correlated to a different price (r = 0.98 and r-squared = 95%), let’s call it “Price B”, and that Price B does have available historical data going back multiple years, here’s what I’ve done: calculated in Excel, using the equation y=m*x+a (where y = price A and x = price B) and parameters calculated in Excel (“m” and “a”), what the prices would had been at point A, let’s say for the last 12 months.
Essentially, I would appreciate if you could tell me whether or not this is valid approach and also what would I should be doing next to estimate the prices for the next 12 months.
I hope this isn’t too confusing, please let me know otherwise.
Thank you in advance.
Marcos,
It seems like a reasonable approach. You should be aware of issues such as seasonality and autocorrelation.
Charles
Hi Charles
For linear regression, if R square is .209 (20%) and ANOVA p= 0.01, what does this imply? Isn’t R square value should be greater than 30% ?
Sandy,
p = .01 means that the slope is significantly different from zero. I don’t know of any rule that R square needs to be greater than 30%.
Charles
Thank you Charles
Hi Charles,
Thanks so much for helping on here first off. My question is that I am attempting to compare “OB GYN per capita population” for each state to the “percent of total deliveries that are via cesarean section” to see if there is a coorelation between the two. So basically less OB GYNs is perhaps associated with higher rates of cesarean section for instance. Would this be a simple linear regression? Would it be proper to enter data for all 50 states to arrive at conclusion? Thanks again
Hi Will,
Yes, this could be a way to test this hypothesis. You don’t know a priori whether the relationship is linear (or quadratic or something else), but it makes sense to start with simple linear regression.
Charles
Thanks. After obtaining result is there some other step to progress to based on result? Statistics is not my strong point
Will,
The next step depends on why you decided to use regression in the first place. E.g., you can use the regression for forecasting.
Charles
I’m merely trying to determine if the two variables are correlated in some way to some statistical significance p=.05. Does this make sense or no?
Will,
To test whether there is a significant correlation between two variables, please look at the following webpage
Correlation Testing
There are a number of different tests that can be used, as explained on the website.
Charles
Hello Charles.
If a model has R-square 55% is it doing a good job?
Some people say it is good others say it is weak
i would appreciate it if you helped me
Alexander,
There is no set answer to this question. It really depends on a umber of factors. I certainly wouldn’t say that 55% necessarily means that the model is a bad fit. See the following webpage for more details
http://people.duke.edu/~rnau/rsquared.htm
Charles
Hello!
I need to run regressions using this software for my BUS252 class. However, when I click “Linear Regression” and it prompts me to highlight the X and Y values, I’m unable to do so. Im working from a Macbook, but can’t seem to get it to work. Any help would be much appreciated!
Hello Connor,
For the Mac version of the software, you can’t highlight the range, but need to enter the ranges manually. E.g. for range D5:E24, you need to type in the value D5:E24 in the Input Range field. Good luck with your course.
Charles
hi Charles,
i have a result from one of the independent variables showing a highest standardized beta but the p-value are not significant.
so it is still accepted or rejected to explain which is the most influential variables?
how do i interpret this result?
help me.
tq!
Hi Hani,
Standardizing the variables doesn’t necessarily lead to identifying the most important/influential variable in linear regression. For example, see the webpage
http://www.jerrydallal.com/lhsp/importnt.htm.
Depending on what you mean by most influential, you might want to compare partial correlation coefficients or compare the R-square values when each variable is dropped from the model.
Charles
You have created a wonderful resource in this website and the Excel add-in. I’m wondering if you’re planning to add standardized coefficients to the linear regression routine. Is it already there and I missed it?
Keep up the great work!
Michael,
I have not added standard coefficients yet, but it will eventually be added.
Charles
When running a regression in Excel what do you use to determine if the null hypothesis is accepted or rejected? What parts of the regression output are used in the regression equation?
Gustavo,
Q. When running a regression in Excel what do you use to determine if the null hypothesis is accepted or rejected?
A. Significance F
Q. What parts of the regression output are used in the regression equation?
A. The coefficients
Charles
Charles,
Can you help me understand the difference between errors and residuals?
Thanks
Ryan,
In this context, error values and residuals are the same thing.
Charles
Ive loaded Real Stats as an add in but I cant see it up in the data analysis icon area? How do I use it, where do I access it it in the excel page?
Dan
Thank you so much for this site. It is very helpful. I have to make a project for bachelor’s degree but I don’t know how to find some data so I can make the regression analysis. I want you to tell me what variable to take, to make the regression analysis?I have the theme of agriculture and environmental. Thank you
Regards
Kalina
Kalina,
You haven’t provided enough information for me to tell you what variables to use, but think about what hypothesis in the area of agriculture and environment you want to make. From this will naturally follow which variables you need to study.
Charles
I’m so glad I found this site…..thanks so much Professor. You explain all the concepts so well!!!
Dear Charles:
Thank you for the explanation. This information will be very useful for my future calculations.
Regards.
William Agurto.
Dear Charles:
I found a little bug in the residuals and Cook’s D sections (when that options are selected in linear regression analysis):
1. ThereĀ“s a section at the right of the Cook’s D table that calculates dfE (degrees of freedom of errors) substracting (k+1) instead of substracting k.
2. Because of that, there’s a difference between Data Analysis Excel Add-in results and Real Statistics 2.16.1 results in the calculation of Standard Residuals (SResidual in Cook’s D table).
3. Also, for that reason, there are some mistakes in the calculation of: ModMSE, RStudent, T-Test, Cook’s D, and DFFITS columns in Cook’s D table.
I hope these comments will be helpful.
Regards.
William agurto.
Dear Charles:
The bug in the SResidual calculation is still unfixed in Real Statistics 2.17.
I present an example for making the explanation simpler:
1. Data for analysis:
X Y
9 0.98
14 0.96
25 0.94
30 0.92
59 0.9
122 0.88
128 0.86
138 0.84
144 0.82
2. Results obtained in Excel 2010 (using Data Analysis Add-in) for RESIDUAL OUTPUT:
Observation Predicted Y Residuals Standard Residuals
1 0.959434128 0.020565872 1.305567744
2 0.954885598 0.005114402 0.324673749
3 0.944878831 -0.004878831 -0.309719156
4 0.940330301 -0.020330301 -1.290613152
5 0.913948826 -0.013948826 -0.885502786
6 0.856637346 0.023362654 1.483113757
7 0.851179109 0.008820891 0.559969944
8 0.842082049 -0.002082049 -0.132173143
9 0.836623813 -0.016623813 -1.055316957
3. Results obtained in Real Statistics 2.17 for Cook’s D Table (only columns 1, and 4 to 7 are shown):
Obs Pred Y Residual Leverage SResidual
1 0.959434128 0.020565872 0.271566883 1.221246799
2 0.954885598 0.005114402 0.247947105 0.303704483
3 0.944878831 -0.004878831 0.202599637 -0.289715742
4 0.940330301 -0.020330301 0.184994445 -1.207258058
5 0.913948826 -0.013948826 0.11994921 -0.82831201
6 0.856637346 0.023362654 0.1965224 1.387325886
7 0.851179109 0.008820891 0.219377825 0.523803919
8 0.842082049 -0.002082049 0.263484784 -0.123636654
9 0.836623813 -0.016623813 0.293557711 -0.987158622
4. As you see, the Standard residuals obtained by Data Analysis Add-in is different from those obtained in Real Statistics 2.17. As I commented in the prior message, this is because Real Statistics 2.17 calculates dfE (degrees of freedom of errors) substracting (k+1) instead of substracting k (in the example, 7 instead of 8). When I manually correct this difference (I write 8 instead of 7), all the standard residuals are OK.
I hope this comment will be helpfull.
Regards.
William Agurto.