Objective
On this webpage, we explore the concepts of a confidence interval and prediction interval associated with simple linear regression, i.e. a linear regression with one independent variable x (and dependent variable y), based on sample data of the form (x1, y1), …, (xn, yn). We also show how to calculate these intervals in Excel. In Confidence and Prediction Intervals we extend these concepts to multiple linear regression, where there may be more than one independent variable.
Confidence Interval
The 95% confidence interval for the forecasted values ŷ of x is
Here, sy⋅x is the standard estimate of the error, as defined in Definition 3 of Regression Analysis, Sx is the squared deviation of the x-values in the sample (see Measures of Variability), and tcrit is the critical value of the t distribution for the specified significance level α divided by 2. How to calculate these values is described in Example 1, below.
The 95% confidence interval is commonly interpreted as there is a 95% probability that the true linear regression line of the population will lie within the confidence interval of the regression line calculated from the sample data. This is not quite accurate, as explained in Confidence Interval, but it will do for now.
Figure 1 – Confidence interval
In the graph on the left of Figure 1, a linear regression line is calculated to fit the sample data points. The confidence interval consists of the space between the two curves (dotted lines). Thus there is a 95% probability that the true best-fit line for the population lies within the confidence interval (e.g. any of the lines in the figure on the right above).
Prediction Interval
There is also a concept called a prediction interval. Here we look at any specific value of x, x0, and find an interval around the predicted value ŷ0 for x0 such that there is a 95% probability that the real value of y (in the population) corresponding to x0 is within this interval (see the graph on the right side of Figure 1). Again, this is not quite accurate, but it will do for now.
The 95% prediction interval of the forecasted value ŷ0 for x0 is
where the standard error of the prediction is
For any specific value x0 the prediction interval is more meaningful than the confidence interval.
Example
Example 1: Find the 95% confidence and prediction intervals for the forecasted life expectancy for men who smoke 20 cigarettes in Example 1 of Method of Least Squares.
Figure 2 – Confidence and prediction intervals
Referring to Figure 2, we see that the forecasted value for 20 cigarettes is given by FORECAST(20,B4:B18,A4:A18) = 73.16. The confidence interval, calculated using the standard error of 2.06 (found in cell E12), is (68.70, 77.61).
The prediction interval is calculated in a similar way using the prediction standard error of 8.24 (found in cell J12). Thus life expectancy of men who smoke 20 cigarettes is in the interval (55.36, 90.95) with 95% probability.
Graphical representation
You can create charts of the confidence interval or prediction interval for a regression model. This is demonstrated at Charts of Regression Intervals. You can also use the Real Statistics Confidence and Prediction Interval Plots data analysis tool to do this, as described on that webpage.
Testing the y-intercept
Example 2: Test whether the y-intercept is 0.
We use the same approach as that used in Example 1 to find the confidence interval of ŷ when x = 0 (this is the y-intercept). The result is given in column M of Figure 2. Here the standard error is
and so the confidence interval is
Since 0 is not in this interval, the null hypothesis that the y-intercept is zero is rejected.
Reference
Howell, D. C. (2009) Statistical methods for psychology, 7th ed. Cengage.
https://labs.la.utexas.edu/gilden/files/2016/05/Statistics-Text.pdf
Hello!
I’m using a simple linear regression to predict the content of certain amino acids (aa) in a solution that I could not determine experimentally from the aas I could determine. I want to place all the results in a table, both the predicted and experimentally determined, with their corresponding uncertainties. By replicating the experiments, the standard deviations of the experimental results were determined, but I’m not sure how to calculate the uncertainty of the predicted values. I could calculate the 95% prediction interval, but I feel like it would be strange since the interval of the experimentally determined values is calculated differently. In the end I want to sum up the concentrations of the aas to determine the total amount, and I also want to know the uncertainty of this value. How do you recommend that I calculate the uncertainty of the predicted values in this case?
The prediction intervals, as described on this webpage, is one way to describe the uncertainty.
Charles
Hi Charles, thanks for getting back to me again. I’m trying to establish the confidence level in an upper bound prediction (at p=97.5%, single sided) . I used Monte Carlo analysis with 5000 runs to draw sample sizes of 15 from N(0,1). In order to be 90% confident that a bound drawn to any single sample of 15 exceeds the 97.5% upper bound of the underlying Normal population (at x =1.96), I find I need to apply a statistic of 2.72 to the prediction error. As I’m doing this generically, the 97.5/90 interval/confidence level would be the mean +2.72 times std dev, i.e. x =2.72. As far as I can see, an upper bound prediction at the 97.5% level (single sided) for the t-distribution would require a statistic of 2.15 (for 14 degrees of freedom) to be applied. A prediction upper bound (such as at 97.5%) made using the t-distribution does not seem to have a confidence level associated with it. So my concern is that a prediction based on the t-distribution may not be as conservative as one may think….
Ian,
Since the sample size is 15, the t-statistic is more suitable than the z-statistic.
I don’t understand why you think that the t-distribution does not seem to have a confidence interval.
Charles
Hello Charles,
Your material has been very helpful in enhancing my understanding of statistics, for which I’m very grateful.
Does the following interpretation of CI & PI with the context of sampling distribution seem appropriate to you?
Sampling Distribution of the Mean of Y
If all possible samples, m, of sample size n were taken for a specific value of X (i.e. X = h), and mean (μy) calculated for the Ys observed against those Xs, then :
1) The distribution of such means will be normally distributed.
2) 95% of the confidence intervals (μy ± 1.96σ) calculated from such estimated means will contain the true population mean of Y.
3) Mean of all the m means will be equal to the overall population mean of Y.
But since its not practical to take all possible samples of size n, we take 1 sample of size n, and calculate mean of y.
Confidence Interval for Y
We expect that, μy obtained from this one sample is one of those 95% that will yield a confidence interval that will contain the true population mean of Y.
Prediction Interval for Y
The 95% prediction interval (that has a much wider reach than the confidence interval) calculated using the μy obtained from this one sample will contain 95% of all the possible population values of Y.
Have I got the interpretation for CI & PI right ?
Rahul,
1. A 95% confidence interval for any parameter would mean that if the same experiment is repeated a large number of times, then 95% of the confidence interval calculated from each experiment would contain the true (population) value of the parameter. The true value of the parameter is not guaranteed to be in the confidence interval of the one sample that is drawn. Thus, the phrase “…this one sample is one of those 95%…” in your definition of the Confidence Interval for Y is not correct.
2. For the confidence interval of a regression model the parameter is the mean μy.
3. For the prediction interval of a regression model the parameter is any one of the predicted y values.
Charles
Thank you for the explanation.
Follow on question, is there a way to calculate the confidence interval bounds if I was to force my trendline to go through the origin? Currently, the trendline as a y-intercept value.
David,
See https://stats.libretexts.org/Bookshelves/Computing_and_Modeling/Supplemental_Modules_(Computing_and_Modeling)/Regression_Analysis/Simple_linear_regression/Regression_through_the_origin
Charles
Dear Charles,
Thanks for the nice explanations on this website. If the prediction bands and CI bands of a couple of regression lines with unequal slopes (interacion effect) are known, is it then possible to calculate the SE of difference between two regression lines at the mean x?
The data I want to analyse usually gives divergent regression lines and I want to do post hoc tests with the interpolated values at the mean x. I can do this analysis with Minitab, but I do not understand how they calculate the SE of difference. Do you know where I can find a method to do this?
Thank you very much,
Bart
Hello Bart,
Sorry, but I am not familiar with the Minitab analysis, and so I am not able to answer your question. In fact, when you reference SE I am not sure whether you are referring to the standard error of the slope or something else.
Charles
Hi Charles,
Really appreciate your quality lecture note.
May I ask the derivation of s.e.(standard error of the prediction) formula?
Thank you!
See schwert.ssb.rochester.edu › a425_pred
Charles
Dear Charles,
You discussed here how to predict (Y) value from (X) and how to estimate its confidence interval. In my stydies, I use the regression equation to predict (X) value from (Y) as in the following example:
Conc (X) Response % (Y)
10 20
15 27
20 35
25 46
30 55
35 70
40 85
Regression line equation: Response% = Slope * Conc. + Intercept
= (2.15) * Conc. + (-5.46)
Then the concentration that cause 50% response will be:
Concentration = (50 + 5.46) / 2.15 = 25.79
My question is how to calculate the Confidence interval of this predicted concentration to get the lower and upper predicted values (X ± ???).
Please help me and give your answer in steps with an example.
Sincerely,
Hello,
Thanks for the article. I have a question, which might not be so related to this article, but still about confidence and prediction intervals.
I’m using lmfit for python. It provides function for confidence interval, but not prediction interval. Its documentation cites the code it is based on, and it has prediction interval calculation.
https://www.astro.rug.nl/software/kapteyn/kmpfittutorial.html#confidence-and-prediction-intervals
It uses different formula for standard error. I don’t know if it will hold the same value as the formula here, or their relationship; too complicated for me. Still, I want to have prediction interval, so I want to modify the function in the lmfit module to support it.
There, the difference between confidence and prediction interval is that err*err is added to the standard error for the prediction intervals. However, in the code, that err is the noise added to smooth model to simulate real data. I don’t think I can get the real noise from real data. I can only get residuals, but the calculation of residuals (and derivation) is also weird there (they’re divided by err), which adds more confusion for me.
Below the definition of err in the code, there’s a commented out line of err that makes err a straight 1 for the length of the dataset. If err is 1, then err*err is also 1, so it will just add 1 to the standard errors before they are sqrt-ed. I notice that here in your article, the difference of standard error formula for confidence interval and prediction interval is that 1 is added there before it is sqrt-ed. This doesn’t seem like coincidence for me. So, my question is, can I just change err*err with 1?
Thank you
Hello Rizqi,
I am not familiar with the lmfit module, and so I can’t comment on it.
As you can see from the referenced Real Statistics webpage, the only difference in the calculation between a prediction interval and a confidence interval is an extra 1 under the square root symbol.
Charles
Thank you for the response. Sorry, I knew this was kind of unrelated.
I tried adding 1 there, and the prediction interval is now very wide compared to confidence interval. It encompasses all data points. Also, the width of prediction interval when the confidence interval is or close to 0 is the same or almost the same as the sigma (there I use sigma=3). If I change sigma to 2 it’d be 2.
https://ibb.co/x273fbD
Is that just how it is? Or is this just wrong?
Hi Charles,
I ran a regression model and added confidence and prediction interval. While most of the data points stay within the prediction band, there are some outside the prediction band. I’m having trouble interpreting those points, in general, do those points mean that they are outliers?
Leonardo,
Yes, in some sense these points are potential outliers for the regression model, but if you have a lot of points you can expect that some will lie outside the prediction band.
Charles
Hi Charles,
Thank you so much for the post.
I have a question, can we have prediction interval for independent variable if dependent variable is already given. I mean if i m using regression line to predict x value as y value is given, can we calculate prediction interval for x value also.
Yes, you can do this.
Charles
Why would you be using tinv(.05,13) instead of tinv(.025,13). Does not seem correct.
TINV is the two-tailed inverse.
Charles
This is very interesting! Could you please suggest reference literature to go deeper in the subject?
Thanks
Hello Maurizio,
There are lots of references. Two such references are Zar’s textbook and Howell’s textbook.
See Bibliography for details.
Charles
Thanks a lot!
very interesting. thanks for putting this on a page. Are the sample excel files available for download somewhere? thanks!
Yes, see Real Statistics Examples Workbooks
Charles
Is this statement true: ”If we estimate a regression slope, its confidence interval represents a plausible range for the slope’s true population value.”?
This is reasonable way to look at the confidence interval, although technically this is not literally true (at least when you remove the word “plausibly”). See https://real-statistics.com/hypothesis-testing/confidence-interval/
Charles
Hi Charles,
Is it the same formula to calculate prediction intervals for non-linear functions too (i.e. Weibull, Gompertz)?
Thanks,
Hello David,
The situation will be different. Here is an article about the prediction interval for the Weibull distribution.
lib.dr.iastate.edu/cgi/viewcontent.cgi?article=1011&context=stat_las_preprints
Charles
Hi Charles,
In a question where they would a model that related the annual number of Australian wines sold to the annual growth rate of Canada, what would be x and what would be y?
Albert
Albert,
Either is possible. If you want to predict the wines sold from the growth rate, then y = wines sold and x = growth rate.
Charles
Thanks for a very useful post!
Just wondering, how would you go about calculating a random set of best-fit lines with those confidence intervals? I guess I would have to determine a distribution for the set of possible slope and a distribution for the set of possible intercept values?
David
David,
How would you measure what a best fit line is?
Charles
No idea… I just referring to the plot on the right of the first figure that shows a selection of regression lines withing the confidence intervals, and wondering how one would go about getting a random sample of those regression lines?
Something like bootstrapping resampling with a t-test to get the ones that are within the 95% interval?
David,
Here’s my rough idea for how you might go about doing this. Let’s look at a specific example, namely Example 1 on
Plot of Confidence and Prediction Interval
First look at the mean of the X values, namely x = 19.4. Next calculate the upper and lower bound for the y value corresponding to this value of x. Suppose for a minute that the mean of the x values is 20 (i.e. it is one of the sample data values for x) instead of 19.4. Then, as we can see from Figure 1, the the lower and upper bound for the corresponding y value is 68.70 and 77.61. Now pick a value between 68.70 and 77.61 at random: e.g. by using the formula =68.70+RND()*(77.61-68.70). Say this value is 71.22. Next we pick any other sample x data value, say 0. The corresponding y value ranges from 77.28 and 94.16. Once again pick a value at random from this range. Say this value is 88.42. Now this random line connects the points (20, 71.22) and (0, 88.42). We can use easy express the equation for this line in the form y = mx + b.
It was important to start with the mean value for x so that the two randomly assigned points would stay within the confidence interval. Thus, the only thing left is figure out the range of y values corresponding to the mean value of the x when this value of x doesn’t correspond to any of the points in the sample. I guess some sort of curve fitting approach could be used.
Charles
Hi Charles,
I’m just starting to uses your excel add-in and it works great for graphically presenting the confidence and prediction intervals for a simple linear regression. Thank you! I’m trying to calculate and graphically display tolerance intervals with 95% confidence and 75% coverage of the population. I’m not seeing those options in your confidence and prediction interval. Do you offer these options in your add-in?
Troy,
You can change the confidence interval for the Confidence and Prediction Interval Plots data analysis tool. Simply change Alpha field. .05 corresponds to a 95% confidence interval, .01 corresponds to a 99% confidence interval, etc. See the following webpage>
https://real-statistics.com/regression/confidence-and-prediction-intervals/plots-regression-confidence-prediction-intervals/
I don’t know what you mean by the 75% coverage of the population.
Charles
For tolerance intervals there are two variables your confidence level (alpha 95%) and your population coverage or portion of population contained with the tolerance interval (75%). Is there a way in your program to input the confidence level and portion of population contained within a tolerance interval and have it graphically displayed like the confidence and prediction intervals?
Troy,
Tolerance intervals are not yet covered by Real Statistics. I will add this shortly, probably without the graphics though
Charles
Hello,
I am evaluating the concentration of a chemical in blood in relation to its concentration in the food. I derrived a linear regression and the respective prediction intervals. Now, I would like to calculate how many samples I have to analyse to predict the right concentration ( with e.g. 95% confidence) in the food. How can I do this?
Best regards
Suse
Susanne,
You have created a linear regression model based on one sample with a number of elements. This model can be used to make predictions. If you create a new sample you will get a different (hopefully not very different) regression model. Thus, it is not clear what you mean by “how many samples…”
One way to interpret your question is to assume that there is some commonality between the samples; e.g. you fix the values of the x mean, x standard deviation and the correlation coefficient between the y and x values. As can be seen from the formula for the prediction standard error on this webpage, s.e. depends on the x mean, S_x (related to the x standard deviation) and s_y.x (related to the correlation coefficient and sample size n). The size of the prediction interval is twice s.e times t-crit. Here t-crit depends on df, which is n-2. Thus for fixed values of x mean, x standard deviation and correlation between y and x (and x_0), the size of the prediction interval only depends on the sample size n (using the formulas for s.e. and df). If you fill in the values of these three values in Figure 2 on the webpage (e.g. based on the linear regression you already developed), the only remaining variable is the sample size n. You can then manually repeatedly change the value of n (cell E5) until you get the prediction interval size (i.e. the value of =J15-J14) that you desire. You can also use Excel’s Goal Seek to calculate this for you.
Charles
Dear Charles
Thank you for your useful text, but I would so appreciate if you would advice me how can I plot these confidence intervals for multiple linear regression with more than one variable, please. and I have more question, may you kindly tell me how can I have the equations of these confidence interval lines, for using in predicting models, please?
Best Regards
Mojgan
Mojgan,
See https://real-statistics.com/multiple-regression/confidence-and-prediction-intervals/
Charles
Hello Dr. Zaiontz,
I have a doubt regarding prediction interval calculation.Does the calculation depend on method of forecasting used?(e.g. ARIMA, linear regression, ETS) Or it’s independent of methods used for forecasting.Please give me some references to look at for support of your answer.
Thanks in advance.
Arka,
This webpage specifically addresses forecasts using linear regression (with one independent variable). The techniques will vary depending on which model you use (ETS, ARIMA, etc.)
Charles
Hello Dr. Zaiontz,
I am developing a correlation between sample lab results (x) and field instrument readings (y) in order to establish a cleanup level equivalent for my field instrument. I’ve calculated 95% upper and lower prediction limits for the field instrument cleanup value equivalent and would like to know what I can say precisely about each of these values. Is it accurate to say of the lower prediction limit that there’s a 95% (or 90%?) chance that the true field instrument cleanup value equivalent is greater than the lower limit and that this value assures minimal false negatives (null hypothesis is that contamination exceeds cleanup level)? Similarly, is there a 95% (or 90%?) chance that the true field instrument cleanup value equivalent is less than the upper limit? If the latter is accurate, what would be any potential value of using the upper limit, other than it being the least conservative value (i.e. allowing minimize cleanup costs) without regard for false negatives?
Thanks
Keith,
In general if you have a 95% confidence interval for some parameter of (a,b), it means that if you reran the experiment 1,000 times, then the true value of the parameter will occur in the interval (a,b) in approximately 950 (i.e. 95%) of these experiments.
Charles
Charles Zaiontz,
I need to say that you are such an absolutely amazing person. All your intructions are very easy to follow. I even like statistics now. Thank you very much!
Amanda,
That is very kind of you. I am very pleased that I have been able to help.
Charles
Hello Charles-
I am being asked to use the slope and intercept to create a nonlinear model, after I have identified my most linear graph which is the “Power Regression” model. With my regression summary statistics, how am I creating the formula using slope and the intercept. I am missing something rather simple I believe.
Thanks,
Cristopher
Cristopher,
If you look at the following webpage, you see that power regression is just a transformation of the nonlinear expression into the form of a linear regression. Using the terminology on this webpage, the intercept is alpha-prime and the slope is beta-prime.
https://real-statistics.com/regression/power-regression/
Charles
Hi Charles,
Thanks for maintaining this site with very helpful information.
Is it a good idea to use the prediction intervals as a check for outliers in the measured data when doing a linear regression between y and x?
Best regards
Tomas
Tomas,
You should identify possible outliers. The approach that you are suggesting could be a good way to do this.
Charles
Hi Charles,
I would like to know how to calculate the “Standard Error of intercept” or “Standard deviation of intercept” by excel formula rather than LINEST function?
Thanks for your informative helps
Chester
Chester,
This is shown in Example 2 of the referenced webpage.
Charles
Hi Charles,
What is the equation for the standard error of the prediction when the y-intercept is zero? The physics of our data requires linear regressions where y=0 when x=0. Something similar to http://stackoverflow.com/questions/36459996/confidence-interval-for-independent-variable-of-a-linear-regression-model-in-r, but in excel.
Thanks,
Mark
Mark,
I believe that the type of linear regression that you are referring to is described on the following webpage:
Multiple Regression without Intercept. The value for the matrix B is given on that webpage.
The calculation of the prediction interval is shown on the following webpage:
Confidence and Prediction Intervals for Multiple Regression. You can use the formula on that webpage for the prediction interval substituting the value B given above.
Charles
Sorry if I wasn’t specific enough, but I am using a simple linear regression thought the origin (zero-intercept). What is the equation for the confidence interval for the forecasted values ŷ of x? Wouldn’t it be that s.e.=0 when x=0 such that any and all possible true best fits still goes through the origin and do not rotate around the data set mean (curved confidence interval)? Thanks again.
Mark,
Clearly at x = 0, s.e. would be zero, but this is not the case for other values of x.
The information I gave you previous will enable you to calculate a prediction interval at other values of x.
Charles
Hi Charles,
I’m fitting a regression model to some lifing data (Strain vs Number of Cycles) and would like to make sure I’m making the right assertions.
When regression isn’t involved, I’ve seen several references to tolerance intervals. (I think this is what Tristan was referring to when he spoke about 98% of the population with 95% reliability.)
I’m still struggling slightly with the differences between prediction and tolerance intervals (explained in https://en.wikipedia.org/wiki/Tolerance_interval), particularly in the context of a regression model. I was hoping you might be able to explain and have some pointers about how to modify the calculations to include it.
Ideally I’d like to say for a certain value of strain (X) for a design we can be A% confident that B% of the population will survive beyond Y cycles.
Many thanks for your help.
David,
It sounds like you are speaking about survival analysis. Please look at the following webpage, especially the part about Cox Regression, and see if this is helpful.
https://real-statistics.com/survival-analysis/
Charles
Good Day
I am using this methodology for my MSc dissertation. How do you suggest that I reference it?
Is this appropriate?
Zaiontz, C. (2016). Confidence and Prediction intervals for forecasted values. Retrieved June 16, 2016, from https://real-statistics.com/regression/confidence-and-prediction-intervals/
That looks good. See also Citation.
Charles
CAn you extend your Fig 1 to a pooled within-group regression slope as obtained from ANOVA
Roger,
Sorry, but what do you consider to be the pooled within-group regression slope as obtained from ANOVA?
Charles
Can you extend this to a pooled within-group slope as obtained from ANCOVA
Roger,
Sorry, but what do you consider to be the pooled within-group regression slope as obtained from ANCOVA?
Charles
Hi Charles
I am not a statistician – please bear this in mind.
When you have two variables (x,y) and several groups, one can use an analysis of covariance to test the difference between groups for y when regressed against x. This does not use the regression line with the data pooled irrespective of group, but the pooled within group regression line. For example Statistical Methods, Snedechor and Cochran Chapter 14 fig 14:6:1 (6th ed) . I want the prediction intervals around this pooled-within-group regression line. I hope this is clear.
Roger
Sorry Roger, but I don’t have access to Statistical Methods, Snedechor and Cochran Chapter 14 fig 14:6:1 (6th ed).
Charles
Thank you so much! This example was extremely helpful in finding prediction intervals in Excel and exactly what I was looking for!
Charles,
I am trying to determine a confidence interval around a natural exponential function (y = ae^bx or ln(y) = ln(a)+bx) using Excel. I have reviewed this website but am unsure about the confidence interval calculation.
http://www.tushar-mehta.com/publish_train/data_analysis/16.htm
Thank you!
Steve,
Since you are treating ln(y) = ln(a) + bx as a linear regression z = bx + c where z = ln(y) and c = ln(a), one approach to creating a confidence interval is to use the confidence interval for z = bx + c, as described on the webpage
https://real-statistics.com/regression/confidence-and-prediction-intervals/
This will give you a confidence interval for z of form [h, k] you then need to convert this into a confidence interval for y. The simplest approach for this is to use the confidence interval [e^h, e^k].
Charles
Everything seems to follow in my software (i wanted to transfer confidence intervals into a bespoke dashboard from my software) for a single X variable
If i have two x variables how do i calculate the SE of the overall confidence interval?
I.e what does the Equation in cell G12 look like?
Alex,
For multiple regression you use a slightly different approach. See the following webpage for details:
https://real-statistics.com/multiple-regression/confidence-and-prediction-intervals/
Charles
Thank you, ive been after this for ages! My statistics software doesnt make it clear what is going on in its calculations for the confidence intervals, but seeing this, i can reverse engineer it AND understand it.
I wasnt even aware Excel had these capabilities, i had twigged at MMULT.
Very pleased to find a site that isnt filled with mathematics requiring a mathematics degree to comprehend!
Hello charles,
Thank you for the explanation. I have a question though:
what if I have x and I would like to predict y instead but with a confidence interval? is it the same principle and the same formula?
Hello Sma,
In this case you use the prediction interval.
Charles
Hello Charles,
Could you tell me what the first part in cell G12 is (it is kind of blurry on my monitor), for the calculation of s.e. 2.06?
Also, is there a shortcut (function) in Excel for the s.e. calculation?
Thank you,
John,
Cell G12 contains the following formula: s_Res * SQRT(1/n + x_0 – k) – 2/SS_x)
There is no Excel formula for calculating this. The Real Statistics form REGPRED will automatically calculate the s.e. in the prediction interval case, but not the confidence interval case. I will add a similar formula for the confidence interval shortly. See the following webpage for more details:
Prediction and Confidence Intervals
Charles
Thank you, Dr. Zaiontz. Pardon my ignorance, but could you tell me what s_Res and how I obtain it?
John
Looks like its the standard error of y and x, the 7.97 in cell E10. Is this correct? The underscore in sRes threw me off a little bit.
Hi Charles,
If I understand you correctly, you describe the confidence interval as the range of possible values for model parameters, e.g. the range of values slope and intercept may be for a linear regression for a given set of data and specified confidence interval. You describe prediction interval as the interval around a predicted Y for a specific X0.
However, in your example you calculate confidence interval for a specific X0 and do the same for a prediction interval. I am confused as the example does not appear to match the discussion.
Obviously I don’t understand you correctly!
On a related matter, when one does a linear regression with Excel, Excel reports the Lower and Upper confidence intervals for “intercept” and “X Variable”, i.e. values for the slope and intercept. How does this Excel output relate to your discussion above?
Thank you
Gary,
I will respond to your first question shortly.
Regarding your second question: the confidence intervals for the intercept and x variable are really for the intercept and x coefficient (not for the prediction or confidence interval of data elements).
Charles
Gary,
Here is my response to your first question:
The confidence interval focuses on the population mean. If you create many random samples that are normally distributed and for each sample you calculate a confidence interval for the mean, then about 95% of those intervals will contain the true value of the population mean.
The prediction interval focuses on the true y value for any set of x values. If you create many random samples that are normally distributed and for each sample you calculate a prediction interval for the y value corresponding to some set of x values, then about 95% of those intervals will contain the true y value.
Charles
Hi, Charles. Can you make a video on plotting a 95% confidence interval. Above instruction was really confusing. Will be grateful.
Ang,
Thanks for your input. I will try to improve the explanation, but first I need to finish up the work I am doing on time series analysis.
Charles
Hi Charles,
I need like to plot the 95% Confidence Interval curves just like they are shown within Figure 1 (e.g. the dotted lines). How would you recommend doing this in Excel? I was informed that other programs may provide this feature, but I prefer to continue working in Excel if at all possible.
Thank you in advance,
Andy
Yes, you can do this in Excel. E.g. to create the dotted line for the upper confidence interval curve, fill in a range of x and y values (say A1:B100) as follows: (1) in column A insert x values of the appropriate scale. If for example you are looking at values between 0 and 10, insert 0 in cell A1 and the formula =A1+.1 in cell A2 and then highlight the range A2:A100 and press Ctrl-D, in (2) in column B insert the y values for the upper confidence interval. E.g. in cell B1 insert the Excel formula for the upper confidence interval value corresponding to the x value in cell A1 (this is as described in cell E15 of Figure 2 of the referenced webpage), then highlight range B1:B100 and press Ctrl-D. (Make sure that you use absolute addressing for all the parts of the formula in B1 that don’t depend on the x value in cell A1.) (3) Finally, highlight range A1:B100 and select Insert > Charts|Scatter.
Charles
Dear Charles
I just want to thank you for the wealth of information you put out here freely. It takes a lot of effort and dedication, and you have made the ins and outs of statistics a lot easier. I hope to help others in the same way.
Kind regards
PG
Thank you very much.
Charles
Hi Charles,
I wanted to point out a common misunderstanding about confidence intervals: CI’s say nothing about the probability that the true value of the population parameter lies within them – it either does or it doesn’t. A 95% CI just tells you that, if you were to repeat your experiment (sampling) an infinite number of times and run the statistics on each sample, the true parameter will lie within 95% of those confidence intervals. What you described as a CI in the first section of your post is actually Bayesian credible interval, which is a bit more complicated to calculate, but it does tell you the probability that your population parameter lies within the interval.
Cheers,
Phil
Phil,
Thanks for correcting my somewhat appealing, but, in the end, inaccurate statement. Shortly I will update the website with a more accurate characterization of the confidence interval.
I appreciate your help in making the site more accurate.
Charles
Hello! Thank you very much this incredibly helpful guide! Do you have a tutorial on how to find the C.I and P.I using multiple regression?
Mari,
Not yet. I plan to include this shortly. I am working on adding information about Survival Analysis now. After that the next thing I will do will include CI/PI for multiple regression.
Charles
Hi Charles. Was wondering if you’ve added the section covering CIs/PIs for Multi Linear Regression yet.
Really love the site and it has helped out tremendously. Keep up the good work!
Hi Thomas,
Thanks for your continued support.
The section covering CIs/PIs for Multi Linear Regression is located on the following webpage:
https://real-statistics.com/multiple-regression/confidence-and-prediction-intervals/
Charles
Hi Charles,
I notice that you use n-2 for degrees of freedom, whilst other publications, for example UKAS M3003, use n-1. Although a small difference is seen in the calculation of the degrees of freedom, it can greatly affect the interval magnitude. Could you give some advise as to which calculation should be used for the df value? Thank you in advance for your reply.
By the way, your website has been an extremely useful aid, Thanks
Paul,
I am not familiar with UKAS M3003, but I just checked a few other publications and they all show df = n-2 when calculating the confidence interval for regression.
Charles
Charles,
What’s the rationale for the added 1 under the square root of the standard error of the prediction? Theoretically, why 1?
Thank you
Ryan,
The standard error for a prediction interval adds the variability of the points around the predicted mean. mathematically this is where the 1 comes from.
I plan to eventually show the proofs of the formulas for the confidence and prediction intervals. You will then be able to see from the proofs more precisely where the 1 comes from.
Charles
Is the proofs available now? I’m also interested in understanding where the 1 comes from!
Hello Hana,
I will add the proof to the website shortly.
Charles
Hi Charles,
This post has helped me so much already, really very insightful and easy to follow!
I am trying to do a prediction interval for some metal fatigue test data but I am trying to find 90% confidence in the 98% reliability data. In order to do this is just using 0.1 as the probability in TINV sufficient for the method to change it to the 90% probability or do I need to make more changes please?
Thanks for your help!
Tristan
Tristan,
TINV(.1,df) (i.e. alpha = .1) is part of the formula for a 90% prediction interval. But I don’t understand what the 98% reliability data means.
Charles
Hi all,
It would be good to point out that the function TINV gives a two-sided confidence interval. The new function T.INV (with a dot) gives a one-sided confidence interval. In case of using the new function, you should take \alpha/2; furthermore, it uses the 1-\alpha/2 value, thus, T.INV(0.975,df).
Emiel
Emiel,
Actually, more simply you should use the T.INV.2T function for the two-sided critical value. It is equivalent to TINV.
This is explained on the webpage https://real-statistics.com/excel-capabilities/built-in-statistical-functions/
I will shortly update this information to better explain the various usages of the functions.
Charles
Hi Charles,
Your post has been of tremendous help to me.
Although it took me some time I have applied most of the solutions and they have worked just fine.
I need clarification on example 2. Why did you have to eliminate the # 1 before 1/15?
Would really appreciate your prompt response.
Thanks.
The test uses the confidence interval and not the prediction interval. This is why a 1 is not inserted before 1/15. Shortly I will update the webpage to explain better when the prediction interval is used and when the confidence interval is used.
Charles
Hi Charles,
I am using a exponential/hyperbolic function to fit my data. The model is (a*X+b)-SQRT[(a*X+b)^2-((4*a*X*b*c)/(2*c))].
I wonder if I can calculate prediction intervals in the way you show, or if there is any parameter that is different for this type of models. If so, can you tell me how I should calculate it?
Thank you very much
Elisa
Elisa,
I can’t think of a way of doing this. Perhaps someone else has suggestion.
Charles
Hi Charles,
Is Syx = Sres = STEYX(Y,X)? Is it same as Syx = SQRT((SUM(yi – Yi)^2)/(degrees of freedom)), where (xi,yi) are given data and Y is any nonlinear model (not a straight line, say a sigmoidal or logistic curve), which fits the data.
Rizwan,
Sorry for the long delay in responding to you. I just realized that I overlooked responding to you.
It is true that Syx = Sres = STEYX(Y,X) = SQRT((SUM(yi – Yi)^2)/df) for linear models. I am not sure what these terms would even mean for non linear models such as logistic regression models.
Charles
Hi Charles,
Thanks for your contributions on this site. I’m a bit confused by your base formula, though. Where you use the sum of squared deviations of x (SSx, calculated as DEVSQ(x) or DEVSQ(A4:A:18), I’ve learned to use the standard deviation of x times (n-1), or STDEV.S(A4:A:18)*(n-1) in Excel speak. This would yield a value of CI SE of 2.090695467 and a PI SE of 8.244184143. The difference is small in your dataset, but where deviations are larger the use of sums of squared deviations instead of the method I’ve heard will yield very different results. That said, I’m not at all certain which method is correct–can you point to some references for your formula, please?
Thanks in advance for taking the time to clarify this issue for me.
David
Hi David,
Note that for any range R1, the square root of DEVSQ(R1) is STDEV.S(R1)*(n-1). In fact, in the formula for cell E12 in Figure 2 of the referenced page I do take the square root of SSx, and so it seems that we should get the same answer. Can you send me an example of your calculation so that I can see why the results are not the same?
Charles
Thanks for your contribution. I would like to know how to calculate CI and PI if there are two independent variables. Thanks.
Zhang,
Sorry but I haven’t had enough time to figure out or find an answer to your question.
Charles
Please help how u got value of SSx which I suppose to be:-271.6
Anu,
SSx (cell E11) is calculated by the formula =DEVSQ(A4:A18). It has the value 2171.6.
Charles