Confidence and prediction intervals for forecasted values

Objective

On this webpage, we explore the concepts of a confidence interval and prediction interval associated with simple linear regression, i.e. a linear regression with one independent variable x (and dependent variable y), based on sample data of the form (x₁, y₁), …, (x_n, y_n). We also show how to calculate these intervals in Excel. In Confidence and Prediction Intervals we extend these concepts to multiple linear regression, where there may be more than one independent variable.

Confidence Interval

The 95% confidence interval for the forecasted values ŷ of x is

where

Here, s_y⋅x is the standard estimate of the error, as defined in Definition 3 of Regression Analysis, S_x is the squared deviation of the x-values in the sample (see Measures of Variability), and t_crit is the critical value of the t distribution for the specified significance level α divided by 2. How to calculate these values is described in Example 1, below.

The 95% confidence interval is commonly interpreted as there is a 95% probability that the true linear regression line of the population will lie within the confidence interval of the regression line calculated from the sample data. This is not quite accurate, as explained in Confidence Interval, but it will do for now.

Figure 1 – Confidence interval

In the graph on the left of Figure 1, a linear regression line is calculated to fit the sample data points. The confidence interval consists of the space between the two curves (dotted lines). Thus there is a 95% probability that the true best-fit line for the population lies within the confidence interval (e.g. any of the lines in the figure on the right above).

Prediction Interval

There is also a concept called a prediction interval. Here we look at any specific value of x, x₀, and find an interval around the predicted value ŷ₀ for x₀ such that there is a 95% probability that the real value of y (in the population) corresponding to x₀ is within this interval (see the graph on the right side of Figure 1). Again, this is not quite accurate, but it will do for now.

The 95% prediction interval of the forecasted value ŷ₀ for x₀ is

where the standard error of the prediction is

For any specific value x₀ the prediction interval is more meaningful than the confidence interval.

Example

Example 1: Find the 95% confidence and prediction intervals for the forecasted life expectancy for men who smoke 20 cigarettes in Example 1 of Method of Least Squares.

Figure 2 – Confidence and prediction intervals

Referring to Figure 2, we see that the forecasted value for 20 cigarettes is given by FORECAST(20,B4:B18,A4:A18) = 73.16. The confidence interval, calculated using the standard error of 2.06 (found in cell E12), is (68.70, 77.61).

The prediction interval is calculated in a similar way using the prediction standard error of 8.24 (found in cell J12). Thus life expectancy of men who smoke 20 cigarettes is in the interval (55.36, 90.95) with 95% probability.

Graphical representation

You can create charts of the confidence interval or prediction interval for a regression model. This is demonstrated at Charts of Regression Intervals. You can also use the Real Statistics Confidence and Prediction Interval Plots data analysis tool to do this, as described on that webpage.

Testing the y-intercept

Example 2: Test whether the y-intercept is 0.

We use the same approach as that used in Example 1 to find the confidence interval of ŷ when x = 0 (this is the y-intercept). The result is given in column M of Figure 2. Here the standard error is

and so the confidence interval is

Since 0 is not in this interval, the null hypothesis that the y-intercept is zero is rejected.

Reference

Howell, D. C. (2009) Statistical methods for psychology, 7th ed. Cengage.
https://labs.la.utexas.edu/gilden/files/2016/05/Statistics-Text.pdf

185 thoughts on “Confidence and prediction intervals for forecasted values”

masterstudent

December 3, 2020 at 4:04 pm

Hello!
I’m using a simple linear regression to predict the content of certain amino acids (aa) in a solution that I could not determine experimentally from the aas I could determine. I want to place all the results in a table, both the predicted and experimentally determined, with their corresponding uncertainties. By replicating the experiments, the standard deviations of the experimental results were determined, but I’m not sure how to calculate the uncertainty of the predicted values. I could calculate the 95% prediction interval, but I feel like it would be strange since the interval of the experimentally determined values is calculated differently. In the end I want to sum up the concentrations of the aas to determine the total amount, and I also want to know the uncertainty of this value. How do you recommend that I calculate the uncertainty of the predicted values in this case?
Reply
- Charles
  
  December 7, 2020 at 11:26 pm
  
  The prediction intervals, as described on this webpage, is one way to describe the uncertainty.
  Charles
  Reply
Ian O'Donnell

November 29, 2020 at 2:00 pm

Hi Charles, thanks for getting back to me again. I’m trying to establish the confidence level in an upper bound prediction (at p=97.5%, single sided) . I used Monte Carlo analysis with 5000 runs to draw sample sizes of 15 from N(0,1). In order to be 90% confident that a bound drawn to any single sample of 15 exceeds the 97.5% upper bound of the underlying Normal population (at x =1.96), I find I need to apply a statistic of 2.72 to the prediction error. As I’m doing this generically, the 97.5/90 interval/confidence level would be the mean +2.72 times std dev, i.e. x =2.72. As far as I can see, an upper bound prediction at the 97.5% level (single sided) for the t-distribution would require a statistic of 2.15 (for 14 degrees of freedom) to be applied. A prediction upper bound (such as at 97.5%) made using the t-distribution does not seem to have a confidence level associated with it. So my concern is that a prediction based on the t-distribution may not be as conservative as one may think….
Reply
- Charles
  
  December 9, 2020 at 8:42 pm
  
  Ian,
  Since the sample size is 15, the t-statistic is more suitable than the z-statistic.
  I don’t understand why you think that the t-distribution does not seem to have a confidence interval.
  Charles
  Reply
Rahul

November 6, 2020 at 2:58 pm

Hello Charles,
Your material has been very helpful in enhancing my understanding of statistics, for which I’m very grateful.

Does the following interpretation of CI & PI with the context of sampling distribution seem appropriate to you?

Sampling Distribution of the Mean of Y
If all possible samples, m, of sample size n were taken for a specific value of X (i.e. X = h), and mean (μy) calculated for the Ys observed against those Xs, then :
1) The distribution of such means will be normally distributed.
2) 95% of the confidence intervals (μy ± 1.96σ) calculated from such estimated means will contain the true population mean of Y.
3) Mean of all the m means will be equal to the overall population mean of Y.

But since its not practical to take all possible samples of size n, we take 1 sample of size n, and calculate mean of y.

Confidence Interval for Y
We expect that, μy obtained from this one sample is one of those 95% that will yield a confidence interval that will contain the true population mean of Y.

Prediction Interval for Y
The 95% prediction interval (that has a much wider reach than the confidence interval) calculated using the μy obtained from this one sample will contain 95% of all the possible population values of Y.

Have I got the interpretation for CI & PI right ?
Reply
- Charles
  
  November 12, 2020 at 5:43 pm
  
  Rahul,
  1. A 95% confidence interval for any parameter would mean that if the same experiment is repeated a large number of times, then 95% of the confidence interval calculated from each experiment would contain the true (population) value of the parameter. The true value of the parameter is not guaranteed to be in the confidence interval of the one sample that is drawn. Thus, the phrase “…this one sample is one of those 95%…” in your definition of the Confidence Interval for Y is not correct.
  2. For the confidence interval of a regression model the parameter is the mean μy.
  3. For the prediction interval of a regression model the parameter is any one of the predicted y values.
  Charles
  Reply
David Miller

October 21, 2020 at 6:34 pm

Thank you for the explanation.

Follow on question, is there a way to calculate the confidence interval bounds if I was to force my trendline to go through the origin? Currently, the trendline as a y-intercept value.
Reply
- Charles
  
  October 27, 2020 at 9:26 am
  
  David,
  See https://stats.libretexts.org/Bookshelves/Computing_and_Modeling/Supplemental_Modules_(Computing_and_Modeling)/Regression_Analysis/Simple_linear_regression/Regression_through_the_origin
  Charles
  Reply
Bart

September 15, 2020 at 11:54 am

Dear Charles,

Thanks for the nice explanations on this website. If the prediction bands and CI bands of a couple of regression lines with unequal slopes (interacion effect) are known, is it then possible to calculate the SE of difference between two regression lines at the mean x?

The data I want to analyse usually gives divergent regression lines and I want to do post hoc tests with the interpolated values at the mean x. I can do this analysis with Minitab, but I do not understand how they calculate the SE of difference. Do you know where I can find a method to do this?

Thank you very much,

Bart
Reply
- Charles
  
  September 16, 2020 at 4:10 pm
  
  Hello Bart,
  Sorry, but I am not familiar with the Minitab analysis, and so I am not able to answer your question. In fact, when you reference SE I am not sure whether you are referring to the standard error of the slope or something else.
  Charles
  Reply
Bin

September 1, 2020 at 3:43 am

Hi Charles,
Really appreciate your quality lecture note.
May I ask the derivation of s.e.(standard error of the prediction) formula?
Thank you!
Reply
- Charles
  
  September 2, 2020 at 9:09 am
  
  See schwert.ssb.rochester.edu › a425_pred
  Charles
  Reply
Abdel-Rahman Gamal

July 25, 2020 at 9:42 am

Dear Charles,
You discussed here how to predict (Y) value from (X) and how to estimate its confidence interval. In my stydies, I use the regression equation to predict (X) value from (Y) as in the following example:
Conc (X) Response % (Y)
10 20
15 27
20 35
25 46
30 55
35 70
40 85
Regression line equation: Response% = Slope * Conc. + Intercept
= (2.15) * Conc. + (-5.46)
Then the concentration that cause 50% response will be:
Concentration = (50 + 5.46) / 2.15 = 25.79
My question is how to calculate the Confidence interval of this predicted concentration to get the lower and upper predicted values (X ± ???).
Please help me and give your answer in steps with an example.
Sincerely,
Reply
Rizqi

July 22, 2020 at 3:56 pm

Hello,

Thanks for the article. I have a question, which might not be so related to this article, but still about confidence and prediction intervals.

I’m using lmfit for python. It provides function for confidence interval, but not prediction interval. Its documentation cites the code it is based on, and it has prediction interval calculation.

https://www.astro.rug.nl/software/kapteyn/kmpfittutorial.html#confidence-and-prediction-intervals

It uses different formula for standard error. I don’t know if it will hold the same value as the formula here, or their relationship; too complicated for me. Still, I want to have prediction interval, so I want to modify the function in the lmfit module to support it.

There, the difference between confidence and prediction interval is that err*err is added to the standard error for the prediction intervals. However, in the code, that err is the noise added to smooth model to simulate real data. I don’t think I can get the real noise from real data. I can only get residuals, but the calculation of residuals (and derivation) is also weird there (they’re divided by err), which adds more confusion for me.

Below the definition of err in the code, there’s a commented out line of err that makes err a straight 1 for the length of the dataset. If err is 1, then err*err is also 1, so it will just add 1 to the standard errors before they are sqrt-ed. I notice that here in your article, the difference of standard error formula for confidence interval and prediction interval is that 1 is added there before it is sqrt-ed. This doesn’t seem like coincidence for me. So, my question is, can I just change err*err with 1?

Thank you
Reply
- Charles
  
  July 22, 2020 at 7:55 pm
  
  Hello Rizqi,
  I am not familiar with the lmfit module, and so I can’t comment on it.
  As you can see from the referenced Real Statistics webpage, the only difference in the calculation between a prediction interval and a confidence interval is an extra 1 under the square root symbol.
  Charles
  Reply
  - Rizqi
    
    July 23, 2020 at 4:12 pm
    
    Thank you for the response. Sorry, I knew this was kind of unrelated.
    
    I tried adding 1 there, and the prediction interval is now very wide compared to confidence interval. It encompasses all data points. Also, the width of prediction interval when the confidence interval is or close to 0 is the same or almost the same as the sigma (there I use sigma=3). If I change sigma to 2 it’d be 2.
    
    https://ibb.co/x273fbD
    
    Is that just how it is? Or is this just wrong?
    Reply
Leonardo Hernandez

July 6, 2020 at 7:40 pm

Hi Charles,

I ran a regression model and added confidence and prediction interval. While most of the data points stay within the prediction band, there are some outside the prediction band. I’m having trouble interpreting those points, in general, do those points mean that they are outliers?
Reply
- Charles
  
  July 6, 2020 at 8:31 pm
  
  Leonardo,
  Yes, in some sense these points are potential outliers for the regression model, but if you have a lot of points you can expect that some will lie outside the prediction band.
  Charles
  Reply
Zahoor Malik

June 23, 2020 at 3:56 pm

Hi Charles,
Thank you so much for the post.
I have a question, can we have prediction interval for independent variable if dependent variable is already given. I mean if i m using regression line to predict x value as y value is given, can we calculate prediction interval for x value also.
Reply
- Charles
  
  June 23, 2020 at 4:49 pm
  
  Yes, you can do this.
  Charles
  Reply
Chazs

May 19, 2020 at 6:46 am

Why would you be using tinv(.05,13) instead of tinv(.025,13). Does not seem correct.
Reply
- Charles
  
  May 19, 2020 at 2:44 pm
  
  TINV is the two-tailed inverse.
  Charles
  Reply
maurizio

March 29, 2020 at 5:16 pm

This is very interesting! Could you please suggest reference literature to go deeper in the subject?
Thanks
Reply
- Charles
  
  March 29, 2020 at 6:34 pm
  
  Hello Maurizio,
  There are lots of references. Two such references are Zar’s textbook and Howell’s textbook.
  See Bibliography for details.
  Charles
  Reply
  - maurizio
    
    April 12, 2020 at 5:39 pm
    
    Thanks a lot!
    Reply
Jeroen

January 21, 2020 at 1:11 pm

very interesting. thanks for putting this on a page. Are the sample excel files available for download somewhere? thanks!
Reply
- Charles
  
  January 21, 2020 at 8:41 pm
  
  Yes, see Real Statistics Examples Workbooks
  Charles
  Reply
shabnam

September 5, 2019 at 8:15 pm

Is this statement true: ”If we estimate a regression slope, its confidence interval represents a plausible range for the slope’s true population value.”?
Reply
- Charles
  
  September 6, 2019 at 8:44 am
  
  This is reasonable way to look at the confidence interval, although technically this is not literally true (at least when you remove the word “plausibly”). See https://real-statistics.com/hypothesis-testing/confidence-interval/
  Charles
  Reply
David

July 15, 2019 at 8:25 pm

Hi Charles,
Is it the same formula to calculate prediction intervals for non-linear functions too (i.e. Weibull, Gompertz)?

Thanks,
Reply
- Charles
  
  July 17, 2019 at 2:28 pm
  
  Hello David,
  The situation will be different. Here is an article about the prediction interval for the Weibull distribution.
  lib.dr.iastate.edu/cgi/viewcontent.cgi?article=1011&context=stat_las_preprints
  Charles
  Reply
Albert

November 27, 2018 at 9:37 pm

Hi Charles,

In a question where they would a model that related the annual number of Australian wines sold to the annual growth rate of Canada, what would be x and what would be y?

Albert
Reply
- Charles
  
  November 28, 2018 at 8:18 am
  
  Albert,
  Either is possible. If you want to predict the wines sold from the growth rate, then y = wines sold and x = growth rate.
  Charles
  Reply
David

November 4, 2018 at 7:16 pm

Thanks for a very useful post!

Just wondering, how would you go about calculating a random set of best-fit lines with those confidence intervals? I guess I would have to determine a distribution for the set of possible slope and a distribution for the set of possible intercept values?

David
Reply
- Charles
  
  November 5, 2018 at 7:09 am
  
  David,
  How would you measure what a best fit line is?
  Charles
  Reply
  - David
    
    November 5, 2018 at 1:26 pm
    
    No idea… I just referring to the plot on the right of the first figure that shows a selection of regression lines withing the confidence intervals, and wondering how one would go about getting a random sample of those regression lines?
    
    Something like bootstrapping resampling with a t-test to get the ones that are within the 95% interval?
    Reply
    - Charles
      
      November 6, 2018 at 6:19 pm
      
      David,
      Here’s my rough idea for how you might go about doing this. Let’s look at a specific example, namely Example 1 on
      Plot of Confidence and Prediction Interval
      First look at the mean of the X values, namely x = 19.4. Next calculate the upper and lower bound for the y value corresponding to this value of x. Suppose for a minute that the mean of the x values is 20 (i.e. it is one of the sample data values for x) instead of 19.4. Then, as we can see from Figure 1, the the lower and upper bound for the corresponding y value is 68.70 and 77.61. Now pick a value between 68.70 and 77.61 at random: e.g. by using the formula =68.70+RND()*(77.61-68.70). Say this value is 71.22. Next we pick any other sample x data value, say 0. The corresponding y value ranges from 77.28 and 94.16. Once again pick a value at random from this range. Say this value is 88.42. Now this random line connects the points (20, 71.22) and (0, 88.42). We can use easy express the equation for this line in the form y = mx + b.
      It was important to start with the mean value for x so that the two randomly assigned points would stay within the confidence interval. Thus, the only thing left is figure out the range of y values corresponding to the mean value of the x when this value of x doesn’t correspond to any of the points in the sample. I guess some sort of curve fitting approach could be used.
      Charles
      Reply
Troy

November 2, 2018 at 12:08 am

Hi Charles,

I’m just starting to uses your excel add-in and it works great for graphically presenting the confidence and prediction intervals for a simple linear regression. Thank you! I’m trying to calculate and graphically display tolerance intervals with 95% confidence and 75% coverage of the population. I’m not seeing those options in your confidence and prediction interval. Do you offer these options in your add-in?
Reply
- Charles
  
  November 2, 2018 at 7:41 am
  
  Troy,
  You can change the confidence interval for the Confidence and Prediction Interval Plots data analysis tool. Simply change Alpha field. .05 corresponds to a 95% confidence interval, .01 corresponds to a 99% confidence interval, etc. See the following webpage>
  https://real-statistics.com/regression/confidence-and-prediction-intervals/plots-regression-confidence-prediction-intervals/
  I don’t know what you mean by the 75% coverage of the population.
  Charles
  Reply
  - Troy Pittenger
    
    November 20, 2018 at 5:17 pm
    
    For tolerance intervals there are two variables your confidence level (alpha 95%) and your population coverage or portion of population contained with the tolerance interval (75%). Is there a way in your program to input the confidence level and portion of population contained within a tolerance interval and have it graphically displayed like the confidence and prediction intervals?
    Reply
    - Charles
      
      November 20, 2018 at 6:27 pm
      
      Troy,
      Tolerance intervals are not yet covered by Real Statistics. I will add this shortly, probably without the graphics though
      Charles
      Reply
Suse

June 5, 2018 at 3:30 pm

Hello,

I am evaluating the concentration of a chemical in blood in relation to its concentration in the food. I derrived a linear regression and the respective prediction intervals. Now, I would like to calculate how many samples I have to analyse to predict the right concentration ( with e.g. 95% confidence) in the food. How can I do this?

Best regards
Suse
Reply
- Charles
  
  June 6, 2018 at 7:21 am
  
  Susanne,
  You have created a linear regression model based on one sample with a number of elements. This model can be used to make predictions. If you create a new sample you will get a different (hopefully not very different) regression model. Thus, it is not clear what you mean by “how many samples…”
  One way to interpret your question is to assume that there is some commonality between the samples; e.g. you fix the values of the x mean, x standard deviation and the correlation coefficient between the y and x values. As can be seen from the formula for the prediction standard error on this webpage, s.e. depends on the x mean, S_x (related to the x standard deviation) and s_y.x (related to the correlation coefficient and sample size n). The size of the prediction interval is twice s.e times t-crit. Here t-crit depends on df, which is n-2. Thus for fixed values of x mean, x standard deviation and correlation between y and x (and x_0), the size of the prediction interval only depends on the sample size n (using the formulas for s.e. and df). If you fill in the values of these three values in Figure 2 on the webpage (e.g. based on the linear regression you already developed), the only remaining variable is the sample size n. You can then manually repeatedly change the value of n (cell E5) until you get the prediction interval size (i.e. the value of =J15-J14) that you desire. You can also use Excel’s Goal Seek to calculate this for you.
  Charles
  Reply
mojgan

May 30, 2018 at 4:17 pm

Dear Charles
Thank you for your useful text, but I would so appreciate if you would advice me how can I plot these confidence intervals for multiple linear regression with more than one variable, please. and I have more question, may you kindly tell me how can I have the equations of these confidence interval lines, for using in predicting models, please?

Best Regards

Mojgan
Reply
- Charles
  
  May 30, 2018 at 9:54 pm
  
  Mojgan,
  See https://real-statistics.com/multiple-regression/confidence-and-prediction-intervals/
  Charles
  Reply
Pingback: Paul Connett’s misrepresentation of maternal F exposure study debunked | Open Parachute
Arka Bhattacharjee

February 28, 2018 at 2:04 pm

Hello Dr. Zaiontz,
I have a doubt regarding prediction interval calculation.Does the calculation depend on method of forecasting used?(e.g. ARIMA, linear regression, ETS) Or it’s independent of methods used for forecasting.Please give me some references to look at for support of your answer.
Thanks in advance.
Reply
- Charles
  
  February 28, 2018 at 2:30 pm
  
  Arka,
  This webpage specifically addresses forecasts using linear regression (with one independent variable). The techniques will vary depending on which model you use (ETS, ARIMA, etc.)
  Charles
  Reply
Keith

February 15, 2018 at 4:25 pm

Hello Dr. Zaiontz,
I am developing a correlation between sample lab results (x) and field instrument readings (y) in order to establish a cleanup level equivalent for my field instrument. I’ve calculated 95% upper and lower prediction limits for the field instrument cleanup value equivalent and would like to know what I can say precisely about each of these values. Is it accurate to say of the lower prediction limit that there’s a 95% (or 90%?) chance that the true field instrument cleanup value equivalent is greater than the lower limit and that this value assures minimal false negatives (null hypothesis is that contamination exceeds cleanup level)? Similarly, is there a 95% (or 90%?) chance that the true field instrument cleanup value equivalent is less than the upper limit? If the latter is accurate, what would be any potential value of using the upper limit, other than it being the least conservative value (i.e. allowing minimize cleanup costs) without regard for false negatives?
Thanks
Reply
- Charles
  
  February 16, 2018 at 10:16 am
  
  Keith,
  In general if you have a 95% confidence interval for some parameter of (a,b), it means that if you reran the experiment 1,000 times, then the true value of the parameter will occur in the interval (a,b) in approximately 950 (i.e. 95%) of these experiments.
  Charles
  Reply
Amanda Cardoso

January 25, 2018 at 2:52 am

Charles Zaiontz,

I need to say that you are such an absolutely amazing person. All your intructions are very easy to follow. I even like statistics now. Thank you very much!
Reply
- Charles
  
  January 25, 2018 at 8:08 am
  
  Amanda,
  That is very kind of you. I am very pleased that I have been able to help.
  Charles
  Reply
Cristopher Mallory-Coble

November 30, 2017 at 3:44 am

Hello Charles-

I am being asked to use the slope and intercept to create a nonlinear model, after I have identified my most linear graph which is the “Power Regression” model. With my regression summary statistics, how am I creating the formula using slope and the intercept. I am missing something rather simple I believe.

Thanks,
Cristopher
Reply
- Charles
  
  November 30, 2017 at 2:05 pm
  
  Cristopher,
  If you look at the following webpage, you see that power regression is just a transformation of the nonlinear expression into the form of a linear regression. Using the terminology on this webpage, the intercept is alpha-prime and the slope is beta-prime.
  https://real-statistics.com/regression/power-regression/
  Charles
  Reply
Tomas Alexandersson

November 8, 2017 at 4:05 pm

Hi Charles,
Thanks for maintaining this site with very helpful information.

Is it a good idea to use the prediction intervals as a check for outliers in the measured data when doing a linear regression between y and x?

Best regards
Tomas
Reply
- Charles
  
  November 8, 2017 at 8:00 pm
  
  Tomas,
  You should identify possible outliers. The approach that you are suggesting could be a good way to do this.
  Charles
  Reply
Chester

May 31, 2017 at 8:13 pm

Hi Charles,

I would like to know how to calculate the “Standard Error of intercept” or “Standard deviation of intercept” by excel formula rather than LINEST function?

Thanks for your informative helps
Chester
Reply
- Charles
  
  June 2, 2017 at 4:55 pm
  
  Chester,
  This is shown in Example 2 of the referenced webpage.
  Charles
  Reply
Mark Stafford

November 20, 2016 at 9:48 pm

Hi Charles,

What is the equation for the standard error of the prediction when the y-intercept is zero? The physics of our data requires linear regressions where y=0 when x=0. Something similar to http://stackoverflow.com/questions/36459996/confidence-interval-for-independent-variable-of-a-linear-regression-model-in-r, but in excel.

Thanks,
Mark
Reply
- Charles
  
  November 27, 2016 at 8:51 am
  
  Mark,
  
  I believe that the type of linear regression that you are referring to is described on the following webpage:
  Multiple Regression without Intercept. The value for the matrix B is given on that webpage.
  
  The calculation of the prediction interval is shown on the following webpage:
  Confidence and Prediction Intervals for Multiple Regression. You can use the formula on that webpage for the prediction interval substituting the value B given above.
  
  Charles
  Reply
  - Mark Stafford
    
    November 29, 2016 at 4:18 am
    
    Sorry if I wasn’t specific enough, but I am using a simple linear regression thought the origin (zero-intercept). What is the equation for the confidence interval for the forecasted values ŷ of x? Wouldn’t it be that s.e.=0 when x=0 such that any and all possible true best fits still goes through the origin and do not rotate around the data set mean (curved confidence interval)? Thanks again.
    Reply
    - Charles
      
      November 29, 2016 at 9:36 am
      
      Mark,
      Clearly at x = 0, s.e. would be zero, but this is not the case for other values of x.
      The information I gave you previous will enable you to calculate a prediction interval at other values of x.
      Charles
      Reply
David Dinnarr

September 13, 2016 at 10:22 am

Hi Charles,

I’m fitting a regression model to some lifing data (Strain vs Number of Cycles) and would like to make sure I’m making the right assertions.

When regression isn’t involved, I’ve seen several references to tolerance intervals. (I think this is what Tristan was referring to when he spoke about 98% of the population with 95% reliability.)
I’m still struggling slightly with the differences between prediction and tolerance intervals (explained in https://en.wikipedia.org/wiki/Tolerance_interval), particularly in the context of a regression model. I was hoping you might be able to explain and have some pointers about how to modify the calculations to include it.

Ideally I’d like to say for a certain value of strain (X) for a design we can be A% confident that B% of the population will survive beyond Y cycles.

Many thanks for your help.
Reply
- Charles
  
  September 13, 2016 at 11:13 am
  
  David,
  It sounds like you are speaking about survival analysis. Please look at the following webpage, especially the part about Cox Regression, and see if this is helpful.
  https://real-statistics.com/survival-analysis/
  Charles
  Reply
DGRoberts

September 8, 2016 at 1:42 pm

Good Day

I am using this methodology for my MSc dissertation. How do you suggest that I reference it?

Is this appropriate?

Zaiontz, C. (2016). Confidence and Prediction intervals for forecasted values. Retrieved June 16, 2016, from https://real-statistics.com/regression/confidence-and-prediction-intervals/
Reply
- Charles
  
  September 8, 2016 at 6:12 pm
  
  That looks good. See also Citation.
  Charles
  Reply
Roger

June 28, 2016 at 9:02 am

CAn you extend your Fig 1 to a pooled within-group regression slope as obtained from ANOVA
Reply
- Charles
  
  June 28, 2016 at 6:26 pm
  
  Roger,
  Sorry, but what do you consider to be the pooled within-group regression slope as obtained from ANOVA?
  Charles
  Reply
Roger

June 28, 2016 at 9:01 am

Can you extend this to a pooled within-group slope as obtained from ANCOVA
Reply
- Charles
  
  June 28, 2016 at 6:26 pm
  
  Roger,
  Sorry, but what do you consider to be the pooled within-group regression slope as obtained from ANCOVA?
  Charles
  Reply
  - Roger
    
    June 29, 2016 at 8:30 am
    
    Hi Charles
    I am not a statistician – please bear this in mind.
    
    When you have two variables (x,y) and several groups, one can use an analysis of covariance to test the difference between groups for y when regressed against x. This does not use the regression line with the data pooled irrespective of group, but the pooled within group regression line. For example Statistical Methods, Snedechor and Cochran Chapter 14 fig 14:6:1 (6th ed) . I want the prediction intervals around this pooled-within-group regression line. I hope this is clear.
    Roger
    Reply
    - Charles
      
      July 1, 2016 at 5:23 pm
      
      Sorry Roger, but I don’t have access to Statistical Methods, Snedechor and Cochran Chapter 14 fig 14:6:1 (6th ed).
      Charles
      Reply
Adam Scarchilli

April 11, 2016 at 9:28 pm

Thank you so much! This example was extremely helpful in finding prediction intervals in Excel and exactly what I was looking for!
Reply
Steve

March 10, 2016 at 8:25 pm

Charles,
I am trying to determine a confidence interval around a natural exponential function (y = ae^bx or ln(y) = ln(a)+bx) using Excel. I have reviewed this website but am unsure about the confidence interval calculation.
http://www.tushar-mehta.com/publish_train/data_analysis/16.htm
Thank you!
Reply
- Charles
  
  March 21, 2016 at 10:43 pm
  
  Steve,
  
  Since you are treating ln(y) = ln(a) + bx as a linear regression z = bx + c where z = ln(y) and c = ln(a), one approach to creating a confidence interval is to use the confidence interval for z = bx + c, as described on the webpage
  https://real-statistics.com/regression/confidence-and-prediction-intervals/
  
  This will give you a confidence interval for z of form [h, k] you then need to convert this into a confidence interval for y. The simplest approach for this is to use the confidence interval [e^h, e^k].
  
  Charles
  Reply
Alex

March 10, 2016 at 6:23 pm

Everything seems to follow in my software (i wanted to transfer confidence intervals into a bespoke dashboard from my software) for a single X variable
If i have two x variables how do i calculate the SE of the overall confidence interval?
I.e what does the Equation in cell G12 look like?
Reply
- Charles
  
  March 12, 2016 at 12:29 pm
  
  Alex,
  For multiple regression you use a slightly different approach. See the following webpage for details:
  https://real-statistics.com/multiple-regression/confidence-and-prediction-intervals/
  Charles
  Reply
  - Alex
    
    March 15, 2016 at 3:56 pm
    
    Thank you, ive been after this for ages! My statistics software doesnt make it clear what is going on in its calculations for the confidence intervals, but seeing this, i can reverse engineer it AND understand it.
    I wasnt even aware Excel had these capabilities, i had twigged at MMULT.
    Very pleased to find a site that isnt filled with mathematics requiring a mathematics degree to comprehend!
    Reply
Sma Perkings

March 10, 2016 at 12:00 am

Hello charles,

Thank you for the explanation. I have a question though:
what if I have x and I would like to predict y instead but with a confidence interval? is it the same principle and the same formula?
Reply
- Charles
  
  March 10, 2016 at 7:17 am
  
  Hello Sma,
  In this case you use the prediction interval.
  Charles
  Reply
John Hart

February 19, 2016 at 6:11 am

Hello Charles,

Could you tell me what the first part in cell G12 is (it is kind of blurry on my monitor), for the calculation of s.e. 2.06?

Also, is there a shortcut (function) in Excel for the s.e. calculation?

Thank you,
Reply
- Charles
  
  February 19, 2016 at 9:22 am
  
  John,
  
  Cell G12 contains the following formula: s_Res * SQRT(1/n + x_0 – k) – 2/SS_x)
  
  There is no Excel formula for calculating this. The Real Statistics form REGPRED will automatically calculate the s.e. in the prediction interval case, but not the confidence interval case. I will add a similar formula for the confidence interval shortly. See the following webpage for more details:
  Prediction and Confidence Intervals
  
  Charles
  Reply
  - John Hart
    
    February 19, 2016 at 2:34 pm
    
    Thank you, Dr. Zaiontz. Pardon my ignorance, but could you tell me what s_Res and how I obtain it?
    
    John
    Reply
    - John Hart
      
      February 19, 2016 at 2:43 pm
      
      Looks like its the standard error of y and x, the 7.97 in cell E10. Is this correct? The underscore in sRes threw me off a little bit.
      Reply
Gary

January 11, 2016 at 1:42 pm

Hi Charles,

If I understand you correctly, you describe the confidence interval as the range of possible values for model parameters, e.g. the range of values slope and intercept may be for a linear regression for a given set of data and specified confidence interval. You describe prediction interval as the interval around a predicted Y for a specific X0.

However, in your example you calculate confidence interval for a specific X0 and do the same for a prediction interval. I am confused as the example does not appear to match the discussion.

Obviously I don’t understand you correctly!

On a related matter, when one does a linear regression with Excel, Excel reports the Lower and Upper confidence intervals for “intercept” and “X Variable”, i.e. values for the slope and intercept. How does this Excel output relate to your discussion above?

Thank you
Reply
- Charles
  
  January 11, 2016 at 7:16 pm
  
  Gary,
  I will respond to your first question shortly.
  Regarding your second question: the confidence intervals for the intercept and x variable are really for the intercept and x coefficient (not for the prediction or confidence interval of data elements).
  Charles
  Reply
- Charles
  
  February 5, 2016 at 10:23 am
  
  Gary,
  
  Here is my response to your first question:
  
  The confidence interval focuses on the population mean. If you create many random samples that are normally distributed and for each sample you calculate a confidence interval for the mean, then about 95% of those intervals will contain the true value of the population mean.
  
  The prediction interval focuses on the true y value for any set of x values. If you create many random samples that are normally distributed and for each sample you calculate a prediction interval for the y value corresponding to some set of x values, then about 95% of those intervals will contain the true y value.
  
  Charles
  Reply
Ang

January 9, 2016 at 9:54 am

Hi, Charles. Can you make a video on plotting a 95% confidence interval. Above instruction was really confusing. Will be grateful.
Reply
- Charles
  
  January 14, 2016 at 2:55 pm
  
  Ang,
  Thanks for your input. I will try to improve the explanation, but first I need to finish up the work I am doing on time series analysis.
  Charles
  Reply
Andy

September 29, 2015 at 6:46 pm

Hi Charles,
I need like to plot the 95% Confidence Interval curves just like they are shown within Figure 1 (e.g. the dotted lines). How would you recommend doing this in Excel? I was informed that other programs may provide this feature, but I prefer to continue working in Excel if at all possible.
Thank you in advance,
Andy
Reply
- Charles
  
  September 30, 2015 at 9:55 am
  
  Yes, you can do this in Excel. E.g. to create the dotted line for the upper confidence interval curve, fill in a range of x and y values (say A1:B100) as follows: (1) in column A insert x values of the appropriate scale. If for example you are looking at values between 0 and 10, insert 0 in cell A1 and the formula =A1+.1 in cell A2 and then highlight the range A2:A100 and press Ctrl-D, in (2) in column B insert the y values for the upper confidence interval. E.g. in cell B1 insert the Excel formula for the upper confidence interval value corresponding to the x value in cell A1 (this is as described in cell E15 of Figure 2 of the referenced webpage), then highlight range B1:B100 and press Ctrl-D. (Make sure that you use absolute addressing for all the parts of the formula in B1 that don’t depend on the x value in cell A1.) (3) Finally, highlight range A1:B100 and select Insert > Charts|Scatter.
  Charles
  Reply
  - PG
    
    January 21, 2020 at 8:56 am
    
    Dear Charles
    
    I just want to thank you for the wealth of information you put out here freely. It takes a lot of effort and dedication, and you have made the ins and outs of statistics a lot easier. I hope to help others in the same way.
    
    Kind regards
    PG
    Reply
    - Charles
      
      January 21, 2020 at 9:02 pm
      
      Thank you very much.
      Charles
      Reply
Phil

July 31, 2015 at 4:05 pm

Hi Charles,
I wanted to point out a common misunderstanding about confidence intervals: CI’s say nothing about the probability that the true value of the population parameter lies within them – it either does or it doesn’t. A 95% CI just tells you that, if you were to repeat your experiment (sampling) an infinite number of times and run the statistics on each sample, the true parameter will lie within 95% of those confidence intervals. What you described as a CI in the first section of your post is actually Bayesian credible interval, which is a bit more complicated to calculate, but it does tell you the probability that your population parameter lies within the interval.
Cheers,
Phil
Reply
- Charles
  
  August 13, 2015 at 10:47 am
  
  Phil,
  
  Thanks for correcting my somewhat appealing, but, in the end, inaccurate statement. Shortly I will update the website with a more accurate characterization of the confidence interval.
  
  I appreciate your help in making the site more accurate.
  
  Charles
  Reply
Mari

July 28, 2015 at 11:18 pm

Hello! Thank you very much this incredibly helpful guide! Do you have a tutorial on how to find the C.I and P.I using multiple regression?
Reply
- Charles
  
  July 29, 2015 at 5:37 am
  
  Mari,
  Not yet. I plan to include this shortly. I am working on adding information about Survival Analysis now. After that the next thing I will do will include CI/PI for multiple regression.
  Charles
  Reply
  - Thomas Knoll
    
    November 20, 2015 at 6:13 pm
    
    Hi Charles. Was wondering if you’ve added the section covering CIs/PIs for Multi Linear Regression yet.
    
    Really love the site and it has helped out tremendously. Keep up the good work!
    Reply
    - Charles
      
      November 25, 2015 at 9:16 pm
      
      Hi Thomas,
      
      Thanks for your continued support.
      
      The section covering CIs/PIs for Multi Linear Regression is located on the following webpage:
      
      https://real-statistics.com/multiple-regression/confidence-and-prediction-intervals/
      
      Charles
      Reply
Paul

July 24, 2015 at 10:23 am

Hi Charles,
I notice that you use n-2 for degrees of freedom, whilst other publications, for example UKAS M3003, use n-1. Although a small difference is seen in the calculation of the degrees of freedom, it can greatly affect the interval magnitude. Could you give some advise as to which calculation should be used for the df value? Thank you in advance for your reply.

By the way, your website has been an extremely useful aid, Thanks
Reply
- Charles
  
  July 30, 2015 at 3:37 pm
  
  Paul,
  I am not familiar with UKAS M3003, but I just checked a few other publications and they all show df = n-2 when calculating the confidence interval for regression.
  Charles
  Reply
Ryan

June 11, 2015 at 5:35 pm

Charles,

What’s the rationale for the added 1 under the square root of the standard error of the prediction? Theoretically, why 1?

Thank you
Reply
- Charles
  
  June 25, 2015 at 7:09 pm
  
  Ryan,
  The standard error for a prediction interval adds the variability of the points around the predicted mean. mathematically this is where the 1 comes from.
  I plan to eventually show the proofs of the formulas for the confidence and prediction intervals. You will then be able to see from the proofs more precisely where the 1 comes from.
  Charles
  Reply
  - Hana
    
    January 24, 2020 at 5:11 pm
    
    Is the proofs available now? I’m also interested in understanding where the 1 comes from!
    Reply
    - Charles
      
      January 25, 2020 at 8:02 pm
      
      Hello Hana,
      I will add the proof to the website shortly.
      Charles
      Reply
Tristan

February 10, 2015 at 12:13 pm

Hi Charles,

This post has helped me so much already, really very insightful and easy to follow!

I am trying to do a prediction interval for some metal fatigue test data but I am trying to find 90% confidence in the 98% reliability data. In order to do this is just using 0.1 as the probability in TINV sufficient for the method to change it to the 90% probability or do I need to make more changes please?

Thanks for your help!

Tristan
Reply
- Charles
  
  February 10, 2015 at 10:20 pm
  
  Tristan,
  TINV(.1,df) (i.e. alpha = .1) is part of the formula for a 90% prediction interval. But I don’t understand what the 98% reliability data means.
  Charles
  Reply
Emiel

November 26, 2014 at 9:46 am

Hi all,

It would be good to point out that the function TINV gives a two-sided confidence interval. The new function T.INV (with a dot) gives a one-sided confidence interval. In case of using the new function, you should take \alpha/2; furthermore, it uses the 1-\alpha/2 value, thus, T.INV(0.975,df).

Emiel
Reply
- Charles
  
  November 26, 2014 at 2:39 pm
  
  Emiel,
  
  Actually, more simply you should use the T.INV.2T function for the two-sided critical value. It is equivalent to TINV.
  
  This is explained on the webpage https://real-statistics.com/excel-capabilities/built-in-statistical-functions/
  
  I will shortly update this information to better explain the various usages of the functions.
  
  Charles
  Reply
  - Omamz
    
    December 12, 2014 at 5:57 pm
    
    Hi Charles,
    
    Your post has been of tremendous help to me.
    Although it took me some time I have applied most of the solutions and they have worked just fine.
    I need clarification on example 2. Why did you have to eliminate the # 1 before 1/15?
    Would really appreciate your prompt response.
    Thanks.
    Reply
    - Charles
      
      December 12, 2014 at 7:28 pm
      
      The test uses the confidence interval and not the prediction interval. This is why a 1 is not inserted before 1/15. Shortly I will update the webpage to explain better when the prediction interval is used and when the confidence interval is used.
      Charles
      Reply
Elisa

September 3, 2014 at 11:30 am

Hi Charles,
I am using a exponential/hyperbolic function to fit my data. The model is (a*X+b)-SQRT[(a*X+b)^2-((4*a*X*b*c)/(2*c))].
I wonder if I can calculate prediction intervals in the way you show, or if there is any parameter that is different for this type of models. If so, can you tell me how I should calculate it?
Thank you very much
Elisa
Reply
- Charles
  
  September 9, 2014 at 5:09 pm
  
  Elisa,
  I can’t think of a way of doing this. Perhaps someone else has suggestion.
  Charles
  Reply
Rizwan

June 4, 2014 at 4:51 pm

Hi Charles,

Is Syx = Sres = STEYX(Y,X)? Is it same as Syx = SQRT((SUM(yi – Yi)^2)/(degrees of freedom)), where (xi,yi) are given data and Y is any nonlinear model (not a straight line, say a sigmoidal or logistic curve), which fits the data.
Reply
- Charles
  
  July 2, 2014 at 10:35 pm
  
  Rizwan,
  Sorry for the long delay in responding to you. I just realized that I overlooked responding to you.
  It is true that Syx = Sres = STEYX(Y,X) = SQRT((SUM(yi – Yi)^2)/df) for linear models. I am not sure what these terms would even mean for non linear models such as logistic regression models.
  Charles
  Reply
David

May 8, 2014 at 3:17 am

Hi Charles,

Thanks for your contributions on this site. I’m a bit confused by your base formula, though. Where you use the sum of squared deviations of x (SSx, calculated as DEVSQ(x) or DEVSQ(A4:A:18), I’ve learned to use the standard deviation of x times (n-1), or STDEV.S(A4:A:18)*(n-1) in Excel speak. This would yield a value of CI SE of 2.090695467 and a PI SE of 8.244184143. The difference is small in your dataset, but where deviations are larger the use of sums of squared deviations instead of the method I’ve heard will yield very different results. That said, I’m not at all certain which method is correct–can you point to some references for your formula, please?

Thanks in advance for taking the time to clarify this issue for me.

David
Reply
- Charles
  
  May 12, 2014 at 9:05 am
  
  Hi David,
  
  Note that for any range R1, the square root of DEVSQ(R1) is STDEV.S(R1)*(n-1). In fact, in the formula for cell E12 in Figure 2 of the referenced page I do take the square root of SSx, and so it seems that we should get the same answer. Can you send me an example of your calculation so that I can see why the results are not the same?
  
  Charles
  Reply
  - Zhang
    
    June 14, 2014 at 11:21 am
    
    Thanks for your contribution. I would like to know how to calculate CI and PI if there are two independent variables. Thanks.
    Reply
    - Charles
      
      June 26, 2014 at 9:51 pm
      
      Zhang,
      Sorry but I haven’t had enough time to figure out or find an answer to your question.
      Charles
      Reply
Anu

April 4, 2014 at 10:12 am

Please help how u got value of SSx which I suppose to be:-271.6
Reply
- Charles
  
  April 5, 2014 at 9:09 am
  
  Anu,
  SSx (cell E11) is calculated by the formula =DEVSQ(A4:A18). It has the value 2171.6.
  Charles
  Reply