Bootstrapping

Overview

Assume that we have a random sample S = {x1, …, xn} and that^θ is the estimate of some parameter θ based on this sample using some function f(S).

The bootstrap estimate θ* of this parameter is obtained by creating a large number of bootstrap samples S1, …, Sm where each Sj consists of n elements selected randomly from S with replacement. This results in m values θ1*, …, θm* where each θj* = f(Sj).

The bootstrap estimate θ* of θ is simply the mean of the θj*

Bootstrap estimate

We can also obtain an estimate of the standard error of θ by using the standard deviation of the θj*

Bootstrap standard error

Percentile Confidence Interval

We can use (Clower, Cupper) an estimate of the 1 – α confidence interval of θ where

Lower limit

Upper limit

Alternative upper limit value

We will call this the percentile estimate of the confidence interval.

BCa Confidence Interval

Another estimate of the confidence interval, called the bootstrap bias-corrected and accelerated (BCa) confidence interval, can often produce less biased results. To obtain this confidence interval, we first need to define the median bias z0, and the acceleration a. The median bias is defined from the bootstrap using the inverse of the standard normal distribution, namely

median-bias

and

Inverse normal distribution value

The acceleration is defined using the jackknife sample of sample S (see Jackknife), as follows

Acceleration

We now define the BCa confidence interval as the percentile confidence interval (Cα-lower, Cα-upper) where

alpha-lower

Alpha-upper

and

Standard normal distribution cdf

Worksheet Functions

Real Statistics Functions: The Real Statistics Resource Pack provides the following lambda array functions.

BOOTSTRAP(R1, expression, iter, ref): returns a column array with a bootstrap sample for the data in R1 based on the function f(arr) on a variable arr that takes array values; expression is used to specify f and ref is an optional reference to arr.

R1, expression, and ref are as for JACKKNIFE. iter is the number of bootstrap samples returned (default 2,000).

If R1 is a column array with elements S = {x1, …, xn}, then bootstrapping works by creating iter data sets S1, …, Siter where each Sj is formed by taking n elements from S with replacement. The bootstrap sample is then a column array containing the elements f(S1), …, f(Siter).

If R1 is an array with multiple columns then the same approach is used except that now the xi represent rows in R1.

See Real Statistics Lambda Capabilities for additional information about lambda functions.

Confidence Intervals

In addition, the Real Statistics Resource Pack provides the following lambda worksheet function.

CI_BOOTSTRAP(R1, expression, lab, iter, alpha, ref): returns various statistics resulting from a bootstrap sample for the data in R1 based on the function f(arr) on a variable arr that takes array values; expression is used to specify f and ref is an optional reference to arr.

R1, expressionref, and iter are as for BOOTSTRAP. alpha takes a value between 0 and .5 (default .05).

This function returns the following statistics:

  • parameter estimate = f(arr) where arr is the R1 array
  • bootstrap estimate = the average of the f(S1), …, f(Siter) for bootstraps Sj
  • standard error = the standard deviation of the f(S1), …, f(Siter)
  • percentile confidence interval (% lower, % upper) where % lower = the alpha*iter smallest value of f(Sj) and % upper = the alpha*iter largest value of f(Sj)
  • BCa (bias-corrected and accelerated) confidence interval (BCa lower, BCa upper) = the BCa adjusted value of the percentile confidence interval

If lab = FALSE (default), then the output consists of a column array with the above 7 entries. If lab = TRUE, then an extra column is appended to the output consisting of labels.

Example

Example 1: Use bootstrapping to estimate the 95% confidence interval for the population mean based on the data in range B1:M1 of Figure 1. Note this is the same data as shown in column B in Figure 1 of Jackknife.

We start by creating a bootstrap with 2,000 bootstrap samples, as shown in the rows of range B2:M2001 of Figure 1 (only the first 15 samples are displayed). This is done by placing the array formula =RANDOMIZES(B1:M1) in array B2:M2, highlighting range B2:M2001, and pressing Ctrl-D. We then place the formula =AVERAGE(B1:M1) in cell N1, highlight the range N1:N2001, and press Ctrl-D.

Bootstrap sample

Figure 1 – Bootstrap sample

Here N1 contains the mean of the original sample and N2:N2001 contains the means from the 2,000 bootstrap samples. Alternatively, we can obtain the bootstrap sample shown in N2:N2001 by using the lambda formula

=BOOTSTRAP(TRANSPOSE(B1:M1),”=AVERAGE($arr)”)

Confidence Intervals

We now obtain the 95% confidence intervals based on the bootstrap, as displayed in Figure 2. The percentile confidence interval of (33.41667, 43.16667) is shown in range Q8:Q9 and the BCa confidence interval of (32.83333, 42.83333) is shown in range Q18:Q19.

Bootstrapping confidence interval

Figure 2 – Confidence intervals

We use the Real Statistics SMALLExact function in cells Q18 and Q19 since the values in cells Q16 and Q17 are not whole numbers. Since the 28th and 29th smallest bootstrap means are equal we could have used the Excel SMALL function instead of SMALLExact. Similarly, we could have used the SMALL function in cell Q19 since the 1922nd and 1923th smallest bootstrap means are equal.

As stated above, the sample data is approximately normally distributed, and so we expect that the confidence interval will take the form x-bar ± se⋅ crit. This confidence interval, (33.50597, 43.49403) is shown in range V6:V7 of Figure 3 and is pretty similar to the percentile confidence interval described previously. The BCa confidence interval probably better reflects the skewness of the sample data (skewness = -.84475).

BCa confidence interval

Figure 3 – More confidence intervals

We can also use the CI_BOOTSTRAP function to calculate both the percentile and BCa confidence intervals, as shown in the lower part of Figure 3.

One final note. We see that the bootstrap mean shown in Figure 2 is 0.01104 higher than the sample mean. This could motivate us to shift the percentile confidence interval 0.01104 units to the left, obtaining the interval (33.40563,43.15563). It is not clear whether this is a better estimate, but some may recommend this adjustment.

Correlation Example

Example 2: Use bootstrapping to estimate the 95% confidence of the population correlation coefficient based on the sample of size 8 in range B2:C9 of Figure 2 of Jackknife (Example 2 of that webpage).

We use the lambda formula 

=CI_BOOTSTRAP(B2:C9,”=CORREL(INDEX($arr,,1),INDEX($arr,,2))”,TRUE)

to obtain the results shown in Figure 4.

Bootstrap confidence intervals correlation

Figure 4 – Bootstrap confidence intervals for correlation

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

References

DiCiccio, T. J., Efron, B. (1996) Bootstrap confidence intervals.
http://staff.ustc.edu.cn/~zwp/teach/Stat-Comp/Efron_Bootstrap_CIs.pdf

Efron, B., Tibshirani, R. J., (1993) An introduction to the bootstrap. Springer
https://books.google.it/books?hl=en&lr=&id=gLlpIUxRntoC&oi=fnd&pg=PR14&dq=Efron,+B.,+Tibshirani,+R.+J.,+(1993)+An+introduction+to+the+bootstrap.+Springer&ots=AaBr-7Kcy0&sig=M1FW6pzIvh1jgNfOXVKOE5G6uIk

Leave a Comment