Lecture Notes



Steven Holland

Resampling Methods

There are several ways we can run into problems by using traditional parametric and non-parametric statistical methods. For example, our sample size may be too small for the central limit theorem to insure that sample means are normally distributed, so classically calculated confidence limits may not be accurate. We may not wish to resort to non-parametric tests that have low power. We may be interested in a statistic for which the underlying theoretical distribution is unknown. Without a distribution, we cannot calculate critical values, p-values, or confidence limits.

Resampling methods are one solution to these problems. Resampling methods allow us to use our data to build a distribution from which we can obtain critical values, calculate p-values, or construct confidence intervals.

Resampling methods have several advantages. They make no assumptions about the shape of the parent distribution, other than that the sample is a good reflection of that distribution, and this is just the assumption of random sampling that underlies all statistics. Resampling methods can be applied to any problem, even when there is no theoretical distribution of the statistic. They are less sensitive to outliers than parametric methods. Ties (observations having the same value) don’t pose the problem they do in non-parametric methods. Resampling methods often have greater power than non-parametric methods, often approaching or even exceeding the power of parametric methods. They are simple to use and program. They are flexible, and they are intuitive.

There are few drawbacks to resampling methods. The main difficulty is that they can be more difficult to perform than a traditional parametric or non-parametric test. They are also less familiar to other scientists, which can cause problems in their acceptance, although this is becoming less of an issue. The methods are somewhat less standardized, and sometimes they are computationally expensive.

There are four main types of resampling methods: randomization, Monte Carlo, bootstrap, and jackknife. In some situations, these are used to build the distribution of the null hypothesis, which can be used to generate critical values or p-values. In other situations, these methods are used to build the distribution based on our data, which can be used to generate confidence intervals on an estimate of a parameter.

Monte Carlo

Monte Carlo methods are typically used in simulations where the parent distribution is either known or assumed. Without saying so, I have used Monte Carlo methods all semester to illustrate how statistical techniques work.

For example, suppose we wanted to simulate the distribution of an F-statistic for two samples drawn from a normal distribution. We would specify the mean and variance of the normal distribution, generate two samples of a given size, and calculate the ratio of their variances. We could repeat this process many times to produce a frequency distribution of the F-statistic, from which we could calculate p-values or critical values.

Here is an example of a Monte Carlo in R.

# Declare the sizes of the two samples
n1 <- 12
n2 <- 25

# Function to simulate one F ratio for two samples
# of known size
simulateOneF <- function(sampleSize1, sampleSize2) {
  data1 <- rnorm(sampleSize1)
  data2 <- rnorm(sampleSize2)
  F <- var(data1)/var(data2)

# Call that function many times to build the distribution
F <- replicate(10000, simulateOneF(n1, n2))

We can calculate a p-value like we did for randomization, by finding the percentage of monte carlo values that equal or exceed our observed F-statistic. In this example, we will assume the F-statistic for our data was 3.5. We are calculating a one-tailed (right) p-value in this case, so we do not multiply the p-value by two. If we wanted a two-tailed p-value, we would multiply by two.

Fobserved <- 3.5
pvalue <- length(F[F>=Fobserved])/length(F)

Once we have the expected distribution of a statistic, the p-value is calculated the same way. Because this can be generalized, it is best to turn it into a function than can be tested. Once we are sure that the function works, we can simply call it and minimize coding errors.

pvalue <- function(x, observed) {
  nExtreme <- length(x[x>=observed])
  nTotal <- length(x)
  p <- nExtreme / nTotal

Calling this is simple: the first argument is the distribution provided by the Monte Carlo (or another resampling method), and the second argument is the observed statistic.

pvalue(F, Fobserved)

To summarize, there are three essential steps in a Monte Carlo:


Randomization is a simple resampling method, and a simple thought model illustrates the approach. Suppose we have two samples, A and B, on which we have measured a statistic (the mean, for example). One way to ask if the two samples have significantly different means is to ask how likely we would be to observe their difference of means if the observations had been randomly assigned to the two groups. We could simulate that, randomly assigning each observation to group A or B, then calculating the difference of means, and repeating this process until we built up a distribution. By doing this, we have effectively built the distribution of the difference in means under the null hypothesis that the two samples came from one population.

From this distribution, we could find critical values against which we could compare our observed difference in means. We could also use this distribution to calculate the p-value of the observed statistic.

The essential steps in randomization are:

Here is an example of performing a randomization to evaluate the difference in means of two samples.

# The data
x <- c(0.36, 1.07, 1.81, 1.67, 1.16, 0.18, 1.26, 0.61, 0.08, 1.29)
y <- c(1.34, 0.32, 0.34, 0.29, 0.64, 0.20, 0.50, 0.85, 0.38, 0.31, 2.50, 0.69, 0.41, 1.73, 0.62)

# Function for calculating one randomized difference in means
randomizedDifferenceInMeans <- function(x, y) {
  nx <- length(x)
  ny <- length(y)

  # combine the data
  combined <- c(x,y)
  ncom <- length(combined)
  indices <- 1:ncom

  # initially assign all observations to y group
  group <- rep('y', ncom)

  # assign a subsample to to x group
  xsub <- sample(indices, nx)
  group[xsub] <- 'x'

  # calculate the means
  meanX <- mean(combined[group=='x'])
  meanY <- mean(combined[group=='y'])
  differenceInMeans <- meanX - meanY

# Repeat that function many times to build the distribution
diffMeans <- replicate(10000, randomizedDifferenceInMeans(x,y))

At this point, we could do several things. Because we have a distribution for the null hypothesis, we could compare the observed difference in means to the two-tailed critical values. From this, we could accept or reject the null hypothesis of no difference in means. The quantile() function makes finding the critical values easy.

observedDifference <- mean(x) - mean(y)
alpha = 0.05
quantile(diffMeans, alpha/2)
quantile(diffMeans, 1-alpha/2)

Sometimes, it is helpful to show the distribution with the critical values and the observed value. This allows readers to understand the shape of the distribution and to visualize where the observed value fell relative to the critical values.

hist(diffMeans, breaks=50, col='gray', xlab='Difference of Means', main='')

observedDifference <- mean(x) - mean(y)
abline(v=observedDifference, lwd=4, col='black')

alpha = 0.05
abline(v=quantile(diffMeans, alpha/2), col='red', lwd=2)
abline(v=quantile(diffMeans, 1-alpha/2), col='red', lwd=2)

We could also calculate the two-tailed p-value for our observed difference in means. To do this, we count the number of randomized values equal to or more extreme than our observed difference. We then calculate the proportion of all randomized values this represents, and we multiply it by two, since this is a two-tailed test.

observedDifference <- mean(x) - mean(y)
extreme <- length(diffMeans[diffMeans>=observedDifference])
total <-length(diffMeans)
pvalue <- 2 * extreme/total


One of the most powerful methods for generating confidence limits is the bootstrap. The name bootstrap comes from the idea of “lifting yourself up by your own bootstraps”. In this case, it means treating the data as the best description of the frequency distribution of the population. In a sense, this is equivalent to the assumption of random sampling that underlies all statistical tests.

The principle behind a bootstrap is to draw elements from the data to build up a simulated sample of the same size as the original sample. This sampling is done with replacement, meaning that a particular observation might get randomly sampled more than once. The statistic is then calculated on this bootstrapped sample, and this process is repeated many times (thousands of times or more) to build a distribution for the statistic. The mean of these bootstrapped values is the estimate of the parameter.

This distribution of bootstrapped values is also used to construct a confidence interval on the estimate of the parameter. To do this, the statistics in the distribution are sorted from lowest to highest. The 1-α/2 % element (e.g. 0.025%) and the α/2 % element (e.g., 0.975%) are the lower and upper confidence limits. For example, if we calculated our statistic 10,000 times and wanted 95% confidence limits, we would find the elements at rows 250 and 9750 of our sorted vector of statistics. The quantile() function does this automatically, saving us the trouble of sorting and finding the cutoffs.

The reliability of the bootstrap improves with the number of bootstrap replicates. Increasing the number of replicates increases the computation time, but this is rarely the problem that it was in the past.

One-variable bootstrap

This example illustrates a bootstrap of a sample standard deviation. This could be modified easily for any statistic calculated on a single variable.

# The data
x <- c(1.61, 0.84, 0.01, 0.35, 0.37, 1.13, 0.25, 1.25, 3.24, 0.68)

# Function that calculates the standard deviation of one bootstrapped sample
sdOfOneBootstrap <- function(x) {
  bootstrappedSample <- sample(x, size=length(x), replace=TRUE)
  theSd <- sd(bootstrappedSample)

# Call that function many times to build the distribution
sds <- replicate(10000, sdOfOneBootstrap(x))

From this distribution of bootstrapped samples, we can calculate our estimate of the parameter (the population standard deviation in this example) and place a confidence interval on that estimate.

parameterEstimate <- mean(sds)
alpha = 0.05
quantile(sds, alpha/2)
quantile(sds, 1-alpha/2)

Two-variable bootstrap

We can also use a bootstrap when our statistic is calculated based on two variables, such as a correlation coefficient or a slope. This is only slightly more complicated than the previous example, because we need to sample rows of data, that is, to randomly draw both variables that correspond to a single data point or observation. Independently selecting an element for each variable is a common mistake when constructing confidence intervals, because it destroys the correlation structure in the data set. If elements of each variable are selected independently, this builds the distribution for the null hypothesis (one lacking the observed correlations among the variables), which could be used to calculate critical values or p-values.

A two-variable bootstrap is illustrated for a correlation coefficient:

# The data, plotted
x <- c(3.37, 2.04, 0.67, 0.99, 0.91, 2.25, 5.54, 2.67, 2.43, 2.00)
y <- c(12.30, 9.91, 5.92, 8.56, 6.51, 10.53, 19.22, 13.52, 10.47, 9.03)
plot(x, y, pch=16)

# Function to calculate r on one bootstrapped sample
bootstrapR <- function(x, y) {
  i <- 1:length(x)
  theSample <- (sample(i, length(x), replace=TRUE))
  pearsonR <- cor(x[theSample], y[theSample])

# Repeat that function many times to build distribution of r
r <- replicate(10000, bootstrapR(x,y))

The parameter estimate and its confidence interval are found as in the single-variable case.

parameterEstimate <- mean(r)
alpha = 0.05
quantile(r, alpha/2)
quantile(r, 1-alpha/2)

Multivariate bootstrap

The two-variable approach can be generalized for multivariate data. For constructive confidence intervals based multivariate data, randomly select a row as in the two-variable case, and use all of the observations for that case as the bootstrapped sample is built. If observations from each variable are randomly selected (as opposed to selecting all the variables for a random case), the distribution of the null hypothesis is constructed, which can be used for constructing p-values and critical values.

Some methods (such as non-metric multidimensional scaling) do not permit cases that are identical, and this will cause a problem for a multivariate bootstrap. One common solution is to add a very small random number to each value as that case is selected, and this will prevent the problem of identical cases. For example, if data values range from 1–10, the random number could be on the scale of 0.001, enough to prevent values from being identical, but not enough to substantially alter the data.


To summarize, a bootstrap has three steps:

The bootstrap should be used with caution when sample size is small. At small sample sizes, the data may not adequately represent the population’s distribution, violating the single assumption of the bootstrap.

The bootstrap is an incredibly versatile technique. All scientists should know how to implement one.


The jackknife is another technique for estimating a parameter and placing confidence limits on that estimate. The name is a reference to cutting the data, because one removes a single observation, calculates the statistic without that one value, then repeats that same process in turn for each observation (remove just one value, calculate the statistic). This produces a distribution of jackknifed values of our statistic, from which a parameter estimate and confidence interval is calculated.

Specifically, suppose we are interested in parameter K, but we have only an estimate of it (the statistic k) that is based on the entire sample of n observations.

To form the first jackknifed sample consisting of n-1 observations, remove the first observation in the data. Calculate the statistic (k-i) on that sample, where the -i subscript means all of the observations except the ith. From this statistic, calculate the pseudovalue κi as k - (n-1)(k-i - k).

Repeat this process by replacing the first observation and removing the second observation, to make a new sample of size n-1. Calculate the statistic and pseudovalue as before. Repeat this for every possible observation until there are n pseudovalues, each based on a sample of size n-1. From these n pseudovalues, we can can estimate the parameter, its standard error, and the confidence interval.

The jackknifed estimate of the parameter K is equal to the mean of the pseudovalues.

The jackknifed estimate of the standard error of K is equal to sd(κ) / sqrt(n), that is, the standard deviation of the psuedovalues divided by the square root of sample size (n).

The jackknifed confidence limits are equal to K ± t * SE, K is the jackknifed estimate of the parameter, t is the two-tailed critical t-score with n-1 degrees of freedom, and SE is the jackknifed estimate of standard error.

There are two caveats for the jackknife. First, the confidence intervals assume that the jackknife estimate is normally distributed, which is justified by the central limit theorem for large samples. If our sample size is small, the central limit theorem may not assure normality. Second, the pseudovalues are not independent, they are necessarily correlated.

Here is a demonstration of a jackknife, using the standard deviation as the statistic of interest. Although it is not necessary for a problem as simple as this, I wrap it inside the function calculateStatistic() to emphasize that if we wanted to jackknife a different statistic, all we would have to do is change the contents of the calculateStatistic() and the rest of the code would be unchanged.

As usual, we start with the data and our function.

x <- c(0.12, 0.97, 1.03, 0.10, 0.20, 0.22, 0.63, 0.21, 2.36, 3.07, 0.91, 0.65, 0.71, 0.13, 2.37, 1.44, 0.55, 1.31, 0.29, 0.25, 2.32, 1.84, 1.03, 0.42, 0.34, 0.37, 0.02, 1.46, 0.62, 0.73, 1.01, 0.37, 2.23, 0.10, 0.53, 0.31, 0.38, 0.20, 5.12, 2.02, 2.35, 0.07, 0.03, 0.13, 2.24, 0.70, 0.19, 1.07, 1.88, 2.61)

calculateStatistic <- function(x) {

Next, calculate the pseudovalues. First, make a vector to hold pseudovalues, then loop through the data, removing each observation one at a time.

pseudovalues <- vector(length=length(x), mode="numeric")
theEstimate <- calculateStatistic(x)
for (i in 1:length(x)) {
  subset <- x[-i]
  subsetEstimate <- calculateStatistic(subset)
  pseudovalues[i] <- theEstimate - (length(x) - 1)*(subsetEstimate - theEstimate)

Note: many argue that loops should be avoided in R at all costs. What follows in the next block is a way to calculate the pseudovalues without loops. As usual, the solution is to use apply() or one of its kin. Here, we use sapply(), because it can be applied to a vector, as opposed to apply(), which works on a matrix. fWe supply sapply() with an anonymous (unnamed) function as a closure (in the curly braces), which applies those operations row by row within x. The closure captures y and theEstimate from its environment. This is fairly advanced, so if the loop makes more sense to you, then use it.

y <- x
theEstimate <- calculateStatistic(x)
pseudovalues <- sapply(x, function(x) {
  i <- min(which(y==x))
  subset <- y[-i]
  subsetEstimate <- calculateStatistic(subset)
  thePseudovalue <- theEstimate - (length(subset))*(subsetEstimate - theEstimate)

Regardless of which we we calculate the pseudovalues, we can calculate our estimate, standard error, and confidence limits.

estimate <- mean(pseudovalues)
n <- length(pseudovalues)
stdError <- sd(pseudovalues)/sqrt(n)

# confidence limits
alpha <- 0.05
estimate + qt(p=alpha/2, df=n-1) * stdError
estimate - qt(p=alpha/2, df=n-1) * stdError

To summarize, the jackknife has three main steps:


Crowley, P.H., 1992, Resampling methods for computation-intensive data analysis in ecology and evolution. Annual Review of Ecology and Systematics 23: 405–447.