Lecture Notes



Steven Holland


Understanding and characterizing variation in samples is an important part of statistics. Variation can be measured with several different statistics. Range is the difference between the largest and smallest values in a distribution. Because range tends to increase with sample size, and because it is highly sensitive to outliers, it is often not a desirable measure of variation. Range is calculated easily in R with the range() function.

Variance is a much more useful measure of variation. Variance of a population is equal to the average squared deviation of every observation from the population mean. It is symbolized by a Greek lowercase sigma-squared (σ2).

population variance

For samples, the population mean is unknown, so variance is calculated as the sum of squared deviations of every observation from the sample mean, divided by the degrees of freedom. In this case, there is a penalty for using the sample mean as an estimate of the population mean. Sample variance is symbolized by a Roman lowercase s-squared (s2) for samples.

sample variance

In R, sample variance is calculated with the var() function.

Because variance is the average squared deviation from the mean, the units of variance are the square of the original measurements. For example, if shells are measured in millimeters, variance has units of mm2. Taking the square root of variance gives standard deviation, which has the same units as the original measurements, making it a more easily understood quantity. In R, sample standard deviation is calculated with the sd() function.

A normal distribution is scaled by the standard deviation, with 68% of the observations within one standard deviation of the mean, 95% of the observations within two standard deviations of the mean, and 99.7% of the observations within three standard deviations. These are good rules of thumb to remember, particularly that plus or minus two standard deviations encompasses 95% of the distribution.

Less commonly used is the coefficient of variation, which is the standard deviation divided by the mean and therefore a dimensionless number.

coefficient of variation

There is no built-in function for the coefficient of variation in R, but a function is straightforward:

CV <- function(x) {sd(x) / mean(x)}

Testing for equal variances in two samples

We often want to compare two sample variances to test whether their populations have the same variance (this is another example of the so-called zero-hypothesis). Testing for a difference in variance is generally done with an F statistic, the ratio of the two sample variances. The F statistic is named for its discoverer, the biostatistician R.A. Fisher (of p-value and Modern Synthesis fame), and it is therefore sometimes called Fisher’s F.

F statistic

As with any test, we must have a distribution of expected values of the statistic to make inferences or test any hypotheses. For the F statistic, it is easy to simulate what the expected distribution would be if two random samples were drawn from a population that was normally distributed.

sampleSize1 <- 12
sampleSize2 <- 25
numTrials <- 10000
F <- replicate(numTrials, var(rnorm(sampleSize1))/var(rnorm(sampleSize2)))
hist(F, breaks=50, col='salmon', main=paste('n1=',sampleSize1,', n2=',sampleSize2))

F distribution

This distribution should makes sense. The mean is near one, which is expected if two samples come from the same population. For example, if the two variances ought to be nearly equal, their ratio should be near one. Because variance must be positive, the ratio can have only positive values. In the extreme case, the variance in the numerator might be very large and the one in the denominator might be close to zero, so the right tail of the distribution should be long. The likelihood of such extreme values must depend on the sample size of the two samples, so the shape of an F-distribution reflects the degrees of freedom for the numerator and denominator.

In practice, F distributions come from analytic solutions, not from simulations. These analytic solutions assume that both samples come from normal distributions.

As with all distributions, R comes with functions to explore the F distribution, which include df() to find the height of the distribution from F and the two degrees of freedom; pf() to obtain a p-value from an observed F and the two degrees of freedom; qf() to obtain a critical value from alpha and the two degrees of freedom; and rf() to obtain random variates from an F distribution from the two degrees of freedom. Here is the same distribution that we simulated above, but calculated analytically with df().

F <- seq(from=0, to=5, by=0.01)
density <- df(F, df1=12–1, df2=25–1)
plot(F, density, type='l', lwd=2, las=1)

F distribution analytic

Once we have a statistic and have the distribution of expected values, we can perform statistical tests on the null hypothesis. In this case, the null hypothesis is usually that that the two samples were drawn from populations with the same variance (i.e, the zero-difference hypothesis). The F test assumes random sampling (like all tests), and it assumes that the populations are normally distributed, making this a parametric test. This last assumption must be verified, as many types of data are non-normally distributed. Use a data transformation to help the data to normality; if that doesn’t work, you will need to use a nonparametric test.

An F test is performed in R with the var.test() function. Here is a demonstration using two simulated data sets:

mydata1 <- rnorm(50, mean=3, sd=2)
mydata2 <- rnorm(44, mean=2, sd=1.4)
var.test(mydata1, mydata2)

The output provides the F statistic, a p-value, and 95% confidence intervals on the F statistic (the ratio of variances). The default test is two-tailed, and it does not matter which variance you put in the numerator and the denominator.

You can specify alternative hypotheses in cases in which you suspect variance in one sample ought to be larger than in another. Remember, you cannot examine your data to decide if you want to test whether one variance is larger than another. As always, the number of tails of a test is never determined by looking at the data.

An important note about all statistical tests

Both the t-test and the F test have assumptions. For example, both require random sampling. The F-test requires that the data be normally distributed, but the t-test requires that the sample statistic (the sample means) be normally distributed. These assumptions are necessary to generate the distribution of expected outcomes based on a null hypothesis.

When you perform a statistical test, you are actually testing the null hypothesis and the assumptions of the test. If you verify that the assumptions are upheld prior to conducting the test, the decision to accept or reject applies only to the null hypothesis. However, if you do not check the assumptions, a rejection of the null hypothesis may mean instead that the assumptions are not valid. This is not good, because it means that you cannot interpret the results. For example, if your results suggest that you should reject the null, they may be evidence that you that you did not sample randomly. The bottom line is that you must verify the assumptions of a test prior to performing it.

Testing multiple variances: Bartlett test

In some cases, you may have multiple variances that you wish to compare. To do this, use Bartlett’s test for the homogeneity of variances. Bartlett’s test is a parametric test: it assumes that the data are normally distributed and that you have random sampling. In R, a Bartlett’s test is run with bartlett.test(). Your data should be set up with the measurement variable in one vector and a grouping variable in a second vector. The test is run like this:

bartlett.test(myMeasurementVariable, myGroupingVariable)

The non-parametric Ansari-Bradley test

If your data are not normally distributed, you can try a data transformation (such as the log transformation) to see if the distributions can be made normal. If a data transformation does not produce normality, you will need to use a non-parametric test, such as the Ansari-Bradley test.

The Ansari-Bradley test only assumes random sampling. It is called in R like this:

ansari.test(mydata1, mydata2)

The output consists of the AB statistic and a p-value. Like all non-parametric tests, confidence limits on a population parameter are by definition not possible.