One of the most common problems in data analysis is to evaluate central tendency, the expected value in a distribution. Central tendency is most commonly measured with the mean or the median. We will learn how to test hypotheses and construct confidence intervals for these.

The arithmetic mean is the sum of observations divided by sample size (n). The statistic mean is symbolized by an X with a bar over the top (x̄), and the parameter mean is symbolized by the Greek letter μ (mu).

The sample mean is an **unbiased estimator** of the population mean, meaning that the sample mean is just as likely to be larger than the population mean as it is to be smaller. The sample mean is an **efficient statistic** compared to the sample median, meaning that sample means are more likely to lie closer to the population mean than are sample medians to the population median.

To make statistical inferences, we must know how a sample statistic is distributed when sampling from a population with given parameters (that is, the null hypothesis). To understand how the sample mean is distributed, it is critically important to understand the **central limit theorem**, which states:

*The means of samples drawn from a population of any distribution will be approximately normally distributed, and this approximation gets better as sample size increases.*

In other words, as long as sample size is reasonably large, we do not need to be concerned about whether the distribution of the *data* is normal because the distribution of *sample means* will be normally distributed, regardless of how the data are distributed. The central limit theorem applies to any parent distribution, but the distribution of sample means more quickly approximates a normal distribution for parent distributions that more closely resemble a normal distribution.

The normal distribution of sample means has two important characteristics. First, the mean of the sample means is equal to the population mean. Stated differently, the central tendency of sample means in the population is the sample mean; that is, if we could sample a population, our expectation is that our sample mean will be equal to the population mean. Put succinctly, the **expected value** of a sample mean is the population mean.

Second, the standard deviation of the sample mean distribution equals the standard deviation of the parent population (σ) divided by the square root of sample size (n). This quantity is called the **standard error** of the mean:

A standard error describes the standard deviation of a *sample statistic*, but it is called a standard error to avoid confusing it with the standard deviation of the *data*. Because the standard error of the mean is calculating by dividing the standard deviation by the square root of n, increasing the sample size quickly shrinks the standard error of the mean. In other words, increasing sample size reduces the uncertainty in the mean. As we have seen before and will continue to see, increasing sample size brings many benefits.

We can simulate how the central limit theorem works. For the first simulation, we will assume that what we are measuring is normally distributed, that is, the parent distribution is normal.

dev.new(width=3, height=3)

x <- rnorm(10000)

hist(x, breaks=50, ylab='', xlab='', main='Normal Distribution', col='gray', border='gray')

abline(v=mean(x), col='red', lwd=4)

dev.new()

par(mfrow=c(2,2))

trials <- 1000

lowerLimit <- -4 # Adjust these limits if the code produces an 'Error in hist.default'

upperLimit <- 4 # Adjust these limits if the code produces an 'Error in hist.default'

myBreaks <- seq(lowerLimit, upperLimit, by=0.1)

sampleSize <- c(5, 10, 20, 50)

for (j in 1:length(sampleSize)) {

sampleMean <- vector(mode='numeric', length=trials)

sampleMean <- replicate(n=trials, mean(sample(x, size=sampleSize[j])))

par(mar=c(1, 4, 4, 2) + 0.1)

hist(sampleMean, ylab='', xlab='', xlim=c(lowerLimit, upperLimit), main=paste('Sample Means, n=',sampleSize[j]), col='red', border='red', breaks=myBreaks)

}

This simulation generates two plots. This first one shows the distribution of the data (again, called the parent distribution). The mean of the data, that is, the population mean, is shown by the vertical black line.

The second plot shows the distribution of sample means for sample sizes ranging from 5 (top) to 50 (bottom). All are plotted at the same scale.

Notice three things:

- The sample means are normally distributed.
- The mean of the sample means is the same as the mean of the parent population.
- As sample size increases, the variation in the sample means decreases.

We can perform this same simulation, but starting this time from a decidedly non-normal distribution. The exponential distribution is a good choice because it is as asymmetric as you can get, as it has no left tail whatsoever (all of the distribution is in the right tail).

dev.new(width=3, height=3)

x <- rexp(10000)

hist(x, breaks=50, ylab='', xlab='', main='Exponential Distribution', col='gray', border='gray')

abline(v=mean(x), col='red', lwd=4)

dev.new()

par(mfrow=c(2,2))

trials <- 1000

lowerLimit <- 0 # Adjust these limits if the code produces an 'Error in hist.default'

upperLimit <- 4 # Adjust these limits if the code produces an 'Error in hist.default'

myBreaks <- seq(lowerLimit, upperLimit, by=0.1)

sampleSize <- c(5, 10, 20, 50)

for (j in 1:length(sampleSize)) {

sampleMean <- vector(mode='numeric', length=trials)

sampleMean <- replicate(n=trials, mean(sample(x, size=sampleSize[j])))

par(mar=c(1, 4, 4, 2) + 0.1)

hist(sampleMean, ylab='', xlab='', xlim=c(lowerLimit, upperLimit), main=paste('Sample Means, n=',sampleSize[j]), col='red', border='red', breaks=myBreaks)

}

As before, this simulation generates two plots, and the first one shows the parent distribution and its mean.

The second plot shows the distribution of sample means for sample sizes ranging from 5 (top) to 50 (bottom). All are plotted at the same scale.

Again, notice three things:

- As sample size increases, the distribution of sample means better approximates a normal distribution.
- The mean of the sample means is the same as the mean of the parent population.
- As sample size increases, the variation in the sample means decreases.

The central limit theorem is great news, because it lets us know the how our sample statistic (the mean) is distributed. Knowing that, we can use it to calculate critical values, p-values, and confidence intervals.

We saw that the spread of the sample mean distribution decreases with sample size, and we saw that this spread reflects the spread of the original data. Unfortunately, we do not know the spread in the original data (that is, the parameter called the population standard deviation, σ) for the usual reason that we never know the parameters of a population. We can use the sample standard deviation (s) to estimate the standard deviation of the population (σ), but sampling introduces uncertainty, and our sample standard deviation will rarely equal the population standard deviation exactly.

We need a distribution that has the approximate shape of a normal distribution, but one that is a little wider to reflect the uncertainty that comes from using the sample standard deviation to approximate the population standard deviation. This somewhat-wider normal-like distribution is called a **t-distribution**. The t-distribution is widely used in statistics.

To test a mean (and other statistics later on), we will use a t-test. A **t-statistic** is calculated as follows:

where X̄ is our sample mean, μ is our null hypothesis for the population mean, and s_{X̄} is the standard error of the sample mean, defined above as:

In R, the longhand way of calculating the t-statistic would work like this, but we will often find even easier ways:

t <- (mean(x) - nullHypothesis) / (sd(x) / sqrt(length(x)))

Using the t-distribution requires two assumptions. First, we must have random sampling, as always. Second, our *statistic* (the mean in this case) must be normally distributed, which the central limit theorem states, provided we have a sufficient sample size.

Note that there is widespread confusion over this second point. The t-test does *not* assume that the data are normally distributed; it assumes that the statistic is normally distributed. As a result, many people unnecessarily avoid doing t-tests and use lower-power non-parametric tests because their data are not normally distributed.

As we saw in our simulations, if n is reasonably large, the central limit theorem will insure that the mean is normally distributed. How large n must be depends on the shape of the parent distribution. If the data are normally distributed, the sample means will always be normally distributed. As the data become more asymmetrically or oddly distributed, sample size will need to be larger, typically 20–50 for the central limit theorem to be valid.

The shape of a t-distribution is controlled by the degrees of freedom, symbolized by the Greek letter nu (ν). In general, the degrees of freedom is equal to the sample size (n) minus the number of estimated parameters, that is, the parameters that had to be estimated before the statistic could be calculated. Here, we had to estimate σ, the population standard deviation, so we used the sample standard deviation to estimate it. We therefore have one estimated parameter, thus n-1 degrees of freedom. Using the degrees of freedom is a correction for double-use of the data. Here, the double use is making an inference about the mean and estimating standard deviation with the same data.

When the degrees of freedom is large (over 100), the t-distribution becomes essentially identical to a normal distribution, that is, it is said to *converge* to a normal distribution. As the degrees of freedom gets smaller, the t-distribution becomes increasingly wider and flatter than a normal distribution, reflecting the greater uncertainty introduced by estimating the population standard deviation with the sample standard deviation. Like the standard normal distribution, the mean of the t-distribution is zero and its units are in standard deviations.

T-tests in R use the function t.test(). Here, let’s simulate some data (myData1) and test it against a null hypothesis of the population mean (11.1 in this case). Because we do not have any reason for expecting our sample mean to be greater than or to be less than the sample mean, we will run a two-tailed test, which is the default.

myData1 <- rnorm(n=13, mean=12.2, sd=1.8)

t.test(myData1, mu=11.1)

If we had some external reason (that is, a reason *not* derived from our data) that our mean would be larger than (or smaller than) the null hypothesis, we could run a one-tailed test by setting the alternative parameter. Here is an example of a right-tailed test.

myData1 <- rnorm(n=13, mean=12.2, sd=1.8)

t.test(myData1, mu=11, alternative='greater')

The output of the t.test() command reports the t-statistic, the degrees of freedom, and a p-value (but not the critical t statistic).

If you want to construct confidence limits on a mean, as we generally should, leave off the alternative and mu parameters, which defaults to a two-sided test:

t.test(myData1)

The default confidence limits are 95%, but you can change this with the conf.level parameter.

If you may want to compare whether two means are equal or whether one is greater than the other, you can perform a t-test on the difference in means. The null hypothesis in these situations is usually that the difference in means is zero (in other words, that the means are the same). Testing for a difference of zero, where the alternative hypothesis is that the means are different is easy:

myData1 <- rnorm(n=13, mean=12.2, sd=1.8)

myData2 <- rnorm(n=27, mean=13.4, sd=2.1)

t.test(myData1, myData2)

This tests whether the difference in means (myData1 – myData2) is non-zero. As before, running the test returns a t-statistic, a p-value, and confidence limits on the difference in means. Remember when you interpret the confidence interval that it is the confidence interval of myData1 – myData2, not the reverse.

Note that this t-test does not assume that variances are equal (some two-sample t-tests do assume this), so it applies a correction that causes degrees of freedom to be non-integer. You can turn off this assumption, if you wish.

Again, remember that if you are comparing two means, *do not* use the values of those means to decide whether to run a one-tailed test. Running a one-tailed test requires something beyond your data; it must be a decision that could be made without seeing the data.

Some experiments apply different treatments (or a control and a treatment) to the same set of samples to check for a difference. For example, we could fill up all our cars with regular gas, measure the mileage on each of them, then fill up those same cars with premium gas, measure mileage again, and check for any change. Because we used the same cars in both cases, we can improve our estimate of the difference in means by using what’s called a paired t-test. The paired t-test is run by setting paired=TRUE:

gas <- read.table('gasoline.txt', header=TRUE, row.names=1, sep=',')

head(gas)

t.test(gas$regular, gas$premium, paired=TRUE)

In some cases (not often), we can find critical t-scores from our significance level (α) and our degrees of freedom, by using the qt() function.

qt(p=0.05, df=16)

By default, the qt() function will work off the left (lower) tail. When we need the t score for the right (upper) tail, we change the lower.tail parameter:

qt(p=0.05, df=16, lower.tail=FALSE)

If we’re doing a two-tailed test and need both the left and right critical values, simply halve the alpha value:

qt(p=0.05/2, df=16)

The qt() function will report the critical value for the left tail, and the critical value for the right tail will be the same value, but with a positive sign.

The t.test() function will automatically generate two-tailed confidence intervals, but you can also calculate them longhand. We will see later that other statistics also follow a t-distribution, and we can also use the approach below to calculate confidence intervals on them. The general formula for the confidence interval on a statistic is:

where X̄ is the sample mean, s_{X̄} is the standard error of the mean, and t_{α/2,ν} is the two-tailed t-score for the significance level (α) and degrees of freedom (ν). For other statistics (such as slope of a line), replace the mean with that other statistic, and use the standard error of that statistic.

In R, this is the longhand approach for calculating the confidence interval on a mean, although remember that using t.test() is the simpler way:

confidence <- 0.95

alpha <- 1 - confidence

nu <- length(myData1) - 1

tscore <- qt(p=alpha/2, df=nu, lower.tail=FALSE)

standardError <- sd(myData1) / sqrt(length(myData1))

leftConfidenceLimit <- mean(myData1) - tscore * standardError

rightConfidenceLimit <- mean(myData1) + tscore * standardError

In some cases, our sample size is too small to be sure that the central limit theorem will produce a normal distribution of sample means, so we must turn to a non-parametric test. There are many to choose from, but one of the most common is the **Wilcoxon Rank Sum test**, also called the **Mann-Whitney U test**. The Wilcoxon Rank Sum test is used to test whether the medians of two samples differ. It is performed in R with wilcox.test().

wilcox.test(myData1, myData2, alternative="less")

In this case, we set our alternative hypothesis to be that the median of myData1 is less than the median of myData2. Note that this function reports the W statistic and a p-value.

Realize that this is a severe penalty for a small sample size: we not only have low power because of the sample size, we further decrease our power by using a non-parametric test. The recommendation is clear: increase your sample size to the limits of your time, money, and attention.