Steven Holland

# Hypothesis testing

## Probability

Probability describes the frequency of observing some outcome that is subject to chance. Probability may be expressed as a decimal or as a percentage, but is always between 0 and 1 (0–100%) inclusive.

In a game of chance, probability is easy to imagine. For example, there is some probability that we could roll a 2 on a die, get a heads in a coin flip, or draw a royal flush in a game of poker.

For scientists, chance enters our world primarily through how we sample a population. For example, if we measured the heights of a dozen randomly selected trees and calculated their mean height, that mean would be subject to variation because of the trees we happened to measure. If we measured a different dozen trees, we would get a different mean height. As such, all of our scientific data is subject to sampling and probability.

Probabilities of different outcomes are generally not the same; some outcomes are often more probable than others. Regardless of the individual probabilities, the sum of probabilities of all possible outcomes equals 100%. We can visualize this as the area under a probability distribution; in other words, we can integrate the probability distribution, and the total area is always 100%.

Probability has two meanings in the frequentist definition that we will use for most of this course and that is used by most data analysts (although the fact that more use a frequentist definition does not mean that it is better). Probability for a frequentist is the chance of a given outcome in one trial, but it can also be interpreted as the proportion of times an outcome will occur over many trials.

Later in the course, we will discuss another way to think about probability, called Bayesian probability. For a Bayesian, probability is a value that describes the strength of belief in a statement. Probability in the Bayesian sense does not have the same properties that it does in the frequentist sense.

## Hypothesis testing — the classical approach

One of the most common situations for a scientist is wanting to know if a given statistic is consistent with some hypothesis about a parameter of the population. For example, we might want to know if our sample mean would have been unexpected if the population mean has some particular value. In other words, we might want to know whether we could rule out that our population had some a hypothesized mean. This approach is called hypothesis testing, and we will use the approach of Jerzy Neyman and Egon Pearson (not the Karl Pearson that developed the Pearson correlation coefficient).

Hypothesis testing follows six steps:

1. State a null hypothesis
2. Generate a distribution of expected outcomes
3. Declare a level of rarity called significance (α)
4. Find the critical value
5. Calculate our statistic
6. Decide whether to accept or reject the null hypothesis

To test a hypothesis, we first need to state our null hypothesis, which is a statement that we will use to evaluate our data. The null hypothesis is a definitive statement about the world; it is not a question. A null hypothesis is a special type of hypothesis, and it is generally a statement that nothing special is going on. For example, a null hypothesis about an experimental treatment could be that its effects are no different than applying no treatment at all. A null hypothesis about two means is that their difference is zero. A null hypothesis about the correlation of two variables is that they have no correlation at all, that is, that the correlation is zero. A null hypothesis about the slope of the line is that the slope is zero.

Many students find two things counterintuitive about statistics. First, our scientific hypothesis is usually not the null hypothesis. For example, our scientific hypothesis may be that applying a fertilizer causes faster growth rates than not applying it. The null hypothesis would be that applying the fertilizer does not improve growth rates (in other words that the difference between applying a fertilizer and not applying it is zero). Second, we do not attempt to prove our scientific hypothesis, we attempt to disprove the null hypothesis. If we are able to disprove the null (more accurately, if we can show that the null hypothesis is unlikely to generate the data we have), we conclude that our alternative hypothesis, our scientific hypothesis, is a better explanation of our data.

Next, we need to generate a probability distribution of possible values of the statistic that we plan to measure. This distribution will be based on assuming our null hypothesis is true and the size of the sample we plan to collect. For example, suppose we are interested in the mean of a variable. We would want to know the probability of observing any possible sample mean if our null hypothesis was true. Because we calculate this probability for all possible values of the mean, we can build a probability distribution of the expected value of the mean.

This distribution will let us evaluate whether our observed statistic (our mean) was a likely result or an unlikely one, again, assuming that the null hypothesis is true. The farther out we are on the tails of this distribution, the less probable it is that our statistic was generated by sampling from a population described by our null hypothesis. We will learn several different ways to build these probability distributions in this course.

Next, because our conclusion is based on probability, we need to state how unlikely our observed statistic would have to be (again, if the null hypothesis was true) before we should conclude to reject the null hypothesis and accept the alternative. This level of rarity is called our significance level, and it is symbolized by the greek letter α. Significance is a probability; it is the probability that we define as rare.

Despite what you may hear, there is no magical or universally accepted significance level, and significance varies by field. That said, significance is commonly 0.05 (5%) in the natural sciences, because when it was first proposed by Ronald Fisher, he suggested that 1 in 20 event was unusual enough that it warranted further investigation. It is important to realize that we can set significance to any value we wish. For example, we could set it to be very small (<0.1%) to be more certain about rejecting the null hypothesis, and it was proposed this summer that significance be generally set to 0.005. There are tradeoffs to setting significance to smaller values, and we will explore those in the next lecture.

From our probability distribution for our statistic, our significance level, and our alternative hypothesis, we can find one or two critical values that define the limits of what would be considered unexpected outcomes. Any value of our statistic equal to those critical values or that is even more extreme, that is, farther out on the tails of our probability distribution, would be regarded as a value that we would not expect if the null hypothesis was true.

For example, one alternative hypothesis might be that the sample mean is larger than that of the null hypothesis. This would likely be the case in a test of a new fertilizer, where we would hope that mean growth is greater than when no fertilizer is applied. In this case, we are interested only in rare results on the right (large-value) side of the probability distribution. To find our critical value, we would add up the area under the probability distribution starting at the far right of the distribution and working towards the left (smaller values of the statistic) until we reached a cumulative probability that equals our significance level. When we have reached our significance level, that position marks the critical value: any observed statistic (sample mean in this example) beyond that lies in the realm of improbable results if the null hypothesis was true. This realm is called the critical zone. This would be a one-tailed test, more specifically, a right-tailed test.

In some cases, we might expect our statistic to be smaller than that of the null hypothesis. We would follow the same procedure, except we would start from the far left of the probability distribution and work towards the right until we hit a cumulative probability equal to the significance level. That would mark our critical value, and any outcome to the left of that would be considered improbable to have been generated by the null hypothesis. This is also called a one-tailed test, specifically a left-tailed test.

In many cases, we do not have anything to suggest that the statistic would be larger or smaller than the null hypothesis. In that case, we split the significance probability evenly into the two tails of the distribution and calculate two critical values, one for each tail. Anything beyond those critical values, that is, in either of the two critical zones, would be considered an improbable outcome. This is called a two-tailed test.

Notice that by this point, our test is entirely set up: we know our critical values, so we know how we would respond if our measured statistic was in the critical zone or not. Notice also that we have not yet seen our data and we therefore do not yet have our statistic. This is how it should be: none of what we have done so far should be influenced by our data. Modifying this process by what is in the data nullifies the interpretation of our statistical analyses. Stated another way, we should not test a hypothesis driven by an examination of the data. Once we have stated our null hypothesis, declared our significance level, generated a distribution of expected outcomes, and found our critical values, we can then collect our data and calculate our statistic.

With our critical values and our statistic, we can now make a decision about the null hypothesis. If our statistic lies in the critical zone, we reject the null hypothesis, and if it does not fall in the critical zone, we accept the null hypothesis. When we reject the null hypothesis, we implicitly accept our alternative hypothesis. In other words, we never test our scientific (alternative) hypothesis directly; we test the null hypothesis, which leads to a decision on the alternative hypothesis. It is important to realize that we did not determine whether the null hypothesis is true or false. Instead, we have only made a decision on how to act: we will act as if the null hypothesis is true or we will act as if it is false. In most cases, we will never know whether the null hypothesis was indeed true or false. Remember, we accept or reject hypotheses; we never determine if they are true or false.

Some people use the more convoluted language of “fail to accept” and “fail to reject” instead of reject and accept. Avoid using this language, not only because the wording is cumbersome, but more importantly because it confuses the truth of the hypothesis with our decision, a distinction we will continue to explore in this course.

## An example in R

We can explore hypothesis testing with a simulation. Suppose we are engaged in an exploration of a potential gold mine. We do not want to develop the mine if it is not economically viable, that is, if the concentration of gold is not greater than some minimal, economically viable concentration.

Our scientific hypothesis is that our potential mine is economically viable, but we need to state a null hypothesis, which we will state as “The mean concentration of gold in our potential mine is not greater than a mine that is at the break-even point”.

Solely for the purposes of illustration, let us suppose we knew the population of gold concentration in assays from a break-even mine, that is, one that is right at the boundary between making money and losing money. In reality, we will never know this, but for doing what follows, it can help to understand how hypothesis testing works. Plus, it is fun to pretend that you are omniscient and know things like populations.

First, suppose that gold concentrations in a mine follow a lognormal distribution. We can use use the rlnorm() function to simulate the population of assays from the break-even mine. We will display the frequency distribution of these concentrations, and we can also show the mean of the population, that is the mean gold concentration of all rocks in a break-even mine.

population <- rlnorm(10000)
hist(population, breaks=50, freq=FALSE, main="Population", xlab="Gold Concentration", ylab="probability", las=1, col="pink")
mean(population)
abline(v=mean(population), lwd=4, col="red")
text(mean(population)+1, 0.4, pos=4, "population mean", col="red")

(Note that in the last line of this code, because I used magic numbers to specify the x and y coordinates of the label in text(), the label might not plot in the correct position. This is also true in some of the code below. This is not good practice, and I do it here solely to keep the code simple and not detract from the purpose of the simulation.)

Because I plan to collect a small sample (n=20) from the potential mine, it is worth exploring what the mean of this sample might be if our mine was a break-even mine (that is the null hypothesis). It is easy to simulate a sample of n=20 from the break-even mine and calculate the mean of that sample.

mean(sample(x=population, size=20))

Note that the mean of this sample is not the same as the mean of the population, although it is close. If you repeat these lines of code, you will see that sometimes the sample mean is smaller than the population mean, and sometimes it is larger. This is the effect of chance brought on by random sampling. The effects of chance underlie every measurement you will make as a scientist.

Because of the effects of chance introduced in our sampling, we need to simulate not just one sample mean, but many of them, so that we have the probability distribution of sample means (when n=20) when we sample from the break-even mine. To do this, we repeat the process of drawing a sample of size n=20 many times (say, 50,000), and plot a frequency distribution of those sample means. On this plot, we add the population mean so that we can see how it compares with the distribution of sample means.

numTrials <- 50000
n20means <- replicate(n=numTrials, mean(sample(population, size=20)))
dev.new()
hist(n20means, breaks=50, freq=FALSE, main="Distribution of sample means when n=20", xlab="mean gold concentration in a sample of size n=20", col="gray")
abline(v=mean(population), lwd=4, col="red")
text(mean(population)+0.05, 0.9, "population mean", col="red", pos=4)

This frequency distribution shows that some values of the sample mean are likely to occur, but others are not. For example, if we grabbed a random sample of size n=20 from a break-even mine, it is very likely that our sample would have a mean gold concentration between 1 and 2.

On the other hand, it is unlikely that our sample would have a gold concentration of 2.5, or 3, or more, if it came from a break-even mine. In other words, such large concentrations would be unlikely to be seen if our null hypothesis (that we are sampling from a break-even mine) is true.

But what would we conclude about a mean gold concentration of 2.2, or 2.3? What we need is a standard for what would constitute a rare outcome; that standard of rarity is our significance level, and we choose it. It is entirely under our control. Feeling conventional, we decide to follow the convention in the natural sciences, and we set our significance at 0.05 (5%). To repeat, we could have chosen any value for significance that we want.

From our probability distribution and our significance value, we can find our critical value. In this case, we are interested only in unusually high gold concentrations (small ones would indicate an unprofitable mine), so we perform a right-tailed test. We integrate under the curve starting from the far right end, working leftward into the distribution until our cumulative probability equals our significance level. Where these match is our critical value.

significance <- 0.05
criticalValue <- quantile(n20means, 1-significance)
criticalValue
abline(v=criticalValue, lwd=3, col="blue")
text(criticalValue+0.05, 0.8, "critical value", col="blue", pos=4)
text(criticalValue+0.5, 0.5, "unlikely values", cex=2, col="blue", pos=4)

At this point, we can now measure the gold concentrations in our sample and find the mean of those 20 values. If the mean concentration is greater than the critical value, it lies in the critical zone and we would reject our null hypothesis that we sampled from a break-even mine. Because we did a right-tailed test, we would therefore implicitly accept our alternative hypothesis that our potential mine will be profitable. If our sample mean was to the left of the critical value, that is, outside the critical zone, we would accept the null hypothesis that we sampled from a break-even mine (or worse), and we would not develop this new mine as a result.

This example shows you the logic behind hypothesis testing. Although the actual mechanics of how we obtain the distribution of expected outcomes for our statistic will differ depending on the situation, we will follow these steps when hypothesis testing:

1. State a null hypothesis
2. Generate a distribution of expected outcomes
3. Declare a level of rarity called significance (α)
4. Find the critical value
5. Calculate our statistic
6. Decide whether to accept or reject the null hypothesis