Steven Holland

# Type I and Type II Errors

A null hypothesis is either true or false. Unfortunately, we do not know which is the case, and generally, we never will. It is important to realize that there is no probability that the null hypothesis is true or that it is false. Not knowing which is correct does not mean that a probability is involved. For example, if you are testing whether a potential mine has a greater gold concentration than that of break-even mine, the null hypothesis that your potential mine has a gold concentration no greater than a break-even mine is either true or it is false. There is no probability associated with these two cases (in a frequentist sense) because the gold is already in the ground, and as a result there is no possibility for chance because everything is already set. All we have is our own uncertainty about the null hypothesis.

This lack of knowledge about the null hypothesis is why we need to perform a statistical test: we want to use our data to make an inference about the null hypothesis. Specifically, we need to decide if we are going to act as if the null hypothesis is true or act as if it was false. From our hypothesis test, we therefore choose to either accept or reject the null hypothesis. If we accept the null hypothesis, we are stating that our data are consistent with the null hypothesis. If we reject the null hypothesis, we are stating that our data are so unexpected that they are inconsistent with the null hypothesis being true.

Our decision will change our behavior. If we reject the null hypothesis, we will act as if the null hypothesis is false, even though we do not know if that is the case. If we accept the null hypothesis, we will act as if the null hypothesis is true, even though we have not demonstrated that it is in fact true. This is a critical point: regardless of the results of our statistical test, we will never know if the null hypothesis is true or false. In other words, we do not prove or disprove null hypotheses; we do not show that null hypotheses are true or that they are false.

Because we have two possibilities for the null hypothesis (true or false) and two possibilities for our decision (accept or reject), there are four possible scenarios. Two of these are correct decisions: we could accept a true null hypothesis or we could reject a false null hypothesis. The other two are errors. If we reject a true null hypothesis, we have committed a type I error. If we accept a false null hypothesis, we have made a type II error.

Each of these four possibilities has some probability of occurring, and those probabilities are contingent on whether the null hypothesis is true or false. If the null hypothesis is true, there are only two possibilities: we will choose to accept the null hypothesis with probability of 1-α, or we will reject it with probability of α. If the null hypothesis is false, there are also only two possibilities: we will choose to accept the null hypothesis with probability of β, or we will reject it with probability of 1-β. Because the probabilities are contingent on whether the null hypothesis is true or false, it is the probabilities in each row that sum to 100%. The probabilities in each column generally will not sum to 100%.

Because we don’t know the truth of a null hypothesis, we need to cover ourselves and lessen the chances of making both types of error. If the null hypothesis is true, we want to lessen our chances of making a type I error; in other words, we want to find ways to reduce alpha. If the the null hypothesis is false, we want to reduce our chances of making a type II error: we want to find ways to reduce beta. Because we will never know if the null hypothesis is true or false, we need to simultaneously keep the probabilities of both alpha and beta as small as possible.

## Significance and confidence

Keeping the probability of a type I error low is straightforward, because we choose our significance level (α). If we are especially concerned about making a type I error, and in some cases we might be, we can set our significance level to be as small as we wish.

If the null hypothesis is true, we have a 1-α probability that we will make the correct decision and accept it. We call that probability (1-α) our confidence level. Confidence and significance sum to one because rejecting and accepting a null hypothesis are the only possible choices when the null hypothesis is true. Therefore, when we decrease significance, we increase our confidence. Although you might think you would always want confidence to be as high as possible, we will see that increasing it (by decreasing significance, or α), we necessarily makes it more likely that we will make a type II error when the null hypothesis is false.

## Beta and power

Keeping the probability of a type II error small is more complicated.

When the null hypothesis is false, β is the probability that we will make the wrong choice and accept it (a type II error). Beta is nearly always unknown, since knowing it requires knowing whether the null hypothesis is true or not; specifically, calculating beta requires knowing the actual distribution of our statistic. If we knew that, we wouldn’t need statistics.

If the null hypothesis is false, there is a 1-β probability that we’ll make the right choice and reject that false null hypothesis. The probability that we will make the right choice when the null hypothesis is false is called statistical power. Power reflects our ability to reject false null hypotheses and detect new phenomena in the world. We must try to maximize power. Power is influenced by four factors:

• Power increases with the size of the effect that we are trying to detect. For example, it is easier to detect a large difference in means than a small one. We cannot control effect size, because it is determined by the problem we are studying. The remaining three factors, however, are entirely under our control.
• Some statistical tests have greater power than other tests. In general, parametric tests have greater power than nonparametric tests.
• Our significance level affects β. Increasing alpha (significance) will increase our power, but at only at an increased risk of rejecting a true null hypothesis.
• Sample size (n) has a major effect on power. Increasing sample size increases power.

## Our strategy for minimizing errors

We keep our probability of committing a type I error small by keeping significance (α) as small as possible.

We keep our probability of committing a type II error small by (1) choosing tests with high power, (2) by increasing sample size as much as possible, given the constraints of time and money, and (3) by not making significance (α) too small. As a data analyst, you should always be thinking of ways to choose a test with high power, of ways to increase sample size given your constraints of time and money, and not making significance too small.

## Tradeoffs in α and β

We would like to minimize the probability of making type I and type II errors, but there is a tradeoff in α and β: decreasing one necessarily increases the other. We cannot simultaneously minimize both, and we will have to prioritize one or the other. Which one we prioritize depends on the circumstances.

For example, if we were doing quality control at a factory, our null hypothesis would be that the current production run meets some minimum standard of quality. We want to maximize our profits, so that means we do not want to be too careful and discard too many production runs; thus we set α low, but not too low. α is thus known as producer’s risk. If we are a consumer, we do not want defective goods, so if the null hypothesis is false (the goods are defective), we want them to be caught and not shipped to the consumers. β is therefore known as consumer’s risk.

Another example comes from court proceedings, and we need to think about the differences between a criminal trial and a civil trial.

In a criminal trial, it is the individual pitted against the state. The founders of our government did not want the state to be too powerful and take away the rights of innocent people, so the standard in a criminal trial is “proof beyond a reasonable doubt”. In effect, this sets α to be very low, accepting the risk that some guilty people will walk free (a type II error).

In civil trials, individuals are pitted against one another, and there is no concern for one party over another. The standard of proof is therefore “a preponderance of the evidence”, that is, whoever presents the stronger case wins. In effect, this sets alpha to be much larger, creating a greater balance between type I and type II errors.

## Demonstration Code

This first block will demonstrate the relationship between alpha and beta for a small sample (n=20). This and the next block will generate color versions of the figures used in class.

dev.new(height=8, width=6)
par(mfrow=c(2, 1))

# simulate Ho distribution for mean of small sample (n=20)
nullMean <- 4.5
nullStandardDeviation <- 1.0
smallSample <- 20
iterations <- 10000
smallSampleMeans <- replicate(iterations, mean(rnorm(smallSample, nullMean, nullStandardDeviation)))

# plot the distribution of the mean
breaks <- seq(3, 6, 0.05)
range <- c(min(breaks), max(breaks))
hist(smallSampleMeans, breaks=breaks, col='gray', main='Hypothesized Mean n=20', xlab='mean', xlim=range)

# show the critical value
alpha <- 0.05
criticalValue <- quantile(smallSampleMeans, 1-alpha)
abline(v=criticalValue, col='blue', lwd=3)
text(criticalValue, 800, 'critical value', pos=4, col='blue')
text(5.5, 300, expression(alpha), cex=2, pos=1, col='blue')
text(3.5, 300, expression(1-alpha), cex=2, pos=1, col='blue')

observedMean <- 5.1
arrows(observedMean, 600, 5.1, 50, angle=20, length=0.15, lwd=3, col='red')
text(observedMean, 620, 'observed mean', pos=4, col='red')

# simulate true distribution for mean of small sample (n=20)
# remember, this is unknowable to us
trueMean <- 5.0
trueSd <- 1.0
trueSampleMeans <- replicate(iterations, mean(rnorm(smallSample, trueMean, trueSd)))

# plot the distribution, add critical value and observed mean
hist(trueSampleMeans, breaks=breaks, col='gray', main='True Mean (Unknowable)', xlab='mean', xlim=range)

# add critical value and observed mean
abline(v=criticalValue, col='blue', lwd=3)
text(5.5, 800, expression(1-beta), cex=2, pos=1, col='blue')
text(3.5, 800, expression(beta), cex=2, pos=1, col='blue')
arrows(observedMean, 600, 5.1, 50, angle=20, length=0.15, lwd=3, col='red')

# Note that magic numbers were used for coordinates of arrows and text labels
# If you modify the code, you will likely need to change these numbers

This second block will do the same, but for a large sample (n=60).

# Repeat, but for sample size of 60
quartz(height=8, width=6)
par(mfrow=c(2, 1))

# simulate Ho distribution for mean of large sample (n=60)
nullMean <- 4.5
nullStandardDeviation <- 1.0
largeSample <- 60
iterations <- 10000
largeSampleMeans <- replicate(iterations, mean(rnorm(largeSample, nullMean, nullStandardDeviation)))

# plot the distribution of the mean
breaks <- seq(3, 6, 0.05)
range <- c(min(breaks), max(breaks))
hist(largeSampleMeans, breaks=breaks, col='gray', main='Hypothesized Mean n=60', xlab='mean', xlim=range)

# show the critical value
alpha <- 0.05
criticalValue <- quantile(largeSampleMeans, 1-alpha)
abline(v=criticalValue, col='blue', lwd=3)
text(criticalValue, 1400, 'critical value', pos=4, col='blue')
text(5.5, 500, expression(alpha), cex=2, pos=1, col='blue')
text(3.5, 500, expression(1-alpha), cex=2, pos=1, col='blue')

observedMean <- 5.1
arrows(observedMean, 1000, 5.1, 50, angle=20, length=0.15, lwd=3, col='red')
text(observedMean, 1020, 'observed mean', pos=4, col='red')

# simulate true distribution for mean of large sample (n=60)
trueMean <- 5.0
trueSd <- 1.0
trueLargeSampleMeans <- replicate(iterations, mean(rnorm(largeSample, trueMean, trueSd)))

# plot the distribution, add critical value and observed mean
hist(trueLargeSampleMeans, breaks=breaks, col='gray', main='True Mean (Unknowable)', xlab='mean', xlim=range)

# add critical value and observed mean
abline(v=criticalValue, col='blue', lwd=3)
text(5.5, 1400, expression(1-beta), cex=2, pos=1, col='blue')
text(3.5, 1400, expression(beta), cex=2, pos=1, col='blue')
arrows(observedMean, 1000, 5.1, 50, angle=20, length=0.15, lwd=3, col='red')