In the classical approach to statistical testing, we state a null and alternative hypothesis, choose a significance value, find our critical value(s), measure our statistic, and decide to accept or reject the null hypothesis. When we report our decision, we have no way to convey how convincingly we were able to make our decision. Was our null hypothesis clearly rejected or accepted, or was it a borderline case? The classical all-or-nothing, accept-or-reject approach does not give us a complete picture.

We could take another approach that would tell us how convinced we were in our decision, an approach developed by Ronald Fisher (of F-test, maximum likelihood, genetics, and evolutionary biology fame). Once we have our statistic and the distribution based on the null hypothesis, instead of finding a critical value, we could instead calculate the probability of observing our statistic or a value even more extreme. This probability is called a *p-value*. The p-value is the probability of observing our statistic or one more extreme *if the null hypothesis is true*.

We use the p-value to make a decision about the null hypothesis. If the p-value is less than our significance level, we conclude that our statistic has such a low probability of being observed that we should reject our null hypothesis. If our p-value is greater than our significance level, we accept our null hypothesis, because our p-value tells us that our observed statistic is a likely result when the null hypothesis is true.

Finding a p-value is the reverse of finding a critical value. When you find a critical value, you start with a probability (significance) and integrate under the probability distribution from one or both tails until you equal your probability; the critical value(s) correspond to those cutoff values of the statistic. With a p-value, you start with the observed statistic, then integrate under the probability distribution from that observed statistic towards the tail. The cumulative probability from your statistic to the tail is the p-value.

P-values improve on the classical approach to statistics. We can still decide to accept or reject the null hypothesis, just as we did in the classical approach. We can also report a p-value, which directly tells readers the probability of observing what we did if the null hypothesis is true. With this, readers can judge if we convincingly rejected the null hypothesis or not. For example, a minuscule p-value says that we have an extremely unlikely result, whereas a larger p-value shows that our observations are more consistent with the null hypothesis. P-values convey more information than the classical approach, and as a result, scientists today primarily use the p-value approach.

Unfortunately, p-values are widely misused in two ways. First, the p-value is widely misinterpreted as the probability that the null hypothesis is true. If you recall last lecture, there is no probability of the null hypothesis being true: it either is, or it is not. We may not know which, but there can be no probability (in a frequentist sense) that a hypothesis is true or false.

Second, scientists commonly use p-values to indicate the magnitude of the effect being detected. For example, when someone tests a correlation coefficient, a low p-value is widely misinterpreted to mean that the correlation is strong or important. Similarly, scientists comparing the mean values in two samples often interpret a low p-value as indicating that the two means are very different.

*(As an aside, the concept of significance was first introduced by Ronald Fisher, who used a probability of 1 in 20 (0.05) as a reasonable standard for rarity. For Fisher, any p-value below that cutoff was considered significant, in the sense that it meant the data warranted further exploration. Fisher did not mean that the work was done, as a low p-value often gets treated today. So, both the concept of statistical significance and the cutoff of 0.05 stem from Fisher. Neyman and Pearson later integrated the concept of significance into their larger understanding of type I and type II errors.)*

It is critical to understand what controls the size of a p-value. In part, it reflects the effect being investigated: large effects tend to generate small p-values. In other words, means that are quite different tend to produce a small p-value. Strong correlations likewise favor small p-values.

The p-value depends just as importantly on sample size, and larger samples tend to produce smaller p-values. A small p-value might therefore reflect a large effect (e.g., a strong correlation, or a large difference in means), but it might also indicate a large sample size. P-values alone do not convey the size of the effect. Some unscrupulous scientists intentionally play on this misunderstanding by collecting a vast data set, publishing a small p-value, and downplaying the minuscule effect that was discovered. *Watch for this.*

Finally, remember to think about p-values in terms of the hypothesis being tested. Generally, our null hypothesis is that nothing special is going on. It might be that our correlation was zero, or that there is no differences in means, or that two variances are the same. When we run our test, the p-value is the probability we would get the result we observed if the null hypothesis was true. In other words, the p-value is the probability we would get the result we observed if there really is no correlation, or that the means are the same, or that the variances are the same. Put more simply, it tests whether something is non-zero. “Not zero” encompasses a large range of possibilities, and some of those possibilities are scientifically interesting and some are not. A statistic might have a p-value of 0.001 and be statistically non-zero, merely because sample size is large.

One problem that classical and p-value approaches share is that they test a single null hypothesis, most commonly that a parameter (mean, correlation coefficient, etc.) is zero. Alternatively, you might want to know all the null hypotheses that are inconsistent with the data, not just the zero hypothesis. Somtimes, it may not be clear what the null hypothesis is, and without a null hypothesis, you cannot generate the distribution needed to calculate a critical value or a p-value.

Imagine instead that you could test all possible null hypotheses and identify those that are consistent with the data and those that are not. Confidence intervals do just that. Confidence intervals improve on the classical and p-value approaches in that they convey our *estimate of a parameter* and our *uncertainty in that estimate*. This is far more informative than either the classical or p-value approach, and *confidence intervals should be your standard practice*.

Critical values and p-values are based on a null hypothesis: given a null hypothesis about a parameter, we generate the expected distribution of a statistic. From that distribution and our chosen significance level, we obtain a critical value and compare that to our observed statistic. We could also use our statistic and directly calculate the probability we would observe that value or one more extreme if the null hypothesis is true (i.e., a p-value). In other words, critical values and p-values are both based on a distribution built from the null hypothesis.

Confidence intervals are not constructed from a distribution of the statistic based on the null hypothesis, but a distribution of the statistic based on the data. If some assumptions are made about the parent distribution underlying the data, it is possible to construct a distribution of any statistic from our data. From this distribution, we could use our significance level (α) to define unlikely values (i.e., those out on the tails). More commonly, we use confidence (1-α) to define a region of likely values, and this is called a confidence interval.

**A confidence interval is the set of all null hypotheses that are consistent with the data.** Another way to say this is that a confidence interval is the *set of all acceptable null hypotheses*. In other words, you accept any null hypothesis that lies within the confidence interval. The end points of a confidence interval are called **confidence limits** and they mark the last acceptable null hypotheses. You reject any null hypothesis that equals the confidence limits or that lies outside of them.

Again, note the opposing approaches. For critical values and p-values, we use the null hypothesis to build a distribution, against which we compare our observed statistic. For confidence intervals, we use the the data to build a distribution, against which we compare the null hypothesis.

Confidence limits are generally phrased as some value (the best estimate of the parameter) plus or minus some other value, such as 13.3 ± 1.7, which would correspond to a confidence interval of 11.6 to 15.0. A large confidence interval conveys large uncertainty in your estimate of the parameter, and a small interval indicates a high degree of certainty. Confidence intervals are always reported with the value of confidence, as in 95% confidence limits, which would correspond to a significance level of 0.05.

Confidence intervals should be interpreted this way: if your confidence is 95%, approximately 95% of the confidence intervals that you construct in your lifetime will contain the population parameter. Confidence intervals *do not* mean that there is a 95% chance that you have bracketed the population mean, and this misconception is *very common*. Recall that the population parameter has some value; you may not know what it is, but the value of the parameter is not subject to chance. Once you have generated your confidence interval, it also exists, so there is no chance anymore: it either brackets the true value or it doesn’t.

As confidence increases (or equivalently, as significance decreases), so does the size of the confidence interval, and this may seem counterintuitive. One way to think about this is, as your confidence increases, your confidence intervals become more likely to bracket the population mean, which means they must get larger.

In practice, calculating confidence intervals is not as tedious as it sounds: you will not actually test every possible null hypothesis. There are straightforward methods of calculating confidence limits that we will discuss for each test we cover. What is important is that their interpretation is the same.

Confidence limits are the preferred way to do statistics. Like the classical and p-value approaches, you can still accept or reject a null hypothesis. Like the p-value approach, you get a sense for how convincingly you can reject or accept the null hypothesis, based on how close it falls to your confidence limits. Where confidence intervals improve on the classical and p-value approaches is that they give you an estimate of the population parameter and a measure of your uncertainty in that estimate. These two quantities are quite valuable and can be used in other calculations, such as propagation of error or numerical models. Almost always, these are the two things we really wish to know: what is our best estimate of the population parameter and what is our uncertainty in this estimate. What we generally don’t care about is the probability that we would see our data if the parameter (correlation, difference of means, etc.) was zero, that is, the p-value.

Be sure to read this section from Biometry (Sokal and Rohlf, 1995) on Type I and Type II errors. Some of this is in reference to a specific statistical test, but try not to focus on the particulars of that test, but on the general principles we discussed in class.

Sokal, R. R., and Rohlf, F. J., 1995, Biometry: New York, W.H. Freeman and Company, 887 p.