P-values & confidence intervals

When we reject the null hypothesis, we say that our result is *statistically significant* or *significantly different*. Both of these indicate that our statistic is unlikely if the null hypothesis is true. Because statistical significance is determined by the value of alpha, people will often specify alpha, as in *statistically significant at the 0.05 level*. Unfortunately, people sometimes just say that their result is *significant*, but you should avoid doing this because this subtle change in language leads to confusion.

The statistical term *significance* is perhaps the most unfortunate possible choice of words because when most people hear the term *significant*, they think *important*. However, statistical significance is not the same as scientific significance. Statistical significance means that the results are inconsistent with the null hypothesis, that it is unlikely we would have observed our statistic if the null hypothesis was true. This could be because our parameter is very different from the null hypothesis, but it could also be because the sample size is large enough to detect a small effect size.

Scientific significance is a statement about whether the result is an important one, that is, whether the effect detected is a large one. For example, is the correlation of two variables strong? Is the difference between two sample means large? A statistically significant difference may not indicate a large or important difference (as can happen when sample size is especially large). Conversely, a large or important difference may not be statistically significant (when sample size is small).

Statistical significance is often used misleadingly to imply scientific significance. For example, Richard J. Herrnstein and Charles Murray published a bestselling, influential, and controversial book called The Bell Curve in 1994. In it, they reported that many phenomena in society — such as drug use, teen pregnancy, and unemployment — are significantly correlated with low intelligence. Among the many criticisms of the book, one that is often overlooked is that the authors equate statistical significance with scientific importance. True, all of their results are statistically significant, and they admirably include the statistics in their appendix. What the authors don’t tell you is that the results are significant solely because their sample sizes are enormous. In most cases, intelligence explains only a tiny percentage of the variation in drug use, teen pregnancy, etc. In other words, intelligence is not an *important* explanation of these societal problems, although it is a *statistically significant* one.

The bottom line is that, from this point on, whenever you hear the word *significant*, you must ask whether it is meant in the statistical sense or the scientific sense. In your own work, consider using the phrase “statistically different” or “statistically distinct” instead of statistical significance, and using “important”, “great”, or “substantial” to convey scientific significance.

Furthermore, because the null hypothesis is generally a test of no difference or no relationship, it is a test of the zero hypothesis. For example, a null hypothesis is commonly that there is a zero correlation, or that the difference of two means is zero. Because of this, it would be much less ambiguous if we described our results as statistically non-zero rather than statistically significant. For example, stating that a correlation is statistically non-zero sounds like a very different statement to most people than stating that a correlation is significant. Non-zero includes many possibilities, and only some of those would be ones that most people would call important or noteworthy.

Although the increased use of statistics in science has improved the rigor of many fields, the misuse of statistics comes with a price. Many scientists are now so focused on hypothesis testing via statistics that they lose sight of the scientific problem. Their pursuit has become one of p-values, not one of finding important results.

A growing movement (e.g., Jost, 2009; Anderson et al., 2000; Schervish, 1996; Seaman and Jaeger, 1990; Toft, 1990; Yoccoz, 1991; Johnson, 1998) advocates replacing p-values with confidence intervals and parameter estimation. Confidence intervals give us an estimate of a parameter and our uncertainty in it, and these are what should interest us. In many cases, the null hypothesis that is tested is not a reasonable hypothesis, or even one that must be false by its very nature, yet many scientists feel compelled to use a classical approach or a p-value to test it.

For example, suppose you wanted to compare the density of fiddler crab burrows in two marshes, one influenced by fertilizer runoff and one not. The standard approach would be to make replicate measurements of the density of burrows in the two marshes, and then use a statistical test on the difference of means. The null hypothesis would be that the difference of means is zero, indicating that *nothing special is going on*, that the two marshes are the same because fertilizer runoff has no effect. You already know the answer to this question without collecting one bit of data, because no two marshes anywhere could have *exactly* the same density of fiddler crab burrows. Whether you actually reject the null hypothesis is simply a matter of how much data you are willing and can afford to collect. Instead of testing a patently false null hypothesis (of no difference), we should focus our energy on estimating the difference in burrow density, and calculating our uncertainty in that difference. This is what confidence intervals give us, and we should construct them.

A friend of mine, Dan McShea, once stated when we were in graduate school that “Science is the art of substituting the question that can be answered for the question that is interesting”. Too much statistical hypothesis testing reduces to exactly this: endless testing of meaningless and plainly false null hypotheses in the name of scientific rigor, without determining the size of an effect and our uncertainty in it.

To make inferences about a population, we collect a sample, and all statistical tests assume that it is a *random* sample. A random sample is said to be an **i.i.d. sequence**, that is, an independent and identically distributed sequence of observations. Let’s consider both parts of this definition.

**Independence** means that the probability distribution of each observation is not controlled by the value of any other observation. Using the principles for avoiding pseudoreplication is a good way to insure independence.

**Identically distributed** means that the probability distribution of each observation is the same as that of the population. In other words, each item in the population had an equal opportunity of being included in the sample. In a biased sample, particular observations are systematically excluded or included, such that not all items in the population have an equal opportunity of being sampled. Choosing samples based on a preliminary examination of whether they look “good” is a quick route to making a biased sample.

Parametric tests are based on assuming a particular distribution, commonly a normal distribution. Non-parametric tests are often described as distribution-free statistics because they make no assumptions about the distribution. Because they do not make such assumptions, non-parametric tests typically have lower power than parametric tests.

Because many parametric tests assume normality in the data or the statistic, you must check the assumption of normality prior to running such a test. In particular, parametric tests are sensitive to skewness in data, where one tail is substantially longer than the other. Sample skewness can be calculated with this formula:

skewness <- function(x) {

n <- length(x)

mean <- mean(x)

m3 <- sum((x-mean)^3) / n

s3 <- sd(x)^3

G1 <- n^2/((n-1)*(n-2)) * m3/s3

G1

}

A symmetrical distribution has zero skewness, a right-tailed distribution has positive skewness, and a left-tailed distribution has negative skewness. You can also check skewness by comparing the mean and median: for symmetrical distributions, the mean and median will be equal, but the mean will be pulled towards the longer tail in skewed distributions.

Kurtosis, the peakedness of a distribution, is less of a concern for most parametric tests that assume normality. Kurtosis can be calculated with this function:

kurtosis <- function(x) {

m4 <- mean((x-mean(x))^4)

kurt <- m4/(sd(x)^4)-3

kurt

}

Normal distributions have zero kurtosis, excessively peaked (leptokurtic) distributions have positive kurtosis, and excessively broad (platykurtic) distributions have negative kurtosis.

Normality can be checked qualitatively by shape of a frequency distribution, but this can be misleading in small data sets. For large data sets, you can use hist() to generate a histogram. For small data sets, stripchart() and stripchart(method="stack") are better for visualizing the distribution. In general, you are looking for symmetry: most of the data should be in the center of the distribution, with less data in each tail, and an equal amount in each tail. If you have different groups in your data, you must separate them before making any frequency distributions.

In bivariate data, where you have an x and a y variable, you can visualize bivariate normality on a scatter plot. For each axis, most of the data should like in the center of the range, with an equal amount on either tail. In particular, beware of the comet distribution (shown below), where most data lies near the origin, with an expanding cloud to the top and right, which indicates a data set with bivariate positive skew. These distributions are particularly common in the natural sciences and they often indicate that both variables are log normally distributed.

On a bivariate plot, you can also use the rug() function to show the distribution of points along each axis, similar to the stripchart() function. Be careful not to confuse how you sampled along the regression with whether the variables are normally distributed. For example, if you sampled from high-salinity marshes and low-salinity marshes, but not intermediate-salinity marshes, a plot of salinity will look bimodal, only because it reflects your sampling.

Finally, you can evaluate normality with a quantile-quantile plot, performed with the qqnorm() function. This type of plot compares the observed distribution to what would be expected if it was a normal distribution. If the data are normally distributed, this plot will display a linear relationship. Some curvature is normal near the ends of the plot, even for normally distributed data, but substantial departures from a linear relationship, particularly in the middle of the plot, indicate non-normality.

In general, tests of normality have low power in small data sets, as you would expect in any test with small sample size. For such data sets, tests of normality may indicate that you should accept the null hypothesis that the data are normally distributed, when in fact they are not.

The MVN package includes several tests and plots for evaluating normality, and is especially useful for investigating multivariate normality.

Use non-parametric tests whenever your:

- your measurement scale is nominal or ordinal, or
- your distribution is non-normal.

Use parametric tests whenever:

- your measurement scale is interval or ratio,
*and* - your distribution is normal, or at least symmetrical.

Bear in mind that parametric tests generally have greater power than their corresponding non-parametric tests.

Some data are inherently not normally distributed, but can be turned into normally distributed data through the use of a transformation. Transformations change the scale of measurement of data, meaning that the spacing between values may change non-uniformly, but that the order of observations is unchanged. Some people have reservations about using transformations, but the need for a data transformation commonly indicates that the data should not be measured on a linear scale. For example, we would never measure pH on a linear scale and have no problem that this is measured on a log scale. Likewise, many other types of measures also have a natural scale of measurement, and it may not be linear.

There are two common types of data transformations, and we must discuss these now because you will deal with them so frequently. Be aware that there are several other types of transformations.

The log transformation is one of the most common transformations because many natural phenomena should be measured on a log scale, such as pH and grain size. Log-normally distributed data commonly arise when a measurement is affected by many variables that have a multiplicative effect on one another.

Try a log transformation when the data are right-skewed, when means and variances are positively correlated, and when negative values are impossible.

The log transformation is performed by taking the logarithm of all of your observations. The base is generally not important, although you should report it and you must use a consistent base for everything. If you have zeroes in your data set, such as for species abundances, you should use a log(x+1) transformation, where you first add a one to all observations, then take the logarithm. There are also other options for transformations of data with zeroes.

A log transformation shortens the right tail of a distribution. This will also make data **homoscedastic**, that is, make different data sets have equal variances, which is important for some tests, like the ANOVA.

The square-root transformation is common for data that follow a Poisson distribution, such as the number of objects in a fixed amount of space or events in a fixed amount of time.

Use a square-root transformation when the data are counts or when the mean and variance are equal, which is true for all Poisson distributions.

The square-root transformation is performed as its name suggested: take the square root of each of your observations. Some suggest using a (x+0.5)^{0.5} transformation if your data include many zeroes. Similar to the log(x+1) transformation, add 0.5 to all of your observations, then take their square roots.

Like a log transformation, a square-root transformation shortens the right tail and tends to make data homoscedastic.

Anderson, D.R., K.P. Burnham, and W.L. Thompson, 2000. Null hypothesis testing: problems, prevalence, and an alternative. Journal of Wildlife Management 64:912–923.

Herrnstein, R.J., and C. Murray, 1994. Bell Curve: Intelligence and Class Structure in American Life. Free Press: New York, 912 p.

Johnson, D.H., 1995. Statistical sirens: the allure of nonparametrics. Ecology 76:1998–2000.

Jost, L., 2009. D vs. G_{ST}: Response to Heller and Siegismund (2009) and Ryman and Leimar (2009). Molecular Ecology 18:2088–2091.

Toft, C.A., 1990. Reply to Seaman and Jaeger: an appeal to common sense. Herpetologica 46:357–361.

Schervish, M.J., 1996. P values: What they are what they are not. The American Statistician 50:203–206.

Seaman, J.W., and R.G. Jaeger, 1990. Statisticae dogmaticae: a critical essay on statistical practice in ecology. Herpetologica 46:337–346.

Yoccoz, N.G., 1991. Use, overuse, and misuse of significance tests in evolutionary biology and ecology. Bulletin of the Ecological Society of America 72:106–111.