As a scientist, you know that scientists ask an enormously wide range of questions about the world. As a statistician, you will see that most of these questions fall into a few basic categories. Once you recognize these categories and the types of statistical analyses each demands, your perspective on science will change and you will grasp a much broader array of scientific questions. What follows are some of the most common statistical questions, and you can use this as a guide to what statistical techniques to apply in any given situation.
How does the central tendency or typical value compare in two or more groups of data? The central tendency refers to the expected value of a variable; in other words, it is something described by the mean, median, and mode. Tests of the mean in two or more groups are typically done with a t-test or ANOVA, but similar tests can also be done for medians with their nonparametric equivalents, such as the Wilcoxon Rank Sum (Mann-Whitney U) and Kruskal-Wallis tests.
How does the amount of variation compare in two or more groups of data? Variation can be described with range, sum of squares, standard deviation, and variance. Common tests include the F-test and ANOVA, although ANOVA is used more often to test for differences in means.
Are two variables correlated in a set of data, that is, do they covary? As one variable increases in value, the other might increase or decrease. The strength of this relationship can be measured in several ways, including the parametric Pearson correlation coefficient and the nonparametric Spearman and Kendall rank correlation coefficients.
What is the mathematical relationship that best describes how one variable changes with another? This is a question about regression, and there are many parametric and non-parametric techniques for this. Note that regression is a different question than asking about the strength of a relationship, which is correlation. Multiple regression can be applied when a single dependent variable is controlled by more than one independent variable.
Many of these four topics (central tendency, variation, correlation, regression) can also be answered with other classes of analyses, including resampling, likelihood, and Bayesian statistics.
In a large data set with many variables and many samples, other questions arise. What is the primary structure of the data, that is what are the sources of variation in the data? This helps you to address a variety of questions including: What variables covary with one another? Are there groups of covarying variables? Are some groups of samples more similar to one another than they are to those in other groups? How are samples distributed along natural gradients? Ordination techniques like the parametric Principal Components Analysis and the non-parametric Non-metric Multidimensional Scaling are a good place to start. Cluster analysis is another common technique for identifying similar groups of samples or similarly behaving variables.
Collectively, these represent most of the common statistical questions and techniques, and they will be the subject of the course. Knowing how to address these will lay the groundwork for other statistical questions and analyses, ones that won’t cover in this course.
Your task as a data analyst is to identify the scientific question that is being asked, to translate that into a statistical question, and to apply the appropriate procedure. That is the goal of the course. We want to move beyond vague desires to “do some statistics” or “analyze” data, because these usually result in poorly chosen statistical methods or a never-ending morass of pointless analyses. Likewise, we want to avoid doing statistical analyses just because we can or because that is what everyone else does; the scientific question should always drive the data analysis.
In short, data analysis starts by stating your scientific question clearly. Only after that can the statistical question be phrased, usually making the analytical methods clear. Choose appropriate statistical techniques based on your question, and do only those.
For any statistical analysis, you must have replicates, that is, repeated measures of some quantity. The more replicates you have, the more reliable your results will be. How many replicates you need depends on what you are trying to detect: if you’re trying to detect a large effect or difference, a few replicates may suffice, but if you’re trying to detect a small effect, many more replicates will be required. Beware of any blanket statistical wisdom you may have heard, such as, “You must have x points to do any statistics.” Such statements are usually rooted in a particular type of analysis, and they are unlikely to apply in every situation, or even in most situations.
You will also need to randomize or intersperse your replicates so that the effect you are trying to detect is not inadvertently correlated with some other variable that systematically affects your samples.
For example, suppose you want to test the effect of a fertilizer on the growth of grass. Rather than apply the fertilizer to all of your experimental plots, you realize you will need to compare plots which have fertilizer to ones where it has not been applied. In other words, you need a control. You therefore plan two types of treatments: one in which the fertilizer is applied (black squares below), and one in which it is not (the control, shown in white squares below). Suppose you have eight trays of grass in your experiment, owing to the limitations of cost, time, and space. You decide to apply the fertilizer to four of these trays, leaving four as controls, giving you an equal number of replicates for each treatment. So far, so good. You have several different choices for which trays will receive the fertilizer and which will not. Three of these designs are good, and five are not.
Choice A-1 is completely randomized, where each tray is assigned a random number that determines whether it will receive the fertilizer or be a control. It is a good choice, provided that the treatment and control plots are well interspersed; with few plots, assigning the plots with a random number generator might achieve interspersion. Interspersion is what is desired, more so than strict randomization. Choice A-2 is also good, in which the trays are placed in pairs, with one of each pair receiving the fertilizer and one being a control. Such randomized block designs are common and effective designs. Last, a systematic alternation of the trays (A-3) is also a good choice, provided there are no external influences that might be regularly spaced, such as having growing lights only above the control trays. A-2 and A-3 are good for achieving interspersion of your treatments.
Five strategies create a poorly designed experiment. The trays could be clumped with all of the fertilized treatments at one end and all of the controls at the other (B-1). If one end of the array is lit from a window but the other end is not, the amount of sunlight will differ systematically for the two groups. Similarly, the trays could be clumped (B-2), with all of the fertilized treatments on one table and all of the controls on another. Even worse, the two groups might be separated and isolated from one another (B-3), such as all the fertilized trays in one greenhouse and all of the controls in a different one. The trays may have been randomized (B-4), but some external agent might affect only one group, such as a watering system. Last, if you chose to have only two trays, but to take multiple samples from each tray, this would be equivalent to no replication (B-5) and is the most basic type of poor experimental design. In all cases, the lack of interspersion creates a poor experimental design.
Pseudoreplication occurs when you collect multiple observations that are not independent of one another. This is a problem because independence is a critical assumption underlying all statistical methods. For example, you may take measurements on one individual (organism, rock, outcrop, etc.) through time, called temporal pseudoreplication. Instead, sample through time, but on different individuals. You might also take several measurements from the same location, called spatial pseudoreplication. Instead, take your measurements in different locations. Pseudoreplication is a common problem, and you should watch for it. It affects not only experiments, but also non-experimental data sets.
The five cases of poor design in the example above are all examples of pseudoreplication, defined as when treatments are not replicated (although samples may be) or when replicates are not statistically independent. In cases B-1, B-2, and B-3, the replicates of the treatment are segregated from the controls. In case B-4, the replicates are interconnected in a way that makes them non-independent. In case B-5, repeated measurements (“replicates”) from the same experimental plot are clearly non-independent.
Although the figure was described in terms of the spatial positions of the samples, the same arguments apply if the figure describes the temporal ordering of the measurements.
Pseudoreplication is shockingly common. Of 537 ecological papers examined in one study (Hurlbert, 1984), only 191 described their sampling design and performed statistics — a depressingly low 36%. Of those that described their experimental design, 62 committed pseudoreplication; that’s 32%! The conclusions of every one of these papers are compromised by poor design.
You must design your work with replication and randomization in mind before you collect any data. No statistical analysis can salvage a data set lacking sufficient replication or randomization.
Download a pdf of the slides used in today’s lecture.
Hurlbert, S.H., 1984. Pseudoreplication and the design of ecological field experiments. Ecological Monographs 54(2):187–211.