Steven Holland

# Statistics fundamentals

## Kinds of data

There are four types of data, of increasing usefulness:

Nominal data consist of mutually exclusive and unranked categories, like granite and basalt, or red, blue and green. Nominal data are also called categorical data or attribute data.

Ordinal data are like nominal data, but they are ordered or ranked. The size of the steps in the ranking may be unequal. Moh’s hardness scale is an example of ordinal data, as is metamorphic grade, and Mary Droser’s scale of ichnofabric or burrowing intensity.

Interval data are like ordinal, but they have steps of equal size. Interval data lack a true natural zero, that is, their zero is set an arbitrary point rather than a fundamental point. For example, the zero point for Celsius was arbitrarily defined as the freezing point of water; it could have been defined to be the freezing point of ethanol. Likewise, the zero point in isotopic ratios is defined based on an arbitrarily selected isotopic standard. Interval data are also called measurement variables. Interval data measured on different scales can be converted through multiplication and addition, such as converting between Fahrenheit and Celsius.

Ratio data are like interval data, but they have a natural or fundamental zero point, one that is not arbitrarily chosen. For example, 0° Kelvin and 0° Rankine are both fundamental zero points: the cessation of molecular motion. Zero length is likewise a fundamental quantity: the absence of any length. Ratio data are also called measurement variables. Conversion between ratio data on different scales (e.g., length in cm vs. length in inches) requires only multiplication.

Measurement variables may be open or closed. Open variables are not constrained by the system of measurement, whereas they are for closed data such as percentage measurements, which must sum to 100%. Measurement variables may be described as continuous, where all possible intermediate values are possible, or discrete, where only certain values are possible, typically integers. Temperature is an example of a continuous measurement variable, whereas numbers of individuals is a discrete interval variable.

## Populations versus samples

A population is a well-defined set of elements that may be either finite or infinite in size. It is the total collection of objects to be studied, from which a subset is usually selected for study.

A sample is a subset of elements drawn from a population. Note that this is not the same as a geological sample or a tissue sample, etc., so be aware that people use the term “sample” in different contexts. Samples may be random, that is, collected without systematic inclusion or exclusion, or they may be biased, that is, collected from a population in a way that systematically includes or excludes part of the population. All statistical analyses assume that samples are random. You must insure this through your sampling design, and if you do not, your conclusions may well be invalid.

The general problem in statistics is that we are nearly always more interested in the population, but we can examine only a sample because we are constrained by time, money, or effort. As a result, we collect an unbiased sample and use it to make inferences about the population.

## Statistics vs. Parameters

A parameter is a measurement that describes some aspect of a population, such as its mean or variance. A statistic is the corresponding measurement that describes a sample. The easy mnemonic is that parameter and population begin with a “p”, and statistic and sample begin with an “s”. Roman letters are generally used for statistics (e.g., s for sample standard deviation), with the corresponding Greek character for the parameter (σ for population standard deviation).

Rephrasing the general problem in statistics, we measure statistics on samples, but we’re interested in parameters of the population. For example, we may have measured the means of two samples (statistics), but what we really want to know is how the means of the populations compare (that is, the parameters). Similarly, we may measure the mean of a sample (the statistic), and use it to estimate the mean of the population (the parameter) and our uncertainty in that estimate. In both cases, the quality of our comparison or our estimate of uncertainty will be controlled by the amount of replication. More replication translates to less uncertainty.

## Distributions

A frequency distribution describes the frequency with which all observations were made over the range of possible values. Often, a frequency distribution will be rescaled to probability. Frequency distributions are visualized with histograms. Cumulative frequency distributions show the cumulative frequency or probability beginning at one edge of a distribution and progressing to the other. In terms of percentage, such cumulative distributions must begin at zero and end at one hundred.

### Some common theoretical distributions

The normal or gaussian distribution is a symmetrical continuous distribution with the familiar bell shape. This distribution arises when a variable is affected by many independent factors, whose contributions are additive (not multiplicative).

The lognormal distribution is an asymmetrical humped continuous distribution with a long right tail. Taking the log of this distribution produces a normal distribution, hence the name. A lognormal distribution arises like a normal distribution, except that the effects of the underlying factors are multiplicative. Lognormal distributions are very common in nature (e.g., grain size, concentration, body size), and should often be suspected when values below zero are impossible.

The binomial distribution is a discrete distribution that describes the number of successes in a series of trials, where the probability of success is fixed, for example, the number of heads when you flip a coin a given number of times. When the number of trials is large, the binomial distribution approximates the shape of a normal distribution. Binomial distributions are symmetrical when the probability is 0.5, but asymmetrical when the probability is greater (left-tailed) or less than 0.5 (right-tailed).

The Poisson distribution is a discrete distribution that describes of the number of events occurring in a fixed period of time, where the events occur at a fixed average rate and the events are independent of one another, that is, they are not dependent on the time since the last event. Poisson distributions also describe the number of objects in a fixed area or volume, when the occurrence of each object is independent of any other objects. The number of chocolate chips per cookie in a batch of cookies follows a Poisson distribution. The Poisson distribution is a right-tailed distribution when the rate of events or occurrences is small, but becomes increasingly symmetrical as the rate of events or occurrences increases.

The exponential distribution is related to the Poisson distribution, but is a continuous distribution of the time between events when those events occur at a fixed average rate, or equivalently, the distance between objects. Exponential distributions are asymmetrical, with a long right tail.

The even or uniform distribution is a flat, continuous distribution where the probability of all outcomes is equal. Few things in nature follow such a distribution, although it is the most common distribution made by random number generators, so beware of misapplications in some numerical models.

## Descriptive statistics

Central tendency or location is commonly described with the mean, which is the sum of measurements divided by sample size, symbolized with a lower case “n”. A sample mean is indicated by an x with a bar over it (x̄), and a population mean (a parameter) is indicated by the greek letter mu (μ). The mean is an unbiased estimator, that is, the sample mean will not tend to be either larger or smaller than population mean, and the sample mean has an equal probability of being larger than or smaller than the population mean.

Central tendency can also be measured by the median, that is, the value for which half of the sample is smaller and half of the sample is larger. In the case of an even number of observations, the median is the average of the two middle values. Central tendency can be measured by the mode, the highest peak in a histogram. Last, central tendency can be measured by the geometric mean, which is the product of all the observations taken to the nth root, although this is more easily calculated as the mean of the logarithm of the observations, raised to the base of e.

Because there are different measures of central tendency, one might ask which is the best. There is no simple answer, but one criterion is called efficiency. One statistic is said to be more efficient than another if it is more likely to lie closer to the population parameter. The mean is a more efficient statistic of central tendency than the median.

The variation or spread or dispersion of a distribution can also be measured several ways. Range is the difference between largest and smallest value. Range tends to increase with sample size and this sensitivity is not desirable. Interquartile deviation is the difference between largest and smallest value after largest 25% and smallest 25% of distribution has been removed. It is somewhat less sensitive to sample size than range.

Variance is the average squared deviation of all possible observations from the population mean and plays a crucial role in statistics. Variance is symbolized as s2 for samples and σ2 for populations. Standard deviation is the square root of variance, which places it on same measurement scale as mean. The coefficient of variation is the standard deviation divided by the mean. This dimensionless number is used to compare standard deviations for two samples that have different means, such as when measurements are made on two different scales (such as centimeters and inches).

The shape of a distribution is usually described in terms of the number of modes, or peaks on a histogram. Unimodal distributions have one peak, bimodal have two, and multimodal have more than two peaks. Skewness refers to the asymmetry of a distribution. Right-skewed distributions have positive skew and a longer right tail. Left-skewed distributions have negative skew and a longer left tail. Skewed distributions often cause problems for statistical analysis and require special treatment, such as data transformations or the use of non-parametric statistics. Kurtosis is less commonly used, but it describes the peakedness of a distribution relative to a normal distribution. Distributions that have more data in the middle of the distribution than a normal distribution, that is, distributions that are strongly pointed, are called leptokurtic and have positive kurtosis. Distributions that are broader or flatter, or that have more data in their tails than a normal distribution are called platykurtic and have negative kurtosis.