Steven Holland

# Covariance and Correlation

A very common problem in the sciences is describing the strength of the linear relationship of two variables. Covariance and correlation are the two most common measures of this relationship.

## Covariance

The sum of products lies at the core of covariance and is analogous to the sum of squares used to calculate variance.

where i is the observation number, n is the total number of observations, j is one variable, and k is the other variable.

Alternatively, the sum of products can be calculated as follows, where the first term is called the uncorrected sum of products and the second term is called the correction term. It is corrected in the sense that it subtracts the mean from each of the observations, that is, it is the sum of products relative to the means of each variable, just as we measured the sum of squares relative to the mean.

From the sum of products, we get covariance, which is analogous to variance. Note that j and k can be switched, with no change in the result.

Unlike the sum of squares and variance, the sum of products can be positive or negative, as can covariance. Covariance has units equal to the product of the units of the two measurments, similar to those of variance. Also like variance, the value of covariance is a function of the scale on which the data were measured, which is generally undesirable. For standard deviation, we formed a unitless measure of by using the coefficient of variation. For covariance, we will develop a unitless measure of the strength of the linear relationship by using the correlation coefficient.

## Pearson’s correlation coefficient

Pearson’s correlation coefficient is also commonly called the product-moment correlation coefficient or simply correlation coefficient.

To produce a dimensionless measure of correlation, we need to divide the covariance by something that has the same units, so we use the product of the standard deviations of the two variables. In effect, this standardizes the joint variation in the two variables by the product of the variation in each variable. This produces a dimensionless number, one unaffected by the scale on which we measure the data - a very useful measure.

Pearson’s correlation coefficient varies from -1 (perfect negative correlation) to +1 (perfect positive correlation), with a value of 0 indicating no linear relationship. If one variable does not vary, one of the standard deviations will be zero, making the denominator zero, and making the correlation coefficient undefined.

Note that Pearson’s correlation coefficient could also be calculated as the sum of products divided by the square root of the product of the two sums of squares:

Covariance and correlation are easily calculated in R:

cov(x, y) # Covariance
cor(x, y) # Correlation Coefficient

As we have seen before, if we have a statistic, we can make inferences and test hypotheses if we know the expected distribution of the statistic. If the data are normally distributed, and if the null hypothesis is that there is zero correlation, Pearson’s correlation coefficient follows a t-distribution, where the standard error of Pearson’s correlation coefficient is:

The t-test for Pearson’s r is set up as:

Since the null hypothesis is that the correlation is zero, the ρ term drops out, and the t-statistic is simply Pearson’s correlation coefficient divided by its standard error. This t-test assumes random sampling and that the expected value of Pearson’s r is normally distributed, which will be true only if the null hypothesis is that ρ equals zero. The test also assumes that both variables are normally distributed. Like any t-test, the default is a two-tailed test, but you can perform a one-tailed test if you have theoretical reasons for believing that the correlation ought to be either positive or negative. As always, this alternative hypothesis cannot be determined from your data; it must come from prior knowledge, such as a theoretical model or other data sets. Finally, the test assumes that the correlation coefficient is calculated on interval or ratio data. If you have ordinal data, you must use a non-parametric test. This t-test has n-2 degrees of freedom, since population variance is estimated for two variables, not just one.

The t-test for Pearson’s correlation coefficient is most easily performed in R with cor.test(x, y). The output is identical to that of t.test(). If you calculate the t-test on Pearson’s r manually, you will get the same results as if you run cor.test().

cor.test(x, y)

## Spearman’s rank correlation

Spearman’s rank correlation coefficient is a non-parametric test of correlation based on the ranks of the observations. Suppose you could rank all of your x observations, assigning 1 to the smallest value, 2 to the next smallest, and so on. Suppose also that you did the same for your y-variable. If x and y were perfectly correlated, the ranks for the two data sets would correspond perfectly. Spearman’s rank correlation coefficient measures this correlation.

In R, you can the Spearman rank correlation by setting the method parameter of cor() or cor.test():

cor(x, y, method='spearman')
cor.test(x, y, method='spearman')

The Spearman correlation test assumes only that your data were randomly sampled. It will work with any type of data that can be ranked, including ordinal, interval, or ratio data. The Spearman test is valuable in that it is scale-independent, and because it is less sensitive to outliers than the Pearson correlation test.

## Two caveats

### Spurious correlation of time series

Random walks frequently show a strong and statistically significant correlation, which becomes increasingly so as the length (n) of the time series increases. This behavior often means that both variables are correlated to a third variable, such as time. For example, the number of ministers and the number of alcoholics are positively correlated through the 1900’s. The number of each largely reflects the growing population during that time, so the two display a correlation, even though they are not directly linked. There are plenty of other examples.

The spurious correlation of two random walks is easily simulated by taking the cumulative sum (or running sum) of two sets of random numbers:

x <- cumsum(rnorm(25))
y <- cumsum(rnorm(25))
cor(x, y)

Because random walks are unrelated, you might expect that a test of correlation of random walks would produce a statistically significant result at a rate equal to α (e.g., 5%). Statistically significant correlations of random walks are actually far more common than that, meaning that the rate of making a type I error is far larger than your chosen value of α would suggest. You can see this by running the following code several times; p-values less than 0.05 occur far more frequently than you would expect.

x <- cumsum(rnorm(25))
y <- cumsum(rnorm(25))
cor.test(x, y)\$p.value

Spurious correlation is a problem not only for time series, but also for spatially correlated data.

You might think that you could address the problem of spurious correlation by collecting more data, but this makes the problem worse because increasing sample size decreases the p-value.

Spurious correlation can be solved by differencing the data, that is, calculating the change from one observation to the next, which will decrease the size of your data set by one observation. Differenced data will not display spurious correlation because differencing removes the trends that cause spurious correlation. In R, use the diff() function to calculate your differences quickly.

xdiff <- diff(x)
ydiff <- diff(y)
cor(xdiff, ydiff)

Any time that you see someone trying to correlate time series or spatial series, always ask if the data were differenced. If they were not, be skeptical of any non-zero correlation.

Here is a simulation that illustrates the problem of spurious time-series correlation. Each plot shows 10,000 random simulations of the correlation coefficient of two sets of data. The upper plot shows the results for two uncorrelated variables, the middle plot shows two random walks (time series), and the bottom plot shows two differenced random walks (time series). Note how commonly two random walks can produce strong negative or positive correlations, and note how differencing returns the expectation to that of two two uncorrelated variables. Follow the Ten Statistical Commandments of Chairman Alroy and Difference Thy Data.

par(mfrow=c(3,1))

trials <- 10000
sampleSize <- 25
r <- replicate(trials, cor(rnorm(sampleSize), rnorm(sampleSize)))
hist(r, xlim=c(-1, 1), breaks=50, col='green', main='Uncorrelated Variables', las=1)

r <- replicate(trials, cor(cumsum(rnorm(sampleSize)), cumsum(rnorm(sampleSize))))
hist(r, xlim=c(-1, 1), breaks=50, col='green', main='Random Walks', las=1)
# Note that the correlation coefficients are not clumped around zero as they would be for uncorrelated variables

r <- replicate(trials, cor(diff(cumsum(rnorm(sampleSize))), diff(cumsum(rnorm(sampleSize)))))
hist(r, xlim=c(-1, 1), breaks=50, col='green', main='First Differences of Random Walks', las=1)

### Use and interpretation of Pearson and Spearman coefficients

The Pearson correlation coefficient measures the strength of a linear relationship, that is, how well the data are described by a line. In contrast, the Spearman correlation coefficient measures the strength of a monotonic relationship, that is, as one value increases, whether the other consistently increases or decreases, regardless of the amount. Relationships can be monotonic, but not non-linear (for example, an exponential relationship), and as a result, they can have a strong Spearman correlation, but a weak Pearson correlation. This example demonstrates this:

x <- seq(1, 10, by=0.2)
y <- exp(x)
plot(x, y, las=1, pch=16, main='')
cor(x, y, method='pearson')
cor(x, y, method='spearman')

Data can display strong relationships that are neither linear nor monotonic (for example, a sine wave). When this happens, the Pearson and the Spearman correlations may both be near zero, even though there may be a very simple relationship underlying the data. Always plot your data first so that you understand what the relationship looks like, and always use the appropriate measure of correlation to describe your data.

x <- seq(1, 10, by=0.2)
y <- cos(x)
plot(x, y, las=1, pch=16, main='')
cor(x, y, method='pearson')
cor(x, y, method='spearman')

Finally, remember that outliers affect a Pearson correlation coefficient more than they influence Spearman correlation coefficient. In the example below, one outlier (in red) is added to the data. Notice how much the Pearson correlation increases compared to the Spearman correlation increases. Again, plot your data first to know if outliers are an issue.

x <- rnorm(10)
y <- rnorm(10)
plot(x, y, pch=16, xlim=c(-3,5), ylim=c(-3,5))
cor(x, y)
cor(x, y, method='spearman')

# Add one outlier
x <- c(x, 5)
y <- c(y, 5)
points(x[11], y[11], pch=16, col='red')
cor(x, y)
cor(x, y, method='spearman')

## Concept of R2

The square of the correlation coefficient, R2, is a widely reported statistic with an extremely useful property: it measures the proportion of variation that is explained by a linear relationship between two variables. R2 will always be between zero and one.