Lecture Notes

Home

Contact

Steven Holland

Covariance and Correlation

A very common problem in the sciences is describing the strength of the linear relationship of two variables. Covariance and correlation are the two most common measures of this relationship.

Covariance

The sum of products lies at the core of covariance and is analogous to the sum of squares used to calculate variance.

sum of products

where i is the observation number, n is the total number of observations, j is one variable, and k is the other variable.

Alternatively, the sum of products can be calculated as follows, where the first term is called the uncorrected sum of products and the second term is called the correction term.

alternative formulation of sum of products

From the sum of products, we get covariance, which is analogous to variance. Note that j and k can be switched, with no change in the result.

covariance

Unlike the sum of squares and variance, the sum of products can be positive or negative, as can covariance. Like variance, covariance has units of the square of the original observational units. Also like variance, the value of covariance is a function of the scale on which the data were measured, which is generally undesirable. For standard deviation, we formed a unitless measure of variance through the coefficient of variation. For covariance, we will develop a unitless measure of the strength of the linear relationship through the correlation coefficient.

Pearson’s correlation coefficient

Pearson’s correlation coefficient is also commonly called the product-moment correlation coefficient or simply correlation coefficient.

To produce a dimensionless measure of correlation, we need to divide the covariance by something that has the same units, so we use the product of the standard deviations of the two variables. In effect, we are standardizing the joint variation in the two variables by the product of the variation in each variable. This gives us a dimensionless number, one unaffected by the scale on which we measure the data - a very useful measure.

correlation coefficient

Pearson’s correlation coefficient varies from -1 (perfect negative correlation) to +1 (perfect positive correlation), with a value of 0 indicating no linear relationship. If one variable does not vary, one of the standard deviations will be zero, making the denominator zero, and making the correlation coefficient undefined.

Note that Pearson’s correlation coefficient could also be calculated as the sum of products divided by the square root of the product of the two sums of squares:

alternative correlation coefficient

Covariance and correlation are easily calculated in R:

cov(x,y) # Covariance
cor(x,y) # Correlation Coefficient

As we have seen before, if we have a statistic, we can make inferences and test hypotheses if we know the expected distribution of the statistic. If the data are normally distributed, and if the null hypothesis is that there is no correlation, Pearson’s correlation coefficient follows a t-distribution, where the standard error of Pearson’s correlation coefficient is given as:

standard error of correlation

The t-test for Pearson’s r is set up as:

t-test of correlation

Since the null hypothesis is that there is no correlation, the ρ term drops out, and the t-statistic is simply Pearson’s correlation coefficient divided by its standard error. This t-test assumes random sampling and that the expected value of Pearson’s r is normally distributed, which will be true only if the null hypothesis is that ρ equals zero. The test also assumes that both variables are normally distributed. Like any t-test, the default is a two-tailed test, but you may also set up one-tailed test if you have theoretical reasons for believing that the correlation ought to be either positive or negative. As always, this alternative hypothesis cannot be determined by examining the sign of your correlation coefficient; it must come from other lines of reasoning or other data sets. Finally, the test assumes that the correlation coefficient is calculated on interval or ratio data. If you have ordinal data, you must use a non-parametric test. This t-test has n-2 degrees of freedom, since population variance is estimated for two variables, not just one.

The t-test for Pearson’s correlation coefficient can be done in R with the cor.test(x,y) function. The output is identical to that of t.test(). If you calculate the t-test on Pearson’s r manually, you will get the same results as if you run cor.test().

cor.test(x, y)

Spearman’s rank correlation

Spearman’s rank correlation coefficient is a non-parametric test of correlation based on the ranks of the observations. Suppose you could rank all of your x observations, assigning 1 to the smallest value, 2 to the next smallest, and so on. Suppose also that you did the same for your y-variable. If x and y were perfectly correlated, the ranks for the two data sets would correspond perfectly. Spearman’s rank correlation coefficient measures this correlation.

In R, you can the Spearman rank correlation by setting the method parameter of cor() or cor.test():

cor(x, y, method="spearman")
cor.test(x, y, method="spearman")

The Spearman correlation test assumes only that your data were randomly sampled, and will work with any type of data that can be ranked, including interval data. The Spearman test is valuable in that it is scale-independent, and it is less sensitive to outliers than the Pearson correlation test.

Two caveats

Spurious correlation of time series

Random walks will frequently show strong and statistically significant correlation, which becomes more pronounced as their length (n) increases. Often, this means that both are correlated to some other variable, such as time. For example, a classic correlation is that the number of ministers and number of alcoholics are correlated through the 1900’s. The number of each largely reflects the growing population during the 1900’s, so the two display a correlation although they are not directly linked. There are also some more recent examples.

The spurious correlation of two random walks is easily simulated:

x <- cumsum(rnorm(25))
y <- cumsum(rnorm(25))
cor(x, y)

Because statistically significant correlation of random walks is much more common than a typical value of alpha (e.g., 0.05) would suggest (that is, such significant correlations occur much more frequently than 1 in 20), the rate of making a type I error in time-series correlation can be much higher than alpha. The same argument applies to spatially correlated data, not just time series.

You might think that you could address the problem of spurious correlation by collecting more data, but this makes the problem worse, since a larger n is more likely to produce a small p-value.

The problem can be solved by differencing the data, that is, calculating the change from one observation to the next, which will decrease the size of your data set by one observation. Differenced data will not display these spurious correlations, so that if you difference your data, test for correlation and find one, you can be sure that there is a correlation. In R, use the diff() function to calculate your differences quickly.

xdiff <- diff(x)
ydiff <- diff(y)
cor(xdiff, ydiff)

Any time that you see someone trying to correlate time series or spatial series, always ask if the data were differenced. If they were not differenced, be skeptical of any significant correlation.

Here is a simulation that illustrates the problem of spurious time-series correlation:

# Correlation coefficient for uncorrelated data
trials <- 10000
sampleSize <- 25
r <- replicate(trials, cor(rnorm(sampleSize), rnorm(sampleSize)))
quartz()
hist(r, xlim=c(-1,1), breaks=50, col="sienna", main="Uncorrelated Variables", las=1)
 
# Correlation coefficient for random walks
trials <- 10000
sampleSize <- 25
r <- replicate(trials, cor(cumsum(rnorm(sampleSize)), cumsum(rnorm(sampleSize))))
quartz()
hist(r, xlim=c(-1,1), breaks=50, col="turquoise", main="Random Walks", las=1)
# Note that the correlation coefficients are not clumped around zero as they would be for uncorrelated variables
 
# Correlation coefficient for first differences
trials <- 10000
sampleSize <- 25
r <- replicate(trials, cor(diff(cumsum(rnorm(sampleSize))), diff(cumsum(rnorm(sampleSize)))))
quartz()
hist(r, xlim=c(-1,1), breaks=50, col="olivedrab", main="First Differences of Random Walks", las=1)
# Note that this is more like the expected distribution of uncorrelated variables

Use and interpretation of Pearson and Spearman coefficients

The Pearson correlation coefficient measures the strength of a linear relationship (how well the data fit the line), whereas the Spearman correlation coefficient measures the strength of a monotonic relationship (as one value increases, does the other consistently increase or decrease, regardless of the amount). As a result, non-linear but monotonic relationships could have a strong Spearman correlation, but a weak Pearson correlation. This example demonstrates this:

x <- seq(1, 10, by=0.2)
y <- exp(x)
plot(x,y, las=1, pch=16, main="")
cor(x,y,method="pearson")
cor(x,y,method="spearman")
# Note that only Spearman adequately recognizes the strength of the relationship
 
x <- seq(1, 10, by=0.2)
y <- cos(x)
plot(x,y, las=1, pch=16, main="")
cor(x,y,method="pearson")
cor(x,y,method="spearman")
# Note that both measures fail to detect this relationship

You should use the appropriate measure for the relationship you are trying to detect. Always plot your data first so that you can see the relationship before you attempt to measure it.

Finally, remember that outliers affect a Pearson correlation coefficient much more strongly than they influence Spearman correlation coefficient.

x <- rnorm(10)
y <- rnorm(10)
plot(x,y,pch=16)
cor(x,y)
cor(x,y,method='spearman')
 
# Add one outlier
x <- c(x, 15)
y <- c(y, 15)
plot(x,y,pch=16)
cor(x,y)
cor(x,y,method='spearman')

Always check your data for outliers and beware of one or two data points that are the source of a strong correlation. Again, outliers are most easily found by plotting your data.

Concept of R2

R2, as we will find out, can be interpreted as the proportion of the variation that is explained by a linear relationship between two variables. R2 will always be between zero and one.