Problem Sets



Steven Holland

Problem Set 5: Correlation and Regression

In your text, read pages 108–113, which cover covariance and correlation, and chapter 8, which covers regression.

Because this homework is involved, please format your commands as you did on the previous problem set.

1) Download and import the data set purdin.txt, naming the data frame purdin. This data is geochemical data from a series of limestone beds. View the first several lines so that you can verify that the import worked and so that you can see the structure of the data set.

2) Use names() to view the variable names in this data set. Next, use the appropriate command so that you do not need to use dollar sign notation.

3) You will use silicon and iron in this data set. First, plot Fe vs. Si, using filled black circles and rotating the y-axis labels to horizontal. There is no need for a main label, so do not include one.

4) Notice that both Si and Fe in this plot are clumped near low values and are spread more widely at higher values. On a bivariate plot like this, this produces a comet pattern, with the head of the comet near the origin. Such a pattern is often a hallmark of two log-normally distributed variables. You will make four Q-Q norm plots, but all should be on the same page for easy comparison. Make a new plot window, then use par() and mfrow to set up the window to display four plots arranged into two rows of two columns. Run qqnorm() on each data set, using the main titles of "raw Fe data" and "raw Si data". Do these plots suggest that the data are normally distributed? Explain and include your answer as a comment in your commands file (do this everywhere below where I ask a question).

5) Apply a log transformation on the data, assigning them to logSi and logFe. R has a variety of log functions, such as log(), log2(), and log10(), which are in base e, base 2, and base 10, respectively. Just so we are all on the same page, use the base-10 logarithm, following the convention of geochemistry. In your own work, realize that you could use any of these for a log transformation; just be consistent, follow the conventions of your field, and record what you did.

6) Now run qqnorm() on logFe and logSi, setting the main titles to “log Fe data” and “log Si data”. In your matrix of Q-Q norm plots, the raw data plots and log data plots should be in separate rows, and Fe and Si should be in separate columns. Do these plots suggest that the original data are log-normally distributed? Explain.

7) Make a new plot window and in it, plot logFe vs. logSi. Use filled black circles and rotate the y-axis labels. Make the y-axis label “log Fe” and the x-axis label “log Si”. There is no need for a main label, so do not include one.

8) Test the significance of the correlation of the log-transformed variables, first with a Pearson correlation coefficient and then with a Spearman correlation coefficient. Interpret the results, being sure to consider what each correlation statistic measures (i.e., linear vs. monotonic), the sample size, and the relative powers of the two tests.

9) Perform a least-squares regression of logFe as a function of logSi, assigning the result to FeVsSiRegression. View the FeVsSiRegression object. As a comment, write the equation of the line as Y = b1 x + b0, replacing X and Y with their correct variable names, and replacing b1 and b0 with their correct values, to a reasonable number of significant figures. In other words, your equation should look something like
      numTrees = 8.25 numFrogs + 27.1

Because Fe and Si were measured from rocks that were collected, that is Si was not set to a value in an experiment with Fe measured as a result, you might be thinking that a model 2 regression would be more appropriate. If we simply wanted to describe the relationship, a model 2 regression would be appropriate. However, we want to predict Fe from Si, so that requires a model 1 regression, and that is what is used here.

10) Use summary() to view the statistical tests associated with your regression. Answer each of the following questions in a separate comment. Is the slope statistically significantly different from zero? Is the intercept statistically significantly different from zero? What percentage of the variation is explained by the regression? Is the regression statistically significant?

Good answers to these questions should show how you answered it by making specific reference to the relevant numerical values in the regression results. Good answers should also demonstrate that you understand the relevant statistical concepts, and that is best demonstrated by succinctness.

11) Now, go back and evaluate the assumptions of your regression to make sure that the results are interpretable. First, open a new plot window, then use par() and mfrow() to set the window up to have four plots arranged in 2x2 grid, as you did above. Next, use plot() on your regression object to evaluate whether the residuals change systematically with the fitted values. Answer the following questions as separate comments. Is there any systematic relationship of the residuals with the fitted values? Are the residuals normally distributed? Considering your answer to these two questions, are the assumptions of the least-squares regression met for these data? Explain.

12) Undo the command you used in #2 that allowed you to use dollar-sign notation.

E-mail your commands file to, following all the standard instructions. Do not email the data file, as I already have it. This problem set is due 18 October.