Problem Sets



Steven Holland

Problem Set 4: Central Tendency

Background reading

Read / skim chapters 3, 5, and 6 in your textbook. Some sections will clearly relate to lectures and you should read those; other sections will seem more peripheral and you should skim those. The chapters refer to data files (such as yvalues.txt on p. 23) that can downloaded from the book’s website at:

You should perform the R commands as you read each chapter. This will introduce you to new techniques and improve your overall efficiency with R. Do not just blindly type the commands; you should insure that you understand each command. If you don’t, check the relevant help pages until you do. Read the chapters before doing the assignment.

Chapter 3 will introduce writing functions that do basic things like taking an arithmetic mean. Pay attention to these techniques so that you understand how to write your own functions. In chapter 3, you should focus less on geometric mean and harmonic mean (p. 28–30) than on earlier parts of the chapter.

Chapter 5 presents today’s lecture in a somewhat different manner that should help you fully understand the concepts. As you read the chapter, be sure you understand the following commands: summary(), which(), boxplot(), table(), range(), diff(), for loops (e.g., p. 56), dnorm(), pnorm(), qnorm(), qqnorm(), qqline(), and names(). You will be expected to know these commands on future homeworks and exams. The dnorm(), pnorm(), and qnorm() commands are particularly important. Along with rnorm(), which we’ve used in class, these form and important basis for understanding all of the distributions included in R. There are similar forms of these commands for all the distributions, such as rt(), dt(), pt(), and qt() for the t-distribution, so be sure to spend some time on the help page, running their examples, until you understand what these do.

Chapter 6 will discuss how to compare two samples. For now, just focus on the t-test, wilcox, and paired sample pages at the beginning of the chapter.

The assignment

Conrad Labandeira and Nigel Hughes independently studied the Late Cambrian trilobite Dikelocephalus for their master’s theses. Each came to the same conclusion, that Ulrich and Resser, who had described many species of Dikelocephalus in the early 1900’s, had vastly oversplit the genus and that many of the species could not be distinguished. For this week’s problem set, we want to examine a small portion of the data they analyzed and published (Journal of Paleontology 68:492–517).

Listed below is their data for two measurements of the free cheek (the side of the head) for four species of Dikelocephalus, which they called omega and sigma. They measured many other aspects of these trilobites and for many other species, but to keep this problem tractable, we will examine only these two features for only four of the many species. In this problem set, we want to evaluate whether the species can be distinguished on the basis of omega and sigma, individually. This is a problem of central tendency: whether the central tendencies (e.g, mean, median) of these species are the same or different. If the central tendencies are the same, the species cannot be distinguished, but if the central tendencies differ, we will be able to differentiate the species. It is possible that some of the species are distinctive, but others are not. Likewise, it is possible that some variables are useful for distinguishing a pair of species and that other variables may not be useful.

You will do a series of analyses on each variable (omega and sigma), and you will use both sets of analyses to evaluate whether these trilobites can be distinguished.

1) Import the data set trilobites2 into R. Use the appropriate command to view the first few lines of the data frame to verify that it was imported correctly. Use the appropriate command so that you can call the variables directly by name without using the name of the data frame and dollar-sign notation.

2) First, visualize the data for omega. Use stripchart(), with the data (omega) grouped by species. Hint: stripchart(y~x) will plot the variable y grouped by the variable x. Use solid black circles for the plotting symbol. Examine the plot carefully and think ahead: does the mean value of these lengths appear to be the same for all the species or is one or more species different, considering the scatter in the data? You needn’t write a comment, but you should think about what you expect the statistical tests to show before you run them.

3) The standard way in which this data is often examined is to test in one step whether the means of all of the species are statistically indistinguishable, and this is usually done with an ANOVA. Before you can run the ANOVA, you must verify its assumptions. One of these assumptions is that the data are normally distributed. Based on your stripcharts, is omega symmetrically distributed for each species? Because the data set is small, we will be concerned only about strong departures from symmetry. Write your answer as a comment in your R session.

4) The second assumption of the ANOVA is that the variance for each species is the same, in other words, that the scatter for each species is approximately the same. Use the appropriate test that lets you compare the variances of omega on many (>2) groups in a single and simple line of code. What do the results suggest about the homogeneity of variances across the species, that is, whether the variances of all of the species are similar? Include your answer as a comment. Be sure to state the null hypothesis, whether it was accepted or rejected, and report the results as I showed in class. In your comment, summarize whether the assumptions of normality and homoscedasticity (equal variances) been met for an ANOVA on omega?

5) Normally, you would only proceed to the ANOVA if the assumptions were met. In this case, I want you to do the ANOVA regardless of whether they were met or not, but I will want you to interpret your results in light of those assumptions. Run an anova using anova(lm()) on omega as a function of species. Do the results indicate that all of the species are indistinguishable with regard to omega? Include your answer as a comment, and be sure to consider the assumptions of the tests. Also, be sure to state the null hypothesis, whether you accepted or rejected it, and report the results of this test as I showed in class.

6) Using the t.test() function, calculate 95% confidence limits on the mean value of omega for each species (i.e., four confidence intervals, one for each species). Based on these, which pairs of species appear distinguishable with omega? Justify your answer from your results, being sure to list of every pair of species that can be distinguished from each other. Include your answer as a comment.

7) Do step 2, but for sigma. Create a new plot window for this.

8) Do step 3, but for sigma.

9) Do step 4, but for sigma.

10) Do step 5, but for sigma.

11) Do step 6, but for sigma.

12) Undo the command you used in step 1 that allowed you to avoid using dollar-sign notation.

To make your code easier to follow, please insert a blank line before each numbered problem, followed by comment line indicating the problem number, then all lines of code for that problem with no blank lines between them. Your commands file should look like this:

# 1
# 2
# 3

Turn in your R commands needed for this problem sets, but do not send the data file, as I already have it. E-mail your commands file to Steven Holland, following all the standard instructions. This problem set is due 11 October.