2 December, 2011

One of the most useful aspects of R is that you can nest statements within one another. However, this can sometimes produce an error when one of the statements isn’t returning what you think it ought to return. Even worse, calculations might proceed without any warnings or errors, producing a result that is meaningless, but not obviously so.

For example, suppose you wanted to do a t-test on femur length in two species from the femurs.txt data set. You’d read the data, display the first few lines as a sanity check, and attach the data frame so you could access the variables directly.

> femurs <- read.csv('femurs.txt')

> head(femurs)

femur.length species

1 4.6 granulosa

2 16.2 granulosa

3 20.2 meeki

4 15.3 granulosa

5 11.1 granulosa

6 21.0 meeki

> attach(femurs)

Now comes the t-test. You might try the following:

> t.test(species=='granulosa', species=='meeki')

Welch Two Sample t-test

data: species == "granulosa" and species == "meeki"

t = -0.2425, df = 64, p-value = 0.8091

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.2799046 0.2192985

sample estimates:

mean of x mean of y

0.4848485 0.5151515

At first glance, it looks like it worked, but there’s something very wrong here. Can you see it?

What’s wrong is that you’ve never specified that you want to compare the femur lengths. So, why is it giving us an answer and no warning? To understand what it is doing, you have to *unpack* your command, that is, take it apart to see what each part is doing. Let’s start with the first parameter to t.test:

> species=='granulosa'

[1] TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE

[15] FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE

[29] TRUE FALSE FALSE FALSE FALSE

What this tells us is that t.test is comparing two vectors of TRUE and FALSE, not femur length for the two species. The reason it gives us an answer is that TRUE=1 and FALSE=0, so these vectors can be regarded as numeric data, even though they don’t appear to be.

What we really want is femur length for each species, so you might try this:

> femurs[species=='granulosa']

Error in `[.data.frame`(femurs, species == "granulosa") :

undefined columns selected

Well, that’s not right. We’re not specifying the column. We could specify column 1, which has the femur length, but we’ve already attached the data frame, so we can access the femur length directly:

> femur.length[species=='granulosa']

[1] 4.6 16.2 15.3 11.1 16.7 16.1 11.7 8.1 12.0 13.9 17.7 16.3 15.8 17.7 16.9 10.3

That’s more like it. Now we can call the t-test function correctly:

> t.test(femur.length[species=='granulosa'], femur.length[species=='meeki'])

Welch Two Sample t-test

data: femur.length[species == "granulosa"] and femur.length[species == "meeki"]

t = -2.097, df = 26.72, p-value = 0.0456

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-4.76402884 -0.05067704

sample estimates:

mean of x mean of y

13.77500 16.18235

This is what we want. What’s important is that we could solve the problem by *unpacking* our commands, that is, unnesting our commands to make sure that each one is doing what we want.

Consider another problem, this time in the trilobites3.txt data set. First, we’ll detach the femurs data set since we no longer need it, then we’ll read the new data, view it, and attach it.

> detach(femurs)

> tri &- read.table('trilobites3.txt', header=TRUE, row.names=1, sep="\t")

> head(tri)

sigma omega species

FMNH059 3.09 0.97 gracilis

FMNH075 3.49 1.35 raaschi

FMNH091856 1.83 0.65 gracilis

FMNH0984 3.38 1.22 gracilis

FMNH0985 3.32 1.20 gracilis

FMNH1007 3.62 1.14 gracilis

> attach(tri)

Suppose we want to delete a particular sample, FMNH075, from the data before we do our analyses. We’ll preserve the original data set and create a new one that has the culled data, and we might try this:

> tri.cull &- tri[-FMNH075]

Error in `[.data.frame`(tri, -FMNH075) : object 'FMNH075' not found

We should realize that we need to access the row names and we might try this:

> tri.cull &- tri[rownames(tri)=='FMNH075']

No errors that time, maybe that means it worked. Better check our results:

> head(tri.cull)

omega

FMNH059 0.97

FMNH075 1.35

FMNH091856 0.65

FMNH0984 1.22

FMNH0985 1.20

FMNH1007 1.14

Uh oh, we lost two variables, sigma and species. Why isn’t this working? Unpack the command and see what that logical statement returns:

> rownames(tri)=='FMNH075'

[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

[15] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

[29] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

A vector of all FALSEs and one TRUE, which corresponds to the sample FMNH075. This should help us know what to do: we want to remove the one row that matches the TRUE, but keep all the columns. Maybe try this:

> tri.cull &- tri[-(rownames(tri)=='FMNH075'), ]

> head(tri.cull)

sigma omega species

FMNH075 3.49 1.35 raaschi

FMNH091856 1.83 0.65 gracilis

FMNH0984 3.38 1.22 gracilis

FMNH0985 3.32 1.20 gracilis

FMNH1007 3.62 1.14 gracilis

FMNH1100 3.21 1.21 gracilis

That didn’t work either. What we want to do is specify the row numbers that we want to delete, that is we want the row number for FMNH075. The which() can tell us which values are TRUE in a vector of TRUEs and FALSEs:

> which(rownames(tri)=='FMNH075')

[1] 2

That’s it: we now know the row number we want to delete and we can delete that row (don’t forget the minus sign):

> tri.cull &- tri[-which(rownames(tri)=='FMNH075'), ]

> head(tri.cull)

sigma omega species

FMNH059 3.09 0.97 gracilis

FMNH091856 1.83 0.65 gracilis

FMNH0984 3.38 1.22 gracilis

FMNH0985 3.32 1.20 gracilis

FMNH1007 3.62 1.14 gracilis

FMNH1100 3.21 1.21 gracilis

That worked. These two examples show you the importance of unpacking your commands. Any time you get strange results or unexpected errors, unpack your commands, starting at the finest level and building up to the most nested set of commands to identify and repair the source of the problem. Doing this will also greatly improve your ability to write the commands correctly the first time because you’ll have a much better understanding of what each of your nested commands is doing.