Steven Holland

# Subset()

30 October 2017

I have always made subsets of a data frame by using logical operations to specify rows and columns. The subset() command offers a simpler and more readable way to do this, but it comes with some caveats.

Suppose we have a geochemical data set:

> geochem <- read.table('purdin2.txt', header=TRUE, row.names=1, sep='\t')

This data set contains geochemical measurements on many samples. Here are the first few:

> head(geochem)

StratPosition  d13C  d18O   Al     Ca   Fe K   Mg   Mn    Si
Ae        35.080  1.95 -4.66 1.49 297.16 1.52 0 2.22 0.34  8.61
Ad        34.745  1.82 -4.57 2.73 275.89 2.70 0 2.84 0.33  7.32
Ac        34.660  1.91 -4.77 4.26 328.11 3.13 0 3.12 0.40  9.65
Ab        34.555  0.93 -4.58 4.69 329.66 9.11 0 2.88 0.47 23.07
Aa        34.285 -0.34 -4.54 4.80 285.67 3.46 0 3.69 0.45  9.97
A1        34.150  0.46 -3.86 3.89 346.72 3.79 0 8.77 0.32  7.98

Logical operations and row/column notation are a standard way to get all the rows for which the value of Al was greater than 8:

> geochem[geochem\$Al>8, ]

These logical operations can be combined to make more complicated queries:

> geochem[geochem\$Al>8 & geochem\$Ca>250, ]

It’s also possible to retrieve specific columns (variables) instead of all of the columns:

> geochem[geochem\$Al>8 & geochem\$Ca>250, c('Fe', 'K')]

There’s a few things that can be error-prone about this approach. We have to remember to use the name of the data frame and dollar-sign notation; we can’t just write Al>8 & Ca>250, unless we have called attach() on the data frame to make a copy of the variables in the global name space. If we want all the rows, we have to remember the comma before the closing bracket. We also have to remember what goes in the column position and what goes in the row position. When we select columns, we have to remember to wrap those column names in quotes.

## subset() to the rescue

The subset() command makes all of this easier.

Repeating the operations from above, first we get all the samples with Al>8, with the data frame as the first argument, and the conditions in the second argument:

> subset(geochem, Al>8)

Notice that we are freed from dollar-sign notation with a simple and direct syntax. Also, by default, subset() returns all the rows.

More complex queries are just as easy:

> subset(geochem, Al>8 & Ca>250)

As queries get more complicated, avoiding dollar-sign notation makes it quicker to type and easier to read.

Specifying columns is also simpler because column names aren’t wrapped in quotes. The desired columns are specified as a vector for the third argument:

> subset(geochem, Al>8 & Ca>250, c(Fe, K))

## Trouble in paradise

The subset() command saves so much time in typing and is so much easier to read that it makes you wonder why we should ever bother with using brackets. The help page for subset() offers a clue, and it is a warning near the end:

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

It sounds rather cryptic. For a thorough explanation, read Hadley Wickham’s article, which explains ‘non-standard evaluation’ well. The short version is that if parsing the rules for the rows or columns runs into problems, subset() returns an error. If you have an R script that assumes a data frame will be returned, your script will fail.

To see this in action, change the column we want to AL, instead of Al. Of course, there is no column called AL, so let’s see what happens with bracket notation.

> geochem[geochem\$AL>8 & geochem\$Ca>250, ]

[1] StratPosition d13C d18O Al
[5] Ca Fe K Mg
[9] Mn Si
<0 rows> (or 0-length row.names)

No rows fit that criterion, so the operation returns a data frame with zero rows. Our script could check for that and handle it appropriately. Let’s see how subset() handles the same query.

> subset(geochem, AL>8 & Ca>250)

Error in eval(e, x, parent.frame()) : object 'AL' not found

In this case, a data-frame object was never created. If our code assumes an object was produced, then we have problems, and the rest of the script is likely to fail. In short, the warning says it as it is: if you are interacting in the console, enjoy using subset(); it will save you time. If you are programming, stick with bracket notation.

## Comments

Comments or questions? Contact me at stratum@uga.edu