Problem Sets

Return Home

Contact Us

Steven Holland

Problem Set 3: Opening Files and Accessing Data Frames

Opening files

Opening files and accessing portions of data frames are essential to work in R. You can’t use R effectively unless you master these skills. I highly recommend working in pairs on this problem set, particularly if you are having difficulty. After the homework, challenge each other to solve a particular type of access to build up your speed. By the exam, you should be able to complete any particular access problem in one command and in less than 2 minutes.

From a web browser, download each of the files listed under Data from the 8370 website, without changing their name. Do not modify the contents of these files in any way or change their name. To do so will guarantee errors when I run your code on the original files.

UPDATE: Be sure to download all 20 files, including those labeled Example Vector and Example Matrix.

In alphabetical order, open each of these downloaded files using only read.table() or scan(), as apppropriate. As you open each file, assign the results to an object. Use head() to view the first few lines of that object to verify that it was imported correctly. If it wasn't imported correctly, fix your call to read.table() or scan(), but only submit the call to read.table() or scan() that worked correctly. In the next line of code, delete the object you just created. As always, be sure to not embed paths or to include your call to setwd(), as these guarantee errors when the code is executed on my computer.

UPDATE: Be sure to specify the row names if a file has them, and many of these files do. Row names almost always appear in column 1; in this course, they will always be in column 1. Row names are unique identifiers, usually strings of characters and possibly numbers. Numerical values may also be unique (like date or time), but these are not names and should be treated as a column of numerical values. In some cases, people leave off the column name when there are row names, such that the header begins with a comma, and this can be a good clue that the first column are row names. Note, however, that many other people give row names a column name, such as locality.name or sample.name or some other usually obvious label. The bottom line is, if your data frame has row names, be sure to set the row.names argument.

Accessing data

Using read.table(), open the worms dataset that you used in problem set 2 and assign it to a data frame called worms.

Solve each problem below with a single command. Simply display the results for each; do not assign the results to a new object. For each of these, you are trying to extract data from worms, so what you display in most cases (except 25–27, 31) will be a data frame. For all but problems 25–27 and 31, your command should be in the form of worms[rows, columns], where you substitute values for rows and columns as necessary. Answers to 25–27 and 31 should be vectors. To repeat, we want to see the data that corresponds to particular conditions, not the indices of rows or columns that match particular conditions.

Particular rows

Perform the following using $ notation. Do not use attach().

1. Get all rows where the soil pH is 4.0.

2. Get all rows where the vegetation is Arable.

3. Get the rows named Ashurst. Get these by name and the rownames() command, but do not embed the row number as a magic number.

4. Get all rows where the slope is greater than 8.

5. Get all rows that are not damp.

6. Get all rows that have an area less than or equal to 2.

7. Get all rows where worm density is not equal to 4.

8. Get all rows where the vegetation is Grassland or Arable. Hint: use | to handle the logical or. On a Mac keyboard, this is on the backslash key, just above the Return key.

9. Get all rows where area is greater than or equal to 3 and slope is greater than or equal to 2. Hint: use & for the logical and.

10. Get all rows where the vegetation is Grassland and the soil is damp.

For problems 11–20, repeat problems 1–10 after running the attach() command on worms. Do not use $ notation and do not use row numbers. When you have finished problems 11–20, detach the worms data frame.

Do the following with row number notation. Do not use $ signs and do not use attach(). Use c() only if necessary.

21. Get rows one to ten.

22. Get rows 1 to 5 and row 9.

23. Get all rows except the first. Your answer should not presume knowledge of how many rows there are in the data frame. Hint: the simplest answer involves a minus sign.

24. Get all rows except 10–15, with the same restrictions as problem 23.

Particular columns

Do questions 25–27 with $ notation. Do not use attach().

25. Get the Damp column.

26. Get the Vegetation column.

27. Get the Area column.

Do questions 28–31 with column number notation. Do not use $ signs or attach().

28. Get columns 2 through 4.

29. Get columns 2 and 4 (but not 3).

30. Get all columns except column 5. Your answer should not presume to know how many columns there are in this data frame. Hint: you solved a similar problem when accessing rows.

31. Get all the row names.

Combinations of specified rows and columns

For questions 32–36, use logical operations to find particular rows; do not specify row numbers. Do not use attach().

32. Get columns 2 and 4 for the rows for which area is greater than 4.

33. Get columns 1 and 6 for the cases in which the vegetation is Grassland.

34. Get columns 4 and 6 for the cases in which the area is greater than 2 and the ground is not damp.

35. Get columns 1 and 2 for which the vegetation is not Grassland.

36. Get all columns except column 3 for which the vegetation is not Meadow.

Turn in your R commands needed for this problem sets, but do not send any of the data set files, as I already have them. Add a comment before each answer indicating the question number (e.g., # 21), with a blank line before (but not after) the comment. In other words, there should be a blank line, then a comment with the problem number, and then a command with your solution. E-mail your commands file to Steven Holland, following all the standard instructions. This problem set is due 11 September.