Subsetting data

2013-05-07 / Luis

At School we use R across many courses, because students are supposed to use statistics under a variety of contexts. Imagine their disappointment when they pass stats and discovered that R and statistics haven’t gone away!

When students start working with real data sets one of their first stumbling blocks is subsetting data. We have data sets and either they are required to deal with different subsets or there is data cleaning to do. For some reason, many students struggle with what should be a simple task.

If one thinks of data as as a matrix/2-dimensional array, subsetting boils down to extracting the needed rows (cases) and columns (variables). In the R world one can do this in a variety of ways, ranging from the cryptic to the explicit and clear. For example, let’s assume that we have a dataset called alltrials with heights and stem diameters for a number of trees in different trials (We may have a bunch of additional covariates and experimental design features that we’ll ignore for the moment). How do we extract all trees located in Christchurch?

Two common approaches are:

mytrial <- alltrials[alltrials$location == "Christchurch", ]

mytrial <- subset(alltrials, location == "Christchurch")

While both expressions are equivalent, the former reads like Klingon to students, while the latter makes explicit that we are obtaining a subset of the original data set. This can easily be expanded to more complex conditions; for example to include all trees from Christchurch that are taller than 10 m:

mytrial <- alltrials[alltrials$location == "Christchurch" & alltrials$height > 10, ]

mytrial <- subset(alltrials, location == "Christchurch" & height > 10)

I think the complication with the Klingonian notation comes mostly from two sources:

Variable names for subsetting the data set are not directly accessible, so we have to prefix them with the NameOfDataset$, making the code more difficult to read, particularly if we join several conditions with & and |.
Hanging commas: if we are only working with rows or columns we have to acknowledge it by suffixing or prefixing with a comma, which are often forgotten.

Both points result on frustrating error messages like

Error in `[.data.frame`(alltrials, location == "Christchurch", ) : object 'location' not found for the first point or
Error in `[.data.frame`(alltrials, alltrials$location == "Christchurch") undefined columns selected for the second point.

The generic forms of these two notations are:

dataset[what to do with rows, what to do with columns]

subset(dataset, what to do with rows, what to do with columns)

We often want to keep a subset of the observed cases and keep (or drop) specific variables. For example, we want to keep trees in 'Christchurch' and we want to ignore diameter, because the assessor was 'high' that day:

# With this notation things get a bit trickier
# The easiest way is to provide the number of the variable
# Here diameter is the 5th variable in the dataset
mytrial <- alltrials[all.trials$location == "Christchurch" & all.trials$height > 10, -5]

# This notation is still straightforward
mytrial <- subset(alltrials, location == "Christchurch" & height > 10, select = -diameter)

There are, however, situations where Klingon is easier or more efficient. For example, to take a random sample of 100 trees from the full dataset:

mytrial <- alltrials[sample(1:nrow(alltrials), 100, replace = FALSE),]

If you are interested in this issue Quick-R has a good description of subsetting. I'm sure this basic topic must has been covered many times, although I doubt anyone used Klingon in the process.

Gratuitous picture: building blocks for R.

r, rblogs, teaching

9 Comments

Thomas Lumley
2013-05-07 at 18:32

There’s also
with(alltrials, alltrials[location==”Christchurch” & height>10,])
which isn’t as simple as the subset() approach but is more generally useful
- Luis (Post author)
  2013-05-07 at 19:53
  
  You are right Thomas. I should be using with() more with the students.
  - Fr.
    2013-05-08 at 02:44
    
    Coincidentally, I just recommended with() on a different but related question. The thread might be of interest.
    - Luis (Post author)
      2013-05-08 at 11:22
      
      Thanks. I was drafting a follow up post dealing with with() for my students. I should mention that use.
Robert Young
2013-05-08 at 08:55

If you’re only doing a subset once, then using R is a reasonable option. On the other hand, if you’re going to be doing serial analysis (the norm in the “real world”) on some data, then it is wiser to build the data in a RDBMS and use SQL to do the data manipulation; that’s what relational databases and SQL are for.
- Luis (Post author)
  2013-05-08 at 11:27
  
  Good point. I think it boils down to the complexity/size of the data set and how often will data be updated. For often-updated and large data sets I would certainly go for the RDBMS + SQL combination.
  
  In my case I tend to deal with lots of single-use, disparate data sets, where the overhead of setting up the database an connection are just too big to justify going that route.
- robhodde
  2016-02-04 at 14:04
  
  Couldn’t agree more. R seems incredibly “not scalable”
PirateGrunt
2013-05-08 at 13:48

How old are your students? I had to wait until I was over 40 to start doing R and statistical programming in earnest. They don’t know it, but they’re very fortunate.

When I was first learning R, I found that I used subset a lot, for the reasons you mentioned. Eventually, I needed the flexibility of the Klingon approach. By the time I got there, I had gotten comfortable enough with the basic concepts that it seemed like English to me. Or perhaps I had started speaking Klingon without realizing it.

-PG
- Luis (Post author)
  2013-05-08 at 15:01
  
  Around 20-22 yo. I think it’s better to start with easier to read syntax and the people can pick up Klingon more easily.