Careless comparison bites back (again)

When running stats labs I like to allocate a slightly different subset of data to each student, which acts as an incentive for people to do their own work (rather than copying the same results from a fellow student). We also need to be able to replicate the results when marking, so we need a record of exactly which observations were dropped to create a particular data set. I have done this in a variety of ways, but this time I opted for code that looked like:

setwd('~/Dropbox/teaching/stat202-2012')
 
biom = read.csv('biom2012.csv', header = TRUE)
drops = read.csv('lab4-dels.csv', header = TRUE)
 
# Use here your OWN student code
my.drop = subset(drops, student.code == 'mjl159')
my.data = subset(biom, !(id %in% my.drop))

Thus, we were reading a full data set and assigning it to biom, reading a table that contained student codes in the first column and 5 columns with observations to be dropped (assigned to drops) and choosing one row of drops depending on the student code (assigned to my.drop). As an example, for student ‘mjl159′ my.drop looked like:

my.drop
#   student.code X1 X2 X3 X4 X5
#40       mjl159 21 18 30 15 43

The problem with the code is the comparison !(id %in% my.drop), as it includes the factor student.code, so when R makes the comparison checking if a record is in my.drop it converts the text, e.g. ‘mjl159’, to the number of level, e.g. 41, which makes the code to delete ONE MORE observation (in this #41) on top of the ones the student was allocated. This happens only for some students, where the number of level is not included in the list of observations to drop.

This is another version of R pitfall #3: friggin’ factors. A simple workaround is to change the comparison to !(id %in% my.drop[2:6]). I should know better than this.

Gratuitous image: Tree spread on metal frame to provide shade in a plaza, Lisbon, Portugal. Some days I would love to have a coffee there without computer, just watching the world pass by. (Photo: Luis).

R pitfall #3: friggin’ factors

I received an email from one of my students expressing deep frustation with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to NA. Using simple letters as example names he was baffled by the result of the following code:

[sourcecode lang=”r”]
lines = factor(LETTERS)
lines
# [1] A B C D E F G H…
# Levels: A B C D E F G H…

linesNA = ifelse(lines %in% c(‘C’, ‘G’, ‘P’), NA, lines)
linesNA
# [1] 1 2 NA 4 5 6 NA 8…
[/sourcecode]

The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustation guaranteed!

[sourcecode lang=”r”]
linesNA = factor(ifelse(lines %in% c(‘C’, ‘G’, ‘P’), NA, lines))
linesNA
# [1] 1 2 <NA> 4 5 6 <NA> 8…
# Levels: 1 2 4 5 6 8…
[/sourcecode]

Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns’s R Inferno PDF for details). We can deal directly with the levels using this:

[sourcecode lang=”r”]
linesNA = lines
levels(linesNA)[levels(linesNA) %in% c(‘C’, ‘G’, ‘P’)] = NA
linesNA
# [1] A B <NA> D E F <NA> H…
#Levels: A B D E F H…
[/sourcecode]

We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:

[sourcecode lang=”r”]
linesNA = factor(as.character(ifelse(lines %in%
c(‘C’, ‘G’, ‘P’), NA, lines)))
linesNA
# [1] A B <NA> D E F <NA> H…
#Levels: A B D E F H…
[/sourcecode]

I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).

R pitfall #1: check data structure

A common problem when running a simple (or not so simple) analysis is forgetting that the levels of a factor has been coded using integers. R doesn’t know that this variable is supposed to be a factor and when fitting, for example, something as simple as a one-way anova (using lm()) the variable will be used as a covariate rather than as a factor.

There is a series of steps that I follow to make sure that I am using the right variables (and types) when running a series of analyses. I always define the working directory (using setwd()), so I know where the files that I am reading from and writing to are.

After reading a dataset I will have a look at the first and last few observations (using head() and tail(), which by default show 6 observations). This gives you an idea of how the dataset looks like, but it doesn’t confirm the structure (for example, which variables are factors). The function str() provides a good overview of variable types and together with summary() one gets an idea of ranges, numbers of observations and missing values.

[sourcecode language=”r”]
# Define your working directory (folder). This will make
# your life easier. An example in OS X:
setwd(‘~/Documents/apophenia’)

# and one for a Windows machine
setwd(‘c:/Documents/apophenia’)

# Read the data
apo = read.csv(‘apophenia-example.csv’, header = TRUE)

# Have a look at the first few and last few observations
head(apo)
tail(apo)

# Check the structure of the data (which variables are numeric,
# which ones are factors, etc)
str(apo)

# Obtain a summary for each of the variables in the dataset
summary(apo)
[/sourcecode]

This code should help you avoid the ‘fitting factors as covariates’ pitfall; anyway, always check the degrees of freedom of the ANOVA table just in case.