R pitfall #3: friggin’ factors

I received an email from one of my students expressing deep frustation with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to NA. Using simple letters as example names he was baffled by the result of the following code:

lines = factor(LETTERS)
lines
# [1] A B C D E F G H...
# Levels: A B C D E F G H...

linesNA = ifelse(lines %in% c('C', 'G', 'P'), NA, lines)
linesNA
#  [1]  1  2 NA  4  5  6 NA  8...

The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustation guaranteed!

linesNA = factor(ifelse(lines %in% c('C', 'G', 'P'), NA, lines))
linesNA
# [1] 1    2    <NA> 4    5    6    <NA> 8...
# Levels: 1 2 4 5 6 8...

Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns’s R Inferno PDF for details). We can deal directly with the levels using this:

linesNA = lines
levels(linesNA)[levels(linesNA) %in% c('C', 'G', 'P')] = NA
linesNA
# [1] A    B    <NA> D    E    F    <NA> H...
#Levels: A B D E F H...

We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:

linesNA = factor(as.character(ifelse(lines %in%
                 c('C', 'G', 'P'), NA, lines)))
linesNA
# [1] A    B    <NA> D    E    F    <NA> H...
#Levels: A B D E F H...

I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).

R pitfall #1: check data structure

A common problem when running a simple (or not so simple) analysis is forgetting that the levels of a factor has been coded using integers. R doesn’t know that this variable is supposed to be a factor and when fitting, for example, something as simple as a one-way anova (using lm()) the variable will be used as a covariate rather than as a factor.

There is a series of steps that I follow to make sure that I am using the right variables (and types) when running a series of analyses. I always define the working directory (using setwd()), so I know where the files that I am reading from and writing to are.

After reading a dataset I will have a look at the first and last few observations (using head() and tail(), which by default show 6 observations). This gives you an idea of how the dataset looks like, but it doesn’t confirm the structure (for example, which variables are factors). The function str() provides a good overview of variable types and together with summary() one gets an idea of ranges, numbers of observations and missing values.

# Define your working directory (folder). This will make
# your life easier. An example in OS X:
setwd('~/Documents/apophenia')

# and one for a Windows machine
setwd('c:/Documents/apophenia')

# Read the data
apo = read.csv('apophenia-example.csv', header = TRUE)

# Have a look at the first few and last few observations
head(apo)
tail(apo)

# Check the structure of the data (which variables are numeric,
# which ones are factors, etc)
str(apo)

# Obtain a summary for each of the variables in the dataset
summary(apo)

This code should help you avoid the ‘fitting factors as covariates’ pitfall; anyway, always check the degrees of freedom of the ANOVA table just in case.