R pitfall #3: friggin’ factors

I received an email from one of my students expressing deep frustation with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to NA. Using simple letters as example names he was baffled by the result of the following code:

[sourcecode lang="r"]
lines = factor(LETTERS)
lines
# [1] A B C D E F G H…
# Levels: A B C D E F G H…

linesNA = ifelse(lines %in% c(‘C’, ‘G’, ‘P’), NA, lines)
linesNA
# [1] 1 2 NA 4 5 6 NA 8…
[/sourcecode]

The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustation guaranteed!

[sourcecode lang="r"]
linesNA = factor(ifelse(lines %in% c(‘C’, ‘G’, ‘P’), NA, lines))
linesNA
# [1] 1 2 <NA> 4 5 6 <NA> 8…
# Levels: 1 2 4 5 6 8…
[/sourcecode]

Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns’s R Inferno PDF for details). We can deal directly with the levels using this:

[sourcecode lang="r"]
linesNA = lines
levels(linesNA)[levels(linesNA) %in% c('C', 'G', 'P')] = NA
linesNA
# [1] A B <NA> D E F <NA> H…
#Levels: A B D E F H…
[/sourcecode]

We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:

[sourcecode lang="r"]
linesNA = factor(as.character(ifelse(lines %in%
c(‘C’, ‘G’, ‘P’), NA, lines)))
linesNA
# [1] A B <NA> D E F <NA> H…
#Levels: A B D E F H…
[/sourcecode]

I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).

11 thoughts on “R pitfall #3: friggin’ factors”

• 1999/11/30 at 1:00 pm
Permalink

Maybe this is more intuitive (also it is not really different to your approach):

lines <- factor(LETTERS)

linesNA <- lines

levels(linesNA) <- ifelse(levels(linesNA) %in% c('C', 'G', 'P'), NA, levels(lines))

Reply
• 2011/12/16 at 3:15 am
Permalink

You can do:

lines[lines %in% c('C', 'G', 'P')] <- NA

Reply
• 2011/12/16 at 4:02 am
Permalink

My latest complaint about factors is this:

R> factor(letters[1:10])
[1] a b c d e f g h i j
Levels: a b c d e f g h i j
R> nchar(factor(letters[1:10]))
[1] 1 1 1 1 1 1 1 1 1 2

Reply
• 2011/12/16 at 8:24 am
Permalink

Wow! R is counting the number of characters of the internal numeric representation of levels. Devious and nightmarish to debug! I share your pain.

Reply
• 2011/12/16 at 8:21 am
Permalink

Thanks for pointing out the exact location. I like very much your writing in the Inferno!

Reply
• 2011/12/22 at 4:54 am
Permalink

You know, I have a embarrasing pitfall with the simple function "save". I can't save an R object in a file, because it saves a character string of the object instead of the contents of the object! Then I tried saving the whole session, and it saves a vector of all the objects' names. :facepalm:
The sad thing is I had'nt solve it yet.

P.D. In moments like this is when I wish to have formal training in R programming.

Reply
• 2011/12/22 at 10:22 am
Permalink

It should be straightforward; for example:

a = c('a', 'b', 'c')
save(a, file = 'whatever.Rdata')

However, if you put the object name between quotes—save('a', file = 'whatever.Rdata')—you will get the name, which is not what you want. I hope this helps, Luis.

Reply
• 2011/12/27 at 1:51 am
Permalink

Thanks to Rbloggers I solved this "easy" task. The problem is that I did something like this:

> x.var <- rnorm(100)
> save(x.var, file="foo")
> rm(x.var)
> something <- load("foo")
> something
[1] "x.var"

My fail was to assign to a variable that is not needed, with the line load("foo") is enough. My line of thought was: "Is good to save my models, temporary data and other stuff inside variables, so you can interact with that stuff later, in that case, let's load the R-object and let's put it inside a variable!"
Maybe I had a weird line of thought…

Reply
• 2012/01/03 at 9:14 am
Permalink

Seems that most of the issue here is the idea that factors are both a numeric list, and a set of accompanying labels. This is a powerful representation, but needs to be taken into account when dealing with the factor structure.

So, when you want the character count of the labels, you have to tell R it is the labels you are thinking about…

nchar(as.character(factor(letters[1:10])))

Reply