I received an email from one of my students expressing deep frustation with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to NA. Using simple letters as example names he was baffled by the result of the following code:

lines = factor(LETTERS)
lines
# [1] A B C D E F G H...
# Levels: A B C D E F G H...

linesNA = ifelse(lines %in% c('C', 'G', 'P'), NA, lines)
linesNA
#  [1]  1  2 NA  4  5  6 NA  8...


The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustation guaranteed!

linesNA = factor(ifelse(lines %in% c('C', 'G', 'P'), NA, lines))
linesNA
# [1] 1    2    <NA> 4    5    6    <NA> 8...
# Levels: 1 2 4 5 6 8...


Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns’s R Inferno PDF for details). We can deal directly with the levels using this:

linesNA = lines
levels(linesNA)[levels(linesNA) %in% c('C', 'G', 'P')] = NA
linesNA
# [1] A    B    <NA> D    E    F    <NA> H...
#Levels: A B D E F H...


We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:

linesNA = factor(as.character(ifelse(lines %in%
c('C', 'G', 'P'), NA, lines)))
linesNA
# [1] A    B    <NA> D    E    F    <NA> H...
#Levels: A B D E F H...


I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).

• Maybe this is more intuitive (also it is not really different to your approach):

lines <- factor(LETTERS)

linesNA <- lines

levels(linesNA) <- ifelse(levels(linesNA) %in% c('C', 'G', 'P'), NA, levels(lines))

• You can do:

lines[lines %in% c('C', 'G', 'P')] <- NA

• My latest complaint about factors is this:

R> factor(letters[1:10])
[1] a b c d e f g h i j
Levels: a b c d e f g h i j
R> nchar(factor(letters[1:10]))
[1] 1 1 1 1 1 1 1 1 1 2

• Wow! R is counting the number of characters of the internal numeric representation of levels. Devious and nightmarish to debug! I share your pain.

• Several stumbling blocks with factors are shown at the beginning of Circle 8.2 of 'The R Inferno' http://www.burns-stat.com/pages/Tutor/R_inferno.p

• Thanks for pointing out the exact location. I like very much your writing in the Inferno!

• You know, I have a embarrasing pitfall with the simple function "save". I can't save an R object in a file, because it saves a character string of the object instead of the contents of the object! Then I tried saving the whole session, and it saves a vector of all the objects' names. :facepalm:

P.D. In moments like this is when I wish to have formal training in R programming.

• It should be straightforward; for example:

a = c('a', 'b', 'c')
save(a, file = 'whatever.Rdata')

However, if you put the object name between quotes—save('a', file = 'whatever.Rdata')—you will get the name, which is not what you want. I hope this helps, Luis.

• Thanks to Rbloggers I solved this "easy" task. The problem is that I did something like this:

> x.var <- rnorm(100)
> save(x.var, file="foo")
> rm(x.var)