R pitfall #3: friggin’ factors

I received an email from one of my students expressing deep frustation with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to NA. Using simple letters as example names he was baffled by the result of the following code:

lines = factor(LETTERS)
lines
# [1] A B C D E F G H...
# Levels: A B C D E F G H...
 
linesNA = ifelse(lines %in% c('C', 'G', 'P'), NA, lines)
linesNA
#  [1]  1  2 NA  4  5  6 NA  8...

The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustration guaranteed!

linesNA = factor(ifelse(lines %in% c('C', 'G', 'P'), NA, lines))
linesNA
# [1] 1    2    <NA> 4    5    6    <NA> 8...
# Levels: 1 2 4 5 6 8...

Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns’s R Inferno PDF for details). We can deal directly with the levels using this:

linesNA = lines
levels(linesNA)[levels(linesNA) %in% c('C', 'G', 'P')] = NA
linesNA
# [1] A    B    <NA> D    E    F    <NA> H...     
#Levels: A B D E F H...

We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:

linesNA = factor(as.character(ifelse(lines %in% 
                 c('C', 'G', 'P'), NA, lines)))
linesNA
# [1] A    B    <NA> D    E    F    <NA> H...     
#Levels: A B D E F H...

I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).

11 thoughts on “R pitfall #3: friggin’ factors

  • 1999/11/30 at 1:00 pm
    Permalink

    Maybe this is more intuitive (also it is not really different to your approach):

    lines <- factor(LETTERS)

    linesNA <- lines

    levels(linesNA) <- ifelse(levels(linesNA) %in% c('C', 'G', 'P'), NA, levels(lines))

    Reply
  • 2011/12/16 at 3:15 am
    Permalink

    You can do:

    lines[lines %in% c('C', 'G', 'P')] <- NA

    Reply
  • 2011/12/16 at 4:02 am
    Permalink

    My latest complaint about factors is this:

    R> factor(letters[1:10])
    [1] a b c d e f g h i j
    Levels: a b c d e f g h i j
    R> nchar(factor(letters[1:10]))
    [1] 1 1 1 1 1 1 1 1 1 2

    Reply
    • 2011/12/16 at 8:24 am
      Permalink

      Wow! R is counting the number of characters of the internal numeric representation of levels. Devious and nightmarish to debug! I share your pain.

      Reply
    • 2011/12/16 at 8:21 am
      Permalink

      Thanks for pointing out the exact location. I like very much your writing in the Inferno!

      Reply
  • 2011/12/22 at 4:54 am
    Permalink

    You know, I have a embarrasing pitfall with the simple function "save". I can't save an R object in a file, because it saves a character string of the object instead of the contents of the object! Then I tried saving the whole session, and it saves a vector of all the objects' names. :facepalm:
    The sad thing is I had'nt solve it yet.

    P.D. In moments like this is when I wish to have formal training in R programming.

    Reply
    • 2011/12/22 at 10:22 am
      Permalink

      It should be straightforward; for example:

      a = c('a', 'b', 'c')
      save(a, file = 'whatever.Rdata')

      However, if you put the object name between quotes—save('a', file = 'whatever.Rdata')—you will get the name, which is not what you want. I hope this helps, Luis.

      Reply
      • 2011/12/27 at 1:51 am
        Permalink

        Thanks to Rbloggers I solved this "easy" task. The problem is that I did something like this:

        > x.var <- rnorm(100)
        > save(x.var, file="foo")
        > rm(x.var)
        > something <- load("foo")
        > something
        [1] "x.var"

        My fail was to assign to a variable that is not needed, with the line load("foo") is enough. My line of thought was: "Is good to save my models, temporary data and other stuff inside variables, so you can interact with that stuff later, in that case, let's load the R-object and let's put it inside a variable!"
        Maybe I had a weird line of thought…

        Reply
  • 2012/01/03 at 9:14 am
    Permalink

    Seems that most of the issue here is the idea that factors are both a numeric list, and a set of accompanying labels. This is a powerful representation, but needs to be taken into account when dealing with the factor structure.

    So, when you want the character count of the labels, you have to tell R it is the labels you are thinking about…

    nchar(as.character(factor(letters[1:10])))

    Reply

Leave a Reply