Quantum Forest

notes in a shoebox

R pitfall #3: friggin’ factors

I received an email from one of my students expressing deep frustation with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to NA. Using simple letters as example names he was baffled by the result of the following code:

The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustration guaranteed!

Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns’s R Inferno PDF for details). We can deal directly with the levels using this:

We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:

I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).


  1. Maybe this is more intuitive (also it is not really different to your approach):

    lines <- factor(LETTERS)

    linesNA <- lines

    levels(linesNA) <- ifelse(levels(linesNA) %in% c('C', 'G', 'P'), NA, levels(lines))

  2. You can do:

    lines[lines %in% c('C', 'G', 'P')] <- NA

  3. Kevin Wright

    2011/12/16 at 4:02 am

    My latest complaint about factors is this:

    R> factor(letters[1:10])
    [1] a b c d e f g h i j
    Levels: a b c d e f g h i j
    R> nchar(factor(letters[1:10]))
    [1] 1 1 1 1 1 1 1 1 1 2

    • Luis

      2011/12/16 at 8:24 am

      Wow! R is counting the number of characters of the internal numeric representation of levels. Devious and nightmarish to debug! I share your pain.

  4. Several stumbling blocks with factors are shown at the beginning of Circle 8.2 of 'The R Inferno' http://www.burns-stat.com/pages/Tutor/R_inferno.p

  5. You know, I have a embarrasing pitfall with the simple function "save". I can't save an R object in a file, because it saves a character string of the object instead of the contents of the object! Then I tried saving the whole session, and it saves a vector of all the objects' names. :facepalm:
    The sad thing is I had'nt solve it yet.

    P.D. In moments like this is when I wish to have formal training in R programming.

    • It should be straightforward; for example:

      a = c('a', 'b', 'c')
      save(a, file = 'whatever.Rdata')

      However, if you put the object name between quotes—save('a', file = 'whatever.Rdata')—you will get the name, which is not what you want. I hope this helps, Luis.

      • Thanks to Rbloggers I solved this "easy" task. The problem is that I did something like this:

        > x.var <- rnorm(100)
        > save(x.var, file="foo")
        > rm(x.var)
        > something <- load("foo")
        > something
        [1] "x.var"

        My fail was to assign to a variable that is not needed, with the line load("foo") is enough. My line of thought was: "Is good to save my models, temporary data and other stuff inside variables, so you can interact with that stuff later, in that case, let's load the R-object and let's put it inside a variable!"
        Maybe I had a weird line of thought…

  6. Seems that most of the issue here is the idea that factors are both a numeric list, and a set of accompanying labels. This is a powerful representation, but needs to be taken into account when dealing with the factor structure.

    So, when you want the character count of the labels, you have to tell R it is the labels you are thinking about…


Leave a Reply

© 2015 Quantum Forest

Theme by Anders NorenUp ↑