Evolving notes, images and sounds by Luis Apiolaza

R pitfall #3: friggin’ factors

I received an email from one of my students expressing deep frustration with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to NA. Using simple letters as example names he was baffled by the result of the following code:

lines <- factor(LETTERS)
lines
# [1] A B C D E F G H...
# Levels: A B C D E F G H...

linesNA <- ifelse(lines %in% c('C', 'G', 'P'), NA, lines)
linesNA
#  [1]  1  2 NA  4  5  6 NA  8...

The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustration guaranteed!

linesNA <- factor(ifelse(lines %in% c('C', 'G', 'P'), NA, lines))
linesNA
# [1] 1    2     4    5    6     8...
# Levels: 1 2 4 5 6 8...

Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns's R Inferno PDF for details). We can deal directly with the levels using this:

linesNA <- lines
levels(linesNA)[levels(linesNA) %in% c('C', 'G', 'P')] <- NA
linesNA
# [1] A    B     D    E    F     H...
#Levels: A B D E F H...

We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:

linesNA <- factor(as.character(ifelse(lines %in% c('C', 'G', 'P'), NA, lines)))
linesNA
# [1] A    B     D    E    F     H...
#Levels: A B D E F H...

I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).

13 Comments

  1. Jan

    Maybe this is more intuitive (also it is not really different to your approach):

    lines <- factor(LETTERS)

    linesNA <- lines

    levels(linesNA) <- ifelse(levels(linesNA) %in% c('C', 'G', 'P'), NA, levels(lines))

  2. Wojciech

    You can do:

    lines[lines %in% c('C', 'G', 'P')] <- NA

  3. Kevin Wright

    My latest complaint about factors is this:

    R> factor(letters[1:10])
    [1] a b c d e f g h i j
    Levels: a b c d e f g h i j
    R> nchar(factor(letters[1:10]))
    [1] 1 1 1 1 1 1 1 1 1 2

    • Luis

      Wow! R is counting the number of characters of the internal numeric representation of levels. Devious and nightmarish to debug! I share your pain.

  4. Pat Burns

    Several stumbling blocks with factors are shown at the beginning of Circle 8.2 of 'The R Inferno' http://www.burns-stat.com/pages/Tutor/R_inferno.p

    • Luis

      Thanks for pointing out the exact location. I like very much your writing in the Inferno!

  5. edivimo

    You know, I have a embarrasing pitfall with the simple function "save". I can't save an R object in a file, because it saves a character string of the object instead of the contents of the object! Then I tried saving the whole session, and it saves a vector of all the objects' names. :facepalm:
    The sad thing is I had'nt solve it yet.

    P.D. In moments like this is when I wish to have formal training in R programming.

    • Luis

      It should be straightforward; for example:

      a = c('a', 'b', 'c')
      save(a, file = 'whatever.Rdata')

      However, if you put the object name between quotes—save('a', file = 'whatever.Rdata')—you will get the name, which is not what you want. I hope this helps, Luis.

      • edivimo

        Thanks to Rbloggers I solved this "easy" task. The problem is that I did something like this:

        > x.var <- rnorm(100)
        > save(x.var, file="foo")
        > rm(x.var)
        > something <- load("foo")
        > something
        [1] "x.var"

        My fail was to assign to a variable that is not needed, with the line load("foo") is enough. My line of thought was: "Is good to save my models, temporary data and other stuff inside variables, so you can interact with that stuff later, in that case, let's load the R-object and let's put it inside a variable!"
        Maybe I had a weird line of thought…

  6. Tim Bates

    Seems that most of the issue here is the idea that factors are both a numeric list, and a set of accompanying labels. This is a powerful representation, but needs to be taken into account when dealing with the factor structure.

    So, when you want the character count of the labels, you have to tell R it is the labels you are thinking about…

    nchar(as.character(factor(letters[1:10])))

  7. Joel

    When I first learned R (by myself) I had so many factors created, usually because of the base R data.frame()’s automystically changing character vectors into a factor type. So frustrating! Simply because I didn’t know to use stringsAsFactors = FALSE. My solution was to first thing convert the factors by using as.character()

    Then the dplyr package came along and it was a revelation in simplicity.

    • Luis

      In the old times (this post is from 12 years ago), options(stringsAsFactors = FALSE) at the beginning of a script was a solution to what now is the default. dplyr is a great package, but it’s possible to write great, clean code in base R as well. For example, look at this post and parts 2 & 3.

© 2024 Palimpsest

Theme by Anders NorenUp ↑