An R wish list for 2012

I expect there will be many reviews and wish lists for R this year, with many of them focusing on either running speed or dealing with large data sets. However, most issues that I would like to see tackled in R next year are not technical but, for lack of a better word, social.

Many users will first encounter R through the r-project.org website. This site is begging for a redesign, which could start getting rid of the frames (which have sucked since a long time ago). At a minimum, this would make pages much easier to bookmark.

The way we find, install and refer to packages could be better; both the main site and alternatives (like crantastic) do not help much to answer the question “Which package am I supposed to download?”. While folksonomies are cool (like in crantastic), they are far from sufficient and some level of curation (at the topic level, for example) could work much better. Sort of an improved version of Task Views with user comments, tags and indication of popularity. Tangent: if old users want packages to be called packages instead of libraries (as in most other software) the use of library() does not help.

Help! Usability entry-barriers.

Help, I need somebody

Let’s combine a few issues for the second encounter of R users with reality. Most new (and not so new) users will require help, which involves easy access to mailing lists, because they contain the richest set of information in the R world. However, a good proportion of users will have very little idea of even the existence of the R mailing list; particularly younger people for whom email is not the primary form of communication. Old users always want people to search for answers in previous messages to avoid repeating the same question over and over again. Solution: put a prominent search field or link to searchable R-help list on the main page. The other mailing lists tend to be of secondary importance to newbies.

The third point should be consistency and meeting user expectations. In a previous post I discussed an example of broken expectations when dealing with factors. The user wants to deal with levels but by default R deals with the underlying numerical coding. Radford Neal presents other examples on reversing sequences and using curly brackets to speed up computation.

R immigrants

Finally, be nice to newbies. Newbies are the functional equivalent to immigrants in a society (and I’m one myself). Immigrants induce dynamism in a society (and provide tasty alternatives to bland British food), push boundaries and, some times, challenge our beliefs. Newbies will keep the R community on its toes, forcing it to evolve and to be easier to use. Unless… unless we turn them away.

See you on the other side of the calendar.

P.S. Yes! Unless refers to Dr Seuss’s “Unless someone like you cares a whole awful lot, nothing is going to get better. It’s not.” in The Lorax.

P.S.2 I hope this text does not feel overly negative. R is the best thing since sliced bread, but it could be even better.

19 thoughts on “An R wish list for 2012

  • 2011/12/27 at 2:30 am
    Permalink

    The R Project web pages are done with frames because they are served as a bunch of static pages, managed by the developers via SVN, and so frames are the only solution under those constraints for having a set menu on the left and content. Otherwise, without a server-side framework, you end up having to copy your common code (headers, menus) into every .html file.

    I suspect a halfway-house option would be to use a templating engine that built a static site from templates. This would then not require any frames, server-side technology, and could still be maintained via SVN and statically served and mirrored.

    BUT that would require R-core to change their working practices. Good luck with that :)

    However I do think library(foo) is going away shortly, to be replaced by 'use(foo)'.

    Reply
  • 2011/12/27 at 5:54 am
    Permalink

    - Agreed, the website could do with an update. I don't think it reflects the sophistication of the R software, or the community.
    - One of the deficiencies in CRAN, to my mind, is lack of accounting information for package downloads (across all mirrors, for all package versions, etc).
    - Wiki might be a good alternative for task views (though this would require some sort of server-side scripting).

    Reply
  • 2011/12/27 at 6:54 am
    Permalink

    Like a newbie myself, I strongly agree with your opinion "Help, I need somebody". It's hard to find a place to explain R for the totally newbie and R's data analysis paradigm (Why you complicate yourself being a programming language, R?! Answer that!).
    To the rookie in statistics isn't necessary a mailing list with statistics advice, but maybe a pointer saying "If you need to learn statistics, we can't teach you here, it's too complex and you will be better served with a pretty good book o teacher, but we can recommend you this book that will teach you with the help of R"
    Another good pointer it will be the importance of learning the data manipulation with R. I'm too lazy to search, but I think the R tutorial and another introductory texts in R start right away with the data manipulation functions in R, but don't explain why this functions and tricks are quite needed. And that is necessary in every data analysis task.

    Reply
  • 2011/12/27 at 2:30 pm
    Permalink

    Thanks for thinking of us newbies. My suggestion would be a series of R for X resource lists, where X=a domain in which R is used. In my case, I came to R via an interest in social network analysis and Pajek. Based on limited encounters to date, I would highly recommend the Princeton getting starting with R tutorial (http://data.princeton.edu/R/gettingStarted.html), the Stanford labs (http://sna.stanford.edu/rlabs.php), Paul Teetor's R Cookbook and Norman Matloff's The Art of R Programming.

    Reply
      • 2011/12/28 at 8:48 am
        Permalink

        I mentioned them in the post "Sort of an improved version of Task Views with user comments, tags and indication of popularity".

        Reply
  • 2011/12/28 at 2:52 am
    Permalink

    R appears to have an Achilles' heel. As I've delved into the system, the reliance on recursion is evident. What's also evident is that, according to R sites when the question comes up, the advice is to be careful with recursion and R will never support tail call optimization. Having used Prolog, and a few other recursion centered languages, I will say with certainty that it is a black and white situation: a recursive language falls down if it doesn't supply tail call to the coder. So, bite the bullet. Either remove recursion from the design paradigm, or build in tail call.

    Reply
  • 2011/12/28 at 3:15 am
    Permalink

    You have some good points, but then push too far.

    For example, "library" is not universal (Matlab, for example uses toolboxes and packages). And how else do you distinguish the routines in the package (imported via 'library') and the data (accessed, remarkably enough with 'data')? An R package contains more (including PDFs) than what I'd think of as a 'library'.

    Talking about reversing and curly braces to speed things up is also a complaint too far. All languages have quirks or advanced usage that will stymie a newcomer: you couldn't name a single (useful) language where an advanced user couldn't dazzle you.

    Having a rating/comment system for packages (with dates) would be useful. You'd still have to wade through a bunch of comments to figure out which wavelet package, for example, would work best for your needs. The key is to have a way to indicate packages that really aren't maintained or are buggy versus the alternatives that exist for different tastes and tasks.

    Reply
    • 2011/12/28 at 3:56 am
      Permalink

      Sorry to reply to myself, but wanted to expand on a couple of thoughts.

      First, in many languages/tools, you really don't have a choice on libraries/packages. The answer to "I want to do a neural net for …" is, "Load the Neural Net library." (Or, "Buy the Neural Net library.") In R, there are usually two or three packages that do what you want. They usually have different approaches, so there isn't one answer: you have to see which works best for you. An App-Store-like rating/review system would be helpful, but there would still be no simple answers.

      Second, the 'if else' issue is non-intuitive, true. If you get help on ifelse (?ifelse), you'll see a Warning:

      "Warning

      "The mode of the result may depend on the value of test (see the examples), and the class attribute (see oldClass) of the result is taken from test and may be inappropriate for the values selected from yes and no.

      "Sometimes it is better to use a construction such as (tmp <- yes; tmp[!test] <- no[!test]; tmp), possibly extended to handle missing values in test."

      And as you think about it, you'd need some non-intuitive magic in ifelse to make this all work the way you think it should. How, exactly, would listNA get its type? Should ifelse take the type from its second argument (in this case, 'list')? What if the logic were reversed and NA were the second argument? Should it have special logic for this single case?

      It's frustrating, but no different from any other dynamic language.

      Third, I thought of more things that go into a package. Of course, help, which is probably expected in any interactive language where you can load libraries, but also demos, vignettes (PDF tutorials), multiple datasets, functions, etc. Also, I believe that R uses 'library' to refer to a collection of packages, and you use the 'library' function to load a package from your library.

      Sorry to sound like such a fanboy, but I really do think that your good points are diminished with these other points which are either not unique in the programming world (and pretty much impossible to eliminate), or are the kinds of misunderstandings that you have when you're a newbie to any new topic.

      Reply
      • 2011/12/28 at 9:07 am
        Permalink

        I'm familiar with other interpreted languages (having used Python, Matlab, Mathematica and a few other obscure things) and R is not the only one that has several options to achieve a task. Again, some of the problems can be reduced with a better choice of defaults. If someone is going to test a comparison with factors, which is most likely, that my interest is on the names of the levels or the underlying numerical representation? I would guess the former, because that's why I gave name to the bloody levels.

        No worries about fanboys. As I point out in the P.S. 2 of the post, R is great but we can make it even better.

        Reply
    • 2011/12/28 at 8:57 am
      Permalink

      My point about library() is the naming. If you want people to call packages, well, packages then the system should use a function called packages() to call them. I like Python that let you use things like "import package" and then use a prefix to call functions, "from package import *" that loads the whole namespace, etc.

      Sure, all languages have quirky details; however, my point is about the defaults. Why the default is to slow? It reminds me of my first PC, which used to have a "turbo" button. I don't know anybody that ever wanted to run the computer more slowly…

      Reply
      • 2011/12/28 at 9:34 am
        Permalink

        I think you're arguing too far on these issues instead of sticking to the important issues.

        Yes, it's unfortunate that R uses "library" to load a "package". Too late to change that now, and it's not like Python uses "module" to import modules (it uses "import"). Also, you'll note that Python also has libraries, which are collections of modules, just as R libraries are collections of packages. (And you will find people talking about "importing libraries" if you search.) Further note that an R package can include a LOT more than a Python module, so "import" might be misleading in R.

        R does not "default to slow". Some advanced R programmers found implementation quirks that allow them to speed things up with odd constructs. Again, the same could be said of most languages.

        R's help system, as you say, needs some serious work. It actually has some very advanced features, but it's not useful when you don't know how to say what you want. There's no "I'm looking for a way to scramble some numbers" kind of help, or "What kinds of control structures does R have?" kind of help. It's all help for a specific package or a specific function within a package. (With a fairly nice fuzzy matching search, but still you really have to be in the ballpark to get the help you need.) Focus your fire on that kind of issue that is: 1) a large hinderance to beginners, and 2) could be changed without breaking a lot of other references, tutorials, etc.

        Reply
  • 2011/12/28 at 7:51 am
    Permalink

    I want to help redesigning the R website.
    Please advice me on how to approach.

    Reply
  • 2011/12/29 at 2:25 am
    Permalink

    another wish: the ability to pay for R membership or to donate with paypal. This would get many more donations and memberships.

    Reply
  • 2011/12/30 at 6:10 am
    Permalink

    I'm surprised that no one has pointed to http://stackoverflow.com/ as an alternative for R-help lists. I wish more people would go there. It provides good searchable questions and answers. Looking in the [r] tag will give you R related questions. There really is not a good way to traverse the mailing lists looking for answers. Stack overflow provides a far superior alternative for question and answers.

    Reply
    • 2011/12/30 at 7:04 am
      Permalink

      Hi Andrew,

      As a gross generalization, stackoverflow readers seem to be much more competent with programming than with statistics. I do agree on that the look of the site is pretty good, but having never used it I find the threading/editing mechanism a bit confusing. I have found some very good threads for Python though…

      If we could move the whole R-help email list to a system like that, i.e. keeping the current users of the list in the loop, that would be great.

      Reply
      • 2012/01/05 at 2:10 am
        Permalink

        Fabulous stuff.

        From this newbie I would say any mailing list is not the place to start. Quick-R I think is much more accessible. Then move to mailing lists (via google and seeing whichever relevant answer bubbles up to the top, whether from SO or Rhelp or others.)

        Reply
      • 2012/01/24 at 6:44 am
        Permalink

        For stats, crossvalidated.com is your savior. Questions get moved from CV to SO (and vice versa) all the time. There was a debate of starting an R-stackexchange, but the idea’s on hold for the time being.

        Reply

Leave a Reply