An R wish list for 2012

I expect there will be many reviews and wish lists for R this year, with many of them focusing on either running speed or dealing with large data sets. However, most issues that I would like to see tackled in R next year are not technical but, for lack of a better word, social.

Many users will first encounter R through the r-project.org website. This site is begging for a redesign, which could start getting rid of the frames (which have sucked since a long time ago). At a minimum, this would make pages much easier to bookmark.

The way we find, install and refer to packages could be better; both the main site and alternatives (like crantastic) do not help much to answer the question “Which package am I supposed to download?”. While folksonomies are cool (like in crantastic), they are far from sufficient and some level of curation (at the topic level, for example) could work much better. Sort of an improved version of Task Views with user comments, tags and indication of popularity. Tangent: if old users want packages to be called packages instead of libraries (as in most other software) the use of library() does not help.

Help! Usability entry-barriers.

Help, I need somebody

Let’s combine a few issues for the second encounter of R users with reality. Most new (and not so new) users will require help, which involves easy access to mailing lists, because they contain the richest set of information in the R world. However, a good proportion of users will have very little idea of even the existence of the R mailing list; particularly younger people for whom email is not the primary form of communication. Old users always want people to search for answers in previous messages to avoid repeating the same question over and over again. Solution: put a prominent search field or link to searchable R-help list on the main page. The other mailing lists tend to be of secondary importance to newbies.

The third point should be consistency and meeting user expectations. In a previous post I discussed an example of broken expectations when dealing with factors. The user wants to deal with levels but by default R deals with the underlying numerical coding. Radford Neal presents other examples on reversing sequences and using curly brackets to speed up computation.

R immigrants

Finally, be nice to newbies. Newbies are the functional equivalent to immigrants in a society (and I’m one myself). Immigrants induce dynamism in a society (and provide tasty alternatives to bland British food), push boundaries and, some times, challenge our beliefs. Newbies will keep the R community on its toes, forcing it to evolve and to be easier to use. Unless… unless we turn them away.

See you on the other side of the calendar.

P.S. Yes! Unless refers to Dr Seuss’s “Unless someone like you cares a whole awful lot, nothing is going to get better. It’s not.” in The Lorax.

P.S.2 I hope this text does not feel overly negative. R is the best thing since sliced bread, but it could be even better.

Shaken

We had a beautiful day sightseeing Banks Peninsula when another quake hit Christchurch: magnitude 5.8 at 1:58pm local time. This has been followed by numerous aftershocks, including a 6.0 at 3:18pm. The MQZ quake drum looked like this:

MQZ quake drum

MQZ quake drum, near Christchurch.

Tomorrow is another day (probably filled with aftershocks).

The introduction of a new system

It must be remembered that there is nothing more difficult to plan, more doubtful of success, nor more dangerous to manage, than the creation of a new system. For the initiator has the enmity of all who would profit by the preservation of the old institutions and merely lukewarm defenders in those who would gain by the new ones.

—Niccolò Machiavelli, The Prince, Chapter 6.

E debbasi considerare, come non è cosa più difficile a trattare, né più dubia a riuscire, né più pericolosa a maneggiare, che farsi capo ad introdurre nuovi ordini; perché lo introduttore ha per nimici tutti quelli che degli ordini vecchi fanno bene, e ha tepidi defensori tutti quelli che delli ordini nuovi farebbono bene.

—Niccolò Machiavelli, Il Principe, Capitolo VI.

I’m a big fan of this quote, because it applies to so many new endeavors including the introduction of new topics in a university curriculum.

First impressions of Doing Bayesian Data Analysis

About a month ago I was discussing the approach that I would like to see in introductory Bayesian statistics books. In that post I mentioned a PDF copy of Doing Bayesian Data Analysis by John K. Kruschke and that I have ordered the book. Well, recently a parcel was waiting in my office with a spanking new, real paper copy of the book. A few days are not enough to provide a ‘proper’ review of the book but I would like to discuss my first impressions about the book, as they could be helpful for someone out there.

If I were looking for a single word to define the word it would be meaty, not on the “having the flavor or smell of meat” sense of the word as pointed out by Newton, but on the conceptual side. Kruschke has clearly put a lot of thought on how to draw a generic student with little background on the topic to start thinking of statistical concepts. In addition Kruschke clearly loves language and has an interesting, sometimes odd, sense of humor; anyway, Who am I to comment on someone else’s strange sense of humor?

Steak at Quantum Forest's HQ

Meaty like a ribeye steak slowly cooked at Quantum Forest's HQ in Christchurch.

One difference between the dodgy PDF copy and the actual book is the use of color, three shades of blue, to highlight section headers and graphical content. In general I am not a big fun of lots of colors and contentless pictures as used in modern calculus and physics undergraduate books. In this case, the effect is pleasant and makes browsing and reading the book more accessible. Most graphics really drive a point and support the written material, although there are exceptions in my opinion like some faux 3D graphs (Figure 17.2 and 17.3 under multiple linear regression) that I find somewhat confusing.

The book’s website contains PDF versions of the table of contents and chapter 1, which is a good way to whet your appetite. The book covers enough material as to be the sole text for an introductory Bayesian statistics course, either starting from scratch or as a transition from a previous course with a frequentist approach. There are plenty of exercises, a solutions manual and plenty of R code available.

The mere existence of this book prompts the question: Can we afford not to introduce students to a Bayesian approach to statistics? In turn this sparks the question How do we convince departments to de-emphasize the old way? (this quote is extremely relevant)

Verdict: if you are looking for a really introductory text, this is hands down the best choice. The material goes from the ‘OMG do I need to learn stats?’ level to multiple linear regression, ANOVA, hierarchical models and GLMs.

Newton and Luis with our new copy of Doing Bayesian Data Analyses

Dear Dr Kruschke, we really bought a copy. Promise! Newton and Luis (Photo: Orlando).

P.S. I’m still using a combination of books, including Krushke’s, and Marin and Robert’s for my own learning process.
P.S.2 There is a lot to be said about a book that includes puppies on its cover and references to A Prairie Home Companion on its first page (the show is sometimes re-broadcasted down under by Radio New Zealand).

Hiking

Stitched pictures of the view, courtesy of Hugin and Skitch (Photo: Luis).

Traveling with friends and family, view from Green Lake towards Lake Tarawera, North Island, New Zealand. Time for walks, nice meals and setting R aside (although I have a few drafts for the blog soon to be published). Now back in the South Island we prepare for a great weekend with more walks, food and friends. Merry Christmas.

R pitfall #3: friggin’ factors

I received an email from one of my students expressing deep frustation with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to NA. Using simple letters as example names he was baffled by the result of the following code:

[sourcecode lang="r"]
lines = factor(LETTERS)
lines
# [1] A B C D E F G H…
# Levels: A B C D E F G H…

linesNA = ifelse(lines %in% c(‘C’, ‘G’, ‘P’), NA, lines)
linesNA
# [1] 1 2 NA 4 5 6 NA 8…
[/sourcecode]

The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustation guaranteed!

[sourcecode lang="r"]
linesNA = factor(ifelse(lines %in% c(‘C’, ‘G’, ‘P’), NA, lines))
linesNA
# [1] 1 2 <NA> 4 5 6 <NA> 8…
# Levels: 1 2 4 5 6 8…
[/sourcecode]

Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns’s R Inferno PDF for details). We can deal directly with the levels using this:

[sourcecode lang="r"]
linesNA = lines
levels(linesNA)[levels(linesNA) %in% c('C', 'G', 'P')] = NA
linesNA
# [1] A B <NA> D E F <NA> H…
#Levels: A B D E F H…
[/sourcecode]

We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:

[sourcecode lang="r"]
linesNA = factor(as.character(ifelse(lines %in%
c(‘C’, ‘G’, ‘P’), NA, lines)))
linesNA
# [1] A B <NA> D E F <NA> H…
#Levels: A B D E F H…
[/sourcecode]

I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).

Tall big data, wide big data

After attending two one-day workshops last week I spent most days paying attention to (well, at least listening to) presentations in this biostatistics conference. Most presenters were R users—although Genstat, Matlab and SAS fans were also present and not once I heard “I can’t deal with the current size of my data sets”. However, there were some complaints about the speed of R, particularly when dealing with simulations or some genomic analyses.

Some people worried about the size of coming datasets; nevertheless that worry was across statistical packages or, more precisely, it went beyond statistical software. How will we able to even store the data from something like the Square Kilometer Array, let alone analyze it?

LaurelAndHardy.jpg

In a previous post I was asking if we needed to actually deal with ‘big data’ in R, and my answer was probably not or, better, at least not directly. I still think that it is a valid, although incomplete opinion. In many statistical analyses we can think of n (the number of observations) and p (the number of variables per observation). In most cases, particularly when people refer to big data, n >> p. Thus, we may have 100 million people but we have only 10 potential predictors: tall data. In contrast, we may have only 1,000 individuals but with 50,000 points each coming from a near infrared spectrometry or information from 250,000 SNPs (a type of molecular marker): wide data. Both types of data will keep on growing but are challenging in a different way.

In a totally generalizing, unfair and simplistic way I will state that dealing with wide is more difficult (and potentially interesting) than dealing with tall. This from a modeling perspective. As the t-shirt says: sampling is not a crime, and it should work quite well with simpler models and large datasets. In contrast, sampling to fitting wide data may not work at all.

Algorithms. Clever algorithms is what we need in a first stage. For example, we can fit linear mixed models to a tall dataset with ten millions records or a multivariate mixed model with 60 responses using ASReml-R. Wide datasets are often approached using Bayesian inference, but MCMC gets slooow when dealing with thousands of predictors, we need other fast approximations to the posterior distributions.

This post may not be totally coherent, but it keeps the conversation going. My excuse? I was watching Be kind rewind while writing it.

R, academia and the democratization of statistics

I am not a statistician but I use statistics, teach some statistics and write about applications of statistics in biological problems.

Last week I was in this biostatistics conference, talking with a Ph.D. student who was surprised about this situation because I didn’t have any statistical training. I corrected “any formal training”. On the first day one of the invited speakers was musing about the growing number of “amateurs” using statistics—many times wrongly—and about what biostatisticians could offer as professional value-adding. Yes, he was talking about people like me spoiling the party.

Twenty years ago it was more difficult to say “cool, I have some data and I will run some statistical analyses” because (and you can easily see where I am going here) access to statistical software was difficult. You were among the lucky ones if you were based at a university or a large company, because you had access to SAS, SPSS, MINITAB, etc. However, you were out of luck outside of these environments, because there was no way to easily afford a personal licence, not for a hobby at least. This greatly limited the pool of people that could afford to muck around with stats.

Gratuitous picture: spiral in Sydney

Enter R and other free (sensu gratis) software that allowed us to skip the specialist, skip the university or the large organization. Do you need a formal degree in statistics to start running analyses? Do you even need to go through a university (for any degree, it doesn’t really matter) to do so? There are plenty of resources to start, download R or a matrix language, get online tutorials and books, read, read, read and ask questions in email lists or fora when you get stuck. If you are a clever cookie—and some of you clearly are one—you could easily cover as much ground as someone going through a university degree. It is probably still not enough to settle down, but it is a great start and a great improvement over the situation twenty years ago.

This description leaves three groups in trouble, trying to sort out their “value-adding” ability: academia, (bio)statisticians and software makers. What are universities offering, that’s unique enough, to justify the time and money invested by individuals and governments? If you make software, What makes it special? For how long can you rely on tradition and inertia so people don’t switch to something else? What’s so special about your (bio) statistical training to justify having one “you” in the organization? Too many questions, so better I go to sleep.

P.S. Did I write “value-adding” ability? I must have been talking to the suits for too long… Next post I may end up writing “value-adding proposition”!

On the (statistical) road, workshops and R

Things have been a bit quiet at Quantum Forest during the last ten days. Last Monday (Sunday for most readers) I flew to Australia to attend a couple of one-day workshops; one on spatial analysis (in Sydney) and another one on modern applications of linear mixed models (in Wollongong). This will be followed by attending The International Biometric Society Australasian Region Conference in Kiama.

I would like to comment on the workshops to look for commonalities and differences. First, both workshops heavily relied on R, supporting the idea that if you want to reach a lot of people and get them using your ideas, R is pretty much the vehicle to do so. It is almost trivial to get people to install R and RStudio before the workshop so they are ready to go. “Almost” because you have to count on someone having a bizarre software configuration or draconian security policies for their computer.

The workshop on Spatial Analysis also used WinBUGS, which with all respect to the developers, is a clunky piece of software. Calling it from R or using JAGS from R seems to me a much more sensible way of using a Bayesian approach while maintaining access to the full power of R. The workshop on linear mixed models relied on asreml-R; if you haven’t tried it, please give it a go (it is a free license for academic/research use). There were applications on multi-stage experiments, composite samples and high dimensional data (molecular information). In addition, there was an initial session on optimal design of experiments.

In my opinion, the second workshop (modern applications…) was much more successful than the first one (spatial analysis…) for a few reasons:

  • One has to limit the material to cover in a one-day workshop; if you want to cover a lot consider three days so people can digest all the material.
  • One has to avoid the “split-personality” approach to presentations; having very-basic and super-hard but nothing in the middle is not a great idea (IMHO). Pick a starting point and gradually move people from there.
  • Limit the introduction of new software. One software per day seems to be a good rule of thumb.

Something bizarre (for me, at least) was the difference between the audiences. In Sydney the crowd was a lot younger, with many trainees in biostatistics coming mostly from health research. They had little exposure to R and seemed to come from a mostly SAS shop. The crowd in Wollongong had people with a lot of experience (OK, oldish) both in statistics and R. I was expecting young people to be more conversant in R.

Tomorrow we will drive down to Kiama, sort out registration and then go to the welcome BBQ. Funny thing is that this is my first statistics conference; as I mentioned in the About page of this blog, I am a simple forester. :-)