R, Julia and the shiny new thing

2012-04-12 / Luis

My head exploded a while ago. Perhaps not my head but my brain was all mushy after working every day of March and first week of April; an explanation—as good as any—for the post hiatus. Back to the post title.

It is not a mystery that for a while there have been some underlying unhappiness in the R world. Ross Ihaka and Duncan Temple Long have mused on starting over (PDF, 2008). Similar comments were voiced by Ihaka in Christian Robert’s blog (2010) and were probably at the root of the development of Incanter (based on Clojure). Vince Buffalo pointed out the growing pains of R but it was a realist on his post about Julia: one thing is having a cool base language but a different one is recreating R’s package ecosystem^†.

A quick look in R-bloggers will show several posts mentioning Julia, including A-listers (in the R world, forget Hollywood) like Douglas Bates (of nlme and lme4 fame). They are not alone, as many of us (myself included) are prone to suffer the ‘oh, new, shiny’ syndrome. We are always looking for that new car language smell, which promises to deliver c('consistency', 'elegance', 'maintainability', 'speed', ...); each of us is looking for a different mix and order of requirements.

Personally, I do not find R particularly slow. Put another way, I tend to struggle with a particular family of problems (multivariate linear mixed models) that require heavy use of external libraries written in C or FORTRAN. Whatever language I use will call these libraries, so I may not stand to gain a lot from using insert shiny language name here. I doubt that one would gain much with parallelization in this class of problems (but I stand to be corrected). This does not mean that I would not consider a change.

Give me consistency + elegance and you start getting my attention. What do I mean? I have been reading a book on ‘Doing X with R’ so I can write a review about it. This involved looking at the book with newbie eyes and, presto!, R can be a real eyesore and the notation can drive you insane sometimes. At about the same time Vince Buffalo was posting his thoughts on Julia (early March) I wrote:

The success of some of Hadley Wickham’s packages got me thinking about underlying design issues in R that make functions so hard to master for users. Don’t get me wrong, I still think that R is great, but why are there so many problems to understand part of the core functionality? A quick web search will highlight that there is, for example, an incredible amount of confusion on how to use the apply family of functions. The management of dates and strings is also a sore point. I perfectly understand the need for, and even the desirability of, having new packages that extend the functionality of R. However, this is another kettle of fish; we are talking about making sane design choices so there is no need to repackage basic functionality to make it usable.

May be the question is not as dramatic as Do we need a new, shiny language? but Can we get rid of the cruft that R has accumulated over the years? Can we make the language more consistent, easier to learn, write and maintain?

Gratuitous picture: cabbage trees or mushy brain?.

^†Some may argue that one could get away with implementing the 20% of the functionality that’s used by 80% of the users. Joel Spolsky already discussed the failure of this approach in 2001.

julia, r, rblogs

12 Comments

Harlan
2012-04-13 at 01:59

Yes, indeed. There are some really great aspects of R/S, such as ubiquitous attributes and lazy evaluation, that make it incredibly good for interactive data manipulation and statistical development. I’ve been working a bit on Julia, exploring basic data structures such as masked arrays (for NA support) and data frame equivalents. But I think a viable alternative approach would be, as you say, to reimplement the S language from scratch, completely redesigning the core libraries. Hadley has done a lot of work demonstrating better ways of dealing with data frame manipulation, date manipulation, and functional programming. Those insights and more could be used to design a clean, orthogonal set of core functionality that would be much easier to learn and use, while the language itself could be re-implemented in a much more modern and speedy way.

This approach would, however, break every package in existence, and would require many person-years of development to get to a stage where it could replace R for end-users. In theory, it’s a better approach than starting from scratch with Julia, whose core design is idea for mathematical programming, but only good-not-great for interactive statistical work. But I don’t see the R core group proposing this or building support for this, ever. So I suspect that we’ll all plod along with R and its quirks for another 5 or 10 or 15 years or so, before Julia or something similar has the support and scale to slowly replace it.
- Luis (Post author)
  2012-04-13 at 16:06
  
  Over the last 20 years I’ve used SAS, Splus, Genstat, R & ASReml (for stats), Matlab and Mathematica (for simulation), FORTRAN (for simulation and stats) and Python (for general programming/glueing). They represent different stages and projects and I have never thought ‘this is the language that I will use for the rest of my life’. These days I use R quite a lot, and have pushed for its use in our university courses. However, I believe for both students and researchers it should be very important to understand the general concepts rather than a specific implementation of the tools.
  
  But I don’t see the R core group proposing this or building support for this, ever. So I suspect that we’ll all plod along with R and its quirks for another 5 or 10 or 15 years or so, before Julia or something similar has the support and scale to slowly replace it.
  
  I think you are totally right.
Robert Young
2012-04-13 at 06:14

— and, presto!, R can be a real eyesore and the notation can drive you insane sometimes.

Is R a command language or a programming language? Both (using a single syntax), of course, and that makes it different and to my newbie eyes (to R, I’ve used everything since BMDP at one time or another). For the users, which is more important? Are the “problems” encountered due to this? Could it be that trying to serve two masters has pleased none?

The counter example: SQL is used by its users to managed relational (hopefully) data, but those who support and extended the SQL language generally do so in C/C++ (yes, each engine vendor has a Procedural Language for stored procedures and functions, but I don’t see this as analogous). Perhaps R should take that fork in the road; X to create the “commands”, while the users continue to use R syntax for the “commands”.
- Luis (Post author)
  2012-04-13 at 16:19
  
  Is R a command language or a programming language?
  
  Both depending of the user and problem. Most newbies want only basic commands and then try to expand the use of the language to cover more ground. There is no reason to use only one language to solve a problem and R already does quite a bit of time critical processing in C/C++/FORTRAN. Nevertheless many people complain about speed (referring to execution speed) but without considering the time required to develop the code.
  
  Most users will write code to solve a given problem only once. They could get faster processing in any low level language but it would take them weeks to write in, say, C because they have little or no practice on it. Sometimes one can get huge changes of performance in R with some attention to detail; for example here and here. Of course most benchmarks are totally unrealistic, because who cares about Fibonacci?
Jan Galkowski
2012-04-13 at 12:07

I’m lucky to have “outgrown” or gotten tired of finding neater ways of expressing computation years ago. Now, I’m fortunate: I learned LISP and purely functional ways of expressing computations early, had great teachers in numerical analysis, and was lucky to have done a lot of programming in APL for many years, as well as some hard-nosed time- and space-constrained programming in the embedded world, mostly of quantitative algorithms. I was also fortunate to have participated in the development of the DOD and FAA stsndard language, ADA, and so got some ecperience in what
a language dev effort is like on the inside.

Someplace along the way I got much more interested in numerical algorithms and problems and began to care less about how my computations are expressed. I must say in comparison to old FORTRAN (there are new ones) and C (and I try not yo touch

language dev effort is like from the inside.
- Luis (Post author)
  2012-04-13 at 15:55
  
  In summary, what are you using today for technical/scientific computing?
- Jan Galkowski
  2012-04-14 at 02:41
  
  Sorry …. My Kindle Fire hung up and I wasn’t able to finish this.
  
  I was going to say, I understand language development because of a participation in the DOD/FAA ADA language.
  
  But since about 1984 I became more interested in WHAT was being calculated than how. So I used LISP and primarily APL and MATLAB for a long time. I later worked in Smalltalk and C. Then SQL. Now I use R.
  - Luis (Post author)
    2012-04-14 at 11:27
    
    Thanks for the clarification. I still think that APL was an awesome example of conciseness and power, but it was a real pain to revisit code a year later. An interesting point of focussing on the what’s being calculated is that one can explore different ways of solving the problem, which makes many benchmarks invalid (e.g. the Fibonacci examples I linked to above). Forcing the same approach in all languages highlights false efficiencies.
nick
2012-04-13 at 12:39

well…I am always struggling with R’s speed and memory limitations. I must have tried everything out there (ff, bigmemory, databases, Rcpp, rsqlite….) but nothing really works well for me. I really hope base R catches up with SAS in that regard where people dont have to do all that plumbing for even a dataset that’s a couple of gigs big. In the real (non-academic) world, data sizes are even bigger and that’s why SAS is so pervasive. Revolution R is doing something with its out of memory stuff, but its not compatible with base R, it breaks with the way R works (which is actually pretty neat imo) and its closed source.
I like the way JIT work is going. That will help with speed issues. Memory issues need some help as well.
- Luis (Post author)
  2012-04-13 at 12:49
  
  I have never managed to run my models through proc mixed in SAS. It can easily read large data sets, but once you want to do something fancier it just (technical term) craps out.
  - nick
    2012-04-14 at 02:23
    
    I agree on that regard. I dont think SAS is great on everything and I’d rather use an open source tool. Also the range of options available in R is staggering. However sas does have enviable datastep (although i find programming it rather clunky and unintuitive). personally I can do some preprocessing in python…however that makes it harder for its adoption in an enterprise which is used to sas. I’d love to hear more about the mixed model example though. I’ve never used either proc mixed or lme although i did take Doug Bates tutorial at useR conference in 2010.
    - Luis (Post author)
      2012-04-14 at 10:29
      
      I’ll post on this type of work soon. Basically, we are splitting an observation in to many different sources of variation, some which relate to genetic effects and some that relate to the environment.
      
      In a given site we may have, for example, blocks, incomplete blocks, plots and individuals. Individuals are related to each other (which we need to account for) and may have spatial environmental trends to account for too. Up to that point you could fit the model with proc mixed. Now imagine that you have, say, 50 sites and that you consider your response in each site as a different trait. Now you have 50 variables on the left hand-side, with potentially different experimental designs in each site; you would like to account for spatial trends within each site and for the correlation at the genetic level amongst all sites. By now you are supposed to be modeling residual covariance structures (e.g. ARxAR) and genetic structures (you may need to to use something like a factor analytic).
      Ah, and you have thousands of genotypes in each site.
      
      At this point all SAS proc mixed and R nlme and lme4 can’t deal with the problem (actually they couldn’t deal with less than 10 sites in SAS and even 2 in R) and then you are dealing with asreml, which is what researchers like me and many breeding companies (think Monsanto, for example) are using. I can call asreml from R using the asreml-R package (proprietary, free for research).