Teaching code, production code, benchmarks and new languages

I’m a bit obsessive with words. May be I should have used learning in the title, rather than teaching code. Or perhaps remembering code. You know? Code where one actually has very clear idea of what is going on; for example, let’s say that we are calculating the average of a bunch of n numbers, we can have a loop that will add up each of them and then divide the total by n. Of course we wouldn’t do that in R, but use a simple function: mean(x).

In a previous post I compared R and Julia code and one of the commenters (Andrés) rightly pointed out that the code was inefficient. It was possible to speed up the calculation many times (and he sent me the code to back it up), because we could reuse intermediate results, generate batches of random numbers, etc. However, if you have studied the genomic selection problem, the implementations in my post are a lot closer to the algorithm. It is easier to follow and to compare, but not too flash in the speed department; for the latter we’d move to production code, highly optimized but not very similar to the original explanation.

This reminded me of the controversial Julia benchmarks, which implemented a series of 7 toy problems in a number of languages, following the ‘teaching code’ approach. Forcing ‘teaching code’ makes the comparison simple, as different language implementations look similar to each other. However, one is also throwing away the very thing that makes languages different: they have been designed and optimized using different typical uses in mind. For example, execution time for the pisum benchmark can be reduced sixty times just by replacing the internal loop with sum(1/c(1:10000)^2). Making comparisons easily understandable (so the code looks very similar) is orthogonal to making them realistic.

Gratuitous picture: looking for the right bicycle in Uppsala (Photo: Luis).

Tony Boyles asked, tongue in cheek, ‘The Best Statistical Programming Language is …Javascript?’ Well, if you define best as fastest language running seven benchmarks that may bear some resemblance to what you’d like to do (despite not having any statistical libraries) maybe the answer is yes.

I have to admit that I’m a sucker for languages; I like to understand new ways of expressing problems and, some times, one is lucky and finds languages that even allow tackling new problems. At the same time, most of my tests never progress beyond the ‘ah, cool, but it isn’t really that interesting’. So, if you are going to design a new language for statistical analyses you may want to:

  • Piggyback on another language. Case in point: R, which is far from an immaculate conception but an implementation of S, which gave it a start with both users and code. It may be that you extend a more general language (e.g. Incanter) rather than a domain specific one.
  • However, one needs a clean syntax from the beginning (ah, Python); my theory is that this partly explains why Incanter got little traction. (((too-many-parentheses-are-annoying)))
  • Related to the previous point, make extensibility to other languages very simple. R’s is a pain, while Julia’s seems to be much straightforward (judging by Douglas Bates’s examples).
  • Indices start at 1, not 0. Come on! This is stats and maths, not computer science. It is easier for us, your audience, to count from 1.
  • Programming structures are first class citizens, not half-assed constructs that complicate life. A clear counterexample are SAS’s macros and PROC IML, which are not conducive to people writing their own libraries/modules and sharing them with the community: they are poorly integrated with the rest of the SAS system.
  • Rely since day one on creating a community; as I pointed out in a previous post one thing is having a cool base language but a different one is recreating R’s package ecosystem. Good luck recreating R’s ecosystem working on your own.
  • However, you have to create a base language with sane basics included: access to both plain text and databases, easy summary functions (Xapply doesn’t cut it), glms and cool graphics (ggplot like) or people may not get hooked with the language, so they start contributing.

Two interesting resources discussing the adoption of R are this paper (PDF) by John Fox and this presentation by John Cook. Incidentally, John Fox is the author of my favorite book on regression and generalized linear models, no R code at all, just explanations and the math behind it all. John Cook writes The Endeavour, a very interesting blog with mathematical/programming bent.

Late-April flotsam

It has been month and a half since I compiled a list of statistical/programming internet flotsam and jetsam.

  • Via Lambda The Ultimate: Evaluating the Design of the R Language: Objects and Functions For Data Analysis (PDF). A very detailed evaluation of the design and performance of R. HT: Christophe Lalanne. If you are in statistical genetics and Twitter Christophe is the man to follow.
  • Attributed to John Tukey, “without assumptions there can be no conclusions” is an extremely important point, which comes to mind when listening to the fascinating interview to Richard Burkhauser on the changes of income for the middle class in USA. Changes to the definition of the unit of analysis may give a completely different result. By the way, does someone have a first-hand reference to Tukey’s quote?
  • Nature news publishes RNA studies under fire: High-profile results challenged over statistical analysis of sequence data. I expect to see happening more often once researchers get used to upload the data and code for their papers.
  • Bob O’Hara writes on Why simple models are better, which is not positive towards the machine learning crowd.
  • A Matlab Programmer’s Take On Julia, and a Python developer interacts with Julia developers. Not everything is smooth. HT: Mike Croucher. ‏
  • Dear NASA: No More Rainbow Color Scales, Please. HT: Mike Dickinson. Important: this applies to R graphs too.
  • Rafael Maia asks “are programmers trying on purpose to come up with names for their languages that make it hard to google for info?” These are the suggestions if one searches Google for Julia:

    Unhelpful search suggestions.

  • I suggest creating a language called Bieber and search for dimension Bieber, loop Bieber and regression Bieber.

That’s all folks.

R, Julia and genome wide selection

— “You are a pussy” emailed my friend.
— “Sensu cat?” I replied.
— “No. Sensu chicken” blurbed my now ex-friend.

What was this about? He read my post on R, Julia and the shiny new thing, which prompted him to assume that I was the proverbial old dog unwilling (or was it unable?) to learn new tricks. (Incidentally, with friends like this who needs enemies? Hi, Gus.)

Having a look at different—statistical—horses (Photo: Luis, click to enlarge).

I decided to tackle a small—but hopefully useful—piece of code: fitting/training a Genome Wide Selection model, using the Bayes A approach put forward by Meuwissen, Hayes and Goddard in 2001. In that approach the breeding values of the individuals (response) are expressed as a function of a very large number of random predictors (2000, our molecular markers). The dataset (csv file) is a simulation of 2000 bi-allelic markers (aa = 0, Aa = 1, AA = 2) for 250 individuals, followed by the phenotypes (column 2001) and breeding values (column 2002). These models are frequently adjusted using MCMC.

In 2010 I attended this course in Ames, Iowa where Rohan Fernando passed us the following R code (pretty much a transliteration from C code; notice the trailing semicolons, for example). P.D. 2012-04-26 Please note that this is teaching code not production code:

nmarkers = 2000;    # number of markers
startMarker = 1981; # set to 1 to use all
numiter  = 2000;    # number of iterations
vara     = 1.0/20.0; 
 
# input data
data     = matrix(scan("trainData.out0"),ncol=nmarkers+2,byrow=TRUE);
nrecords = dim(data)[1];
 
beg = Sys.time()
 
# x has the mean followed by the markers
x = cbind(1,data[,startMarker:nmarkers]);  
y = data[,nmarkers+1];
a =  data[,nmarkers+2];
# inital values
 
nmarkers = nmarkers - startMarker + 1;
mean2pq = 0.5;                          # just an approximation
scalea  = 0.5*vara/(nmarkers*mean2pq);  # 0.5 = (v-2)/v for v=4
 
size = dim(x)[2];
b = array(0.0,size);
meanb = b;
b[1] = mean(y);
var  = array(0.0,size);
 
# adjust y
 ycorr = y - x%*%b;                  
 
# MCMC sampling
for (iter in 1:numiter){	
  # sample vare
  vare = ( t(ycorr)%*%ycorr )/rchisq(1,nrecords + 3);
 
  # sample intercept
  ycorr = ycorr + x[,1]*b[1];
  rhs = sum(ycorr)/vare;
  invLhs = 1.0/(nrecords/vare);
  mean = rhs*invLhs;                            
  b[1] = rnorm(1,mean,sqrt(invLhs));                  
  ycorr = ycorr - x[,1]*b[1];
  meanb[1] = meanb[1] + b[1];
 
  # sample variance for each locus	
  for (locus in 2:size){
    var[locus] = (scalea*4+b[locus]*b[locus])/rchisq(1,4.0+1) 
  }
 
# sample effect for each locus	
  for (locus in 2:size){
    # unadjust y for this locus
    ycorr = ycorr + x[,locus]*b[locus];   
    rhs = t(x[,locus])%*%ycorr/vare;
    lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus];
    invLhs = 1.0/lhs;
    mean = invLhs*rhs;
    b[locus]= rnorm(1,mean,sqrt(invLhs));  
    #adjust y for the new value of this locus                   
    ycorr = ycorr - x[,locus]*b[locus];   
    meanb[locus] = meanb[locus] + b[locus];		
  }	
}
 
Sys.time() - beg
 
meanb = meanb/numiter;
aHat  = x %*% meanb;

Thus, we just need defining a few variables, reading the data (marker genotypes, breeding values and phenotypic data) into a matrix, creating loops, matrix and vector multiplication and generating random numbers (using a Gaussian and Chi squared distributions). Not much if you think about it, but I didn’t have much time to explore Julia’s features as to go for something more complex.

nmarkers = 2000    # Number of markers
startmarker = 1981 # Set to 1 to use all
numiter = 2000     # Number of iterations
 
data = dlmread("markers.csv", ',')
(nrecords, ncols) = size(data)
 
tic()
 
#this is the mean and markers matrix
X = hcat(ones(Float64, nrecords), data[:, startmarker:nmarkers])
y = data[:, nmarkers + 1]
a = data[:, nmarkers + 2]
 
nmarkers = nmarkers - startmarker + 1
vara = 1.0/nmarkers
mean2pq = 0.5
 
scalea  = 0.5*vara/(nmarkers*mean2pq) # 0.5 = (v-2)/v for v=4
 
ndesign = size(X, 2)
b = zeros(Float64, ndesign)
meanb = zeros(Float64, ndesign)
b[1] = mean(y)
varian  = zeros(Float64, ndesign)
 
# adjust y
ycorr = y - X * b                  
 
# MCMC sampling
for i = 1:numiter
  # sample vare
  vare = dot(ycorr, ycorr )/randchi2(nrecords + 3)
 
  # sample intercept
  ycorr = ycorr + X[:, 1] * b[1];
  rhs = sum(ycorr)/vare;
  invlhs = 1.0/(nrecords/vare);
  mn = rhs*invlhs;                            
  b[1] = randn() * sqrt(invlhs) + mn;
  ycorr = ycorr - X[:, 1] * b[1];
  meanb[1] = meanb[1] + b[1];
 
  # sample variance for each locus
  for locus = 2:ndesign
      varian[locus] = (scalea*4 + b[locus]*b[locus])/randchi2(4.0 + 1);
  end
 
  # sample effect for each locus
  for locus = 2:ndesign
      # unadjust y for this locus
      ycorr = ycorr + X[:, locus] * b[locus]; 
      rhs = dot(X[:, locus], ycorr)/vare;
      lhs = dot(X[:, locus], X[:, locus])/vare + 1.0/varian[locus];
      invlhs = 1.0/lhs;
      mn = invlhs * rhs;
      b[locus] = randn() * sqrt(invlhs) + mn;
      #adjust y for the new value of this locus
      ycorr = ycorr - X[:, locus] * b[locus];
      meanb[locus] = meanb[locus] + b[locus];
  end
end
 
toc()
 
meanb = meanb/numiter;
aHat  = X * meanb;

The code looks remarkably similar and there are four main sources of differences:

  1. The first trivial one is that the original code read a binary dataset and I didn’t know how to do it in Julia, so I’ve read a csv file instead (this is why I start timing after reading the file too).
  2. The second trivial one is to avoid name conflicts between variables and functions; for example, in R the user is allowed to have a variable called var that will not interfere with the variance function. Julia is picky about that, so I needed renaming some variables.
  3. Julia pases variables by reference, while R does so by value when assigning matrices, which tripped me because in the original R code there was something like: b = array(0.0,size); meanb = b;. This works fine in R, but in Julia changes to the b vector also changed meanb.
  4. The definition of scalar vs array created some problems in Julia. For example y' * y (t(y) %*% y in R) is numerically equivalent to dot(y, y). However, the first version returns an array, while the second one a scalar. I got an error message when trying to store the ‘scalar like an array’ in to an array. I find that confusing.

One interesting point in this comparison is using rough code, not really optimized for speed; in fact, the only thing that I can say of the Julia code is that ‘it runs’ and it probably is not very idiomatic. Testing runs with different numbers of markers we get that R needs roughly 2.8x the time used by Julia. The Julia website claims better results in benchmarks, but in real life we work with, well, real problems.

In 1996-7 I switched from SAS to ASReml for genetic analyses because it was 1-2 orders of magnitude faster and opened a world of new models. Today a change from R to Julia would deliver (in this particular case) a much more modest speed up (~3x), which is OK but not worth changing languages (yet). Together with the embryonic graphical capabilities and the still-to-develop ecosystem of packages, means that I’m still using R. Nevertheless, the Julia team has achieved very impressive performance in very little time, so it is worth to keep an eye on their progress.

P.S.1 Readers are welcome to suggesting ways of improving the code.
P.S.2 WordPress does not let me upload the binary version of the simulated data.
P.S.3 Hey WordPress guys; it would be handy if the sourcecode plugin supported Julia!
P.S.4 2012-04-26 Following AL’s recommendation in the comments, one can replace in R:

rhs = t(x[,locus])%*%ycorr/vare;
lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus]

by

rhs = crossprod(x[,locus],ycorr)/vare
lhs = crossprod(x[,locus],x[,locus])/vare + 1.0/var[locus]

reducing execution time by roughly 20%, making the difference between Julia and R even smaller.

R, Julia and the shiny new thing

My head exploded a while ago. Perhaps not my head but my brain was all mushy after working every day of March and first week of April; an explanation—as good as any—for the post hiatus. Back to the post title.

It is not a mystery that for a while there have been some underlying unhappiness in the R world. Ross Ihaka and Duncan Temple Long have mused on starting over (PDF, 2008). Similar comments were voiced by Ihaka in Christian Robert’s blog (2010) and were probably at the root of the development of Incanter (based on Clojure). Vince Buffalo pointed out the growing pains of R but it was a realist on his post about Julia: one thing is having a cool base language but a different one is recreating R’s package ecosystem.

A quick look in R-bloggers will show several posts mentioning Julia, including A-listers (in the R world, forget Hollywood) like Douglas Bates (of nlme and lme4 fame). They are not alone, as many of us (myself included) are prone to suffer the ‘oh, new, shiny’ syndrome. We are always looking for that new car language smell, which promises to deliver c('consistency', 'elegance', 'maintainability', 'speed', ...); each of us is looking for a different mix and order of requirements.

Personally, I do not find R particularly slow. Put another way, I tend to struggle with a particular family of problems (multivariate linear mixed models) that require heavy use of external libraries written in C or FORTRAN. Whatever language I use will call these libraries, so I may not stand to gain a lot from using insert shiny language name here. I doubt that one would gain much with parallelization in this class of problems (but I stand to be corrected). This does not mean that I would not consider a change.

Give me consistency + elegance and you start getting my attention. What do I mean? I have been reading a book on ‘Doing X with R’ so I can write a review about it. This involved looking at the book with newbie eyes and, presto!, R can be a real eyesore and the notation can drive you insane sometimes. At about the same time Vince Buffalo was posting his thoughts on Julia (early March) I wrote:

The success of some of Hadley Wickham’s packages got me thinking about underlying design issues in R that make functions so hard to master for users. Don’t get me wrong, I still think that R is great, but why are there so many problems to understand part of the core functionality? A quick web search will highlight that there is, for example, an incredible amount of confusion on how to use the apply family of functions. The management of dates and strings is also a sore point. I perfectly understand the need for, and even the desirability of, having new packages that extend the functionality of R. However, this is another kettle of fish; we are talking about making sane design choices so there is no need to repackage basic functionality to make it usable.

May be the question is not as dramatic as Do we need a new, shiny language? but Can we get rid of the cruft that R has accumulated over the years? Can we make the language more consistent, easier to learn, write and maintain?

Gratuitous picture: cabbage trees or mushy brain? (Photo: Luis).

Some may argue that one could get away with implementing the 20% of the functionality that’s used by 80% of the users. Joel Spolsky already discussed the failure of this approach in 2001.