Teaching code, production code, benchmarks and new languages

I’m a bit obsessive with words. May be I should have used learning in the title, rather than teaching code. Or perhaps remembering code. You know? Code where one actually has very clear idea of what is going on; for example, let’s say that we are calculating the average of a bunch of n numbers, we can have a loop that will add up each of them and then divide the total by n. Of course we wouldn’t do that in R, but use a simple function: mean(x).

In a previous post I compared R and Julia code and one of the commenters (Andrés) rightly pointed out that the code was inefficient. It was possible to speed up the calculation many times (and he sent me the code to back it up), because we could reuse intermediate results, generate batches of random numbers, etc. However, if you have studied the genomic selection problem, the implementations in my post are a lot closer to the algorithm. It is easier to follow and to compare, but not too flash in the speed department; for the latter we’d move to production code, highly optimized but not very similar to the original explanation.

This reminded me of the controversial Julia benchmarks, which implemented a series of 7 toy problems in a number of languages, following the ‘teaching code’ approach. Forcing ‘teaching code’ makes the comparison simple, as different language implementations look similar to each other. However, one is also throwing away the very thing that makes languages different: they have been designed and optimized using different typical uses in mind. For example, execution time for the pisum benchmark can be reduced sixty times just by replacing the internal loop with sum(1/c(1:10000)^2). Making comparisons easily understandable (so the code looks very similar) is orthogonal to making them realistic.

Gratuitous picture: looking for the right bicycle in Uppsala (Photo: Luis).

Tony Boyles asked, tongue in cheek, ‘The Best Statistical Programming Language is …Javascript?’ Well, if you define best as fastest language running seven benchmarks that may bear some resemblance to what you’d like to do (despite not having any statistical libraries) maybe the answer is yes.

I have to admit that I’m a sucker for languages; I like to understand new ways of expressing problems and, some times, one is lucky and finds languages that even allow tackling new problems. At the same time, most of my tests never progress beyond the ‘ah, cool, but it isn’t really that interesting’. So, if you are going to design a new language for statistical analyses you may want to:

  • Piggyback on another language. Case in point: R, which is far from an immaculate conception but an implementation of S, which gave it a start with both users and code. It may be that you extend a more general language (e.g. Incanter) rather than a domain specific one.
  • However, one needs a clean syntax from the beginning (ah, Python); my theory is that this partly explains why Incanter got little traction. (((too-many-parentheses-are-annoying)))
  • Related to the previous point, make extensibility to other languages very simple. R’s is a pain, while Julia’s seems to be much straightforward (judging by Douglas Bates’s examples).
  • Indices start at 1, not 0. Come on! This is stats and maths, not computer science. It is easier for us, your audience, to count from 1.
  • Programming structures are first class citizens, not half-assed constructs that complicate life. A clear counterexample are SAS’s macros and PROC IML, which are not conducive to people writing their own libraries/modules and sharing them with the community: they are poorly integrated with the rest of the SAS system.
  • Rely since day one on creating a community; as I pointed out in a previous post one thing is having a cool base language but a different one is recreating R’s package ecosystem. Good luck recreating R’s ecosystem working on your own.
  • However, you have to create a base language with sane basics included: access to both plain text and databases, easy summary functions (Xapply doesn’t cut it), glms and cool graphics (ggplot like) or people may not get hooked with the language, so they start contributing.

Two interesting resources discussing the adoption of R are this paper (PDF) by John Fox and this presentation by John Cook. Incidentally, John Fox is the author of my favorite book on regression and generalized linear models, no R code at all, just explanations and the math behind it all. John Cook writes The Endeavour, a very interesting blog with mathematical/programming bent.

Late-April flotsam

It has been month and a half since I compiled a list of statistical/programming internet flotsam and jetsam.

  • Via Lambda The Ultimate: Evaluating the Design of the R Language: Objects and Functions For Data Analysis (PDF). A very detailed evaluation of the design and performance of R. HT: Christophe Lalanne. If you are in statistical genetics and Twitter Christophe is the man to follow.
  • Attributed to John Tukey, “without assumptions there can be no conclusions” is an extremely important point, which comes to mind when listening to the fascinating interview to Richard Burkhauser on the changes of income for the middle class in USA. Changes to the definition of the unit of analysis may give a completely different result. By the way, does someone have a first-hand reference to Tukey’s quote?
  • Nature news publishes RNA studies under fire: High-profile results challenged over statistical analysis of sequence data. I expect to see happening more often once researchers get used to upload the data and code for their papers.
  • Bob O’Hara writes on Why simple models are better, which is not positive towards the machine learning crowd.
  • A Matlab Programmer’s Take On Julia, and a Python developer interacts with Julia developers. Not everything is smooth. HT: Mike Croucher. ‏
  • Dear NASA: No More Rainbow Color Scales, Please. HT: Mike Dickinson. Important: this applies to R graphs too.
  • Rafael Maia asks “are programmers trying on purpose to come up with names for their languages that make it hard to google for info?” These are the suggestions if one searches Google for Julia:

    Unhelpful search suggestions.

  • I suggest creating a language called Bieber and search for dimension Bieber, loop Bieber and regression Bieber.

That’s all folks.

R, Julia and genome wide selection

— “You are a pussy” emailed my friend.
— “Sensu cat?” I replied.
— “No. Sensu chicken” blurbed my now ex-friend.

What was this about? He read my post on R, Julia and the shiny new thing, which prompted him to assume that I was the proverbial old dog unwilling (or was it unable?) to learn new tricks. (Incidentally, with friends like this who needs enemies? Hi, Gus.)

Having a look at different—statistical—horses (Photo: Luis, click to enlarge).

I decided to tackle a small—but hopefully useful—piece of code: fitting/training a Genome Wide Selection model, using the Bayes A approach put forward by Meuwissen, Hayes and Goddard in 2001. In that approach the breeding values of the individuals (response) are expressed as a function of a very large number of random predictors (2000, our molecular markers). The dataset (csv file) is a simulation of 2000 bi-allelic markers (aa = 0, Aa = 1, AA = 2) for 250 individuals, followed by the phenotypes (column 2001) and breeding values (column 2002). These models are frequently adjusted using MCMC.

In 2010 I attended this course in Ames, Iowa where Rohan Fernando passed us the following R code (pretty much a transliteration from C code; notice the trailing semicolons, for example). P.D. 2012-04-26 Please note that this is teaching code not production code:

nmarkers = 2000;    # number of markers
startMarker = 1981; # set to 1 to use all
numiter  = 2000;    # number of iterations
vara     = 1.0/20.0; 
 
# input data
data     = matrix(scan("trainData.out0"),ncol=nmarkers+2,byrow=TRUE);
nrecords = dim(data)[1];
 
beg = Sys.time()
 
# x has the mean followed by the markers
x = cbind(1,data[,startMarker:nmarkers]);  
y = data[,nmarkers+1];
a =  data[,nmarkers+2];
# inital values
 
nmarkers = nmarkers - startMarker + 1;
mean2pq = 0.5;                          # just an approximation
scalea  = 0.5*vara/(nmarkers*mean2pq);  # 0.5 = (v-2)/v for v=4
 
size = dim(x)[2];
b = array(0.0,size);
meanb = b;
b[1] = mean(y);
var  = array(0.0,size);
 
# adjust y
 ycorr = y - x%*%b;                  
 
# MCMC sampling
for (iter in 1:numiter){	
  # sample vare
  vare = ( t(ycorr)%*%ycorr )/rchisq(1,nrecords + 3);
 
  # sample intercept
  ycorr = ycorr + x[,1]*b[1];
  rhs = sum(ycorr)/vare;
  invLhs = 1.0/(nrecords/vare);
  mean = rhs*invLhs;                            
  b[1] = rnorm(1,mean,sqrt(invLhs));                  
  ycorr = ycorr - x[,1]*b[1];
  meanb[1] = meanb[1] + b[1];
 
  # sample variance for each locus	
  for (locus in 2:size){
    var[locus] = (scalea*4+b[locus]*b[locus])/rchisq(1,4.0+1) 
  }
 
# sample effect for each locus	
  for (locus in 2:size){
    # unadjust y for this locus
    ycorr = ycorr + x[,locus]*b[locus];   
    rhs = t(x[,locus])%*%ycorr/vare;
    lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus];
    invLhs = 1.0/lhs;
    mean = invLhs*rhs;
    b[locus]= rnorm(1,mean,sqrt(invLhs));  
    #adjust y for the new value of this locus                   
    ycorr = ycorr - x[,locus]*b[locus];   
    meanb[locus] = meanb[locus] + b[locus];		
  }	
}
 
Sys.time() - beg
 
meanb = meanb/numiter;
aHat  = x %*% meanb;

Thus, we just need defining a few variables, reading the data (marker genotypes, breeding values and phenotypic data) into a matrix, creating loops, matrix and vector multiplication and generating random numbers (using a Gaussian and Chi squared distributions). Not much if you think about it, but I didn’t have much time to explore Julia’s features as to go for something more complex.

nmarkers = 2000    # Number of markers
startmarker = 1981 # Set to 1 to use all
numiter = 2000     # Number of iterations
 
data = dlmread("markers.csv", ',')
(nrecords, ncols) = size(data)
 
tic()
 
#this is the mean and markers matrix
X = hcat(ones(Float64, nrecords), data[:, startmarker:nmarkers])
y = data[:, nmarkers + 1]
a = data[:, nmarkers + 2]
 
nmarkers = nmarkers - startmarker + 1
vara = 1.0/nmarkers
mean2pq = 0.5
 
scalea  = 0.5*vara/(nmarkers*mean2pq) # 0.5 = (v-2)/v for v=4
 
ndesign = size(X, 2)
b = zeros(Float64, ndesign)
meanb = zeros(Float64, ndesign)
b[1] = mean(y)
varian  = zeros(Float64, ndesign)
 
# adjust y
ycorr = y - X * b                  
 
# MCMC sampling
for i = 1:numiter
  # sample vare
  vare = dot(ycorr, ycorr )/randchi2(nrecords + 3)
 
  # sample intercept
  ycorr = ycorr + X[:, 1] * b[1];
  rhs = sum(ycorr)/vare;
  invlhs = 1.0/(nrecords/vare);
  mn = rhs*invlhs;                            
  b[1] = randn() * sqrt(invlhs) + mn;
  ycorr = ycorr - X[:, 1] * b[1];
  meanb[1] = meanb[1] + b[1];
 
  # sample variance for each locus
  for locus = 2:ndesign
      varian[locus] = (scalea*4 + b[locus]*b[locus])/randchi2(4.0 + 1);
  end
 
  # sample effect for each locus
  for locus = 2:ndesign
      # unadjust y for this locus
      ycorr = ycorr + X[:, locus] * b[locus]; 
      rhs = dot(X[:, locus], ycorr)/vare;
      lhs = dot(X[:, locus], X[:, locus])/vare + 1.0/varian[locus];
      invlhs = 1.0/lhs;
      mn = invlhs * rhs;
      b[locus] = randn() * sqrt(invlhs) + mn;
      #adjust y for the new value of this locus
      ycorr = ycorr - X[:, locus] * b[locus];
      meanb[locus] = meanb[locus] + b[locus];
  end
end
 
toc()
 
meanb = meanb/numiter;
aHat  = X * meanb;

The code looks remarkably similar and there are four main sources of differences:

  1. The first trivial one is that the original code read a binary dataset and I didn’t know how to do it in Julia, so I’ve read a csv file instead (this is why I start timing after reading the file too).
  2. The second trivial one is to avoid name conflicts between variables and functions; for example, in R the user is allowed to have a variable called var that will not interfere with the variance function. Julia is picky about that, so I needed renaming some variables.
  3. Julia pases variables by reference, while R does so by value when assigning matrices, which tripped me because in the original R code there was something like: b = array(0.0,size); meanb = b;. This works fine in R, but in Julia changes to the b vector also changed meanb.
  4. The definition of scalar vs array created some problems in Julia. For example y' * y (t(y) %*% y in R) is numerically equivalent to dot(y, y). However, the first version returns an array, while the second one a scalar. I got an error message when trying to store the ‘scalar like an array’ in to an array. I find that confusing.

One interesting point in this comparison is using rough code, not really optimized for speed; in fact, the only thing that I can say of the Julia code is that ‘it runs’ and it probably is not very idiomatic. Testing runs with different numbers of markers we get that R needs roughly 2.8x the time used by Julia. The Julia website claims better results in benchmarks, but in real life we work with, well, real problems.

In 1996-7 I switched from SAS to ASReml for genetic analyses because it was 1-2 orders of magnitude faster and opened a world of new models. Today a change from R to Julia would deliver (in this particular case) a much more modest speed up (~3x), which is OK but not worth changing languages (yet). Together with the embryonic graphical capabilities and the still-to-develop ecosystem of packages, means that I’m still using R. Nevertheless, the Julia team has achieved very impressive performance in very little time, so it is worth to keep an eye on their progress.

P.S.1 Readers are welcome to suggesting ways of improving the code.
P.S.2 WordPress does not let me upload the binary version of the simulated data.
P.S.3 Hey WordPress guys; it would be handy if the sourcecode plugin supported Julia!
P.S.4 2012-04-26 Following AL’s recommendation in the comments, one can replace in R:

rhs = t(x[,locus])%*%ycorr/vare;
lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus]

by

rhs = crossprod(x[,locus],ycorr)/vare
lhs = crossprod(x[,locus],x[,locus])/vare + 1.0/var[locus]

reducing execution time by roughly 20%, making the difference between Julia and R even smaller.

My setup

Yesterday I accidentally started a dialogue in Twitter with the dude running The Setup. Tonight I decided to procrastinate in my hotel room (for work in Rotovegas) writing up my own Luis Uses This:

Since 2005 I’ve been using Apple computers as my main machines. They tend to be well built and keep on running without rebooting for a while and I ssh to a unix box from them when I need extra oomph. At the moment I have a 2009 15″ macbook pro and a 2010 27″ iMac; both computers are pretty much the default, except for extra RAM and they are still running Snow Leopard. I have never liked Apple mice, so I bought a Logitech mouse for the iMac. I use a generic Android phone, because I’m too cheap to spend money on an iPhone. I don’t have an iPad either, because I don’t have a proper use for it and I dislike lugging around gear for the sake of it.

I’m not ‘addicted’ to any software. However, I do use some programs frequently: R for data analysis/scripting (often with RStudio as a frontend), asreml for quantitative genetics, Python for scripting/scraping, XeLaTeX for writing lecture notes and writing solo papers (because my second surname is Zúñiga), MS Word for writing anything that requires collaborating with other people, Keynote for presentations (but sometimes have to use PowerPoint). I check my university email with Thunderbird or Entourage, my browsing is mostly done using Chrome, but when paranoid I use Tor + Firefox + Vidalia. I use a bunch of other programs but not often enough to deserve a mention. If you think about it, Keynote is the only format that defeats my platform agnosticism (I could still write Word documents using OpenOffice or similar). I almost forgot! I do rely on Dropbox to keep computers synced.

I keep on changing text editors: I don’t understand how people can cope with emacs and am uncomfortably writing this post using vim (which is awful as well), I own a copy of Textmate but I feel annoyed by the author’s abandonment, so I’m at a point where I tend to use software-specific editors: R – RStudio, XeLaTeX – TeXShop, etc.

If I weren’t allowed to use a mac at work I’d probably move to Linux; the major hassle would be converting Keynote presentations to something else. I could live with Windows, but I would start with a totally clean install, because I find the pre-installed software very unhelpful. These days I think that I’ve been unconsciously preparing myself for the impermanence of software, so if I need to learn a new stats package or new editor that is ‘just fine': software agnostic Buddhism.

Non-computer-wise I’m permanently dissatisfied with my bag/backpack: I haven’t found a nice overnight trip bag that it’s designed for walking around carrying a laptop. (Did I mention that I like to walk?) Most of them are dorky or plain useless and my current theory is that the solution goes for getting a smaller (say 11-13″) laptop. Because the university depreciates laptops over 4 years I still have to wait a year to test the theory.

I tend to doodle when thinking or preparing a talk. I prefer to write in unlined photocopy paper with a pen or pencil. A fountain pen is nice, but a $0.20 pen will do too. It has the advantage of being i-) cheap and ii-) available everywhere.

I like to take pictures and muck around with shutter speeds and apertures, which doesn’t mean that I’m any good at it. I use a Nikon Coolpix P7100 camera, but I’m sure that a Canon G12 would do the job as well. It is the smallest camera that gives me the degree of control I like. I process the pictures in Lightroom, which is just OK, but, again, it sort of fits my platform agnosticism.

I’m slowly moving to ebooks, for which I use a Sony Reader (which I got for free) that I manage using Calibre. I keep wireless disabled and non-configured: it is only for reading books and I often use the dictionary feature while reading (I’m always surprised by the large number of English words).

What would be your dream setup?

This would be a ‘sabbatical package’ where I would spend 6 months living in another (earthquake-proof) country, near the ocean, with my family, good food, a light notebook with a week’s worth of battery life, decent internet connection and the chance to catch up with my research subject.

P.S. 2012-04-19. I came across this post in 37 signals discussing a switch from OS X to Ubuntu. I think that there is a class of user cases (e.g. web developers, scientific programming) where moving from one to the other should be relatively painless.

On the destruction of a trial of genetically modified pines

The media in New Zealand briefly covered the destruction of a trial with genetically modified pines (Pinus radiata D. Don, vulgar name Radiata pine, Monterey pine) near Rotorua. This is not the first time that Luddites destroy a trial, ignoring that they have been established following regulations from the Environmental Protection Agency. Most people have discussed this pseudo-religious vandalism either from the wasting resources (money, more importantly time, delays on publication for scientists, etc) or from the criminal activity points of view.

I will discuss something slightly different, when would we plant genetically modified trees?

Some background first

In New Zealand, plantations of forests trees are established by the private sector (mostly forest companies and small growers–usually farmers). Most of the stock planted in the country has some degree of (traditional) breeding, and it ranges from seed mixes with a large numbers of parents to the deployment of genetically identical clones. The higher the degree of improvement the most likely is that tree deployment involves a small number of highly selected genotypes. Overall, most tree plantations are based on open-pollinated seed with a modest degree of genetic improvement, which is much more genetically diverse than most agricultural crops. In contrast, agricultural crops tend to deploy named clonal varieties which is what we buy in supermarkets: Gold kiwifruit, Gala apples, Nadine potatoes, etc.

Stating the obvious, tree and agricultural growers will pay more for genetic material if they have the expectation that the seeds, cuttings, tubers, etc are going to provide higher quantity and/or quality of products which will pay for the extra expense. Here we can see a big difference between people growing trees and annual/short rotation crops: there is a large lag between tree establishment and income coming from the trees, which means that when one runs a discounted cash flow analysis to estimate profitability:

  1. Income is in the distant future (say 25-30 years) and are heavily discounted.
  2. Establishment costs, which include buying the genetic material, are not discounted because they happen right now.

Unsurprisingly, growers want to reduce establishment costs as much as they can and remember that the cost of trees is an important component. This means that most people planting trees will go for cheaper, low level of genetic improvement trees (often seedlings), unless they are convinced that they can recover the extra expense with more improved trees (usually clones, which cost at least double than seedlings).

What’s the relationship with genetic modification?

Modification of any organism is an expensive process, which means that:

  1. One would only modify individuals with an outstanding genetic background; i.e. start with a good genotype to end up with a great one.
  2. Successful modifications will be clonally propagated to scale up the modification, driving down unit cost.

Thus, we have a combination of very good genotypes plus clonal propagation plus no discounting, which would make establishment costs very high (although no impossible). There is a second element that, at least for now, would delay adoption. Most large forest growers will have some type of product certification, which establishes that the grower is using good forestry, environmental and social practices. Think of it as a sticker that says the producer of this piece of wood is a good guy, so please feel confident about buying this product; that is, this sticker is part of a marketing strategy. Currently some forest certification organizations do not accept the use of genetically modified organisms (e.g. Forest Certification Council, PDF of GMO policy).

This does not mean that it is not financially possible to plant genetically modified trees. For once, modification costs would reduce with economies of scale (as for most biotechnologies), and one of the reasons we don’t have these economies is the political pressure by almost-religious zealots against GMO, which make people scared about being first to plant GM trees/plants. Another option is to change the GMO policy for some certification agencies or, relying on other certification organizations that do accept GMOs. Each individual forest company would have to evaluate the trade-offs of the certification decision, as they do not work as a block.

A simple scenario

Roughly 80% percent of the forest plantations in New Zealand correspond to radiata pine. Now imagine that we face a very destructive pest or disease that has the potential to severely damage the survival/growth of the trees. I know that it would take us a long time (decades?) to breed trees resistant to this problem. I also know that the GM crowd could insert several disease resistance genes and silence flowering, so we don’t have reproduction of modified trees. Would you support the use of genetic modification to save one of the largest industries of the country? I would.

However, before using the technology I would like to have access to data from trials growing in New Zealand conditions. The destruction of trials makes extremely difficult to make informed decisions and this is the worst crime. This people are not just destroying trees but damaging our ability to properly make decisions as a society, evaluating the pros and cons of our activities.

P.S. These are just my personal musings about the subject and do not represent the views of the forest companies, the university or anyone else. I do not work on genetic modification, but I am a quantitative geneticist & tree breeder.
P.S.2. While I do not work on genetic modification—so I’d struggle to call that crowd ‘colleagues’—I support researchers on that topic in their effort to properly evaluate the performance of genetically modified trees.

R, Julia and the shiny new thing

My head exploded a while ago. Perhaps not my head but my brain was all mushy after working every day of March and first week of April; an explanation—as good as any—for the post hiatus. Back to the post title.

It is not a mystery that for a while there have been some underlying unhappiness in the R world. Ross Ihaka and Duncan Temple Long have mused on starting over (PDF, 2008). Similar comments were voiced by Ihaka in Christian Robert’s blog (2010) and were probably at the root of the development of Incanter (based on Clojure). Vince Buffalo pointed out the growing pains of R but it was a realist on his post about Julia: one thing is having a cool base language but a different one is recreating R’s package ecosystem.

A quick look in R-bloggers will show several posts mentioning Julia, including A-listers (in the R world, forget Hollywood) like Douglas Bates (of nlme and lme4 fame). They are not alone, as many of us (myself included) are prone to suffer the ‘oh, new, shiny’ syndrome. We are always looking for that new car language smell, which promises to deliver c('consistency', 'elegance', 'maintainability', 'speed', ...); each of us is looking for a different mix and order of requirements.

Personally, I do not find R particularly slow. Put another way, I tend to struggle with a particular family of problems (multivariate linear mixed models) that require heavy use of external libraries written in C or FORTRAN. Whatever language I use will call these libraries, so I may not stand to gain a lot from using insert shiny language name here. I doubt that one would gain much with parallelization in this class of problems (but I stand to be corrected). This does not mean that I would not consider a change.

Give me consistency + elegance and you start getting my attention. What do I mean? I have been reading a book on ‘Doing X with R’ so I can write a review about it. This involved looking at the book with newbie eyes and, presto!, R can be a real eyesore and the notation can drive you insane sometimes. At about the same time Vince Buffalo was posting his thoughts on Julia (early March) I wrote:

The success of some of Hadley Wickham’s packages got me thinking about underlying design issues in R that make functions so hard to master for users. Don’t get me wrong, I still think that R is great, but why are there so many problems to understand part of the core functionality? A quick web search will highlight that there is, for example, an incredible amount of confusion on how to use the apply family of functions. The management of dates and strings is also a sore point. I perfectly understand the need for, and even the desirability of, having new packages that extend the functionality of R. However, this is another kettle of fish; we are talking about making sane design choices so there is no need to repackage basic functionality to make it usable.

May be the question is not as dramatic as Do we need a new, shiny language? but Can we get rid of the cruft that R has accumulated over the years? Can we make the language more consistent, easier to learn, write and maintain?

Gratuitous picture: cabbage trees or mushy brain? (Photo: Luis).

Some may argue that one could get away with implementing the 20% of the functionality that’s used by 80% of the users. Joel Spolsky already discussed the failure of this approach in 2001.