When R, or any other language, is not enough

This post is tangential to R, although R has a fair share of the issues I mention here, which include research reproducibility, open source, paying for software, multiple languages, salt and pepper.

There is an increasing interest in the reproducibility of research. In many topics we face multiple, often conflicting claims and as researchers we value the ability to evaluate those claims, including repeating/reproducing research results. While I share the interest in reproducibility, some times I feel we are obsessing too much on only part of the research process: statistical analysis. Even here, many people focus not on the models per se, but only on the code for the analysis, which should only use tools that are free of charge.

There has been enormous progress in the R world on literate programming, where the combination of RStudio + Markdown + knitr has made analyzing data and documenting the process almost enjoyable. Nevertheless, and here is the BUT coming, there is a large difference between making the code repeatable and making research reproducible.

As an example, currently I am working in a project that relies on two trials, which have taken a decade to grow. We took a few hundred increment cores from a sample of trees and processed them using a densitometer, an X-Ray diffractometer and a few other lab toys. By now you get the idea, actually replicating the research may take you quite a few resources before you even start to play with free software. At that point, of course, I want to be able to get the most of my data, which means that I won’t settle for a half-assed model because the software is not able to fit it. If you think about it, spending a couple of grands in software (say ASReml and Mathematica licenses) doesn’t sound outrageous at all. Furthermore, reproducing this piece of research would require: a decade, access to genetic material and lab toys. I’ll give you the code for free, but I can’t give you ten years or $0.25 million…

In addition, the research process may require linking disparate sources of data for which other languages (e.g. Python) may be more appropriate. Some times R is the perfect tool for the job, while other times I feel like we have reached peak VBS (Visual Basic Syndrome) in R: people want to use it for everything, even when it’s a bad idea.

In summary,

  • research is much more than a few lines of R (although they are very important),
  • even when considering data collection and analysis it is a good idea to know more than a single language/software, because it broadens analytical options
  • I prefer free (freedom+beer) software for research; however, I rely on non-free, commercial software for part of my work because it happens to be the best option for specific analyses.

Disclaimer: my primary analysis language is R and I often use lme4, MCMCglmm and INLA (all free). However, many (if not most) of my analyses that use genetic information rely on ASReml (paid, not open source). I’ve used Mathematica, Matlab, Stata and SAS for specific applications with reasonably priced academic licenses.

Gratuitous picture: 3000 trees leaning in a foggy Christchurch day (Photo: Luis).

Gratuitous picture: 3000 trees leaning in a foggy Christchurch day (Photo: Luis, click to enlarge).

INLA: Bayes goes to Norway

INLA is not the Norwegian answer to ABBA; that would probably be a-ha. INLA is the answer to ‘Why do I have enough time to cook a three-course meal while running MCMC analyses?”.

Integrated Nested Laplace Approximations (INLA) is based on direct numerical integration (rather than simulation as in MCMC) which, according to people ‘in the know’, allows:

  • the estimation of marginal posteriors for all parameters,
  • marginal posteriors for each random effect and
  • estimation of the posterior for linear combinations of random effects.

Rather than going to the usual univariate randomized complete block or split-plot designs that I have analyzed before (here using REML and here using MCMC), I’ll go for some analyses that motivated me to look for INLA. I was having a look at some reproductive output for Drosophila data here at the university, and wanted to fit a logistic model using MCMCglmm. Unfortunately, I was running into the millions (~3M) of iterations to get a good idea of the posterior and, therefore, leaving the computer running overnight. Almost by accident I came across INLA and started playing with it. The idea is that Sol—a Ph.D. student—had a cool experiment with a bunch of flies using different mating strategies over several generations, to check the effect on breeding success. Therefore we have to keep track of the pedigree too.

Gratuitous picture: Cubist apartments not in Norway (Photo: Luis, click to enlarge).

# Set working directory containing data and code
setwd('~/Dropbox/research/2010/flies')
 
# Packages needed for analysis
# This code requires the latest (and updated) version of INLA
require(INLA)            # This loads INLA
require(pedigreemm)      # pedigree(), relfactor(), Tinv, D, ...
 
####### Pedigree and assessment files
# Reads pedigree file
ped = read.csv('recodedped.csv', header=FALSE)
names(ped) = c('id', 'mum', 'dad')
 
# Reads data file
dat = read.csv('ggdatall.csv', header=TRUE)
dat$cross = factor(dat$cross)
 
# Pedigree object for pedigreemm functions
pedPMM = with(ped, pedigreemm::pedigree(sire=dad, dam=mum, label=id))
 
# Pedigree precision matrix (A inverse)
# T^{-1} in A^{-1} = (T^{-1})' D^{-1} T^{-1}
Tinv = as(pedPMM, "sparseMatrix") 
D    = Diagonal(x=Dmat(pedPMM))   # D      in A = TDT'
Dinv = solve(D)
Ainv = t(Tinv) %*% Dinv %*% Tinv

Up to this point we have read the response data, the pedigree and constructed the inverse of the pedigree matrix. We also needed to build a contrast matrix to compare the mean response between the different mating strategies. I was struggling there and contacted Gregor Gorjanc, who kindly emailed me the proper way to do it.

# Define contrasts to compare cross types. Thanks to Gregor Gorjanc
# for coding contrasts
k = nlevels(dat$cross)
tmp = matrix(nrow=(k-1)*k/2, ncol=k)
 
##               1   2   3   4   5   6 
tmp[ 1, ] =  c(  1, -1, NA, NA, NA, NA) ## c1-c2
tmp[ 2, ] =  c(  1, NA, -1, NA, NA, NA) ##   -c3
tmp[ 3, ] =  c(  1, NA, NA, -1, NA, NA) ##   -c4
tmp[ 4, ] =  c(  1, NA, NA, NA, -1, NA) ##   -c5
tmp[ 5, ] =  c(  1, NA, NA, NA, NA, -1) ##   -c6
tmp[ 6, ] =  c( NA,  1, -1, NA, NA, NA) ## c2-c3
tmp[ 7, ] =  c( NA,  1, NA, -1, NA, NA) ##   -c4
tmp[ 8, ] =  c( NA,  1, NA, NA, -1, NA) ##   -c5
tmp[ 9, ] =  c( NA,  1, NA, NA, NA, -1) ##   -c6
tmp[10, ] =  c( NA, NA,  1, -1, NA, NA) ## c3-c4
tmp[11, ] =  c( NA, NA,  1, NA, -1, NA) ##   -c5
tmp[12, ] =  c( NA, NA,  1, NA, NA, -1) ##   -c6
tmp[13, ] =  c( NA, NA, NA,  1, -1, NA) ## c4-c5
tmp[14, ] =  c( NA, NA, NA,  1, NA, -1) ##   -c6
tmp[15, ] =  c( NA, NA, NA, NA,  1, -1) ## c5-c6
 
 
# Make Linear Combinations
LC = inla.make.lincombs(cross=tmp)
 
# Assign names to combinations
t = 0
for(i in 1:(k-1)) {
  for(j in (i+1):k) {
    t = t + 1
    names(LC)[t] = paste("c", i, "-", "c", j, sep="")
  }
}

There is another related package (Animal INLA) that takes care of i- giving details about the priors and ii- “easily” fitting models that include a term with a pedigree (an animal model in quantitative genetics speak). However, I wanted the assumptions to be clear so read the source of Animal INLA and shamelessly copied the useful bits (read the source, Luke!).

######  Analysis for for binomial traits
####### Plain-vanilla INLA Version
# Feeling more comfortable with *explicit* statement of assumptions
# (rather than hidden behind animal.inla()) 
 
# Function to backconvert logits to probabilities
back.conv = function(values){
  return(1/(1+exp(-(values))))
}
 
# Function to get posterior of the odds
# Thanks to Helena Moltchanova
inla.marginal.summary = function(x){
  m1 = inla.emarginal(function(z) exp(z), marginal=x)
  odds = inla.marginal.transform(function(x) exp(x), x)
  q = inla.qmarginal(p=c(0.025, 0.975), marginal=odds)
  c("0.025quant"=q[1], "0.5quant"=m1, "0.975quant"=q[2])
}
 
 
# Model for pupae/eggs
# Drops a few observations with no reproductive output (trips INLA)
no0eggs = subset(dat, eggs>0 & pupae <= eggs)
 
# Actual model
mpueg =  pupae ~ f(cross, model='iid', constr=TRUE, hyper=list(theta=list(initial=-10, fixed=TRUE))) +
                 f(id, model='generic0', constr=TRUE, Cmatrix=Ainv,
                 hyper=list(theta=list(param=c(0.5,0.5), fixed=FALSE)))
 
# INLA call
fpueg =  inla(formula=mpueg, family='binomial', data=no0eggs,
              lincomb=LC,
              Ntrials=eggs,
              control.compute=list(dic=FALSE))
 
# Results
summary(fpueg)
 
# Call:
#   c("inla(formula = mpueg, family = \"binomial\", data = no0eggs, 
#     Ntrials = eggs, ",  "    lincomb = LC, control.compute = list(dic = FALSE))")
# 
# Time used:
#   Pre-processing    Running inla Post-processing           Total 
# 0.2712612       1.1172159       2.0439510       3.4324281 
# 
# Fixed effects:
#   mean        sd 0.025quant 0.5quant 0.975quant       kld
# (Intercept) 1.772438 0.1830827   1.417413 1.770863   2.136389 0.5833235
# 
# Linear combinations (derived):
#   ID        mean        sd 0.025quant    0.5quant  0.975quant kld
# c1-c2  0 -0.26653572 0.7066540 -1.6558225 -0.26573011  1.11859967   0
# c1-c3  1  0.04150999 0.7554753 -1.4401435  0.04104020  1.52622856   0
# c1-c4  2 -0.08777325 0.6450669 -1.3557501 -0.08713005  1.17693349   0
# c1-c5  3 -1.36702960 0.6583121 -2.6615604 -1.36618274 -0.07690788   0
# c1-c6  4 -1.82037714 0.8193280 -3.4338294 -1.81848244 -0.21714431   0
# c2-c3  5  0.30804735 0.7826815 -1.2248185  0.30677279  1.84852340   0
# c2-c4  6  0.17876229 0.5321948 -0.8654273  0.17859036  1.22421409   0
# c2-c5  7 -1.10049385 0.7466979 -2.5663142 -1.10046590  0.36558211   0
# c2-c6  8 -1.55383673 0.8188321 -3.1640965 -1.55276603  0.05084282   0
# c3-c4  9 -0.12928419 0.7475196 -1.5996080 -0.12817855  1.33522000   0
# c3-c5 10 -1.40854298 0.6016539 -2.5930656 -1.40723901 -0.23103707   0
# c3-c6 11 -1.86189314 0.8595760 -3.5555571 -1.85954031 -0.18100418   0
# c4-c5 12 -1.27925604 0.6998640 -2.6536362 -1.27905616  0.09438701   0
# c4-c6 13 -1.73259977 0.7764105 -3.2600936 -1.73134961 -0.21171790   0
# c5-c6 14 -0.45334267 0.8179794 -2.0618730 -0.45229981  1.14976690   0
# 
# Random effects:
#   Name    Model	  	Max KLD 
# cross   IID model 
# id   Generic0 model 
# 
# Model hyperparameters:
#   mean    sd      0.025quant 0.5quant 0.975quant
# Precision for id 0.08308 0.01076 0.06381    0.08244  0.10604   
# 
# Expected number of effective parameters(std dev): 223.95(0.7513)
# Number of equivalent replicates : 1.121 
# 
# Marginal Likelihood:  -1427.59 
 
fpueg$summary.random$cross
#   ID       mean        sd  0.025quant   0.5quant 0.975quant          kld
# 1  1 -0.5843466 0.4536668 -1.47561024 -0.5840804  0.3056632 0.0178780930
# 2  2 -0.3178102 0.4595676 -1.21808638 -0.3184925  0.5865565 0.0009666916
# 3  3 -0.6258600 0.4978254 -1.60536281 -0.6250077  0.3491075 0.0247426578
# 4  4 -0.4965763 0.4071715 -1.29571071 -0.4966277  0.3030747 0.0008791629
# 5  5  0.7826817 0.4389003 -0.07756805  0.7821937  1.6459253 0.0077476806
# 6  6  1.2360387 0.5768462  0.10897529  1.2340813  2.3744368 0.0451357379
 
# Backtransforms point estimates and credible intervals for odds -> prob
for(name in names(fpueg$marginals.lincomb.derived)){
  summa = inla.marginal.summary(eval(parse(text=paste("fpueg$marginals.lincomb.derived$\'", name, "\'", sep=''))))
  cat(name, summa, '\n')
}
 
# c1-c2 0.1894451 0.9831839 3.019878 
# c1-c3 0.2338952 1.387551 4.534581 
# c1-c4 0.256858 1.127751 3.204961 
# c1-c5 0.0695406 0.3164847 0.9145132 
# c1-c6 0.03157478 0.2264027 0.792517 
# c2-c3 0.289088 1.850719 6.255175 
# c2-c4 0.4213069 1.377848 3.366947 
# c2-c5 0.0759222 0.4398384 1.420934 
# c2-c6 0.04135211 0.2955985 1.035951 
# c3-c4 0.1996085 1.16168 3.747526 
# c3-c5 0.0746894 0.2929174 0.7847903 
# c3-c6 0.02774805 0.2245797 0.821099 
# c4-c5 0.06988459 0.355529 1.084414 
# c4-c6 0.03780307 0.2389529 0.7974092 
# c5-c6 0.1245211 0.8878682 3.108852

A quick look at the time taken by INLA shows that it is in the order of seconds (versus overnight using MCMC). I have tried a few examples and the MCMCglmm and INLA results tend to be very close; however, figuring out how to code models has been very tricky for me. INLA follows the glorious tradition of not having a ‘proper’ manual, but a number of examples with code. In fact, they reimplement BUGS‘s examples. Personally, I struggle with that approach towards documentation, but you may be the right type of person for that. Note for letter to Santa: real documentation for INLA.

I was talking with a student about using Norwegian software and he mentioned Norwegian Black Metal. That got me thinking about how the developers of the package would look like; would they look like Gaahl of Gorgoroth (see interview here)?

Gaahl Gorgoroth

Not an INLA developer

Talk about disappointment! In fact Håvard Rue, INLA mastermind, looks like a nice, clean, non-black-metal statistician. To be fair, it would be quite hard to code in any language wearing those spikes…