When R, or any other language, is not enough

This post is tangential to R, although R has a fair share of the issues I mention here, which include research reproducibility, open source, paying for software, multiple languages, salt and pepper.

There is an increasing interest in the reproducibility of research. In many topics we face multiple, often conflicting claims and as researchers we value the ability to evaluate those claims, including repeating/reproducing research results. While I share the interest in reproducibility, some times I feel we are obsessing too much on only part of the research process: statistical analysis. Even here, many people focus not on the models per se, but only on the code for the analysis, which should only use tools that are free of charge.

There has been enormous progress in the R world on literate programming, where the combination of RStudio + Markdown + knitr has made analyzing data and documenting the process almost enjoyable. Nevertheless, and here is the BUT coming, there is a large difference between making the code repeatable and making research reproducible.

As an example, currently I am working in a project that relies on two trials, which have taken a decade to grow. We took a few hundred increment cores from a sample of trees and processed them using a densitometer, an X-Ray diffractometer and a few other lab toys. By now you get the idea, actually replicating the research may take you quite a few resources before you even start to play with free software. At that point, of course, I want to be able to get the most of my data, which means that I won’t settle for a half-assed model because the software is not able to fit it. If you think about it, spending a couple of grands in software (say ASReml and Mathematica licenses) doesn’t sound outrageous at all. Furthermore, reproducing this piece of research would require: a decade, access to genetic material and lab toys. I’ll give you the code for free, but I can’t give you ten years or $0.25 million… In addition, the research process may require linking disparate sources of data for which other languages (e.g. Python) may be more appropriate. Some times R is the perfect tool for the job, while other times I feel like we have reached peak VBS (Visual Basic Syndrome) in R: people want to use it for everything, even when it’s a bad idea. In summary, • research is much more than a few lines of R (although they are very important), • even when considering data collection and analysis it is a good idea to know more than a single language/software, because it broadens analytical options • I prefer free (freedom+beer) software for research; however, I rely on non-free, commercial software for part of my work because it happens to be the best option for specific analyses. Disclaimer: my primary analysis language is R and I often use lme4, MCMCglmm and INLA (all free). However, many (if not most) of my analyses that use genetic information rely on ASReml (paid, not open source). I’ve used Mathematica, Matlab, Stata and SAS for specific applications with reasonably priced academic licenses. Gratuitous picture: 3000 trees leaning in a foggy Christchurch day (Photo: Luis). More sense of random effects I can’t exactly remember how I arrived to Making sense of random effects, a good post in the Distributed Ecology blog (go over there and read it). Incidentally, my working theory is that I follow Scott Chamberlain (@recology_), who follows Karthik Ram ‏(@_inundata) who mentioned Edmund Hart’s (@DistribEcology) post. I liked the discussion, but I thought one could add to the explanation to make it a bit clearer. The idea is that there are 9 individuals, assessed five times each—once under each of five different levels for a treatment—so we need to include individual as a random effect; after all, it is our experimental unit. The code to generate the data, plot it and fit the model is available in the post, but I redid data generation to make it a bit more R-ish and, dare I say, a tad more elegant: require(lme4) require(ggplot2) # Basic dataset idf <- data.frame(ind = factor(rep(1:9, 5)), levs = factor(rep(c('i1', 'i2', 'i3', 'i4', 'i5'), each = 9)), f.means = rep(c(6, 16, 2, 10, 13), each = 9), i.eff = rep(seq(-4, 4, length = 9), 5)) # Function for individual values used in apply ran.ind <- function(f.means, i.eff) rnorm(1, f.means + i.eff, 0.3) # Rich showed me this version of apply that takes more than one argument # https://smtp.biostat.wustl.edu/sympa/biostat/arc/s-news/2005-06/msg00022.html # so we skip loops idf$size <- apply(idf[, 3:4], 1,
function(x) do.call('ran.ind', as.list(x)))

ggplot(idf, aes(x = levs, y = size, group = ind, colour = ind)) +
geom_point() + geom_path()

# Fit linear mixed model (avoid an overall mean with -1)
m3 <- lmer(size ~ levs - 1 + (1 | ind), data = idf)
summary(m3)

# Skipping a few things
#   AIC   BIC logLik deviance REMLdev
# 93.84 106.5 -39.92    72.16   79.84
#Random effects:
# Groups   Name        Variance Std.Dev.
# ind      (Intercept) 7.14676  2.67334
# Residual             0.10123  0.31816
#Number of obs: 45, groups: ind, 9

# Show fixed effects
fixef(m3)

#   levsi1    levsi2    levsi3    levsi4    levsi5
# 5.824753 15.896714  2.029902  9.969462 12.870952



Original simulated data, showing the effect of treatment (fixed effect) and each individual.

What we can do to better understand what’s going on is ‘adjust’ the score observations by the estimated fixed effects and plot those values to see what we are modeling with the random effects:

# Substract fixed effects
idf$adjusted <- idf$size - rep(fixef(m3), each = 9)
ggplot(idf, aes(x = levs, y = adjusted, group = ind, colour = ind)) +
geom_point() + geom_path()

# Display random effects
ranef(m3)

#$ind # (Intercept) #1 -3.89632810 #2 -2.83054394 #3 -1.99715524 #4 -1.12733342 #5 0.06619981 #6 0.95605162 #7 2.00483963 #8 2.99947727 #9 3.82479236  Data ‘adjusted’ by fixed effects. The random intercepts would be lines going through the average of points for each individual. The random effects for individual or, better, the individual-level intercepts are pretty much the lines going through the middle of the points for each individual. Furthermore, the variance for ind is the variance of the random intercepts around the’adjusted’ values, which can be seen comparing the variance of random effects above (~7.15) with the result below (~7.13). var(unlist(ranef(m3))) #[1] 7.12707  Distributed Ecology then goes on to randomize randomize the individuals within treatment, which means that the average deviation around the adjusted means is pretty close to zero, making that variance component close to zero. I hope this explanation complements Edmund Hart’s nice post. P.S. If you happen to be in the Southern part of South America next week, I’ll be here and we can have a chat (and a beer, of course). Surviving a binomial mixed model A few years ago we had this really cool idea: we had to establish a trial to understand wood quality in context. Sort of following the saying “we don’t know who discovered water, but we are sure that it wasn’t a fish” (attributed to Marshall McLuhan). By now you are thinking WTF is this guy talking about? But the idea was simple; let’s put a trial that had the species we wanted to study (Pinus radiata, a gymnosperm) and an angiosperm (Eucalyptus nitens if you wish to know) to provide the contrast, as they are supposed to have vastly different types of wood. From space the trial looked like this: The reason you can clearly see the pines but not the eucalypts is because the latter were dying like crazy over a summer drought (45% mortality in one month). And here we get to the analytical part: we will have a look only at the eucalypts where the response variable can’t get any clearer, trees were either totally dead or alive. The experiment followed a randomized complete block design, with 50 open-pollinated families in 48 blocks. The original idea was to harvest 12 blocks each year but—for obvious reasons—we canned this part of the experiment after the first year. The following code shows the analysis in asreml-R, lme4 and MCMCglmm: load('~/Dropbox/euc.Rdata') library(asreml) sasreml = asreml(surv ~ 1, random = ~ Fami + Block, data = euc, family = asreml.binomial(link = 'logit')) summary(sasreml)$varcomp

#                      gamma component  std.error  z.ratio
#Fami!Fami.var     0.5704205 0.5704205 0.14348068 3.975591
#Block!Block.var   0.1298339 0.1298339 0.04893254 2.653324
#R!variance        1.0000000 1.0000000         NA       NA

#                 constraint
#Fami!Fami.var      Positive
#Block!Block.var    Positive
#R!variance            Fixed

# Quick look at heritability
varFami = summary(sasreml)$varcomp[1, 2] varRep = summary(sasreml)$varcomp[2, 2]
h2 = 4*varFami/(varFami + varRep + 3.29)
h2
#[1] 0.5718137

library(lme4)
slme4 = lmer(surv ~ 1 + (1|Fami) + (1|Block),
data = euc,

summary(slme4)

#Generalized linear mixed model fit by the Laplace approximation
#Formula: surv ~ 1 + (1 | Fami) + (1 | Block)
#   Data: euc
#  AIC  BIC logLik deviance
# 2725 2742  -1360     2719
#Random effects:
# Groups   Name        Variance Std.Dev.
# Fami     (Intercept) 0.60941  0.78065
# Block    (Intercept) 0.13796  0.37143
#Number of obs: 2090, groups: Fami, 51; Block, 48
#
#Fixed effects:
#            Estimate Std. Error z value Pr(>|z|)
#(Intercept)   0.2970     0.1315   2.259   0.0239 *

# Quick look at heritability
varFami = VarCorr(slme4)$Fami[1] varRep = VarCorr(slme4)$Block[1]
h2 = 4*varFami/(varFami + varRep + 3.29)
h2
#[1] 0.6037697

# And let's play to be Bayesians!
library(MCMCglmm)
pr = list(R = list(V = 1, n = 0, fix = 1),
G = list(G1 = list(V = 1, n = 0.002),
G2 = list(V = 1, n = 0.002)))

sb <- MCMCglmm(surv ~ 1,
random = ~ Fami + Block,
family = 'categorical',
data = euc,
prior = pr,
verbose = FALSE,
pr = TRUE,
burnin = 10000,
nitt = 100000,
thin = 10)

plot(sb$VCV)  You may be wondering Where does the 3.29 in the heritability formula comes from? Well, that’s the variance of the link function that, in the case of the logit link is pi*pi/3. In the case of MCMCglmm we can estimate the degree of genetic control quite easily, remembering that we have half-siblings (open-pollinated plants): # Heritability h2 = 4*sb$VCV[, 'Fami']/(sb$VCV[, 'Fami'] + sb$VCV[, 'Block'] + 3.29 + 1)
posterior.mode(h2)
#     var1
#0.6476185

HPDinterval(h2)
#         lower     upper
#var1 0.4056492 0.9698148
#attr(,"Probability")
#[1] 0.95

plot(h2)


By the way, it is good to remember that we need to back-transform the estimated effects to probabilities, with very simple code:

# Getting mode and credible interval for solutions
inv.logit(posterior.mode(sb$Sol)) inv.logit(HPDinterval(sb$Sol, 0.95))


Even if one of your trials is trashed there is a silver lining: it is possible to have a look at survival.

Coming out of the (Bayesian) closet

Until today all the posts in this blog have used a frequentist view of the world. I have a confession to make: I have an ecumenical view of statistics and I do sometimes use Bayesian approaches in data analyses. This is not quite one of those “the truth will set you free” moments, but I’ll show that one could almost painlessly repeat some of the analyses I presented before using MCMC.

MCMCglmm is a very neat package that—as its rather complicated em cee em cee gee el em em acronym implies—implements MCMC for generalized linear mixed models. We’ll skip that the frequentist fixed vs random effects distinction gets blurry in Bayesian models and still use the f… and r… terms. I’ll first repeat the code for a Randomized Complete Block design with Family effects (so we have two random factors) using both lme4 and ASReml-R and add the MCMCglmm counterpart:

library(lme4)
m1 = lmer(bden ~ 1 + (1|Block) + (1|Family),
data = trees)

summary(m1)

#Linear mixed model fit by REML
#Formula: bden ~ (1 | Block) + (1 | Family)
#   Data: a
#  AIC  BIC logLik deviance REMLdev
# 4572 4589  -2282     4569    4564
#Random effects:
# Groups   Name        Variance Std.Dev.
# Family   (Intercept)  82.803   9.0996
# Block    (Intercept) 162.743  12.7571
# Residual             545.980  23.3662
#Number of obs: 492, groups: Family, 50; Block, 11

#Fixed effects:
#            Estimate Std. Error t value
#(Intercept)  306.306      4.197   72.97

library(asreml)
m2 = asreml(bden ~ 1, random = ~ Block + Family,
data = trees)

summary(m2)$varcomp # gamma component std.error z.ratio #Block!Block.var 0.2980766 162.74383 78.49271 2.073362 #Family!Family.var 0.1516591 82.80282 29.47153 2.809587 #R!variance 1.0000000 545.97983 37.18323 14.683496 # constraint #Block!Block.var Positive #Family!Family.var Positive #R!variance Positive m2$coeff$fixed #(Intercept) # 306.306  We had already established that the results obtained from lme4 and ASReml-R were pretty much the same, at least for relatively simple models where we can use both packages (as their functionality diverges later for more complex models). This example is no exception and we quickly move to fitting the same model using MCMCglmm: library(MCMCglmm) priors = list(R = list(V = 260, n = 0.002), G = list(G1 = list(V = 260, n = 0.002), G2 = list(V = 260, n = 0.002))) m4 = MCMCglmm(bden ~ 1, random = ~ Block + Family, family = 'gaussian', data = a, prior = priors, verbose = FALSE, pr = TRUE, burnin = 10000, nitt = 20000, thin = 10) plot(mcmc.list(m4$VCV))

autocorr(m4$VCV) posterior.mode(m4$VCV)
#    Block    Family     units
#126.66633  72.97771 542.42237

HPDinterval(m4$VCV) # lower upper #Block 33.12823 431.0233 #Family 26.34490 146.6648 #units 479.24201 627.7724  The first difference is that we have to specify priors for the coefficients that we would like to estimate (by default fixed effects, the overall intercept for our example, start with a zero mean and very large variance: 106). The phenotypic variance for our response is around 780, which I split into equal parts for Block, Family and Residuals. For each random effect we have provided our prior for the variance (V) and a degree of belief on our prior (n). In addition to the model equation, name of the data set and prior information we need to specify things like the number of iterations in the chain (nitt), how many we are discarding for the initial burnin period (burnin), and how many values we are keeping (thin, every ten). Besides the pretty plot of the posterior distributions (see previous figure) they can be summarized using the posterior mode and high probability densities. One of the neat things we can do is to painlessly create functions of variance components and get their posterior mode and credible interval. For example, the heritability (or degree of additive genetic control) can be estimated in this trial with full-sib families using the following formula: $$\hat{h^2} = \frac{2 \sigma_F^2}{\sigma_F^2 + \sigma_B^2 + \sigma_R^2}$$ h2 <- 2 * m4$VCV[, 'Family']/(m4$VCV[, 'Family'] + m4$VCV[, 'Block'] + m4$VCV[, 'units']) plot(h2) posterior.mode(h2) #0.1887017 HPDinterval(h2) # lower upper #var1 0.05951232 0.3414216 #attr(,"Probability") #[1] 0.95  There are some differences on the final results between ASReml-R/lme4 and MCMCglmm; however, the gammas (ratios of variance component/error variance) for the posterior distributions are very similar, and the estimated heritabilities are almost identical (~0.19 vs ~0.21). Overall, MCMCglmm is a very interesting package that covers a lot of ground. It pays to read the reference manual and vignettes, which go into a lot of detail on how to specify different types of models. I will present the MCMCglmm bivariate analyses in another post. P.S. There are several other ways that we could fit this model in R using a Bayesian approach: it is possible to call WinBugs or JAGS (in Linux and OS X) from R, or we could have used INLA. More on this in future posts. Covariance structures In most mixed linear model packages (e.g. asreml, lme4, nlme, etc) one needs to specify only the model equation (the bit that looks like y ~ factors...) when fitting simple models. We explicitly say nothing about the covariances that complete the model specification. This is because most linear mixed model packages assume that, in absence of any additional information, the covariance structure is the product of a scalar (a variance component) by a design matrix. For example, the residual covariance matrix in simple models is R = I σe2, or the additive genetic variance matrix is G = A σa2 (where A is the numerator relationship matrix), or the covariance matrix for a random effect f with incidence matrix Z is ZZ σf2. However, there are several situations when analyses require a more complex covariance structure, usually a direct sum or a Kronecker product of two or more matrices. For example, an analysis of data from several sites might consider different error variances for each site, that is R = Σd Ri, where Σd represents a direct sum and Ri is the residual matrix for site i. Other example of a more complex covariance structure is a multivariate analysis in a single site (so the same individual is assessed for two or more traits), where both the residual and additive genetic covariance matrices are constructed as the product of two matrices. For example, R = IR0, where I is an identity matrix of size number of observations, ⊗ is the Kronecker product (do not confuse with a plain matrix multiplication) and R0 is the error covariance matrix for the traits involved in the analysis. Similarly, G = AG0 where all the matrices are as previously defined and G0 is the additive covariance matrix for the traits. Some structures are easier to understand (at least for me) if we express a covariance matrix (M) as the product of a correlation matrix (C) pre- and post-multiplied by a diagonal matrix (D) containing standard deviations for each of the traits (M = D C D). That is:$latex
M = \left [
\begin{array}{cccc}
v_{11}& c_{12}& c_{13}& c_{14} \\
c_{21}& v_{22}& c_{23}& c_{24} \\
c_{31}& c_{32}& v_{33}& c_{34} \\
c_{41}& c_{42}& c_{43}& v_{44}
\end{array}
\right ]latex
C = \left [
\begin{array}{cccc}
1& r_{12}& r_{13}& r_{14} \\
r_{21}& 1& r_{23}& r_{24} \\
r_{31}& r_{32}& 1& r_{34} \\
r_{41}& r_{42}& r_{43}& 1
\end{array}
\right ]latex
D = \left [
\begin{array}{cccc}
s_{11}& 0& 0& 0 \\
0& s_{22}& 0& 0 \\
0& 0& s_{33}& 0 \\
0& 0& 0& s_{44}
\end{array}
\right ]$where the v are variances, the r correlations and the s standard deviations. If we do not impose any restriction on M, apart from being positive definite (p.d.), we are talking about an unstructured matrix (us() in asreml-R parlance). Thus, M or C can take any value (as long as it is p.d.) as it is usual when analyzing multiple trait problems. There are cases when the order of assessment or the spatial location of the experimental units create patterns of variation, which are reflected by the covariance matrix. For example, the breeding value of an individual i observed at time j (aij) is a function of genes involved in expression at time j – k (aij-k), plus the effect of genes acting in the new measurement (αj), which are considered independent of the past measurement aij = ρjk aij-k + αj, where ρjk is the additive genetic correlation between measures j and k. Rather than using a different correlation for each pair of ages, it is possible to postulate mechanisms which model the correlations. For example, an autoregressive model (ar() in asreml-R lingo), where the correlation between measurements j and k is r|j-k|. In this model M = D CAR D, where CAR (for equally spaced assessments) is:$latex
C_{AR} = \left [
\begin{array}{cccc}
1 & r^{|t_2-t_1|} & \ldots & r^{|t_m-t_1|}\\
r^{|t_2-t_1|} & 1 & \ldots & r^{|t_m-t_2|}\\
\vdots & \vdots & \ddots & \vdots \\
r^{|t_m-t_1|} & r^{|t_m-t_2|} & \ldots & 1
\end{array} \right ]$Assuming three different autocorrelation coefficients (0.95 solid line, 0.90 dashed line and 0.85 dotted line) we can get very different patterns with a few extra units of lag, as shown in the following graph: A model including this structure will certainly be more parsimonious (economic on terms of number of parameters) than one using an unstructured approach. Looking at the previous pattern it is a lot easier to understand why they are called ‘structures’. A similar situation is considered in spatial analysis, where the ‘independent errors’ assumption of typical analyses is relaxed. A common spatial model will consider the presence of autocorrelated residuals in both directions (rows and columns). Here the level of autocorrelation will depend on distance between trees rather than on time. We can get an idea of how separable processes look like using this code: # Separable row col autoregressive process car2 = function(dim, rhor, rhoc) { M = diag(dim) rhor^(row(M) - 1) * rhoc^(col(M) - 1) } library(lattice) levelplot(car2(20, 0.95, 0.85))  This correlation matrix can then be multiplied by a spatial residual variance to obtain the covariance and we can add up a spatially independent residual variance. Much more detail on code notation for covariance structures can be found, for example, in the ASReml-R User Guide (PDF, chapter 4), for nlme in Pinheiro and Bates’s Mixed-effects models in S and S-plus (link to Google Books, chapter 5.3) and in Bates’s draft book for lme4 in chapter 4. Large applications of linear mixed models In a previous post I summarily described our options for (generalized to varying degrees) linear mixed models from a frequentist point of view: nlme, lme4 and ASReml-R, followed by a quick example for a split-plot experiment. But who is really happy with a toy example? We can show a slightly more complicated example assuming that we have a simple situation in breeding: a number of half-sib trials (so we have progeny that share one parent in common), each of them established following a randomized complete block design, analyzed using a ‘family model’. That is, the response variable (dbh: tree stem diameter assessed at breast height—1.3m from ground level) can be expressed as a function of an overall mean, fixed site effects, random block effects (within each site), random family effects and a random site-family interaction. The latter provides an indication of genotype by environment interaction. setwd('~/Dropbox/quantumforest') trees = read.table('treedbh.txt', header=TRUE) # Removing all full-siblings and dropping Trial levels # that are not used anymore trees = subset(trees, Father == 0) trees$Trial = trees$Trial[drop = TRUE] # Which let me with 189,976 trees in 31 trials xtabs(~Trial, data = trees) # First run the analysis with lme4 library(lme4) m1 = lmer(dbh ~ Trial + (1|Trial:Block) + (1|Mother) + (1|Trial:Mother), data = trees) # Now run the analysis with ASReml-R library(asreml) m2 = asreml(dbh ~ Trial, random = ~Trial:Block + Mother + Trial:Mother, data = trees)  First I should make clear that this model has several drawbacks: • It assumes that we have homogeneous variances for all trials, as well as identical correlation between trials. That is, that we are assuming compound symmetry, which is very rarely the case in multi-environment progeny trials. • It also assumes that all families are unrelated to each other, which in this case I know it is not true, as a quick look at the pedigree will show that several mothers are related. • We assume only one type of families (half-sibs), which is true just because I got rid of all the full-sib families and clonal material, to keep things simple in this example. Just to make the point, a quick look at the data using ggplot2—with some jittering and alpha transparency to better display a large number of points—shows differences in both mean and variance among sites. library(ggplot2) ggplot(aes(x=jitter(as.numeric(Trial)), y=dbh), data = trees) + geom_point(colour=alpha('black', 0.05)) + scale_x_continuous('Trial')  Anyway, let’s keep on going with this simplistic model. In my computer, a two year old iMac, ASReml-R takes roughly 9 seconds, while lme4 takes around 65 seconds, obtaining similar results. # First lme4 (and rounding off the variances) summary(m1) Random effects: Groups Name Variance Std.Dev. Trial:Mother (Intercept) 26.457 5.1436 Mother (Intercept) 32.817 5.7286 Trial:Block (Intercept) 77.390 8.7972 Residual 892.391 29.8729 # And now ASReml-R # Round off part of the variance components table round(summary(m2)$varcomp[,1:4], 3)
gamma component std.error z.ratio
Trial:Block!Trial.var  0.087    77.399     4.598  16.832
Mother!Mother.var      0.037    32.819     1.641  20.002
Trial:Mother!Trial.var 0.030    26.459     1.241  21.328
R!variance             1.000   892.389     2.958 301.672


The residual matrix of this model is a, fairly large, diagonal matrix (residual variance times an identity matrix). At this point we can relax this assumption, adding a bit more complexity to the model so we can highlight some syntax. Residuals in one trial should have nothing to do with residuals in another trial, which could be hundreds of kilometers away. I will then allow for a new matrix of residuals, which is the direct sum of trial-specific diagonal matrices. In ASReml-R we can do so by specifying a diagonal matrix at each trial with rcov = ~ at(Trial):units:

m2b =  asreml(dbh ~ Trial, random = ~Trial:Block + Mother +
Trial:Mother, data = trees,
rcov = ~ at(Trial):units)

# Again, extracting and rounding variance components
round(summary(m2b)$varcomp[,1:4], 3) gamma component std.error z.ratio Trial:Block!Trial.var 77.650 77.650 4.602 16.874 Mother!Mother.var 30.241 30.241 1.512 20.006 Trial:Mother!Trial.var 22.435 22.435 1.118 20.065 Trial_1!variance 1176.893 1176.893 18.798 62.606 Trial_2!variance 1093.409 1093.409 13.946 78.403 Trial_3!variance 983.924 983.924 12.061 81.581 ... Trial_29!variance 2104.867 2104.867 55.821 37.708 Trial_30!variance 520.932 520.932 16.372 31.819 Trial_31!variance 936.785 936.785 31.211 30.015  There is a big improvement of log-Likelihood from m2 (-744452.1) to m2b (-738011.6) for 30 additional variances. At this stage, we can also start thinking of heterogeneous variances for blocks, with a small change to the code: m2c = asreml(dbh ~ Trial, random = ~at(Trial):Block + Mother + Trial:Mother, data = wood, rcov = ~ at(Trial):units) round(summary(m2c)$varcomp[,1:4], 3)

# Which adds another 30 variances (one for each Trial)
gamma component std.error z.ratio
at(Trial, 1):Block!Block.var     2.473     2.473     2.268   1.091
at(Trial, 2):Block!Block.var    95.911    95.911    68.124   1.408
at(Trial, 3):Block!Block.var     1.147     1.147     1.064   1.079
...
Mother!Mother.var               30.243    30.243     1.512  20.008
Trial:Mother!Trial.var          22.428    22.428     1.118  20.062
Trial_1!variance              1176.891  1176.891    18.798  62.607
Trial_2!variance              1093.415  1093.415    13.946  78.403
Trial_3!variance               983.926   983.926    12.061  81.581
...


At this point we could do extra modeling at the site level by including any other experimental design features, allowing for spatially correlated residuals, etc. I will cover some of these issues in future posts, as this one is getting a bit too long. However, we could gain even more by expressing the problem in a multivariate fashion, where the performance of Mothers in each trial would be considered a different trait. This would push us towards some large problems, which require modeling covariance structures so we have a change of achieving convergence during our lifetime.

Disclaimer: I am emphasizing the use of ASReml-R because 1.- I am most familiar with it than with nlme or lme4 (although I tend to use nlme for small projects), and 2.- There are plenty of examples for the other packages but not many for ASReml in the R world. I know some of the people involved in the development of ASReml-R but otherwise I have no commercial links with VSN (the guys that market ASReml and Genstat).

There are other packages that I have not had the chance to investigate yet, like hglm.

Linear mixed models in R

A substantial part of my job has little to do with statistics; nevertheless, a large proportion of the statistical side of things relates to applications of linear mixed models. The bulk of my use of mixed models relates to the analysis of experiments that have a genetic structure.

A brief history of time

At the beginning (1992-1995) I would use SAS (first proc glm, later proc mixed), but things started getting painfully slow and limiting if one wanted to move into animal model BLUP. At that time (1995-1996), I moved to DFREML (by Karen Meyer, now replaced by WOMBAT) and AIREML (by Dave Johnson, now defunct—the program I mean), which were designed for the analysis of animal breeding progeny trials, so it was a hassle to deal with experimental design features. At the end of 1996 (or was it the beginning of 1997?) I started playing with ASReml (programed by Arthur Gilmour mostly based on theoretical work by Robin Thompson and Brian Cullis). I was still using SAS for data preparation, but all my analyses went through ASReml (for which I wrote the cookbook), which was orders of magnitude faster than SAS (and could deal with much bigger problems). Around 1999, I started playing with R (prompted by a suggestion from Rod Ball), but I didn’t really use R/S+ often enough until 2003. At the end of 2005 I started using OS X and quickly realized that using a virtual machine or dual booting was not really worth it, so I dropped SAS and totally relied on R in 2009.

Options

As for many other problems, there are several packages in R that let you deal with linear mixed models from a frequentist (REML) point of view. I will only mention nlme (Non-Linear Mixed Effects), lme4 (Linear Mixed Effects) and asreml (average spatial reml). There are also several options for Bayesian approaches, but that will be another post.

nlme is the most mature one and comes by default with any R installation. In addition to fitting hierarchical generalized linear mixed models it also allows fitting non-linear mixed models following a Gaussian distribution (my explanation wasn’t very clear, thanks to ucfagls below for pointing this out). Its main advantages are, in my humble opinion, the ability to fit fairly complex hierarchical models using linear or non-linear approaches, a good variety of variance and correlation structures, and access to several distributions and link functions for generalized models. In my opinion, its main drawbacks are i- fitting cross-classified random factors is a pain, ii- it can be slow and may struggle with lots of data, iii- it does not deal with pedigrees by default and iv- it does not deal with multivariate data.

lme4 is a project led by Douglas Bates (one of the co-authors of nlme), looking at modernizing the code and making room for trying new ideas. On the positive side, it seems to be a bit faster than nlme and it deals a lot better with cross-classified random factors. Drawbacks: similar to nlme’s, but dropping point i- and adding that it doesn’t deal with covariance and correlation structures yet. It is possible to fit pedigrees using the pedigreemm package, but I find the combination a bit flimsy.

ASReml-R is, unsurprisingly, an R package interface to ASReml. On the plus side it i- deals well with cross-classified random effects, ii- copes very well with pedigrees, iii- can work with fairly large datasets, iv-can run multivariate analyses and v- covers a large number of covariance and correlation structures. Main drawbacks are i- limited functionality for non-Gaussian responses, ii- it does not cover non-linear models and iii- it is non-free (as in beer and speech). The last drawback is relative; it is possible to freely use asreml for academic purposes (and there is also a version for developing countries). Besides researchers, the main users of ASReml/ASReml-R are breeding companies.

All these three packages are available for Windows, Linux and OS X.

A (very) simple example

I will use a traditional dataset to show examples of the notation for the three packages: Yates’ variety and nitrogen split-plot experiment. We can get the dataset from the MASS package, after which it is a good idea to rename the variables using meaningful names. In addition, I will follow Bill Venables’s excellent advice and create additional variables for main plot and subplots, as it is confusing to use the same factor for two purposes (e.g. variety as treatment and main plot). Incidentally, if you haven’t read Bill’s post go and read it; it is one of the best explanations I have ever seen for a split-plot analysis.

library(MASS)
data(oats)
names(oats) = c('block', 'variety', 'nitrogen', 'yield')
oats$mainplot = oats$variety
oats$subplot = oats$nitrogen

summary(oats)
block           variety     nitrogen      yield              mainplot
I  :12   Golden.rain:24   0.0cwt:18   Min.   : 53.0   Golden.rain:24
II :12   Marvellous :24   0.2cwt:18   1st Qu.: 86.0   Marvellous :24
III:12   Victory    :24   0.4cwt:18   Median :102.5   Victory    :24
IV :12                    0.6cwt:18   Mean   :104.0
V  :12                                3rd Qu.:121.2
VI :12                                Max.   :174.0
subplot
0.0cwt:18
0.2cwt:18
0.4cwt:18
0.6cwt:18


The nlme code for this analysis is fairly simple: response on the left-hand side of the tilde, followed by the fixed effects (variety, nitrogen and their interaction). Then there is the specification of the random effects (which also uses a tilde) and the data set containing all the data. Notice that 1|block/mainplot is fitting block and mainplot within block. There is no reference to subplot as there is a single assessment for each subplot, which ends up being used at the residual level.

library(nlme)
m1.nlme = lme(yield ~ variety*nitrogen,
random = ~ 1|block/mainplot,
data = oats)

summary(m1.nlme)

Linear mixed-effects model fit by REML
Data: oats
AIC      BIC    logLik
559.0285 590.4437 -264.5143

Random effects:
Formula: ~1 | block
(Intercept)
StdDev:    14.64496

Formula: ~1 | mainplot %in% block
(Intercept) Residual
StdDev:    10.29863 13.30727

Fixed effects: yield ~ variety * nitrogen
Value Std.Error DF   t-value p-value
(Intercept)                      80.00000  9.106958 45  8.784492  0.0000
varietyMarvellous                 6.66667  9.715028 10  0.686222  0.5082
varietyVictory                   -8.50000  9.715028 10 -0.874933  0.4021
nitrogen0.2cwt                   18.50000  7.682957 45  2.407927  0.0202
nitrogen0.4cwt                   34.66667  7.682957 45  4.512152  0.0000
nitrogen0.6cwt                   44.83333  7.682957 45  5.835427  0.0000
varietyMarvellous:nitrogen0.2cwt  3.33333 10.865342 45  0.306786  0.7604
varietyVictory:nitrogen0.2cwt    -0.33333 10.865342 45 -0.030679  0.9757
varietyMarvellous:nitrogen0.4cwt -4.16667 10.865342 45 -0.383482  0.7032
varietyVictory:nitrogen0.4cwt     4.66667 10.865342 45  0.429500  0.6696
varietyMarvellous:nitrogen0.6cwt -4.66667 10.865342 45 -0.429500  0.6696
varietyVictory:nitrogen0.6cwt     2.16667 10.865342 45  0.199411  0.8428

anova(m1.nlme)

numDF denDF   F-value p-value
(Intercept)          1    45 245.14299  <.0001
variety              2    10   1.48534  0.2724
nitrogen             3    45  37.68562  <.0001
variety:nitrogen     6    45   0.30282  0.9322


The syntax for lme4 is not that dissimilar, with random effects specified using a (1|something here) syntax. One difference between the two packages is that nlme reports standard deviations instead of variances for the random effects.

library(lme4)
m1.lme4 = lmer(yield ~ variety*nitrogen + (1|block/mainplot),
data = oats)

summary(m1.lme4)

Linear mixed model fit by REML
Formula: yield ~ variety * nitrogen + (1 | block/mainplot)
Data: oats
AIC   BIC logLik deviance REMLdev
559 593.2 -264.5    595.9     529
Random effects:
Groups         Name        Variance Std.Dev.
mainplot:block (Intercept) 106.06   10.299
block          (Intercept) 214.48   14.645
Residual                   177.08   13.307
Number of obs: 72, groups: mainplot:block, 18; block, 6

Fixed effects:
Estimate Std. Error t value
(Intercept)                       80.0000     9.1064   8.785
varietyMarvellous                  6.6667     9.7150   0.686
varietyVictory                    -8.5000     9.7150  -0.875
nitrogen0.2cwt                    18.5000     7.6830   2.408
nitrogen0.4cwt                    34.6667     7.6830   4.512
nitrogen0.6cwt                    44.8333     7.6830   5.835
varietyMarvellous:nitrogen0.2cwt   3.3333    10.8653   0.307
varietyVictory:nitrogen0.2cwt     -0.3333    10.8653  -0.031
varietyMarvellous:nitrogen0.4cwt  -4.1667    10.8653  -0.383
varietyVictory:nitrogen0.4cwt      4.6667    10.8653   0.430
varietyMarvellous:nitrogen0.6cwt  -4.6667    10.8653  -0.430
varietyVictory:nitrogen0.6cwt      2.1667    10.8653   0.199

anova(m1.lme4)

Analysis of Variance Table
Df  Sum Sq Mean Sq F value
variety           2   526.1   263.0  1.4853
nitrogen          3 20020.5  6673.5 37.6856
variety:nitrogen  6   321.7    53.6  0.3028


For this type of problem, the notation for asreml is also very similar, particularly when compared to nlme.

library(asreml)
m1.asreml = asreml(yield ~ variety*nitrogen,
random = ~ block/mainplot,
data = oats)

summary(m1.asreml)$varcomp gamma component std.error z.ratio constraint block!block.var 1.2111647 214.4771 168.83404 1.270343 Positive block:mainplot!block.var 0.5989373 106.0618 67.87553 1.562593 Positive R!variance 1.0000000 177.0833 37.33244 4.743416 Positive wald(m1.asreml, denDF = 'algebraic')$Wald
Df denDF    F.inc           Pr
(Intercept)       1     5 245.1000 1.931825e-05
variety           2    10   1.4850 2.723869e-01
nitrogen          3    45  37.6900 2.457710e-12
variety:nitrogen  6    45   0.3028 9.321988e-01

\$stratumVariances
df  Variance block block:mainplot R!variance
block           5 3175.0556    12              4          1
block:mainplot 10  601.3306     0              4          1
R!variance     45  177.0833     0              0          1


In this simple example one pretty much gets the same results, independently of the package used (which is certainly comforting). I will soon cover another simple model, but with much larger dataset, to highlight some performance differences between the packages.