Analyzing a simple experiment with heterogeneous variances using asreml, MCMCglmm and SAS

I was working with a small experiment which includes families from two Eucalyptus species and thought it would be nice to code a first analysis using alternative approaches. The experiment is a randomized complete block design, with species as fixed effect and family and block as a random effects, while the response variable is growth strain (in \( \mu \epsilon\)).

When looking at the trees one can see that the residual variances will be very different. In addition, the trees were growing in plastic bags laid out in rows (the blocks) and columns. Given that trees were growing in bags siting on flat terrain, most likely the row effects are zero.

Below is the code for a first go in R (using both MCMCglmm and ASReml-R) and SAS. I had stopped using SAS for several years, mostly because I was running a mac for which there is no version. However, a few weeks ago I started accessing it via their OnDemand for Academics program via a web browser.

The R code using REML looks like:

# Options
options(stringsAsFactors = FALSE)
setwd('~/Dropbox/research/2013/growth-stress')
 
# Packages
require(ggplot2)
require(asreml)
require(MCMCglmm)
 
 
# Reading data, renaming, factors, etc
gs = read.csv('eucalyptus-growth-stress.csv')
summary(gs)
 
# Both argophloia and bosistoana
gsboth = subset(gs, !is.na(strain))
gsboth = within(gsboth, {
	species = factor(species)
	row = factor(row)
	column = factor(column)
	fam = factor(fam)
})            
 
 
 
ma = asreml(strain ~ species, random = ~ fam + row, 
            rcov = ~ at(species):units,
            data = gsboth)
 
summary(ma)$varcomp
 
#                                   gamma  component std.error   z.ratio constraint
#fam!fam.var                    27809.414  27809.414 10502.036 2.6480022   Positive
#row!row.var                     2337.164   2337.164  3116.357 0.7499666   Positive
#species_E.argopholia!variance 111940.458 111940.458 26609.673 4.2067580   Positive
#species_E.bosistoana!variance  63035.256  63035.256  7226.768 8.7224681   Positive

While using MCMC we get estimates in the ballpark by using:

# Priors
bothpr = list(R = list(V = diag(c(50000, 50000)), nu = 3), 
              G = list(G1 = list(V = 20000, nu = 0.002), 
                       G2 = list(V = 20000, nu = 0.002),
                       G3 = list(V = 20000, nu = 0.002)))
 
# MCMC                    
m2 = MCMCglmm(strain ~ species, random = ~ fam + row + column, 
              rcov = ~ idh(species):units,
              data = gsboth,
              prior = bothpr,
              pr = TRUE,
              family = 'gaussian',
              burnin = 10000,
              nitt = 40000,
              thin = 20,
              saveX = TRUE,
              saveZ = TRUE,
              verbose = FALSE)
 
summary(m2)
 
# Iterations = 10001:39981
# Thinning interval  = 20
# Sample size  = 1500 
# 
# DIC: 3332.578 
# 
# G-structure:  ~fam
# 
# post.mean l-95% CI u-95% CI eff.samp
# fam     30315    12211    55136     1500
# 
# ~row
# 
# post.mean l-95% CI u-95% CI eff.samp
# row      1449    5.928     6274    589.5
# 
# R-structure:  ~idh(species):units
# 
# post.mean l-95% CI u-95% CI eff.samp
# E.argopholia.units    112017    71152   168080     1500
# E.bosistoana.units     65006    52676    80049     1500
# 
# Location effects: strain ~ species 
# 
# post.mean l-95% CI u-95% CI eff.samp  pMCMC    
# (Intercept)            502.21   319.45   690.68     1500 <7e-04 ***
# speciesE.bosistoana   -235.95  -449.07   -37.19     1361  0.036 *

The SAS code is not that disimilar, except for the clear demarcation between data processing (data step, for reading files, data transformations, etc) and specific procs (procedures), in this case to summarize data, produce a boxplot and fit a mixed model.

* termstr=CRLF accounts for the windows-like line endings of the data set;
data gs;
  infile "/home/luis/Projects/euc-strain/growthstresses.csv" 
        dsd termstr=CRLF firstobs=2;
  input row column species $ family $ strain;
  if strain ^= .;
run;
 
proc summary data = gs print;
  class species;
  var strain;
run;
 
proc boxplot data = gs;
  plot strain*species;
run;
 
proc mixed data = gs;
  class row species family;
  model strain = species;
  random row family;
  repeated species / group=species;
run;
 
/*
Covariance Parameter Estimates
Cov Parm 	Group 	               Estimate
row 	  	                       2336.80
family 	  	                       27808
species 	species E.argoph 	111844
species 	species E.bosist 	63036
*/
SAS boxplot for the data set.

SAS boxplot for the data set.

I like working with multiple languages and I realized that, in fact, I missed SAS a bit. It was like meeting an old friend; at the beginning felt strange but we were quickly chatting away after a few minutes.

When R, or any other language, is not enough

This post is tangential to R, although R has a fair share of the issues I mention here, which include research reproducibility, open source, paying for software, multiple languages, salt and pepper.

There is an increasing interest in the reproducibility of research. In many topics we face multiple, often conflicting claims and as researchers we value the ability to evaluate those claims, including repeating/reproducing research results. While I share the interest in reproducibility, some times I feel we are obsessing too much on only part of the research process: statistical analysis. Even here, many people focus not on the models per se, but only on the code for the analysis, which should only use tools that are free of charge.

There has been enormous progress in the R world on literate programming, where the combination of RStudio + Markdown + knitr has made analyzing data and documenting the process almost enjoyable. Nevertheless, and here is the BUT coming, there is a large difference between making the code repeatable and making research reproducible.

As an example, currently I am working in a project that relies on two trials, which have taken a decade to grow. We took a few hundred increment cores from a sample of trees and processed them using a densitometer, an X-Ray diffractometer and a few other lab toys. By now you get the idea, actually replicating the research may take you quite a few resources before you even start to play with free software. At that point, of course, I want to be able to get the most of my data, which means that I won’t settle for a half-assed model because the software is not able to fit it. If you think about it, spending a couple of grands in software (say ASReml and Mathematica licenses) doesn’t sound outrageous at all. Furthermore, reproducing this piece of research would require: a decade, access to genetic material and lab toys. I’ll give you the code for free, but I can’t give you ten years or $0.25 million…

In addition, the research process may require linking disparate sources of data for which other languages (e.g. Python) may be more appropriate. Some times R is the perfect tool for the job, while other times I feel like we have reached peak VBS (Visual Basic Syndrome) in R: people want to use it for everything, even when it’s a bad idea.

In summary,

  • research is much more than a few lines of R (although they are very important),
  • even when considering data collection and analysis it is a good idea to know more than a single language/software, because it broadens analytical options
  • I prefer free (freedom+beer) software for research; however, I rely on non-free, commercial software for part of my work because it happens to be the best option for specific analyses.

Disclaimer: my primary analysis language is R and I often use lme4, MCMCglmm and INLA (all free). However, many (if not most) of my analyses that use genetic information rely on ASReml (paid, not open source). I’ve used Mathematica, Matlab, Stata and SAS for specific applications with reasonably priced academic licenses.

Gratuitous picture: 3000 trees leaning in a foggy Christchurch day (Photo: Luis).

Gratuitous picture: 3000 trees leaning in a foggy Christchurch day (Photo: Luis, click to enlarge).

Surviving a binomial mixed model

A few years ago we had this really cool idea: we had to establish a trial to understand wood quality in context. Sort of following the saying “we don’t know who discovered water, but we are sure that it wasn’t a fish” (attributed to Marshall McLuhan). By now you are thinking WTF is this guy talking about? But the idea was simple; let’s put a trial that had the species we wanted to study (Pinus radiata, a gymnosperm) and an angiosperm (Eucalyptus nitens if you wish to know) to provide the contrast, as they are supposed to have vastly different types of wood. From space the trial looked like this:

The reason you can clearly see the pines but not the eucalypts is because the latter were dying like crazy over a summer drought (45% mortality in one month). And here we get to the analytical part: we will have a look only at the eucalypts where the response variable can’t get any clearer, trees were either totally dead or alive. The experiment followed a randomized complete block design, with 50 open-pollinated families in 48 blocks. The original idea was to harvest 12 blocks each year but—for obvious reasons—we canned this part of the experiment after the first year.

The following code shows the analysis in asreml-R, lme4 and MCMCglmm:

load('~/Dropbox/euc.Rdata')
 
library(asreml)
sasreml = asreml(surv ~ 1, random = ~ Fami + Block,
                 data = euc, 
                 family = asreml.binomial(link = 'logit'))
summary(sasreml)$varcomp
 
#                      gamma component  std.error  z.ratio
#Fami!Fami.var     0.5704205 0.5704205 0.14348068 3.975591
#Block!Block.var   0.1298339 0.1298339 0.04893254 2.653324
#R!variance        1.0000000 1.0000000         NA       NA
 
#                 constraint
#Fami!Fami.var      Positive
#Block!Block.var    Positive
#R!variance            Fixed
 
# Quick look at heritability
varFami = summary(sasreml)$varcomp[1, 2]
varRep = summary(sasreml)$varcomp[2, 2]
h2 = 4*varFami/(varFami + varRep + 3.29)
h2
#[1] 0.5718137
 
 
library(lme4)
slme4 = lmer(surv ~ 1 + (1|Fami) + (1|Block), 
             data = euc,
             family = binomial(link = 'logit'))
 
summary(slme4)
 
#Generalized linear mixed model fit by the Laplace approximation 
#Formula: surv ~ 1 + (1 | Fami) + (1 | Block) 
#   Data: euc 
#  AIC  BIC logLik deviance
# 2725 2742  -1360     2719
#Random effects:
# Groups   Name        Variance Std.Dev.
# Fami     (Intercept) 0.60941  0.78065 
# Block    (Intercept) 0.13796  0.37143 
#Number of obs: 2090, groups: Fami, 51; Block, 48
#
#Fixed effects:
#            Estimate Std. Error z value Pr(>|z|)  
#(Intercept)   0.2970     0.1315   2.259   0.0239 *
 
# Quick look at heritability
varFami = VarCorr(slme4)$Fami[1]
varRep = VarCorr(slme4)$Block[1]
h2 = 4*varFami/(varFami + varRep + 3.29)
h2
#[1] 0.6037697
 
# And let's play to be Bayesians!
library(MCMCglmm)
pr = list(R = list(V = 1, n = 0, fix = 1),
          G = list(G1 = list(V = 1, n = 0.002), 
                   G2 = list(V = 1, n = 0.002)))
 
sb <- MCMCglmm(surv ~ 1, 
               random = ~ Fami + Block, 
               family = 'categorical',
               data = euc,
               prior = pr,
               verbose = FALSE,
               pr = TRUE,
               burnin = 10000,
               nitt = 100000,
               thin = 10)
 
plot(sb$VCV)

You may be wondering Where does the 3.29 in the heritability formula comes from? Well, that’s the variance of the link function that, in the case of the logit link is pi*pi/3. In the case of MCMCglmm we can estimate the degree of genetic control quite easily, remembering that we have half-siblings (open-pollinated plants):

# Heritability
h2 = 4*sb$VCV[, 'Fami']/(sb$VCV[, 'Fami'] + 
                         sb$VCV[, 'Block'] + 3.29 + 1)
posterior.mode(h2)
#     var1 
#0.6476185 
 
HPDinterval(h2)
#         lower     upper
#var1 0.4056492 0.9698148
#attr(,"Probability")
#[1] 0.95
 
plot(h2)

By the way, it is good to remember that we need to back-transform the estimated effects to probabilities, with very simple code:

# Getting mode and credible interval for solutions
inv.logit(posterior.mode(sb$Sol))
inv.logit(HPDinterval(sb$Sol, 0.95))

Even if one of your trials is trashed there is a silver lining: it is possible to have a look at survival.

Coming out of the (Bayesian) closet: multivariate version

This week I’m facing my—and many other lecturers’—least favorite part of teaching: grading exams. In a supreme act of procrastination I will continue the previous post, and the antepenultimate one, showing the code for a bivariate analysis of a randomized complete block design.

Just to recap, the results from the REML multivariate analysis (that used ASReml-R) was the following:

library(asreml)
 
m4 = asreml(cbind(bden, veloc) ~ trait,
            random = ~ us(trait):Block +  us(trait):Family, data = a,
            rcov = ~ units:us(trait))
 
summary(m4)$varcomp
 
#                                      gamma    component    std.error
#trait:Block!trait.bden:bden    1.628812e+02 1.628812e+02 7.854123e+01
#trait:Block!trait.veloc:bden   1.960789e-01 1.960789e-01 2.273473e-01
#trait:Block!trait.veloc:veloc  2.185595e-03 2.185595e-03 1.205128e-03
#trait:Family!trait.bden:bden   8.248391e+01 8.248391e+01 2.932427e+01
#trait:Family!trait.veloc:bden  1.594152e-01 1.594152e-01 1.138992e-01
#trait:Family!trait.veloc:veloc 2.264225e-03 2.264225e-03 8.188618e-04
#R!variance                     1.000000e+00 1.000000e+00           NA
#R!trait.bden:bden              5.460010e+02 5.460010e+02 3.712833e+01
#R!trait.veloc:bden             6.028132e-01 6.028132e-01 1.387624e-01
#R!trait.veloc:veloc            1.710482e-02 1.710482e-02 9.820673e-04
#                                  z.ratio constraint
#trait:Block!trait.bden:bden     2.0738303   Positive
#trait:Block!trait.veloc:bden    0.8624639   Positive
#trait:Block!trait.veloc:veloc   1.8135789   Positive
#trait:Family!trait.bden:bden    2.8128203   Positive
#trait:Family!trait.veloc:bden   1.3996166   Positive
#trait:Family!trait.veloc:veloc  2.7650886   Positive
#R!variance                             NA      Fixed
#R!trait.bden:bden              14.7057812   Positive
#R!trait.veloc:bden              4.3442117   Positive
#R!trait.veloc:veloc            17.4171524   Positive

The corresponding MCMCglmm code is not that different from ASReml-R, after which it is modeled anyway. Following the recommendations of the MCMCglmm Course Notes (included with the package), the priors have been expanded to diagonal matrices with degree of belief equal to the number of traits. The general intercept is dropped (-1) so the trait keyword represents trait means. We are fitting unstructured (us(trait)) covariance matrices for both Block and Family, as well as an unstructured covariance matrix for the residuals. Finally, both traits follow a gaussian distribution:

library(MCMCglmm)
bp = list(R = list(V = diag(c(0.007, 260)), n = 2), 
          G = list(G1 = list(V = diag(c(0.007, 260)), n = 2), 
                   G2 = list(V = diag(c(0.007, 260)), n = 2)))
 
bmod = MCMCglmm(cbind(veloc, bden) ~ trait - 1, 
                random = ~ us(trait):Block + us(trait):Family,
                rcov = ~ us(trait):units, 
                family = c('gaussian', 'gaussian'),
                data = a,
                prior = bp,
                verbose = FALSE,
                pr = TRUE,
                burnin = 10000,
                nitt = 20000,
                thin = 10)

Further manipulation of the posterior distributions requires having an idea of the names used to store the results. Following that, we can build an estimate of the genetic correlation between the traits (Family covariance between traits divided by the square root of the product of the Family variances). Incidentally, it wouldn’t be a bad idea to run a much longer chain for this model, so the plot of the posterior for the correlation looks better, but I’m short of time:

dimnames(bmod$VCV)
 
rg = bmod$VCV[,'veloc:bden.Family']/sqrt(bmod$VCV[,'veloc:veloc.Family'] * 
                                  bmod$VCV[,'bden:bden.Family'])
 
plot(rg)
 
posterior.mode(rg)
#     var1 
#0.2221953 
 
 
HPDinterval(rg, prob = 0.95)
#         lower     upper
#var1 -0.132996 0.5764006
#attr(,&quot;Probability&quot;)
#[1] 0.95

And that’s it! Time to congratulate Jarrod Hadfield for developing this package.

Coming out of the (Bayesian) closet

Until today all the posts in this blog have used a frequentist view of the world. I have a confession to make: I have an ecumenical view of statistics and I do sometimes use Bayesian approaches in data analyses. This is not quite one of those “the truth will set you free” moments, but I’ll show that one could almost painlessly repeat some of the analyses I presented before using MCMC.

MCMCglmm is a very neat package that—as its rather complicated em cee em cee gee el em em acronym implies—implements MCMC for generalized linear mixed models. We’ll skip that the frequentist fixed vs random effects distinction gets blurry in Bayesian models and still use the f… and r… terms. I’ll first repeat the code for a Randomized Complete Block design with Family effects (so we have two random factors) using both lme4 and ASReml-R and add the MCMCglmm counterpart:

library(lme4)
m1 = lmer(bden ~ 1 + (1|Block) + (1|Family),
                 data = trees)
 
summary(m1)
 
#Linear mixed model fit by REML 
#Formula: bden ~ (1 | Block) + (1 | Family) 
#   Data: a 
#  AIC  BIC logLik deviance REMLdev
# 4572 4589  -2282     4569    4564
#Random effects:
# Groups   Name        Variance Std.Dev.
# Family   (Intercept)  82.803   9.0996 
# Block    (Intercept) 162.743  12.7571 
# Residual             545.980  23.3662 
#Number of obs: 492, groups: Family, 50; Block, 11
 
#Fixed effects:
#            Estimate Std. Error t value
#(Intercept)  306.306      4.197   72.97
 
 
library(asreml)
m2 = asreml(bden ~ 1, random = ~ Block + Family,
                      data = trees)
 
summary(m2)$varcomp
#                      gamma component std.error   z.ratio
#Block!Block.var   0.2980766 162.74383  78.49271  2.073362
#Family!Family.var 0.1516591  82.80282  29.47153  2.809587
#R!variance        1.0000000 545.97983  37.18323 14.683496
 
#                 constraint
#Block!Block.var    Positive
#Family!Family.var  Positive
#R!variance         Positive
 
 
m2$coeff$fixed
#(Intercept) 
#    306.306

We had already established that the results obtained from lme4 and ASReml-R were pretty much the same, at least for relatively simple models where we can use both packages (as their functionality diverges later for more complex models). This example is no exception and we quickly move to fitting the same model using MCMCglmm:

library(MCMCglmm)
priors = list(R = list(V = 260, n = 0.002),
              G = list(G1 = list(V = 260, n = 0.002),
                       G2 = list(V = 260, n = 0.002)))
 
m4 = MCMCglmm(bden ~ 1, 
              random = ~ Block + Family, 
              family = 'gaussian',
              data = a,
              prior = priors,
              verbose = FALSE,
              pr = TRUE,
              burnin = 10000,
              nitt = 20000,
              thin = 10)
 
plot(mcmc.list(m4$VCV))
 
autocorr(m4$VCV)
 
posterior.mode(m4$VCV)
#    Block    Family     units 
#126.66633  72.97771 542.42237 
 
HPDinterval(m4$VCV)
#           lower    upper
#Block   33.12823 431.0233
#Family  26.34490 146.6648
#units  479.24201 627.7724

The first difference is that we have to specify priors for the coefficients that we would like to estimate (by default fixed effects, the overall intercept for our example, start with a zero mean and very large variance: 106). The phenotypic variance for our response is around 780, which I split into equal parts for Block, Family and Residuals. For each random effect we have provided our prior for the variance (V) and a degree of belief on our prior (n).

In addition to the model equation, name of the data set and prior information we need to specify things like the number of iterations in the chain (nitt), how many we are discarding for the initial burnin period (burnin), and how many values we are keeping (thin, every ten). Besides the pretty plot of the posterior distributions (see previous figure) they can be summarized using the posterior mode and high probability densities.

One of the neat things we can do is to painlessly create functions of variance components and get their posterior mode and credible interval. For example, the heritability (or degree of additive genetic control) can be estimated in this trial with full-sib families using the following formula:

\( \hat{h^2} = \frac{2 \sigma_F^2}{\sigma_F^2 + \sigma_B^2 + \sigma_R^2} \)
h2 = 2 * m4$VCV[, 'Family']/(m4$VCV[, 'Family'] + 
         m4$VCV[, 'Block'] + m4$VCV[, 'units'])
plot(h2)
 
posterior.mode(h2)
#0.1887017
 
HPDinterval(h2)
#          lower     upper
#var1 0.05951232 0.3414216
#attr(,"Probability")
#[1] 0.95

There are some differences on the final results between ASReml-R/lme4 and MCMCglmm; however, the gammas (ratios of variance component/error variance) for the posterior distributions are very similar, and the estimated heritabilities are almost identical (~0.19 vs ~0.21). Overall, MCMCglmm is a very interesting package that covers a lot of ground. It pays to read the reference manual and vignettes, which go into a lot of detail on how to specify different types of models. I will present the MCMCglmm bivariate analyses in another post.

P.S. There are several other ways that we could fit this model in R using a Bayesian approach: it is possible to call WinBugs or JAGS (in Linux and OS X) from R, or we could have used INLA. More on this in future posts.