Analyzing a simple experiment with heterogeneous variances using asreml, MCMCglmm and SAS

I was working with a small experiment which includes families from two Eucalyptus species and thought it would be nice to code a first analysis using alternative approaches. The experiment is a randomized complete block design, with species as fixed effect and family and block as a random effects, while the response variable is growth strain (in \( \mu \epsilon\)).

When looking at the trees one can see that the residual variances will be very different. In addition, the trees were growing in plastic bags laid out in rows (the blocks) and columns. Given that trees were growing in bags siting on flat terrain, most likely the row effects are zero.

Below is the code for a first go in R (using both MCMCglmm and ASReml-R) and SAS. I had stopped using SAS for several years, mostly because I was running a mac for which there is no version. However, a few weeks ago I started accessing it via their OnDemand for Academics program via a web browser.

The R code using REML looks like:

# Options
options(stringsAsFactors = FALSE)
setwd('~/Dropbox/research/2013/growth-stress')
 
# Packages
require(ggplot2)
require(asreml)
require(MCMCglmm)
 
 
# Reading data, renaming, factors, etc
gs = read.csv('eucalyptus-growth-stress.csv')
summary(gs)
 
# Both argophloia and bosistoana
gsboth = subset(gs, !is.na(strain))
gsboth = within(gsboth, {
	species = factor(species)
	row = factor(row)
	column = factor(column)
	fam = factor(fam)
})            
 
 
 
ma = asreml(strain ~ species, random = ~ fam + row, 
            rcov = ~ at(species):units,
            data = gsboth)
 
summary(ma)$varcomp
 
#                                   gamma  component std.error   z.ratio constraint
#fam!fam.var                    27809.414  27809.414 10502.036 2.6480022   Positive
#row!row.var                     2337.164   2337.164  3116.357 0.7499666   Positive
#species_E.argopholia!variance 111940.458 111940.458 26609.673 4.2067580   Positive
#species_E.bosistoana!variance  63035.256  63035.256  7226.768 8.7224681   Positive

While using MCMC we get estimates in the ballpark by using:

# Priors
bothpr = list(R = list(V = diag(c(50000, 50000)), nu = 3), 
              G = list(G1 = list(V = 20000, nu = 0.002), 
                       G2 = list(V = 20000, nu = 0.002),
                       G3 = list(V = 20000, nu = 0.002)))
 
# MCMC                    
m2 = MCMCglmm(strain ~ species, random = ~ fam + row + column, 
              rcov = ~ idh(species):units,
              data = gsboth,
              prior = bothpr,
              pr = TRUE,
              family = 'gaussian',
              burnin = 10000,
              nitt = 40000,
              thin = 20,
              saveX = TRUE,
              saveZ = TRUE,
              verbose = FALSE)
 
summary(m2)
 
# Iterations = 10001:39981
# Thinning interval  = 20
# Sample size  = 1500 
# 
# DIC: 3332.578 
# 
# G-structure:  ~fam
# 
# post.mean l-95% CI u-95% CI eff.samp
# fam     30315    12211    55136     1500
# 
# ~row
# 
# post.mean l-95% CI u-95% CI eff.samp
# row      1449    5.928     6274    589.5
# 
# R-structure:  ~idh(species):units
# 
# post.mean l-95% CI u-95% CI eff.samp
# E.argopholia.units    112017    71152   168080     1500
# E.bosistoana.units     65006    52676    80049     1500
# 
# Location effects: strain ~ species 
# 
# post.mean l-95% CI u-95% CI eff.samp  pMCMC    
# (Intercept)            502.21   319.45   690.68     1500 <7e-04 ***
# speciesE.bosistoana   -235.95  -449.07   -37.19     1361  0.036 *

The SAS code is not that disimilar, except for the clear demarcation between data processing (data step, for reading files, data transformations, etc) and specific procs (procedures), in this case to summarize data, produce a boxplot and fit a mixed model.

* termstr=CRLF accounts for the windows-like line endings of the data set;
data gs;
  infile "/home/luis/Projects/euc-strain/growthstresses.csv" 
        dsd termstr=CRLF firstobs=2;
  input row column species $ family $ strain;
  if strain ^= .;
run;
 
proc summary data = gs print;
  class species;
  var strain;
run;
 
proc boxplot data = gs;
  plot strain*species;
run;
 
proc mixed data = gs;
  class row species family;
  model strain = species;
  random row family;
  repeated species / group=species;
run;
 
/*
Covariance Parameter Estimates
Cov Parm 	Group 	               Estimate
row 	  	                       2336.80
family 	  	                       27808
species 	species E.argoph 	111844
species 	species E.bosist 	63036
*/
SAS boxplot for the data set.

SAS boxplot for the data set.

I like working with multiple languages and I realized that, in fact, I missed SAS a bit. It was like meeting an old friend; at the beginning felt strange but we were quickly chatting away after a few minutes.

When R, or any other language, is not enough

This post is tangential to R, although R has a fair share of the issues I mention here, which include research reproducibility, open source, paying for software, multiple languages, salt and pepper.

There is an increasing interest in the reproducibility of research. In many topics we face multiple, often conflicting claims and as researchers we value the ability to evaluate those claims, including repeating/reproducing research results. While I share the interest in reproducibility, some times I feel we are obsessing too much on only part of the research process: statistical analysis. Even here, many people focus not on the models per se, but only on the code for the analysis, which should only use tools that are free of charge.

There has been enormous progress in the R world on literate programming, where the combination of RStudio + Markdown + knitr has made analyzing data and documenting the process almost enjoyable. Nevertheless, and here is the BUT coming, there is a large difference between making the code repeatable and making research reproducible.

As an example, currently I am working in a project that relies on two trials, which have taken a decade to grow. We took a few hundred increment cores from a sample of trees and processed them using a densitometer, an X-Ray diffractometer and a few other lab toys. By now you get the idea, actually replicating the research may take you quite a few resources before you even start to play with free software. At that point, of course, I want to be able to get the most of my data, which means that I won’t settle for a half-assed model because the software is not able to fit it. If you think about it, spending a couple of grands in software (say ASReml and Mathematica licenses) doesn’t sound outrageous at all. Furthermore, reproducing this piece of research would require: a decade, access to genetic material and lab toys. I’ll give you the code for free, but I can’t give you ten years or $0.25 million…

In addition, the research process may require linking disparate sources of data for which other languages (e.g. Python) may be more appropriate. Some times R is the perfect tool for the job, while other times I feel like we have reached peak VBS (Visual Basic Syndrome) in R: people want to use it for everything, even when it’s a bad idea.

In summary,

  • research is much more than a few lines of R (although they are very important),
  • even when considering data collection and analysis it is a good idea to know more than a single language/software, because it broadens analytical options
  • I prefer free (freedom+beer) software for research; however, I rely on non-free, commercial software for part of my work because it happens to be the best option for specific analyses.

Disclaimer: my primary analysis language is R and I often use lme4, MCMCglmm and INLA (all free). However, many (if not most) of my analyses that use genetic information rely on ASReml (paid, not open source). I’ve used Mathematica, Matlab, Stata and SAS for specific applications with reasonably priced academic licenses.

Gratuitous picture: 3000 trees leaning in a foggy Christchurch day (Photo: Luis).

Gratuitous picture: 3000 trees leaning in a foggy Christchurch day (Photo: Luis, click to enlarge).

Multisite, multivariate genetic analysis: simulation and analysis

The email wasn’t a challenge but a simple question: Is it possible to run a multivariate analysis in multiple sites? I was going to answer yes, of course, and leave it there but it would be a cruel, non-satisfying answer. We can get a better handle of the question if we use a simple example; let’s assume that we have two traits (call them tree stem diameter and stem density) assessed in two sites (localities).

Because this is genetics we have a family structure, let’s say half-siblings so we only half the mother in common, and we will ignore any experimental design features to keep things simple. We have 100 families, with 30 trees each, in sites A and B, for a total of 6,000 trees (100 x 30 x 2). The data could look like this:

site family tree diam dens
 A     1     1   20    398
 A     1     2   19    400
...
 A    100   30   24    375
...
 B     1    1    25    396
...
 B    100   30   23    403

We can also think of a trait measured in two sites as separate, but correlated, traits. For example, diameter in site 1 (diam1) is a different variable from diameter in site 2 (diam 2). So now we have four response variables (diam1, dens1, diam2, dens2), two of which have only missing values in a given site:

site family tree diam1 dens1 diam2 dens2
 A     1     1    20    398   NA    NA
 A     1     2    19    400   NA    NA
...
 A    100   30    24    375   NA    NA
...
 B      1    1    NA    NA    25    396
...
 B    100   30    NA    NA    23    403

All variables are correlated at the genetic level with an unstructured G covariance matrix, while at the residual level are only correlated within-site (same tree was assessed for both traits), but there is zero correlation between sites, because one could be assessed near Christchurch, while the other near Dunedin, for example.

###
### Data generation
###
 
# Simulate 100 half-sib families with 30 trees each
# for the four traits in two sites
n.fam <- 100
n.prog <- 30
n.trait <- 4
n.site <- 2
 
# Couple of variance matrices,
# G is for additive genetic effects
# R is for residuals
# All variances are assumed equal to 1 without loss
# of generality
G <- matrix(c(1, 0.7, 0.5, 0.3,
              0.7, 1, 0.2, 0.1,
              0.5, 0.2, 1, 0.8,
              0.3, 0.1, 0.8, 1), 4, 4)
 
R <- matrix(c(1, 0.3, 0, 0,
              0.3, 1, 0, 0,
              0, 0, 1, 0.2,
              0, 0, 0.2, 1), 4, 4)
 
 
# Between-family variances account for 1/4 of additive
# genetic effects. Within-family account for 3/4 of
# additive + residual.
# We also get the Cholesky decomposition in the same step
BFL <- chol(1/4 * G)
WFL <- chol(3/4 * G + R)
 
# Simulate random family effects for four traits
# Simulate random family effects for four traits
fam.eff <- t(BFL) %*% matrix(rnorm(n.fam*n.trait), 
                             n.trait, n.fam)
fam.eff <- t(fam.eff) # This is 100 x 4
 
tre.eff <- t(WFL) %*% matrix(rnorm(n.prog*n.fam*n.trait), 
                             n.trait, n.prog*n.fam)
tre.eff <- t(tre.eff) # This is 3000 x 4
 
# Expand family effects matrix (n.prog each) to match 
# dimension of tree effects
pheno <- fam.eff[rep(1:dim(fam.eff)[1], each = n.prog),] + tre.eff
 
# Now to 2 sites
pheno2s <- matrix(NA, n.prog*n.fam*n.site, n.trait)
colnames(pheno2s) <- c('diam1', 'dens1', 'diam2', 'dens2')
 
pheno2s[1:3000, 1:2] <- pheno[1:3000, 1:2]
pheno2s[3001:6000, 3:4] <- pheno[1:3000, 3:4]
 
# Creating data set. Family and tree are shamelessly recycled
sim.data <- data.frame(site = factor(rep(c('A', 'B'), each = n.fam*n.prog)),
                       family = factor(rep(1:n.fam, each = n.prog)),
                       tree = factor(1:30),
                       pheno2s)

Some neat things:

  • Data simulation relies on Cholesky decomposition, as explained in this post over a year ago (time flies!). Generation is a bit more complex this time because we have to account for two, rather than one, covariance structures.
  • Another point with data simulation is that one could generate one set of correlated values at the time by using something like t(BFL) %*% diag(rnorm(trait), 3) and loop it or use apply(). This would require much less memory but would also be much slower.
  • We need to repeat each line of the family effects matrix 30 times so we can add to the individual tree effects. Often we use indexing in matrices or data frames to extract a few rows. Instead here I’m using to repeat a given number of times each row by indexing with rep().

If we show observations 2990 to 3010 (last 10 for site A and first 10 for site B) we can see the pattern of observations below, which will have the required structure to have a US (or heterogeneous variance correlation matrix) for G and a block diagonal matrix for R. By the way, one could also have block diagonal matrices with repeated measurements, although probably with a different correlation structure:

> pheno2s[2990:3010,]
            diam1      dens1      diam2       dens2
 [1,]  0.57087250 -0.8521059         NA          NA
 [2,]  0.94859621  0.6599391         NA          NA
 [3,] -3.37405451 -0.6093312         NA          NA
 [4,]  0.93541048 -0.7977893         NA          NA
 [5,] -0.74758553  0.7962593         NA          NA
 [6,]  0.51280201  1.4870425         NA          NA
 [7,] -1.92571147 -0.2554365         NA          NA
 [8,] -1.15923045 -2.0656582         NA          NA
 [9,] -0.82145904 -0.3138340         NA          NA
[10,]  1.79631670  0.3814723         NA          NA
[11,] -0.01604778 -1.1804723         NA          NA
[12,]          NA         NA -0.1436143 -1.97628883
[13,]          NA         NA -2.7099687 -2.93832962
[14,]          NA         NA -2.5153420  0.73780760
[15,]          NA         NA  0.3084056  0.61696714
[16,]          NA         NA -0.2909500  0.78111864
[17,]          NA         NA -1.8629862 -2.19346309
[18,]          NA         NA  0.8673053 -0.07692884
[19,]          NA         NA -0.1459703  0.36981965
[20,]          NA         NA -0.7688851 -0.96765799
[21,]          NA         NA  0.6637173 -0.34924814
Gratuitous picture: driving in rural Japan, the cultural value of wood and breaking preconceptions (Photo: Luis).

Gratuitous picture: driving in rural Japan, the cultural value of wood and breaking preconceptions (Photo: Luis, click to enlarge).

For the analysis we will use ASReml, which is the Pro option if you work in a breeding program and require solving much larger systems of equations than this tiny example (say 50-100 traits, large pedigrees, etc). Another option for playing with this small data set would be to use MCMCglmm, which also allows for multivariate evaluation of linear mixed models.

###
### Multivariate analysis
###
 
require(asreml)
 
# Assuming zero correlation between traits (equivalent
# to four univariate analyses)
m1 <- asreml(cbind(diam1, dens1, diam2, dens2) ~ trait, 
             random = ~ diag(trait):family, 
             rcov = ~ units:diag(trait),
             data = sim.data)
 
summary(m1)$varcomp
 
#                                 gamma component  std.error   z.ratio constraint
#trait:family!trait.diam1.var 0.2571274 0.2571274 0.04524049  5.683569   Positive
#trait:family!trait.dens1.var 0.2265041 0.2265041 0.04059742  5.579274   Positive
#trait:family!trait.diam2.var 0.2383959 0.2383959 0.04185982  5.695102   Positive
#trait:family!trait.dens2.var 0.2567999 0.2567999 0.04459246  5.758818   Positive
#R!variance                   1.0000000 1.0000000         NA        NA      Fixed
#R!trait.diam1.var            1.8290472 1.8290472 0.04803313 38.078866   Positive
#R!trait.dens1.var            1.7674960 1.7674960 0.04641672 38.078866   Positive
#R!trait.diam2.var            1.6779793 1.6779793 0.04406589 38.078866   Positive
#R!trait.dens2.var            1.7028171 1.7028171 0.04471817 38.078866   Positive
 
# The multivariate analysis allowing for correlation
# is a bit more complex, so we start by generating 
# a structure for starting values
m2.sv <- asreml(cbind(diam1, dens1, diam2, dens2) ~ trait, 
                random = ~ corgh(trait):family, 
                rcov = ~ units:us(trait),
                data = sim.data,
                start.values = TRUE)
 
# Now we'll constraint some R variance components
# to zero and fix them there. These are values for
# which we now they are 0
sv <- m2.sv$gammas.table
sv$Value[c 1="16," 2="18," 3="19)" language="(15,"][/c] <- 0
sv$Constraint[c 1="16," 2="18," 3="19)" language="(15,"][/c] <- 'F'
 
# Run the analyses with those constraints for the R matrix
# using the restricted values in R.param
m2 <- asreml(cbind(diam1, dens1, diam2, dens2) ~ trait, 
             random = ~ corgh(trait):family, 
             rcov = ~ units:us(trait, init = r.init),
             data = sim.data,
             R.param = sv)
 
summary(m2)$varcomp
#                                                 gamma   component  std.error
# trait:family!trait.dens1:!trait.diam1.cor 0.72535514 0.72535514 0.06381776
# trait:family!trait.diam2:!trait.diam1.cor 0.49100215 0.49100215 0.10067768
# trait:family!trait.diam2:!trait.dens1.cor 0.21445972 0.21445972 0.12085780
# trait:family!trait.dens2:!trait.diam1.cor 0.37476119 0.37476119 0.10975970
# trait:family!trait.dens2:!trait.dens1.cor 0.05221773 0.05221773 0.12439924
# trait:family!trait.dens2:!trait.diam2.cor 0.85356318 0.85356318 0.04321683
# trait:family!trait.diam1                  0.25712744 0.25712744 0.04524049
# trait:family!trait.dens1                  0.22650410 0.22650410 0.04059742
# trait:family!trait.diam2                  0.23839593 0.23839593 0.04185982
# trait:family!trait.dens2                  0.25679989 0.25679989 0.04459246
# R!variance                                1.00000000 1.00000000         NA
# R!trait.diam1:diam1                       1.82904721 1.82904721 0.04803313
# R!trait.dens1:diam1                       0.90450432 0.90450432 0.03737490
# R!trait.dens1:dens1                       1.76749598 1.76749598 0.04641672
# R!trait.diam2:diam1                       0.00000000 0.00000000         NA
# R!trait.diam2:dens1                       0.00000000 0.00000000         NA
# R!trait.diam2:diam2                       1.67797927 1.67797927 0.04406589
# R!trait.dens2:diam1                       0.00000000 0.00000000         NA
# R!trait.dens2:dens1                       0.00000000 0.00000000         NA
# R!trait.dens2:diam2                       0.71249532 0.71249532 0.03406354
# R!trait.dens2:dens2                       1.70281710 1.70281710 0.04471817
#
#                                              z.ratio    constraint
# trait:family!trait.dens1:!trait.diam1.cor 11.3660390 Unconstrained
# trait:family!trait.diam2:!trait.diam1.cor  4.8769710 Unconstrained
# trait:family!trait.diam2:!trait.dens1.cor  1.7744796 Unconstrained
# trait:family!trait.dens2:!trait.diam1.cor  3.4143789 Unconstrained
# trait:family!trait.dens2:!trait.dens1.cor  0.4197593 Unconstrained
# trait:family!trait.dens2:!trait.diam2.cor 19.7507102 Unconstrained
# trait:family!trait.diam1                   5.6835685      Positive
# trait:family!trait.dens1                   5.5792738      Positive
# trait:family!trait.diam2                   5.6951016      Positive
# trait:family!trait.dens2                   5.7588182      Positive
# R!variance                                        NA         Fixed
# R!trait.diam1:diam1                       38.0788655      Positive
# R!trait.dens1:diam1                       24.2008477      Positive
# R!trait.dens1:dens1                       38.0788655      Positive
# R!trait.diam2:diam1                               NA         Fixed
# R!trait.diam2:dens1                               NA         Fixed
# R!trait.diam2:diam2                       38.0788655      Positive
# R!trait.dens2:diam1                               NA         Fixed
# R!trait.dens2:dens1                               NA         Fixed
# R!trait.dens2:diam2                       20.9166565      Positive
# R!trait.dens2:dens2                       38.0788655      Positive

In model 1 each of the variances was supposed to be ~0.25 (1/4 * 1) and the residual variances ~1.75 (3/4*1 + 1). Once we move to model 2 we also get values similar to the correlations we were trying to simulate. And this is the end of the long answer.

P.S. If, for some bizarre reason, you would like to use SAS for this type of analyses, Fikret Isik has proc mixed code to run multivariate genetic analyses.

Overlay of design matrices in genetic analysis

I’ve ignored my quantitative geneticist side of things for a while (at least in this blog) so this time I’ll cover some code I was exchanging with a couple of colleagues who work for other organizations.

It is common to use diallel mating designs in plant and tree breeding, where a small number of parents acts as both males and females. For example, with 5 parents we can have 25 crosses, including reciprocals and selfing (crossing an individual with itself). Decades ago this mating design was tricky to fit and, considering an experimental layout with randomized complete blocks, one would have something like y = mu + blocks + dads + mums + cross + error. In this model dads and mums were estimating a fraction of the additive genetic variance. With the advent of animal model BLUP, was possible to fit something like y = mu + blocks + individual (using a pedigree) + cross + error. Another less computationally demanding alternative (at least with unrelated parents) is to fit a parental model, overlaying the design matrices for parents with something like this y = mu + blocks + (dad + mum) + cross + error.

The following code simulate data for a a diallel trial with three replicates and runs a parental model with ASReml. Later I expand the analysis building the matrices by hand.

# Defining diallel cross
n.blocks <- 3
diallel <- matrix(0, nrow = 5, ncol = 5)
males <- 1:dim(diallel)[1]
females <- 1:dim(diallel)[2]
cross <- factor(outer(males, females, paste, sep ='x'))
cross
#[1] 1x1 2x1 3x1 4x1 5x1 1x2 2x2 3x2 4x2 5x2 1x3 2x3 3x3 4x3 5x3 1x4 2x4
#[18] 3x4 4x4 5x4 1x5 2x5 3x5 4x5 5x5
#25 Levels: 1x1 1x2 1x3 1x4 1x5 2x1 2x2 2x3 2x4 2x5 3x1 3x2 3x3 ... 5x5
 
 
#### Generating random values for the trial
# Variance components
sig2.a <- 40    # additive genetic variance
sig2.sca <- 10  # specific combining ability (SCA)
sig2.b <- 10    # blocks
sig2.e <- 30    # residuals
 
# Numbers of everything
n.parents <- length(males)
n.crosses <- length(cross)
n.trees <- n.crosses*n.blocks
 
# Random values for everything
u.g <- rnorm(n.parents)*sqrt(sig2.a)
u.sca <- rnorm(n.crosses)*sqrt(sig2.sca)
u.block <- rnorm(n.blocks)*sqrt(sig2.b)
u.e <- rnorm(n.trees)*sqrt(sig2.e)
 
# Whole trial
trial <- data.frame(block = factor(rep(1:n.blocks, each = n.crosses)),
                    mum = factor(rep(females, n.blocks)),
                    dad = factor(rep(males, each = length(females))),
                    mate = factor(rep(cross, n.blocks)))
 
trial$yield <- 0.5*(u.g[trial$mum] + u.g[trial$dad]) +
               u.sca[trial$mate] + u.block[trial$block] + u.e
 
head(trial)   
#block mum dad mate       yield
#1     1   1   1  1x1 -0.02185486
#2     1   2   1  2x1 10.79760712
#3     1   3   1  3x1 16.45186037
#4     1   4   1  4x1  8.15026291
#5     1   5   1  5x1  5.57707180
#6     1   1   2  1x2 -7.30675148
 
# Fitting the model with ASReml
library(asreml)
m1 <- asreml(yield ~ 1, 
             random = ~ block + dad + and(mum) + mate, 
             data = trial)
summary(m1)
#                       gamma    component    std.error   z.ratio
#block!block.var 1.299110e-02 3.861892e-01 1.588423e+00 0.2431274
#dad!dad.var     2.101417e-01 6.246930e+00 5.120745e+00 1.2199259
#mate!mate.var   4.589938e-07 1.364461e-05 2.340032e-06 5.8309519
#R!variance      1.000000e+00 2.972722e+01 5.098177e+00 5.8309519
 
# Obtaining the predicted breeding values for the parents
coef(m1, pattern = 'dad')
#         effect
#dad_1  1.780683
#dad_2 -2.121174
#dad_3  3.151991
#dad_4 -1.473620
#dad_5 -1.337879

How is the matrix overlay working? We can replicate the calculations used by ASReml by building the matrices from scratch and reusing the variance components, so we avoid the nastiness of writing code for residual maximum likelihood. Once I build the basic matrices I use the code from my How does a linear mixed model look like? post.

# Building incidence matrices for males and females
Zmales <- model.matrix(~ dad - 1, data = trial)
Zfemales <- model.matrix(~ mum - 1, data = trial)
 
# and() in ASReml overlays (i.e. sums) the matrices
# (Notice that this creates 0s, 1 and 2s in the design matrix)
Zparents <- Zmales + Zfemales
Zparents[1:6,]
#   dad1 dad2 dad3 dad4 dad5
#1     2    0    0    0    0
#2     1    1    0    0    0
#3     1    0    1    0    0
#4     1    0    0    1    0
#5     1    0    0    0    1
#6     1    1    0    0    0
 
# Design matrices from other factors
Zblocks <- model.matrix(~ block - 1, data = trial)
Zmates <- model.matrix(~ mate - 1, data = trial)
 
# Creating y, X and big Z matrices to solve mixed model equations
y <- trial$yield
X <- matrix(1, nrow = 75, ncol = 1)
Z <- cbind(Zblocks, Zparents, Zmates)
 
# Building covariance matrices for random effects
# Using the variance components from the ASReml model
G = diag(c(rep(3.861892e-01, 3), 
           rep(6.246930e+00, 5), 
           rep(1.364461e-05, 25)))
R = diag(2.972722e+01, 75, 75)
Rinv = solve(R)
 
# Components of C
XpX = t(X) %*% Rinv %*% X
ZpZ = t(Z) %*% Rinv %*% Z
 
XpZ = t(X) %*% Rinv %*% Z
ZpX = t(Z) %*% Rinv %*% X
 
Xpy = t(X) %*% Rinv %*% y
Zpy = t(Z) %*% Rinv %*% y
 
# Building C * [b a] = RHS
C = rbind(cbind(XpX, XpZ),
          cbind(ZpX, ZpZ + solve(G)))
RHS = rbind(Xpy, Zpy)
 
blup = solve(C, RHS)
blup
# These results are identical to the ones
# produced by ASReml
#dad1     1.780683e+00
#dad2    -2.121174e+00
#dad3     3.151991e+00
#dad4    -1.473620e+00
#dad5    -1.337879e+00

The overlay matrix Zparents is double the actual value we should use:

Zparents2 <- 0.5*(Zmales + Zfemales)
Zparents2[1:6,]
#  dad1 dad2 dad3 dad4 dad5
#1  1.0  0.0  0.0  0.0  0.0
#2  0.5  0.5  0.0  0.0  0.0
#3  0.5  0.0  0.5  0.0  0.0
#4  0.5  0.0  0.0  0.5  0.0
#5  0.5  0.0  0.0  0.0  0.5
#6  0.5  0.5  0.0  0.0  0.0

Repeating the analyses ‘by hand’ using Zparents2 to build the Z matrix results in the same overall mean and block effects, but the predicted breeding values for parents when using Zparents are 0.7 of the predicted breeding values for parents when using Zparents2. I may need to refit the model and obtain new variance components for parents when working with the correct overlaid matrix.

Gratuitous picture: walking in Quail Island (Photo: Luis).

Split-plot 2: let’s throw in some spatial effects

Disappeared for a while collecting frequent flyer points. In the process I ‘discovered’ that I live in the middle of nowhere, as it took me 36 hours to reach my conference destination (Estoril, Portugal) through Christchurch, Sydney, Bangkok, Dubai, Madrid and Lisbon.

Where was I? Showing how split-plots look like under the bonnet (hood for you US readers). Yates presented a nice diagram of his oats data set in the paper, so we have the spatial location of each data point which permits us playing with within-trial spatial trends.

Rather than mucking around with typing coordinates we can rely on Kevin Wright’s version of the oats dataset contained in the agridat package. Kevin is a man of mistery, a James Bond of statisticians—so he keeps a low profile—with a keen interest in experimental design and analyses. This chap has put a really nice collection of data sets WITH suggested coding for the analyses, including nlme, lme4, asreml, MCMCglmm and a few other bits and pieces. Recommended!

Plants (triffids excepted) do not move, which means that environmental trends within a trial (things like fertility, water availability, temperature, etc) can affect experimental units in a way that varies with space and which induces correlation of the residuals. Kind of we could be violating the independence assumption if you haven’t got the hint yet.

Gratuitous picture: Detail of Mosteiro dos Jerónimos, Belém, Lisboa, Portugal (Photo: Luis).

There are a few ways to model environmental trends (AR processes, simple polynomials, splines, etc) that can be accounted for either through the G matrix (as random effects) or the R matrix. See previous post for explanation of the bits and pieces. We will use here a very popular approach, which is to consider two separable (so we can estimate the bloody things) autoregressive processes, one for rows and one for columns, to model spatial association. In addition, we will have a spatial residual. In summary, the residuals have moved from \( \boldsymbol{R} = \sigma^2_e \boldsymbol{I}\) to \( \boldsymbol{R} = \sigma^2_s \boldsymbol{R}_{col} \otimes \boldsymbol{R}_{row}\). I previously showed the general form of this autoregressive matrices in this post, and you can see the \( \boldsymbol{R}_{col}\) matrix below. In some cases we can also add an independent residual (the so-called nugget) to the residual matrix.

We will first fit a split-plot model considering spatial residuals using asreml because, let’s face it, there is no other package that will give you the flexibility:

require(asreml)
require(agridat)
 
# Having a look at the structure of yates.oats
# and changing things slightly so it matches 
# previous post
str(yates.oats)
 
names(yates.oats) = c('row', 'col', 'Y', 'N', 'V', 'B')
 
yates.oats$row = factor(yates.oats$row)
yates.oats$col = factor(yates.oats$col)
yates.oats$N = factor(yates.oats$N)
yates.oats$MP = yates.oats$V
 
# Base model (this was used in the previous post)
m2 = asreml(Y ~ V*N, random = ~ B/MP, data = yates.oats)
summary(m2)$varcomp
#                gamma component std.error  z.ratio constraint
# B!B.var    1.2111647  214.4771 168.83404 1.270343   Positive
# B:MP!B.var 0.5989373  106.0618  67.87553 1.562593   Positive
# R!variance 1.0000000  177.0833  37.33244 4.743416   Positive
 
# Spatial model
m3 = asreml(Y ~ V*N, random = ~ B/MP, 
            rcov = ~ ar1(col):ar1(row),
            data = yates.oats)
summary(m3)$varcomp
#                 gamma    component   std.error   z.ratio    constraint
# B!B.var    0.80338277 169.24347389 156.8662436 1.0789031      Positive
# B:MP!B.var 0.49218005 103.68440202  73.6390759 1.4080079      Positive
# R!variance 1.00000000 210.66355939  67.4051020 3.1253355      Positive
# R!col.cor  0.04484166   0.04484166   0.2006562 0.2234751 Unconstrained
# R!row.cor  0.49412567   0.49412567   0.1420397 3.4787860 Unconstrained
 
coef(m3)$fixed
coef(m3)$random
# effect
# V_GoldenRain:N_0     0.0000000
# V_GoldenRain:N_0.2   0.0000000
# V_GoldenRain:N_0.4   0.0000000
# V_GoldenRain:N_0.6   0.0000000
# V_Marvellous:N_0     0.0000000
# V_Marvellous:N_0.2  -0.8691155
# V_Marvellous:N_0.4 -12.4223873
# V_Marvellous:N_0.6  -5.5018907
# V_Victory:N_0        0.0000000
# V_Victory:N_0.2     -1.9580360
# V_Victory:N_0.4      2.1913469
# V_Victory:N_0.6      0.3728648
# N_0                  0.0000000
# N_0.2               23.3299154
# N_0.4               40.0570745
# N_0.6               47.1749577
# V_GoldenRain         0.0000000
# V_Marvellous         9.2845952
# V_Victory           -5.7259866
# (Intercept)         76.5774292
 
# effect
# B_B1               21.4952875
# B_B2                1.0944484
# B_B3               -5.4336461
# B_B4               -4.4334455
# B_B5               -6.6925874
# B_B6               -6.0300569
# B_B1:MP_GoldenRain  1.3036724
# B_B1:MP_Marvellous -0.9082462
# B_B1:MP_Victory    12.7754283
# B_B2:MP_GoldenRain  1.3286187
# B_B2:MP_Marvellous  7.4181674
# B_B2:MP_Victory    -8.0761824
# B_B3:MP_GoldenRain -6.5288220
# B_B3:MP_Marvellous 10.4361799
# B_B3:MP_Victory    -7.2367277
# B_B4:MP_GoldenRain  6.6868810
# B_B4:MP_Marvellous -9.2317585
# B_B4:MP_Victory    -0.1716372
# B_B5:MP_GoldenRain  2.4635492
# B_B5:MP_Marvellous -9.7086196
# B_B5:MP_Victory     3.1443067
# B_B6:MP_GoldenRain -5.2538993
# B_B6:MP_Marvellous  1.9942770
# B_B6:MP_Victory    -0.4351876
 
# In a larger experiment we could try fitting a nugget using units
m4 = asreml(Y ~ V*N, random = ~ B/MP + units, 
            rcov = ~ ar1(col):ar1(row),
            data = yates.oats)
summary(m4)
 
# However in this small experiments the system
# goes crazy and results meaningless
#                       gamma   component  std.error  z.ratio constraint
# B!B.var         0.006759124   223.70262   185.3816 1.206714   Positive
# B:MP!B.var      0.001339017    44.31663    29.2004 1.517672   Positive
# units!units.var 0.003150356   104.26542    34.2738 3.042132   Positive
# R!variance      1.000000000 33096.39128 19480.8333 1.698921   Positive
# R!col.cor       0.999000000     0.99900         NA       NA      Fixed
# R!row.cor       0.999000000     0.99900         NA       NA      Fixed

So we have to build an autoregressive correlation matrix for rows, one for columns and multiply the whole thing for a spatial variance. Then we can add an independent residual (the nugget, if we want—and can estimate—one). Peter Dalgaard has neat code for building the autocorrelation matrix. And going back to the code in the previous post:

ar.matrix = function(ar, dim) {
  M = diag(dim)
  M = ar^abs(row(M)-col(M))
  return(M)
}
 
# Variance components (from m3)
varB = 169.243
varMP = 103.684
varR = 210.664
arcol = 0.045
arrow = 0.494
 
# Basic vector and matrices: y, X, Z, G & R
y = matrix(yates.oats$Y, nrow = dim(yates.oats)[1], ncol = 1)
X = model.matrix(~ V*N, data = yates.oats)
Z = model.matrix(~ B/MP - 1, data = yates.oats, 
                 contrasts.arg = list(B = contrasts(yates.oats$B, contrasts = F), 
                                      MP = contrasts(yates.oats$MP, contrasts = F)))
 
G = diag(c(rep(varB, 6), rep(varMP, 18)))
 
# Only change from previous post is building the R matrix
Rcol = ar.matrix(arcol, 4)
Rrow = ar.matrix(arrow, 18)
Rcol
# Having a look at the structure
#            [,1]     [,2]     [,3]       [,4]
# [1,] 1.0000e+00 0.045000 0.002025 9.1125e-05
# [2,] 4.5000e-02 1.000000 0.045000 2.0250e-03
# [3,] 2.0250e-03 0.045000 1.000000 4.5000e-02
# [4,] 9.1125e-05 0.002025 0.045000 1.0000e+00
 
R = varR * kronecker(Rcol, Rrow)
Rinv = solve(R)
 
# Components of C
XpX = t(X) %*% Rinv %*% X
ZpZ = t(Z) %*% Rinv %*% Z
 
XpZ = t(X) %*% Rinv %*% Z
ZpX = t(Z) %*% Rinv %*% X
 
Xpy = t(X) %*% Rinv %*% y
Zpy = t(Z) %*% Rinv %*% y
 
# Building C * [b a] = RHS
C = rbind(cbind(XpX, XpZ),
          cbind(ZpX, ZpZ + solve(G)))
RHS = rbind(Xpy, Zpy)
 
blup = solve(C, RHS)
blup
#                         [,1]
# (Intercept)       76.5778238
# VMarvellous        9.2853002
# VVictory          -5.7262894
# N0.2              23.3283060
# N0.4              40.0555464
# N0.6              47.1740348
# VMarvellous:N0.2  -0.8682597
# VVictory:N0.2     -1.9568979
# VMarvellous:N0.4 -12.4200362
# VVictory:N0.4      2.1912083
# VMarvellous:N0.6  -5.5017225
# VVictory:N0.6      0.3732453
# BB1               21.4974445
# BB2                1.0949433
# BB3               -5.4344098
# BB4               -4.4333080
# BB5               -6.6948783
# BB6               -6.0297918
# BB1:MPGoldenRain   1.3047656
# BB2:MPGoldenRain   1.3294043
# BB3:MPGoldenRain  -6.5286993
# BB4:MPGoldenRain   6.6855568
# BB5:MPGoldenRain   2.4624436
# BB6:MPGoldenRain  -5.2534710
# BB1:MPMarvellous  -0.9096022
# BB2:MPMarvellous   7.4170634
# BB3:MPMarvellous  10.4349240
# BB4:MPMarvellous  -9.2293528
# BB5:MPMarvellous  -9.7080694
# BB6:MPMarvellous   1.9950370
# BB1:MPVictory     12.7749000
# BB2:MPVictory     -8.0756682
# BB3:MPVictory     -7.2355284
# BB4:MPVictory     -0.1721988
# BB5:MPVictory      3.1441163
# BB6:MPVictory     -0.4356209

Which are the same results one gets from ASReml-R. QED.

P.S. Many thanks to Kevin Wright for answering my questions about agridat.

Split-plot 1: How does a linear mixed model look like?

I like statistics and I struggle with statistics. Often times I get frustrated when I don’t understand and I really struggled to make sense of Krushke’s Bayesian analysis of a split-plot, particularly because ‘it didn’t look like’ a split-plot to me.

Additionally, I have made a few posts discussing linear mixed models using several different packages to fit them. At no point I have shown what are the calculations behind the scenes. So, I decided to combine my frustration and an explanation to myself in a couple of posts. This is number one and the follow up is Split-plot 2: let’s throw in some spatial effects.

Example of forestry split-plot: one of my colleagues has a trial in which stocking (number of trees per ha) is the main plot and fertilization is the subplot (higher stockings look darker because trees are closer together).

How do linear mixed models look like

Linear mixed models, models that combine so-called fixed and random effects, are often represented using matrix notation as:

\( \boldsymbol{y} = \boldsymbol{X b} + \boldsymbol{Z a} + \boldsymbol{e}\)

where \( \boldsymbol{y}\) represents the response variable vector, \( \boldsymbol{X} \mbox{ and } \boldsymbol{Z}\) are incidence matrices relating the response to the fixed (\( \boldsymbol{b}\) e.g. overall mean, treatment effects, etc) and random (\( \boldsymbol{a}\), e.g. family effects, additive genetic effects and a host of experimental design effects like blocks, rows, columns, plots, etc), and last the random residuals (\( \boldsymbol{e}\)).

The previous model equation still requires some assumptions about the distribution of the random effects. A very common set of assumptions is to say that the residuals are iid (identical and independently distributed) normal (so \( \boldsymbol{R} = \sigma^2_e \boldsymbol{I} \)) and that the random effects are independent of each other, so \( \boldsymbol{G} = \boldsymbol{B} \bigoplus \boldsymbol{M}\) is a direct sum of the variance of blocks (\( \boldsymbol{B} = \sigma^2_B \boldsymbol{I}\)) and main plots (\( \boldsymbol{M} = \sigma^2_M \boldsymbol{I}\)).

An interesting feature of this matrix formulation is that we can express all sort of models by choosing different assumptions for our covariance matrices (using different covariance structures). Do you have longitudinal data (units assessed repeated times)? Is there spatial correlation? Account for this in the \( \boldsymbol{R}\) matrix. Random effects are correlated (e.g. maternal and additive genetic effects in genetics)? Account for this in the \( \boldsymbol{G}\) matrix. Multivariate response? Deal with unstructured \( \boldsymbol{R}\) and \( \boldsymbol{G}\), or model the correlation structure using different constraints (and the you’ll need asreml).

By the way, the history of linear mixed models is strongly related to applications of quantitative genetics for the prediction of breeding values, particularly in dairy cattle. Charles Henderson developed what is now called Henderson’s Mixed Model Equations to simultaneously estimate fixed effects and predict random genetic effects:

\(
\left [
\begin{array}{cc}
\boldsymbol{X}’ \boldsymbol{R}^{-1} \boldsymbol{X} & \boldsymbol{X}’ \boldsymbol{R}^{-1} \boldsymbol{Z} \\
\boldsymbol{Z}’ \boldsymbol{R}^{-1} \boldsymbol{X} & \boldsymbol{Z}’ \boldsymbol{R}^{-1} \boldsymbol{Z} + \boldsymbol{G}^{-1}
\end{array}
\right ]
\left [
\begin{array}{c}
\boldsymbol{b} \\
\boldsymbol{a}
\end{array}
\right ] =
\left [
\begin{array}{c}
\boldsymbol{X}’ \boldsymbol{R}^{-1} \boldsymbol{y} \\
\boldsymbol{Z}’ \boldsymbol{R}^{-1} \boldsymbol{y}
\end{array}
\right ] \)

The big matrix on the left-hand side of this equation is often called the \( \boldsymbol{C} \) matrix. You could be thinking ‘What does this all mean?’ It is easier to see what is going on with a small example, but rather than starting with, say, a complete block design, we’ll go for a split-plot to start tackling my annoyance with the aforementioned blog post.

Old school: physical split-plots

Given that I’m an unsophisticated forester and that I’d like to use data available to anyone, I’ll rely on an agricultural example (so plots are actually physical plots in the field) that goes back to Frank Yates. There are two factors (oats variety, with three levels, and fertilization, with four levels). Yates, F. (1935) Complex experiments, Journal of the Royal Statistical Society Suppl. 2, 181–247 (behind pay wall here).

Layout of oats experiment in Yates’s paper, from a time when articles were meaty. Each of the 6 replicates is divided in to 3 main plots for oats varieties (v1, v2 and v3), while each variety was divided into four parts with different levels of fertilization (manure—animal crap—n1 to n4). Cells display yield.

Now it is time to roll up our sleeves and use some code, getting the data and fitting the same model using nlme (m1) and asreml (m2), just for the fun of it. Anyway, nlme and asreml produce exactly the same results.

We will use the oats data set that comes with MASS, although there is also an Oats data set in nlme and another version in the asreml package. (By the way, you can see a very good explanation by Bill Venables of a ‘traditional’ ANOVA analysis for a split-plot here):

require(MASS) # we get the oats data from here
require(nlme) # for lme function
require(asreml) # for asreml function. This dataset use different variable names, 
                # which may require renaming a dataset to use the code below 
 
# Get the oats data set and check structure
data(oats)
head(oats)
str(oats)
 
# Create a main plot effect for clarity's sake
oats$MP = oats$V
 
# Split-plot using NLME
m1 = lme(Y ~ V*N, random = ~ 1|B/MP, data = oats)
summary(m1)
fixef(m1)
ranef(m1)
 
# Split-plot using ASReml
m2 = asreml(Y ~ V*N, random = ~ B/MP, data = oats)
summary(m2)$varcomp
coef(m2)$fixed
coef(m2)$random
 
fixef(m1)
#        (Intercept)         VMarvellous            VVictory             N0.2cwt 
#         80.0000000           6.6666667          -8.5000000          18.5000000 
#            N0.4cwt             N0.6cwt VMarvellous:N0.2cwt    VVictory:N0.2cwt 
#         34.6666667          44.8333333           3.3333333          -0.3333333 
#VMarvellous:N0.4cwt    VVictory:N0.4cwt VMarvellous:N0.6cwt    VVictory:N0.6cwt 
#         -4.1666667           4.6666667          -4.6666667           2.1666667 
 
ranef(m1)
# Level: B 
# (Intercept)
# I     25.421511
# II     2.656987
# III   -6.529883
# IV    -4.706019
# V    -10.582914
# VI    -6.259681
# 
# Level: MP %in% B 
# (Intercept)
# I/Golden.rain      2.348296
# I/Marvellous      -3.854348
# I/Victory         14.077467
# II/Golden.rain     4.298706
# II/Marvellous      6.209473
# II/Victory        -9.194250
# III/Golden.rain   -7.915950
# III/Marvellous    10.750776
# III/Victory       -6.063976
# IV/Golden.rain     5.789462
# IV/Marvellous     -7.115566
# IV/Victory        -1.001111
# V/Golden.rain      1.116768
# V/Marvellous      -9.848096
# V/Victory          3.497878
# VI/Golden.rain    -5.637282
# VI/Marvellous      3.857761
# VI/Victory        -1.316009

Now we can move to implement the Mixed Model Equations, where probably the only gotcha is the definition of the \(\boldsymbol{Z}\) matrix (incidence matrix for random effects), as both nlme and asreml use ‘number of levels of the factor’ for both the main and interactions effects, which involves using the contrasts.arg argument in model.matrix().

# Variance components
varB = 214.477
varMP = 106.062
varR = 177.083
 
# Basic vector and matrices: y, X, Z, G & R
y = matrix(oats$Y, nrow = dim(oats)[1], ncol = 1)
X = model.matrix(~ V*N, data = oats)
Z = model.matrix(~ B/MP - 1, data = oats, 
                 contrasts.arg = list(B = contrasts(oats$B, contrasts = F), 
                                      MP = contrasts(oats$MP, contrasts = F)))
 
G = diag(c(rep(varB, 6), rep(varMP, 18)))
R = diag(varR, 72, 72)
Rinv = solve(R)
 
# Components of C
XpX = t(X) %*% Rinv %*% X
ZpZ = t(Z) %*% Rinv %*% Z
 
XpZ = t(X) %*% Rinv %*% Z
ZpX = t(Z) %*% Rinv %*% X
 
Xpy = t(X) %*% Rinv %*% y
Zpy = t(Z) %*% Rinv %*% y
 
# Building C * [b a] = RHS
C = rbind(cbind(XpX, XpZ),
          cbind(ZpX, ZpZ + solve(G)))
RHS = rbind(Xpy, Zpy)
 
blup = solve(C, RHS)
blup
 
# [,1]
# (Intercept)          80.0000000
# VMarvellous           6.6666667
# VVictory             -8.5000000
# N0.2cwt              18.5000000
# N0.4cwt              34.6666667
# N0.6cwt              44.8333333
# VMarvellous:N0.2cwt   3.3333333
# VVictory:N0.2cwt     -0.3333333
# VMarvellous:N0.4cwt  -4.1666667
# VVictory:N0.4cwt      4.6666667
# VMarvellous:N0.6cwt  -4.6666667
# VVictory:N0.6cwt      2.1666667
# BI                   25.4215578
# BII                   2.6569919
# BIII                 -6.5298953
# BIV                  -4.7060280
# BV                  -10.5829337
# BVI                  -6.2596927
# BI:MPGolden.rain      2.3482656
# BII:MPGolden.rain     4.2987082
# BIII:MPGolden.rain   -7.9159514
# BIV:MPGolden.rain     5.7894753
# BV:MPGolden.rain      1.1167834
# BVI:MPGolden.rain    -5.6372811
# BI:MPMarvellous      -3.8543865
# BII:MPMarvellous      6.2094778
# BIII:MPMarvellous    10.7507978
# BIV:MPMarvellous     -7.1155687
# BV:MPMarvellous      -9.8480945
# BVI:MPMarvellous      3.8577740
# BI:MPVictory         14.0774514
# BII:MPVictory        -9.1942649
# BIII:MPVictory       -6.0639747
# BIV:MPVictory        -1.0011059
# BV:MPVictory          3.4978963
# BVI:MPVictory        -1.3160021

Not surprisingly, we get the same results, except that we start assuming the variance components from the previous analyses, so we can avoid implementing the code for restricted maximum likelihood estimation as well. By the way, given that \( \boldsymbol{R}^{-1} \) is in all terms it can be factored out from the MME, leaving terms like \( \boldsymbol{X}’ \boldsymbol{X} \), i.e. without \( \boldsymbol{R}^{-1}\), making for simpler calculations. In fact, if you drop the \( \boldsymbol{R}^{-1} \) it is easier to see what is going on in the different components of the \( \boldsymbol{C} \) matrix. For example, print \( \boldsymbol{X}’ \boldsymbol{X} \) and you’ll get the sum of observations for the overall mean and for each of the levels of the fixed effect factors. Give it a try with the other submatrices too!

XpXnoR = t(X) %*% X
XpXnoR
#                    (Intercept) VMarvellous VVictory N0.2cwt N0.4cwt N0.6cwt
#(Intercept)                  72          24       24      18      18      18
#VMarvellous                  24          24        0       6       6       6
#VVictory                     24           0       24       6       6       6
#N0.2cwt                      18           6        6      18       0       0
#N0.4cwt                      18           6        6       0      18       0
#N0.6cwt                      18           6        6       0       0      18
#VMarvellous:N0.2cwt           6           6        0       6       0       0
#VVictory:N0.2cwt              6           0        6       6       0       0
#VMarvellous:N0.4cwt           6           6        0       0       6       0
#VVictory:N0.4cwt              6           0        6       0       6       0
#VMarvellous:N0.6cwt           6           6        0       0       0       6
#VVictory:N0.6cwt              6           0        6       0       0       6
#                    VMarvellous:N0.2cwt VVictory:N0.2cwt VMarvellous:N0.4cwt
#(Intercept)                           6                6                   6
#VMarvellous                           6                0                   6
#VVictory                              0                6                   0
#N0.2cwt                               6                6                   0
#N0.4cwt                               0                0                   6
#N0.6cwt                               0                0                   0
#VMarvellous:N0.2cwt                   6                0                   0
#VVictory:N0.2cwt                      0                6                   0
#VMarvellous:N0.4cwt                   0                0                   6
#VVictory:N0.4cwt                      0                0                   0
#VMarvellous:N0.6cwt                   0                0                   0
#VVictory:N0.6cwt                      0                0                   0
#                    VVictory:N0.4cwt VMarvellous:N0.6cwt VVictory:N0.6cwt
#(Intercept)                        6                   6                6
#VMarvellous                        0                   6                0
#VVictory                           6                   0                6
#N0.2cwt                            0                   0                0
#N0.4cwt                            6                   0                0
#N0.6cwt                            0                   6                6
#VMarvellous:N0.2cwt                0                   0                0
#VVictory:N0.2cwt                   0                   0                0
#VMarvellous:N0.4cwt                0                   0                0
#VVictory:N0.4cwt                   6                   0                0
#VMarvellous:N0.6cwt                0                   6                0
#VVictory:N0.6cwt                   0                   0                6

I will leave it here and come back to this problem as soon as I can.

† Incidentally, a lot of the theoretical development was supported by Shayle Searle (a Kiwi statistician and Henderson’s colleague in Cornell University).

Bivariate linear mixed models using ASReml-R with multiple cores

A while ago I wanted to run a quantitative genetic analysis where the performance of genotypes in each site was considered as a different trait. If you think about it, with 70 sites and thousands of genotypes one is trying to fit a 70×70 additive genetic covariance matrix, which requires 70*69/2 = 2,415 covariance components. Besides requiring huge amounts of memory and being subject to all sort of estimation problems there were all sort of connectedness issues that precluded the use of Factor Analytic models to model the covariance matrix. The best next thing was to run over 2,000 bivariate analyses to build a large genetic correlation matrix (which has all sort of issues, I know). This meant leaving the computer running for over a week.

In another unrelated post, Kevin asked if I have ever considered using ASReml-R to run in parallel using a computer with multiple cores. Short answer, I haven’t, but there is always a first time.

require(asreml) # Multivariate mixed models
require(doMC)   # To access multiple cores
 
registerDoMC()  # to start a "parallel backend"
 
# Read basic density data
setwd('~/Dropbox/quantumforest')
# Phenotypes
dat = read.table('density09.txt', header = TRUE)
dat$FCLN = factor(dat$FCLN)
dat$MCLN = factor(dat$MCLN)
dat$REP = factor(dat$REP)
dat$SETS = factor(dat$SETS)
dat$FAM = factor(dat$FAM)
dat$CP = factor(dat$CP)
dat$Tree_id = factor(dat$Tree_id)
head(dat)
 
# Pedigree
ped = read.table('pedigree.txt', header = TRUE)
names(ped)[1] = 'Tree_id'
ped$Tree_id = factor(ped$Tree_id)
ped$Mother = factor(ped$Mother)
ped$Father = factor(ped$Father)
 
# Inverse of the numerator relationship matrix
Ainv = asreml.Ainverse(ped)$ginv
 
# Wrap call to a generic bivariate analysis
# (This one uses the same model for all sites
#  but it doesn't need to be so)
bivar = function(trial1, trial2) {
  t2 = dat[dat$Trial_id == trial1 | dat$Trial_id == trial2,]
  t2$den1 = ifelse(t2$Trial_id == trial1, t2$DEN, NA)
  t2$den2 = ifelse(t2$Trial_id == trial2, t2$DEN, NA)
  t2$Trial_id = t2$Trial_id[drop = TRUE]
 
  # Bivariate run
  # Fits overall mean, random correlated additive effects with
  # heterogeneous variances and diagonal matrix for heterogeneous
  # residuals
  den.bi =  asreml(cbind(den1,den2) ~ trait,
                   random = ~ ped(Tree_id):corgh(trait),
                   data = t2,
                   ginverse = list(Tree_id = Ainv),
                   rcov = ~ units:diag(trait),
                   workspace=64e06, maxiter=30)
 
  # Returns only log-Likelihood for this example
  return(summary(den.bi)$loglik)
}
 
# Run the bivariate example in parallel for two pairs of sites
# FR38_1 with FR38_2, FR38_1 with FR38_3
foreach(trial1=rep("FR38_1",2), trial2=c("FR38_2", "FR38_3")) %dopar% 
        bivar(trial1, trial2)

The code runs in my laptop (only two cores) but I still have to test its performance in my desktop (4 cores) and see if it really makes a difference in time versus running the analyses sequentially. Initially my mistake was using the multicore package (require(multicore)) directly, which doesn’t start the ‘parallel backend’ and ran sequentially. Using require(doMC) loads multicore but takes care of starting the ‘parallel backend’.

Gratuitous picture: Trees at 8 pm illuminated by bonfire and full moon (Photo: Luis).

P.S. Thanks to Etienne Laliberté, who sent me an email in early 2010 pointing out that I had to use doMC. One of the few exceptions in the pile of useless emails I have archived.
P.S.2. If you want to learn more about this type of models I recommend two books: Mrode’s Linear Models for the Prediction of Animal Breeding Values, which covers multivariate evaluation with lots of gory details, and Lynch and Walsh’s Genetics and Analysis of Quantitative Traits, which is the closest thing to the bible in quantitative genetics.
P.S.3. 2012-05-08. I did run the code in my desktop (iMac with 4 cores), making a couple of small changes in the code to have a fairer comparison between using %dopar% (for parallel code) and %par% (for sequential execution).

  1. Added trace = FALSE to the asreml() call, to avoid printing partial iteration details to the screen, which happened when using %do%.
  2. Used 4 pairs of sites, so I could utilize all cores in my desktop.

Execution time for the parallel version—measured with Sys.time()—was, roughly, 1/(number of cores) the time required by the sequential version.

Rstudio and asreml working together in a mac

December and January were crazy months, with a lot of travel and suddenly I found myself in February working in four parallel projects involving quantitative genetics data analyses. (I’ll write about some of them very soon)

Anyhow, as I have pointed out in repeated occasions, I prefer asreml-R for mixed model analyses because I run out of functionality with nlme and lme4 very quickly. Ten-trait multivariate mixed model with a pedigree, anyone? I thought so. Well, there are asreml-R versions for Windows, Linux and OS X; unsurprisingly, I use the latter. Installation in OS X is not particularly complicated (just follow the instructions in this PDF file) and remember to add and export the following environment variables in your .bash_profile:

# Location of license file, usually installed
# under Applications, together with the console
# version of asreml
ASREML_LICENSE_FILE=/Applications/asreml3/bin/asreml.lic
export ASREML_LICENSE_FILE
 
# Location for dynamic library loading. The number will
# depend on R version (here is 2.13)
DYLD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Users/lap44/Library/R/2.13/library/asreml/libs
export DYLD_LIBRARY_PATH

These instructions work fine if one uses asreml-R in a Terminal window or if one uses a text editor (emacs + ESS, VIM + VIM-R, Textmate + R + R console bundles, etc). However, I couldn’t get the default R GUI in OS X (R.app) or Rstudio (my favorite way to work with R these days) working.

I would use library(asreml) or require(asreml), which would create no problems, but as soon as I used the function asreml(...) I would get an error message stating that I did not have a valid license. Frustrating to say the list (I even considered using emacs for the rest of my life, can you believe it?), because it meant that Rstudio was not being able to ‘see’ the environment variables (as posted in this question). The discussion on that question points to a very useful part of the R help system, which explains the startup process for R, suggesting a simple solution: create an .Renviron file in your home directory.

Thus, I simply copied the previously highlighted code in a text file called .Renviron, saved it in my home directory and I can now call asreml-R from Rstudio without any problems. This solution should also work for any other time when one would like to access environment variables from Rstudio for OS X. Incidentally, Rstudio is becoming really useful, adding ‘projects’ and integrating version control, which means that now I have a(n almost) self-contained working environment.

P.S. Note to self: While I was testing environment variables in my .bash_profile I wanted to refresh the variables without rebooting the computer. The easiest way to do so is typing . .bash_profile (yes, that starts with dot space). To check the exported value of a specific variable one can use the echo command as so echo $ASREML_LICENSE_FILE, which should return the assigned value (/Applications/asreml3/bin/asreml.lic in my case).

Tall big data, wide big data

After attending two one-day workshops last week I spent most days paying attention to (well, at least listening to) presentations in this biostatistics conference. Most presenters were R users—although Genstat, Matlab and SAS fans were also present and not once I heard “I can’t deal with the current size of my data sets”. However, there were some complaints about the speed of R, particularly when dealing with simulations or some genomic analyses.

Some people worried about the size of coming datasets; nevertheless that worry was across statistical packages or, more precisely, it went beyond statistical software. How will we able to even store the data from something like the Square Kilometer Array, let alone analyze it?

LaurelAndHardy.jpg

In a previous post I was asking if we needed to actually deal with ‘big data’ in R, and my answer was probably not or, better, at least not directly. I still think that it is a valid, although incomplete opinion. In many statistical analyses we can think of n (the number of observations) and p (the number of variables per observation). In most cases, particularly when people refer to big data, n >> p. Thus, we may have 100 million people but we have only 10 potential predictors: tall data. In contrast, we may have only 1,000 individuals but with 50,000 points each coming from a near infrared spectrometry or information from 250,000 SNPs (a type of molecular marker): wide data. Both types of data will keep on growing but are challenging in a different way.

In a totally generalizing, unfair and simplistic way I will state that dealing with wide is more difficult (and potentially interesting) than dealing with tall. This from a modeling perspective. As the t-shirt says: sampling is not a crime, and it should work quite well with simpler models and large datasets. In contrast, sampling to fitting wide data may not work at all.

Algorithms. Clever algorithms is what we need in a first stage. For example, we can fit linear mixed models to a tall dataset with ten millions records or a multivariate mixed model with 60 responses using ASReml-R. Wide datasets are often approached using Bayesian inference, but MCMC gets slooow when dealing with thousands of predictors, we need other fast approximations to the posterior distributions.

This post may not be totally coherent, but it keeps the conversation going. My excuse? I was watching Be kind rewind while writing it.

Surviving a binomial mixed model

A few years ago we had this really cool idea: we had to establish a trial to understand wood quality in context. Sort of following the saying “we don’t know who discovered water, but we are sure that it wasn’t a fish” (attributed to Marshall McLuhan). By now you are thinking WTF is this guy talking about? But the idea was simple; let’s put a trial that had the species we wanted to study (Pinus radiata, a gymnosperm) and an angiosperm (Eucalyptus nitens if you wish to know) to provide the contrast, as they are supposed to have vastly different types of wood. From space the trial looked like this:

The reason you can clearly see the pines but not the eucalypts is because the latter were dying like crazy over a summer drought (45% mortality in one month). And here we get to the analytical part: we will have a look only at the eucalypts where the response variable can’t get any clearer, trees were either totally dead or alive. The experiment followed a randomized complete block design, with 50 open-pollinated families in 48 blocks. The original idea was to harvest 12 blocks each year but—for obvious reasons—we canned this part of the experiment after the first year.

The following code shows the analysis in asreml-R, lme4 and MCMCglmm:

load('~/Dropbox/euc.Rdata')
 
library(asreml)
sasreml = asreml(surv ~ 1, random = ~ Fami + Block,
                 data = euc, 
                 family = asreml.binomial(link = 'logit'))
summary(sasreml)$varcomp
 
#                      gamma component  std.error  z.ratio
#Fami!Fami.var     0.5704205 0.5704205 0.14348068 3.975591
#Block!Block.var   0.1298339 0.1298339 0.04893254 2.653324
#R!variance        1.0000000 1.0000000         NA       NA
 
#                 constraint
#Fami!Fami.var      Positive
#Block!Block.var    Positive
#R!variance            Fixed
 
# Quick look at heritability
varFami = summary(sasreml)$varcomp[1, 2]
varRep = summary(sasreml)$varcomp[2, 2]
h2 = 4*varFami/(varFami + varRep + 3.29)
h2
#[1] 0.5718137
 
 
library(lme4)
slme4 = lmer(surv ~ 1 + (1|Fami) + (1|Block), 
             data = euc,
             family = binomial(link = 'logit'))
 
summary(slme4)
 
#Generalized linear mixed model fit by the Laplace approximation 
#Formula: surv ~ 1 + (1 | Fami) + (1 | Block) 
#   Data: euc 
#  AIC  BIC logLik deviance
# 2725 2742  -1360     2719
#Random effects:
# Groups   Name        Variance Std.Dev.
# Fami     (Intercept) 0.60941  0.78065 
# Block    (Intercept) 0.13796  0.37143 
#Number of obs: 2090, groups: Fami, 51; Block, 48
#
#Fixed effects:
#            Estimate Std. Error z value Pr(>|z|)  
#(Intercept)   0.2970     0.1315   2.259   0.0239 *
 
# Quick look at heritability
varFami = VarCorr(slme4)$Fami[1]
varRep = VarCorr(slme4)$Block[1]
h2 = 4*varFami/(varFami + varRep + 3.29)
h2
#[1] 0.6037697
 
# And let's play to be Bayesians!
library(MCMCglmm)
pr = list(R = list(V = 1, n = 0, fix = 1),
          G = list(G1 = list(V = 1, n = 0.002), 
                   G2 = list(V = 1, n = 0.002)))
 
sb <- MCMCglmm(surv ~ 1, 
               random = ~ Fami + Block, 
               family = 'categorical',
               data = euc,
               prior = pr,
               verbose = FALSE,
               pr = TRUE,
               burnin = 10000,
               nitt = 100000,
               thin = 10)
 
plot(sb$VCV)

You may be wondering Where does the 3.29 in the heritability formula comes from? Well, that’s the variance of the link function that, in the case of the logit link is pi*pi/3. In the case of MCMCglmm we can estimate the degree of genetic control quite easily, remembering that we have half-siblings (open-pollinated plants):

# Heritability
h2 = 4*sb$VCV[, 'Fami']/(sb$VCV[, 'Fami'] + 
                         sb$VCV[, 'Block'] + 3.29 + 1)
posterior.mode(h2)
#     var1 
#0.6476185 
 
HPDinterval(h2)
#         lower     upper
#var1 0.4056492 0.9698148
#attr(,"Probability")
#[1] 0.95
 
plot(h2)

By the way, it is good to remember that we need to back-transform the estimated effects to probabilities, with very simple code:

# Getting mode and credible interval for solutions
inv.logit(posterior.mode(sb$Sol))
inv.logit(HPDinterval(sb$Sol, 0.95))

Even if one of your trials is trashed there is a silver lining: it is possible to have a look at survival.