Categories

## Analyzing a simple experiment with heterogeneous variances using asreml, MCMCglmm and SAS

I was working with a small experiment which includes families from two Eucalyptus species and thought it would be nice to code a first analysis using alternative approaches. The experiment is a randomized complete block design, with species as fixed effect and family and block as a random effects, while the response variable is growth strain (in $$mu epsilon$$).

When looking at the trees one can see that the residual variances will be very different. In addition, the trees were growing in plastic bags laid out in rows (the blocks) and columns. Given that trees were growing in bags siting on flat terrain, most likely the row effects are zero.

Below is the code for a first go in R (using both MCMCglmm and ASReml-R) and SAS. I had stopped using SAS for several years, mostly because I was running a mac for which there is no version. However, a few weeks ago I started accessing it via their OnDemand for Academics program via a web browser.

The R code using REML looks like:

# Options
options(stringsAsFactors = FALSE)
setwd('~/Dropbox/research/2013/growth-stress')

# Packages
require(ggplot2)
require(asreml)
require(MCMCglmm)

# Reading data, renaming, factors, etc
summary(gs)

# Both argophloia and bosistoana
gsboth = subset(gs, !is.na(strain))
gsboth = within(gsboth, {
species = factor(species)
row = factor(row)
column = factor(column)
fam = factor(fam)
})

ma = asreml(strain ~ species, random = ~ fam + row,
rcov = ~ at(species):units,
data = gsboth)

summary(ma)$varcomp # gamma component std.error z.ratio constraint #fam!fam.var 27809.414 27809.414 10502.036 2.6480022 Positive #row!row.var 2337.164 2337.164 3116.357 0.7499666 Positive #species_E.argopholia!variance 111940.458 111940.458 26609.673 4.2067580 Positive #species_E.bosistoana!variance 63035.256 63035.256 7226.768 8.7224681 Positive  While using MCMC we get estimates in the ballpark by using: # Priors bothpr = list(R = list(V = diag(c(50000, 50000)), nu = 3), G = list(G1 = list(V = 20000, nu = 0.002), G2 = list(V = 20000, nu = 0.002), G3 = list(V = 20000, nu = 0.002))) # MCMC m2 = MCMCglmm(strain ~ species, random = ~ fam + row + column, rcov = ~ idh(species):units, data = gsboth, prior = bothpr, pr = TRUE, family = 'gaussian', burnin = 10000, nitt = 40000, thin = 20, saveX = TRUE, saveZ = TRUE, verbose = FALSE) summary(m2) # Iterations = 10001:39981 # Thinning interval = 20 # Sample size = 1500 # # DIC: 3332.578 # # G-structure: ~fam # # post.mean l-95% CI u-95% CI eff.samp # fam 30315 12211 55136 1500 # # ~row # # post.mean l-95% CI u-95% CI eff.samp # row 1449 5.928 6274 589.5 # # R-structure: ~idh(species):units # # post.mean l-95% CI u-95% CI eff.samp # E.argopholia.units 112017 71152 168080 1500 # E.bosistoana.units 65006 52676 80049 1500 # # Location effects: strain ~ species # # post.mean l-95% CI u-95% CI eff.samp pMCMC # (Intercept) 502.21 319.45 690.68 1500 <7e-04 *** # speciesE.bosistoana -235.95 -449.07 -37.19 1361 0.036 *  The SAS code is not that disimilar, except for the clear demarcation between data processing (data step, for reading files, data transformations, etc) and specific procs (procedures), in this case to summarize data, produce a boxplot and fit a mixed model. * termstr=CRLF accounts for the windows-like line endings of the data set; data gs; infile "/home/luis/Projects/euc-strain/growthstresses.csv" dsd termstr=CRLF firstobs=2; input row column species$ family $strain; if strain ^= .; run; proc summary data = gs print; class species; var strain; run; proc boxplot data = gs; plot strain*species; run; proc mixed data = gs; class row species family; model strain = species; random row family; repeated species / group=species; run; /* Covariance Parameter Estimates Cov Parm Group Estimate row 2336.80 family 27808 species species E.argoph 111844 species species E.bosist 63036 */ SAS boxplot for the data set. I like working with multiple languages and I realized that, in fact, I missed SAS a bit. It was like meeting an old friend; at the beginning felt strange but we were quickly chatting away after a few minutes. Categories ## More sense of random effects I can’t exactly remember how I arrived to Making sense of random effects, a good post in the Distributed Ecology blog (go over there and read it). Incidentally, my working theory is that I follow Scott Chamberlain (@recology_), who follows Karthik Ram ?(@_inundata) who mentioned Edmund Hart’s (@DistribEcology) post. I liked the discussion, but I thought one could add to the explanation to make it a bit clearer. The idea is that there are 9 individuals, assessed five times each—once under each of five different levels for a treatment—so we need to include individual as a random effect; after all, it is our experimental unit. The code to generate the data, plot it and fit the model is available in the post, but I redid data generation to make it a bit more R-ish and, dare I say, a tad more elegant: require(lme4) require(ggplot2) # Basic dataset idf <- data.frame(ind = factor(rep(1:9, 5)), levs = factor(rep(c('i1', 'i2', 'i3', 'i4', 'i5'), each = 9)), f.means = rep(c(6, 16, 2, 10, 13), each = 9), i.eff = rep(seq(-4, 4, length = 9), 5)) # Function for individual values used in apply ran.ind <- function(f.means, i.eff) rnorm(1, f.means + i.eff, 0.3) # Rich showed me this version of apply that takes more than one argument # https://smtp.biostat.wustl.edu/sympa/biostat/arc/s-news/2005-06/msg00022.html # so we skip loops idf$size <- apply(idf[, 3:4], 1,
function(x) do.call('ran.ind', as.list(x)))

ggplot(idf, aes(x = levs, y = size, group = ind, colour = ind)) +
geom_point() + geom_path()

# Fit linear mixed model (avoid an overall mean with -1)
m3 <- lmer(size ~ levs - 1 + (1 | ind), data = idf)
summary(m3)

# Skipping a few things
#   AIC   BIC logLik deviance REMLdev
# 93.84 106.5 -39.92    72.16   79.84
#Random effects:
# Groups   Name        Variance Std.Dev.
# ind      (Intercept) 7.14676  2.67334
# Residual             0.10123  0.31816
#Number of obs: 45, groups: ind, 9

# Show fixed effects
fixef(m3)

#   levsi1    levsi2    levsi3    levsi4    levsi5
# 5.824753 15.896714  2.029902  9.969462 12.870952



What we can do to better understand what's going on is 'adjust' the score observations by the estimated fixed effects and plot those values to see what we are modeling with the random effects:

# Substract fixed effects
idf$adjusted <- idf$size - rep(fixef(m3), each = 9)
ggplot(idf, aes(x = levs, y = adjusted, group = ind, colour = ind)) +
geom_point() + geom_path()

# Display random effects
ranef(m3)

#$ind # (Intercept) #1 -3.89632810 #2 -2.83054394 #3 -1.99715524 #4 -1.12733342 #5 0.06619981 #6 0.95605162 #7 2.00483963 #8 2.99947727 #9 3.82479236 Data 'adjusted' by fixed effects. The random intercepts would be lines going through the average of points for each individual. The random effects for individual or, better, the individual-level intercepts are pretty much the lines going through the middle of the points for each individual. Furthermore, the variance for ind is the variance of the random intercepts around the'adjusted' values, which can be seen comparing the variance of random effects above (~7.15) with the result below (~7.13). var(unlist(ranef(m3))) # 7.12707  Distributed Ecology then goes on to randomize randomize the individuals within treatment, which means that the average deviation around the adjusted means is pretty close to zero, making that variance component close to zero. I hope this explanation complements Edmund Hart's nice post. P.S. If you happen to be in the Southern part of South America next week, I'll be here and we can have a chat (and a beer, of course). Categories ## Overlay of design matrices in genetic analysis I’ve ignored my quantitative geneticist side of things for a while (at least in this blog) so this time I’ll cover some code I was exchanging with a couple of colleagues who work for other organizations. It is common to use diallel mating designs in plant and tree breeding, where a small number of parents acts as both males and females. For example, with 5 parents we can have 25 crosses, including reciprocals and selfing (crossing an individual with itself). Decades ago this mating design was tricky to fit and, considering an experimental layout with randomized complete blocks, one would have something like y = mu + blocks + dads + mums + cross + error. In this model dads and mums were estimating a fraction of the additive genetic variance. With the advent of animal model BLUP, was possible to fit something like y = mu + blocks + individual (using a pedigree) + cross + error. Another less computationally demanding alternative (at least with unrelated parents) is to fit a parental model, overlaying the design matrices for parents with something like this y = mu + blocks + (dad + mum) + cross + error. The following code simulate data for a a diallel trial with three replicates and runs a parental model with ASReml. Later I expand the analysis building the matrices by hand. # Defining diallel cross n.blocks <- 3 diallel <- matrix(0, nrow = 5, ncol = 5) males <- 1:dim(diallel) females <- 1:dim(diallel) cross <- factor(outer(males, females, paste, sep ='x')) cross # 1x1 2x1 3x1 4x1 5x1 1x2 2x2 3x2 4x2 5x2 1x3 2x3 3x3 4x3 5x3 1x4 2x4 # 3x4 4x4 5x4 1x5 2x5 3x5 4x5 5x5 #25 Levels: 1x1 1x2 1x3 1x4 1x5 2x1 2x2 2x3 2x4 2x5 3x1 3x2 3x3 ... 5x5 #### Generating random values for the trial # Variance components sig2.a <- 40 # additive genetic variance sig2.sca <- 10 # specific combining ability (SCA) sig2.b <- 10 # blocks sig2.e <- 30 # residuals # Numbers of everything n.parents <- length(males) n.crosses <- length(cross) n.trees <- n.crosses*n.blocks # Random values for everything u.g <- rnorm(n.parents)*sqrt(sig2.a) u.sca <- rnorm(n.crosses)*sqrt(sig2.sca) u.block <- rnorm(n.blocks)*sqrt(sig2.b) u.e <- rnorm(n.trees)*sqrt(sig2.e) # Whole trial trial <- data.frame(block = factor(rep(1:n.blocks, each = n.crosses)), mum = factor(rep(females, n.blocks)), dad = factor(rep(males, each = length(females))), mate = factor(rep(cross, n.blocks))) trial$yield <- 0.5*(u.g[trial$mum] + u.g[trial$dad]) +
u.sca[trial$mate] + u.block[trial$block] + u.e

#1     1   1   1  1x1 -0.02185486
#2     1   2   1  2x1 10.79760712
#3     1   3   1  3x1 16.45186037
#4     1   4   1  4x1  8.15026291
#5     1   5   1  5x1  5.57707180
#6     1   1   2  1x2 -7.30675148

# Fitting the model with ASReml
library(asreml)
m1 <- asreml(yield ~ 1,
random = ~ block + dad + and(mum) + mate,
data = trial)
summary(m1)
#                       gamma    component    std.error   z.ratio
#block!block.var 1.299110e-02 3.861892e-01 1.588423e+00 0.2431274
#mate!mate.var   4.589938e-07 1.364461e-05 2.340032e-06 5.8309519
#R!variance      1.000000e+00 2.972722e+01 5.098177e+00 5.8309519

# Obtaining the predicted breeding values for the parents
#         effect


How is the matrix overlay working? We can replicate the calculations used by ASReml by building the matrices from scratch and reusing the variance components, so we avoid the nastiness of writing code for residual maximum likelihood. Once I build the basic matrices I use the code from my How does a linear mixed model look like? post.

# Building incidence matrices for males and females
Zmales <- model.matrix(~ dad - 1, data = trial)
Zfemales <- model.matrix(~ mum - 1, data = trial)

# and() in ASReml overlays (i.e. sums) the matrices
# (Notice that this creates 0s, 1 and 2s in the design matrix)
Zparents <- Zmales + Zfemales
Zparents[1:6,]
#1     2    0    0    0    0
#2     1    1    0    0    0
#3     1    0    1    0    0
#4     1    0    0    1    0
#5     1    0    0    0    1
#6     1    1    0    0    0

# Design matrices from other factors
Zblocks <- model.matrix(~ block - 1, data = trial)
Zmates <- model.matrix(~ mate - 1, data = trial)

# Creating y, X and big Z matrices to solve mixed model equations
y <- trial$yield X <- matrix(1, nrow = 75, ncol = 1) Z <- cbind(Zblocks, Zparents, Zmates) # Building covariance matrices for random effects # Using the variance components from the ASReml model G = diag(c(rep(3.861892e-01, 3), rep(6.246930e+00, 5), rep(1.364461e-05, 25))) R = diag(2.972722e+01, 75, 75) Rinv = solve(R) # Components of C XpX = t(X) %*% Rinv %*% X ZpZ = t(Z) %*% Rinv %*% Z XpZ = t(X) %*% Rinv %*% Z ZpX = t(Z) %*% Rinv %*% X Xpy = t(X) %*% Rinv %*% y Zpy = t(Z) %*% Rinv %*% y # Building C * [b a] = RHS C = rbind(cbind(XpX, XpZ), cbind(ZpX, ZpZ + solve(G))) RHS = rbind(Xpy, Zpy) blup = solve(C, RHS) blup # These results are identical to the ones # produced by ASReml #dad1 1.780683e+00 #dad2 -2.121174e+00 #dad3 3.151991e+00 #dad4 -1.473620e+00 #dad5 -1.337879e+00  The overlay matrix Zparents is double the actual value we should use: Zparents2 <- 0.5*(Zmales + Zfemales) Zparents2[1:6,] # dad1 dad2 dad3 dad4 dad5 #1 1.0 0.0 0.0 0.0 0.0 #2 0.5 0.5 0.0 0.0 0.0 #3 0.5 0.0 0.5 0.0 0.0 #4 0.5 0.0 0.0 0.5 0.0 #5 0.5 0.0 0.0 0.0 0.5 #6 0.5 0.5 0.0 0.0 0.0  Repeating the analyses 'by hand' using Zparents2 to build the Z matrix results in the same overall mean and block effects, but the predicted breeding values for parents when using Zparents are 0.7 of the predicted breeding values for parents when using Zparents2. I may need to refit the model and obtain new variance components for parents when working with the correct overlaid matrix. Gratuitous picture: walking in Quail Island (Photo: Luis). Categories ## INLA: Bayes goes to Norway INLA is not the Norwegian answer to ABBA; that would probably be a-ha. INLA is the answer to ‘Why do I have enough time to cook a three-course meal while running MCMC analyses?”. Integrated Nested Laplace Approximations (INLA) is based on direct numerical integration (rather than simulation as in MCMC) which, according to people ‘in the know’, allows: • the estimation of marginal posteriors for all parameters, • marginal posteriors for each random effect and • estimation of the posterior for linear combinations of random effects. Rather than going to the usual univariate randomized complete block or split-plot designs that I have analyzed before (here using REML and here using MCMC), I’ll go for some analyses that motivated me to look for INLA. I was having a look at some reproductive output for Drosophila data here at the university, and wanted to fit a logistic model using MCMCglmm. Unfortunately, I was running into the millions (~3M) of iterations to get a good idea of the posterior and, therefore, leaving the computer running overnight. Almost by accident I came across INLA and started playing with it. The idea is that Sol—a Ph.D. student—had a cool experiment with a bunch of flies using different mating strategies over several generations, to check the effect on breeding success. Therefore we have to keep track of the pedigree too. Gratuitous picture: Cubist apartments not in Norway (Photo: Luis, click to enlarge). # Set working directory containing data and code setwd('~/Dropbox/research/2010/flies') # Packages needed for analysis # This code requires the latest (and updated) version of INLA require(INLA) # This loads INLA require(pedigreemm) # pedigree(), relfactor(), Tinv, D, ... ####### Pedigree and assessment files # Reads pedigree file ped = read.csv('recodedped.csv', header=FALSE) names(ped) = c('id', 'mum', 'dad') # Reads data file dat = read.csv('ggdatall.csv', header=TRUE) dat$cross = factor(dat$cross) # Pedigree object for pedigreemm functions pedPMM = with(ped, pedigreemm::pedigree(sire=dad, dam=mum, label=id)) # Pedigree precision matrix (A inverse) # T^{-1} in A^{-1} = (T^{-1})' D^{-1} T^{-1} Tinv = as(pedPMM, "sparseMatrix") D = Diagonal(x=Dmat(pedPMM)) # D in A = TDT' Dinv = solve(D) Ainv = t(Tinv) %*% Dinv %*% Tinv  Up to this point we have read the response data, the pedigree and constructed the inverse of the pedigree matrix. We also needed to build a contrast matrix to compare the mean response between the different mating strategies. I was struggling there and contacted Gregor Gorjanc, who kindly emailed me the proper way to do it. # Define contrasts to compare cross types. Thanks to Gregor Gorjanc # for coding contrasts k = nlevels(dat$cross)
tmp = matrix(nrow=(k-1)*k/2, ncol=k)

##               1   2   3   4   5   6
tmp[ 1, ] =  c(  1, -1, NA, NA, NA, NA) ## c1-c2
tmp[ 2, ] =  c(  1, NA, -1, NA, NA, NA) ##   -c3
tmp[ 3, ] =  c(  1, NA, NA, -1, NA, NA) ##   -c4
tmp[ 4, ] =  c(  1, NA, NA, NA, -1, NA) ##   -c5
tmp[ 5, ] =  c(  1, NA, NA, NA, NA, -1) ##   -c6
tmp[ 6, ] =  c( NA,  1, -1, NA, NA, NA) ## c2-c3
tmp[ 7, ] =  c( NA,  1, NA, -1, NA, NA) ##   -c4
tmp[ 8, ] =  c( NA,  1, NA, NA, -1, NA) ##   -c5
tmp[ 9, ] =  c( NA,  1, NA, NA, NA, -1) ##   -c6
tmp[10, ] =  c( NA, NA,  1, -1, NA, NA) ## c3-c4
tmp[11, ] =  c( NA, NA,  1, NA, -1, NA) ##   -c5
tmp[12, ] =  c( NA, NA,  1, NA, NA, -1) ##   -c6
tmp[13, ] =  c( NA, NA, NA,  1, -1, NA) ## c4-c5
tmp[14, ] =  c( NA, NA, NA,  1, NA, -1) ##   -c6
tmp[15, ] =  c( NA, NA, NA, NA,  1, -1) ## c5-c6

# Make Linear Combinations
LC = inla.make.lincombs(cross=tmp)

# Assign names to combinations
t = 0
for(i in 1:(k-1)) {
for(j in (i+1):k) {
t = t + 1
names(LC)[t] = paste("c", i, "-", "c", j, sep="")
}
}


There is another related package (Animal INLA) that takes care of i- giving details about the priors and ii- “easily” fitting models that include a term with a pedigree (an animal model in quantitative genetics speak). However, I wanted the assumptions to be clear so read the source of Animal INLA and shamelessly copied the useful bits (read the source, Luke!).

######  Analysis for for binomial traits
####### Plain-vanilla INLA Version
# Feeling more comfortable with *explicit* statement of assumptions
# (rather than hidden behind animal.inla())

# Function to backconvert logits to probabilities
back.conv = function(values){
return(1/(1+exp(-(values))))
}

# Function to get posterior of the odds
# Thanks to Helena Moltchanova
inla.marginal.summary = function(x){
m1 = inla.emarginal(function(z) exp(z), marginal=x)
odds = inla.marginal.transform(function(x) exp(x), x)
q = inla.qmarginal(p=c(0.025, 0.975), marginal=odds)
c("0.025quant"=q, "0.5quant"=m1, "0.975quant"=q)
}

# Model for pupae/eggs
# Drops a few observations with no reproductive output (trips INLA)
no0eggs = subset(dat, eggs>0 & pupae <= eggs)

# Actual model
mpueg =  pupae ~ f(cross, model='iid', constr=TRUE, hyper=list(theta=list(initial=-10, fixed=TRUE))) +
f(id, model='generic0', constr=TRUE, Cmatrix=Ainv,
hyper=list(theta=list(param=c(0.5,0.5), fixed=FALSE)))

# INLA call
fpueg =  inla(formula=mpueg, family='binomial', data=no0eggs,
lincomb=LC,
Ntrials=eggs,
control.compute=list(dic=FALSE))

# Results
summary(fpueg)

# Call:
#   c("inla(formula = mpueg, family = \"binomial\", data = no0eggs,
#     Ntrials = eggs, ",  "    lincomb = LC, control.compute = list(dic = FALSE))")
#
# Time used:
#   Pre-processing    Running inla Post-processing           Total
# 0.2712612       1.1172159       2.0439510       3.4324281
#
# Fixed effects:
#   mean        sd 0.025quant 0.5quant 0.975quant       kld
# (Intercept) 1.772438 0.1830827   1.417413 1.770863   2.136389 0.5833235
#
# Linear combinations (derived):
#   ID        mean        sd 0.025quant    0.5quant  0.975quant kld
# c1-c2  0 -0.26653572 0.7066540 -1.6558225 -0.26573011  1.11859967   0
# c1-c3  1  0.04150999 0.7554753 -1.4401435  0.04104020  1.52622856   0
# c1-c4  2 -0.08777325 0.6450669 -1.3557501 -0.08713005  1.17693349   0
# c1-c5  3 -1.36702960 0.6583121 -2.6615604 -1.36618274 -0.07690788   0
# c1-c6  4 -1.82037714 0.8193280 -3.4338294 -1.81848244 -0.21714431   0
# c2-c3  5  0.30804735 0.7826815 -1.2248185  0.30677279  1.84852340   0
# c2-c4  6  0.17876229 0.5321948 -0.8654273  0.17859036  1.22421409   0
# c2-c5  7 -1.10049385 0.7466979 -2.5663142 -1.10046590  0.36558211   0
# c2-c6  8 -1.55383673 0.8188321 -3.1640965 -1.55276603  0.05084282   0
# c3-c4  9 -0.12928419 0.7475196 -1.5996080 -0.12817855  1.33522000   0
# c3-c5 10 -1.40854298 0.6016539 -2.5930656 -1.40723901 -0.23103707   0
# c3-c6 11 -1.86189314 0.8595760 -3.5555571 -1.85954031 -0.18100418   0
# c4-c5 12 -1.27925604 0.6998640 -2.6536362 -1.27905616  0.09438701   0
# c4-c6 13 -1.73259977 0.7764105 -3.2600936 -1.73134961 -0.21171790   0
# c5-c6 14 -0.45334267 0.8179794 -2.0618730 -0.45229981  1.14976690   0
#
# Random effects:
#   Name    Model	  	Max KLD
# cross   IID model
# id   Generic0 model
#
# Model hyperparameters:
#   mean    sd      0.025quant 0.5quant 0.975quant
# Precision for id 0.08308 0.01076 0.06381    0.08244  0.10604
#
# Expected number of effective parameters(std dev): 223.95(0.7513)
# Number of equivalent replicates : 1.121
#
# Marginal Likelihood:  -1427.59

fpueg$summary.random$cross
#   ID       mean        sd  0.025quant   0.5quant 0.975quant          kld
# 1  1 -0.5843466 0.4536668 -1.47561024 -0.5840804  0.3056632 0.0178780930
# 2  2 -0.3178102 0.4595676 -1.21808638 -0.3184925  0.5865565 0.0009666916
# 3  3 -0.6258600 0.4978254 -1.60536281 -0.6250077  0.3491075 0.0247426578
# 4  4 -0.4965763 0.4071715 -1.29571071 -0.4966277  0.3030747 0.0008791629
# 5  5  0.7826817 0.4389003 -0.07756805  0.7821937  1.6459253 0.0077476806
# 6  6  1.2360387 0.5768462  0.10897529  1.2340813  2.3744368 0.0451357379

# Backtransforms point estimates and credible intervals for odds -> prob
for(name in names(fpueg$marginals.lincomb.derived)){ summa = inla.marginal.summary(eval(parse(text=paste("fpueg$marginals.lincomb.derived$\'", name, "\'", sep='')))) cat(name, summa, '\n') } # c1-c2 0.1894451 0.9831839 3.019878 # c1-c3 0.2338952 1.387551 4.534581 # c1-c4 0.256858 1.127751 3.204961 # c1-c5 0.0695406 0.3164847 0.9145132 # c1-c6 0.03157478 0.2264027 0.792517 # c2-c3 0.289088 1.850719 6.255175 # c2-c4 0.4213069 1.377848 3.366947 # c2-c5 0.0759222 0.4398384 1.420934 # c2-c6 0.04135211 0.2955985 1.035951 # c3-c4 0.1996085 1.16168 3.747526 # c3-c5 0.0746894 0.2929174 0.7847903 # c3-c6 0.02774805 0.2245797 0.821099 # c4-c5 0.06988459 0.355529 1.084414 # c4-c6 0.03780307 0.2389529 0.7974092 # c5-c6 0.1245211 0.8878682 3.108852  A quick look at the time taken by INLA shows that it is in the order of seconds (versus overnight using MCMC). I have tried a few examples and the MCMCglmm and INLA results tend to be very close; however, figuring out how to code models has been very tricky for me. INLA follows the glorious tradition of not having a ‘proper’ manual, but a number of examples with code. In fact, they reimplement BUGS‘s examples. Personally, I struggle with that approach towards documentation, but you may be the right type of person for that. Note for letter to Santa: real documentation for INLA. I was talking with a student about using Norwegian software and he mentioned Norwegian Black Metal. That got me thinking about how the developers of the package would look like; would they look like Gaahl of Gorgoroth (see interview here)? Not an INLA developer Talk about disappointment! In fact Håvard Rue, INLA mastermind, looks like a nice, clean, non-black-metal statistician. To be fair, it would be quite hard to code in any language wearing those spikes… Categories ## Split-plot 2: let’s throw in some spatial effects Disappeared for a while collecting frequent flyer points. In the process I ‘discovered’ that I live in the middle of nowhere, as it took me 36 hours to reach my conference destination (Estoril, Portugal) through Christchurch, Sydney, Bangkok, Dubai, Madrid and Lisbon. Where was I? Showing how split-plots look like under the bonnet (hood for you US readers). Yates presented a nice diagram of his oats data set in the paper, so we have the spatial location of each data point which permits us playing with within-trial spatial trends. Rather than mucking around with typing coordinates we can rely on Kevin Wright’s version of the oats dataset contained in the agridat package. Kevin is a man of mistery, a James Bond of statisticians—so he keeps a low profile—with a keen interest in experimental design and analyses. This chap has put a really nice collection of data sets WITH suggested coding for the analyses, including nlme, lme4, asreml, MCMCglmm and a few other bits and pieces. Recommended! Plants (triffids excepted) do not move, which means that environmental trends within a trial (things like fertility, water availability, temperature, etc) can affect experimental units in a way that varies with space and which induces correlation of the residuals. Kind of we could be violating the independence assumption if you haven’t got the hint yet. Gratuitous picture: Detail of Mosteiro dos Jerónimos, Belém, Lisboa, Portugal (Photo: Luis). There are a few ways to model environmental trends (AR processes, simple polynomials, splines, etc) that can be accounted for either through the G matrix (as random effects) or the R matrix. See previous post for explanation of the bits and pieces. We will use here a very popular approach, which is to consider two separable (so we can estimate the bloody things) autoregressive processes, one for rows and one for columns, to model spatial association. In addition, we will have a spatial residual. In summary, the residuals have moved from $$oldsymbol{R} = sigma^2_e oldsymbol{I}$$ to $$oldsymbol{R} = sigma^2_s oldsymbol{R}_{col} otimes oldsymbol{R}_{row}$$. I previously showed the general form of this autoregressive matrices in this post, and you can see the $$oldsymbol{R}_{col}$$ matrix below. In some cases we can also add an independent residual (the so-called nugget) to the residual matrix. We will first fit a split-plot model considering spatial residuals using asreml because, let’s face it, there is no other package that will give you the flexibility: require(asreml) require(agridat) # Having a look at the structure of yates.oats # and changing things slightly so it matches # previous post str(yates.oats) names(yates.oats) = c('row', 'col', 'Y', 'N', 'V', 'B') yates.oats$row = factor(yates.oats$row) yates.oats$col = factor(yates.oats$col) yates.oats$N = factor(yates.oats$N) yates.oats$MP = yates.oats$V # Base model (this was used in the previous post) m2 = asreml(Y ~ V*N, random = ~ B/MP, data = yates.oats) summary(m2)$varcomp
#                gamma component std.error  z.ratio constraint
# B!B.var    1.2111647  214.4771 168.83404 1.270343   Positive
# B:MP!B.var 0.5989373  106.0618  67.87553 1.562593   Positive
# R!variance 1.0000000  177.0833  37.33244 4.743416   Positive

# Spatial model
m3 = asreml(Y ~ V*N, random = ~ B/MP,
rcov = ~ ar1(col):ar1(row),
data = yates.oats)
summary(m3)$varcomp # gamma component std.error z.ratio constraint # B!B.var 0.80338277 169.24347389 156.8662436 1.0789031 Positive # B:MP!B.var 0.49218005 103.68440202 73.6390759 1.4080079 Positive # R!variance 1.00000000 210.66355939 67.4051020 3.1253355 Positive # R!col.cor 0.04484166 0.04484166 0.2006562 0.2234751 Unconstrained # R!row.cor 0.49412567 0.49412567 0.1420397 3.4787860 Unconstrained coef(m3)$fixed
coef(m3)$random # effect # V_GoldenRain:N_0 0.0000000 # V_GoldenRain:N_0.2 0.0000000 # V_GoldenRain:N_0.4 0.0000000 # V_GoldenRain:N_0.6 0.0000000 # V_Marvellous:N_0 0.0000000 # V_Marvellous:N_0.2 -0.8691155 # V_Marvellous:N_0.4 -12.4223873 # V_Marvellous:N_0.6 -5.5018907 # V_Victory:N_0 0.0000000 # V_Victory:N_0.2 -1.9580360 # V_Victory:N_0.4 2.1913469 # V_Victory:N_0.6 0.3728648 # N_0 0.0000000 # N_0.2 23.3299154 # N_0.4 40.0570745 # N_0.6 47.1749577 # V_GoldenRain 0.0000000 # V_Marvellous 9.2845952 # V_Victory -5.7259866 # (Intercept) 76.5774292 # effect # B_B1 21.4952875 # B_B2 1.0944484 # B_B3 -5.4336461 # B_B4 -4.4334455 # B_B5 -6.6925874 # B_B6 -6.0300569 # B_B1:MP_GoldenRain 1.3036724 # B_B1:MP_Marvellous -0.9082462 # B_B1:MP_Victory 12.7754283 # B_B2:MP_GoldenRain 1.3286187 # B_B2:MP_Marvellous 7.4181674 # B_B2:MP_Victory -8.0761824 # B_B3:MP_GoldenRain -6.5288220 # B_B3:MP_Marvellous 10.4361799 # B_B3:MP_Victory -7.2367277 # B_B4:MP_GoldenRain 6.6868810 # B_B4:MP_Marvellous -9.2317585 # B_B4:MP_Victory -0.1716372 # B_B5:MP_GoldenRain 2.4635492 # B_B5:MP_Marvellous -9.7086196 # B_B5:MP_Victory 3.1443067 # B_B6:MP_GoldenRain -5.2538993 # B_B6:MP_Marvellous 1.9942770 # B_B6:MP_Victory -0.4351876 # In a larger experiment we could try fitting a nugget using units m4 = asreml(Y ~ V*N, random = ~ B/MP + units, rcov = ~ ar1(col):ar1(row), data = yates.oats) summary(m4) # However in this small experiments the system # goes crazy and results meaningless # gamma component std.error z.ratio constraint # B!B.var 0.006759124 223.70262 185.3816 1.206714 Positive # B:MP!B.var 0.001339017 44.31663 29.2004 1.517672 Positive # units!units.var 0.003150356 104.26542 34.2738 3.042132 Positive # R!variance 1.000000000 33096.39128 19480.8333 1.698921 Positive # R!col.cor 0.999000000 0.99900 NA NA Fixed # R!row.cor 0.999000000 0.99900 NA NA Fixed  So we have to build an autoregressive correlation matrix for rows, one for columns and multiply the whole thing for a spatial variance. Then we can add an independent residual (the nugget, if we want—and can estimate—one). Peter Dalgaard has neat code for building the autocorrelation matrix. And going back to the code in the previous post: ar.matrix = function(ar, dim) { M = diag(dim) M = ar^abs(row(M)-col(M)) return(M) } # Variance components (from m3) varB = 169.243 varMP = 103.684 varR = 210.664 arcol = 0.045 arrow = 0.494 # Basic vector and matrices: y, X, Z, G & R y = matrix(yates.oats$Y, nrow = dim(yates.oats), ncol = 1)
X = model.matrix(~ V*N, data = yates.oats)
Z = model.matrix(~ B/MP - 1, data = yates.oats,
contrasts.arg = list(B = contrasts(yates.oats$B, contrasts = F), MP = contrasts(yates.oats$MP, contrasts = F)))

G = diag(c(rep(varB, 6), rep(varMP, 18)))

# Only change from previous post is building the R matrix
Rcol = ar.matrix(arcol, 4)
Rrow = ar.matrix(arrow, 18)
Rcol
# Having a look at the structure
#            [,1]     [,2]     [,3]       [,4]
# [1,] 1.0000e+00 0.045000 0.002025 9.1125e-05
# [2,] 4.5000e-02 1.000000 0.045000 2.0250e-03
# [3,] 2.0250e-03 0.045000 1.000000 4.5000e-02
# [4,] 9.1125e-05 0.002025 0.045000 1.0000e+00

R = varR * kronecker(Rcol, Rrow)
Rinv = solve(R)

# Components of C
XpX = t(X) %*% Rinv %*% X
ZpZ = t(Z) %*% Rinv %*% Z

XpZ = t(X) %*% Rinv %*% Z
ZpX = t(Z) %*% Rinv %*% X

Xpy = t(X) %*% Rinv %*% y
Zpy = t(Z) %*% Rinv %*% y

# Building C * [b a] = RHS
C = rbind(cbind(XpX, XpZ),
cbind(ZpX, ZpZ + solve(G)))
RHS = rbind(Xpy, Zpy)

blup = solve(C, RHS)
blup
#                         [,1]
# (Intercept)       76.5778238
# VMarvellous        9.2853002
# VVictory          -5.7262894
# N0.2              23.3283060
# N0.4              40.0555464
# N0.6              47.1740348
# VMarvellous:N0.2  -0.8682597
# VVictory:N0.2     -1.9568979
# VMarvellous:N0.4 -12.4200362
# VVictory:N0.4      2.1912083
# VMarvellous:N0.6  -5.5017225
# VVictory:N0.6      0.3732453
# BB1               21.4974445
# BB2                1.0949433
# BB3               -5.4344098
# BB4               -4.4333080
# BB5               -6.6948783
# BB6               -6.0297918
# BB1:MPGoldenRain   1.3047656
# BB2:MPGoldenRain   1.3294043
# BB3:MPGoldenRain  -6.5286993
# BB4:MPGoldenRain   6.6855568
# BB5:MPGoldenRain   2.4624436
# BB6:MPGoldenRain  -5.2534710
# BB1:MPMarvellous  -0.9096022
# BB2:MPMarvellous   7.4170634
# BB3:MPMarvellous  10.4349240
# BB4:MPMarvellous  -9.2293528
# BB5:MPMarvellous  -9.7080694
# BB6:MPMarvellous   1.9950370
# BB1:MPVictory     12.7749000
# BB2:MPVictory     -8.0756682
# BB3:MPVictory     -7.2355284
# BB4:MPVictory     -0.1721988
# BB5:MPVictory      3.1441163
# BB6:MPVictory     -0.4356209


Which are the same results one gets from ASReml-R. QED.

P.S. Many thanks to Kevin Wright for answering my questions about agridat.

Categories

## Split-plot 1: How does a linear mixed model look like?

I like statistics and I struggle with statistics. Often times I get frustrated when I don’t understand and I really struggled to make sense of Krushke’s Bayesian analysis of a split-plot, particularly because ‘it didn’t look like’ a split-plot to me.

Additionally, I have made a few posts discussing linear mixed models using several different packages to fit them. At no point I have shown what are the calculations behind the scenes. So, I decided to combine my frustration and an explanation to myself in a couple of posts. This is number one and the follow up is Split-plot 2: let’s throw in some spatial effects. Example of forestry split-plot: one of my colleagues has a trial in which stocking (number of trees per ha) is the main plot and fertilization is the subplot (higher stockings look darker because trees are closer together).

## How do linear mixed models look like

Linear mixed models, models that combine so-called fixed and random effects, are often represented using matrix notation as:

$$oldsymbol{y} = oldsymbol{X b} + oldsymbol{Z a} + oldsymbol{e}$$

where $$oldsymbol{y}$$ represents the response variable vector, $$oldsymbol{X} mbox{ and } oldsymbol{Z}$$ are incidence matrices relating the response to the fixed ($$oldsymbol{b}$$ e.g. overall mean, treatment effects, etc) and random ($$oldsymbol{a}$$, e.g. family effects, additive genetic effects and a host of experimental design effects like blocks, rows, columns, plots, etc), and last the random residuals ($$oldsymbol{e}$$).

The previous model equation still requires some assumptions about the distribution of the random effects. A very common set of assumptions is to say that the residuals are iid (identical and independently distributed) normal (so $$oldsymbol{R} = sigma^2_e oldsymbol{I}$$) and that the random effects are independent of each other, so $$oldsymbol{G} = oldsymbol{B} igoplus oldsymbol{M}$$ is a direct sum of the variance of blocks ($$oldsymbol{B} = sigma^2_B oldsymbol{I}$$) and main plots ($$oldsymbol{M} = sigma^2_M oldsymbol{I}$$).

An interesting feature of this matrix formulation is that we can express all sort of models by choosing different assumptions for our covariance matrices (using different covariance structures). Do you have longitudinal data (units assessed repeated times)? Is there spatial correlation? Account for this in the $$oldsymbol{R}$$ matrix. Random effects are correlated (e.g. maternal and additive genetic effects in genetics)? Account for this in the $$oldsymbol{G}$$ matrix. Multivariate response? Deal with unstructured $$oldsymbol{R}$$ and $$oldsymbol{G}$$, or model the correlation structure using different constraints (and the you’ll need asreml).

By the way, the history of linear mixed models is strongly related to applications of quantitative genetics for the prediction of breeding values, particularly in dairy cattle. Charles Henderson developed what is now called Henderson’s Mixed Model Equations to simultaneously estimate fixed effects and predict random genetic effects:

$$left [ egin{array}{cc} oldsymbol{X}’ oldsymbol{R}^{-1} oldsymbol{X} & oldsymbol{X}’ oldsymbol{R}^{-1} oldsymbol{Z} \ oldsymbol{Z}’ oldsymbol{R}^{-1} oldsymbol{X} & oldsymbol{Z}’ oldsymbol{R}^{-1} oldsymbol{Z} + oldsymbol{G}^{-1} end{array} ight ] left [ egin{array}{c} oldsymbol{b} \ oldsymbol{a} end{array} ight ] = left [ egin{array}{c} oldsymbol{X}’ oldsymbol{R}^{-1} oldsymbol{y} \ oldsymbol{Z}’ oldsymbol{R}^{-1} oldsymbol{y} end{array} ight ]$$

The big matrix on the left-hand side of this equation is often called the $$oldsymbol{C}$$ matrix. You could be thinking ‘What does this all mean?’ It is easier to see what is going on with a small example, but rather than starting with, say, a complete block design, we’ll go for a split-plot to start tackling my annoyance with the aforementioned blog post.

## Old school: physical split-plots

Given that I’m an unsophisticated forester and that I’d like to use data available to anyone, I’ll rely on an agricultural example (so plots are actually physical plots in the field) that goes back to Frank Yates. There are two factors (oats variety, with three levels, and fertilization, with four levels). Yates, F. (1935) Complex experiments, Journal of the Royal Statistical Society Suppl. 2, 181–247 (behind pay wall here). Layout of oats experiment in Yates’s paper, from a time when articles were meaty. Each of the 6 replicates is divided in to 3 main plots for oats varieties (v1, v2 and v3), while each variety was divided into four parts with different levels of fertilization (manure—animal crap—n1 to n4). Cells display yield.

Now it is time to roll up our sleeves and use some code, getting the data and fitting the same model using nlme (m1) and asreml (m2), just for the fun of it. Anyway, nlme and asreml produce exactly the same results.

We will use the oats data set that comes with MASS, although there is also an Oats data set in nlme and another version in the asreml package. (By the way, you can see a very good explanation by Bill Venables of a ‘traditional’ ANOVA analysis for a split-plot here):

require(MASS) # we get the oats data from here
require(nlme) # for lme function
require(asreml) # for asreml function. This dataset use different variable names,
# which may require renaming a dataset to use the code below

# Get the oats data set and check structure
data(oats)
str(oats)

# Create a main plot effect for clarity's sake
oats$MP = oats$V

# Split-plot using NLME
m1 = lme(Y ~ V*N, random = ~ 1|B/MP, data = oats)
summary(m1)
fixef(m1)
ranef(m1)

# Split-plot using ASReml
m2 = asreml(Y ~ V*N, random = ~ B/MP, data = oats)
summary(m2)$varcomp coef(m2)$fixed
coef(m2)$random fixef(m1) # (Intercept) VMarvellous VVictory N0.2cwt # 80.0000000 6.6666667 -8.5000000 18.5000000 # N0.4cwt N0.6cwt VMarvellous:N0.2cwt VVictory:N0.2cwt # 34.6666667 44.8333333 3.3333333 -0.3333333 #VMarvellous:N0.4cwt VVictory:N0.4cwt VMarvellous:N0.6cwt VVictory:N0.6cwt # -4.1666667 4.6666667 -4.6666667 2.1666667 ranef(m1) # Level: B # (Intercept) # I 25.421511 # II 2.656987 # III -6.529883 # IV -4.706019 # V -10.582914 # VI -6.259681 # # Level: MP %in% B # (Intercept) # I/Golden.rain 2.348296 # I/Marvellous -3.854348 # I/Victory 14.077467 # II/Golden.rain 4.298706 # II/Marvellous 6.209473 # II/Victory -9.194250 # III/Golden.rain -7.915950 # III/Marvellous 10.750776 # III/Victory -6.063976 # IV/Golden.rain 5.789462 # IV/Marvellous -7.115566 # IV/Victory -1.001111 # V/Golden.rain 1.116768 # V/Marvellous -9.848096 # V/Victory 3.497878 # VI/Golden.rain -5.637282 # VI/Marvellous 3.857761 # VI/Victory -1.316009  Now we can move to implement the Mixed Model Equations, where probably the only gotcha is the definition of the $$oldsymbol{Z}$$ matrix (incidence matrix for random effects), as both nlme and asreml use ‘number of levels of the factor’ for both the main and interactions effects, which involves using the contrasts.arg argument in model.matrix(). # Variance components varB = 214.477 varMP = 106.062 varR = 177.083 # Basic vector and matrices: y, X, Z, G & R y = matrix(oats$Y, nrow = dim(oats), ncol = 1)
X = model.matrix(~ V*N, data = oats)
Z = model.matrix(~ B/MP - 1, data = oats,
contrasts.arg = list(B = contrasts(oats$B, contrasts = F), MP = contrasts(oats$MP, contrasts = F)))

G = diag(c(rep(varB, 6), rep(varMP, 18)))
R = diag(varR, 72, 72)
Rinv = solve(R)

# Components of C
XpX = t(X) %*% Rinv %*% X
ZpZ = t(Z) %*% Rinv %*% Z

XpZ = t(X) %*% Rinv %*% Z
ZpX = t(Z) %*% Rinv %*% X

Xpy = t(X) %*% Rinv %*% y
Zpy = t(Z) %*% Rinv %*% y

# Building C * [b a] = RHS
C = rbind(cbind(XpX, XpZ),
cbind(ZpX, ZpZ + solve(G)))
RHS = rbind(Xpy, Zpy)

blup = solve(C, RHS)
blup

# [,1]
# (Intercept)          80.0000000
# VMarvellous           6.6666667
# VVictory             -8.5000000
# N0.2cwt              18.5000000
# N0.4cwt              34.6666667
# N0.6cwt              44.8333333
# VMarvellous:N0.2cwt   3.3333333
# VVictory:N0.2cwt     -0.3333333
# VMarvellous:N0.4cwt  -4.1666667
# VVictory:N0.4cwt      4.6666667
# VMarvellous:N0.6cwt  -4.6666667
# VVictory:N0.6cwt      2.1666667
# BI                   25.4215578
# BII                   2.6569919
# BIII                 -6.5298953
# BIV                  -4.7060280
# BV                  -10.5829337
# BVI                  -6.2596927
# BI:MPGolden.rain      2.3482656
# BII:MPGolden.rain     4.2987082
# BIII:MPGolden.rain   -7.9159514
# BIV:MPGolden.rain     5.7894753
# BV:MPGolden.rain      1.1167834
# BVI:MPGolden.rain    -5.6372811
# BI:MPMarvellous      -3.8543865
# BII:MPMarvellous      6.2094778
# BIII:MPMarvellous    10.7507978
# BIV:MPMarvellous     -7.1155687
# BV:MPMarvellous      -9.8480945
# BVI:MPMarvellous      3.8577740
# BI:MPVictory         14.0774514
# BII:MPVictory        -9.1942649
# BIII:MPVictory       -6.0639747
# BIV:MPVictory        -1.0011059
# BV:MPVictory          3.4978963
# BVI:MPVictory        -1.3160021


Not surprisingly, we get the same results, except that we start assuming the variance components from the previous analyses, so we can avoid implementing the code for restricted maximum likelihood estimation as well. By the way, given that $$oldsymbol{R}^{-1}$$ is in all terms it can be factored out from the MME, leaving terms like $$oldsymbol{X}’ oldsymbol{X}$$, i.e. without $$oldsymbol{R}^{-1}$$, making for simpler calculations. In fact, if you drop the $$oldsymbol{R}^{-1}$$ it is easier to see what is going on in the different components of the $$oldsymbol{C}$$ matrix. For example, print $$oldsymbol{X}’ oldsymbol{X}$$ and you’ll get the sum of observations for the overall mean and for each of the levels of the fixed effect factors. Give it a try with the other submatrices too!

XpXnoR = t(X) %*% X
XpXnoR
#                    (Intercept) VMarvellous VVictory N0.2cwt N0.4cwt N0.6cwt
#(Intercept)                  72          24       24      18      18      18
#VMarvellous                  24          24        0       6       6       6
#VVictory                     24           0       24       6       6       6
#N0.2cwt                      18           6        6      18       0       0
#N0.4cwt                      18           6        6       0      18       0
#N0.6cwt                      18           6        6       0       0      18
#VMarvellous:N0.2cwt           6           6        0       6       0       0
#VVictory:N0.2cwt              6           0        6       6       0       0
#VMarvellous:N0.4cwt           6           6        0       0       6       0
#VVictory:N0.4cwt              6           0        6       0       6       0
#VMarvellous:N0.6cwt           6           6        0       0       0       6
#VVictory:N0.6cwt              6           0        6       0       0       6
#                    VMarvellous:N0.2cwt VVictory:N0.2cwt VMarvellous:N0.4cwt
#(Intercept)                           6                6                   6
#VMarvellous                           6                0                   6
#VVictory                              0                6                   0
#N0.2cwt                               6                6                   0
#N0.4cwt                               0                0                   6
#N0.6cwt                               0                0                   0
#VMarvellous:N0.2cwt                   6                0                   0
#VVictory:N0.2cwt                      0                6                   0
#VMarvellous:N0.4cwt                   0                0                   6
#VVictory:N0.4cwt                      0                0                   0
#VMarvellous:N0.6cwt                   0                0                   0
#VVictory:N0.6cwt                      0                0                   0
#                    VVictory:N0.4cwt VMarvellous:N0.6cwt VVictory:N0.6cwt
#(Intercept)                        6                   6                6
#VMarvellous                        0                   6                0
#VVictory                           6                   0                6
#N0.2cwt                            0                   0                0
#N0.4cwt                            6                   0                0
#N0.6cwt                            0                   6                6
#VMarvellous:N0.2cwt                0                   0                0
#VVictory:N0.2cwt                   0                   0                0
#VMarvellous:N0.4cwt                0                   0                0
#VVictory:N0.4cwt                   6                   0                0
#VMarvellous:N0.6cwt                0                   6                0
#VVictory:N0.6cwt                   0                   0                6


I will leave it here and come back to this problem as soon as I can.

† Incidentally, a lot of the theoretical development was supported by Shayle Searle (a Kiwi statistician and Henderson’s colleague in Cornell University).

Categories

## Bivariate linear mixed models using ASReml-R with multiple cores

A while ago I wanted to run a quantitative genetic analysis where the performance of genotypes in each site was considered as a different trait. If you think about it, with 70 sites and thousands of genotypes one is trying to fit a 70×70 additive genetic covariance matrix, which requires 70*69/2 = 2,415 covariance components. Besides requiring huge amounts of memory and being subject to all sort of estimation problems there were all sort of connectedness issues that precluded the use of Factor Analytic models to model the covariance matrix. The best next thing was to run over 2,000 bivariate analyses to build a large genetic correlation matrix (which has all sort of issues, I know). This meant leaving the computer running for over a week.

In another unrelated post, Kevin asked if I have ever considered using ASReml-R to run in parallel using a computer with multiple cores. Short answer, I haven’t, but there is always a first time.

require(asreml) # Multivariate mixed models
require(doMC)   # To access multiple cores

registerDoMC()  # to start a "parallel backend"

setwd('~/Dropbox/quantumforest')
# Phenotypes
dat$FCLN = factor(dat$FCLN)
dat$MCLN = factor(dat$MCLN)
dat$REP = factor(dat$REP)
dat$SETS = factor(dat$SETS)
dat$FAM = factor(dat$FAM)
dat$CP = factor(dat$CP)
dat$Tree_id = factor(dat$Tree_id)

# Pedigree
names(ped) = 'Tree_id'
ped$Tree_id = factor(ped$Tree_id)
ped$Mother = factor(ped$Mother)
ped$Father = factor(ped$Father)

# Inverse of the numerator relationship matrix
Ainv = asreml.Ainverse(ped)$ginv # Wrap call to a generic bivariate analysis # (This one uses the same model for all sites # but it doesn't need to be so) bivar = function(trial1, trial2) { t2 = dat[dat$Trial_id == trial1 | dat$Trial_id == trial2,] t2$den1 = ifelse(t2$Trial_id == trial1, t2$DEN, NA)
t2$den2 = ifelse(t2$Trial_id == trial2, t2$DEN, NA) t2$Trial_id = t2$Trial_id[drop = TRUE] # Bivariate run # Fits overall mean, random correlated additive effects with # heterogeneous variances and diagonal matrix for heterogeneous # residuals den.bi = asreml(cbind(den1,den2) ~ trait, random = ~ ped(Tree_id):corgh(trait), data = t2, ginverse = list(Tree_id = Ainv), rcov = ~ units:diag(trait), workspace=64e06, maxiter=30) # Returns only log-Likelihood for this example return(summary(den.bi)$loglik)
}

# Run the bivariate example in parallel for two pairs of sites
# FR38_1 with FR38_2, FR38_1 with FR38_3
foreach(trial1=rep("FR38_1",2), trial2=c("FR38_2", "FR38_3")) %dopar%
bivar(trial1, trial2)


The code runs in my laptop (only two cores) but I still have to test its performance in my desktop (4 cores) and see if it really makes a difference in time versus running the analyses sequentially. Initially my mistake was using the multicore package (require(multicore)) directly, which doesn’t start the ‘parallel backend’ and ran sequentially. Using require(doMC) loads multicore but takes care of starting the ‘parallel backend’.

P.S. Thanks to Etienne Laliberté, who sent me an email in early 2010 pointing out that I had to use doMC. One of the few exceptions in the pile of useless emails I have archived.
P.S.2. If you want to learn more about this type of models I recommend two books: Mrode’s Linear Models for the Prediction of Animal Breeding Values, which covers multivariate evaluation with lots of gory details, and Lynch and Walsh’s Genetics and Analysis of Quantitative Traits, which is the closest thing to the bible in quantitative genetics.
P.S.3. 2012-05-08. I did run the code in my desktop (iMac with 4 cores), making a couple of small changes in the code to have a fairer comparison between using %dopar% (for parallel code) and %par% (for sequential execution).

1. Added trace = FALSE to the asreml() call, to avoid printing partial iteration details to the screen, which happened when using %do%.
2. Used 4 pairs of sites, so I could utilize all cores in my desktop.

Execution time for the parallel version—measured with Sys.time()—was, roughly, 1/(number of cores) the time required by the sequential version.

Categories

## On the (statistical) road, workshops and R

Things have been a bit quiet at Quantum Forest during the last ten days. Last Monday (Sunday for most readers) I flew to Australia to attend a couple of one-day workshops; one on spatial analysis (in Sydney) and another one on modern applications of linear mixed models (in Wollongong). This will be followed by attending The International Biometric Society Australasian Region Conference in Kiama.

I would like to comment on the workshops to look for commonalities and differences. First, both workshops heavily relied on R, supporting the idea that if you want to reach a lot of people and get them using your ideas, R is pretty much the vehicle to do so. It is almost trivial to get people to install R and RStudio before the workshop so they are ready to go. “Almost” because you have to count on someone having a bizarre software configuration or draconian security policies for their computer.

The workshop on Spatial Analysis also used WinBUGS, which with all respect to the developers, is a clunky piece of software. Calling it from R or using JAGS from R seems to me a much more sensible way of using a Bayesian approach while maintaining access to the full power of R. The workshop on linear mixed models relied on asreml-R; if you haven’t tried it, please give it a go (it is a free license for academic/research use). There were applications on multi-stage experiments, composite samples and high dimensional data (molecular information). In addition, there was an initial session on optimal design of experiments.

In my opinion, the second workshop (modern applications…) was much more successful than the first one (spatial analysis…) for a few reasons:

• One has to limit the material to cover in a one-day workshop; if you want to cover a lot consider three days so people can digest all the material.
• One has to avoid the “split-personality” approach to presentations; having very-basic and super-hard but nothing in the middle is not a great idea (IMHO). Pick a starting point and gradually move people from there.
• Limit the introduction of new software. One software per day seems to be a good rule of thumb.

Something bizarre (for me, at least) was the difference between the audiences. In Sydney the crowd was a lot younger, with many trainees in biostatistics coming mostly from health research. They had little exposure to R and seemed to come from a mostly SAS shop. The crowd in Wollongong had people with a lot of experience (OK, oldish) both in statistics and R. I was expecting young people to be more conversant in R.

Tomorrow we will drive down to Kiama, sort out registration and then go to the welcome BBQ. Funny thing is that this is my first statistics conference; as I mentioned in the About page of this blog, I am a simple forester. 🙂

Categories

## Surviving a binomial mixed model

A few years ago we had this really cool idea: we had to establish a trial to understand wood quality in context. Sort of following the saying “we don’t know who discovered water, but we are sure that it wasn’t a fish” (attributed to Marshall McLuhan). By now you are thinking WTF is this guy talking about? But the idea was simple; let’s put a trial that had the species we wanted to study (Pinus radiata, a gymnosperm) and an angiosperm (Eucalyptus nitens if you wish to know) to provide the contrast, as they are supposed to have vastly different types of wood. From space the trial looked like this: The reason you can clearly see the pines but not the eucalypts is because the latter were dying like crazy over a summer drought (45% mortality in one month). And here we get to the analytical part: we will have a look only at the eucalypts where the response variable can’t get any clearer, trees were either totally dead or alive. The experiment followed a randomized complete block design, with 50 open-pollinated families in 48 blocks. The original idea was to harvest 12 blocks each year but—for obvious reasons—we canned this part of the experiment after the first year.

The following code shows the analysis in asreml-R, lme4 and MCMCglmm:

load('~/Dropbox/euc.Rdata')

library(asreml)
sasreml = asreml(surv ~ 1, random = ~ Fami + Block,
data = euc,
summary(sasreml)$varcomp # gamma component std.error z.ratio #Fami!Fami.var 0.5704205 0.5704205 0.14348068 3.975591 #Block!Block.var 0.1298339 0.1298339 0.04893254 2.653324 #R!variance 1.0000000 1.0000000 NA NA # constraint #Fami!Fami.var Positive #Block!Block.var Positive #R!variance Fixed # Quick look at heritability varFami = summary(sasreml)$varcomp[1, 2]
varRep = summary(sasreml)$varcomp[2, 2] h2 = 4*varFami/(varFami + varRep + 3.29) h2 # 0.5718137 library(lme4) slme4 = lmer(surv ~ 1 + (1|Fami) + (1|Block), data = euc, family = binomial(link = 'logit')) summary(slme4) #Generalized linear mixed model fit by the Laplace approximation #Formula: surv ~ 1 + (1 | Fami) + (1 | Block) # Data: euc # AIC BIC logLik deviance # 2725 2742 -1360 2719 #Random effects: # Groups Name Variance Std.Dev. # Fami (Intercept) 0.60941 0.78065 # Block (Intercept) 0.13796 0.37143 #Number of obs: 2090, groups: Fami, 51; Block, 48 # #Fixed effects: # Estimate Std. Error z value Pr(>|z|) #(Intercept) 0.2970 0.1315 2.259 0.0239 * # Quick look at heritability varFami = VarCorr(slme4)$Fami
varRep = VarCorr(slme4)$Block h2 = 4*varFami/(varFami + varRep + 3.29) h2 # 0.6037697 # And let's play to be Bayesians! library(MCMCglmm) pr = list(R = list(V = 1, n = 0, fix = 1), G = list(G1 = list(V = 1, n = 0.002), G2 = list(V = 1, n = 0.002))) sb <- MCMCglmm(surv ~ 1, random = ~ Fami + Block, family = 'categorical', data = euc, prior = pr, verbose = FALSE, pr = TRUE, burnin = 10000, nitt = 100000, thin = 10) plot(sb$VCV) You may be wondering Where does the 3.29 in the heritability formula comes from? Well, that's the variance of the link function that, in the case of the logit link is pi*pi/3. In the case of MCMCglmm we can estimate the degree of genetic control quite easily, remembering that we have half-siblings (open-pollinated plants):

# Heritability
h2 = 4*sb$VCV[, 'Fami']/(sb$VCV[, 'Fami'] +
sb$VCV[, 'Block'] + 3.29 + 1) posterior.mode(h2) # var1 #0.6476185 HPDinterval(h2) # lower upper #var1 0.4056492 0.9698148 #attr(,"Probability") # 0.95 plot(h2) By the way, it is good to remember that we need to back-transform the estimated effects to probabilities, with very simple code: # Getting mode and credible interval for solutions inv.logit(posterior.mode(sb$Sol))
inv.logit(HPDinterval(sb$Sol, 0.95))  Even if one of your trials is trashed there is a silver lining: it is possible to have a look at survival. Categories ## Coming out of the (Bayesian) closet: multivariate version This week I’m facing my—and many other lecturers’—least favorite part of teaching: grading exams. In a supreme act of procrastination I will continue the previous post, and the antepenultimate one, showing the code for a bivariate analysis of a randomized complete block design. Just to recap, the results from the REML multivariate analysis (that used ASReml-R) was the following: library(asreml) m4 = asreml(cbind(bden, veloc) ~ trait, random = ~ us(trait):Block + us(trait):Family, data = a, rcov = ~ units:us(trait)) summary(m4)$varcomp

#                                      gamma    component    std.error
#trait:Block!trait.bden:bden    1.628812e+02 1.628812e+02 7.854123e+01
#trait:Block!trait.veloc:bden   1.960789e-01 1.960789e-01 2.273473e-01
#trait:Block!trait.veloc:veloc  2.185595e-03 2.185595e-03 1.205128e-03
#trait:Family!trait.bden:bden   8.248391e+01 8.248391e+01 2.932427e+01
#trait:Family!trait.veloc:bden  1.594152e-01 1.594152e-01 1.138992e-01
#trait:Family!trait.veloc:veloc 2.264225e-03 2.264225e-03 8.188618e-04
#R!variance                     1.000000e+00 1.000000e+00           NA
#R!trait.bden:bden              5.460010e+02 5.460010e+02 3.712833e+01
#R!trait.veloc:bden             6.028132e-01 6.028132e-01 1.387624e-01
#R!trait.veloc:veloc            1.710482e-02 1.710482e-02 9.820673e-04
#                                  z.ratio constraint
#trait:Block!trait.bden:bden     2.0738303   Positive
#trait:Block!trait.veloc:bden    0.8624639   Positive
#trait:Block!trait.veloc:veloc   1.8135789   Positive
#trait:Family!trait.bden:bden    2.8128203   Positive
#trait:Family!trait.veloc:bden   1.3996166   Positive
#trait:Family!trait.veloc:veloc  2.7650886   Positive
#R!variance                             NA      Fixed
#R!trait.bden:bden              14.7057812   Positive
#R!trait.veloc:bden              4.3442117   Positive
#R!trait.veloc:veloc            17.4171524   Positive


The corresponding MCMCglmm code is not that different from ASReml-R, after which it is modeled anyway. Following the recommendations of the MCMCglmm Course Notes (included with the package), the priors have been expanded to diagonal matrices with degree of belief equal to the number of traits. The general intercept is dropped (-1) so the trait keyword represents trait means. We are fitting unstructured (us(trait)) covariance matrices for both Block and Family, as well as an unstructured covariance matrix for the residuals. Finally, both traits follow a gaussian distribution:

library(MCMCglmm)
bp = list(R = list(V = diag(c(0.007, 260)), n = 2),
G = list(G1 = list(V = diag(c(0.007, 260)), n = 2),
G2 = list(V = diag(c(0.007, 260)), n = 2)))

bmod = MCMCglmm(cbind(veloc, bden) ~ trait - 1,
random = ~ us(trait):Block + us(trait):Family,
rcov = ~ us(trait):units,
family = c('gaussian', 'gaussian'),
data = a,
prior = bp,
verbose = FALSE,
pr = TRUE,
burnin = 10000,
nitt = 20000,
thin = 10)


Further manipulation of the posterior distributions requires having an idea of the names used to store the results. Following that, we can build an estimate of the genetic correlation between the traits (Family covariance between traits divided by the square root of the product of the Family variances). Incidentally, it wouldn’t be a bad idea to run a much longer chain for this model, so the plot of the posterior for the correlation looks better, but I’m short of time:

dimnames(bmod$VCV) rg = bmod$VCV[,'veloc:bden.Family']/sqrt(bmod$VCV[,'veloc:veloc.Family'] * bmod$VCV[,'bden:bden.Family'])

plot(rg)

posterior.mode(rg)
#     var1
#0.2221953

HPDinterval(rg, prob = 0.95)
#         lower     upper
#var1 -0.132996 0.5764006
#attr(,"Probability")
# 0.95 And that’s it! Time to congratulate Jarrod Hadfield for developing this package.

Categories

## Coming out of the (Bayesian) closet

Until today all the posts in this blog have used a frequentist view of the world. I have a confession to make: I have an ecumenical view of statistics and I do sometimes use Bayesian approaches in data analyses. This is not quite one of those “the truth will set you free” moments, but I’ll show that one could almost painlessly repeat some of the analyses I presented before using MCMC.

MCMCglmm is a very neat package that—as its rather complicated em cee em cee gee el em em acronym implies—implements MCMC for generalized linear mixed models. We’ll skip that the frequentist fixed vs random effects distinction gets blurry in Bayesian models and still use the f… and r… terms. I’ll first repeat the code for a Randomized Complete Block design with Family effects (so we have two random factors) using both lme4 and ASReml-R and add the MCMCglmm counterpart:

library(lme4)
m1 = lmer(bden ~ 1 + (1|Block) + (1|Family),
data = trees)

summary(m1)

#Linear mixed model fit by REML
#Formula: bden ~ (1 | Block) + (1 | Family)
#   Data: a
#  AIC  BIC logLik deviance REMLdev
# 4572 4589  -2282     4569    4564
#Random effects:
# Groups   Name        Variance Std.Dev.
# Family   (Intercept)  82.803   9.0996
# Block    (Intercept) 162.743  12.7571
# Residual             545.980  23.3662
#Number of obs: 492, groups: Family, 50; Block, 11

#Fixed effects:
#            Estimate Std. Error t value
#(Intercept)  306.306      4.197   72.97

library(asreml)
m2 = asreml(bden ~ 1, random = ~ Block + Family,
data = trees)

summary(m2)$varcomp # gamma component std.error z.ratio #Block!Block.var 0.2980766 162.74383 78.49271 2.073362 #Family!Family.var 0.1516591 82.80282 29.47153 2.809587 #R!variance 1.0000000 545.97983 37.18323 14.683496 # constraint #Block!Block.var Positive #Family!Family.var Positive #R!variance Positive m2$coeff$fixed #(Intercept) # 306.306  We had already established that the results obtained from lme4 and ASReml-R were pretty much the same, at least for relatively simple models where we can use both packages (as their functionality diverges later for more complex models). This example is no exception and we quickly move to fitting the same model using MCMCglmm: library(MCMCglmm) priors = list(R = list(V = 260, n = 0.002), G = list(G1 = list(V = 260, n = 0.002), G2 = list(V = 260, n = 0.002))) m4 = MCMCglmm(bden ~ 1, random = ~ Block + Family, family = 'gaussian', data = a, prior = priors, verbose = FALSE, pr = TRUE, burnin = 10000, nitt = 20000, thin = 10) plot(mcmc.list(m4$VCV))

autocorr(m4$VCV) posterior.mode(m4$VCV)
#    Block    Family     units
#126.66633  72.97771 542.42237

HPDinterval(m4$VCV) # lower upper #Block 33.12823 431.0233 #Family 26.34490 146.6648 #units 479.24201 627.7724  The first difference is that we have to specify priors for the coefficients that we would like to estimate (by default fixed effects, the overall intercept for our example, start with a zero mean and very large variance: 106). The phenotypic variance for our response is around 780, which I split into equal parts for Block, Family and Residuals. For each random effect we have provided our prior for the variance (V) and a degree of belief on our prior (n). In addition to the model equation, name of the data set and prior information we need to specify things like the number of iterations in the chain (nitt), how many we are discarding for the initial burnin period (burnin), and how many values we are keeping (thin, every ten). Besides the pretty plot of the posterior distributions (see previous figure) they can be summarized using the posterior mode and high probability densities. One of the neat things we can do is to painlessly create functions of variance components and get their posterior mode and credible interval. For example, the heritability (or degree of additive genetic control) can be estimated in this trial with full-sib families using the following formula: $$hat{h^2} = frac{2 sigma_F^2}{sigma_F^2 + sigma_B^2 + sigma_R^2}$$ h2 = 2 * m4$VCV[, 'Family']/(m4$VCV[, 'Family'] + m4$VCV[, 'Block'] + m4$VCV[, 'units']) plot(h2) posterior.mode(h2) #0.1887017 HPDinterval(h2) # lower upper #var1 0.05951232 0.3414216 #attr(,"Probability") # 0.95 There are some differences on the final results between ASReml-R/lme4 and MCMCglmm; however, the gammas (ratios of variance component/error variance) for the posterior distributions are very similar, and the estimated heritabilities are almost identical (~0.19 vs ~0.21). Overall, MCMCglmm is a very interesting package that covers a lot of ground. It pays to read the reference manual and vignettes, which go into a lot of detail on how to specify different types of models. I will present the MCMCglmm bivariate analyses in another post. P.S. There are several other ways that we could fit this model in R using a Bayesian approach: it is possible to call WinBugs or JAGS (in Linux and OS X) from R, or we could have used INLA. More on this in future posts. Categories ## Multivariate linear mixed models: livin’ la vida loca I swear there was a point in writing an introduction to covariance structures: now we can start joining all sort of analyses using very similar notation. In a previous post I described simple (even simplistic) models for a single response variable (or ‘trait’ in quantitative geneticist speak). The R code in three R packages (asreml-R, lme4 and nlme) was quite similar and we were happy-clappy with the consistency of results across packages. The curse of the analyst/statistician/guy who dabbles in analyses is the idea that we can always fit a nicer model and—as both Bob the Builder and Obama like to say—yes, we can. Let’s assume that we have a single trial where we have assessed our experimental units (trees in my case) for more than one response variable; for example, we measured the trees for acoustic velocity (related to stiffness) and basic density. If you don’t like trees, think of having height and body weight for people, or whatever picks your fancy. Anyway, we have our traditional randomized complete block design with random blocks, random families and that’s it. Rhetorical question: Can we simultaneously run an analysis for all responses? setwd('~/Dropbox/quantumforest') library(asreml) canty = read.csv('canty_trials.csv', header = TRUE) summary(canty) m1 = asreml(bden ~ 1, random = ~ Block + Family, data = canty) summary(m1)$varcomp

gamma component std.error   z.ratio constraint
Block!Block.var   0.2980766 162.74383  78.49271  2.073362   Positive
Family!Family.var 0.1516591  82.80282  29.47153  2.809587   Positive
R!variance        1.0000000 545.97983  37.18323 14.683496   Positive

m2 = asreml(veloc ~ 1, random = ~ Block + Family, data = canty)
summary(m2)$varcomp gamma component std.error z.ratio constraint Block!Block.var 0.1255846 0.002186295 0.0011774906 1.856741 Positive Family!Family.var 0.1290489 0.002246605 0.0008311341 2.703059 Positive R!variance 1.0000000 0.017408946 0.0012004136 14.502456 Positive  Up to this point we are using the same old code, and remember that we could fit the same model using lme4, so what’s the point of this post? Well, we can now move to fit a multivariate model, where we have two responses at the same time (incidentally, below we have a plot of the two response variables, showing a correlation of ~0.2). We can first refit the model as a multivariate analysis, assuming block-diagonal covariance matrices. The notation now includes: • The use of cbind() to specify the response matrix, • the reserved keyword trait, which creates a vector to hold the overall mean for each response, • at(trait), which asks ASReml-R to fit an effect (e.g. Block) at each trait, by default using a diagonal covariance matrix (σ2 I). We could also use diag(trait) for the same effect, • rcov = ~ units:diag(trait) specifies a different diagonal matrix for the residuals (units) of each trait. m3 = asreml(cbind(bden, veloc) ~ trait, random = ~ at(trait):Block + at(trait):Family, data = canty, rcov = ~ units:diag(trait)) summary(m3)$varcomp
gamma    component    std.error
at(trait, bden):Block!Block.var    1.627438e+02 1.627438e+02 78.492736507
at(trait, veloc):Block!Block.var   2.186295e-03 2.186295e-03  0.001177495
at(trait, bden):Family!Family.var  8.280282e+01 8.280282e+01 29.471507439
at(trait, veloc):Family!Family.var 2.246605e-03 2.246605e-03  0.000831134
R!variance                         1.000000e+00 1.000000e+00           NA
R!trait.bden.var                   5.459799e+02 5.459799e+02 37.183234014
R!trait.veloc.var                  1.740894e-02 1.740894e-02  0.001200414
z.ratio constraint
at(trait, bden):Block!Block.var     2.073362   Positive
at(trait, veloc):Block!Block.var    1.856733   Positive
at(trait, bden):Family!Family.var   2.809589   Positive
at(trait, veloc):Family!Family.var  2.703059   Positive
R!variance                                NA      Fixed
R!trait.bden.var                   14.683496   Positive
R!trait.veloc.var                  14.502455   Positive


Initially, you may not notice that the results are identical, as there is a distracting change to scientific notation for the variance components. A closer inspection shows that we have obtained the same results for both traits, but did we gain anything? Not really, as we took the defaults for covariance components (direct sum of diagonal matrices, which assumes uncorrelated traits); however, we can do better and actually tell ASReml-R to fit the correlation between traits for block and family effects as well as for residuals.

m4 = asreml(cbind(bden, veloc) ~ trait,
random = ~ us(trait):Block +  us(trait):Family, data = a,
rcov = ~ units:us(trait))

summary(m4)$varcomp gamma component std.error trait:Block!trait.bden:bden 1.628812e+02 1.628812e+02 7.854123e+01 trait:Block!trait.veloc:bden 1.960789e-01 1.960789e-01 2.273473e-01 trait:Block!trait.veloc:veloc 2.185595e-03 2.185595e-03 1.205128e-03 trait:Family!trait.bden:bden 8.248391e+01 8.248391e+01 2.932427e+01 trait:Family!trait.veloc:bden 1.594152e-01 1.594152e-01 1.138992e-01 trait:Family!trait.veloc:veloc 2.264225e-03 2.264225e-03 8.188618e-04 R!variance 1.000000e+00 1.000000e+00 NA R!trait.bden:bden 5.460010e+02 5.460010e+02 3.712833e+01 R!trait.veloc:bden 6.028132e-01 6.028132e-01 1.387624e-01 R!trait.veloc:veloc 1.710482e-02 1.710482e-02 9.820673e-04 z.ratio constraint trait:Block!trait.bden:bden 2.0738303 Positive trait:Block!trait.veloc:bden 0.8624639 Positive trait:Block!trait.veloc:veloc 1.8135789 Positive trait:Family!trait.bden:bden 2.8128203 Positive trait:Family!trait.veloc:bden 1.3996166 Positive trait:Family!trait.veloc:veloc 2.7650886 Positive R!variance NA Fixed R!trait.bden:bden 14.7057812 Positive R!trait.veloc:bden 4.3442117 Positive R!trait.veloc:veloc 17.4171524 Positive  Moving from model 3 to model 4 we are adding three covariance components (one for each family, block and residuals) and improving log-likelihood by 8.5. A quick look at the output from m4 would indicate that most of that gain is coming from allowing for the covariance of residuals for the two traits, as the covariances for family and, particularly, block are more modest: summary(m3)$loglik
 -1133.312

summary(m4)\$loglik
 -1124.781


An alternative parameterization for model 4 would be to use a correlation matrix with heterogeneous variances (corgh(trait) instead of us(trait)), which would model the correlation and the variances instead of the covariance and the variances. This approach seems to be more numerically stable sometimes.

As an aside, we can estimate the between traits correlation for Blocks (probably not that interesting) and the one for Family (much more interesting, as it is an estimate of the genetic correlation between traits: 1.594152e-01 / sqrt(8.248391e + 01*2.264225e-03) = 0.37).