When R, or any other language, is not enough

This post is tangential to R, although R has a fair share of the issues I mention here, which include research reproducibility, open source, paying for software, multiple languages, salt and pepper.

There is an increasing interest in the reproducibility of research. In many topics we face multiple, often conflicting claims and as researchers we value the ability to evaluate those claims, including repeating/reproducing research results. While I share the interest in reproducibility, some times I feel we are obsessing too much on only part of the research process: statistical analysis. Even here, many people focus not on the models per se, but only on the code for the analysis, which should only use tools that are free of charge.

There has been enormous progress in the R world on literate programming, where the combination of RStudio + Markdown + knitr has made analyzing data and documenting the process almost enjoyable. Nevertheless, and here is the BUT coming, there is a large difference between making the code repeatable and making research reproducible.

As an example, currently I am working in a project that relies on two trials, which have taken a decade to grow. We took a few hundred increment cores from a sample of trees and processed them using a densitometer, an X-Ray diffractometer and a few other lab toys. By now you get the idea, actually replicating the research may take you quite a few resources before you even start to play with free software. At that point, of course, I want to be able to get the most of my data, which means that I won’t settle for a half-assed model because the software is not able to fit it. If you think about it, spending a couple of grands in software (say ASReml and Mathematica licenses) doesn’t sound outrageous at all. Furthermore, reproducing this piece of research would require: a decade, access to genetic material and lab toys. I’ll give you the code for free, but I can’t give you ten years or $0.25 million…

In addition, the research process may require linking disparate sources of data for which other languages (e.g. Python) may be more appropriate. Some times R is the perfect tool for the job, while other times I feel like we have reached peak VBS (Visual Basic Syndrome) in R: people want to use it for everything, even when it’s a bad idea.

In summary,

  • research is much more than a few lines of R (although they are very important),
  • even when considering data collection and analysis it is a good idea to know more than a single language/software, because it broadens analytical options
  • I prefer free (freedom+beer) software for research; however, I rely on non-free, commercial software for part of my work because it happens to be the best option for specific analyses.

Disclaimer: my primary analysis language is R and I often use lme4, MCMCglmm and INLA (all free). However, many (if not most) of my analyses that use genetic information rely on ASReml (paid, not open source). I’ve used Mathematica, Matlab, Stata and SAS for specific applications with reasonably priced academic licenses.

Gratuitous picture: 3000 trees leaning in a foggy Christchurch day (Photo: Luis).

Gratuitous picture: 3000 trees leaning in a foggy Christchurch day (Photo: Luis, click to enlarge).

Split-plot 1: How does a linear mixed model look like?

I like statistics and I struggle with statistics. Often times I get frustrated when I don’t understand and I really struggled to make sense of Krushke’s Bayesian analysis of a split-plot, particularly because ‘it didn’t look like’ a split-plot to me.

Additionally, I have made a few posts discussing linear mixed models using several different packages to fit them. At no point I have shown what are the calculations behind the scenes. So, I decided to combine my frustration and an explanation to myself in a couple of posts. This is number one and the follow up is Split-plot 2: let’s throw in some spatial effects.

Example of forestry split-plot: one of my colleagues has a trial in which stocking (number of trees per ha) is the main plot and fertilization is the subplot (higher stockings look darker because trees are closer together).

How do linear mixed models look like

Linear mixed models, models that combine so-called fixed and random effects, are often represented using matrix notation as:

\( \boldsymbol{y} = \boldsymbol{X b} + \boldsymbol{Z a} + \boldsymbol{e}\)

where \( \boldsymbol{y}\) represents the response variable vector, \( \boldsymbol{X} \mbox{ and } \boldsymbol{Z}\) are incidence matrices relating the response to the fixed (\( \boldsymbol{b}\) e.g. overall mean, treatment effects, etc) and random (\( \boldsymbol{a}\), e.g. family effects, additive genetic effects and a host of experimental design effects like blocks, rows, columns, plots, etc), and last the random residuals (\( \boldsymbol{e}\)).

The previous model equation still requires some assumptions about the distribution of the random effects. A very common set of assumptions is to say that the residuals are iid (identical and independently distributed) normal (so \( \boldsymbol{R} = \sigma^2_e \boldsymbol{I} \)) and that the random effects are independent of each other, so \( \boldsymbol{G} = \boldsymbol{B} \bigoplus \boldsymbol{M}\) is a direct sum of the variance of blocks (\( \boldsymbol{B} = \sigma^2_B \boldsymbol{I}\)) and main plots (\( \boldsymbol{M} = \sigma^2_M \boldsymbol{I}\)).

An interesting feature of this matrix formulation is that we can express all sort of models by choosing different assumptions for our covariance matrices (using different covariance structures). Do you have longitudinal data (units assessed repeated times)? Is there spatial correlation? Account for this in the \( \boldsymbol{R}\) matrix. Random effects are correlated (e.g. maternal and additive genetic effects in genetics)? Account for this in the \( \boldsymbol{G}\) matrix. Multivariate response? Deal with unstructured \( \boldsymbol{R}\) and \( \boldsymbol{G}\), or model the correlation structure using different constraints (and the you’ll need asreml).

By the way, the history of linear mixed models is strongly related to applications of quantitative genetics for the prediction of breeding values, particularly in dairy cattle. Charles Henderson developed what is now called Henderson’s Mixed Model Equations to simultaneously estimate fixed effects and predict random genetic effects:

\(
\left [
\begin{array}{cc}
\boldsymbol{X}’ \boldsymbol{R}^{-1} \boldsymbol{X} & \boldsymbol{X}’ \boldsymbol{R}^{-1} \boldsymbol{Z} \\
\boldsymbol{Z}’ \boldsymbol{R}^{-1} \boldsymbol{X} & \boldsymbol{Z}’ \boldsymbol{R}^{-1} \boldsymbol{Z} + \boldsymbol{G}^{-1}
\end{array}
\right ]
\left [
\begin{array}{c}
\boldsymbol{b} \\
\boldsymbol{a}
\end{array}
\right ] =
\left [
\begin{array}{c}
\boldsymbol{X}’ \boldsymbol{R}^{-1} \boldsymbol{y} \\
\boldsymbol{Z}’ \boldsymbol{R}^{-1} \boldsymbol{y}
\end{array}
\right ] \)

The big matrix on the left-hand side of this equation is often called the \( \boldsymbol{C} \) matrix. You could be thinking ‘What does this all mean?’ It is easier to see what is going on with a small example, but rather than starting with, say, a complete block design, we’ll go for a split-plot to start tackling my annoyance with the aforementioned blog post.

Old school: physical split-plots

Given that I’m an unsophisticated forester and that I’d like to use data available to anyone, I’ll rely on an agricultural example (so plots are actually physical plots in the field) that goes back to Frank Yates. There are two factors (oats variety, with three levels, and fertilization, with four levels). Yates, F. (1935) Complex experiments, Journal of the Royal Statistical Society Suppl. 2, 181–247 (behind pay wall here).

Layout of oats experiment in Yates’s paper, from a time when articles were meaty. Each of the 6 replicates is divided in to 3 main plots for oats varieties (v1, v2 and v3), while each variety was divided into four parts with different levels of fertilization (manure—animal crap—n1 to n4). Cells display yield.

Now it is time to roll up our sleeves and use some code, getting the data and fitting the same model using nlme (m1) and asreml (m2), just for the fun of it. Anyway, nlme and asreml produce exactly the same results.

We will use the oats data set that comes with MASS, although there is also an Oats data set in nlme and another version in the asreml package. (By the way, you can see a very good explanation by Bill Venables of a ‘traditional’ ANOVA analysis for a split-plot here):

require(MASS) # we get the oats data from here
require(nlme) # for lme function
require(asreml) # for asreml function. This dataset use different variable names, 
                # which may require renaming a dataset to use the code below 
 
# Get the oats data set and check structure
data(oats)
head(oats)
str(oats)
 
# Create a main plot effect for clarity's sake
oats$MP = oats$V
 
# Split-plot using NLME
m1 = lme(Y ~ V*N, random = ~ 1|B/MP, data = oats)
summary(m1)
fixef(m1)
ranef(m1)
 
# Split-plot using ASReml
m2 = asreml(Y ~ V*N, random = ~ B/MP, data = oats)
summary(m2)$varcomp
coef(m2)$fixed
coef(m2)$random
 
fixef(m1)
#        (Intercept)         VMarvellous            VVictory             N0.2cwt 
#         80.0000000           6.6666667          -8.5000000          18.5000000 
#            N0.4cwt             N0.6cwt VMarvellous:N0.2cwt    VVictory:N0.2cwt 
#         34.6666667          44.8333333           3.3333333          -0.3333333 
#VMarvellous:N0.4cwt    VVictory:N0.4cwt VMarvellous:N0.6cwt    VVictory:N0.6cwt 
#         -4.1666667           4.6666667          -4.6666667           2.1666667 
 
ranef(m1)
# Level: B 
# (Intercept)
# I     25.421511
# II     2.656987
# III   -6.529883
# IV    -4.706019
# V    -10.582914
# VI    -6.259681
# 
# Level: MP %in% B 
# (Intercept)
# I/Golden.rain      2.348296
# I/Marvellous      -3.854348
# I/Victory         14.077467
# II/Golden.rain     4.298706
# II/Marvellous      6.209473
# II/Victory        -9.194250
# III/Golden.rain   -7.915950
# III/Marvellous    10.750776
# III/Victory       -6.063976
# IV/Golden.rain     5.789462
# IV/Marvellous     -7.115566
# IV/Victory        -1.001111
# V/Golden.rain      1.116768
# V/Marvellous      -9.848096
# V/Victory          3.497878
# VI/Golden.rain    -5.637282
# VI/Marvellous      3.857761
# VI/Victory        -1.316009

Now we can move to implement the Mixed Model Equations, where probably the only gotcha is the definition of the \(\boldsymbol{Z}\) matrix (incidence matrix for random effects), as both nlme and asreml use ‘number of levels of the factor’ for both the main and interactions effects, which involves using the contrasts.arg argument in model.matrix().

# Variance components
varB = 214.477
varMP = 106.062
varR = 177.083
 
# Basic vector and matrices: y, X, Z, G & R
y = matrix(oats$Y, nrow = dim(oats)[1], ncol = 1)
X = model.matrix(~ V*N, data = oats)
Z = model.matrix(~ B/MP - 1, data = oats, 
                 contrasts.arg = list(B = contrasts(oats$B, contrasts = F), 
                                      MP = contrasts(oats$MP, contrasts = F)))
 
G = diag(c(rep(varB, 6), rep(varMP, 18)))
R = diag(varR, 72, 72)
Rinv = solve(R)
 
# Components of C
XpX = t(X) %*% Rinv %*% X
ZpZ = t(Z) %*% Rinv %*% Z
 
XpZ = t(X) %*% Rinv %*% Z
ZpX = t(Z) %*% Rinv %*% X
 
Xpy = t(X) %*% Rinv %*% y
Zpy = t(Z) %*% Rinv %*% y
 
# Building C * [b a] = RHS
C = rbind(cbind(XpX, XpZ),
          cbind(ZpX, ZpZ + solve(G)))
RHS = rbind(Xpy, Zpy)
 
blup = solve(C, RHS)
blup
 
# [,1]
# (Intercept)          80.0000000
# VMarvellous           6.6666667
# VVictory             -8.5000000
# N0.2cwt              18.5000000
# N0.4cwt              34.6666667
# N0.6cwt              44.8333333
# VMarvellous:N0.2cwt   3.3333333
# VVictory:N0.2cwt     -0.3333333
# VMarvellous:N0.4cwt  -4.1666667
# VVictory:N0.4cwt      4.6666667
# VMarvellous:N0.6cwt  -4.6666667
# VVictory:N0.6cwt      2.1666667
# BI                   25.4215578
# BII                   2.6569919
# BIII                 -6.5298953
# BIV                  -4.7060280
# BV                  -10.5829337
# BVI                  -6.2596927
# BI:MPGolden.rain      2.3482656
# BII:MPGolden.rain     4.2987082
# BIII:MPGolden.rain   -7.9159514
# BIV:MPGolden.rain     5.7894753
# BV:MPGolden.rain      1.1167834
# BVI:MPGolden.rain    -5.6372811
# BI:MPMarvellous      -3.8543865
# BII:MPMarvellous      6.2094778
# BIII:MPMarvellous    10.7507978
# BIV:MPMarvellous     -7.1155687
# BV:MPMarvellous      -9.8480945
# BVI:MPMarvellous      3.8577740
# BI:MPVictory         14.0774514
# BII:MPVictory        -9.1942649
# BIII:MPVictory       -6.0639747
# BIV:MPVictory        -1.0011059
# BV:MPVictory          3.4978963
# BVI:MPVictory        -1.3160021

Not surprisingly, we get the same results, except that we start assuming the variance components from the previous analyses, so we can avoid implementing the code for restricted maximum likelihood estimation as well. By the way, given that \( \boldsymbol{R}^{-1} \) is in all terms it can be factored out from the MME, leaving terms like \( \boldsymbol{X}’ \boldsymbol{X} \), i.e. without \( \boldsymbol{R}^{-1}\), making for simpler calculations. In fact, if you drop the \( \boldsymbol{R}^{-1} \) it is easier to see what is going on in the different components of the \( \boldsymbol{C} \) matrix. For example, print \( \boldsymbol{X}’ \boldsymbol{X} \) and you’ll get the sum of observations for the overall mean and for each of the levels of the fixed effect factors. Give it a try with the other submatrices too!

XpXnoR = t(X) %*% X
XpXnoR
#                    (Intercept) VMarvellous VVictory N0.2cwt N0.4cwt N0.6cwt
#(Intercept)                  72          24       24      18      18      18
#VMarvellous                  24          24        0       6       6       6
#VVictory                     24           0       24       6       6       6
#N0.2cwt                      18           6        6      18       0       0
#N0.4cwt                      18           6        6       0      18       0
#N0.6cwt                      18           6        6       0       0      18
#VMarvellous:N0.2cwt           6           6        0       6       0       0
#VVictory:N0.2cwt              6           0        6       6       0       0
#VMarvellous:N0.4cwt           6           6        0       0       6       0
#VVictory:N0.4cwt              6           0        6       0       6       0
#VMarvellous:N0.6cwt           6           6        0       0       0       6
#VVictory:N0.6cwt              6           0        6       0       0       6
#                    VMarvellous:N0.2cwt VVictory:N0.2cwt VMarvellous:N0.4cwt
#(Intercept)                           6                6                   6
#VMarvellous                           6                0                   6
#VVictory                              0                6                   0
#N0.2cwt                               6                6                   0
#N0.4cwt                               0                0                   6
#N0.6cwt                               0                0                   0
#VMarvellous:N0.2cwt                   6                0                   0
#VVictory:N0.2cwt                      0                6                   0
#VMarvellous:N0.4cwt                   0                0                   6
#VVictory:N0.4cwt                      0                0                   0
#VMarvellous:N0.6cwt                   0                0                   0
#VVictory:N0.6cwt                      0                0                   0
#                    VVictory:N0.4cwt VMarvellous:N0.6cwt VVictory:N0.6cwt
#(Intercept)                        6                   6                6
#VMarvellous                        0                   6                0
#VVictory                           6                   0                6
#N0.2cwt                            0                   0                0
#N0.4cwt                            6                   0                0
#N0.6cwt                            0                   6                6
#VMarvellous:N0.2cwt                0                   0                0
#VVictory:N0.2cwt                   0                   0                0
#VMarvellous:N0.4cwt                0                   0                0
#VVictory:N0.4cwt                   6                   0                0
#VMarvellous:N0.6cwt                0                   6                0
#VVictory:N0.6cwt                   0                   0                6

I will leave it here and come back to this problem as soon as I can.

† Incidentally, a lot of the theoretical development was supported by Shayle Searle (a Kiwi statistician and Henderson’s colleague in Cornell University).

Covariance structures

In most mixed linear model packages (e.g. asreml, lme4, nlme, etc) one needs to specify only the model equation (the bit that looks like y ~ factors...) when fitting simple models. We explicitly say nothing about the covariances that complete the model specification. This is because most linear mixed model packages assume that, in absence of any additional information, the covariance structure is the product of a scalar (a variance component) by a design matrix. For example, the residual covariance matrix in simple models is R = I σe2, or the additive genetic variance matrix is G = A σa2 (where A is the numerator relationship matrix), or the covariance matrix for a random effect f with incidence matrix Z is ZZ σf2.

However, there are several situations when analyses require a more complex covariance structure, usually a direct sum or a Kronecker product of two or more matrices. For example, an analysis of data from several sites might consider different error variances for each site, that is R = Σd Ri, where Σd represents a direct sum and Ri is the residual matrix for site i.

Other example of a more complex covariance structure is a multivariate analysis in a single site (so the same individual is assessed for two or more traits), where both the residual and additive genetic covariance matrices are constructed as the product of two matrices. For example, R = IR0, where I is an identity matrix of size number of observations, ⊗ is the Kronecker product (do not confuse with a plain matrix multiplication) and R0 is the error covariance matrix for the traits involved in the analysis. Similarly, G = AG0 where all the matrices are as previously defined and G0 is the additive covariance matrix for the traits.

Some structures are easier to understand (at least for me) if we express a covariance matrix (M) as the product of a correlation matrix (C) pre- and post-multiplied by a diagonal matrix (D) containing standard deviations for each of the traits (M = D C D). That is:

\(
M = \left [
\begin{array}{cccc}
v_{11}& c_{12}& c_{13}& c_{14} \\
c_{21}& v_{22}& c_{23}& c_{24} \\
c_{31}& c_{32}& v_{33}& c_{34} \\
c_{41}& c_{42}& c_{43}& v_{44}
\end{array}
\right ]\) \(
C = \left [
\begin{array}{cccc}
1& r_{12}& r_{13}& r_{14} \\
r_{21}& 1& r_{23}& r_{24} \\
r_{31}& r_{32}& 1& r_{34} \\
r_{41}& r_{42}& r_{43}& 1
\end{array}
\right ]\) \(
D = \left [
\begin{array}{cccc}
s_{11}& 0& 0& 0 \\
0& s_{22}& 0& 0 \\
0& 0& s_{33}& 0 \\
0& 0& 0& s_{44}
\end{array}
\right ]\)

where the v are variances, the r correlations and the s standard deviations.

If we do not impose any restriction on M, apart from being positive definite (p.d.), we are talking about an unstructured matrix (us() in asreml-R parlance). Thus, M or C can take any value (as long as it is p.d.) as it is usual when analyzing multiple trait problems.

There are cases when the order of assessment or the spatial location of the experimental units create patterns of variation, which are reflected by the covariance matrix. For example, the breeding value of an individual i observed at time j (aij) is a function of genes involved in expression at time j – k (aij-k), plus the effect of genes acting in the new measurement (αj), which are considered independent of the past measurement aij = ρjk aij-k + αj, where ρjk is the additive genetic correlation between measures j and k.

Rather than using a different correlation for each pair of ages, it is possible to postulate mechanisms which model the correlations. For example, an autoregressive model (ar() in asreml-R lingo), where the correlation between measurements j and k is r|j-k|. In this model M = D CAR D, where CAR (for equally spaced assessments) is:

\(
C_{AR} = \left [
\begin{array}{cccc}
1 & r^{|t_2-t_1|} & \ldots & r^{|t_m-t_1|}\\
r^{|t_2-t_1|} & 1 & \ldots & r^{|t_m-t_2|}\\
\vdots & \vdots & \ddots & \vdots \\
r^{|t_m-t_1|} & r^{|t_m-t_2|} & \ldots & 1
\end{array} \right ]\)

Assuming three different autocorrelation coefficients (0.95 solid line, 0.90 dashed line and 0.85 dotted line) we can get very different patterns with a few extra units of lag, as shown in the following graph:

A model including this structure will certainly be more parsimonious (economic on terms of number of parameters) than one using an unstructured approach. Looking at the previous pattern it is a lot easier to understand why they are called ‘structures’. A similar situation is considered in spatial analysis, where the ‘independent errors’ assumption of typical analyses is relaxed. A common spatial model will consider the presence of autocorrelated residuals in both directions (rows and columns). Here the level of autocorrelation will depend on distance between trees rather than on time. We can get an idea of how separable processes look like using this code:

# Separable row col autoregressive process
car2 = function(dim, rhor, rhoc) {
    M = diag(dim)
    rhor^(row(M) - 1) * rhoc^(col(M) - 1)
}
 
library(lattice)
levelplot(car2(20, 0.95, 0.85))

This correlation matrix can then be multiplied by a spatial residual variance to obtain the covariance and we can add up a spatially independent residual variance.

Much more detail on code notation for covariance structures can be found, for example, in the ASReml-R User Guide (PDF, chapter 4), for nlme in Pinheiro and Bates’s Mixed-effects models in S and S-plus (link to Google Books, chapter 5.3) and in Bates’s draft book for lme4 in chapter 4.

Longitudinal analysis: autocorrelation makes a difference

Back to posting after a long weekend and more than enough rugby coverage to last a few years. Anyway, back to linear models, where we usually assume normality, independence and homogeneous variances. In most statistics courses we live in a fantasy world where we meet all of the assumptions, but in real life—and trees and forests are no exceptions—there are plenty of occasions when we can badly deviate from one or more assumptions. In this post I present a simple example, where we have a number of clones (genetically identical copies of a tree), which had between 2 and 4 cores extracted, and each core was assessed for acoustic velocity (we care about it because it is inversely related to longitudinal shrinkage and its square is proportional to wood stiffness) every two millimeters. This small dataset is only a pilot for a much larger study currently underway.

At this stage I will ignore any relationship between the clones and focus on the core assessements. Let’s think for a moment; we have five replicates (which restrict the randomization) and four clones (A, B, C and D). We have (mostly) 2 to 4 cores (cylindrical pieces of wood covering from tree pith to cambium) within each tree, and we have longitudinal assessments for each core. I would have the expectation that, at least, successive assessments for each core are not independent; that is, assessments that are closer together are more similar than those that are farther apart. How does the data look like? The trellis plot shows trees using a Clone:Rep notation:

library(lattice)
xyplot(velocity ~ distance | Tree, group=Core, 
                  data=cd, type=c('l'))

Incidentally, cores from Clone C in replicate four were damaged, so I dropped them from this example (real life is unbalanced as well!). Just in case, distance is in mm from the tree pith and velocity in m/s. Now we will fit an analysis that totally ignores any relationship between the successive assessments:

library(nlme)
lm1a = lme(ACV ~ Clone*Distance,
                 random = ~ 1|Rep/Tree/Core, data=cd)
summary(lm1a)
 
Linear mixed-effects model fit by REML
 Data: cd 
      AIC      BIC   logLik
  34456.8 34526.28 -17216.4
 
Random effects:
 Formula: ~1 | Rep
        (Intercept)
StdDev:    120.3721
 
 Formula: ~1 | Tree %in% Rep
        (Intercept)
StdDev:    77.69231
 
 Formula: ~1 | Core %in% Tree %in% Rep
        (Intercept) Residual
StdDev:    264.6254 285.9208
 
Fixed effects: ACV ~ Clone * Distance 
                     Value Std.Error   DF  t-value p-value
(Intercept)       3274.654 102.66291 2379 31.89715  0.0000
CloneB             537.829 127.93871   11  4.20380  0.0015
CloneC             209.945 137.10691   11  1.53125  0.1539
CloneD             293.840 124.08420   11  2.36807  0.0373
Distance            14.220   0.28607 2379 49.70873  0.0000
CloneB:distance     -0.748   0.44852 2379 -1.66660  0.0957
CloneC:distance     -0.140   0.45274 2379 -0.30977  0.7568
CloneD:distance      3.091   0.47002 2379  6.57573  0.0000
 
 
anova(lm1a)
               numDF denDF  F-value p-value
(Intercept)        1  2379 3847.011  <.0001
Clone              3    11    4.054  0.0363
distance           1  2379 7689.144  <.0001
Clone:distance     3  2379   22.468  <.0001

Incidentally, our assessment setup looks like this. The nice thing of having good technicians (Nigel made the tool frame), collaborating with other departments (Electrical Engineering, Michael and students designed the electronics and software for signal processing) and other universities (Uni of Auckland, where Paul—who cored the trees and ran the machine—works) is that one gets involved in really cool projects.

What happens if we actually allow for an autoregressive process?

lm1b = lme(velocity ~ Clone*distance,
                      random = ~ 1|Rep/Tree/Core, data = cd,
                      correlation = corCAR1(value = 0.8, 
                      form = ~ distance | Rep/Tree/Core))
summary(lm1b)
 
Linear mixed-effects model fit by REML
 Data: ncd 
       AIC      BIC    logLik
  29843.45 29918.72 -14908.73
 
Random effects:
 Formula: ~1 | Rep
        (Intercept)
StdDev:     60.8209
 
 Formula: ~1 | Tree %in% Rep
        (Intercept)
StdDev:    125.3225
 
 Formula: ~1 | Core %in% Tree %in% Rep
        (Intercept) Residual
StdDev:   0.3674224 405.2818
 
Correlation Structure: Continuous AR(1)
 Formula: ~distance | Rep/Tree/Core 
 Parameter estimate(s):
      Phi 
0.9803545 
Fixed effects: velocity ~ Clone * distance 
                   Value Std.Error   DF   t-value p-value
(Intercept)     3297.517 127.98953 2379 25.763960  0.0000
CloneB           377.290 183.16795   11  2.059804  0.0639
CloneC           174.986 195.21327   11  0.896383  0.3892
CloneD           317.581 178.01710   11  1.783994  0.1020
distance          15.209   1.26593 2379 12.013979  0.0000
CloneB:distance    0.931   1.94652 2379  0.478342  0.6325
CloneC:distance   -0.678   2.00308 2379 -0.338629  0.7349
CloneD:distance    2.677   1.95269 2379  1.371135  0.1705
 
 
anova(lm1b)
               numDF denDF  F-value p-value
(Intercept)        1  2379 5676.580  <.0001
Clone              3    11    2.483  0.1152
distance           1  2379  492.957  <.0001
Clone:distance     3  2379    0.963  0.4094

In ASReml-R this would look like (for the same results, but many times faster):

as1a = asreml(velocity ~ Clone*distance, 
                         random = ~ Rep + Tree/Core, 
                         data = cd)
summary(as1a)
anova(as1a)
 
# I need to sort out my code for ar(1) and update to
# the latest version of asreml-r

Oops! What happened to the significance of Clone and its interaction with distance? The perils of ignoring the independence assumption. But, wait, isn’t an AR(1) process too simplistic to model the autocorrelation (as pointed out by D.J. Keenan when criticizing IPCC’s models and discussing Richard Mueller’s new BEST project models)? In this case, probably not, as we have a mostly increasing response, where we have a clue of the processes driving the change and with far less noise than climate data.

Could we improve upon this model? Sure! We could add heterogeneous variances, explore non-linearities, take into account the genetic relationship between the trees, run the whole thing in asreml (so it is faster), etc. Nevertheless, at this point you can get an idea of some of the issues (or should I call them niceties?) involved in the analysis of experiments.

Linear mixed models in R

A substantial part of my job has little to do with statistics; nevertheless, a large proportion of the statistical side of things relates to applications of linear mixed models. The bulk of my use of mixed models relates to the analysis of experiments that have a genetic structure.

A brief history of time

At the beginning (1992-1995) I would use SAS (first proc glm, later proc mixed), but things started getting painfully slow and limiting if one wanted to move into animal model BLUP. At that time (1995-1996), I moved to DFREML (by Karen Meyer, now replaced by WOMBAT) and AIREML (by Dave Johnson, now defunct—the program I mean), which were designed for the analysis of animal breeding progeny trials, so it was a hassle to deal with experimental design features. At the end of 1996 (or was it the beginning of 1997?) I started playing with ASReml (programed by Arthur Gilmour mostly based on theoretical work by Robin Thompson and Brian Cullis). I was still using SAS for data preparation, but all my analyses went through ASReml (for which I wrote the cookbook), which was orders of magnitude faster than SAS (and could deal with much bigger problems). Around 1999, I started playing with R (prompted by a suggestion from Rod Ball), but I didn’t really use R/S+ often enough until 2003. At the end of 2005 I started using OS X and quickly realized that using a virtual machine or dual booting was not really worth it, so I dropped SAS and totally relied on R in 2009.

Options

As for many other problems, there are several packages in R that let you deal with linear mixed models from a frequentist (REML) point of view. I will only mention nlme (Non-Linear Mixed Effects), lme4 (Linear Mixed Effects) and asreml (average spatial reml). There are also several options for Bayesian approaches, but that will be another post.

nlme is the most mature one and comes by default with any R installation. In addition to fitting hierarchical generalized linear mixed models it also allows fitting non-linear mixed models following a Gaussian distribution (my explanation wasn’t very clear, thanks to ucfagls below for pointing this out). Its main advantages are, in my humble opinion, the ability to fit fairly complex hierarchical models using linear or non-linear approaches, a good variety of variance and correlation structures, and access to several distributions and link functions for generalized models. In my opinion, its main drawbacks are i- fitting cross-classified random factors is a pain, ii- it can be slow and may struggle with lots of data, iii- it does not deal with pedigrees by default and iv- it does not deal with multivariate data.

lme4 is a project led by Douglas Bates (one of the co-authors of nlme), looking at modernizing the code and making room for trying new ideas. On the positive side, it seems to be a bit faster than nlme and it deals a lot better with cross-classified random factors. Drawbacks: similar to nlme’s, but dropping point i- and adding that it doesn’t deal with covariance and correlation structures yet. It is possible to fit pedigrees using the pedigreemm package, but I find the combination a bit flimsy.

ASReml-R is, unsurprisingly, an R package interface to ASReml. On the plus side it i- deals well with cross-classified random effects, ii- copes very well with pedigrees, iii- can work with fairly large datasets, iv-can run multivariate analyses and v- covers a large number of covariance and correlation structures. Main drawbacks are i- limited functionality for non-Gaussian responses, ii- it does not cover non-linear models and iii- it is non-free (as in beer and speech). The last drawback is relative; it is possible to freely use asreml for academic purposes (and there is also a version for developing countries). Besides researchers, the main users of ASReml/ASReml-R are breeding companies.

All these three packages are available for Windows, Linux and OS X.

A (very) simple example

I will use a traditional dataset to show examples of the notation for the three packages: Yates’ variety and nitrogen split-plot experiment. We can get the dataset from the MASS package, after which it is a good idea to rename the variables using meaningful names. In addition, I will follow Bill Venables’s excellent advice and create additional variables for main plot and subplots, as it is confusing to use the same factor for two purposes (e.g. variety as treatment and main plot). Incidentally, if you haven’t read Bill’s post go and read it; it is one of the best explanations I have ever seen for a split-plot analysis.

library(MASS)
data(oats)
names(oats) = c('block', 'variety', 'nitrogen', 'yield')
oats$mainplot = oats$variety
oats$subplot = oats$nitrogen
 
summary(oats)
 block           variety     nitrogen      yield              mainplot 
 I  :12   Golden.rain:24   0.0cwt:18   Min.   : 53.0   Golden.rain:24  
 II :12   Marvellous :24   0.2cwt:18   1st Qu.: 86.0   Marvellous :24  
 III:12   Victory    :24   0.4cwt:18   Median :102.5   Victory    :24  
 IV :12                    0.6cwt:18   Mean   :104.0                   
 V  :12                                3rd Qu.:121.2                   
 VI :12                                Max.   :174.0                   
   subplot  
 0.0cwt:18  
 0.2cwt:18  
 0.4cwt:18  
 0.6cwt:18

The nlme code for this analysis is fairly simple: response on the left-hand side of the tilde, followed by the fixed effects (variety, nitrogen and their interaction). Then there is the specification of the random effects (which also uses a tilde) and the data set containing all the data. Notice that 1|block/mainplot is fitting block and mainplot within block. There is no reference to subplot as there is a single assessment for each subplot, which ends up being used at the residual level.

library(nlme)
m1.nlme = lme(yield ~ variety*nitrogen,
                      random = ~ 1|block/mainplot,
                      data = oats)
 
summary(m1.nlme)
 
Linear mixed-effects model fit by REML
 Data: oats 
       AIC      BIC    logLik
  559.0285 590.4437 -264.5143
 
Random effects:
 Formula: ~1 | block
        (Intercept)
StdDev:    14.64496
 
 Formula: ~1 | mainplot %in% block
        (Intercept) Residual
StdDev:    10.29863 13.30727
 
Fixed effects: yield ~ variety * nitrogen 
                                    Value Std.Error DF   t-value p-value
(Intercept)                      80.00000  9.106958 45  8.784492  0.0000
varietyMarvellous                 6.66667  9.715028 10  0.686222  0.5082
varietyVictory                   -8.50000  9.715028 10 -0.874933  0.4021
nitrogen0.2cwt                   18.50000  7.682957 45  2.407927  0.0202
nitrogen0.4cwt                   34.66667  7.682957 45  4.512152  0.0000
nitrogen0.6cwt                   44.83333  7.682957 45  5.835427  0.0000
varietyMarvellous:nitrogen0.2cwt  3.33333 10.865342 45  0.306786  0.7604
varietyVictory:nitrogen0.2cwt    -0.33333 10.865342 45 -0.030679  0.9757
varietyMarvellous:nitrogen0.4cwt -4.16667 10.865342 45 -0.383482  0.7032
varietyVictory:nitrogen0.4cwt     4.66667 10.865342 45  0.429500  0.6696
varietyMarvellous:nitrogen0.6cwt -4.66667 10.865342 45 -0.429500  0.6696
varietyVictory:nitrogen0.6cwt     2.16667 10.865342 45  0.199411  0.8428
 
anova(m1.nlme)
 
                 numDF denDF   F-value p-value
(Intercept)          1    45 245.14299  <.0001
variety              2    10   1.48534  0.2724
nitrogen             3    45  37.68562  <.0001
variety:nitrogen     6    45   0.30282  0.9322

The syntax for lme4 is not that dissimilar, with random effects specified using a (1|something here) syntax. One difference between the two packages is that nlme reports standard deviations instead of variances for the random effects.

library(lme4)
m1.lme4 = lmer(yield ~ variety*nitrogen + (1|block/mainplot),
                       data = oats)
 
summary(m1.lme4)
 
Linear mixed model fit by REML 
Formula: yield ~ variety * nitrogen + (1 | block/mainplot) 
   Data: oats 
 AIC   BIC logLik deviance REMLdev
 559 593.2 -264.5    595.9     529
Random effects:
 Groups         Name        Variance Std.Dev.
 mainplot:block (Intercept) 106.06   10.299  
 block          (Intercept) 214.48   14.645  
 Residual                   177.08   13.307  
Number of obs: 72, groups: mainplot:block, 18; block, 6
 
Fixed effects:
                                 Estimate Std. Error t value
(Intercept)                       80.0000     9.1064   8.785
varietyMarvellous                  6.6667     9.7150   0.686
varietyVictory                    -8.5000     9.7150  -0.875
nitrogen0.2cwt                    18.5000     7.6830   2.408
nitrogen0.4cwt                    34.6667     7.6830   4.512
nitrogen0.6cwt                    44.8333     7.6830   5.835
varietyMarvellous:nitrogen0.2cwt   3.3333    10.8653   0.307
varietyVictory:nitrogen0.2cwt     -0.3333    10.8653  -0.031
varietyMarvellous:nitrogen0.4cwt  -4.1667    10.8653  -0.383
varietyVictory:nitrogen0.4cwt      4.6667    10.8653   0.430
varietyMarvellous:nitrogen0.6cwt  -4.6667    10.8653  -0.430
varietyVictory:nitrogen0.6cwt      2.1667    10.8653   0.199
 
anova(m1.lme4)
 
Analysis of Variance Table
                 Df  Sum Sq Mean Sq F value
variety           2   526.1   263.0  1.4853
nitrogen          3 20020.5  6673.5 37.6856
variety:nitrogen  6   321.7    53.6  0.3028

For this type of problem, the notation for asreml is also very similar, particularly when compared to nlme.

library(asreml)
m1.asreml = asreml(yield ~ variety*nitrogen,
                           random = ~ block/mainplot,
                           data = oats)
 
summary(m1.asreml)$varcomp
 
                             gamma component std.error  z.ratio constraint
block!block.var          1.2111647  214.4771 168.83404 1.270343   Positive
block:mainplot!block.var 0.5989373  106.0618  67.87553 1.562593   Positive
R!variance               1.0000000  177.0833  37.33244 4.743416   Positive
 
wald(m1.asreml, denDF = 'algebraic')
 
$Wald
                 Df denDF    F.inc           Pr
(Intercept)       1     5 245.1000 1.931825e-05
variety           2    10   1.4850 2.723869e-01
nitrogen          3    45  37.6900 2.457710e-12
variety:nitrogen  6    45   0.3028 9.321988e-01
 
$stratumVariances
               df  Variance block block:mainplot R!variance
block           5 3175.0556    12              4          1
block:mainplot 10  601.3306     0              4          1
R!variance     45  177.0833     0              0          1

In this simple example one pretty much gets the same results, independently of the package used (which is certainly comforting). I will soon cover another simple model, but with much larger dataset, to highlight some performance differences between the packages.

Linear regression with correlated data

I started following the debate on differential minimum wage for youth (15-19 year old) and adults in New Zealand. Eric Crampton has written a nice series of blog posts, making the data from Statistics New Zealand available. I will use the nzunemployment.csv data file (with quarterly data from March 1986 to June 2011) and show an example of multiple linear regression with autocorrelated residuals in R.

A first approach could be to ignore autocorrelation and fit a linear model that attempts to predict youth unemployment with two explanatory variables: adult unemployment (continuous) and minimum wage rules (categorical: equal or different). This can be done using:

setwd('~/Dropbox/quantumforest')
un = read.csv('nzunemployment.csv', header = TRUE)
 
# Create factor for minimum wage, which was different for youth 
# and adults before quarter 90 (June 2008)
un$minwage = factor(ifelse(un$q < 90, 'Different', 'Equal'))
 
mod1 = lm(youth ~ adult*minwage, data = un)
summary(mod1)
 
Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         8.15314    0.43328  18.817  < 2e-16 ***
adult               1.53334    0.07506  20.428  < 2e-16 ***
minwageEqual       -5.69192    2.19356  -2.595   0.0109 *  
adult:minwageEqual  2.85518    0.46197   6.180 1.47e-08 ***
 
Residual standard error: 1.447 on 98 degrees of freedom
Multiple R-squared: 0.8816,	Adjusted R-squared: 0.878 
F-statistic: 243.3 on 3 and 98 DF,  p-value: < 2.2e-16

Remember that adult*minwage is expanded to adult + minwage + adult:minwage. We can make the coefficients easier to understand if we center adult unemployment on the mean of the first 80 quarters. Notice that we get the same slope, Adj-R2, etc. but now the intercept corresponds to the youth unemployment for the average adult unemployment before changing minimum wage rules. All additional analyses will use the centered version.

un$cadult = with(un, adult - mean(adult))
mod2 = lm(youth ~ cadult*minwage, data = un)
summary(mod2)
 
Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         16.28209    0.15352  106.06  < 2e-16 ***
cadult               1.53334    0.07506   20.43  < 2e-16 ***
minwageEqual         9.44472    0.52629   17.95  < 2e-16 ***
cadult:minwageEqual  2.85518    0.46197    6.18 1.47e-08 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1 
 
Residual standard error: 1.447 on 98 degrees of freedom
Multiple R-squared: 0.8816,	Adjusted R-squared: 0.878 
F-statistic: 243.3 on 3 and 98 DF,  p-value: < 2.2e-16 
 
plot(mod2)    # Plots residuals for the model fit
acf(mod2$res) # Plots autocorrelation of the residuals

In the centered version, the intercept corresponds to youth unemployment when adult unemployment rate is 5.4 (average for the first 89 quarters). The coefficient of minwageEqual corresponds to the increase of youth unemployment (9.44%) when the law moved to have equal minimum wage for youth and adults. Notice that the slopes did not change at all.

I will use the function gls() from the nlme package (which comes by default with all R installations) to take into account the serial correlation. First we can fit a model equivalent to mod2, just to check that we get the same results.

library(nlme)
mod3 = gls(youth ~ cadult*minwage, data = un)
summary(mod3)
 
Generalized least squares fit by REML
  Model: youth ~ cadult * minwage 
  Data: un 
       AIC      BIC    logLik
  375.7722 388.6971 -182.8861
 
Coefficients:
                        Value Std.Error   t-value p-value
(Intercept)         16.282089 0.1535237 106.05585       0
cadult               1.533341 0.0750595  20.42834       0
minwageEqual         9.444719 0.5262926  17.94576       0
cadult:minwageEqual  2.855184 0.4619725   6.18042       0
 
 Correlation: 
                    (Intr) cadult mnwgEq
cadult              -0.048              
minwageEqual        -0.292  0.014       
cadult:minwageEqual  0.008 -0.162  0.568
 
Standardized residuals:
        Min          Q1         Med          Q3         Max 
-2.96256631 -0.53975848 -0.02071559  0.63499262  2.35900240 
 
Residual standard error: 1.446696 
Degrees of freedom: 102 total; 98 residual

Yes, they are identical. Notice that the model fitting is done using Restricted Maximum Likelihood (REML). Now we can add an autoregressive process of order 1 for the residuals and compare the two models:

mod4 = gls(youth ~ cadult*minwage, 
           correlation = corAR1(form=~1), data = un)
summary(mod4)
 
 
Generalized least squares fit by REML
  Model: youth ~ cadult * minwage 
  Data: un 
       AIC      BIC    logLik
  353.0064 368.5162 -170.5032
 
Correlation Structure: AR(1)
 Formula: ~1 
 Parameter estimate(s):
      Phi 
0.5012431 
 
Coefficients:
                        Value Std.Error  t-value p-value
(Intercept)         16.328637 0.2733468 59.73598       0
cadult               1.522192 0.1274453 11.94389       0
minwageEqual         9.082626 0.8613543 10.54459       0
cadult:minwageEqual  2.545011 0.5771780  4.40940       0
 
 Correlation: 
                    (Intr) cadult mnwgEq
cadult              -0.020              
minwageEqual        -0.318  0.007       
cadult:minwageEqual -0.067 -0.155  0.583
 
Standardized residuals:
        Min          Q1         Med          Q3         Max 
-2.89233359 -0.55460580 -0.02419759  0.55449166  2.29571080 
 
Residual standard error: 1.5052 
Degrees of freedom: 102 total; 98 residual
 
anova(mod3, mod4)
     Model df      AIC      BIC    logLik   Test L.Ratio p-value
mod3     1  5 375.7722 388.6971 -182.8861                       
mod4     2  6 353.0064 368.5162 -170.5032 1 vs 2 24.7658  <.0001

There is a substantial improvement for the log likelihood (from -182 to -170). We can take into account the additional parameter (autocorrelation) that we are fitting by comparing AIC, which improved from 375.77 (-2*(-182.8861) + 2*5) to 368.52 (-2*(-170.5032) + 2*6). Remember that AIC is -2*logLikelihood + 2*number of parameters.

The file unemployment.txt contains the R code used in this post (I didn’t use the .R extension as WordPress complains).