Evolving notes, images and sounds by Luis Apiolaza

Linear mixed models in R

A substantial part of my job has little to do with statistics; nevertheless, a large proportion of the statistical side of things relates to applications of linear mixed models. The bulk of my use of mixed models relates to the analysis of experiments that have a genetic structure.

A brief history of time

At the beginning (1992-1995) I would use SAS (first proc glm, later proc mixed), but things started getting painfully slow and limiting if one wanted to move into animal model BLUP. At that time (1995-1996), I moved to DFREML (by Karen Meyer, now replaced by WOMBAT) and AIREML (by Dave Johnson, now defunct—the program I mean), which were designed for the analysis of animal breeding progeny trials, so it was a hassle to deal with experimental design features. At the end of 1996 (or was it the beginning of 1997?) I started playing with ASReml (programed by Arthur Gilmour mostly based on theoretical work by Robin Thompson and Brian Cullis). I was still using SAS for data preparation, but all my analyses went through ASReml (for which I wrote the cookbook), which was orders of magnitude faster than SAS (and could deal with much bigger problems). Around 1999, I started playing with R (prompted by a suggestion from Rod Ball), but I didn’t really use R/S+ often enough until 2003. At the end of 2005 I started using OS X and quickly realized that using a virtual machine or dual booting was not really worth it, so I dropped SAS and totally relied on R in 2009.

Options

As for many other problems, there are several packages in R that let you deal with linear mixed models from a frequentist (REML) point of view. I will only mention nlme (Non-Linear Mixed Effects), lme4 (Linear Mixed Effects) and asreml (average spatial reml). There are also several options for Bayesian approaches, but that will be another post.

nlme is the most mature one and comes by default with any R installation. In addition to fitting hierarchical generalized linear mixed models it also allows fitting non-linear mixed models following a Gaussian distribution (my explanation wasn’t very clear, thanks to ucfagls below for pointing this out). Its main advantages are, in my humble opinion, the ability to fit fairly complex hierarchical models using linear or non-linear approaches, a good variety of variance and correlation structures, and access to several distributions and link functions for generalized models. In my opinion, its main drawbacks are i- fitting cross-classified random factors is a pain, ii- it can be slow and may struggle with lots of data, iii- it does not deal with pedigrees by default and iv- it does not deal with multivariate data.

lme4 is a project led by Douglas Bates (one of the co-authors of nlme), looking at modernizing the code and making room for trying new ideas. On the positive side, it seems to be a bit faster than nlme and it deals a lot better with cross-classified random factors. Drawbacks: similar to nlme’s, but dropping point i- and adding that it doesn’t deal with covariance and correlation structures yet. It is possible to fit pedigrees using the pedigreemm package, but I find the combination a bit flimsy.

ASReml-R is, unsurprisingly, an R package interface to ASReml. On the plus side it i- deals well with cross-classified random effects, ii- copes very well with pedigrees, iii- can work with fairly large datasets, iv-can run multivariate analyses and v- covers a large number of covariance and correlation structures. Main drawbacks are i- limited functionality for non-Gaussian responses, ii- it does not cover non-linear models and iii- it is non-free (as in beer and speech). The last drawback is relative; it is possible to freely use asreml for academic purposes (and there is also a version for developing countries). Besides researchers, the main users of ASReml/ASReml-R are breeding companies.

All these three packages are available for Windows, Linux and OS X.

A (very) simple example

I will use a traditional dataset to show examples of the notation for the three packages: Yates’ variety and nitrogen split-plot experiment. We can get the dataset from the MASS package, after which it is a good idea to rename the variables using meaningful names. In addition, I will follow Bill Venables’s excellent advice and create additional variables for main plot and subplots, as it is confusing to use the same factor for two purposes (e.g. variety as treatment and main plot). Incidentally, if you haven’t read Bill’s post go and read it; it is one of the best explanations I have ever seen for a split-plot analysis.

library(MASS)
data(oats)
names(oats) <- c('block', 'variety', 'nitrogen', 'yield')
oats$mainplot <- oats$variety
oats$subplot <- oats$nitrogen

summary(oats)
block           variety     nitrogen      yield              mainplot
I  :12   Golden.rain:24   0.0cwt:18   Min.   : 53.0   Golden.rain:24
II :12   Marvellous :24   0.2cwt:18   1st Qu.: 86.0   Marvellous :24
III:12   Victory    :24   0.4cwt:18   Median :102.5   Victory    :24
IV :12                    0.6cwt:18   Mean   :104.0
V  :12                                3rd Qu.:121.2
VI :12                                Max.   :174.0
subplot
0.0cwt:18
0.2cwt:18
0.4cwt:18
0.6cwt:18

The nlme code for this analysis is fairly simple: response on the left-hand side of the tilde, followed by the fixed effects (variety, nitrogen and their interaction). Then there is the specification of the random effects (which also uses a tilde) and the data set containing all the data. Notice that 1|block/mainplot is fitting block and mainplot within block. There is no reference to subplot as there is a single assessment for each subplot, which ends up being used at the residual level.

library(nlme)
m1.nlme <- lme(yield ~ variety*nitrogen,
               random = ~ 1|block/mainplot,
               data = oats)

summary(m1.nlme)

Linear mixed-effects model fit by REML
Data: oats
AIC      BIC    logLik
559.0285 590.4437 -264.5143

Random effects:
Formula: ~1 | block
(Intercept)
StdDev:    14.64496

Formula: ~1 | mainplot %in% block
(Intercept) Residual
StdDev:    10.29863 13.30727

Fixed effects: yield ~ variety * nitrogen
Value Std.Error DF   t-value p-value
(Intercept)                      80.00000  9.106958 45  8.784492  0.0000
varietyMarvellous                 6.66667  9.715028 10  0.686222  0.5082
varietyVictory                   -8.50000  9.715028 10 -0.874933  0.4021
nitrogen0.2cwt                   18.50000  7.682957 45  2.407927  0.0202
nitrogen0.4cwt                   34.66667  7.682957 45  4.512152  0.0000
nitrogen0.6cwt                   44.83333  7.682957 45  5.835427  0.0000
varietyMarvellous:nitrogen0.2cwt  3.33333 10.865342 45  0.306786  0.7604
varietyVictory:nitrogen0.2cwt    -0.33333 10.865342 45 -0.030679  0.9757
varietyMarvellous:nitrogen0.4cwt -4.16667 10.865342 45 -0.383482  0.7032
varietyVictory:nitrogen0.4cwt     4.66667 10.865342 45  0.429500  0.6696
varietyMarvellous:nitrogen0.6cwt -4.66667 10.865342 45 -0.429500  0.6696
varietyVictory:nitrogen0.6cwt     2.16667 10.865342 45  0.199411  0.8428

anova(m1.nlme)

numDF denDF   F-value p-value
(Intercept)          1    45 245.14299  <.0001
variety              2    10   1.48534  0.2724
nitrogen             3    45  37.68562  <.0001
variety:nitrogen     6    45   0.30282  0.9322

The syntax for lme4 is not that dissimilar, with random effects specified using a (1|something here) syntax. One difference between the two packages is that nlme reports standard deviations instead of variances for the random effects.

library(lme4)
m1.lme4 <- lmer(yield ~ variety*nitrogen + (1|block/mainplot),
                data = oats)

summary(m1.lme4)

Linear mixed model fit by REML
Formula: yield ~ variety * nitrogen + (1 | block/mainplot)
Data: oats
AIC   BIC logLik deviance REMLdev
559 593.2 -264.5    595.9     529
Random effects:
Groups         Name        Variance Std.Dev.
mainplot:block (Intercept) 106.06   10.299
block          (Intercept) 214.48   14.645
Residual                   177.08   13.307
Number of obs: 72, groups: mainplot:block, 18; block, 6

Fixed effects:
Estimate Std. Error t value
(Intercept)                       80.0000     9.1064   8.785
varietyMarvellous                  6.6667     9.7150   0.686
varietyVictory                    -8.5000     9.7150  -0.875
nitrogen0.2cwt                    18.5000     7.6830   2.408
nitrogen0.4cwt                    34.6667     7.6830   4.512
nitrogen0.6cwt                    44.8333     7.6830   5.835
varietyMarvellous:nitrogen0.2cwt   3.3333    10.8653   0.307
varietyVictory:nitrogen0.2cwt     -0.3333    10.8653  -0.031
varietyMarvellous:nitrogen0.4cwt  -4.1667    10.8653  -0.383
varietyVictory:nitrogen0.4cwt      4.6667    10.8653   0.430
varietyMarvellous:nitrogen0.6cwt  -4.6667    10.8653  -0.430
varietyVictory:nitrogen0.6cwt      2.1667    10.8653   0.199

anova(m1.lme4)

Analysis of Variance Table
Df  Sum Sq Mean Sq F value
variety           2   526.1   263.0  1.4853
nitrogen          3 20020.5  6673.5 37.6856
variety:nitrogen  6   321.7    53.6  0.3028

For this type of problem, the notation for asreml is also very similar, particularly when compared to nlme.

library(asreml)
m1.asreml <- asreml(yield ~ variety*nitrogen,
                    random = ~ block/mainplot,
                    data = oats)

summary(m1.asreml)$varcomp

gamma component std.error  z.ratio constraint
block!block.var          1.2111647  214.4771 168.83404 1.270343   Positive
block:mainplot!block.var 0.5989373  106.0618  67.87553 1.562593   Positive
R!variance               1.0000000  177.0833  37.33244 4.743416   Positive

wald(m1.asreml, denDF = 'algebraic')

$Wald
Df denDF    F.inc           Pr
(Intercept)       1     5 245.1000 1.931825e-05
variety           2    10   1.4850 2.723869e-01
nitrogen          3    45  37.6900 2.457710e-12
variety:nitrogen  6    45   0.3028 9.321988e-01

$stratumVariances
df  Variance block block:mainplot R!variance
block           5 3175.0556    12              4          1
block:mainplot 10  601.3306     0              4          1
R!variance     45  177.0833     0              0          1

In this simple example one pretty much gets the same results, independently of the package used (which is certainly comforting). I will soon cover another simple model, but with much larger dataset, to highlight some performance differences between the packages.

20 Comments

  1. Oldi

    Hi there, this looks like to become a very interesting blog! Looking forward to it!
    cheers

    • Luis

      Thanks and I hope you are right!

      • Gulshan Madhur

        Hi Louis, I am pretty new to mixed models. I would really appreciate your kind help to make me understand how to intrepret the results here or give me a link to read. 🙂

        Looking forward to your help!

  2. ucfagls

    A small clarification: nlme doesn't fit generalized LMMs just linear ones. It is lme4 that implements generalized linear and non-linear MMs (as well as linear ones).

    • Luis

      Thanks for the clarification. nlme deals only with Gaussian non-linear mixed models; I'll update the post.

  3. Ben Bolker

    I was going to make the same comment as ucfagls. Nice post, though, looking forward to future posts.

  4. jianfeng.mao

    MCMCglmm, it may be another choice. It is a Bayesian animal modeler. You may need to turn to it. And, R-INLR package, is also attractive. All these two R packages are free.

    • Luis

      Hi Jianfeng, You are right, and I will cover them with more detail in future posts. I used MCMCglmm for a multivariate animal model when working in this paper and am submitting soon another paper using INLA. Both packages are really good, but I am still getting my head around some of their notation.

  5. LarsR

    An alternative to the packages already mentioned is the hglm package on CRAN (developed by me and collaborators). There it is possible to give arbitrary design matrices (both for fixed and random effects) as inputs to the hglm-function, which allows modelling of correlated random effects. In our R Journal publication (The R Journal 2:20-28) we have an example of a an animal model being fitted (see Example 4 in the paper). Note that arbitrary correlation structures can be fitted, allowing for eg modelling of IBD-matrices in QTL analysis and genomic relationship matrices as well.

    • Luis

      Hi Lars, Thanks for the pointer to hglm and its publication in the R-Journal (PDF). I hope to check it when we finish our semester (I'm trying to keep my head above the water at the moment).

      By the way, it is hard to beat the name for your PhD topic; my childhood dream come true! I visited SLU in Uppsala in Dec 2009, cool place.

      • LarsR

        Well Luis, I'm pretty proud of being one of very few who has been doing research on animal breeding in reindeer. Northern Sweden can be a pretty cool place. Even cooler than Uppsala.
        Another package, recently available on CRAN, is dhglm developed by Noh & Lee. There you can fit GLMs having random effects both in the mean and the dispersion part, which has some interesting applications in genetics (see for instance my recent papers in this field http://www.gsejournal.org/content/pdf/1297-9686-4… and http://www.genetics.org/content/188/2/435.full ) . In the dhglm package you can also fit arbitrary correlation structures.

  6. Antonio Pedro

    I am very very late to this post but I have a question, if somewhat off topic: how did you collect the output of several lme objects in an useful format for model comparison and, later, for latex table? I've been struggling with that recently. Best, Antonio.

    • Luis

      You can create a list in R and assign the results of the analyses as one more element of the list (I normally save only the variance components and log-likelihood). Something like this:

      results = list()

      loop iterating i here
      … results[i] = list(summary(den.uni)$varcomp) # in the case of asreml
      end loop

      You can even pre-allocate the list of whatever size to save some time, but in my experience when running this type of analyses the time to growth the list is minimal compared to the time to run a single analysis.

      Concerning Latex tables, I haven't done that in R.

  7. kas

    could any one forward syntax for A matrix generation (Y=xB +Zu +e) A is pedgree information

    • Luis

      Search for pedigree in the search box and you’ll see examples on how to specify the pedigree for different packages that can be used for genetic analysis.

  8. Achen

    Thanks Luis, a very suitable post for a beginner. However,the link to “Bill Venables’s post on linear mixed model” is broken. Would you mind update it?

    • Luis

      Thanks for pointing out the broken link. It referred to a 16 year old email in the s-plus email list, which is not active anymore. Fortunately there are still a couple of copies circulating in internet, so I have updated the link to one of the surviving copies.

  9. Gabor

    Technical comment: The linked content in the text points to obvious SPAM websites online. Not sure if this is the problem of my computer, or the blog contents were compromised.

    eg in the text “(first proc glm, this site later proc mixed), ” the “this site” link.

    Maybe check the linked content?

    • Luis

      Thanks for that Gabor. A while ago the site was hacked and sprinkled with links to viagra and similar products. I’m still going over the older posts and removing links.

  10. Michail Tsagris

    I would like to add that I have implented random intercepts (with compound symmetry) linear regression for continuous predictors only in the Rpackage Rfast. It is 30+ times faster than lme4, yet it is not at its best. We will expand it to other covariance structures and translate it to C++ a t some point. There are some more variance components related functions as well, if you want to take a look.

© 2024 Palimpsest

Theme by Anders NorenUp ↑