Covariance structures

In most mixed linear model packages (e.g. asreml, lme4, nlme, etc) one needs to specify only the model equation (the bit that looks like y ~ factors...) when fitting simple models. We explicitly say nothing about the covariances that complete the model specification. This is because most linear mixed model packages assume that, in absence of any additional information, the covariance structure is the product of a scalar (a variance component) by a design matrix. For example, the residual covariance matrix in simple models is R = I σe2, or the additive genetic variance matrix is G = A σa2 (where A is the numerator relationship matrix), or the covariance matrix for a random effect f with incidence matrix Z is ZZ σf2.

However, there are several situations when analyses require a more complex covariance structure, usually a direct sum or a Kronecker product of two or more matrices. For example, an analysis of data from several sites might consider different error variances for each site, that is R = Σd Ri, where Σd represents a direct sum and Ri is the residual matrix for site i.

Other example of a more complex covariance structure is a multivariate analysis in a single site (so the same individual is assessed for two or more traits), where both the residual and additive genetic covariance matrices are constructed as the product of two matrices. For example, R = IR0, where I is an identity matrix of size number of observations, ⊗ is the Kronecker product (do not confuse with a plain matrix multiplication) and R0 is the error covariance matrix for the traits involved in the analysis. Similarly, G = AG0 where all the matrices are as previously defined and G0 is the additive covariance matrix for the traits.

Some structures are easier to understand (at least for me) if we express a covariance matrix (M) as the product of a correlation matrix (C) pre- and post-multiplied by a diagonal matrix (D) containing standard deviations for each of the traits (M = D C D). That is:

\(
M = \left [
\begin{array}{cccc}
v_{11}& c_{12}& c_{13}& c_{14} \\
c_{21}& v_{22}& c_{23}& c_{24} \\
c_{31}& c_{32}& v_{33}& c_{34} \\
c_{41}& c_{42}& c_{43}& v_{44}
\end{array}
\right ]\) \(
C = \left [
\begin{array}{cccc}
1& r_{12}& r_{13}& r_{14} \\
r_{21}& 1& r_{23}& r_{24} \\
r_{31}& r_{32}& 1& r_{34} \\
r_{41}& r_{42}& r_{43}& 1
\end{array}
\right ]\) \(
D = \left [
\begin{array}{cccc}
s_{11}& 0& 0& 0 \\
0& s_{22}& 0& 0 \\
0& 0& s_{33}& 0 \\
0& 0& 0& s_{44}
\end{array}
\right ]\)

where the v are variances, the r correlations and the s standard deviations.

If we do not impose any restriction on M, apart from being positive definite (p.d.), we are talking about an unstructured matrix (us() in asreml-R parlance). Thus, M or C can take any value (as long as it is p.d.) as it is usual when analyzing multiple trait problems.

There are cases when the order of assessment or the spatial location of the experimental units create patterns of variation, which are reflected by the covariance matrix. For example, the breeding value of an individual i observed at time j (aij) is a function of genes involved in expression at time j – k (aij-k), plus the effect of genes acting in the new measurement (αj), which are considered independent of the past measurement aij = ρjk aij-k + αj, where ρjk is the additive genetic correlation between measures j and k.

Rather than using a different correlation for each pair of ages, it is possible to postulate mechanisms which model the correlations. For example, an autoregressive model (ar() in asreml-R lingo), where the correlation between measurements j and k is r|j-k|. In this model M = D CAR D, where CAR (for equally spaced assessments) is:

\(
C_{AR} = \left [
\begin{array}{cccc}
1 & r^{|t_2-t_1|} & \ldots & r^{|t_m-t_1|}\\
r^{|t_2-t_1|} & 1 & \ldots & r^{|t_m-t_2|}\\
\vdots & \vdots & \ddots & \vdots \\
r^{|t_m-t_1|} & r^{|t_m-t_2|} & \ldots & 1
\end{array} \right ]\)

Assuming three different autocorrelation coefficients (0.95 solid line, 0.90 dashed line and 0.85 dotted line) we can get very different patterns with a few extra units of lag, as shown in the following graph:

A model including this structure will certainly be more parsimonious (economic on terms of number of parameters) than one using an unstructured approach. Looking at the previous pattern it is a lot easier to understand why they are called ‘structures’. A similar situation is considered in spatial analysis, where the ‘independent errors’ assumption of typical analyses is relaxed. A common spatial model will consider the presence of autocorrelated residuals in both directions (rows and columns). Here the level of autocorrelation will depend on distance between trees rather than on time. We can get an idea of how separable processes look like using this code:

# Separable row col autoregressive process
car2 = function(dim, rhor, rhoc) {
    M = diag(dim)
    rhor^(row(M) - 1) * rhoc^(col(M) - 1)
}
 
library(lattice)
levelplot(car2(20, 0.95, 0.85))

This correlation matrix can then be multiplied by a spatial residual variance to obtain the covariance and we can add up a spatially independent residual variance.

Much more detail on code notation for covariance structures can be found, for example, in the ASReml-R User Guide (PDF, chapter 4), for nlme in Pinheiro and Bates’s Mixed-effects models in S and S-plus (link to Google Books, chapter 5.3) and in Bates’s draft book for lme4 in chapter 4.

Longitudinal analysis: autocorrelation makes a difference

Back to posting after a long weekend and more than enough rugby coverage to last a few years. Anyway, back to linear models, where we usually assume normality, independence and homogeneous variances. In most statistics courses we live in a fantasy world where we meet all of the assumptions, but in real life—and trees and forests are no exceptions—there are plenty of occasions when we can badly deviate from one or more assumptions. In this post I present a simple example, where we have a number of clones (genetically identical copies of a tree), which had between 2 and 4 cores extracted, and each core was assessed for acoustic velocity (we care about it because it is inversely related to longitudinal shrinkage and its square is proportional to wood stiffness) every two millimeters. This small dataset is only a pilot for a much larger study currently underway.

At this stage I will ignore any relationship between the clones and focus on the core assessements. Let’s think for a moment; we have five replicates (which restrict the randomization) and four clones (A, B, C and D). We have (mostly) 2 to 4 cores (cylindrical pieces of wood covering from tree pith to cambium) within each tree, and we have longitudinal assessments for each core. I would have the expectation that, at least, successive assessments for each core are not independent; that is, assessments that are closer together are more similar than those that are farther apart. How does the data look like? The trellis plot shows trees using a Clone:Rep notation:

library(lattice)
xyplot(velocity ~ distance | Tree, group=Core, 
                  data=cd, type=c('l'))

Incidentally, cores from Clone C in replicate four were damaged, so I dropped them from this example (real life is unbalanced as well!). Just in case, distance is in mm from the tree pith and velocity in m/s. Now we will fit an analysis that totally ignores any relationship between the successive assessments:

library(nlme)
lm1a = lme(ACV ~ Clone*Distance,
                 random = ~ 1|Rep/Tree/Core, data=cd)
summary(lm1a)
 
Linear mixed-effects model fit by REML
 Data: cd 
      AIC      BIC   logLik
  34456.8 34526.28 -17216.4
 
Random effects:
 Formula: ~1 | Rep
        (Intercept)
StdDev:    120.3721
 
 Formula: ~1 | Tree %in% Rep
        (Intercept)
StdDev:    77.69231
 
 Formula: ~1 | Core %in% Tree %in% Rep
        (Intercept) Residual
StdDev:    264.6254 285.9208
 
Fixed effects: ACV ~ Clone * Distance 
                     Value Std.Error   DF  t-value p-value
(Intercept)       3274.654 102.66291 2379 31.89715  0.0000
CloneB             537.829 127.93871   11  4.20380  0.0015
CloneC             209.945 137.10691   11  1.53125  0.1539
CloneD             293.840 124.08420   11  2.36807  0.0373
Distance            14.220   0.28607 2379 49.70873  0.0000
CloneB:distance     -0.748   0.44852 2379 -1.66660  0.0957
CloneC:distance     -0.140   0.45274 2379 -0.30977  0.7568
CloneD:distance      3.091   0.47002 2379  6.57573  0.0000
 
 
anova(lm1a)
               numDF denDF  F-value p-value
(Intercept)        1  2379 3847.011  <.0001
Clone              3    11    4.054  0.0363
distance           1  2379 7689.144  <.0001
Clone:distance     3  2379   22.468  <.0001

Incidentally, our assessment setup looks like this. The nice thing of having good technicians (Nigel made the tool frame), collaborating with other departments (Electrical Engineering, Michael and students designed the electronics and software for signal processing) and other universities (Uni of Auckland, where Paul—who cored the trees and ran the machine—works) is that one gets involved in really cool projects.

What happens if we actually allow for an autoregressive process?

lm1b = lme(velocity ~ Clone*distance,
                      random = ~ 1|Rep/Tree/Core, data = cd,
                      correlation = corCAR1(value = 0.8, 
                      form = ~ distance | Rep/Tree/Core))
summary(lm1b)
 
Linear mixed-effects model fit by REML
 Data: ncd 
       AIC      BIC    logLik
  29843.45 29918.72 -14908.73
 
Random effects:
 Formula: ~1 | Rep
        (Intercept)
StdDev:     60.8209
 
 Formula: ~1 | Tree %in% Rep
        (Intercept)
StdDev:    125.3225
 
 Formula: ~1 | Core %in% Tree %in% Rep
        (Intercept) Residual
StdDev:   0.3674224 405.2818
 
Correlation Structure: Continuous AR(1)
 Formula: ~distance | Rep/Tree/Core 
 Parameter estimate(s):
      Phi 
0.9803545 
Fixed effects: velocity ~ Clone * distance 
                   Value Std.Error   DF   t-value p-value
(Intercept)     3297.517 127.98953 2379 25.763960  0.0000
CloneB           377.290 183.16795   11  2.059804  0.0639
CloneC           174.986 195.21327   11  0.896383  0.3892
CloneD           317.581 178.01710   11  1.783994  0.1020
distance          15.209   1.26593 2379 12.013979  0.0000
CloneB:distance    0.931   1.94652 2379  0.478342  0.6325
CloneC:distance   -0.678   2.00308 2379 -0.338629  0.7349
CloneD:distance    2.677   1.95269 2379  1.371135  0.1705
 
 
anova(lm1b)
               numDF denDF  F-value p-value
(Intercept)        1  2379 5676.580  <.0001
Clone              3    11    2.483  0.1152
distance           1  2379  492.957  <.0001
Clone:distance     3  2379    0.963  0.4094

In ASReml-R this would look like (for the same results, but many times faster):

as1a = asreml(velocity ~ Clone*distance, 
                         random = ~ Rep + Tree/Core, 
                         data = cd)
summary(as1a)
anova(as1a)
 
# I need to sort out my code for ar(1) and update to
# the latest version of asreml-r

Oops! What happened to the significance of Clone and its interaction with distance? The perils of ignoring the independence assumption. But, wait, isn’t an AR(1) process too simplistic to model the autocorrelation (as pointed out by D.J. Keenan when criticizing IPCC’s models and discussing Richard Mueller’s new BEST project models)? In this case, probably not, as we have a mostly increasing response, where we have a clue of the processes driving the change and with far less noise than climate data.

Could we improve upon this model? Sure! We could add heterogeneous variances, explore non-linearities, take into account the genetic relationship between the trees, run the whole thing in asreml (so it is faster), etc. Nevertheless, at this point you can get an idea of some of the issues (or should I call them niceties?) involved in the analysis of experiments.

Teaching with R: the switch

There are several blog posts, websites (and even books) explaining the transition from using another statistical system (e.g. SAS, SPSS, Stata, etc) to relying on R. Most of that material treats the topic from the point of view of i- an individual user and ii- a researcher. This post explains some of the issues involved in, first, moving several users and, second, with an emphasis in teaching.

I have made part of this information available before, but I wanted to update it and keep it together with all the other posts in Quantum Forest. The process started in March 2009.

March 2009

I started explaining to colleagues my position on using R (and R commander) for teaching purposes. Some background first: forestry deals with variability and variability is the province of statistics. The use of statistics permeates forestry: we use sampling for inventory purposes, we use all sort of complex linear and non-linear regression models to predict growth, linear mixed models are the bread and butter of the analysis of experiments, etc.

I think it is fair to expect foresters to be at least acquainted with basic statistical tools, and we have two courses covering ANOVA and regression. In addition, we are supposed to introduce/reinforce statistical concepts in several other courses. So far so good, until we reached the issue of software.

During the first year of study, it is common to use MS Excel. I am not a big fan of Excel, but I can tolerate its use: people do not require much training to (ab)use it and it has a role to introduce students to some of the ’serious/useful’ functions of a computer; that is, beyond gaming. However, one can hit Excel limits fairly quickly which–together with the lack of audit trail for the analyses and the need to repeat all the pointing and clicking every time we need an analysis–makes looking for more robust tools very important.

Until the end of 2009 SAS (mostly BASE and STAT, with some sprinkles of GRAPH) was our robust tool. SAS was introduced in second year during the ANOVA and regression courses. SAS is a fine product, however:

  • We spent a very long time explaining how to write simple SAS scripts. Students forgot the syntax very quickly.
  • SAS’s graphical capabilities are fairly ordinary and not at all conducive to exploratory data analysis.
  • SAS is extremely expensive.
  • SAS tends to define the subject; I mean, it adopts new techniques very slowly, so there is the tendency to do only what SAS can do. This is unimportant for undergrads, but it is relevant for postgrads.
  • Users sometimes store data in SAS’s own format, which introduces another source of lock-in.

At the time, in my research work I used mostly ASReml (for specialized genetic analyses) and R (for general work); since thenI have moved towards using asreml-R (an R library that interfaces ASReml) to have a consistent work environment. For teaching I was using SAS to be consistent with second-year material.

Considering the previously mentioned barriers for students I started playing with R-commander (Rcmdr), a cross-platform GUI for R created by John Fox (the writer of some very nice statistics books, by the way. As I see it:

  • R in command mode is not more difficult (but not simpler either) for students than SAS. I think that SAS is more consistent and they have worked hard at keeping a very similar structure between PROCs.
  • We can get R-commander to start working right away with simple(r) methods, while maintaining the possibility of moving to more complex methods later by typing commands or programming.
  • It is free, so our students can load it into their laptops and keep on using it when they are gone. This is particularly true with international students: many of them will never see SAS again in their home countries.
  • It allows an easy path to data exploration (pre-requisite for building decent models) and high quality graphs.
  • R is open source (nice, but not a deal breaker for me) and easily extensible (this one is really important for me).

At the time I thought that R would be an excellent fit for teaching; nevertheless, there could be a few drawbacks, mostly when dealing with postgrads:

  • There are restrictions to the size of datasets (they have to fit in memory), although there are ways to deal with some of the restrictions. On the other hand, I have hit the limits of PROC GLM and PROC MIXED before and that is where ASReml shines. In two years this has never been a problem.
  • Some people have an investment in SAS and may not like the idea of using a different software. This was a problem the first few months.

As someone put it many years ago–there is always resistance to change:

It must be remembered that there is nothing more difficult to plan, more doubtful of success, nor more dangerous to manage, than the creation of a new system. For the initiator has the enmity of all who would profit by the preservation of the old institutions and merely lukewarm defenders in those who would gain by the new ones.—Niccolò Machiavelli, The Prince, Chapter 6

.

Five months later: August 2009

At the department level, I had to spend substantial time compiling information to prove that R could satisfy my colleagues’ statistical needs. Good selling points were nlme/lme4, lattice/ggplot2 and pointing my most statistically inclined colleagues to CRAN. Another important issue was the ability to have a GUI (Rcmdr) that could be adapted to our specific needs. At that time the School of Forestry adopted R as the default software for teaching any statistical content during the four years of the curriculum.

At the university level, my questions to the department of Mathematics and Statistics sparkled a lot of internal discussion, which resulted in R being adopted as the standard software for the ANOVA and regression second year courses (it was already the standard for many courses in 3rd and 4th year). The decision was not unanimous, particularly because for statisticians SAS is one of those ‘must be in the CV’ skills, but they went for change. The second year courses are offered across colleges, which makes the change very far reaching. These changes implied that many computers in the university labs now come with R pre-installed.

A year later: April 2010

R and R-commander were installed in our computer labs and we started using them in our Research Methods course. It is still too early to see what will be the effect of R versus SAS, but we expect to see an increase on the application of statistics within our curriculum.

One thing that I did not properly consider in the process were the annoying side-effects of the university’s computer policies. Students are not allowed to install software in the university computers and R packages fall within that category. We can either stay with the defaults + R commander (our current position) or introduce an additional complication for students, pushing them to define their own library location. I’d rather teach ggplot2 than lattice, but ggplot2 is an extra installation. Choices, choices… On the positive side, the default installation for some of the computer labs install all the packages by default.

Two years later: March 2011

Comments after teaching a regression modeling course using R-commander:

  • Some students really appreciate the possibility of using R-commander as their ‘total analysis system’. Most students that have never used a command line environment prefer it.
  • Students that have some experience with command-line work do not like much R-commander as they find it confusing, particularly when it is possible to access the R console through two points: Rcmdr and the default console. Some of them could not see the point of using an environment with a limited subset of functionality.
  • Data transformation facilities in R-commander are somewhat limited to the simplest cases.
  • Why is that the linear regression item does not accept categorical predictors? That works under ‘linear models’, but it is such an arbitrary separation.
  • The OS X version of R-commander (under X Windows) is butt ugly. This is not John Fox’s fault, but just a fact of life.

In general, R would benefit of having a first-class Excel import system that worked across platforms. Yes, I know that some people say that researchers should not use Excel; however, there is a distinction between normative and positive approaches to research. People do use Excel and insisting that they should not is not helpful.

I would love to hear anyone else’s experiences teaching basic statistics with R. Any comments?

Spatial correlation in designed experiments

Last Wednesday I had a meeting with the folks of the New Zealand Drylands Forest Initiative in Blenheim. In addition to sitting in a conference room and having nice sandwiches we went to visit one of our progeny trials at Cravens. Plantation forestry trials are usually laid out following a rectangular lattice defined by rows and columns. The trial follows an incomplete block design with 148 blocks and is testing 60 Eucalyptus bosistoana families. A quick look at survival shows an interesting trend: the bottom of the trial was much more affected by frost than the top.

setwd('~/Dropbox/quantumforest')
library(ggplot2) # for qplot
library(asreml)
 
load('cravens.Rdata')
 
qplot(col, row, fill = Surv, geom='tile', data = cravens)

We have the trial assessed for tree height (ht) at 2 years of age, where a plot also shows some spatial trends, with better growth on the top-left corner; this is the trait that I will analyze here.

Tree height can be expressed as a function of an overall constant, random block effects, random family effects and a normally distributed residual (this is our base model, m1). I will then take into account the position of the trees (expressed as rows and columns within the trial) to fit spatially correlated residuals (m2)—using separable rows and columns processes—to finally add a spatially independent residual to the mix (m3).

# Base model (non-spatial)
m1 = asreml(ht ~ 1, random = ~ Block + Family, data = cravens)
summary(m1)$loglik
[1] -8834.071
 
summary(m1)$varcomp
 
                       gamma component std.error   z.ratio constraint
Block!Block.var   0.52606739 1227.4058 168.84681  7.269345   Positive
Family!Family.var 0.06257139  145.9898  42.20675  3.458921   Positive
R!variance        1.00000000 2333.1722  78.32733 29.787459   Positive

m1 represents a traditional family model with only the overall constant as the only fixed effect and a diagonal residual matrix (identity times the residual variance). In m2 I am modeling the R matrix as the Kronecker product of two separable autoregressive processes (one in rows and one in columns) times a spatial residual.

# Spatial model (without non-spatial residual)
m2 = asreml(ht ~ 1, random = ~ Block + Family,
            rcov = ~ ar1(row):ar1(col),
            data = cravens[order(cravens$row, cravens$col),])
summary(m2)$loglik
[1] -8782.112
 
summary(m2)$varcomp
 
                       gamma    component    std.error   z.ratio
Block!Block.var   0.42232303 1025.9588216 157.00799442  6.534437
Family!Family.var 0.05848639  142.0822874  40.20346956  3.534080
R!variance        1.00000000 2429.3224308  88.83064209 27.347798
R!row.cor         0.09915320    0.0991532   0.02981808  3.325271
R!col.cor         0.28044024    0.2804402   0.02605972 10.761445
 
                      constraint
Block!Block.var         Positive
Family!Family.var       Positive
R!variance              Positive
R!row.cor          Unconstrained
R!col.cor          Unconstrained

Adding two parameters to the model results in improving the log-likelihood from -8834.071 to -8782.112, a difference of 51.959, but with what appears to be a low correlation in both directions. In m3 I am adding an spatially independent residual (using the keyword units), improving log-likelihood from -8782.112 to -8660.411: not too shabby.

m3 = asreml(ht ~ 1, random = ~ Block + Family + units,
            rcov = ~ ar1(row):ar1(col),
            data = cravens[order(cravens$row, cravens$col),])
summary(m3)$loglik
[1] -8660.411
 
summary(m3)$varcomp
 
                         gamma    component    std.error    z.ratio
Block!Block.var   3.069864e-07 4.155431e-04 6.676718e-05   6.223763
Family!Family.var 1.205382e-01 1.631630e+02 4.327885e+01   3.770040
units!units.var   1.354166e+00 1.833027e+03 7.085134e+01  25.871456
R!variance        1.000000e+00 1.353621e+03 2.174923e+02   6.223763
R!row.cor         7.814065e-01 7.814065e-01 4.355976e-02  17.938724
R!col.cor         9.529984e-01 9.529984e-01 9.100529e-03 104.719017
                     constraint
Block!Block.var        Boundary
Family!Family.var      Positive
units!units.var        Positive
R!variance             Positive
R!row.cor         Unconstrained
R!col.cor         Unconstrained

Allowing for the independent residual (in addition to the spatial one) has permitted higher autocorrelations (particularly in columns), while the Block variance has gone very close to zero. Most of the Block variation is being captured now through the residual matrix (in the rows and columns autoregressive correlations), but we are going to keep it in the model, as it represents a restriction to the randomization process. In addition to log-likelihood (or AIC) we can get graphical descriptions of the fit by plotting the empirical semi-variogram as in plot(variogram(m3)):

On R, bloggers, politics, sex, alcohol and rock & roll

Yesterday morning at 7 am I was outside walking the dog before getting a taxi to go to the airport to catch a plane to travel from Christchurch to Blenheim (now I can breath after reading without a pause). It was raining cats and dogs while I was walking doggyo, thinking of a post idea for Quantum Forest; something that I could work on without a computer. Then I remembered that I told Tal Galili that I would ‘mention r-bloggers’ in a future post. Well, Tal, this is it.

I started this blog on 4th October and I thought ‘I could write a few things and see how it goes. I may even get 10 people a day reading this blog’. I reached that number almost immediately and I was thinking that, if I kept going at it, I could eventually reach 50 people. Then I came across R-bloggers and, after some hesitation, I submitted this blog to Tal’s web site. I jumped to over 200 people a day visiting the blog (see below) and even got comments! I repeat: I was writing about mixed models and got comments!

As I point out in the ‘About’ page, this is not an ‘R blog’ but a blog about statistics and data analysis in general, that mostly uses R as a vehicle to express ideas. There will be some Python (my favorite language) and, maybe, other tools; however, the ideas are more important than the syntax. Nevertheless, R has democratized the practice of statistics, as well as facilitated the production of some very interesting visuals. Once thing was clear to me: I did not want to write about ‘data visualization’, which is receiving a lot (I dare to say too much) attention in the R world. Most infographics produce the same reaction on me as choirs and mimes; which is to say, they bore me to tears (apologies if you are a choir-singer mime in your spare time). I want to write a bit about analysis and models and ‘bread and butter’ issues, because I think that they are many times ignored by people chasing the latest smoke and mirrors.

I have to say that I am learning a lot from comments, particularly people suggesting packages that I didn’t even know that existed. I am also learning about spam-comments and I wish there was a horrible place where spammers would go and suffer for eternity (together with pedophiles, torturers and other not-so-nice seedy characters). I am thankful to Tal mostly for putting the work to create a repository of R blogs; when I have some free time I often wander around looking for some interesting explanation to an R problem that is bugging me. As Tolkien said: ‘Not all those who wander are lost’.

Politics

Most of my professional life I have dealt with, let’s say, ‘agricultural’ statistics; that is, designed experiments, linear mixed models (mostly frequentist, but sometimes Bayesian), often with pedigrees. One of my pet peeves (and somewhat political issue) is that many statisticians—particularly in mathematics departments—tend to look down on ‘bread and butter’ work. They seem to forget that experiments and breeding/genetics have been the basis of many theoretical developments and that we deal with heavily multivariate data, longitudinal data, spatial data, models with sometimes hundreds of covariance components or, if you are into genomics, models with tens of thousands of random effects.

There is also the politics of software, where in the R community there is a strong bias against any package that is not free (sensu both speech and beer). At some level I share the ideal of having free access to tools, which make the practice of statistics/data analyses available to a large number of people. At another level, it seems counterproductive to work with substandard tools and to go for lesser models only because ‘package X can’t deal with the model that I would like to fit’. This is just another example of software defining our field, which was (and still is) a common accusation pointed towards SAS. In contrast, in this blog you will see several references to ASReml-R a commercial mixed models package, which I tend to use because (as far as I know) there is nothing at the moment that comes close to its functionality. If I pay for my computer running MS Windows or OS X (far from being paradigms of openness), how can I justify being dogmatic about using a commercial R package (particularly if I can access it via academic or nonprofit pricing)?

Sex and alcohol

As I pointed out above, most of my work has dealt with forestry/agriculture. Nevertheless, this year I have become more interested in the relationship between statistics and public policy issues; for example, minimum wage and unemployment, which I covered as a simple example in this blog. I have to thank my colleague Eric Crampton (in Canterbury’s Department of Economics) for igniting my interest on the use and misuse of statistics, and their interaction with economics, to justify all sort of restrictions and interventions in society. There are very interesting datasets in this area available in, for example, Statistics New Zealand that could be analyzed and would make very interesting case studies for teaching stats. There is a quote by H.G. Wells that comes to mind:

Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.

Rock & roll? This is getting too long, so I will leave music for another time.

Large applications of linear mixed models

In a previous post I summarily described our options for (generalized to varying degrees) linear mixed models from a frequentist point of view: nlme, lme4 and ASReml-R, followed by a quick example for a split-plot experiment.

But who is really happy with a toy example? We can show a slightly more complicated example assuming that we have a simple situation in breeding: a number of half-sib trials (so we have progeny that share one parent in common), each of them established following a randomized complete block design, analyzed using a ‘family model’. That is, the response variable (dbh: tree stem diameter assessed at breast height—1.3m from ground level) can be expressed as a function of an overall mean, fixed site effects, random block effects (within each site), random family effects and a random site-family interaction. The latter provides an indication of genotype by environment interaction.

setwd('~/Dropbox/quantumforest')
 
trees = read.table('treedbh.txt', header=TRUE)
 
# Removing all full-siblings and dropping Trial levels
# that are not used anymore
trees = subset(trees, Father == 0)
trees$Trial = trees$Trial[drop = TRUE]
 
# Which let me with 189,976 trees in 31 trials
xtabs(~Trial, data = trees)
 
# First run the analysis with lme4
library(lme4)
m1 = lmer(dbh ~ Trial + (1|Trial:Block) + (1|Mother) + 
                (1|Trial:Mother), data = trees)
 
 
# Now run the analysis with ASReml-R
library(asreml)
m2 = asreml(dbh ~ Trial, random = ~Trial:Block + Mother + 
                  Trial:Mother, data = trees)

First I should make clear that this model has several drawbacks:

  • It assumes that we have homogeneous variances for all trials, as well as identical correlation between trials. That is, that we are assuming compound symmetry, which is very rarely the case in multi-environment progeny trials.
  • It also assumes that all families are unrelated to each other, which in this case I know it is not true, as a quick look at the pedigree will show that several mothers are related.
  • We assume only one type of families (half-sibs), which is true just because I got rid of all the full-sib families and clonal material, to keep things simple in this example.

Just to make the point, a quick look at the data using ggplot2—with some jittering and alpha transparency to better display a large number of points—shows differences in both mean and variance among sites.

library(ggplot2)
ggplot(aes(x=jitter(as.numeric(Trial)), y=dbh), data = trees) + 
      geom_point(colour=alpha('black', 0.05)) + 
      scale_x_continuous('Trial')

Anyway, let’s keep on going with this simplistic model. In my computer, a two year old iMac, ASReml-R takes roughly 9 seconds, while lme4 takes around 65 seconds, obtaining similar results.

# First lme4 (and rounding off the variances)
summary(m1)
 
Random effects:
 Groups       Name        Variance Std.Dev.
 Trial:Mother (Intercept)  26.457   5.1436 
 Mother       (Intercept)  32.817   5.7286 
 Trial:Block  (Intercept)  77.390   8.7972 
 Residual                 892.391  29.8729 
 
 
# And now ASReml-R
# Round off part of the variance components table
round(summary(m2)$varcomp[,1:4], 3)
                       gamma component std.error z.ratio
Trial:Block!Trial.var  0.087    77.399     4.598  16.832
Mother!Mother.var      0.037    32.819     1.641  20.002
Trial:Mother!Trial.var 0.030    26.459     1.241  21.328
R!variance             1.000   892.389     2.958 301.672

The residual matrix of this model is a, fairly large, diagonal matrix (residual variance times an identity matrix). At this point we can relax this assumption, adding a bit more complexity to the model so we can highlight some syntax. Residuals in one trial should have nothing to do with residuals in another trial, which could be hundreds of kilometers away. I will then allow for a new matrix of residuals, which is the direct sum of trial-specific diagonal matrices. In ASReml-R we can do so by specifying a diagonal matrix at each trial with rcov = ~ at(Trial):units:

m2b =  asreml(dbh ~ Trial, random = ~Trial:Block + Mother + 
                    Trial:Mother, data = trees,
                    rcov = ~ at(Trial):units)
 
# Again, extracting and rounding variance components
round(summary(m2b)$varcomp[,1:4], 3)
                          gamma component std.error z.ratio
Trial:Block!Trial.var    77.650    77.650     4.602  16.874
Mother!Mother.var        30.241    30.241     1.512  20.006
Trial:Mother!Trial.var   22.435    22.435     1.118  20.065
Trial_1!variance       1176.893  1176.893    18.798  62.606
Trial_2!variance       1093.409  1093.409    13.946  78.403
Trial_3!variance        983.924   983.924    12.061  81.581
...
Trial_29!variance      2104.867  2104.867    55.821  37.708
Trial_30!variance       520.932   520.932    16.372  31.819
Trial_31!variance       936.785   936.785    31.211  30.015

There is a big improvement of log-Likelihood from m2 (-744452.1) to m2b (-738011.6) for 30 additional variances. At this stage, we can also start thinking of heterogeneous variances for blocks, with a small change to the code:

m2c =  asreml(dbh ~ Trial, random = ~at(Trial):Block + Mother + 
                    Trial:Mother, data = wood,
                    rcov = ~ at(Trial):units)
round(summary(m2c)$varcomp[,1:4], 3)
 
# Which adds another 30 variances (one for each Trial)
                                 gamma component std.error z.ratio
at(Trial, 1):Block!Block.var     2.473     2.473     2.268   1.091
at(Trial, 2):Block!Block.var    95.911    95.911    68.124   1.408
at(Trial, 3):Block!Block.var     1.147     1.147     1.064   1.079
...
Mother!Mother.var               30.243    30.243     1.512  20.008
Trial:Mother!Trial.var          22.428    22.428     1.118  20.062
Trial_1!variance              1176.891  1176.891    18.798  62.607
Trial_2!variance              1093.415  1093.415    13.946  78.403
Trial_3!variance               983.926   983.926    12.061  81.581
...

At this point we could do extra modeling at the site level by including any other experimental design features, allowing for spatially correlated residuals, etc. I will cover some of these issues in future posts, as this one is getting a bit too long. However, we could gain even more by expressing the problem in a multivariate fashion, where the performance of Mothers in each trial would be considered a different trait. This would push us towards some large problems, which require modeling covariance structures so we have a change of achieving convergence during our lifetime.

Disclaimer: I am emphasizing the use of ASReml-R because 1.- I am most familiar with it than with nlme or lme4 (although I tend to use nlme for small projects), and 2.- There are plenty of examples for the other packages but not many for ASReml in the R world. I know some of the people involved in the development of ASReml-R but otherwise I have no commercial links with VSN (the guys that market ASReml and Genstat).

There are other packages that I have not had the chance to investigate yet, like hglm.

Lattice when modeling, ggplot when publishing

When working in research projects I tend to fit several, sometimes quite a few, alternative models. This model fitting is informed by theoretical considerations (e.g. quantitative genetics, experimental design we used, our understanding of the process under study, etc.) but also by visual inspection of the data. Trellis graphics—where subsets of data are plotted in different ‘panels’ defined by one or more factors—are extremely useful to generate research hypotheses.

There are two packages in R that have good support for trellis graphics: lattice and ggplot2. Lattice is the oldest, while ggplot2 is probably more consistent (implementing a grammar of graphics) and popular with the cool kids and the data visualization crowd. However, lattice is also quite fast, while ggplot2 can be slow as a dog (certainly way slower than my dog).

Tree-breeding progeny trials often have between 1,000 and 12,000 individuals, and analyses commonly include several trials. Thus, it is not unusual to have tens of thousands or even hundreds of thousand of records that will be involved in the analysis. Add to this situation that I am impatient and you will understand that differences on speed can make a big difference to my mental health. But how different is the speed? We can simulate some correlated data (following the explanation in this post) and build a simple scatterplot faceted by site; let’s say 60,000 observations in 6 sites (10,000 per site).

[sourcecode lang=”r”]
library(lattice)
library(ggplot2)

# number of observations to simulate
nobs = 60000
sites = 6
 
# Using a correlation matrix (let’s assume that all variables
# have unit variance
M = matrix(c(1, 0.7,
0.7, 1), nrow=2, ncol=2)
 
# Cholesky decomposition
L = chol(M)
nvars = dim(L)[1]

# Random variables that follow an M correlation matrix
r = t(L) %*% matrix(rnorm(nvars*nobs), nrow=nvars, ncol=nobs)
r = t(r)

rdata = as.data.frame(r)
names(rdata) = c(‘x’, ‘y’)
rdata$site = factor(rep(1:sites, each = nobs/sites))

# Plotting with lattice
xyplot(y ~ x | site, data = rdata,
layout = c(3, 2), type=c(‘p’,’smooth’))

# Plotting with ggplot2
qplot(x, y, facets = ~ site,
geom=c(‘point’, ‘smooth’),
data = rdata) + facet_wrap(~site)
[/sourcecode]

The timing was done surrounding the graph calls (either xyplot() or qplot()) by system.time(print()), so the graph is sent to the screen and the operation is timed. In summary, in this simple call ggplot2 takes a bit over double the time than lattice. The more layers you add to the graph the slower it gets.

The two plots are below. We could improve both plots and make them look more similar to each other, but I want to avoid introducing more distractions in the code.

Nevertheless, I do like the flexibility of ggplot2, so I support most of my exploratory data analysis using lattice but when I have to create the final pretty plots for publications in journals I go back to ggplot2. I subscribe to Frank Harrell’s Philosophy of Biostatistics, which includes ‘use excellent graphics, liberally’. Switching between packages let me deal with both abundance of graphics and impatience.

This is R pitfall #2: plots inside a function (and system.time() is a function) have to be surrounded by print() or they won’t be sent to the screen. Pitfall #1 is here.

Linear mixed models in R

A substantial part of my job has little to do with statistics; nevertheless, a large proportion of the statistical side of things relates to applications of linear mixed models. The bulk of my use of mixed models relates to the analysis of experiments that have a genetic structure.

A brief history of time

At the beginning (1992-1995) I would use SAS (first proc glm, later proc mixed), but things started getting painfully slow and limiting if one wanted to move into animal model BLUP. At that time (1995-1996), I moved to DFREML (by Karen Meyer, now replaced by WOMBAT) and AIREML (by Dave Johnson, now defunct—the program I mean), which were designed for the analysis of animal breeding progeny trials, so it was a hassle to deal with experimental design features. At the end of 1996 (or was it the beginning of 1997?) I started playing with ASReml (programed by Arthur Gilmour mostly based on theoretical work by Robin Thompson and Brian Cullis). I was still using SAS for data preparation, but all my analyses went through ASReml (for which I wrote the cookbook), which was orders of magnitude faster than SAS (and could deal with much bigger problems). Around 1999, I started playing with R (prompted by a suggestion from Rod Ball), but I didn’t really use R/S+ often enough until 2003. At the end of 2005 I started using OS X and quickly realized that using a virtual machine or dual booting was not really worth it, so I dropped SAS and totally relied on R in 2009.

Options

As for many other problems, there are several packages in R that let you deal with linear mixed models from a frequentist (REML) point of view. I will only mention nlme (Non-Linear Mixed Effects), lme4 (Linear Mixed Effects) and asreml (average spatial reml). There are also several options for Bayesian approaches, but that will be another post.

nlme is the most mature one and comes by default with any R installation. In addition to fitting hierarchical generalized linear mixed models it also allows fitting non-linear mixed models following a Gaussian distribution (my explanation wasn’t very clear, thanks to ucfagls below for pointing this out). Its main advantages are, in my humble opinion, the ability to fit fairly complex hierarchical models using linear or non-linear approaches, a good variety of variance and correlation structures, and access to several distributions and link functions for generalized models. In my opinion, its main drawbacks are i- fitting cross-classified random factors is a pain, ii- it can be slow and may struggle with lots of data, iii- it does not deal with pedigrees by default and iv- it does not deal with multivariate data.

lme4 is a project led by Douglas Bates (one of the co-authors of nlme), looking at modernizing the code and making room for trying new ideas. On the positive side, it seems to be a bit faster than nlme and it deals a lot better with cross-classified random factors. Drawbacks: similar to nlme’s, but dropping point i- and adding that it doesn’t deal with covariance and correlation structures yet. It is possible to fit pedigrees using the pedigreemm package, but I find the combination a bit flimsy.

ASReml-R is, unsurprisingly, an R package interface to ASReml. On the plus side it i- deals well with cross-classified random effects, ii- copes very well with pedigrees, iii- can work with fairly large datasets, iv-can run multivariate analyses and v- covers a large number of covariance and correlation structures. Main drawbacks are i- limited functionality for non-Gaussian responses, ii- it does not cover non-linear models and iii- it is non-free (as in beer and speech). The last drawback is relative; it is possible to freely use asreml for academic purposes (and there is also a version for developing countries). Besides researchers, the main users of ASReml/ASReml-R are breeding companies.

All these three packages are available for Windows, Linux and OS X.

A (very) simple example

I will use a traditional dataset to show examples of the notation for the three packages: Yates’ variety and nitrogen split-plot experiment. We can get the dataset from the MASS package, after which it is a good idea to rename the variables using meaningful names. In addition, I will follow Bill Venables’s excellent advice and create additional variables for main plot and subplots, as it is confusing to use the same factor for two purposes (e.g. variety as treatment and main plot). Incidentally, if you haven’t read Bill’s post go and read it; it is one of the best explanations I have ever seen for a split-plot analysis.

library(MASS)
data(oats)
names(oats) = c('block', 'variety', 'nitrogen', 'yield')
oats$mainplot = oats$variety
oats$subplot = oats$nitrogen
 
summary(oats)
 block           variety     nitrogen      yield              mainplot 
 I  :12   Golden.rain:24   0.0cwt:18   Min.   : 53.0   Golden.rain:24  
 II :12   Marvellous :24   0.2cwt:18   1st Qu.: 86.0   Marvellous :24  
 III:12   Victory    :24   0.4cwt:18   Median :102.5   Victory    :24  
 IV :12                    0.6cwt:18   Mean   :104.0                   
 V  :12                                3rd Qu.:121.2                   
 VI :12                                Max.   :174.0                   
   subplot  
 0.0cwt:18  
 0.2cwt:18  
 0.4cwt:18  
 0.6cwt:18

The nlme code for this analysis is fairly simple: response on the left-hand side of the tilde, followed by the fixed effects (variety, nitrogen and their interaction). Then there is the specification of the random effects (which also uses a tilde) and the data set containing all the data. Notice that 1|block/mainplot is fitting block and mainplot within block. There is no reference to subplot as there is a single assessment for each subplot, which ends up being used at the residual level.

library(nlme)
m1.nlme = lme(yield ~ variety*nitrogen,
                      random = ~ 1|block/mainplot,
                      data = oats)
 
summary(m1.nlme)
 
Linear mixed-effects model fit by REML
 Data: oats 
       AIC      BIC    logLik
  559.0285 590.4437 -264.5143
 
Random effects:
 Formula: ~1 | block
        (Intercept)
StdDev:    14.64496
 
 Formula: ~1 | mainplot %in% block
        (Intercept) Residual
StdDev:    10.29863 13.30727
 
Fixed effects: yield ~ variety * nitrogen 
                                    Value Std.Error DF   t-value p-value
(Intercept)                      80.00000  9.106958 45  8.784492  0.0000
varietyMarvellous                 6.66667  9.715028 10  0.686222  0.5082
varietyVictory                   -8.50000  9.715028 10 -0.874933  0.4021
nitrogen0.2cwt                   18.50000  7.682957 45  2.407927  0.0202
nitrogen0.4cwt                   34.66667  7.682957 45  4.512152  0.0000
nitrogen0.6cwt                   44.83333  7.682957 45  5.835427  0.0000
varietyMarvellous:nitrogen0.2cwt  3.33333 10.865342 45  0.306786  0.7604
varietyVictory:nitrogen0.2cwt    -0.33333 10.865342 45 -0.030679  0.9757
varietyMarvellous:nitrogen0.4cwt -4.16667 10.865342 45 -0.383482  0.7032
varietyVictory:nitrogen0.4cwt     4.66667 10.865342 45  0.429500  0.6696
varietyMarvellous:nitrogen0.6cwt -4.66667 10.865342 45 -0.429500  0.6696
varietyVictory:nitrogen0.6cwt     2.16667 10.865342 45  0.199411  0.8428
 
anova(m1.nlme)
 
                 numDF denDF   F-value p-value
(Intercept)          1    45 245.14299  <.0001
variety              2    10   1.48534  0.2724
nitrogen             3    45  37.68562  <.0001
variety:nitrogen     6    45   0.30282  0.9322

The syntax for lme4 is not that dissimilar, with random effects specified using a (1|something here) syntax. One difference between the two packages is that nlme reports standard deviations instead of variances for the random effects.

library(lme4)
m1.lme4 = lmer(yield ~ variety*nitrogen + (1|block/mainplot),
                       data = oats)
 
summary(m1.lme4)
 
Linear mixed model fit by REML 
Formula: yield ~ variety * nitrogen + (1 | block/mainplot) 
   Data: oats 
 AIC   BIC logLik deviance REMLdev
 559 593.2 -264.5    595.9     529
Random effects:
 Groups         Name        Variance Std.Dev.
 mainplot:block (Intercept) 106.06   10.299  
 block          (Intercept) 214.48   14.645  
 Residual                   177.08   13.307  
Number of obs: 72, groups: mainplot:block, 18; block, 6
 
Fixed effects:
                                 Estimate Std. Error t value
(Intercept)                       80.0000     9.1064   8.785
varietyMarvellous                  6.6667     9.7150   0.686
varietyVictory                    -8.5000     9.7150  -0.875
nitrogen0.2cwt                    18.5000     7.6830   2.408
nitrogen0.4cwt                    34.6667     7.6830   4.512
nitrogen0.6cwt                    44.8333     7.6830   5.835
varietyMarvellous:nitrogen0.2cwt   3.3333    10.8653   0.307
varietyVictory:nitrogen0.2cwt     -0.3333    10.8653  -0.031
varietyMarvellous:nitrogen0.4cwt  -4.1667    10.8653  -0.383
varietyVictory:nitrogen0.4cwt      4.6667    10.8653   0.430
varietyMarvellous:nitrogen0.6cwt  -4.6667    10.8653  -0.430
varietyVictory:nitrogen0.6cwt      2.1667    10.8653   0.199
 
anova(m1.lme4)
 
Analysis of Variance Table
                 Df  Sum Sq Mean Sq F value
variety           2   526.1   263.0  1.4853
nitrogen          3 20020.5  6673.5 37.6856
variety:nitrogen  6   321.7    53.6  0.3028

For this type of problem, the notation for asreml is also very similar, particularly when compared to nlme.

library(asreml)
m1.asreml = asreml(yield ~ variety*nitrogen,
                           random = ~ block/mainplot,
                           data = oats)
 
summary(m1.asreml)$varcomp
 
                             gamma component std.error  z.ratio constraint
block!block.var          1.2111647  214.4771 168.83404 1.270343   Positive
block:mainplot!block.var 0.5989373  106.0618  67.87553 1.562593   Positive
R!variance               1.0000000  177.0833  37.33244 4.743416   Positive
 
wald(m1.asreml, denDF = 'algebraic')
 
$Wald
                 Df denDF    F.inc           Pr
(Intercept)       1     5 245.1000 1.931825e-05
variety           2    10   1.4850 2.723869e-01
nitrogen          3    45  37.6900 2.457710e-12
variety:nitrogen  6    45   0.3028 9.321988e-01
 
$stratumVariances
               df  Variance block block:mainplot R!variance
block           5 3175.0556    12              4          1
block:mainplot 10  601.3306     0              4          1
R!variance     45  177.0833     0              0          1

In this simple example one pretty much gets the same results, independently of the package used (which is certainly comforting). I will soon cover another simple model, but with much larger dataset, to highlight some performance differences between the packages.

Maximum likelihood

This post is one of those ‘explain to myself how things work’ documents, which are not necessarily completely correct but are close enough to facilitate understanding.

Background

Let’s assume that we are working with a fairly simple linear model, where we only have a response variable (say tree stem diameter in cm). If we want to ‘guess’ the diameter for a tree (yi) our best bet is the average (μ) and we will have a residual (εi). The model equation then looks like:

\(y_i = \mu + \varepsilon_i \)

We want to estimate both the overall mean (μ) and the error variance (\(\sigma_{\varepsilon}^2\)), for which we could use least squares. However, we will use an alternative method (maximum likelihood) because that is the point of this post. A likelihood function expresses the probability of obtaining the observed sample from a population given a set of model parameters. Thus, it answers the question What is the probability of observing the current dataset, when I assume a given set of parameters for my linear model? To do so, we have to assume a distribution, say normal, for which the probability density function (p.d.f) is:

\(p.d.f. (y) = \frac{1} {\sqrt{2 \pi} \sigma} e^{-1/2 \frac{(y- \mu)^2} {\sigma^2}} \)

with n independent samples the likelihood (L) is:

\(L = \prod_{i=1}^n \frac{1} {\sqrt{2 \pi} \sigma} e^{-1/2 \frac{(y- \mu)^2} {\sigma^2}}\)

where ∏ is a multiplication operator, analogous to the summation operator ∑. Maximizing L or its natural logarithm (LogL) is equivalent, then:

\(LogL = \sum_{i=1}^n \log\left ( \frac{1} {\sqrt{2 \pi} \sigma} e^{-1/2 \frac{(y- \mu)^2} {\sigma^2}} \right ) \)

\(LogL= \sum_{i=1}^n \left [ \log\left ( \frac{1} {\sqrt{2 \pi} \sigma} \right ) + \log\left ( e^{-1/2 \frac{(y- \mu)^2} {\sigma^2}} \right ) \right ]\)

\(LogL= \sum_{i=1}^n \log(2 \pi)^{-1/2} + \sum_{i=1}^n \log\left (\frac{1} {\sigma}\right ) + \sum_{i=1}^n \left ( -1/2 \frac{(y- \mu)^2} {\sigma^2} \right )\)

\(LogL= -\frac{n} {2} \log(2 \pi) – n \log(\sigma) – \frac{1} {2 \sigma^2} \sum_{i=1}^n ( y – \mu)^2\)

Considering only μ, LogL is maximum when \(\sum_{i=1}^n ( y – \mu)^2\) is a minimum, i.e. for:

\(\mu = \frac{\sum_{i=1}^n y_i}{n}\)

Now considering σ:

\(\frac{\delta LogL}{\delta \sigma} = -\frac{n}{\sigma} – (-2) \frac{1}{2} \sum_{i=1}^n (y – \mu)^2 \sigma^{-3}\)
\(\frac{\delta LogL}{\delta \sigma} = -\frac{n}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^n (y – \mu)^2\)

and setting the last expression equal to 0:

\(\frac{n}{\sigma} = \frac{1}{\sigma^3} \sum_{i=1}^n (y – \mu)^2\)
\(\sigma^2 = \frac{\sum_{i=1}^n (y – \mu)^2}{n}\)

We can obtain an estimate of the error variance (\(\hat{\sigma}\), notice the hat) by replacing μ by our previous estimate:
\(\hat{\sigma}^2 = \frac{\sum_{i=1}^n (y – \bar{y})^2}{n}\)

which is biased, because the denominator is n rather than (n – 1) (the typical denominator for sample variance). This bias arises because maximum likelihood estimates do not take into account the loss of degrees of freedom when estimating fixed effects.

Playing in R with an example

We have data for stem diameters (in mm) for twelve 10 year-old radiata pine (Pinus radiata D. Don) trees:

[sourcecode lang=”r”]
diams = c(150, 124, 100, 107, 170, 144,
113, 108, 92, 129, 123, 118)

# Obtain typical mean and standard deviation
mean(diams)

[1] 123.1667

sd(diams)
[1] 22.38438

# A simple function to calculate likelihood.
# Values quickly become very small
like = function(data, mu, sigma) {
like = 1
for(obs in data){
like = like * 1/(sqrt(2*pi)*sigma) *
exp(-1/2 * (obs – mu)^2/(sigma^2))
}
return(like)
}

# For example, likelihood of data if mu=120 and sigma=20
like(diams, 120, 20)
[1] 3.476384e-24

# A simple function to calculate log-likelihood.
# Values will be easier to manage
loglike = function(data, mu, sigma) {
loglike = 0
for(obs in data){
loglike = loglike +
log(1/(sqrt(2*pi)*sigma) *
exp(-1/2 * (obs – mu)^2/(sigma^2)))
}
return(loglike)
}

# Example, log-likelihood of data if mu=120 and sigma=20
loglike(diams, 122, 20)
[1] -53.88605

# Let’s try some combinations of parameters and
# plot the results
params = expand.grid(mu = seq(50, 200, 1),
sigma = seq(10, 40, 1))
params$logL = with(params, loglike(diams, mu, sigma))
summary(params)

library(lattice)
contourplot(logL ~ mu*sigma, data = params, cuts = 20)
[/sourcecode]

It is difficult to see the maximum likelihood in this plot, so we will zoom-in by generating a smaller grid around the typical mean and standard deviation estimates.

[sourcecode lang=”r”]
# Zooming in
params = expand.grid(mu = seq(120, 126, 0.01),
sigma = seq(18, 25, 0.1))
params$logL = with(params, loglike(diams, mu, sigma))
summary(params)

contourplot(logL ~ mu*sigma, data = params, cuts = 10)
[/sourcecode]

We can now check what would be actual results using any of the functions to fit models using maximum likelihood, for example gls:

[sourcecode lang=”r”]
library(nlme)
m1 = gls(diams ~ 1, method=’ML’)
summary(m1)

Generalized least squares fit by maximum likelihood
Model: diams ~ 1
Data: NULL
AIC BIC logLik
111.6111 112.5809 -53.80556

Coefficients:
Value Std.Error t-value p-value
(Intercept) 123.1667 6.461815 19.06069 0

Residual standard error: 21.43142
Degrees of freedom: 12 total; 11 residual
[/sourcecode]

In real life, software will use an iterative process to find the combination of parameters that maximizes the log-likelihood value. If we go back to our function and use loglike(diams, 123.1667, 21.43142) we will obtain -53.80556; exactly the same value calculated by gls. In addition, we can see that the estimated standard deviation (21.43) is slightly lower than the one calculated by the function sd() (22.38), because our biased estimate divides the sum of squared deviations by n rather than by n-1.

P.S. The code for the likelihood and log-likelihood functions is far from being optimal (the loops could be vectorized). However, the loops are easier to understand for many people.

Simulating data following a given covariance structure

Every year there is at least a couple of occasions when I have to simulate multivariate data that follow a given covariance matrix. For example, let’s say that we want to create an example of the effect of collinearity when fitting multiple linear regressions, so we want to create one variable (the response) that is correlated with a number of explanatory variables and the explanatory variables have different correlations with each other.

There is a matrix operation called Cholesky decomposition, sort of equivalent to taking a square root with scalars, that is useful to produce correlated data. If we have a covariance matrix M, the Cholesky descomposition is a lower triangular matrix L, such as that M = L L'. How does this connect to our simulated data? Let’s assume that we generate a vector z of random normally independently distributed numbers with mean zero and variance one (with length equal to the dimension of M), we can create a realization of our multivariate distribution using the product L z.

The reason why this works is that the Variance(L z) = L Variance(z) L' as L is just a constant. The variance of z is the identity matrix I; remember that the random numbers have variance one and are independently distributed. Therefore Variance(L z) = L I L' = L L` = M so, in fact, we are producing random data that follow the desired covariance matrix.

As an example, let’s simulate 100 observations with 4 variables. Because we want to simulate 100 realizations, rather than a single one, it pays to generate a matrix of random numbers with as many rows as variables to simulate and as many columns as observations to simulate.

library(lattice) # for splom
library(car)     # for vif
 
# number of observations to simulate
nobs = 100
 
# Using a correlation matrix (let' assume that all variables
# have unit variance
M = matrix(c(1, 0.7, 0.7, 0.5,
             0.7, 1, 0.95, 0.3,
             0.7, 0.95, 1, 0.3,
             0.5, 0.3, 0.3, 1), nrow=4, ncol=4)
 
# Cholesky decomposition             
L = chol(M)
nvars = dim(L)[1]
 
 
# R chol function produces an upper triangular version of L
# so we have to transpose it.
# Just to be sure we can have a look at t(L) and the
# product of the Cholesky decomposition by itself
 
t(L)
 
     [,1]       [,2]        [,3]      [,4]
[1,]  1.0  0.0000000  0.00000000 0.0000000
[2,]  0.7  0.7141428  0.00000000 0.0000000
[3,]  0.7  0.6441288  0.30837970 0.0000000
[4,]  0.5 -0.0700140 -0.01589586 0.8630442
 
t(L) %*% L
 
     [,1] [,2] [,3] [,4]
[1,]  1.0 0.70 0.70  0.5
[2,]  0.7 1.00 0.95  0.3
[3,]  0.7 0.95 1.00  0.3
[4,]  0.5 0.30 0.30  1.0
 
 
# Random variables that follow an M correlation matrix
r = t(L) %*% matrix(rnorm(nvars*nobs), nrow=nvars, ncol=nobs)
r = t(r)
 
rdata = as.data.frame(r)
names(rdata) = c('resp', 'pred1', 'pred2', 'pred3')
 
# Plotting and basic stats
splom(rdata)
cor(rdata)
 
# We are not that far from what we want to simulate!
           resp     pred1     pred2     pred3
resp  1.0000000 0.7347572 0.7516808 0.3915817
pred1 0.7347572 1.0000000 0.9587386 0.2841598
pred2 0.7516808 0.9587386 1.0000000 0.2942844
pred3 0.3915817 0.2841598 0.2942844 1.0000000

Now we can use the simulated data to learn something about the effects of collinearity when fitting multiple linear regressions. We will first fit two models using two predictors with low correlation between them, and then fit a third model with three predictors where pred1 and pred2 are highly correlated with each other.

# Model 1: predictors 1 and 3 (correlation is 0.28)
m1 = lm(resp ~ pred1 + pred3, rdata)
summary(m1)
 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.07536    0.06812   1.106  0.27133    
pred1        0.67316    0.06842   9.838 2.99e-16 ***
pred3        0.20920    0.07253   2.884  0.00483 ** 
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1 
 
Residual standard error: 0.6809 on 97 degrees of freedom
Multiple R-squared: 0.5762,	Adjusted R-squared: 0.5675 
F-statistic: 65.95 on 2 and 97 DF,  p-value: < 2.2e-16 
 
# Model 2: predictors 2 and 3 (correlation is 0.29)
m2 = lm(resp ~ pred2 + pred3, rdata)
summary(m2)
 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.06121    0.06649   0.921  0.35953    
pred2        0.68513    0.06633  10.329  < 2e-16 ***
pred3        0.19624    0.07097   2.765  0.00681 ** 
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1 
 
Residual standard error: 0.6641 on 97 degrees of freedom
Multiple R-squared: 0.5968,	Adjusted R-squared: 0.5885 
F-statistic: 71.79 on 2 and 97 DF,  p-value: < 2.2e-16 
 
# Model 3: correlation between predictors 1 and 2 is 0.96
m3 = lm(resp ~ pred1 + pred2 + pred3, rdata)
summary(m3)
 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.06421    0.06676   0.962  0.33856   
pred1        0.16844    0.22560   0.747  0.45712   
pred2        0.52525    0.22422   2.343  0.02122 * 
pred3        0.19584    0.07114   2.753  0.00706 **
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1 
 
Residual standard error: 0.6657 on 96 degrees of freedom
Multiple R-squared: 0.5991,	Adjusted R-squared: 0.5866 
F-statistic: 47.83 on 3 and 96 DF,  p-value: < 2.2e-16 
 
 
# Variance inflation
vif(m3)
 
    pred1     pred2     pred3 
12.373826 12.453165  1.094875

In my example it is possible to see the huge increase for the standard error for pred1 and pred2, when we use both highly correlated explanatory variables in model 3. In addition, model fit does not improve for model 3.