Lattice when modeling, ggplot when publishing

When working in research projects I tend to fit several, sometimes quite a few, alternative models. This model fitting is informed by theoretical considerations (e.g. quantitative genetics, experimental design we used, our understanding of the process under study, etc.) but also by visual inspection of the data. Trellis graphics—where subsets of data are plotted in different ‘panels’ defined by one or more factors—are extremely useful to generate research hypotheses.

There are two packages in R that have good support for trellis graphics: lattice and ggplot2. Lattice is the oldest, while ggplot2 is probably more consistent (implementing a grammar of graphics) and popular with the cool kids and the data visualization crowd. However, lattice is also quite fast, while ggplot2 can be slow as a dog (certainly way slower than my dog).

Tree-breeding progeny trials often have between 1,000 and 12,000 individuals, and analyses commonly include several trials. Thus, it is not unusual to have tens of thousands or even hundreds of thousand of records that will be involved in the analysis. Add to this situation that I am impatient and you will understand that differences on speed can make a big difference to my mental health. But how different is the speed? We can simulate some correlated data (following the explanation in this post) and build a simple scatterplot faceted by site; let’s say 60,000 observations in 6 sites (10,000 per site).

[sourcecode lang=”r”]
library(lattice)
library(ggplot2)

# number of observations to simulate
nobs = 60000
sites = 6
 
# Using a correlation matrix (let’s assume that all variables
# have unit variance
M = matrix(c(1, 0.7,
0.7, 1), nrow=2, ncol=2)
 
# Cholesky decomposition
L = chol(M)
nvars = dim(L)[1]

# Random variables that follow an M correlation matrix
r = t(L) %*% matrix(rnorm(nvars*nobs), nrow=nvars, ncol=nobs)
r = t(r)

rdata = as.data.frame(r)
names(rdata) = c(‘x’, ‘y’)
rdata$site = factor(rep(1:sites, each = nobs/sites))

# Plotting with lattice
xyplot(y ~ x | site, data = rdata,
layout = c(3, 2), type=c(‘p’,’smooth’))

# Plotting with ggplot2
qplot(x, y, facets = ~ site,
geom=c(‘point’, ‘smooth’),
data = rdata) + facet_wrap(~site)
[/sourcecode]

The timing was done surrounding the graph calls (either xyplot() or qplot()) by system.time(print()), so the graph is sent to the screen and the operation is timed. In summary, in this simple call ggplot2 takes a bit over double the time than lattice. The more layers you add to the graph the slower it gets.

The two plots are below. We could improve both plots and make them look more similar to each other, but I want to avoid introducing more distractions in the code.

Nevertheless, I do like the flexibility of ggplot2, so I support most of my exploratory data analysis using lattice but when I have to create the final pretty plots for publications in journals I go back to ggplot2. I subscribe to Frank Harrell’s Philosophy of Biostatistics, which includes ‘use excellent graphics, liberally’. Switching between packages let me deal with both abundance of graphics and impatience.

This is R pitfall #2: plots inside a function (and system.time() is a function) have to be surrounded by print() or they won’t be sent to the screen. Pitfall #1 is here.

Linear mixed models in R

A substantial part of my job has little to do with statistics; nevertheless, a large proportion of the statistical side of things relates to applications of linear mixed models. The bulk of my use of mixed models relates to the analysis of experiments that have a genetic structure.

A brief history of time

At the beginning (1992-1995) I would use SAS (first proc glm, later proc mixed), but things started getting painfully slow and limiting if one wanted to move into animal model BLUP. At that time (1995-1996), I moved to DFREML (by Karen Meyer, now replaced by WOMBAT) and AIREML (by Dave Johnson, now defunct—the program I mean), which were designed for the analysis of animal breeding progeny trials, so it was a hassle to deal with experimental design features. At the end of 1996 (or was it the beginning of 1997?) I started playing with ASReml (programed by Arthur Gilmour mostly based on theoretical work by Robin Thompson and Brian Cullis). I was still using SAS for data preparation, but all my analyses went through ASReml (for which I wrote the cookbook), which was orders of magnitude faster than SAS (and could deal with much bigger problems). Around 1999, I started playing with R (prompted by a suggestion from Rod Ball), but I didn’t really use R/S+ often enough until 2003. At the end of 2005 I started using OS X and quickly realized that using a virtual machine or dual booting was not really worth it, so I dropped SAS and totally relied on R in 2009.

Options

As for many other problems, there are several packages in R that let you deal with linear mixed models from a frequentist (REML) point of view. I will only mention nlme (Non-Linear Mixed Effects), lme4 (Linear Mixed Effects) and asreml (average spatial reml). There are also several options for Bayesian approaches, but that will be another post.

nlme is the most mature one and comes by default with any R installation. In addition to fitting hierarchical generalized linear mixed models it also allows fitting non-linear mixed models following a Gaussian distribution (my explanation wasn’t very clear, thanks to ucfagls below for pointing this out). Its main advantages are, in my humble opinion, the ability to fit fairly complex hierarchical models using linear or non-linear approaches, a good variety of variance and correlation structures, and access to several distributions and link functions for generalized models. In my opinion, its main drawbacks are i- fitting cross-classified random factors is a pain, ii- it can be slow and may struggle with lots of data, iii- it does not deal with pedigrees by default and iv- it does not deal with multivariate data.

lme4 is a project led by Douglas Bates (one of the co-authors of nlme), looking at modernizing the code and making room for trying new ideas. On the positive side, it seems to be a bit faster than nlme and it deals a lot better with cross-classified random factors. Drawbacks: similar to nlme’s, but dropping point i- and adding that it doesn’t deal with covariance and correlation structures yet. It is possible to fit pedigrees using the pedigreemm package, but I find the combination a bit flimsy.

ASReml-R is, unsurprisingly, an R package interface to ASReml. On the plus side it i- deals well with cross-classified random effects, ii- copes very well with pedigrees, iii- can work with fairly large datasets, iv-can run multivariate analyses and v- covers a large number of covariance and correlation structures. Main drawbacks are i- limited functionality for non-Gaussian responses, ii- it does not cover non-linear models and iii- it is non-free (as in beer and speech). The last drawback is relative; it is possible to freely use asreml for academic purposes (and there is also a version for developing countries). Besides researchers, the main users of ASReml/ASReml-R are breeding companies.

All these three packages are available for Windows, Linux and OS X.

A (very) simple example

I will use a traditional dataset to show examples of the notation for the three packages: Yates’ variety and nitrogen split-plot experiment. We can get the dataset from the MASS package, after which it is a good idea to rename the variables using meaningful names. In addition, I will follow Bill Venables’s excellent advice and create additional variables for main plot and subplots, as it is confusing to use the same factor for two purposes (e.g. variety as treatment and main plot). Incidentally, if you haven’t read Bill’s post go and read it; it is one of the best explanations I have ever seen for a split-plot analysis.

The nlme code for this analysis is fairly simple: response on the left-hand side of the tilde, followed by the fixed effects (variety, nitrogen and their interaction). Then there is the specification of the random effects (which also uses a tilde) and the data set containing all the data. Notice that 1|block/mainplot is fitting block and mainplot within block. There is no reference to subplot as there is a single assessment for each subplot, which ends up being used at the residual level.

The syntax for lme4 is not that dissimilar, with random effects specified using a (1|something here) syntax. One difference between the two packages is that nlme reports standard deviations instead of variances for the random effects.

For this type of problem, the notation for asreml is also very similar, particularly when compared to nlme.

In this simple example one pretty much gets the same results, independently of the package used (which is certainly comforting). I will soon cover another simple model, but with much larger dataset, to highlight some performance differences between the packages.

Maximum likelihood

This post is one of those ‘explain to myself how things work’ documents, which are not necessarily completely correct but are close enough to facilitate understanding.

Background

Let’s assume that we are working with a fairly simple linear model, where we only have a response variable (say tree stem diameter in cm). If we want to ‘guess’ the diameter for a tree (yi) our best bet is the average (μ) and we will have a residual (εi). The model equation then looks like:

\(y_i = mu + varepsilon_i \)

We want to estimate both the overall mean (μ) and the error variance (\(sigma_{varepsilon}^2\)), for which we could use least squares. However, we will use an alternative method (maximum likelihood) because that is the point of this post. A likelihood function expresses the probability of obtaining the observed sample from a population given a set of model parameters. Thus, it answers the question What is the probability of observing the current dataset, when I assume a given set of parameters for my linear model? To do so, we have to assume a distribution, say normal, for which the probability density function (p.d.f) is:

\(p.d.f. (y) = frac{1} {sqrt{2 pi} sigma} e^{-1/2 frac{(y- mu)^2} {sigma^2}} \)

with n independent samples the likelihood (L) is:

\(L = prod_{i=1}^n frac{1} {sqrt{2 pi} sigma} e^{-1/2 frac{(y- mu)^2} {sigma^2}}\)

where ∏ is a multiplication operator, analogous to the summation operator ∑. Maximizing L or its natural logarithm (LogL) is equivalent, then:

$latex LogL = sum_{i=1}^n logleft ( frac{1} {sqrt{2 pi} sigma} e^{-1/2 frac{(y- mu)^2} {sigma^2}}
ight ) $

$latex LogL= sum_{i=1}^n left [ logleft ( frac{1} {sqrt{2 pi} sigma}
ight ) + logleft ( e^{-1/2 frac{(y- mu)^2} {sigma^2}}
ight )
ight ]$

$latex LogL= sum_{i=1}^n log(2 pi)^{-1/2} + sum_{i=1}^n logleft (frac{1} {sigma}
ight ) + sum_{i=1}^n left ( -1/2 frac{(y- mu)^2} {sigma^2}
ight )$

\(LogL= -frac{n} {2} log(2 pi) – n log(sigma) – frac{1} {2 sigma^2} sum_{i=1}^n ( y – mu)^2\)

Considering only μ, LogL is maximum when \(sum_{i=1}^n ( y – mu)^2\) is a minimum, i.e. for:

\(mu = frac{sum_{i=1}^n y_i}{n}\)

Now considering σ:

\(frac{delta LogL}{delta sigma} = -frac{n}{sigma} – (-2) frac{1}{2} sum_{i=1}^n (y – mu)^2 sigma^{-3}\)
\(frac{delta LogL}{delta sigma} = -frac{n}{sigma} + frac{1}{sigma^3} sum_{i=1}^n (y – mu)^2\)

and setting the last expression equal to 0:

\(frac{n}{sigma} = frac{1}{sigma^3} sum_{i=1}^n (y – mu)^2\)
\(sigma^2 = frac{sum_{i=1}^n (y – mu)^2}{n}\)

We can obtain an estimate of the error variance (\(hat{sigma}\), notice the hat) by replacing μ by our previous estimate:
\(hat{sigma}^2 = frac{sum_{i=1}^n (y – ar{y})^2}{n}\)

which is biased, because the denominator is n rather than (n – 1) (the typical denominator for sample variance). This bias arises because maximum likelihood estimates do not take into account the loss of degrees of freedom when estimating fixed effects.

Playing in R with an example

We have data for stem diameters (in mm) for twelve 10 year-old radiata pine (Pinus radiata D. Don) trees:

It is difficult to see the maximum likelihood in this plot, so we will zoom-in by generating a smaller grid around the typical mean and standard deviation estimates.

We can now check what would be actual results using any of the functions to fit models using maximum likelihood, for example gls:

In real life, software will use an iterative process to find the combination of parameters that maximizes the log-likelihood value. If we go back to our function and use loglike(diams, 123.1667, 21.43142) we will obtain -53.80556; exactly the same value calculated by gls. In addition, we can see that the estimated standard deviation (21.43) is slightly lower than the one calculated by the function sd() (22.38), because our biased estimate divides the sum of squared deviations by n rather than by n-1.

P.S. The code for the likelihood and log-likelihood functions is far from being optimal (the loops could be vectorized). However, the loops are easier to understand for many people.

Simulating data following a given covariance structure

Every year there is at least a couple of occasions when I have to simulate multivariate data that follow a given covariance matrix. For example, let’s say that we want to create an example of the effect of collinearity when fitting multiple linear regressions, so we want to create one variable (the response) that is correlated with a number of explanatory variables and the explanatory variables have different correlations with each other.

There is a matrix operation called Cholesky decomposition, sort of equivalent to taking a square root with scalars, that is useful to produce correlated data. If we have a covariance matrix M, the Cholesky descomposition is a lower triangular matrix L, such as that M = L L'. How does this connect to our simulated data? Let’s assume that we generate a vector z of random normally independently distributed numbers with mean zero and variance one (with length equal to the dimension of M), we can create a realization of our multivariate distribution using the product L z.

The reason why this works is that the Variance(L z) = L Variance(z) L' as L is just a constant. The variance of z is the identity matrix I; remember that the random numbers have variance one and are independently distributed. Therefore Variance(L z) = L I L' = L L` = M so, in fact, we are producing random data that follow the desired covariance matrix.

As an example, let’s simulate 100 observations with 4 variables. Because we want to simulate 100 realizations, rather than a single one, it pays to generate a matrix of random numbers with as many rows as variables to simulate and as many columns as observations to simulate.

Now we can use the simulated data to learn something about the effects of collinearity when fitting multiple linear regressions. We will first fit two models using two predictors with low correlation between them, and then fit a third model with three predictors where pred1 and pred2 are highly correlated with each other.

In my example it is possible to see the huge increase for the standard error for pred1 and pred2, when we use both highly correlated explanatory variables in model 3. In addition, model fit does not improve for model 3.

Setting plots side by side

This is simple example code to display side-by-side lattice plots or ggplot2 plots, using the mtcars dataset that comes with any R installation. We will display a scatterplot of miles per US gallon (mpg) on car weight (wt) next to another scatterplot of the same data, but using different colors by number of engine cylinders (cyl, treated as factor) and adding a smooth line (under the type option).

According to the documentation, position is a vector of 4 numbers, typically c(xmin, ymin, xmax, ymax) that give the lower-left and upper-right corners of a rectangle in which the Trellis plot of x is to be positioned. The coordinate system for this rectangle is [0-1] in both the x and y directions. That is, the first print() sets position to occupy the left part of the graph with full height, as well as to avoid refreshing the graph when displaying the new plot (more = TRUE). The second print() uses the right part of the graph with full height.

In the case of ggplot2, the code is not that different:

More details on ggplot’s notation can be found here.

Upgrading R (and packages)

I tend not to upgrade R very often—running from 6 months to 1 year behind in version numbers—because I had to reinstall all packages: a real pain. A quick search shows that people have managed to come up with good solutions to this problem, as presented in this stackoverflow thread. I used the code in my mac:

From all installed packages, I only had issues with 5 of them, which require installation from their respective websites: Acinonyx, INLA (and AnimalINLA) and asreml. Package graph is now available from bioconductor.org. INLA can be installed really easily from inside R (see below), while I did not bother downloading again asreml and just copied the folder from ~/Library/R/OldVersion/library/asreml to ~/Library/R/CurrentVersion/library/asreml.

Overall, it was a good upgrade experience, so thanks to the stackoverflow crowd for so many ideas on how to make R even nicer than it is.

P.S. 20100-10-14 Similar instructions, but including compiling R and installing bioconductor.

Reading HTML pages in R for text processing

We were talking with one of my colleagues about doing some text analysis—that, by the way, I have never done before—for which the first issue is to get text in R. Not any text, but files that can be accessed through internet. In summary, we need to access an HTML file, parse it so we can access specific content and then remove the HTML tags. Finally, we may want to replace some text (the end of lines,
, for example) before continue processing the files.

The package XML has the necessary functionality to deal with HTML, while the rest is done using a few standard R functions.

Incidentally, babel.html contains a translation of the short story ‘The Library of Babel’ by Jorge Luis Borges. Great story! We can repeat this process with several files and then create a corpus (and analyze it) using the tm package.

Operating on datasets inside a function

There are times when we need to write a function that makes changes to a generic data frame that is passed as an argument. Let’s say, for example, that we want to write a function that converts to factor any variable with names starting with a capital letter. There are a few issues involved in this problem, including:

  • Obtaining a text version of the name of the dataset (using the substitute() function).
  • Looping over the variable names and checking if they start with a capital letter (comparing with the LETTERS vector of constants).
  • Generating the plain text version of the factor conversion, glueing the dataset and variable names (using paste()).
  • Parsing the plain text version of the code to R code (using parse()) and evaluating it (using eval()). This evaluation has to be done in the parent environment or we will lose any transformation when we leave the function, which is the reason for the envir() specification.

And that’s all. Now the Fert integer variable has been converted to a factor. This example function could be useful for someone out there.

A brief idea of style

Once one starts writing more R code the need for consistency increases, as it facilitates managing larger projects and their maintenance. There are several style guides or suggestions for R; for example, Andrew Gelman’s, Hadley Wickham’s, Bioconductor’s and this one. I tend to write closer to Google’s R style guide, which contains some helpful suggestions. I use something similar but:

  • I use = for assignment rather than <-, because it is visually less noisy, <- requires an extra keystroke (yes, I am that lazy) and—from a merely esthetics point of view—in many monospaced fonts the lower than and hyphen symbols do not align properly, so <- does not look like an arrow. I know that hardcore R programmers prefer the other symbol but, tough, I prefer the equal sign.
  • I indent code using four spaces, just because I am used to do so in Python. I will make an exception and go down to two spaces if there are too many nested clauses.
  • I like their identifier naming scheme, although I do not use it consistently. Mea culpa.
  • I always use single quotes for text (two fewer keystrokes per text variable).

Of course you’ll find that the examples presented in this site depart from the style guide. I didn’t say that I was consistent, did I?

All combinations for levelplot

In a previous post I explained how to create all possible combinations of the levels of two factors using expand.grid(). Another use for this function is to create a regular grid for two variables to create a levelplot or a contour plot.

For example, let’s say that we have fitted a multiple linear regression to predict wood stiffness (stiff, the response) using basic density (bd) and a measure of microfibril angle (t) as explanatory variables. The regression equation could be something like stiffness = 3.439 + 0.009 bd - 0.052 t. In our dataset bd had a range of 300 to 700 kg m-3, while t had a range from 50 to 70.

We will use the levelplot() function that is part of the lattice package of graphical functions, create a grid for both explanatory variables (every 10 for bd and every 1 for t), predict values of stiffness for all combinations of bd and t, and plot the results.

This code creates a graph like this. Simple.

Wood stiffness levelplot