When working in research projects I tend to fit several, sometimes quite a few, alternative models. This model fitting is informed by theoretical considerations (e.g. quantitative genetics, experimental design we used, our understanding of the process under study, etc.) but also by visual inspection of the data. Trellis graphics—where subsets of data are plotted in different ‘panels’ defined by one or more factors—are extremely useful to generate research hypotheses.

There are two packages in R that have good support for trellis graphics: lattice and ggplot2. Lattice is the oldest, while ggplot2 is probably more consistent (implementing a grammar of graphics) and popular with the cool kids and the data visualization crowd. However, lattice is also quite fast, while ggplot2 can be slow as a dog (certainly way slower than *my* dog).

Tree-breeding progeny trials often have between 1,000 and 12,000 individuals, and analyses commonly include several trials. Thus, it is not unusual to have tens of thousands or even hundreds of thousand of records that will be involved in the analysis. Add to this situation that I am impatient and you will understand that differences on speed can make a big difference to my mental health. But how different is the speed? We can simulate some correlated data (following the explanation in this post) and build a simple scatterplot faceted by site; let’s say 60,000 observations in 6 sites (10,000 per site).

[sourcecode lang=”r”]

library(lattice)

library(ggplot2)

# number of observations to simulate

nobs = 60000

sites = 6

# Using a correlation matrix (let’s assume that all variables

# have unit variance

M = matrix(c(1, 0.7,

0.7, 1), nrow=2, ncol=2)

# Cholesky decomposition

L = chol(M)

nvars = dim(L)[1]

# Random variables that follow an M correlation matrix

r = t(L) %*% matrix(rnorm(nvars*nobs), nrow=nvars, ncol=nobs)

r = t(r)

rdata = as.data.frame(r)

names(rdata) = c(‘x’, ‘y’)

rdata$site = factor(rep(1:sites, each = nobs/sites))

# Plotting with lattice

xyplot(y ~ x | site, data = rdata,

layout = c(3, 2), type=c(‘p’,’smooth’))

# Plotting with ggplot2

qplot(x, y, facets = ~ site,

geom=c(‘point’, ‘smooth’),

data = rdata) + facet_wrap(~site)

[/sourcecode]

The timing was done surrounding the graph calls (either `xyplot()`

or `qplot()`

) by `system.time(print())`

, so the graph is sent to the screen and the operation is timed^{†}. In summary, in this simple call ggplot2 takes a bit over double the time than lattice. The more layers you add to the graph the slower it gets.

The two plots are below. We could improve both plots and make them look more similar to each other, but I want to avoid introducing more distractions in the code.

Nevertheless, I do like the flexibility of ggplot2, so I support most of my exploratory data analysis using lattice but when I have to create the final pretty plots for publications in journals I go back to ggplot2. I subscribe to Frank Harrell’s Philosophy of Biostatistics, which includes *‘use excellent graphics, liberally’*. Switching between packages let me deal with both abundance of graphics and impatience.

^{†} This is R pitfall #2: plots inside a function (and `system.time()`

is a function) have to be surrounded by `print()`

or they won’t be sent to the screen. Pitfall #1 is here.

2011/10/18 at 12:09 pm

There is a package may draw your interests. It is called latticeExtra <a href="http://(http://latticeextra.r-forge.r-project.org/#ggplot2like&theme=ggplot2like” target=”_blank”>(http://latticeextra.r-forge.r-project.org/#ggplot2like&theme=ggplot2like.

This package allows lattice graph to mimic ggplot graphs in terms of appearance, fairly easily.

2011/10/18 at 2:42 pm

Thanks for that and one small correction to the link so it works. In addition, Hadley Wickham informs me that ggplot2 will be faster.

2011/10/18 at 3:52 pm

how about geom_hexbin ???

2011/10/19 at 3:06 am

And / or using transparency.

2011/10/19 at 9:03 pm

Picky, picky… I used transparency with ggplot2 in my next post on large linear mixed linear models. Both lattice and ggplot2 have a large number of options that could improve the display of large number of points, but I didn't want to distract the reader from my main point.