Categories

## Cute Gibbs sampling for rounded observations

I was attending a course of Bayesian Statistics where this problem showed up:

There is a number of individuals, say 12, who take a pass/fail test 15 times. For each individual we have recorded the number of passes, which can go from 0 to 15. Because of confidentiality issues, we are presented with rounded-to-the-closest-multiple-of-3 data ($$\mathbf{R}$$). We are interested on estimating $$\theta$$ of the Binomial distribution behind the data.

Rounding is probabilistic, with probability 2/3 if you are one count away from a multiple of 3 and probability 1/3 if the count is you are two counts away. Multiples of 3 are not rounded.

We can use Gibbs sampling to alternate between sampling the posterior for the unrounded $$\mathbf{Y}$$ and $$\theta$$. In the case of $$\mathbf{Y}$$ I used:

# Possible values that were rounded to R
possible <- function(rounded) {
if(rounded == 0) {
options <- c(0, 1, 2)
} else {
options <- c(rounded - 2, rounded - 1, rounded,
rounded + 1, rounded + 2)
}
return(options)
}

# Probability mass function of numbers rounding to R
# given theta
prior_y <- function(options, theta) {
p <- dbinom(options, 15, prob = theta)
return(p)
}

# Likelihood of rounding
like_round3 <- function(options) {
if(length(options) == 3) {
like <- c(1, 2/3, 1/3) }
else {
like <- c(1/3, 2/3, 1, 2/3, 1/3)
}
return(like)
}

# Estimating posterior mass function and drawing a
# random value of it
posterior_sample_y <- function(R, theta) {
po <- possible(R)
pr <- prior_y(po, theta)
li <- like_round3(po)
post <- li*pr/sum(li*pr)
samp <- sample(po, 1, prob = post)
return(samp)
}


While for $$theta$$ we are assuming a vague $$mbox{Beta}(alpha, eta)$$, with $$alpha$$ and $$eta$$ equal to 1, as prior density function for $$theta$$, so the posterior density is a $$mbox{Beta}(alpha + sum Y_i, eta + 12*15 - sum Y_i)$$.

## Function to sample from the posterior Pr(theta | Y, R)
posterior_sample_theta <- function(alpha, beta, Y) {
theta <- rbeta(1, alpha + sum(Y), beta + 12*15 - sum(Y))
return(theta)
}


I then implemented the sampler as:

## Data
R <- c(0, 0, 3, 9, 3, 0, 6, 3, 0, 6, 0, 3)
nsim <- 10000
burnin <- 1000
alpha <- 1
beta <- 1
store <- matrix(0, nrow = nsim, ncol = length(R) + 1)

starting.values <- c(R, 0.1)

## Sampling
store[1,] <- starting.values
for(draw in 2:nsim){
current <- store[draw - 1,]
for(obs in 1:length(R)) {
y <- posterior_sample_y(R[obs], current[length(R) + 1])
# Jump or not still missing
current[obs] <- y
}
theta <- posterior_sample_theta(alpha, beta, current[1:length(R)])
# Jump or not still missing
current[length(R) + 1] <- theta

store[draw,] <- current
}


And plotted the results as:

plot((burnin+1):nsim, store[(burnin+1):nsim,13], type = 'l')

library(ggplot2)

ggplot(data.frame(theta = store[(burnin+1):nsim,13]), aes(x = theta)) +
geom_density(fill = 'blue', alpha = 0.5)

multiple_plot <- data.frame(Y = matrix(store[(burnin+1):nsim, 1:12],
nrow = (nsim - burnin)*12,
ncol = 1))
multiple_plot$obs <- factor(rep(1:12, each = (nsim - burnin))) ggplot(multiple_plot, aes(x = Y)) + geom_histogram() + facet_grid(~obs)  I thought it was a nice, cute example of simultaneously estimating a latent variable and, based on that, estimating the parameter behind it. Categories ## Less wordy R The Swarm Lab presents a nice comparison of R and Python code for a simple (read ‘one could do it in Excel’) problem. The example works, but I was surprised by how wordy the R code was and decided to check if one could easily produce a shorter version. The beginning is pretty much the same, although I’ll use ggplot2 rather than lattice, because it will be a lot easier (and shorter) to get the desired appearance for the plots: require(Quandl) require(ggplot2) # Load data from Quandl my.data = Quandl("TPC/HIST_RECEIPT", start_date = "1945-12-31", end_date = "2013-12-31")  The whole example relies on only three variables and—as I am not great at typing—I tend to work with shorter variable names. I directly changed the names for variables 1 to 3: # Display first lines of the data frame # and set short names for first three columns head(my.data) names(my.data)[1:3] = c('year', 'indtax', 'corptax')  It is a lot easier to compare the regression lines if we change the shape of the data set from wide to long, where there is one variable for year, one for tax type, and one for the actual tax rate. It would be possible to use one of Hadley’s packages to get a simpler syntax for this, but I decided to stick to the minimum set of requirements: # Change shape to fit both regressions simultaneously mdlong = reshape(my.data[, 1:3], idvar = 'year', times = c('Individual', 'Corporate'), varying = list(2:3), direction = 'long') mdlong$taxtype = factor(mdlongtime)  And now we are ready to produce the plots. The first one can be a rough cut to see if we get the right elements: ggplot(mdlong, aes(x = year, y = indtax, color = taxtype)) + geom_point() + geom_line() + geom_smooth(method = 'lm')  Yes, this one has the points, lines, linear regression and 95% confidence intervals for the mean predicted responses, but we still need to get rid of the grey background and get black labels (theme_bw()), set the right axis labels and ticks (scale_x... scale_y...) and set the right color palette for points and lines (scale_colour_manual) and filling the confidence intervals (scale_colour_fill) like so: # Plotting the graph (first match color palette) and put the regression # lines as well serious.palette = c('#AD3333', '#00526D') ggplot(mdlong, aes(x = year, y = indtax, color = taxtype)) + geom_point() + geom_line() + geom_smooth(method = 'lm', aes(fill = taxtype)) + theme_bw() + scale_y_continuous('Income taxes (% of GDP)', breaks = seq(0, 12, 2), minor_breaks = NULL) + scale_x_date('Fiscal year', minor_breaks = NULL) + scale_colour_manual(values=serious.palette) + scale_fill_manual(values=serious.palette)  One can still change font sizes to match the original plots, reposition the legend, change the aspect ratio while saving the png graphs (all simple statements) but you get the idea. If now we move to fitting the regression lines: # Fitting a regression with dummy variables m1 = lm(indtax ~ year*taxtype, data = mdlong) summary(m1) # The regressions have different intercepts and slopes # Call: # lm(formula = indtax ~ year * taxtype, data = mdlong) # # Residuals: # Min 1Q Median 3Q Max # -1.95221 -0.44303 -0.05731 0.35749 2.39415 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 3.435e+00 1.040e-01 33.01 <2e-16 *** # year -1.564e-04 1.278e-05 -12.23 <2e-16 *** # taxtypeIndividual 4.406e+00 1.471e-01 29.94 <2e-16 *** # year:taxtypeIndividual 1.822e-04 1.808e-05 10.08 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 0.7724 on 134 degrees of freedom # Multiple R-squared: 0.9245, Adjusted R-squared: 0.9228 # F-statistic: 546.9 on 3 and 134 DF, p-value: < 2.2e-16  This gives the regression coefficients for Corporate (3.45 - 1.564e-04 year) and Individual ([3.45 + 4.41] + [-1.564e-04 + 1.822e-04] year or 7.84 + 2.58e-05 year). As a bonus you get the comparison between regression lines. In R as a second language I pointed out that 'brevity reduces the time between thinking and implementation, so we can move on and keep on trying new ideas'. Some times it seriously does. Categories ## If you have to use circles… Stats Chat is an interesting kiwi site—managed by the Department of Statistics of the University of Auckland—that centers around the use and presentation of statistics in the media. This week there was an interesting discussion on one of those infographics that make you cringe: I understand the newspaper’s need to grab our attention, as well as the designer’s esthetical considerations, but they have to follow avoiding misleading the reader and providing at least a ball-park idea of the importance of whatever issue is under discussion. Clearly, as pointed out in the discussion, a line chart would convey the message with a minimum of ink; however, the designer may still want to use circles, and here we could go back to a faceted version of the always maligned pie chart. The code reads the data, reshapes it and plots it using pretty much the explanation for pie charts in the ggplot2 documentation. Again, please remember my typical disclaimer about lack of design and color flair. Colors and scales need work, but I think it is an improvement over the original. Categories ## Revisiting homicide rates A pint of R plotted an interesting dataset: intentional homicides in South America. I thought the graphs were pretty but I was unhappy about the way information was conveyed in the plots; relative risk should be very important but number of homicides is very misleading as it also relates to country population (this problem often comes up in our discussions in Stats Chat). Instead of just complaining I decided to try a few alternatives (disclaimer: I do not have a good eye for colors or design but I am only looking at ways that could better show relative risk). I therefore downloaded the MS Excel file, which contained a lot of information from other countries and extracted only the information relevant to these plots, which you can obtain here: homicides.csv (4 KB). Some quick code could display the following graph: [sourcecode language=”R”] require(ggplot2) setwd(‘~/Dropbox/quantumforest’) kill = read.csv(‘homicides.csv’, header = TRUE) kp = ggplot(kill, aes(x = year, y = country, fill = rate)) # Colors coming from # http://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/ png(‘homicides-tile.png’, width = 500, height = 500) kp = kp + geom_tile() + scale_x_continuous(name = ‘Year’, expand = c(0, 0)) + scale_y_discrete(name = ‘Country’, expand = c(0, 0)) + scale_fill_gradient(low = ‘white’, high = ‘steelblue’, name = ‘Homicide rate’) + theme_bw() + opts(panel.grid.major = theme_line(colour = NA), panel.grid.minor = theme_line(colour = NA)) dev.off() [/sourcecode] It is also possible to use a line graph, but it quickly gets very messy, so I created totally arbitrary violence categories: [sourcecode language=”R”] # Totally arbitrary classification killtype = ifelse(kill$country %in% c(‘Brazil’, ‘Colombia’, ‘Venezuela’), ‘Freaking violent’, ifelse(kill$country %in% c(‘Ecuador’, ‘Surinam’, ‘Guyana’),
‘Plain violent’,
‘Sort of quiet’))

kp2 = ggplot(kill, aes(x = year, y = rate, colour = country))

png(‘homicides-lines.png’, width=1000, height = 300)
kp2 + geom_line() + facet_grid(. ~ type) +
scale_y_continuous(‘Homicides/100,000 people’) +
scale_x_continuous(‘Year’) + theme_bw() +
opts(axis.text.x = theme_text(size = 10),
axis.text.y = theme_text(size = 10),
legend.position = ‘none’)
dev.off()
[/sourcecode]

I hope others will download the data and provide much better alternatives to display violence. If you do, please add a link in the comments.

Categories

## Plotting earthquake data

Since 4th September 2010 we have had over 2, 800 quakes (considering only magnitude 3+) in Christchurch. Quakes come in swarms, with one or few strong shocks, followed by numerous smaller ones and then the ocasional shock, creating an interesting data visualization problem. In our case, we have had swarms in September 2010, December 2010, February 2011, June 2011 and December 2011.

Geonet provides the basic information and there have been several attempts at displaying the full set of shocks. For example, Christchurch Quake Map uses animation, while Canterbury Quake Live uses four panels showing quakes for last 24 hours, last week, last month and since September 2010. While both alternatives are informative, it is hard to see long-term trends due to overplotting, particularly when we move beyond one week during a swarm.

Geonet allows data extraction through forms and queries. Rough limits for the Christchurch earthquakes are: Southern Latitude (-43.90), Northern Latitude (-43.15), Western Longitude (171.75) and Eastern Longitude (173.35). We can limit the lower magnitude to 3, as shocks are hard to feel below that value.

The file earthquakes.csv contains 2,802 shocks, starting on 2010-09-04. The file can be read using the following code:

setwd('~/Dropbox/quantumforest')
library(ggplot2)

# Reading file and manipulating dates
qk <- read.csv('earthquakes.csv', header=TRUE)

# Joins parts of date, converts it to UTC and then
# expresses it in NZST
qk$DATEtxt = with(qk, paste(ORI_YEAR,'-',ORI_MONTH,'-',ORI_DAY,' ', ORI_HOUR,':',ORI_MINUTE,':',ORI_SECOND, sep='')) qk$DATEutc = as.POSIXct(qk$DATEtxt, tz='UTC') qk$DATEnz = as.POSIXct(format(qk$DATEutc, tz='Pacific/Auckland')) head(qk)  The following code produces a plot that, in my opinion, presents a clearer idea of the swarms but that I still feel does not make justice to the problem. png('earthquakesELE.png',height=600, width=1200) ele = ggplot(qk, aes(DATEnz, MAG, ymin=3, ymax=MAG)) ele + geom_linerange(color='grey') + geom_point(color='red', size=1) + scale_y_continuous(name='Magnitude', limits=c(2.5,7.2)) + scale_x_datetime(name='Date',major='month') + theme_bw() + opts(panel.grid.major = theme_line(size=0)) dev.off()  Please let me know if you have a better idea for the plot. P.S.1 If you want to download data from Geonet there will be problems reading in R the resulting earthquakes.csv file, because the file is badly formed. All lines end with a comma except for the first one, tripping R into believing that the first column contains row names. The easiest way to fix the file is to add a comma at the end of the first line, which will create an extra empty variable called X that is not used in the plots. P.S.2 We had some discussions by email with Michael (who commented below), where he suggested dropping the grey lines and using alpha less than one to reduce plotting density. I opted for also using an alpha scale, so stronger quakes are darker to mimic the psychology of ‘quake suffering’: both frequent smaller quakes and the odd stronger quakes can be distressing to people. In addition, now the plot uses a 1:6 ratio. png('earthquakesALPHA.png',height=300, width=1800) qplot(DATEnz, MAG, data = qk, alpha = MAG) + geom_point(color = 'red', size=1.5) + scale_x_datetime('Date', major='month') + scale_y_continuous('Magnitude') + opts(legend.position = 'none', axis.text.x = theme_text(colour = 'black'), axis.text.y = theme_text(colour = 'black')) dev.off()  I ran out of time, but the background needs more work, as well as finding the right level of alpha to best tell the story. Categories ## Spatial correlation in designed experiments Last Wednesday I had a meeting with the folks of the New Zealand Drylands Forest Initiative in Blenheim. In addition to sitting in a conference room and having nice sandwiches we went to visit one of our progeny trials at Cravens. Plantation forestry trials are usually laid out following a rectangular lattice defined by rows and columns. The trial follows an incomplete block design with 148 blocks and is testing 60 Eucalyptus bosistoana families. A quick look at survival shows an interesting trend: the bottom of the trial was much more affected by frost than the top. setwd('~/Dropbox/quantumforest') library(ggplot2) # for qplot library(asreml) load('cravens.Rdata') qplot(col, row, fill = Surv, geom='tile', data = cravens)  We have the trial assessed for tree height (ht) at 2 years of age, where a plot also shows some spatial trends, with better growth on the top-left corner; this is the trait that I will analyze here. Tree height can be expressed as a function of an overall constant, random block effects, random family effects and a normally distributed residual (this is our base model, m1). I will then take into account the position of the trees (expressed as rows and columns within the trial) to fit spatially correlated residuals (m2)—using separable rows and columns processes—to finally add a spatially independent residual to the mix (m3). # Base model (non-spatial) m1 = asreml(ht ~ 1, random = ~ Block + Family, data = cravens) summary(m1)$loglik
[1] -8834.071

summary(m1)$varcomp gamma component std.error z.ratio constraint Block!Block.var 0.52606739 1227.4058 168.84681 7.269345 Positive Family!Family.var 0.06257139 145.9898 42.20675 3.458921 Positive R!variance 1.00000000 2333.1722 78.32733 29.787459 Positive  m1 represents a traditional family model with only the overall constant as the only fixed effect and a diagonal residual matrix (identity times the residual variance). In m2 I am modeling the R matrix as the Kronecker product of two separable autoregressive processes (one in rows and one in columns) times a spatial residual. # Spatial model (without non-spatial residual) m2 = asreml(ht ~ 1, random = ~ Block + Family, rcov = ~ ar1(row):ar1(col), data = cravens[order(cravens$row, cravens$col),]) summary(m2)$loglik
[1] -8782.112

summary(m2)$varcomp gamma component std.error z.ratio Block!Block.var 0.42232303 1025.9588216 157.00799442 6.534437 Family!Family.var 0.05848639 142.0822874 40.20346956 3.534080 R!variance 1.00000000 2429.3224308 88.83064209 27.347798 R!row.cor 0.09915320 0.0991532 0.02981808 3.325271 R!col.cor 0.28044024 0.2804402 0.02605972 10.761445 constraint Block!Block.var Positive Family!Family.var Positive R!variance Positive R!row.cor Unconstrained R!col.cor Unconstrained  Adding two parameters to the model results in improving the log-likelihood from -8834.071 to -8782.112, a difference of 51.959, but with what appears to be a low correlation in both directions. In m3 I am adding an spatially independent residual (using the keyword units), improving log-likelihood from -8782.112 to -8660.411: not too shabby. m3 = asreml(ht ~ 1, random = ~ Block + Family + units, rcov = ~ ar1(row):ar1(col), data = cravens[order(cravens$row, cravens$col),]) summary(m3)$loglik
[1] -8660.411

summary(m3)$varcomp gamma component std.error z.ratio Block!Block.var 3.069864e-07 4.155431e-04 6.676718e-05 6.223763 Family!Family.var 1.205382e-01 1.631630e+02 4.327885e+01 3.770040 units!units.var 1.354166e+00 1.833027e+03 7.085134e+01 25.871456 R!variance 1.000000e+00 1.353621e+03 2.174923e+02 6.223763 R!row.cor 7.814065e-01 7.814065e-01 4.355976e-02 17.938724 R!col.cor 9.529984e-01 9.529984e-01 9.100529e-03 104.719017 constraint Block!Block.var Boundary Family!Family.var Positive units!units.var Positive R!variance Positive R!row.cor Unconstrained R!col.cor Unconstrained  Allowing for the independent residual (in addition to the spatial one) has permitted higher autocorrelations (particularly in columns), while the Block variance has gone very close to zero. Most of the Block variation is being captured now through the residual matrix (in the rows and columns autoregressive correlations), but we are going to keep it in the model, as it represents a restriction to the randomization process. In addition to log-likelihood (or AIC) we can get graphical descriptions of the fit by plotting the empirical semi-variogram as in plot(variogram(m3)): Categories ## Large applications of linear mixed models In a previous post I summarily described our options for (generalized to varying degrees) linear mixed models from a frequentist point of view: nlme, lme4 and ASReml-R, followed by a quick example for a split-plot experiment. But who is really happy with a toy example? We can show a slightly more complicated example assuming that we have a simple situation in breeding: a number of half-sib trials (so we have progeny that share one parent in common), each of them established following a randomized complete block design, analyzed using a ‘family model’. That is, the response variable (dbh: tree stem diameter assessed at breast height—1.3m from ground level) can be expressed as a function of an overall mean, fixed site effects, random block effects (within each site), random family effects and a random site-family interaction. The latter provides an indication of genotype by environment interaction. setwd('~/Dropbox/quantumforest') trees = read.table('treedbh.txt', header=TRUE) # Removing all full-siblings and dropping Trial levels # that are not used anymore trees = subset(trees, Father == 0) trees$Trial = trees$Trial[drop = TRUE] # Which let me with 189,976 trees in 31 trials xtabs(~Trial, data = trees) # First run the analysis with lme4 library(lme4) m1 = lmer(dbh ~ Trial + (1|Trial:Block) + (1|Mother) + (1|Trial:Mother), data = trees) # Now run the analysis with ASReml-R library(asreml) m2 = asreml(dbh ~ Trial, random = ~Trial:Block + Mother + Trial:Mother, data = trees)  First I should make clear that this model has several drawbacks: • It assumes that we have homogeneous variances for all trials, as well as identical correlation between trials. That is, that we are assuming compound symmetry, which is very rarely the case in multi-environment progeny trials. • It also assumes that all families are unrelated to each other, which in this case I know it is not true, as a quick look at the pedigree will show that several mothers are related. • We assume only one type of families (half-sibs), which is true just because I got rid of all the full-sib families and clonal material, to keep things simple in this example. Just to make the point, a quick look at the data using ggplot2—with some jittering and alpha transparency to better display a large number of points—shows differences in both mean and variance among sites. library(ggplot2) ggplot(aes(x=jitter(as.numeric(Trial)), y=dbh), data = trees) + geom_point(colour=alpha('black', 0.05)) + scale_x_continuous('Trial')  Anyway, let’s keep on going with this simplistic model. In my computer, a two year old iMac, ASReml-R takes roughly 9 seconds, while lme4 takes around 65 seconds, obtaining similar results. # First lme4 (and rounding off the variances) summary(m1) Random effects: Groups Name Variance Std.Dev. Trial:Mother (Intercept) 26.457 5.1436 Mother (Intercept) 32.817 5.7286 Trial:Block (Intercept) 77.390 8.7972 Residual 892.391 29.8729 # And now ASReml-R # Round off part of the variance components table round(summary(m2)$varcomp[,1:4], 3)
gamma component std.error z.ratio
Trial:Block!Trial.var  0.087    77.399     4.598  16.832
Mother!Mother.var      0.037    32.819     1.641  20.002
Trial:Mother!Trial.var 0.030    26.459     1.241  21.328
R!variance             1.000   892.389     2.958 301.672


The residual matrix of this model is a, fairly large, diagonal matrix (residual variance times an identity matrix). At this point we can relax this assumption, adding a bit more complexity to the model so we can highlight some syntax. Residuals in one trial should have nothing to do with residuals in another trial, which could be hundreds of kilometers away. I will then allow for a new matrix of residuals, which is the direct sum of trial-specific diagonal matrices. In ASReml-R we can do so by specifying a diagonal matrix at each trial with rcov = ~ at(Trial):units:

m2b =  asreml(dbh ~ Trial, random = ~Trial:Block + Mother +
Trial:Mother, data = trees,
rcov = ~ at(Trial):units)

# Again, extracting and rounding variance components
round(summary(m2b)$varcomp[,1:4], 3) gamma component std.error z.ratio Trial:Block!Trial.var 77.650 77.650 4.602 16.874 Mother!Mother.var 30.241 30.241 1.512 20.006 Trial:Mother!Trial.var 22.435 22.435 1.118 20.065 Trial_1!variance 1176.893 1176.893 18.798 62.606 Trial_2!variance 1093.409 1093.409 13.946 78.403 Trial_3!variance 983.924 983.924 12.061 81.581 ... Trial_29!variance 2104.867 2104.867 55.821 37.708 Trial_30!variance 520.932 520.932 16.372 31.819 Trial_31!variance 936.785 936.785 31.211 30.015  There is a big improvement of log-Likelihood from m2 (-744452.1) to m2b (-738011.6) for 30 additional variances. At this stage, we can also start thinking of heterogeneous variances for blocks, with a small change to the code: m2c = asreml(dbh ~ Trial, random = ~at(Trial):Block + Mother + Trial:Mother, data = wood, rcov = ~ at(Trial):units) round(summary(m2c)$varcomp[,1:4], 3)

# Which adds another 30 variances (one for each Trial)
gamma component std.error z.ratio
at(Trial, 1):Block!Block.var     2.473     2.473     2.268   1.091
at(Trial, 2):Block!Block.var    95.911    95.911    68.124   1.408
at(Trial, 3):Block!Block.var     1.147     1.147     1.064   1.079
...
Mother!Mother.var               30.243    30.243     1.512  20.008
Trial:Mother!Trial.var          22.428    22.428     1.118  20.062
Trial_1!variance              1176.891  1176.891    18.798  62.607
Trial_2!variance              1093.415  1093.415    13.946  78.403
Trial_3!variance               983.926   983.926    12.061  81.581
...


At this point we could do extra modeling at the site level by including any other experimental design features, allowing for spatially correlated residuals, etc. I will cover some of these issues in future posts, as this one is getting a bit too long. However, we could gain even more by expressing the problem in a multivariate fashion, where the performance of Mothers in each trial would be considered a different trait. This would push us towards some large problems, which require modeling covariance structures so we have a change of achieving convergence during our lifetime.

Disclaimer: I am emphasizing the use of ASReml-R because 1.- I am most familiar with it than with nlme or lme4 (although I tend to use nlme for small projects), and 2.- There are plenty of examples for the other packages but not many for ASReml in the R world. I know some of the people involved in the development of ASReml-R but otherwise I have no commercial links with VSN (the guys that market ASReml and Genstat).

There are other packages that I have not had the chance to investigate yet, like hglm.

Categories

## Lattice when modeling, ggplot when publishing

When working in research projects I tend to fit several, sometimes quite a few, alternative models. This model fitting is informed by theoretical considerations (e.g. quantitative genetics, experimental design we used, our understanding of the process under study, etc.) but also by visual inspection of the data. Trellis graphics—where subsets of data are plotted in different ‘panels’ defined by one or more factors—are extremely useful to generate research hypotheses.

There are two packages in R that have good support for trellis graphics: lattice and ggplot2. Lattice is the oldest, while ggplot2 is probably more consistent (implementing a grammar of graphics) and popular with the cool kids and the data visualization crowd. However, lattice is also quite fast, while ggplot2 can be slow as a dog (certainly way slower than my dog).

Tree-breeding progeny trials often have between 1,000 and 12,000 individuals, and analyses commonly include several trials. Thus, it is not unusual to have tens of thousands or even hundreds of thousand of records that will be involved in the analysis. Add to this situation that I am impatient and you will understand that differences on speed can make a big difference to my mental health. But how different is the speed? We can simulate some correlated data (following the explanation in this post) and build a simple scatterplot faceted by site; let’s say 60,000 observations in 6 sites (10,000 per site).

[sourcecode lang=”r”]
library(lattice)
library(ggplot2)

# number of observations to simulate
nobs = 60000
sites = 6

# Using a correlation matrix (let’s assume that all variables
# have unit variance
M = matrix(c(1, 0.7,
0.7, 1), nrow=2, ncol=2)

# Cholesky decomposition
L = chol(M)
nvars = dim(L)[1]

# Random variables that follow an M correlation matrix
r = t(L) %*% matrix(rnorm(nvars*nobs), nrow=nvars, ncol=nobs)
r = t(r)

rdata = as.data.frame(r)
names(rdata) = c(‘x’, ‘y’)
rdata\$site = factor(rep(1:sites, each = nobs/sites))

# Plotting with lattice
xyplot(y ~ x | site, data = rdata,
layout = c(3, 2), type=c(‘p’,’smooth’))

# Plotting with ggplot2
qplot(x, y, facets = ~ site,
geom=c(‘point’, ‘smooth’),
data = rdata) + facet_wrap(~site)
[/sourcecode]

The timing was done surrounding the graph calls (either xyplot() or qplot()) by system.time(print()), so the graph is sent to the screen and the operation is timed. In summary, in this simple call ggplot2 takes a bit over double the time than lattice. The more layers you add to the graph the slower it gets.

The two plots are below. We could improve both plots and make them look more similar to each other, but I want to avoid introducing more distractions in the code.

Nevertheless, I do like the flexibility of ggplot2, so I support most of my exploratory data analysis using lattice but when I have to create the final pretty plots for publications in journals I go back to ggplot2. I subscribe to Frank Harrell’s Philosophy of Biostatistics, which includes ‘use excellent graphics, liberally’. Switching between packages let me deal with both abundance of graphics and impatience.

This is R pitfall #2: plots inside a function (and system.time() is a function) have to be surrounded by print() or they won’t be sent to the screen. Pitfall #1 is here.

Categories

## Setting plots side by side

This is simple example code to display side-by-side lattice plots or ggplot2 plots, using the mtcars dataset that comes with any R installation. We will display a scatterplot of miles per US gallon (mpg) on car weight (wt) next to another scatterplot of the same data, but using different colors by number of engine cylinders (cyl, treated as factor) and adding a smooth line (under the type option).

library(lattice)
data(mtcars)

# Create first plot and assign it to variable
p1 = xyplot(mpg ~ wt, data = mtcars,
xlab = 'Car weight', ylab = 'Mileage')

# Create second plot and assign it to variable
p2 = xyplot(mpg ~ wt, group= factor(cyl), data = mtcars,
xlab = 'Car weight', ylab = 'Miles/gallon',
type=c('p', 'smooth'))

# print calls print.lattice, which takes a position
# argument.
print(p1, position = c(0, 0, 0.5, 1), more = TRUE)
print(p2, position = c(0.5, 0, 1, 1))


According to the documentation, position is a vector of 4 numbers, typically c(xmin, ymin, xmax, ymax) that give the lower-left and upper-right corners of a rectangle in which the Trellis plot of x is to be positioned. The coordinate system for this rectangle is [0-1] in both the x and y directions. That is, the first print() sets position to occupy the left part of the graph with full height, as well as to avoid refreshing the graph when displaying the new plot (more = TRUE). The second print() uses the right part of the graph with full height.

In the case of ggplot2, the code is not that different:

library(grid)
library(ggplot2)

# Create first plot and assign it to variable
p1 = qplot(wt, mpg, data = mtcars,
xlab = 'Car weight', ylab = 'Mileage')

# Create second plot and assign it to variable
p2 = qplot(wt, mpg, color = factor(cyl), data = mtcars,
geom = c('point', 'smooth'),
xlab = 'Car weight', ylab = 'Mileage')

# Define grid layout to locate plots and print each graph
pushViewport(viewport(layout = grid.layout(1, 2)))
print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))


More details on ggplot’s notation can be found here.