notes in a shoebox

# Category: ggplot

I was attending a course of Bayesian Statistics where this problem showed up:

There is a number of individuals, say 12, who take a pass/fail test 15 times. For each individual we have recorded the number of passes, which can go from 0 to 15. Because of confidentiality issues, we are presented with rounded-to-the-closest-multiple-of-3 data ($$\mathbf{R}$$). We are interested on estimating $$\theta$$ of the Binomial distribution behind the data.

Rounding is probabilistic, with probability 2/3 if you are one count away from a multiple of 3 and probability 1/3 if the count is you are two counts away. Multiples of 3 are not rounded.

We can use Gibbs sampling to alternate between sampling the posterior for the unrounded $$\mathbf{Y}$$ and $$\theta$$. In the case of $$\mathbf{Y}$$ I used:

While for $$theta$$ we are assuming a vague $$mbox{Beta}(alpha, eta)$$, with $$alpha$$ and $$eta$$ equal to 1, as prior density function for $$theta$$, so the posterior density is a $$mbox{Beta}(alpha + sum Y_i, eta + 12*15 – sum Y_i)$$.

I then implemented the sampler as:

And plotted the results as:

Posterior density for $theta$.

Posterior mass for each rounded observation.

I thought it was a nice, cute example of simultaneously estimating a latent variable and, based on that, estimating the parameter behind it.

The Swarm Lab presents a nice comparison of R and Python code for a simple (read ‘one could do it in Excel’) problem. The example works, but I was surprised by how wordy the R code was and decided to check if one could easily produce a shorter version.

The beginning is pretty much the same, although I’ll use ggplot2 rather than lattice, because it will be a lot easier (and shorter) to get the desired appearance for the plots:

The whole example relies on only three variables and—as I am not great at typing—I tend to work with shorter variable names. I directly changed the names for variables 1 to 3:

It is a lot easier to compare the regression lines if we change the shape of the data set from wide to long, where there is one variable for year, one for tax type, and one for the actual tax rate. It would be possible to use one of Hadley’s packages to get a simpler syntax for this, but I decided to stick to the minimum set of requirements:

And now we are ready to produce the plots. The first one can be a rough cut to see if we get the right elements:

First cut of the taxes per year plot.

Yes, this one has the points, lines, linear regression and 95% confidence intervals for the mean predicted responses, but we still need to get rid of the grey background and get black labels (theme_bw()), set the right axis labels and ticks (scale_x... scale_y...) and set the right color palette for points and lines (scale_colour_manual) and filling the confidence intervals (scale_colour_fill) like so:

Way closer to the desired plot, still much shorter.

One can still change font sizes to match the original plots, reposition the legend, change the aspect ratio while saving the png graphs (all simple statements) but you get the idea. If now we move to fitting the regression lines:

This gives the regression coefficients for Corporate (3.45 – 1.564e-04 year) and Individual ([3.45 + 4.41] + [-1.564e-04 + 1.822e-04] year or 7.84 + 2.58e-05 year). As a bonus you get the comparison between regression lines.

In R as a second language I pointed out that ‘brevity reduces the time between thinking and implementation, so we can move on and keep on trying new ideas’. Some times it seriously does.

Stats Chat is an interesting kiwi site—managed by the Department of Statistics of the University of Auckland—that centers around the use and presentation of statistics in the media. This week there was an interesting discussion on one of those infographics that make you cringe:

I understand the newspaper’s need to grab our attention, as well as the designer’s esthetical considerations, but they have to follow avoiding misleading the reader and providing at least a ball-park idea of the importance of whatever issue is under discussion. Clearly, as pointed out in the discussion, a line chart would convey the message with a minimum of ink; however, the designer may still want to use circles, and here we could go back to a faceted version of the always maligned pie chart.

Faceted pie charts using ggplot2

The code reads the data, reshapes it and plots it using pretty much the explanation for pie charts in the ggplot2 documentation.

Again, please remember my typical disclaimer about lack of design and color flair. Colors and scales need work, but I think it is an improvement over the original.

A pint of R plotted an interesting dataset: intentional homicides in South America. I thought the graphs were pretty but I was unhappy about the way information was conveyed in the plots; relative risk should be very important but number of homicides is very misleading as it also relates to country population (this problem often comes up in our discussions in Stats Chat).

Instead of just complaining I decided to try a few alternatives (disclaimer: I do not have a good eye for colors or design but I am only looking at ways that could better show relative risk). I therefore downloaded the MS Excel file, which contained a lot of information from other countries and extracted only the information relevant to these plots, which you can obtain here: homicides.csv (4 KB). Some quick code could display the following graph:

Tile graph for homicides.

It is also possible to use a line graph, but it quickly gets very messy, so I created totally arbitrary violence categories:

Another view, which still requires labeling countries. Venezuela, what did happen to you? You look like the Wild West...

Since 4th September 2010 we have had over 2, 800 quakes (considering only magnitude 3+) in Christchurch. Quakes come in swarms, with one or few strong shocks, followed by numerous smaller ones and then the ocasional shock, creating an interesting data visualization problem. In our case, we have had swarms in September 2010, December 2010, February 2011, June 2011 and December 2011.

Geonet provides the basic information and there have been several attempts at displaying the full set of shocks. For example, Christchurch Quake Map uses animation, while Canterbury Quake Live uses four panels showing quakes for last 24 hours, last week, last month and since September 2010. While both alternatives are informative, it is hard to see long-term trends due to overplotting, particularly when we move beyond one week during a swarm.

Geonet allows data extraction through forms and queries. Rough limits for the Christchurch earthquakes are: Southern Latitude (-43.90), Northern Latitude (-43.15), Western Longitude (171.75) and Eastern Longitude (173.35). We can limit the lower magnitude to 3, as shocks are hard to feel below that value.

Graph presented by Canterbury Quake Live, notice how difficult is to read the bottom graph.

The file earthquakes.csv contains 2,802 shocks, starting on 2010-09-04. The file can be read using the following code:

The following code produces a plot that, in my opinion, presents a clearer idea of the swarms but that I still feel does not make justice to the problem.

Graph displaying quake swarms, but still far from perfect. Looking forward to your ideas.

Please let me know if you have a better idea for the plot.

P.S.1 If you want to download data from Geonet there will be problems reading in R the resulting earthquakes.csv file, because the file is badly formed. All lines end with a comma except for the first one, tripping R into believing that the first column contains row names. The easiest way to fix the file is to add a comma at the end of the first line, which will create an extra empty variable called X that is not used in the plots.

P.S.2 We had some discussions by email with Michael (who commented below), where he suggested dropping the grey lines and using alpha less than one to reduce plotting density. I opted for also using an alpha scale, so stronger quakes are darker to mimic the psychology of ‘quake suffering’: both frequent smaller quakes and the odd stronger quakes can be distressing to people. In addition, now the plot uses a 1:6 ratio.

New version, click to expand.

I ran out of time, but the background needs more work, as well as finding the right level of alpha to best tell the story.

Last Wednesday I had a meeting with the folks of the New Zealand Drylands Forest Initiative in Blenheim. In addition to sitting in a conference room and having nice sandwiches we went to visit one of our progeny trials at Cravens. Plantation forestry trials are usually laid out following a rectangular lattice defined by rows and columns. The trial follows an incomplete block design with 148 blocks and is testing 60 Eucalyptus bosistoana families. A quick look at survival shows an interesting trend: the bottom of the trial was much more affected by frost than the top.

We have the trial assessed for tree height (ht) at 2 years of age, where a plot also shows some spatial trends, with better growth on the top-left corner; this is the trait that I will analyze here.

Tree height can be expressed as a function of an overall constant, random block effects, random family effects and a normally distributed residual (this is our base model, m1). I will then take into account the position of the trees (expressed as rows and columns within the trial) to fit spatially correlated residuals (m2)—using separable rows and columns processes—to finally add a spatially independent residual to the mix (m3).

m1 represents a traditional family model with only the overall constant as the only fixed effect and a diagonal residual matrix (identity times the residual variance). In m2 I am modeling the R matrix as the Kronecker product of two separable autoregressive processes (one in rows and one in columns) times a spatial residual.

Adding two parameters to the model results in improving the log-likelihood from -8834.071 to -8782.112, a difference of 51.959, but with what appears to be a low correlation in both directions. In m3 I am adding an spatially independent residual (using the keyword units), improving log-likelihood from -8782.112 to -8660.411: not too shabby.

Allowing for the independent residual (in addition to the spatial one) has permitted higher autocorrelations (particularly in columns), while the Block variance has gone very close to zero. Most of the Block variation is being captured now through the residual matrix (in the rows and columns autoregressive correlations), but we are going to keep it in the model, as it represents a restriction to the randomization process. In addition to log-likelihood (or AIC) we can get graphical descriptions of the fit by plotting the empirical semi-variogram as in plot(variogram(m3)):

In a previous post I summarily described our options for (generalized to varying degrees) linear mixed models from a frequentist point of view: nlme, lme4 and ASReml-R, followed by a quick example for a split-plot experiment.

But who is really happy with a toy example? We can show a slightly more complicated example assuming that we have a simple situation in breeding: a number of half-sib trials (so we have progeny that share one parent in common), each of them established following a randomized complete block design, analyzed using a ‘family model’. That is, the response variable (dbh: tree stem diameter assessed at breast height—1.3m from ground level) can be expressed as a function of an overall mean, fixed site effects, random block effects (within each site), random family effects and a random site-family interaction. The latter provides an indication of genotype by environment interaction.

First I should make clear that this model has several drawbacks:

• It assumes that we have homogeneous variances for all trials, as well as identical correlation between trials. That is, that we are assuming compound symmetry, which is very rarely the case in multi-environment progeny trials.
• It also assumes that all families are unrelated to each other, which in this case I know it is not true, as a quick look at the pedigree will show that several mothers are related.
• We assume only one type of families (half-sibs), which is true just because I got rid of all the full-sib families and clonal material, to keep things simple in this example.

Just to make the point, a quick look at the data using ggplot2—with some jittering and alpha transparency to better display a large number of points—shows differences in both mean and variance among sites.

Anyway, let’s keep on going with this simplistic model. In my computer, a two year old iMac, ASReml-R takes roughly 9 seconds, while lme4 takes around 65 seconds, obtaining similar results.

The residual matrix of this model is a, fairly large, diagonal matrix (residual variance times an identity matrix). At this point we can relax this assumption, adding a bit more complexity to the model so we can highlight some syntax. Residuals in one trial should have nothing to do with residuals in another trial, which could be hundreds of kilometers away. I will then allow for a new matrix of residuals, which is the direct sum of trial-specific diagonal matrices. In ASReml-R we can do so by specifying a diagonal matrix at each trial with rcov = ~ at(Trial):units:

There is a big improvement of log-Likelihood from m2 (-744452.1) to m2b (-738011.6) for 30 additional variances. At this stage, we can also start thinking of heterogeneous variances for blocks, with a small change to the code:

At this point we could do extra modeling at the site level by including any other experimental design features, allowing for spatially correlated residuals, etc. I will cover some of these issues in future posts, as this one is getting a bit too long. However, we could gain even more by expressing the problem in a multivariate fashion, where the performance of Mothers in each trial would be considered a different trait. This would push us towards some large problems, which require modeling covariance structures so we have a change of achieving convergence during our lifetime.

Disclaimer: I am emphasizing the use of ASReml-R because 1.- I am most familiar with it than with nlme or lme4 (although I tend to use nlme for small projects), and 2.- There are plenty of examples for the other packages but not many for ASReml in the R world. I know some of the people involved in the development of ASReml-R but otherwise I have no commercial links with VSN (the guys that market ASReml and Genstat).

There are other packages that I have not had the chance to investigate yet, like hglm.

When working in research projects I tend to fit several, sometimes quite a few, alternative models. This model fitting is informed by theoretical considerations (e.g. quantitative genetics, experimental design we used, our understanding of the process under study, etc.) but also by visual inspection of the data. Trellis graphics—where subsets of data are plotted in different ‘panels’ defined by one or more factors—are extremely useful to generate research hypotheses.

There are two packages in R that have good support for trellis graphics: lattice and ggplot2. Lattice is the oldest, while ggplot2 is probably more consistent (implementing a grammar of graphics) and popular with the cool kids and the data visualization crowd. However, lattice is also quite fast, while ggplot2 can be slow as a dog (certainly way slower than my dog).

Tree-breeding progeny trials often have between 1,000 and 12,000 individuals, and analyses commonly include several trials. Thus, it is not unusual to have tens of thousands or even hundreds of thousand of records that will be involved in the analysis. Add to this situation that I am impatient and you will understand that differences on speed can make a big difference to my mental health. But how different is the speed? We can simulate some correlated data (following the explanation in this post) and build a simple scatterplot faceted by site; let’s say 60,000 observations in 6 sites (10,000 per site).

The timing was done surrounding the graph calls (either xyplot() or qplot()) by system.time(print()), so the graph is sent to the screen and the operation is timed. In summary, in this simple call ggplot2 takes a bit over double the time than lattice. The more layers you add to the graph the slower it gets.

The two plots are below. We could improve both plots and make them look more similar to each other, but I want to avoid introducing more distractions in the code.

Nevertheless, I do like the flexibility of ggplot2, so I support most of my exploratory data analysis using lattice but when I have to create the final pretty plots for publications in journals I go back to ggplot2. I subscribe to Frank Harrell’s Philosophy of Biostatistics, which includes ‘use excellent graphics, liberally’. Switching between packages let me deal with both abundance of graphics and impatience.

This is R pitfall #2: plots inside a function (and system.time() is a function) have to be surrounded by print() or they won’t be sent to the screen. Pitfall #1 is here.

This is simple example code to display side-by-side lattice plots or ggplot2 plots, using the mtcars dataset that comes with any R installation. We will display a scatterplot of miles per US gallon (mpg) on car weight (wt) next to another scatterplot of the same data, but using different colors by number of engine cylinders (cyl, treated as factor) and adding a smooth line (under the type option).

According to the documentation, position is a vector of 4 numbers, typically c(xmin, ymin, xmax, ymax) that give the lower-left and upper-right corners of a rectangle in which the Trellis plot of x is to be positioned. The coordinate system for this rectangle is [0-1] in both the x and y directions. That is, the first print() sets position to occupy the left part of the graph with full height, as well as to avoid refreshing the graph when displaying the new plot (more = TRUE). The second print() uses the right part of the graph with full height.

In the case of ggplot2, the code is not that different:

More details on ggplot’s notation can be found here.