## Quantum Forest

### notes in a shoebox

#### Category: graphics

Stats Chat is an interesting kiwi site—managed by the Department of Statistics of the University of Auckland—that centers around the use and presentation of statistics in the media. This week there was an interesting discussion on one of those infographics that make you cringe:

I understand the newspaper’s need to grab our attention, as well as the designer’s esthetical considerations, but they have to follow avoiding misleading the reader and providing at least a ball-park idea of the importance of whatever issue is under discussion. Clearly, as pointed out in the discussion, a line chart would convey the message with a minimum of ink; however, the designer may still want to use circles, and here we could go back to a faceted version of the always maligned pie chart.

Faceted pie charts using ggplot2

The code reads the data, reshapes it and plots it using pretty much the explanation for pie charts in the ggplot2 documentation.

Again, please remember my typical disclaimer about lack of design and color flair. Colors and scales need work, but I think it is an improvement over the original.

A pint of R plotted an interesting dataset: intentional homicides in South America. I thought the graphs were pretty but I was unhappy about the way information was conveyed in the plots; relative risk should be very important but number of homicides is very misleading as it also relates to country population (this problem often comes up in our discussions in Stats Chat).

Instead of just complaining I decided to try a few alternatives (disclaimer: I do not have a good eye for colors or design but I am only looking at ways that could better show relative risk). I therefore downloaded the MS Excel file, which contained a lot of information from other countries and extracted only the information relevant to these plots, which you can obtain here: homicides.csv (4 KB). Some quick code could display the following graph:

[sourcecode language=”R”]
require(ggplot2)

setwd(‘~/Dropbox/quantumforest’)

kp = ggplot(kill, aes(x = year, y = country, fill = rate))

# Colors coming from
# http://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/
png(‘homicides-tile.png’, width = 500, height = 500)
kp = kp + geom_tile() + scale_x_continuous(name = ‘Year’, expand = c(0, 0)) +
scale_y_discrete(name = ‘Country’, expand = c(0, 0)) +
scale_fill_gradient(low = ‘white’, high = ‘steelblue’, name = ‘Homicide rate’) +
theme_bw() +
opts(panel.grid.major = theme_line(colour = NA),
panel.grid.minor = theme_line(colour = NA))
dev.off()
[/sourcecode]

Tile graph for homicides.

It is also possible to use a line graph, but it quickly gets very messy, so I created totally arbitrary violence categories:

[sourcecode language=”R”]
# Totally arbitrary classification
kill$type = ifelse(kill$country %in% c(‘Brazil’, ‘Colombia’, ‘Venezuela’),
‘Freaking violent’,
ifelse(kill$country %in% c(‘Ecuador’, ‘Surinam’, ‘Guyana’), ‘Plain violent’, ‘Sort of quiet’)) kp2 = ggplot(kill, aes(x = year, y = rate, colour = country)) png(‘homicides-lines.png’, width=1000, height = 300) kp2 + geom_line() + facet_grid(. ~ type) + scale_y_continuous(‘Homicides/100,000 people’) + scale_x_continuous(‘Year’) + theme_bw() + opts(axis.text.x = theme_text(size = 10), axis.text.y = theme_text(size = 10), legend.position = ‘none’) dev.off() [/sourcecode] Another view, which still requires labeling countries. Venezuela, what did happen to you? You look like the Wild West... I hope others will download the data and provide much better alternatives to display violence. If you do, please add a link in the comments. Since 4th September 2010 we have had over 2,800 quakes (considering only magnitude 3+) in Christchurch. Quakes come in swarms, with one or few strong shocks, followed by numerous smaller ones and then the ocasional shock, creating an interesting data visualization problem. In our case, we have had swarms in September 2010, December 2010, February 2011, June 2011 and December 2011. Geonet provides the basic information and there have been several attempts at displaying the full set of shocks. For example, Christchurch Quake Map uses animation, while Canterbury Quake Live uses four panels showing quakes for last 24 hours, last week, last month and since September 2010. While both alternatives are informative, it is hard to see long-term trends due to overplotting, particularly when we move beyond one week during a swarm. Geonet allows data extraction through forms and queries. Rough limits for the Christchurch earthquakes are: Southern Latitude (-43.90), Northern Latitude (-43.15), Western Longitude (171.75) and Eastern Longitude (173.35). We can limit the lower magnitude to 3, as shocks are hard to feel below that value. Graph presented by Canterbury Quake Live, notice how difficult is to read the bottom graph. The file earthquakes.csv contains 2,802 shocks, starting on 2010-09-04. The file can be read using the following code: [sourcecode lang=”r”] setwd(‘~/Dropbox/quantumforest’) library(ggplot2) # Reading file and manipulating dates qk <- read.csv(‘earthquakes.csv’, header=TRUE) # Joins parts of date, converts it to UTC and then # expresses it in NZST qk$DATEtxt = with(qk, paste(ORI_YEAR,’-‘,ORI_MONTH,’-‘,ORI_DAY,’ ‘,
ORI_HOUR,’:’,ORI_MINUTE,’:’,ORI_SECOND, sep=”))
qk$DATEutc = as.POSIXct(qk$DATEtxt, tz=’UTC’)
qk$DATEnz = as.POSIXct(format(qk$DATEutc, tz=’Pacific/Auckland’))
[/sourcecode]

The following code produces a plot that, in my opinion, presents a clearer idea of the swarms but that I still feel does not make justice to the problem.

[sourcecode lang=”r”]
png(‘earthquakesELE.png’,height=600, width=1200)
ele = ggplot(qk, aes(DATEnz, MAG, ymin=3, ymax=MAG))
ele + geom_linerange(color=’grey’) + geom_point(color=’red’, size=1) +
scale_y_continuous(name=’Magnitude’, limits=c(2.5,7.2)) +
scale_x_datetime(name=’Date’,major=’month’) + theme_bw() +
opts(panel.grid.major = theme_line(size=0))
dev.off()
[/sourcecode]

Graph displaying quake swarms, but still far from perfect. Looking forward to your ideas.

Please let me know if you have a better idea for the plot.

P.S.1 If you want to download data from Geonet there will be problems reading in R the resulting earthquakes.csv file, because the file is badly formed. All lines end with a comma except for the first one, tripping R into believing that the first column contains row names. The easiest way to fix the file is to add a comma at the end of the first line, which will create an extra empty variable called X that is not used in the plots.

P.S.2 We had some discussions by email with Michael (who commented below), where he suggested dropping the grey lines and using alpha less than one to reduce plotting density. I opted for also using an alpha scale, so stronger quakes are darker to mimic the psychology of ‘quake suffering’: both frequent smaller quakes and the odd stronger quakes can be distressing to people. In addition, now the plot uses a 1:6 ratio.

[sourcecode lang=”r”]
png(‘earthquakesALPHA.png’,height=300, width=1800)
qplot(DATEnz, MAG, data = qk, alpha = MAG) +
geom_point(color = ‘red’, size=1.5) +
scale_x_datetime(‘Date’, major=’month’) +
scale_y_continuous(‘Magnitude’) +
opts(legend.position = ‘none’,
axis.text.x = theme_text(colour = ‘black’),
axis.text.y = theme_text(colour = ‘black’))
dev.off()
[/sourcecode]

New version, click to expand.

I ran out of time, but the background needs more work, as well as finding the right level of alpha to best tell the story.

When working in research projects I tend to fit several, sometimes quite a few, alternative models. This model fitting is informed by theoretical considerations (e.g. quantitative genetics, experimental design we used, our understanding of the process under study, etc.) but also by visual inspection of the data. Trellis graphics—where subsets of data are plotted in different ‘panels’ defined by one or more factors—are extremely useful to generate research hypotheses.

There are two packages in R that have good support for trellis graphics: lattice and ggplot2. Lattice is the oldest, while ggplot2 is probably more consistent (implementing a grammar of graphics) and popular with the cool kids and the data visualization crowd. However, lattice is also quite fast, while ggplot2 can be slow as a dog (certainly way slower than my dog).

Tree-breeding progeny trials often have between 1,000 and 12,000 individuals, and analyses commonly include several trials. Thus, it is not unusual to have tens of thousands or even hundreds of thousand of records that will be involved in the analysis. Add to this situation that I am impatient and you will understand that differences on speed can make a big difference to my mental health. But how different is the speed? We can simulate some correlated data (following the explanation in this post) and build a simple scatterplot faceted by site; let’s say 60,000 observations in 6 sites (10,000 per site).

[sourcecode lang=”r”]
library(lattice)
library(ggplot2)

# number of observations to simulate
nobs = 60000
sites = 6

# Using a correlation matrix (let’s assume that all variables
# have unit variance
M = matrix(c(1, 0.7,
0.7, 1), nrow=2, ncol=2)

# Cholesky decomposition
L = chol(M)
nvars = dim(L)[1]

# Random variables that follow an M correlation matrix
r = t(L) %*% matrix(rnorm(nvars*nobs), nrow=nvars, ncol=nobs)
r = t(r)

rdata = as.data.frame(r)
names(rdata) = c(‘x’, ‘y’)
rdata\$site = factor(rep(1:sites, each = nobs/sites))

# Plotting with lattice
xyplot(y ~ x | site, data = rdata,
layout = c(3, 2), type=c(‘p’,’smooth’))

# Plotting with ggplot2
qplot(x, y, facets = ~ site,
geom=c(‘point’, ‘smooth’),
data = rdata) + facet_wrap(~site)
[/sourcecode]

The timing was done surrounding the graph calls (either xyplot() or qplot()) by system.time(print()), so the graph is sent to the screen and the operation is timed. In summary, in this simple call ggplot2 takes a bit over double the time than lattice. The more layers you add to the graph the slower it gets.

The two plots are below. We could improve both plots and make them look more similar to each other, but I want to avoid introducing more distractions in the code.

Nevertheless, I do like the flexibility of ggplot2, so I support most of my exploratory data analysis using lattice but when I have to create the final pretty plots for publications in journals I go back to ggplot2. I subscribe to Frank Harrell’s Philosophy of Biostatistics, which includes ‘use excellent graphics, liberally’. Switching between packages let me deal with both abundance of graphics and impatience.

This is R pitfall #2: plots inside a function (and system.time() is a function) have to be surrounded by print() or they won’t be sent to the screen. Pitfall #1 is here.

This is simple example code to display side-by-side lattice plots or ggplot2 plots, using the mtcars dataset that comes with any R installation. We will display a scatterplot of miles per US gallon (mpg) on car weight (wt) next to another scatterplot of the same data, but using different colors by number of engine cylinders (cyl, treated as factor) and adding a smooth line (under the type option).

According to the documentation, position is a vector of 4 numbers, typically c(xmin, ymin, xmax, ymax) that give the lower-left and upper-right corners of a rectangle in which the Trellis plot of x is to be positioned. The coordinate system for this rectangle is [0-1] in both the x and y directions. That is, the first print() sets position to occupy the left part of the graph with full height, as well as to avoid refreshing the graph when displaying the new plot (more = TRUE). The second print() uses the right part of the graph with full height.

In the case of ggplot2, the code is not that different:

More details on ggplot’s notation can be found here.

In a previous post I explained how to create all possible combinations of the levels of two factors using expand.grid(). Another use for this function is to create a regular grid for two variables to create a levelplot or a contour plot.

For example, let’s say that we have fitted a multiple linear regression to predict wood stiffness (stiff, the response) using basic density (bd) and a measure of microfibril angle (t) as explanatory variables. The regression equation could be something like stiffness = 3.439 + 0.009 bd - 0.052 t. In our dataset bd had a range of 300 to 700 kg m-3, while t had a range from 50 to 70.

We will use the levelplot() function that is part of the lattice package of graphical functions, create a grid for both explanatory variables (every 10 for bd and every 1 for t), predict values of stiffness for all combinations of bd and t, and plot the results.

This code creates a graph like this. Simple.