# Quantum Forest

### notes in a shoebox

#### Month: February 2012

This week was the first anniversary of the February 22nd earthquake in Christchurch. Between that and the first week of lectures it has been hard to find time to write much about data analysis. Thus, if I owe you some analyses, some code or some text be patient, please. I’ll be back soon(ish).

I have never been much of a downtown person, so it is no surprise that I haven’t been in Hagley Park for a while. Today (Sunday in my part of the planet) was a a nice treat to spend some time there and see the interaction between city recovery and the Botanic Gardens, particularly in a Eucalyptus delegatensis surrounded by messages.

Earthquake messages around Eucalyptus delegatensis, Christchurch Botanic Gardens (Photo: Luis).

Detail of messages (Photo: Luis).

Trees can carry a lot of sadness and hope (Photo: Luis).

Up the eucalypt tree (Photo: Luis).

Time for a pause, time to look forward.

P.S. I wanted to attend the Christchurch PechaKucha tonight, but I ran out of time at the end. Next time.

This coming Monday we start the first semester in Canterbury (and in New Zealand for that matter). We are all looking forward to an earthquake-free year; more realistically, I’d be happy with low magnitude aftershocks.

• The Wall Street Journal reports that more pediatricians are ‘firing’ patients that refuse to use vaccines. I’m wondering about practices that will cluster with ‘vaccine refusers’.
• I am collaborating with a researcher in Electrical Engineering where he and his students develop very cool tools for us (see example in this previous post on dealing with autocorrelation in mixed models). They use Python to control the tools, data extraction and do some basic processing (isn’t that cool?). Python + Scipy have moved a lot towards creating a nice environment for scientific computing; however, in my opinion setting an R environment is way easier than dealing with all the versions for python, wxpython, etc. At the end I went for the 32 bit version in OS X because 64 bit is not supported (wxpython targets carbon, meh).
• Why not publish your data too? HT: @chlalanne. I’ll be uploading datasets that I’ve used in my papers (at least the ones that do not contain industry/confidential data).
• Timandra Harkness asks Have you been seduced by statistics? in Significance Journal (free access). In the same issue Matt Briggs asks another important question Why do statisticians answer silly questions that no one ever asks?
• How companies learn your secrets: the creepy side of statistics/machine learning.
• Feeling all smug in a discussion in Engineering: people struggling with software licensing (Office, Matlab, etc). I pointed out that we didn’t have any licensing issues with R to be at the bleeding edge when teaching. Take that!
• In the nothing-to-do with-stats category, Retronaut displays a beautiful collection of Soviet space propaganda posters. HT: can’t remember.

Supermarkets in New Zealand, Are they creepy too? Checkouts at Pak N Save, Christchurch (Photo: Luis).

I’ve been running analyses for two to three papers, mostly using R + asreml-R + ggplot2 + plyr. I hope to write one of them using XeLaTeX (I’ll be the sole author), while in the other(s) I am condemned to MS Word (!). Thinking of journals to send the papers to, with the most liberal copyright as possible (hard in forestry).

P.S. Flotsam posts compile small bits of information, twitter favorites, shared links, etc.

Stats Chat is an interesting kiwi site—managed by the Department of Statistics of the University of Auckland—that centers around the use and presentation of statistics in the media. This week there was an interesting discussion on one of those infographics that make you cringe:

I understand the newspaper’s need to grab our attention, as well as the designer’s esthetical considerations, but they have to follow avoiding misleading the reader and providing at least a ball-park idea of the importance of whatever issue is under discussion. Clearly, as pointed out in the discussion, a line chart would convey the message with a minimum of ink; however, the designer may still want to use circles, and here we could go back to a faceted version of the always maligned pie chart.

Faceted pie charts using ggplot2

The code reads the data, reshapes it and plots it using pretty much the explanation for pie charts in the ggplot2 documentation.

Again, please remember my typical disclaimer about lack of design and color flair. Colors and scales need work, but I think it is an improvement over the original.

A pint of R plotted an interesting dataset: intentional homicides in South America. I thought the graphs were pretty but I was unhappy about the way information was conveyed in the plots; relative risk should be very important but number of homicides is very misleading as it also relates to country population (this problem often comes up in our discussions in Stats Chat).

Instead of just complaining I decided to try a few alternatives (disclaimer: I do not have a good eye for colors or design but I am only looking at ways that could better show relative risk). I therefore downloaded the MS Excel file, which contained a lot of information from other countries and extracted only the information relevant to these plots, which you can obtain here: homicides.csv (4 KB). Some quick code could display the following graph:

[sourcecode language=”R”]
require(ggplot2)

setwd(‘~/Dropbox/quantumforest’)

kp = ggplot(kill, aes(x = year, y = country, fill = rate))

# Colors coming from
# http://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/
png(‘homicides-tile.png’, width = 500, height = 500)
kp = kp + geom_tile() + scale_x_continuous(name = ‘Year’, expand = c(0, 0)) +
scale_y_discrete(name = ‘Country’, expand = c(0, 0)) +
scale_fill_gradient(low = ‘white’, high = ‘steelblue’, name = ‘Homicide rate’) +
theme_bw() +
opts(panel.grid.major = theme_line(colour = NA),
panel.grid.minor = theme_line(colour = NA))
dev.off()
[/sourcecode]

Tile graph for homicides.

It is also possible to use a line graph, but it quickly gets very messy, so I created totally arbitrary violence categories:

[sourcecode language=”R”]
# Totally arbitrary classification
kill$type = ifelse(kill$country %in% c(‘Brazil’, ‘Colombia’, ‘Venezuela’),
‘Freaking violent’,
ifelse(kill$country %in% c(‘Ecuador’, ‘Surinam’, ‘Guyana’), ‘Plain violent’, ‘Sort of quiet’)) kp2 = ggplot(kill, aes(x = year, y = rate, colour = country)) png(‘homicides-lines.png’, width=1000, height = 300) kp2 + geom_line() + facet_grid(. ~ type) + scale_y_continuous(‘Homicides/100,000 people’) + scale_x_continuous(‘Year’) + theme_bw() + opts(axis.text.x = theme_text(size = 10), axis.text.y = theme_text(size = 10), legend.position = ‘none’) dev.off() [/sourcecode] Another view, which still requires labeling countries. Venezuela, what did happen to you? You look like the Wild West... I hope others will download the data and provide much better alternatives to display violence. If you do, please add a link in the comments. After reading David Smith’s tweet on the price of Oracle R Enterprise (actually free, but it requires Oracle Data Mining at$23K/core as pointed out by Joshua Ulrich.) I went to Oracle’s site to see what was all about. Oracle has a very interesting concept of why we use R:

Statisticians and data analysts like R because they typically don’t know SQL and are not familiar with database tasks. R allows them to remain highly productive.

Pardon? It sounds like if we only knew SQL and database tasks we would not need statistical software. File for future reference.

Mike Croucher at Walking Randomly points out an interesting difference in operator precedence for several mathematical packages to evaluate a simple operation 2^3^4. It is pretty much a divide between Matlab and Excel (does the later qualify as mathematical software?) on one side with result 4096 (or (2^3)^4) and Mathematica, R and Python on the other, resulting on 2417851639229258349412352 (or 2^(3^4)). Remember your parentheses…

Corey Chivers, aka Bayesian Biologist, uses R to help students understand the Monty Hall problem. I think a large part of the confusion to grok it stems from a convenient distraction: opening doors. The problem could be reframed as: i- you pick a door (so probability of winning the prize is 1/3) and Monty gets the other two doors (probability of winning is 2/3), ii- Monty is offering to switch all his doors for yours, so switching increases the probability of winning, iii- Monty will never open a winning door to entice the switch, so we should forget about them.

To make the point clearer, let’s imagine now that instead of 3 doors the game has 10 doors. You pick one (probability of winning 1/10) and Monty keeps 9 (probability of winning 9/10). Would you switch one door for nine? Of course! The fact that Monty will open 8 non-winning doors rather than all of his doors does not make a difference in the deal.

[sourcecode language=”R”]
# Number of games and doors
n.games = 10000
n.doors = 10

# Assign prize to door for each game. Remember:
# Monty keeps all doors not chosen by player
prize.door = floor(runif(n.games, 1, n.doors + 1))
player.door = floor(runif(n.games, 1, n.doors + 1))

# If prize.door and player.door are the same
# and player does not switch
are.same = prize.door == player.door
cat(‘Probability of winning by not switching’, sum(are.same)/n.games, ‘\n’)
cat(‘Probability of winning by switching’, (n.games – sum(are.same))/n.games, ‘\n’)
[/sourcecode]

Gratuitous picture: fish in New Brighton pier (Photo: Luis).

Pierre Lemieux reminds us that “a dishonest statistician is an outliar”.

If you want to make dulce de leche using condensed milk—but lack a pressure cooker—use an autoclave for 50 to 60 minutes. HT: Heidi Smith. Geeky and one needs an autoclave worth thousands of dollars, but that’s what universities are for.

Lesser and Pearl inform us that there are at least 20 modalities for making statistics fun in “Functional Fun in Statistics Teaching: Resources, Research and Recommendations”. HT: Chelsea Heaven. I’ve used music, videos, cartoons, jokes, striking examples using body parts, quotations, food, juggling, etc.

An old review of Buddhism without Beliefs: A Contemporary Guide to Awakening by Stephen Batchelor. I can’t see any statistical angle, but I liked that book.

P.S. Awesome video by OK Go HT: Eric Crampton.

Back to quantitative genetics!

December and January were crazy months, with a lot of travel and suddenly I found myself in February working in four parallel projects involving quantitative genetics data analyses. (I’ll write about some of them very soon)

Anyhow, as I have pointed out in repeated occasions, I prefer asreml-R for mixed model analyses because I run out of functionality with nlme and lme4 very quickly. Ten-trait multivariate mixed model with a pedigree, anyone? I thought so. Well, there are asreml-R versions for Windows, Linux and OS X; unsurprisingly, I use the latter. Installation in OS X is not particularly complicated (just follow the instructions in this PDF file) and remember to add and export the following environment variables in your .bash_profile:

These instructions work fine if one uses asreml-R in a Terminal window or if one uses a text editor (emacs + ESS, VIM + VIM-R, Textmate + R + R console bundles, etc). However, I couldn’t get the default R GUI in OS X (R.app) or Rstudio (my favorite way to work with R these days) working.

I would use library(asreml) or require(asreml), which would create no problems, but as soon as I used the function asreml(...) I would get an error message stating that I did not have a valid license. Frustrating to say the list (I even considered using emacs for the rest of my life, can you believe it?), because it meant that Rstudio was not being able to ‘see’ the environment variables (as posted in this question). The discussion on that question points to a very useful part of the R help system, which explains the startup process for R, suggesting a simple solution: create an .Renviron file in your home directory.

Thus, I simply copied the previously highlighted code in a text file called .Renviron, saved it in my home directory and I can now call asreml-R from Rstudio without any problems. This solution should also work for any other time when one would like to access environment variables from Rstudio for OS X. Incidentally, Rstudio is becoming really useful, adding ‘projects’ and integrating version control, which means that now I have a(n almost) self-contained working environment.

P.S. Note to self: While I was testing environment variables in my .bash_profile I wanted to refresh the variables without rebooting the computer. The easiest way to do so is typing  . .bash_profile (yes, that starts with dot space). To check the exported value of a specific variable one can use the echo command as so echo \$ASREML_LICENSE_FILE, which should return the assigned value (/Applications/asreml3/bin/asreml.lic in my case).