## Quantum Forest

### notes in a shoebox

#### Month: August 2012

Today Scott Chamberlain tweeted asking for a better/faster solution to building an m x n matrix with randomly assigned 0/1. He already had a working version:

Now, I’m the first to acknowledge that I’ve never got the ‘apply’ family of functions and that—thanks Hadley—if I need to do something like that I go for the plyr package. Nevertheless, if you think about it, the starting point is a sugary version of a loop; I like loops because they are explicit and clear, but they can be very slow in R. The loop could be written as:

In contrast, my first idea was to generate a bunch of uniformly distributed [0, 1) random numbers and round them to the closest integer, which is a more ‘matricy’ way of thinking:

and it happens to be a lot faster. My second idea, and Edmund Hart beat me to that one, was to use something like:

which is a nice option, generating r*c random numbers following a binomial distribution, which also has the advantage of allowing for different probabilities, rather than 0.5 as in m1. This is also faster than m1, although probably one would not notice the difference until working with fairly big matrices. When Scott pointed out the speed bump he got me thinking about order of the operations; would this be better?

In terms of speed, m3 fits between m1 and m2, so the order can make a difference.

David Smith and Rafael Maia came up with a different approach, using sample() which I had not considered at all. m4 has the advantage that one could use any number of values to randomize.

Any of these options can be timed using the system.time() function as in system.time(m3 = matrix(round(runif(r*c)), r, c)). It’s interesting how different people come up with alternative strategies (reflecting different ways of thinking about the problem) using exactly the same language. I’m not sure what should be the ‘obvious’ solution; but for almost any purpose any of them (with the exception of the explicit loop) will work just fine.

Gratuitous picture: training horses in the beach (Photo: Luis).

P.S. 2012-08-29, 22:33. Another way that would work (although 4 times slower than m1) would be m5 = matrix(ifelse(runif(r*c) < 0.5, 0, 1)). Still, it takes less than 0.5 seconds for a 1,000,000 elements matrix.
P.S. 2012-09-04. Scott Chamberlain wrote his own blog post discussing the different suggestions. Even cooler, Dirk Eddelbuettel implemented even faster creation of random 1/0 matrices using Rcpp, with inline C++ code. Much faster in a crazy overkill sort of way.

I wouldn’t kill a fly (Photo: Luis, click to enlarge).

This picture reminded me of The Oatmeal’s How to tell if your cat is plotting to kill you.

Reached mid-semester point, with quite a few new lectures to prepare. Nothing extremely complicated but, as always, the tricky part is finding a way to make it meaningful and memorable. Sometimes, and this is one of those times, I sound like a broken record but I’m a bit obsessive about helping people to ‘get’ a topic.

Gratuitous picture: Lola, Lisbon, Portugal(Photo: Luis).

INLA is not the Norwegian answer to ABBA; that would probably be a-ha. INLA is the answer to ‘Why do I have enough time to cook a three-course meal while running MCMC analyses?”.

Integrated Nested Laplace Approximations (INLA) is based on direct numerical integration (rather than simulation as in MCMC) which, according to people ‘in the know’, allows:

• the estimation of marginal posteriors for all parameters,
• marginal posteriors for each random effect and
• estimation of the posterior for linear combinations of random effects.

Rather than going to the usual univariate randomized complete block or split-plot designs that I have analyzed before (here using REML and here using MCMC), I’ll go for some analyses that motivated me to look for INLA. I was having a look at some reproductive output for Drosophila data here at the university, and wanted to fit a logistic model using MCMCglmm. Unfortunately, I was running into the millions (~3M) of iterations to get a good idea of the posterior and, therefore, leaving the computer running overnight. Almost by accident I came across INLA and started playing with it. The idea is that Sol—a Ph.D. student—had a cool experiment with a bunch of flies using different mating strategies over several generations, to check the effect on breeding success. Therefore we have to keep track of the pedigree too.

Gratuitous picture: Cubist apartments not in Norway (Photo: Luis, click to enlarge).

Up to this point we have read the response data, the pedigree and constructed the inverse of the pedigree matrix. We also needed to build a contrast matrix to compare the mean response between the different mating strategies. I was struggling there and contacted Gregor Gorjanc, who kindly emailed me the proper way to do it.

There is another related package (Animal INLA) that takes care of i- giving details about the priors and ii- “easily” fitting models that include a term with a pedigree (an animal model in quantitative genetics speak). However, I wanted the assumptions to be clear so read the source of Animal INLA and shamelessly copied the useful bits (read the source, Luke!).

A quick look at the time taken by INLA shows that it is in the order of seconds (versus overnight using MCMC). I have tried a few examples and the MCMCglmm and INLA results tend to be very close; however, figuring out how to code models has been very tricky for me. INLA follows the glorious tradition of not having a ‘proper’ manual, but a number of examples with code. In fact, they reimplement BUGS‘s examples. Personally, I struggle with that approach towards documentation, but you may be the right type of person for that. Note for letter to Santa: real documentation for INLA.

I was talking with a student about using Norwegian software and he mentioned Norwegian Black Metal. That got me thinking about how the developers of the package would look like; would they look like Gaahl of Gorgoroth (see interview here)?

Not an INLA developer

Talk about disappointment! In fact Håvard Rue, INLA mastermind, looks like a nice, clean, non-black-metal statistician. To be fair, it would be quite hard to code in any language wearing those spikes…

We were having a chat NZ-MX on Skype with Gus and he mentioned this video on music from a tree. Fantastic!

When running stats labs I like to allocate a slightly different subset of data to each student, which acts as an incentive for people to do their own work (rather than copying the same results from a fellow student). We also need to be able to replicate the results when marking, so we need a record of exactly which observations were dropped to create a particular data set. I have done this in a variety of ways, but this time I opted for code that looked like:

Thus, we were reading a full data set and assigning it to biom, reading a table that contained student codes in the first column and 5 columns with observations to be dropped (assigned to drops) and choosing one row of drops depending on the student code (assigned to my.drop). As an example, for student ‘mjl159’ my.drop looked like:

The problem with the code is the comparison !(id %in% my.drop), as it includes the factor student.code, so when R makes the comparison checking if a record is in my.drop it converts the text, e.g. ‘mjl159’, to the number of level, e.g. 41, which makes the code to delete ONE MORE observation (in this #41) on top of the ones the student was allocated. This happens only for some students, where the number of level is not included in the list of observations to drop.

This is another version of R pitfall #3: friggin’ factors. A simple workaround is to change the comparison to !(id %in% my.drop[2:6]). I should know better than this.

Gratuitous image: Tree spread on metal frame to provide shade in a plaza, Lisbon, Portugal. Some days I would love to have a coffee there without computer, just watching the world pass by. (Photo: Luis).

Back teaching a couple of subjects and it’s the constant challenge to find enough common ground with students so one can push/pull them to the other side of new concepts. We are not talking about complex hierarchical models using mixed models or Bayesian approaches, but multiple linear regression or similar. What do students actually learn in first year stats…?

• I’m enjoying reading Machine Learning for Hackers by Drew Conway and John Myles White. There isn’t a lot of stuff new for me in the book—although working with text is not something I usually do—but I have chosen to read the book with newbie eyes. I’m (repeating myself) looking for enough common ground with students so one can push/pull them to the other side of new concepts and, let’s face it, I was 20 quite a few years ago.
• Observation on teaching a lab for STAT202, in which many students are using R for the first time. Do you remember your first steps in S+/R? Some students see the light quickly while others are struggling to get their heads around giving commands to a computer (without clicking on icons).
• Videos and screencasts on using IPython via Vince Buffalo.
• This tweet by @isomorphisms resonated with me: ‘Someday I hope to be reading more Penguin Classics than John Wileys & Springer Verlags’.
• Tom points to an explanation of ‘What really shoots out of spiderman’s modified forelimbs, and why this causes such consternation’.
• I have to convince College IT guys to install R-Studio in a few hundred computers. R-Studio is becoming better all the time, making it obscene to subject students to the naked R for Windows installation without syntax highlighting.
• Finally, reasons why men should not write advice columns via Arthur Charpentier.

Derelict house in Sintra, Portugal (Photo: Luis).

Digital TV and clothes in Bairro Alto, Lisbon, Portugal (Photo: Luis).