# Quantum Forest

### notes in a shoebox

#### Month: November 2011

Solomon saith, There is no new thing upon the earth. So that as Plato had an imagination, that all knowledge was but remembrance; so Solomon giveth his sentence, That all novelty is but oblivion.

—Francis Bacon: Essays, LVIII quoted by Jorge Luis Borges in The Immortal (1949).

This post is somewhat marginal to R in that there are several statistical systems that could be used to tackle the problem. Bayesian statistics is one of those topics that I would like to understand better, much better, in fact. Unfortunately, I struggle to get the time to attend courses on the topic between running my own lectures, research and travel; there are always books, of course.

In my (highly individual and dubious) opinion Albert’s book is the easiest to read. I was waiting to see the doctor while reading—and actually understanding—some of the concepts. The book is certainly geared towards R users and gradually develops the code necessary to run simple analyses from estimating a proportion to fitting (simple) hierarchical linear models. I’m still reading, which is a compliment.

Marin and Robert’s book is quite different in that uses R as a vehicle (like this blog) but the focus is more on the conceptual side and covers more types of models than Albert’s book. I do not have the probability background for this course (or maybe I did, but it was ages ago); however, the book makes me want to learn/refresh that background. An annoying comment on the book is that it is “self-contained”; well, anything is self-contained if one asks for enough prerequisites! I’m still reading (jumping between Albert’s and this book), and the book has managed to capture my interest.

Finally, Bolstad’s book. How to put this? “It is not you, it is me”. It is much more technical and I do not have the time, nor the patience, to wait until chapter 8 to do something useful (logistic regression). This is going back to the library until an indeterminate future.

If you are now writing a book on the topic I would like to think of the following user case:

• the reader has little or no exposure to Bayesian statistics, but it has been working for a while with ‘classical’ methods,
• the reader is self-motivated, but he doesn’t want to spend ages to be able to fit even a simple linear regression,
• the reader has little background on probability theory, but he is willing to learn some in between learning the tools and to run some analyses,
• using a statistical system that allows for both classical and Bayesian approaches is a plus.

It is hard for me to be more selfish in this description; you are potentially writing a book for me.

† After the first quake our main library looked like this. Now it is mostly normal.

P.S. After publishing this post I remembered that I came across a PDF copy of Doing Bayesian Data Analysis: A Tutorial with R and BUGS by Kruschke. Setting aside the dodginess of the copy, the book looked well-written, started from first principles and had puppies on the cover (!), so I ordered it from Amazon.

P.D. 2011-12-03 23:45 AEST Christian Robert sent me a nice email and wrote a few words on my post. Yes, I’m still plodding along with the book although I’m taking a ten day break while traveling in Australia.

P.D. 2011-11-25 12:25 NZST Here is a list of links to Amazon for the books suggested in the comments:

I believe that no one who is familiar, either with mathematical advances in other fields, or with the range of special biological conditions to be considered, would ever conceive that everything could be summed up in a single mathematical formula, however complex.

— R.A. Fisher (1932) quoted in the preface to Foundations of Mathematical Genetics by A.W.F. Edwards (1976).

David Smith at the Revolutions blog posted a nice presentation on “big data” (oh, how I dislike that term). It is a nice piece of work and the Revolution guys managed to process a large amount of records, starting with a download of 70GB and ending up with a series of linear regressions.

I’ve spent the last two weeks traveling (including a visit to the trial below) and finishing marking for the semester, which has somewhat affected my perception on dealing with large amounts of data. The thing is that dealing with hotel internet caps (100MB) or even with my lowly home connection monthly cap (5GB) does get one thinking… Would I spend several months of internet connection just downloading data so I could graph and plot some regression lines for 110 data points? Or does it make sense to run a linear regression with two predictors using 100 million records?

My basic question is why would I want to deal with all those 100 million records directly in R? Wouldn’t it make much more sense to reduce the data to a meaningful size using the original database, up there in the cloud, and download the reduced version to continue an in-depth analysis? There are packages to query external databases (ROracle, RMySQL, RODBC, …, pick your poison), we can sample to explore the dataset, etc.

We can deal with a rather large dataset in our laptop but is it the best that we can do to deal with the underlying modeling problem? Just wondering.

Gratuitous picture of generic Pinus radiata breeding trial in Northern Tasmania.

A few years ago we had this really cool idea: we had to establish a trial to understand wood quality in context. Sort of following the saying “we don’t know who discovered water, but we are sure that it wasn’t a fish” (attributed to Marshall McLuhan). By now you are thinking WTF is this guy talking about? But the idea was simple; let’s put a trial that had the species we wanted to study (Pinus radiata, a gymnosperm) and an angiosperm (Eucalyptus nitens if you wish to know) to provide the contrast, as they are supposed to have vastly different types of wood. From space the trial looked like this:

The reason you can clearly see the pines but not the eucalypts is because the latter were dying like crazy over a summer drought (45% mortality in one month). And here we get to the analytical part: we will have a look only at the eucalypts where the response variable can’t get any clearer, trees were either totally dead or alive. The experiment followed a randomized complete block design, with 50 open-pollinated families in 48 blocks. The original idea was to harvest 12 blocks each year but—for obvious reasons—we canned this part of the experiment after the first year.

The following code shows the analysis in asreml-R, lme4 and MCMCglmm:

You may be wondering Where does the 3.29 in the heritability formula comes from? Well, that’s the variance of the link function that, in the case of the logit link is pi*pi/3. In the case of MCMCglmm we can estimate the degree of genetic control quite easily, remembering that we have half-siblings (open-pollinated plants):

By the way, it is good to remember that we need to back-transform the estimated effects to probabilities, with very simple code:

Even if one of your trials is trashed there is a silver lining: it is possible to have a look at survival.

Before starting the description of the probability distributions, we want to impose on the reader the essential feature that a model is an interpretation of a real phenomenon that fits its characteristics to some degree of approximation rather than an explanation that would require the model to be “true”. In short, there is no such thing as a “true model”, even though some models are more appropriate than others!

— Jean-Michel Marin and Christian P. Robert in Bayesian Core: a practical approach to computational Bayesian statistics.

This week I’m facing my—and many other lecturers’—least favorite part of teaching: grading exams. In a supreme act of procrastination I will continue the previous post, and the antepenultimate one, showing the code for a bivariate analysis of a randomized complete block design.

Just to recap, the results from the REML multivariate analysis (that used ASReml-R) was the following:

The corresponding MCMCglmm code is not that different from ASReml-R, after which it is modeled anyway. Following the recommendations of the MCMCglmm Course Notes (included with the package), the priors have been expanded to diagonal matrices with degree of belief equal to the number of traits. The general intercept is dropped (-1) so the trait keyword represents trait means. We are fitting unstructured (us(trait)) covariance matrices for both Block and Family, as well as an unstructured covariance matrix for the residuals. Finally, both traits follow a gaussian distribution:

Further manipulation of the posterior distributions requires having an idea of the names used to store the results. Following that, we can build an estimate of the genetic correlation between the traits (Family covariance between traits divided by the square root of the product of the Family variances). Incidentally, it wouldn’t be a bad idea to run a much longer chain for this model, so the plot of the posterior for the correlation looks better, but I’m short of time:

And that’s it! Time to congratulate Jarrod Hadfield for developing this package.

Until today all the posts in this blog have used a frequentist view of the world. I have a confession to make: I have an ecumenical view of statistics and I do sometimes use Bayesian approaches in data analyses. This is not quite one of those “the truth will set you free” moments, but I’ll show that one could almost painlessly repeat some of the analyses I presented before using MCMC.

MCMCglmm is a very neat package that—as its rather complicated em cee em cee gee el em em acronym implies—implements MCMC for generalized linear mixed models. We’ll skip that the frequentist fixed vs random effects distinction gets blurry in Bayesian models and still use the f… and r… terms. I’ll first repeat the code for a Randomized Complete Block design with Family effects (so we have two random factors) using both lme4 and ASReml-R and add the MCMCglmm counterpart:

We had already established that the results obtained from lme4 and ASReml-R were pretty much the same, at least for relatively simple models where we can use both packages (as their functionality diverges later for more complex models). This example is no exception and we quickly move to fitting the same model using MCMCglmm:

The first difference is that we have to specify priors for the coefficients that we would like to estimate (by default fixed effects, the overall intercept for our example, start with a zero mean and very large variance: 106). The phenotypic variance for our response is around 780, which I split into equal parts for Block, Family and Residuals. For each random effect we have provided our prior for the variance (V) and a degree of belief on our prior (n).

In addition to the model equation, name of the data set and prior information we need to specify things like the number of iterations in the chain (nitt), how many we are discarding for the initial burnin period (burnin), and how many values we are keeping (thin, every ten). Besides the pretty plot of the posterior distributions (see previous figure) they can be summarized using the posterior mode and high probability densities.

One of the neat things we can do is to painlessly create functions of variance components and get their posterior mode and credible interval. For example, the heritability (or degree of additive genetic control) can be estimated in this trial with full-sib families using the following formula:

$$\hat{h^2} = \frac{2 \sigma_F^2}{\sigma_F^2 + \sigma_B^2 + \sigma_R^2}$$

There are some differences on the final results between ASReml-R/lme4 and MCMCglmm; however, the gammas (ratios of variance component/error variance) for the posterior distributions are very similar, and the estimated heritabilities are almost identical (~0.19 vs ~0.21). Overall, MCMCglmm is a very interesting package that covers a lot of ground. It pays to read the reference manual and vignettes, which go into a lot of detail on how to specify different types of models. I will present the MCMCglmm bivariate analyses in another post.

P.S. There are several other ways that we could fit this model in R using a Bayesian approach: it is possible to call WinBugs or JAGS (in Linux and OS X) from R, or we could have used INLA. More on this in future posts.

I bought an Android phone, nothing fancy just my first foray in the smartphone world, which is a big change coming from the dumb phone world(*). Everything is different and I am back at being a newbie; this is what many students feel the same time they are exposed to R. However, and before getting into software, I find it useful to think of teaching from several points of view, considering that there are several user cases:

1. A few of the students will get into statistics and heavy duty coding, they will be highly motivated and take several stats courses. It is in their best interest to learn R (and other software) very well.
2. Some, but not many, of the students will use statistics once in while. A point and click GUI may help to remember commands.
3. Most students will have to consume statistics, read reports, request information from colleagues and act as responsible, statistically literate citizens. I think that concepts (rather than software) are far the most relevant element to this group.

The first group requires access to all the power in R and, very likely, has at least a passing idea of coding in other languages. The second and third groups are occasional users, which will tend to forget the language and notation and most likely will need a system with menus.

At this point, some of the readers may be tempted to say that everyone (that is groups 1—3) should learn to write R code and they just need a good text editor (say Emacs). This may come as a surprise but normal people not only do not use text editors, they even don’t know what they are. You could be tempted to say that having a good editor would also let them write in LaTeX (or XeLaTeX), which is an excellent way to ‘future proof’ your documents. Please let me repeat this: normal people do not write in LaTeX, they use Word or something equivalent.

But, but. Yes, I know, we are not normal people.

## What are the problems?

When working in my Ph.D. I had the ‘brilliant’ (a.k.a. masochistic) idea of using different languages for each part of my project: Fortran 90 (Lahey), ASReml, Python (ActiveState), Matlab and Mathematica. One thing that I experienced, was that working with a scripting language integrated with a good IDE (e.g. ActiveState Python or Matlab) was much more conducive to learning than a separate console and text editor. I still have fond memories of learning and using Python. This meandering description brings me back to what we should use for teaching.

Let’s be honest, the stock R IDE that one gets with the initial download is spartan if you are in OS X and plain sucky if you are in Windows. Given that working with the console plus a text editor (Emacs, Vim, Textmate, etc) is an uncomfortable learning experience (at least in my opinion) there is a nice niche for IDEs like RStudio, which integrate editor, data manager, graphs, etc.; particularly if they are cross-platform. Why is that RStudio is not included as the default R IDE? (Incidentally, I have never used Revolution R Productivity Environment—Windows only—that looks quite cool).

Today I am tempted to recommend moving the whole course to RStudio, which means installing it in an awful lot of computers at the university. One of the issues that stops me is that is introducing another layer of abstraction to R. We have the plain-vanilla console, then the normal installation and, on top, RStudio. On the other hand, we are already introducing an extra level with R commander.

At this point we reach the point-and-click GUI. The last two years we have used R Commander, which has helped, but I have never felt entirely comfortable with it. This year I had a chat with some students that used SPSS before and, after the initial shock, they seemed to cope with R Commander. In a previous post someone suggested Deducer, which I hope to check before the end of this year. I am always on the look out for a good and easy interface for students that fall in the second and third cases (see above). It would be nice to have a series of interfaces that look like Jeroen Ooms’s prototypes. Please let me know if you have any suggestions.

(*)This is not strictly true, as I had a Nokia E72, which was a dumb phone with a lot of buttons pretending to be a smartphone.

(**)This post should be read as me thinking aloud and reflecting today’s impressions, which are continually evolving. I am not yet truly comfortable with any GUI in the R world, and still feel that SPSS, Splus or Genstat (for example) provide a nicer flow on the GUI front.

I swear there was a point in writing an introduction to covariance structures: now we can start joining all sort of analyses using very similar notation. In a previous post I described simple (even simplistic) models for a single response variable (or ‘trait’ in quantitative geneticist speak). The R code in three R packages (asreml-R, lme4 and nlme) was quite similar and we were happy-clappy with the consistency of results across packages. The curse of the analyst/statistician/guy who dabbles in analyses is the idea that we can always fit a nicer model and—as both Bob the Builder and Obama like to say—yes, we can.

Let’s assume that we have a single trial where we have assessed our experimental units (trees in my case) for more than one response variable; for example, we measured the trees for acoustic velocity (related to stiffness) and basic density. If you don’t like trees, think of having height and body weight for people, or whatever picks your fancy. Anyway, we have our traditional randomized complete block design with random blocks, random families and that’s it. Rhetorical question: Can we simultaneously run an analysis for all responses?

Up to this point we are using the same old code, and remember that we could fit the same model using lme4, so what’s the point of this post? Well, we can now move to fit a multivariate model, where we have two responses at the same time (incidentally, below we have a plot of the two response variables, showing a correlation of ~0.2).

We can first refit the model as a multivariate analysis, assuming block-diagonal covariance matrices. The notation now includes:

• The use of cbind() to specify the response matrix,
• the reserved keyword trait, which creates a vector to hold the overall mean for each response,
• at(trait), which asks ASReml-R to fit an effect (e.g. Block) at each trait, by default using a diagonal covariance matrix (σ2 I). We could also use diag(trait) for the same effect,
• rcov = ~ units:diag(trait) specifies a different diagonal matrix for the residuals (units) of each trait.

Initially, you may not notice that the results are identical, as there is a distracting change to scientific notation for the variance components. A closer inspection shows that we have obtained the same results for both traits, but did we gain anything? Not really, as we took the defaults for covariance components (direct sum of diagonal matrices, which assumes uncorrelated traits); however, we can do better and actually tell ASReml-R to fit the correlation between traits for block and family effects as well as for residuals.

Moving from model 3 to model 4 we are adding three covariance components (one for each family, block and residuals) and improving log-likelihood by 8.5. A quick look at the output from m4 would indicate that most of that gain is coming from allowing for the covariance of residuals for the two traits, as the covariances for family and, particularly, block are more modest:

An alternative parameterization for model 4 would be to use a correlation matrix with heterogeneous variances (corgh(trait) instead of us(trait)), which would model the correlation and the variances instead of the covariance and the variances. This approach seems to be more numerically stable sometimes.

As an aside, we can estimate the between traits correlation for Blocks (probably not that interesting) and the one for Family (much more interesting, as it is an estimate of the genetic correlation between traits: 1.594152e-01 / sqrt(8.248391e + 01*2.264225e-03) = 0.37).