## Cute Gibbs sampling for rounded observations

I was attending a course of Bayesian Statistics where this problem showed up:

There is a number of individuals, say 12, who take a pass/fail test 15 times. For each individual we have recorded the number of passes, which can go from 0 to 15. Because of confidentiality issues, we are presented with rounded-to-the-closest-multiple-of-3 data ($$\mathbf{R}$$). We are interested on estimating $$\theta$$ of the Binomial distribution behind the data.

Rounding is probabilistic, with probability 2/3 if you are one count away from a multiple of 3 and probability 1/3 if the count is you are two counts away. Multiples of 3 are not rounded.

We can use Gibbs sampling to alternate between sampling the posterior for the unrounded $$\mathbf{Y}$$ and $$\theta$$. In the case of $$\mathbf{Y}$$ I used:

While for $$\theta$$ we are assuming a vague $$\mbox{Beta}(\alpha, \beta)$$, with $$\alpha$$ and $$\beta$$ equal to 1, as prior density function for $$\theta$$, so the posterior density is a $$\mbox{Beta}(\alpha + \sum Y_i, \beta + 12*15 – \sum Y_i)$$.

I then implemented the sampler as:

And plotted the results as:

I thought it was a nice, cute example of simultaneously estimating a latent variable and, based on that, estimating the parameter behind it.

## Analyzing a simple experiment with heterogeneous variances using asreml, MCMCglmm and SAS

I was working with a small experiment which includes families from two Eucalyptus species and thought it would be nice to code a first analysis using alternative approaches. The experiment is a randomized complete block design, with species as fixed effect and family and block as a random effects, while the response variable is growth strain (in $$\mu \epsilon$$).

When looking at the trees one can see that the residual variances will be very different. In addition, the trees were growing in plastic bags laid out in rows (the blocks) and columns. Given that trees were growing in bags siting on flat terrain, most likely the row effects are zero.

Below is the code for a first go in R (using both MCMCglmm and ASReml-R) and SAS. I had stopped using SAS for several years, mostly because I was running a mac for which there is no version. However, a few weeks ago I started accessing it via their OnDemand for Academics program via a web browser.

The R code using REML looks like:

While using MCMC we get estimates in the ballpark by using:

The SAS code is not that disimilar, except for the clear demarcation between data processing (data step, for reading files, data transformations, etc) and specific procs (procedures), in this case to summarize data, produce a boxplot and fit a mixed model.

I like working with multiple languages and I realized that, in fact, I missed SAS a bit. It was like meeting an old friend; at the beginning felt strange but we were quickly chatting away after a few minutes.

## INLA: Bayes goes to Norway

INLA is not the Norwegian answer to ABBA; that would probably be a-ha. INLA is the answer to ‘Why do I have enough time to cook a three-course meal while running MCMC analyses?”.

Integrated Nested Laplace Approximations (INLA) is based on direct numerical integration (rather than simulation as in MCMC) which, according to people ‘in the know’, allows:

• the estimation of marginal posteriors for all parameters,
• marginal posteriors for each random effect and
• estimation of the posterior for linear combinations of random effects.

Rather than going to the usual univariate randomized complete block or split-plot designs that I have analyzed before (here using REML and here using MCMC), I’ll go for some analyses that motivated me to look for INLA. I was having a look at some reproductive output for Drosophila data here at the university, and wanted to fit a logistic model using MCMCglmm. Unfortunately, I was running into the millions (~3M) of iterations to get a good idea of the posterior and, therefore, leaving the computer running overnight. Almost by accident I came across INLA and started playing with it. The idea is that Sol—a Ph.D. student—had a cool experiment with a bunch of flies using different mating strategies over several generations, to check the effect on breeding success. Therefore we have to keep track of the pedigree too.

Up to this point we have read the response data, the pedigree and constructed the inverse of the pedigree matrix. We also needed to build a contrast matrix to compare the mean response between the different mating strategies. I was struggling there and contacted Gregor Gorjanc, who kindly emailed me the proper way to do it.

There is another related package (Animal INLA) that takes care of i- giving details about the priors and ii- “easily” fitting models that include a term with a pedigree (an animal model in quantitative genetics speak). However, I wanted the assumptions to be clear so read the source of Animal INLA and shamelessly copied the useful bits (read the source, Luke!).

A quick look at the time taken by INLA shows that it is in the order of seconds (versus overnight using MCMC). I have tried a few examples and the MCMCglmm and INLA results tend to be very close; however, figuring out how to code models has been very tricky for me. INLA follows the glorious tradition of not having a ‘proper’ manual, but a number of examples with code. In fact, they reimplement BUGS‘s examples. Personally, I struggle with that approach towards documentation, but you may be the right type of person for that. Note for letter to Santa: real documentation for INLA.

I was talking with a student about using Norwegian software and he mentioned Norwegian Black Metal. That got me thinking about how the developers of the package would look like; would they look like Gaahl of Gorgoroth (see interview here)?

Talk about disappointment! In fact Håvard Rue, INLA mastermind, looks like a nice, clean, non-black-metal statistician. To be fair, it would be quite hard to code in any language wearing those spikes…

## R, Julia and genome wide selection

— “You are a pussy” emailed my friend.
— “Sensu cat?” I replied.
— “No. Sensu chicken” blurbed my now ex-friend.

What was this about? He read my post on R, Julia and the shiny new thing, which prompted him to assume that I was the proverbial old dog unwilling (or was it unable?) to learn new tricks. (Incidentally, with friends like this who needs enemies? Hi, Gus.)

I decided to tackle a small—but hopefully useful—piece of code: fitting/training a Genome Wide Selection model, using the Bayes A approach put forward by Meuwissen, Hayes and Goddard in 2001. In that approach the breeding values of the individuals (response) are expressed as a function of a very large number of random predictors (2000, our molecular markers). The dataset (csv file) is a simulation of 2000 bi-allelic markers (aa = 0, Aa = 1, AA = 2) for 250 individuals, followed by the phenotypes (column 2001) and breeding values (column 2002). These models are frequently adjusted using MCMC.

In 2010 I attended this course in Ames, Iowa where Rohan Fernando passed us the following R code (pretty much a transliteration from C code; notice the trailing semicolons, for example). P.D. 2012-04-26 Please note that this is teaching code not production code:

Thus, we just need defining a few variables, reading the data (marker genotypes, breeding values and phenotypic data) into a matrix, creating loops, matrix and vector multiplication and generating random numbers (using a Gaussian and Chi squared distributions). Not much if you think about it, but I didn’t have much time to explore Julia’s features as to go for something more complex.

The code looks remarkably similar and there are four main sources of differences:

1. The first trivial one is that the original code read a binary dataset and I didn’t know how to do it in Julia, so I’ve read a csv file instead (this is why I start timing after reading the file too).
2. The second trivial one is to avoid name conflicts between variables and functions; for example, in R the user is allowed to have a variable called var that will not interfere with the variance function. Julia is picky about that, so I needed renaming some variables.
3. Julia pases variables by reference, while R does so by value when assigning matrices, which tripped me because in the original R code there was something like: b = array(0.0,size); meanb = b;. This works fine in R, but in Julia changes to the b vector also changed meanb.
4. The definition of scalar vs array created some problems in Julia. For example y' * y (t(y) %*% y in R) is numerically equivalent to dot(y, y). However, the first version returns an array, while the second one a scalar. I got an error message when trying to store the ‘scalar like an array’ in to an array. I find that confusing.

One interesting point in this comparison is using rough code, not really optimized for speed; in fact, the only thing that I can say of the Julia code is that ‘it runs’ and it probably is not very idiomatic. Testing runs with different numbers of markers we get that R needs roughly 2.8x the time used by Julia. The Julia website claims better results in benchmarks, but in real life we work with, well, real problems.

In 1996-7 I switched from SAS to ASReml for genetic analyses because it was 1-2 orders of magnitude faster and opened a world of new models. Today a change from R to Julia would deliver (in this particular case) a much more modest speed up (~3x), which is OK but not worth changing languages (yet). Together with the embryonic graphical capabilities and the still-to-develop ecosystem of packages, means that I’m still using R. Nevertheless, the Julia team has achieved very impressive performance in very little time, so it is worth to keep an eye on their progress.

P.S.1 Readers are welcome to suggesting ways of improving the code.
P.S.2 WordPress does not let me upload the binary version of the simulated data.
P.S.3 Hey WordPress guys; it would be handy if the sourcecode plugin supported Julia!
P.S.4 2012-04-26 Following AL’s recommendation in the comments, one can replace in R:

by

reducing execution time by roughly 20%, making the difference between Julia and R even smaller.

## Mid-January flotsam: teaching edition

I was thinking about new material that I will use for teaching this coming semester (starting the third week of February) and suddenly compiled the following list of links:

Enough procrastination. Let’s keep on filling out PBRF forms; it is the right time for that hexennial activity.

## Doing Bayesian Data Analysis now in JAGS

Around Christmas time I presented my first impressions of Kruschke’s Doing Bayesian Data Analysis. This is a very nice book but one of its drawbacks was that part of the code used BUGS, which left mac users like me stuck.

Kruschke has now made JAGS code available so I am happy clappy and looking forward to test this New Year present. In addition, there are other updates available for the programs included in the book.

## First impressions of Doing Bayesian Data Analysis

About a month ago I was discussing the approach that I would like to see in introductory Bayesian statistics books. In that post I mentioned a PDF copy of Doing Bayesian Data Analysis by John K. Kruschke and that I have ordered the book. Well, recently a parcel was waiting in my office with a spanking new, real paper copy of the book. A few days are not enough to provide a ‘proper’ review of the book but I would like to discuss my first impressions about the book, as they could be helpful for someone out there.

If I were looking for a single word to define the word it would be meaty, not on the “having the flavor or smell of meat” sense of the word as pointed out by Newton, but on the conceptual side. Kruschke has clearly put a lot of thought on how to draw a generic student with little background on the topic to start thinking of statistical concepts. In addition Kruschke clearly loves language and has an interesting, sometimes odd, sense of humor; anyway, Who am I to comment on someone else’s strange sense of humor?

One difference between the dodgy PDF copy and the actual book is the use of color, three shades of blue, to highlight section headers and graphical content. In general I am not a big fun of lots of colors and contentless pictures as used in modern calculus and physics undergraduate books. In this case, the effect is pleasant and makes browsing and reading the book more accessible. Most graphics really drive a point and support the written material, although there are exceptions in my opinion like some faux 3D graphs (Figure 17.2 and 17.3 under multiple linear regression) that I find somewhat confusing.

The book’s website contains PDF versions of the table of contents and chapter 1, which is a good way to whet your appetite. The book covers enough material as to be the sole text for an introductory Bayesian statistics course, either starting from scratch or as a transition from a previous course with a frequentist approach. There are plenty of exercises, a solutions manual and plenty of R code available.

The mere existence of this book prompts the question: Can we afford not to introduce students to a Bayesian approach to statistics? In turn this sparks the question How do we convince departments to de-emphasize the old way? (this quote is extremely relevant)

Verdict: if you are looking for a really introductory text, this is hands down the best choice. The material goes from the ‘OMG do I need to learn stats?’ level to multiple linear regression, ANOVA, hierarchical models and GLMs.

P.S. I’m still using a combination of books, including Krushke’s, and Marin and Robert’s for my own learning process.
P.S.2 There is a lot to be said about a book that includes puppies on its cover and references to A Prairie Home Companion on its first page (the show is sometimes re-broadcasted down under by Radio New Zealand).

## Tall big data, wide big data

After attending two one-day workshops last week I spent most days paying attention to (well, at least listening to) presentations in this biostatistics conference. Most presenters were R users—although Genstat, Matlab and SAS fans were also present and not once I heard “I can’t deal with the current size of my data sets”. However, there were some complaints about the speed of R, particularly when dealing with simulations or some genomic analyses.

Some people worried about the size of coming datasets; nevertheless that worry was across statistical packages or, more precisely, it went beyond statistical software. How will we able to even store the data from something like the Square Kilometer Array, let alone analyze it?

In a previous post I was asking if we needed to actually deal with ‘big data’ in R, and my answer was probably not or, better, at least not directly. I still think that it is a valid, although incomplete opinion. In many statistical analyses we can think of n (the number of observations) and p (the number of variables per observation). In most cases, particularly when people refer to big data, n >> p. Thus, we may have 100 million people but we have only 10 potential predictors: tall data. In contrast, we may have only 1,000 individuals but with 50,000 points each coming from a near infrared spectrometry or information from 250,000 SNPs (a type of molecular marker): wide data. Both types of data will keep on growing but are challenging in a different way.

In a totally generalizing, unfair and simplistic way I will state that dealing with wide is more difficult (and potentially interesting) than dealing with tall. This from a modeling perspective. As the t-shirt says: sampling is not a crime, and it should work quite well with simpler models and large datasets. In contrast, sampling to fitting wide data may not work at all.

Algorithms. Clever algorithms is what we need in a first stage. For example, we can fit linear mixed models to a tall dataset with ten millions records or a multivariate mixed model with 60 responses using ASReml-R. Wide datasets are often approached using Bayesian inference, but MCMC gets slooow when dealing with thousands of predictors, we need other fast approximations to the posterior distributions.

This post may not be totally coherent, but it keeps the conversation going. My excuse? I was watching Be kind rewind while writing it.

## If you are writing a book on Bayesian statistics

This post is somewhat marginal to R in that there are several statistical systems that could be used to tackle the problem. Bayesian statistics is one of those topics that I would like to understand better, much better, in fact. Unfortunately, I struggle to get the time to attend courses on the topic between running my own lectures, research and travel; there are always books, of course.

In my (highly individual and dubious) opinion Albert’s book is the easiest to read. I was waiting to see the doctor while reading—and actually understanding—some of the concepts. The book is certainly geared towards R users and gradually develops the code necessary to run simple analyses from estimating a proportion to fitting (simple) hierarchical linear models. I’m still reading, which is a compliment.

Marin and Robert’s book is quite different in that uses R as a vehicle (like this blog) but the focus is more on the conceptual side and covers more types of models than Albert’s book. I do not have the probability background for this course (or maybe I did, but it was ages ago); however, the book makes me want to learn/refresh that background. An annoying comment on the book is that it is “self-contained”; well, anything is self-contained if one asks for enough prerequisites! I’m still reading (jumping between Albert’s and this book), and the book has managed to capture my interest.

Finally, Bolstad’s book. How to put this? “It is not you, it is me”. It is much more technical and I do not have the time, nor the patience, to wait until chapter 8 to do something useful (logistic regression). This is going back to the library until an indeterminate future.

If you are now writing a book on the topic I would like to think of the following user case:

• the reader has little or no exposure to Bayesian statistics, but it has been working for a while with ‘classical’ methods,
• the reader is self-motivated, but he doesn’t want to spend ages to be able to fit even a simple linear regression,
• the reader has little background on probability theory, but he is willing to learn some in between learning the tools and to run some analyses,
• using a statistical system that allows for both classical and Bayesian approaches is a plus.

It is hard for me to be more selfish in this description; you are potentially writing a book for me.

† After the first quake our main library looked like this. Now it is mostly normal.

P.S. After publishing this post I remembered that I came across a PDF copy of Doing Bayesian Data Analysis: A Tutorial with R and BUGS by Kruschke. Setting aside the dodginess of the copy, the book looked well-written, started from first principles and had puppies on the cover (!), so I ordered it from Amazon.

P.D. 2011-12-03 23:45 AEST Christian Robert sent me a nice email and wrote a few words on my post. Yes, I’m still plodding along with the book although I’m taking a ten day break while traveling in Australia.

P.D. 2011-11-25 12:25 NZST Here is a list of links to Amazon for the books suggested in the comments:

## Surviving a binomial mixed model

A few years ago we had this really cool idea: we had to establish a trial to understand wood quality in context. Sort of following the saying “we don’t know who discovered water, but we are sure that it wasn’t a fish” (attributed to Marshall McLuhan). By now you are thinking WTF is this guy talking about? But the idea was simple; let’s put a trial that had the species we wanted to study (Pinus radiata, a gymnosperm) and an angiosperm (Eucalyptus nitens if you wish to know) to provide the contrast, as they are supposed to have vastly different types of wood. From space the trial looked like this:

The reason you can clearly see the pines but not the eucalypts is because the latter were dying like crazy over a summer drought (45% mortality in one month). And here we get to the analytical part: we will have a look only at the eucalypts where the response variable can’t get any clearer, trees were either totally dead or alive. The experiment followed a randomized complete block design, with 50 open-pollinated families in 48 blocks. The original idea was to harvest 12 blocks each year but—for obvious reasons—we canned this part of the experiment after the first year.

The following code shows the analysis in asreml-R, lme4 and MCMCglmm:

You may be wondering Where does the 3.29 in the heritability formula comes from? Well, that’s the variance of the link function that, in the case of the logit link is pi*pi/3. In the case of MCMCglmm we can estimate the degree of genetic control quite easily, remembering that we have half-siblings (open-pollinated plants):

By the way, it is good to remember that we need to back-transform the estimated effects to probabilities, with very simple code:

Even if one of your trials is trashed there is a silver lining: it is possible to have a look at survival.