Categories

## Cute Gibbs sampling for rounded observations

I was attending a course of Bayesian Statistics where this problem showed up:

There is a number of individuals, say 12, who take a pass/fail test 15 times. For each individual we have recorded the number of passes, which can go from 0 to 15. Because of confidentiality issues, we are presented with rounded-to-the-closest-multiple-of-3 data ($$\mathbf{R}$$). We are interested on estimating $$\theta$$ of the Binomial distribution behind the data.

Rounding is probabilistic, with probability 2/3 if you are one count away from a multiple of 3 and probability 1/3 if the count is you are two counts away. Multiples of 3 are not rounded.

We can use Gibbs sampling to alternate between sampling the posterior for the unrounded $$\mathbf{Y}$$ and $$\theta$$. In the case of $$\mathbf{Y}$$ I used:

# Possible values that were rounded to R
possible <- function(rounded) {
if(rounded == 0) {
options <- c(0, 1, 2)
} else {
options <- c(rounded - 2, rounded - 1, rounded,
rounded + 1, rounded + 2)
}
return(options)
}

# Probability mass function of numbers rounding to R
# given theta
prior_y <- function(options, theta) {
p <- dbinom(options, 15, prob = theta)
return(p)
}

# Likelihood of rounding
like_round3 <- function(options) {
if(length(options) == 3) {
like <- c(1, 2/3, 1/3) }
else {
like <- c(1/3, 2/3, 1, 2/3, 1/3)
}
return(like)
}

# Estimating posterior mass function and drawing a
# random value of it
posterior_sample_y <- function(R, theta) {
po <- possible(R)
pr <- prior_y(po, theta)
li <- like_round3(po)
post <- li*pr/sum(li*pr)
samp <- sample(po, 1, prob = post)
return(samp)
}

While for $$theta$$ we are assuming a vague $$mbox{Beta}(alpha, eta)$$, with $$alpha$$ and $$eta$$ equal to 1, as prior density function for $$theta$$, so the posterior density is a $$mbox{Beta}(alpha + sum Y_i, eta + 12*15 - sum Y_i)$$.

## Function to sample from the posterior Pr(theta | Y, R)
posterior_sample_theta <- function(alpha, beta, Y) {
theta <- rbeta(1, alpha + sum(Y), beta + 12*15 - sum(Y))
return(theta)
}

I then implemented the sampler as:

## Data
R <- c(0, 0, 3, 9, 3, 0, 6, 3, 0, 6, 0, 3)
nsim <- 10000
burnin <- 1000
alpha <- 1
beta <- 1
store <- matrix(0, nrow = nsim, ncol = length(R) + 1)

starting.values <- c(R, 0.1)

## Sampling
store[1,] <- starting.values
for(draw in 2:nsim){
current <- store[draw - 1,]
for(obs in 1:length(R)) {
y <- posterior_sample_y(R[obs], current[length(R) + 1])
# Jump or not still missing
current[obs] <- y
}
theta <- posterior_sample_theta(alpha, beta, current[1:length(R)])
# Jump or not still missing
current[length(R) + 1] <- theta

store[draw,] <- current
}

And plotted the results as:

plot((burnin+1):nsim, store[(burnin+1):nsim,13], type = 'l')

library(ggplot2)

ggplot(data.frame(theta = store[(burnin+1):nsim,13]), aes(x = theta)) +
geom_density(fill = 'blue', alpha = 0.5)
multiple_plot <- data.frame(Y = matrix(store[(burnin+1):nsim, 1:12],
nrow = (nsim - burnin)*12,
ncol = 1))
multiple_plot$obs <- factor(rep(1:12, each = (nsim - burnin))) ggplot(multiple_plot, aes(x = Y)) + geom_histogram() + facet_grid(~obs) I thought it was a nice, cute example of simultaneously estimating a latent variable and, based on that, estimating the parameter behind it. Categories ## Analyzing a simple experiment with heterogeneous variances using asreml, MCMCglmm and SAS I was working with a small experiment which includes families from two Eucalyptus species and thought it would be nice to code a first analysis using alternative approaches. The experiment is a randomized complete block design, with species as fixed effect and family and block as a random effects, while the response variable is growth strain (in $$mu epsilon$$). When looking at the trees one can see that the residual variances will be very different. In addition, the trees were growing in plastic bags laid out in rows (the blocks) and columns. Given that trees were growing in bags siting on flat terrain, most likely the row effects are zero. Below is the code for a first go in R (using both MCMCglmm and ASReml-R) and SAS. I had stopped using SAS for several years, mostly because I was running a mac for which there is no version. However, a few weeks ago I started accessing it via their OnDemand for Academics program via a web browser. The R code using REML looks like: # Options options(stringsAsFactors = FALSE) setwd('~/Dropbox/research/2013/growth-stress') # Packages require(ggplot2) require(asreml) require(MCMCglmm) # Reading data, renaming, factors, etc gs = read.csv('eucalyptus-growth-stress.csv') summary(gs) # Both argophloia and bosistoana gsboth = subset(gs, !is.na(strain)) gsboth = within(gsboth, { species = factor(species) row = factor(row) column = factor(column) fam = factor(fam) }) ma = asreml(strain ~ species, random = ~ fam + row, rcov = ~ at(species):units, data = gsboth) summary(ma)$varcomp

#                                   gamma  component std.error   z.ratio constraint
#fam!fam.var                    27809.414  27809.414 10502.036 2.6480022   Positive
#row!row.var                     2337.164   2337.164  3116.357 0.7499666   Positive
#species_E.argopholia!variance 111940.458 111940.458 26609.673 4.2067580   Positive
#species_E.bosistoana!variance  63035.256  63035.256  7226.768 8.7224681   Positive

While using MCMC we get estimates in the ballpark by using:

# Priors
bothpr = list(R = list(V = diag(c(50000, 50000)), nu = 3),
G = list(G1 = list(V = 20000, nu = 0.002),
G2 = list(V = 20000, nu = 0.002),
G3 = list(V = 20000, nu = 0.002)))

# MCMC
m2 = MCMCglmm(strain ~ species, random = ~ fam + row + column,
rcov = ~ idh(species):units,
data = gsboth,
prior = bothpr,
pr = TRUE,
family = 'gaussian',
burnin = 10000,
nitt = 40000,
thin = 20,
saveX = TRUE,
saveZ = TRUE,
verbose = FALSE)

summary(m2)

# Iterations = 10001:39981
# Thinning interval  = 20
# Sample size  = 1500
#
# DIC: 3332.578
#
# G-structure:  ~fam
#
# post.mean l-95% CI u-95% CI eff.samp
# fam     30315    12211    55136     1500
#
# ~row
#
# post.mean l-95% CI u-95% CI eff.samp
# row      1449    5.928     6274    589.5
#
# R-structure:  ~idh(species):units
#
# post.mean l-95% CI u-95% CI eff.samp
# E.argopholia.units    112017    71152   168080     1500
# E.bosistoana.units     65006    52676    80049     1500
#
# Location effects: strain ~ species
#
# post.mean l-95% CI u-95% CI eff.samp  pMCMC
# (Intercept)            502.21   319.45   690.68     1500 <7e-04 ***
# speciesE.bosistoana   -235.95  -449.07   -37.19     1361  0.036 *

The SAS code is not that disimilar, except for the clear demarcation between data processing (data step, for reading files, data transformations, etc) and specific procs (procedures), in this case to summarize data, produce a boxplot and fit a mixed model.

* termstr=CRLF accounts for the windows-like line endings of the data set;
data gs;
infile "/home/luis/Projects/euc-strain/growthstresses.csv"
dsd termstr=CRLF firstobs=2;
input row column species $family$ strain;
if strain ^= .;
run;

proc summary data = gs print;
class species;
var strain;
run;

proc boxplot data = gs;
plot strain*species;
run;

proc mixed data = gs;
class row species family;
model strain = species;
random row family;
repeated species / group=species;
run;

/*
Covariance Parameter Estimates
Cov Parm 	Group 	               Estimate
row 	  	                       2336.80
family 	  	                       27808
species 	species E.argoph 	111844
species 	species E.bosist 	63036
*/

I like working with multiple languages and I realized that, in fact, I missed SAS a bit. It was like meeting an old friend; at the beginning felt strange but we were quickly chatting away after a few minutes.

Categories

## INLA: Bayes goes to Norway

INLA is not the Norwegian answer to ABBA; that would probably be a-ha. INLA is the answer to ‘Why do I have enough time to cook a three-course meal while running MCMC analyses?”.

Integrated Nested Laplace Approximations (INLA) is based on direct numerical integration (rather than simulation as in MCMC) which, according to people ‘in the know’, allows:

• the estimation of marginal posteriors for all parameters,
• marginal posteriors for each random effect and
• estimation of the posterior for linear combinations of random effects.

Rather than going to the usual univariate randomized complete block or split-plot designs that I have analyzed before (here using REML and here using MCMC), I’ll go for some analyses that motivated me to look for INLA. I was having a look at some reproductive output for Drosophila data here at the university, and wanted to fit a logistic model using MCMCglmm. Unfortunately, I was running into the millions (~3M) of iterations to get a good idea of the posterior and, therefore, leaving the computer running overnight. Almost by accident I came across INLA and started playing with it. The idea is that Sol—a Ph.D. student—had a cool experiment with a bunch of flies using different mating strategies over several generations, to check the effect on breeding success. Therefore we have to keep track of the pedigree too.

# Set working directory containing data and code
setwd('~/Dropbox/research/2010/flies')

# Packages needed for analysis
# This code requires the latest (and updated) version of INLA
require(pedigreemm)      # pedigree(), relfactor(), Tinv, D, ...

####### Pedigree and assessment files

dat$cross = factor(dat$cross)

# Pedigree object for pedigreemm functions
pedPMM = with(ped, pedigreemm::pedigree(sire=dad, dam=mum, label=id))

# Pedigree precision matrix (A inverse)
# T^{-1} in A^{-1} = (T^{-1})' D^{-1} T^{-1}
Tinv = as(pedPMM, "sparseMatrix")
D    = Diagonal(x=Dmat(pedPMM))   # D      in A = TDT'
Dinv = solve(D)
Ainv = t(Tinv) %*% Dinv %*% Tinv

Up to this point we have read the response data, the pedigree and constructed the inverse of the pedigree matrix. We also needed to build a contrast matrix to compare the mean response between the different mating strategies. I was struggling there and contacted Gregor Gorjanc, who kindly emailed me the proper way to do it.

# Define contrasts to compare cross types. Thanks to Gregor Gorjanc
# for coding contrasts
k = nlevels(dat$cross) tmp = matrix(nrow=(k-1)*k/2, ncol=k) ## 1 2 3 4 5 6 tmp[ 1, ] = c( 1, -1, NA, NA, NA, NA) ## c1-c2 tmp[ 2, ] = c( 1, NA, -1, NA, NA, NA) ## -c3 tmp[ 3, ] = c( 1, NA, NA, -1, NA, NA) ## -c4 tmp[ 4, ] = c( 1, NA, NA, NA, -1, NA) ## -c5 tmp[ 5, ] = c( 1, NA, NA, NA, NA, -1) ## -c6 tmp[ 6, ] = c( NA, 1, -1, NA, NA, NA) ## c2-c3 tmp[ 7, ] = c( NA, 1, NA, -1, NA, NA) ## -c4 tmp[ 8, ] = c( NA, 1, NA, NA, -1, NA) ## -c5 tmp[ 9, ] = c( NA, 1, NA, NA, NA, -1) ## -c6 tmp[10, ] = c( NA, NA, 1, -1, NA, NA) ## c3-c4 tmp[11, ] = c( NA, NA, 1, NA, -1, NA) ## -c5 tmp[12, ] = c( NA, NA, 1, NA, NA, -1) ## -c6 tmp[13, ] = c( NA, NA, NA, 1, -1, NA) ## c4-c5 tmp[14, ] = c( NA, NA, NA, 1, NA, -1) ## -c6 tmp[15, ] = c( NA, NA, NA, NA, 1, -1) ## c5-c6 # Make Linear Combinations LC = inla.make.lincombs(cross=tmp) # Assign names to combinations t = 0 for(i in 1:(k-1)) { for(j in (i+1):k) { t = t + 1 names(LC)[t] = paste("c", i, "-", "c", j, sep="") } } There is another related package (Animal INLA) that takes care of i- giving details about the priors and ii- “easily” fitting models that include a term with a pedigree (an animal model in quantitative genetics speak). However, I wanted the assumptions to be clear so read the source of Animal INLA and shamelessly copied the useful bits (read the source, Luke!). ###### Analysis for for binomial traits ####### Plain-vanilla INLA Version # Feeling more comfortable with *explicit* statement of assumptions # (rather than hidden behind animal.inla()) # Function to backconvert logits to probabilities back.conv = function(values){ return(1/(1+exp(-(values)))) } # Function to get posterior of the odds # Thanks to Helena Moltchanova inla.marginal.summary = function(x){ m1 = inla.emarginal(function(z) exp(z), marginal=x) odds = inla.marginal.transform(function(x) exp(x), x) q = inla.qmarginal(p=c(0.025, 0.975), marginal=odds) c("0.025quant"=q[1], "0.5quant"=m1, "0.975quant"=q[2]) } # Model for pupae/eggs # Drops a few observations with no reproductive output (trips INLA) no0eggs = subset(dat, eggs>0 & pupae <= eggs) # Actual model mpueg = pupae ~ f(cross, model='iid', constr=TRUE, hyper=list(theta=list(initial=-10, fixed=TRUE))) + f(id, model='generic0', constr=TRUE, Cmatrix=Ainv, hyper=list(theta=list(param=c(0.5,0.5), fixed=FALSE))) # INLA call fpueg = inla(formula=mpueg, family='binomial', data=no0eggs, lincomb=LC, Ntrials=eggs, control.compute=list(dic=FALSE)) # Results summary(fpueg) # Call: # c("inla(formula = mpueg, family = \"binomial\", data = no0eggs, # Ntrials = eggs, ", " lincomb = LC, control.compute = list(dic = FALSE))") # # Time used: # Pre-processing Running inla Post-processing Total # 0.2712612 1.1172159 2.0439510 3.4324281 # # Fixed effects: # mean sd 0.025quant 0.5quant 0.975quant kld # (Intercept) 1.772438 0.1830827 1.417413 1.770863 2.136389 0.5833235 # # Linear combinations (derived): # ID mean sd 0.025quant 0.5quant 0.975quant kld # c1-c2 0 -0.26653572 0.7066540 -1.6558225 -0.26573011 1.11859967 0 # c1-c3 1 0.04150999 0.7554753 -1.4401435 0.04104020 1.52622856 0 # c1-c4 2 -0.08777325 0.6450669 -1.3557501 -0.08713005 1.17693349 0 # c1-c5 3 -1.36702960 0.6583121 -2.6615604 -1.36618274 -0.07690788 0 # c1-c6 4 -1.82037714 0.8193280 -3.4338294 -1.81848244 -0.21714431 0 # c2-c3 5 0.30804735 0.7826815 -1.2248185 0.30677279 1.84852340 0 # c2-c4 6 0.17876229 0.5321948 -0.8654273 0.17859036 1.22421409 0 # c2-c5 7 -1.10049385 0.7466979 -2.5663142 -1.10046590 0.36558211 0 # c2-c6 8 -1.55383673 0.8188321 -3.1640965 -1.55276603 0.05084282 0 # c3-c4 9 -0.12928419 0.7475196 -1.5996080 -0.12817855 1.33522000 0 # c3-c5 10 -1.40854298 0.6016539 -2.5930656 -1.40723901 -0.23103707 0 # c3-c6 11 -1.86189314 0.8595760 -3.5555571 -1.85954031 -0.18100418 0 # c4-c5 12 -1.27925604 0.6998640 -2.6536362 -1.27905616 0.09438701 0 # c4-c6 13 -1.73259977 0.7764105 -3.2600936 -1.73134961 -0.21171790 0 # c5-c6 14 -0.45334267 0.8179794 -2.0618730 -0.45229981 1.14976690 0 # # Random effects: # Name Model Max KLD # cross IID model # id Generic0 model # # Model hyperparameters: # mean sd 0.025quant 0.5quant 0.975quant # Precision for id 0.08308 0.01076 0.06381 0.08244 0.10604 # # Expected number of effective parameters(std dev): 223.95(0.7513) # Number of equivalent replicates : 1.121 # # Marginal Likelihood: -1427.59 fpueg$summary.random$cross # ID mean sd 0.025quant 0.5quant 0.975quant kld # 1 1 -0.5843466 0.4536668 -1.47561024 -0.5840804 0.3056632 0.0178780930 # 2 2 -0.3178102 0.4595676 -1.21808638 -0.3184925 0.5865565 0.0009666916 # 3 3 -0.6258600 0.4978254 -1.60536281 -0.6250077 0.3491075 0.0247426578 # 4 4 -0.4965763 0.4071715 -1.29571071 -0.4966277 0.3030747 0.0008791629 # 5 5 0.7826817 0.4389003 -0.07756805 0.7821937 1.6459253 0.0077476806 # 6 6 1.2360387 0.5768462 0.10897529 1.2340813 2.3744368 0.0451357379 # Backtransforms point estimates and credible intervals for odds -> prob for(name in names(fpueg$marginals.lincomb.derived)){
summa = inla.marginal.summary(eval(parse(text=paste("fpueg$marginals.lincomb.derived$\'", name, "\'", sep=''))))
cat(name, summa, '\n')
}

# c1-c2 0.1894451 0.9831839 3.019878
# c1-c3 0.2338952 1.387551 4.534581
# c1-c4 0.256858 1.127751 3.204961
# c1-c5 0.0695406 0.3164847 0.9145132
# c1-c6 0.03157478 0.2264027 0.792517
# c2-c3 0.289088 1.850719 6.255175
# c2-c4 0.4213069 1.377848 3.366947
# c2-c5 0.0759222 0.4398384 1.420934
# c2-c6 0.04135211 0.2955985 1.035951
# c3-c4 0.1996085 1.16168 3.747526
# c3-c5 0.0746894 0.2929174 0.7847903
# c3-c6 0.02774805 0.2245797 0.821099
# c4-c5 0.06988459 0.355529 1.084414
# c4-c6 0.03780307 0.2389529 0.7974092
# c5-c6 0.1245211 0.8878682 3.108852

A quick look at the time taken by INLA shows that it is in the order of seconds (versus overnight using MCMC). I have tried a few examples and the MCMCglmm and INLA results tend to be very close; however, figuring out how to code models has been very tricky for me. INLA follows the glorious tradition of not having a ‘proper’ manual, but a number of examples with code. In fact, they reimplement BUGS‘s examples. Personally, I struggle with that approach towards documentation, but you may be the right type of person for that. Note for letter to Santa: real documentation for INLA.

I was talking with a student about using Norwegian software and he mentioned Norwegian Black Metal. That got me thinking about how the developers of the package would look like; would they look like Gaahl of Gorgoroth (see interview here)?

Talk about disappointment! In fact Håvard Rue, INLA mastermind, looks like a nice, clean, non-black-metal statistician. To be fair, it would be quite hard to code in any language wearing those spikes…

Categories

## R, Julia and genome wide selection

— “You are a pussy” emailed my friend.
— “Sensu cat?” I replied.
— “No. Sensu chicken” blurbed my now ex-friend.

What was this about? He read my post on R, Julia and the shiny new thing, which prompted him to assume that I was the proverbial old dog unwilling (or was it unable?) to learn new tricks. (Incidentally, with friends like this who needs enemies? Hi, Gus.)

I decided to tackle a small—but hopefully useful—piece of code: fitting/training a Genome Wide Selection model, using the Bayes A approach put forward by Meuwissen, Hayes and Goddard in 2001. In that approach the breeding values of the individuals (response) are expressed as a function of a very large number of random predictors (2000, our molecular markers). The dataset (csv file) is a simulation of 2000 bi-allelic markers (aa = 0, Aa = 1, AA = 2) for 250 individuals, followed by the phenotypes (column 2001) and breeding values (column 2002). These models are frequently adjusted using MCMC.

In 2010 I attended this course in Ames, Iowa where Rohan Fernando passed us the following R code (pretty much a transliteration from C code; notice the trailing semicolons, for example). P.D. 2012-04-26 Please note that this is teaching code not production code:

nmarkers = 2000;    # number of markers
startMarker = 1981; # set to 1 to use all
numiter  = 2000;    # number of iterations
vara     = 1.0/20.0;

# input data
data     = matrix(scan("trainData.out0"),ncol=nmarkers+2,byrow=TRUE);
nrecords = dim(data)[1];

beg = Sys.time()

# x has the mean followed by the markers
x = cbind(1,data[,startMarker:nmarkers]);
y = data[,nmarkers+1];
a =  data[,nmarkers+2];
# inital values

nmarkers = nmarkers - startMarker + 1;
mean2pq = 0.5;                          # just an approximation
scalea  = 0.5*vara/(nmarkers*mean2pq);  # 0.5 = (v-2)/v for v=4

size = dim(x)[2];
b = array(0.0,size);
meanb = b;
b[1] = mean(y);
var  = array(0.0,size);

ycorr = y - x%*%b;

# MCMC sampling
for (iter in 1:numiter){
# sample vare
vare = ( t(ycorr)%*%ycorr )/rchisq(1,nrecords + 3);

# sample intercept
ycorr = ycorr + x[,1]*b[1];
rhs = sum(ycorr)/vare;
invLhs = 1.0/(nrecords/vare);
mean = rhs*invLhs;
b[1] = rnorm(1,mean,sqrt(invLhs));
ycorr = ycorr - x[,1]*b[1];
meanb[1] = meanb[1] + b[1];

# sample variance for each locus
for (locus in 2:size){
var[locus] = (scalea*4+b[locus]*b[locus])/rchisq(1,4.0+1)
}

# sample effect for each locus
for (locus in 2:size){
# unadjust y for this locus
ycorr = ycorr + x[,locus]*b[locus];
rhs = t(x[,locus])%*%ycorr/vare;
lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus];
invLhs = 1.0/lhs;
mean = invLhs*rhs;
b[locus]= rnorm(1,mean,sqrt(invLhs));
#adjust y for the new value of this locus
ycorr = ycorr - x[,locus]*b[locus];
meanb[locus] = meanb[locus] + b[locus];
}
}

Sys.time() - beg

meanb = meanb/numiter;
aHat  = x %*% meanb;

Thus, we just need defining a few variables, reading the data (marker genotypes, breeding values and phenotypic data) into a matrix, creating loops, matrix and vector multiplication and generating random numbers (using a Gaussian and Chi squared distributions). Not much if you think about it, but I didn’t have much time to explore Julia’s features as to go for something more complex.

nmarkers = 2000    # Number of markers
startmarker = 1981 # Set to 1 to use all
numiter = 2000     # Number of iterations

(nrecords, ncols) = size(data)

tic()

#this is the mean and markers matrix
X = hcat(ones(Float64, nrecords), data[:, startmarker:nmarkers])
y = data[:, nmarkers + 1]
a = data[:, nmarkers + 2]

nmarkers = nmarkers - startmarker + 1
vara = 1.0/nmarkers
mean2pq = 0.5

scalea  = 0.5*vara/(nmarkers*mean2pq) # 0.5 = (v-2)/v for v=4

ndesign = size(X, 2)
b = zeros(Float64, ndesign)
meanb = zeros(Float64, ndesign)
b[1] = mean(y)
varian  = zeros(Float64, ndesign)

ycorr = y - X * b

# MCMC sampling
for i = 1:numiter
# sample vare
vare = dot(ycorr, ycorr )/randchi2(nrecords + 3)

# sample intercept
ycorr = ycorr + X[:, 1] * b[1];
rhs = sum(ycorr)/vare;
invlhs = 1.0/(nrecords/vare);
mn = rhs*invlhs;
b[1] = randn() * sqrt(invlhs) + mn;
ycorr = ycorr - X[:, 1] * b[1];
meanb[1] = meanb[1] + b[1];

# sample variance for each locus
for locus = 2:ndesign
varian[locus] = (scalea*4 + b[locus]*b[locus])/randchi2(4.0 + 1);
end

# sample effect for each locus
for locus = 2:ndesign
# unadjust y for this locus
ycorr = ycorr + X[:, locus] * b[locus];
rhs = dot(X[:, locus], ycorr)/vare;
lhs = dot(X[:, locus], X[:, locus])/vare + 1.0/varian[locus];
invlhs = 1.0/lhs;
mn = invlhs * rhs;
b[locus] = randn() * sqrt(invlhs) + mn;
#adjust y for the new value of this locus
ycorr = ycorr - X[:, locus] * b[locus];
meanb[locus] = meanb[locus] + b[locus];
end
end

toc()

meanb = meanb/numiter;
aHat  = X * meanb;

The code looks remarkably similar and there are four main sources of differences:

1. The first trivial one is that the original code read a binary dataset and I didn’t know how to do it in Julia, so I’ve read a csv file instead (this is why I start timing after reading the file too).
2. The second trivial one is to avoid name conflicts between variables and functions; for example, in R the user is allowed to have a variable called var that will not interfere with the variance function. Julia is picky about that, so I needed renaming some variables.
3. Julia pases variables by reference, while R does so by value when assigning matrices, which tripped me because in the original R code there was something like: b = array(0.0,size); meanb = b;. This works fine in R, but in Julia changes to the b vector also changed meanb.
4. The definition of scalar vs array created some problems in Julia. For example y' * y (t(y) %*% y in R) is numerically equivalent to dot(y, y). However, the first version returns an array, while the second one a scalar. I got an error message when trying to store the ‘scalar like an array’ in to an array. I find that confusing.

One interesting point in this comparison is using rough code, not really optimized for speed; in fact, the only thing that I can say of the Julia code is that ‘it runs’ and it probably is not very idiomatic. Testing runs with different numbers of markers we get that R needs roughly 2.8x the time used by Julia. The Julia website claims better results in benchmarks, but in real life we work with, well, real problems.

In 1996-7 I switched from SAS to ASReml for genetic analyses because it was 1-2 orders of magnitude faster and opened a world of new models. Today a change from R to Julia would deliver (in this particular case) a much more modest speed up (~3x), which is OK but not worth changing languages (yet). Together with the embryonic graphical capabilities and the still-to-develop ecosystem of packages, means that I’m still using R. Nevertheless, the Julia team has achieved very impressive performance in very little time, so it is worth to keep an eye on their progress.

P.S.1 Readers are welcome to suggesting ways of improving the code.
P.S.2 WordPress does not let me upload the binary version of the simulated data.
P.S.3 Hey WordPress guys; it would be handy if the sourcecode plugin supported Julia!
P.S.4 2012-04-26 Following AL’s recommendation in the comments, one can replace in R:

rhs = t(x[,locus])%*%ycorr/vare;
lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus]

by

rhs = crossprod(x[,locus],ycorr)/vare
lhs = crossprod(x[,locus],x[,locus])/vare + 1.0/var[locus]

reducing execution time by roughly 20%, making the difference between Julia and R even smaller.

Categories

## Mid-January flotsam: teaching edition

I was thinking about new material that I will use for teaching this coming semester (starting the third week of February) and suddenly compiled the following list of links:

Enough procrastination. Let’s keep on filling out PBRF forms; it is the right time for that hexennial activity.

Categories

## Doing Bayesian Data Analysis now in JAGS

Around Christmas time I presented my first impressions of Kruschke’s Doing Bayesian Data Analysis. This is a very nice book but one of its drawbacks was that part of the code used BUGS, which left mac users like me stuck.

Kruschke has now made JAGS code available so I am happy clappy and looking forward to test this New Year present. In addition, there are other updates available for the programs included in the book.

Categories

## First impressions of Doing Bayesian Data Analysis

About a month ago I was discussing the approach that I would like to see in introductory Bayesian statistics books. In that post I mentioned a PDF copy of Doing Bayesian Data Analysis by John K. Kruschke and that I have ordered the book. Well, recently a parcel was waiting in my office with a spanking new, real paper copy of the book. A few days are not enough to provide a ‘proper’ review of the book but I would like to discuss my first impressions about the book, as they could be helpful for someone out there.

If I were looking for a single word to define the word it would be meaty, not on the “having the flavor or smell of meat” sense of the word as pointed out by Newton, but on the conceptual side. Kruschke has clearly put a lot of thought on how to draw a generic student with little background on the topic to start thinking of statistical concepts. In addition Kruschke clearly loves language and has an interesting, sometimes odd, sense of humor; anyway, Who am I to comment on someone else’s strange sense of humor?

One difference between the dodgy PDF copy and the actual book is the use of color, three shades of blue, to highlight section headers and graphical content. In general I am not a big fun of lots of colors and contentless pictures as used in modern calculus and physics undergraduate books. In this case, the effect is pleasant and makes browsing and reading the book more accessible. Most graphics really drive a point and support the written material, although there are exceptions in my opinion like some faux 3D graphs (Figure 17.2 and 17.3 under multiple linear regression) that I find somewhat confusing.

The book’s website contains PDF versions of the table of contents and chapter 1, which is a good way to whet your appetite. The book covers enough material as to be the sole text for an introductory Bayesian statistics course, either starting from scratch or as a transition from a previous course with a frequentist approach. There are plenty of exercises, a solutions manual and plenty of R code available.

The mere existence of this book prompts the question: Can we afford not to introduce students to a Bayesian approach to statistics? In turn this sparks the question How do we convince departments to de-emphasize the old way? (this quote is extremely relevant)

Verdict: if you are looking for a really introductory text, this is hands down the best choice. The material goes from the ‘OMG do I need to learn stats?’ level to multiple linear regression, ANOVA, hierarchical models and GLMs.

P.S. I’m still using a combination of books, including Krushke’s, and Marin and Robert’s for my own learning process.
P.S.2 There is a lot to be said about a book that includes puppies on its cover and references to A Prairie Home Companion on its first page (the show is sometimes re-broadcasted down under by Radio New Zealand).

Categories

## Tall big data, wide big data

After attending two one-day workshops last week I spent most days paying attention to (well, at least listening to) presentations in this biostatistics conference. Most presenters were R users—although Genstat, Matlab and SAS fans were also present and not once I heard “I can’t deal with the current size of my data sets”. However, there were some complaints about the speed of R, particularly when dealing with simulations or some genomic analyses.

Some people worried about the size of coming datasets; nevertheless that worry was across statistical packages or, more precisely, it went beyond statistical software. How will we able to even store the data from something like the Square Kilometer Array, let alone analyze it?

In a previous post I was asking if we needed to actually deal with ‘big data’ in R, and my answer was probably not or, better, at least not directly. I still think that it is a valid, although incomplete opinion. In many statistical analyses we can think of n (the number of observations) and p (the number of variables per observation). In most cases, particularly when people refer to big data, n >> p. Thus, we may have 100 million people but we have only 10 potential predictors: tall data. In contrast, we may have only 1,000 individuals but with 50,000 points each coming from a near infrared spectrometry or information from 250,000 SNPs (a type of molecular marker): wide data. Both types of data will keep on growing but are challenging in a different way.

In a totally generalizing, unfair and simplistic way I will state that dealing with wide is more difficult (and potentially interesting) than dealing with tall. This from a modeling perspective. As the t-shirt says: sampling is not a crime, and it should work quite well with simpler models and large datasets. In contrast, sampling to fitting wide data may not work at all.

Algorithms. Clever algorithms is what we need in a first stage. For example, we can fit linear mixed models to a tall dataset with ten millions records or a multivariate mixed model with 60 responses using ASReml-R. Wide datasets are often approached using Bayesian inference, but MCMC gets slooow when dealing with thousands of predictors, we need other fast approximations to the posterior distributions.

This post may not be totally coherent, but it keeps the conversation going. My excuse? I was watching Be kind rewind while writing it.

Categories

## If you are writing a book on Bayesian statistics

This post is somewhat marginal to R in that there are several statistical systems that could be used to tackle the problem. Bayesian statistics is one of those topics that I would like to understand better, much better, in fact. Unfortunately, I struggle to get the time to attend courses on the topic between running my own lectures, research and travel; there are always books, of course.

In my (highly individual and dubious) opinion Albert’s book is the easiest to read. I was waiting to see the doctor while reading—and actually understanding—some of the concepts. The book is certainly geared towards R users and gradually develops the code necessary to run simple analyses from estimating a proportion to fitting (simple) hierarchical linear models. I’m still reading, which is a compliment.

Marin and Robert’s book is quite different in that uses R as a vehicle (like this blog) but the focus is more on the conceptual side and covers more types of models than Albert’s book. I do not have the probability background for this course (or maybe I did, but it was ages ago); however, the book makes me want to learn/refresh that background. An annoying comment on the book is that it is “self-contained”; well, anything is self-contained if one asks for enough prerequisites! I’m still reading (jumping between Albert’s and this book), and the book has managed to capture my interest.

Finally, Bolstad’s book. How to put this? “It is not you, it is me”. It is much more technical and I do not have the time, nor the patience, to wait until chapter 8 to do something useful (logistic regression). This is going back to the library until an indeterminate future.

If you are now writing a book on the topic I would like to think of the following user case:

• the reader has little or no exposure to Bayesian statistics, but it has been working for a while with ‘classical’ methods,
• the reader is self-motivated, but he doesn’t want to spend ages to be able to fit even a simple linear regression,
• the reader has little background on probability theory, but he is willing to learn some in between learning the tools and to run some analyses,
• using a statistical system that allows for both classical and Bayesian approaches is a plus.

It is hard for me to be more selfish in this description; you are potentially writing a book for me.

† After the first quake our main library looked like this. Now it is mostly normal.

P.S. After publishing this post I remembered that I came across a PDF copy of Doing Bayesian Data Analysis: A Tutorial with R and BUGS by Kruschke. Setting aside the dodginess of the copy, the book looked well-written, started from first principles and had puppies on the cover (!), so I ordered it from Amazon.

P.D. 2011-12-03 23:45 AEST Christian Robert sent me a nice email and wrote a few words on my post. Yes, I’m still plodding along with the book although I’m taking a ten day break while traveling in Australia.

P.D. 2011-11-25 12:25 NZST Here is a list of links to Amazon for the books suggested in the comments:

Categories

## Surviving a binomial mixed model

A few years ago we had this really cool idea: we had to establish a trial to understand wood quality in context. Sort of following the saying “we don’t know who discovered water, but we are sure that it wasn’t a fish” (attributed to Marshall McLuhan). By now you are thinking WTF is this guy talking about? But the idea was simple; let’s put a trial that had the species we wanted to study (Pinus radiata, a gymnosperm) and an angiosperm (Eucalyptus nitens if you wish to know) to provide the contrast, as they are supposed to have vastly different types of wood. From space the trial looked like this:

The reason you can clearly see the pines but not the eucalypts is because the latter were dying like crazy over a summer drought (45% mortality in one month). And here we get to the analytical part: we will have a look only at the eucalypts where the response variable can’t get any clearer, trees were either totally dead or alive. The experiment followed a randomized complete block design, with 50 open-pollinated families in 48 blocks. The original idea was to harvest 12 blocks each year but—for obvious reasons—we canned this part of the experiment after the first year.

The following code shows the analysis in asreml-R, lme4 and MCMCglmm:

library(asreml)
sasreml = asreml(surv ~ 1, random = ~ Fami + Block,
data = euc,
summary(sasreml)$varcomp # gamma component std.error z.ratio #Fami!Fami.var 0.5704205 0.5704205 0.14348068 3.975591 #Block!Block.var 0.1298339 0.1298339 0.04893254 2.653324 #R!variance 1.0000000 1.0000000 NA NA # constraint #Fami!Fami.var Positive #Block!Block.var Positive #R!variance Fixed # Quick look at heritability varFami = summary(sasreml)$varcomp[1, 2]
varRep = summary(sasreml)$varcomp[2, 2] h2 = 4*varFami/(varFami + varRep + 3.29) h2 #[1] 0.5718137 library(lme4) slme4 = lmer(surv ~ 1 + (1|Fami) + (1|Block), data = euc, family = binomial(link = 'logit')) summary(slme4) #Generalized linear mixed model fit by the Laplace approximation #Formula: surv ~ 1 + (1 | Fami) + (1 | Block) # Data: euc # AIC BIC logLik deviance # 2725 2742 -1360 2719 #Random effects: # Groups Name Variance Std.Dev. # Fami (Intercept) 0.60941 0.78065 # Block (Intercept) 0.13796 0.37143 #Number of obs: 2090, groups: Fami, 51; Block, 48 # #Fixed effects: # Estimate Std. Error z value Pr(>|z|) #(Intercept) 0.2970 0.1315 2.259 0.0239 * # Quick look at heritability varFami = VarCorr(slme4)$Fami[1]
varRep = VarCorr(slme4)$Block[1] h2 = 4*varFami/(varFami + varRep + 3.29) h2 #[1] 0.6037697 # And let's play to be Bayesians! library(MCMCglmm) pr = list(R = list(V = 1, n = 0, fix = 1), G = list(G1 = list(V = 1, n = 0.002), G2 = list(V = 1, n = 0.002))) sb <- MCMCglmm(surv ~ 1, random = ~ Fami + Block, family = 'categorical', data = euc, prior = pr, verbose = FALSE, pr = TRUE, burnin = 10000, nitt = 100000, thin = 10) plot(sb$VCV)

You may be wondering Where does the 3.29 in the heritability formula comes from? Well, that's the variance of the link function that, in the case of the logit link is pi*pi/3. In the case of MCMCglmm we can estimate the degree of genetic control quite easily, remembering that we have half-siblings (open-pollinated plants):

# Heritability
h2 = 4*sb$VCV[, 'Fami']/(sb$VCV[, 'Fami'] +
sb$VCV[, 'Block'] + 3.29 + 1) posterior.mode(h2) # var1 #0.6476185 HPDinterval(h2) # lower upper #var1 0.4056492 0.9698148 #attr(,"Probability") #[1] 0.95 plot(h2) By the way, it is good to remember that we need to back-transform the estimated effects to probabilities, with very simple code: # Getting mode and credible interval for solutions inv.logit(posterior.mode(sb$Sol))
inv.logit(HPDinterval(sb$Sol, 0.95)) Even if one of your trials is trashed there is a silver lining: it is possible to have a look at survival. Categories ## Coming out of the (Bayesian) closet: multivariate version This week I’m facing my—and many other lecturers’—least favorite part of teaching: grading exams. In a supreme act of procrastination I will continue the previous post, and the antepenultimate one, showing the code for a bivariate analysis of a randomized complete block design. Just to recap, the results from the REML multivariate analysis (that used ASReml-R) was the following: library(asreml) m4 = asreml(cbind(bden, veloc) ~ trait, random = ~ us(trait):Block + us(trait):Family, data = a, rcov = ~ units:us(trait)) summary(m4)$varcomp

#                                      gamma    component    std.error
#trait:Block!trait.bden:bden    1.628812e+02 1.628812e+02 7.854123e+01
#trait:Block!trait.veloc:bden   1.960789e-01 1.960789e-01 2.273473e-01
#trait:Block!trait.veloc:veloc  2.185595e-03 2.185595e-03 1.205128e-03
#trait:Family!trait.bden:bden   8.248391e+01 8.248391e+01 2.932427e+01
#trait:Family!trait.veloc:bden  1.594152e-01 1.594152e-01 1.138992e-01
#trait:Family!trait.veloc:veloc 2.264225e-03 2.264225e-03 8.188618e-04
#R!variance                     1.000000e+00 1.000000e+00           NA
#R!trait.bden:bden              5.460010e+02 5.460010e+02 3.712833e+01
#R!trait.veloc:bden             6.028132e-01 6.028132e-01 1.387624e-01
#R!trait.veloc:veloc            1.710482e-02 1.710482e-02 9.820673e-04
#                                  z.ratio constraint
#trait:Block!trait.bden:bden     2.0738303   Positive
#trait:Block!trait.veloc:bden    0.8624639   Positive
#trait:Block!trait.veloc:veloc   1.8135789   Positive
#trait:Family!trait.bden:bden    2.8128203   Positive
#trait:Family!trait.veloc:bden   1.3996166   Positive
#trait:Family!trait.veloc:veloc  2.7650886   Positive
#R!variance                             NA      Fixed
#R!trait.bden:bden              14.7057812   Positive
#R!trait.veloc:bden              4.3442117   Positive
#R!trait.veloc:veloc            17.4171524   Positive

The corresponding MCMCglmm code is not that different from ASReml-R, after which it is modeled anyway. Following the recommendations of the MCMCglmm Course Notes (included with the package), the priors have been expanded to diagonal matrices with degree of belief equal to the number of traits. The general intercept is dropped (-1) so the trait keyword represents trait means. We are fitting unstructured (us(trait)) covariance matrices for both Block and Family, as well as an unstructured covariance matrix for the residuals. Finally, both traits follow a gaussian distribution:

library(MCMCglmm)
bp = list(R = list(V = diag(c(0.007, 260)), n = 2),
G = list(G1 = list(V = diag(c(0.007, 260)), n = 2),
G2 = list(V = diag(c(0.007, 260)), n = 2)))

bmod = MCMCglmm(cbind(veloc, bden) ~ trait - 1,
random = ~ us(trait):Block + us(trait):Family,
rcov = ~ us(trait):units,
family = c('gaussian', 'gaussian'),
data = a,
prior = bp,
verbose = FALSE,
pr = TRUE,
burnin = 10000,
nitt = 20000,
thin = 10)

Further manipulation of the posterior distributions requires having an idea of the names used to store the results. Following that, we can build an estimate of the genetic correlation between the traits (Family covariance between traits divided by the square root of the product of the Family variances). Incidentally, it wouldn’t be a bad idea to run a much longer chain for this model, so the plot of the posterior for the correlation looks better, but I’m short of time:

dimnames(bmod$VCV) rg = bmod$VCV[,'veloc:bden.Family']/sqrt(bmod$VCV[,'veloc:veloc.Family'] * bmod$VCV[,'bden:bden.Family'])

plot(rg)

posterior.mode(rg)
#     var1
#0.2221953

HPDinterval(rg, prob = 0.95)
#         lower     upper
#var1 -0.132996 0.5764006
#attr(,"Probability")
#[1] 0.95

And that’s it! Time to congratulate Jarrod Hadfield for developing this package.

Categories

## Coming out of the (Bayesian) closet

Until today all the posts in this blog have used a frequentist view of the world. I have a confession to make: I have an ecumenical view of statistics and I do sometimes use Bayesian approaches in data analyses. This is not quite one of those “the truth will set you free” moments, but I’ll show that one could almost painlessly repeat some of the analyses I presented before using MCMC.

MCMCglmm is a very neat package that—as its rather complicated em cee em cee gee el em em acronym implies—implements MCMC for generalized linear mixed models. We’ll skip that the frequentist fixed vs random effects distinction gets blurry in Bayesian models and still use the f… and r… terms. I’ll first repeat the code for a Randomized Complete Block design with Family effects (so we have two random factors) using both lme4 and ASReml-R and add the MCMCglmm counterpart:

library(lme4)
m1 = lmer(bden ~ 1 + (1|Block) + (1|Family),
data = trees)

summary(m1)

#Linear mixed model fit by REML
#Formula: bden ~ (1 | Block) + (1 | Family)
#   Data: a
#  AIC  BIC logLik deviance REMLdev
# 4572 4589  -2282     4569    4564
#Random effects:
# Groups   Name        Variance Std.Dev.
# Family   (Intercept)  82.803   9.0996
# Block    (Intercept) 162.743  12.7571
# Residual             545.980  23.3662
#Number of obs: 492, groups: Family, 50; Block, 11

#Fixed effects:
#            Estimate Std. Error t value
#(Intercept)  306.306      4.197   72.97

library(asreml)
m2 = asreml(bden ~ 1, random = ~ Block + Family,
data = trees)

summary(m2)$varcomp # gamma component std.error z.ratio #Block!Block.var 0.2980766 162.74383 78.49271 2.073362 #Family!Family.var 0.1516591 82.80282 29.47153 2.809587 #R!variance 1.0000000 545.97983 37.18323 14.683496 # constraint #Block!Block.var Positive #Family!Family.var Positive #R!variance Positive m2$coeff$fixed #(Intercept) # 306.306 We had already established that the results obtained from lme4 and ASReml-R were pretty much the same, at least for relatively simple models where we can use both packages (as their functionality diverges later for more complex models). This example is no exception and we quickly move to fitting the same model using MCMCglmm: library(MCMCglmm) priors = list(R = list(V = 260, n = 0.002), G = list(G1 = list(V = 260, n = 0.002), G2 = list(V = 260, n = 0.002))) m4 = MCMCglmm(bden ~ 1, random = ~ Block + Family, family = 'gaussian', data = a, prior = priors, verbose = FALSE, pr = TRUE, burnin = 10000, nitt = 20000, thin = 10) plot(mcmc.list(m4$VCV))

autocorr(m4$VCV) posterior.mode(m4$VCV)
#    Block    Family     units
#126.66633  72.97771 542.42237

HPDinterval(m4$VCV) # lower upper #Block 33.12823 431.0233 #Family 26.34490 146.6648 #units 479.24201 627.7724 The first difference is that we have to specify priors for the coefficients that we would like to estimate (by default fixed effects, the overall intercept for our example, start with a zero mean and very large variance: 106). The phenotypic variance for our response is around 780, which I split into equal parts for Block, Family and Residuals. For each random effect we have provided our prior for the variance (V) and a degree of belief on our prior (n). In addition to the model equation, name of the data set and prior information we need to specify things like the number of iterations in the chain (nitt), how many we are discarding for the initial burnin period (burnin), and how many values we are keeping (thin, every ten). Besides the pretty plot of the posterior distributions (see previous figure) they can be summarized using the posterior mode and high probability densities. One of the neat things we can do is to painlessly create functions of variance components and get their posterior mode and credible interval. For example, the heritability (or degree of additive genetic control) can be estimated in this trial with full-sib families using the following formula: $$hat{h^2} = frac{2 sigma_F^2}{sigma_F^2 + sigma_B^2 + sigma_R^2}$$ h2 = 2 * m4$VCV[, 'Family']/(m4$VCV[, 'Family'] + m4$VCV[, 'Block'] + m4\$VCV[, 'units'])
plot(h2)

posterior.mode(h2)
#0.1887017

HPDinterval(h2)
#          lower     upper
#var1 0.05951232 0.3414216
#attr(,"Probability")
#[1] 0.95

There are some differences on the final results between ASReml-R/lme4 and MCMCglmm; however, the gammas (ratios of variance component/error variance) for the posterior distributions are very similar, and the estimated heritabilities are almost identical (~0.19 vs ~0.21). Overall, MCMCglmm is a very interesting package that covers a lot of ground. It pays to read the reference manual and vignettes, which go into a lot of detail on how to specify different types of models. I will present the MCMCglmm bivariate analyses in another post.

P.S. There are several other ways that we could fit this model in R using a Bayesian approach: it is possible to call WinBugs or JAGS (in Linux and OS X) from R, or we could have used INLA. More on this in future posts.