m x n matrix with randomly assigned 0/1

Today Scott Chamberlain tweeted asking for a better/faster solution to building an m x n matrix with randomly assigned 0/1. He already had a working version:

r <- 1000
c <- 1000
m0 <- matrix(0, r, c)
apply(m0, c(1,2), function(x) sample(c(0,1),1))

Now, I’m the first to acknowledge that I’ve never got the ‘apply’ family of functions and that—thanks Hadley—if I need to do something like that I go for the plyr package. Nevertheless, if you think about it, the starting point is a sugary version of a loop; I like loops because they are explicit and clear, but they can be very slow in R. The loop could be written as:

m00 = matrix(0, r, c)
for(i in 1:r){
  for(j in 1:c){
    m00[i, j] = sample(c(0,1),1)
  }
}

In contrast, my first idea was to generate a bunch of uniformly distributed [0, 1) random numbers and round them to the closest integer, which is a more ‘matricy’ way of thinking:

m1 = round(matrix(runif(r*c), r, c)))

and it happens to be a lot faster. My second idea, and Edmund Hart beat me to that one, was to use something like:

m2 = matrix(rbinom(r*c,1,0.5),r,c)

which is a nice option, generating r*c random numbers following a binomial distribution, which also has the advantage of allowing for different probabilities, rather than 0.5 as in m1. This is also faster than m1, although probably one would not notice the difference until working with fairly big matrices. When Scott pointed out the speed bump he got me thinking about order of the operations; would this be better?

m3 = matrix(round(runif(r*c)), r, c)

In terms of speed, m3 fits between m1 and m2, so the order can make a difference.

David Smith and Rafael Maia came up with a different approach, using sample() which I had not considered at all. m4 has the advantage that one could use any number of values to randomize.

m4 = matrix(sample(0:1,r*c, replace=TRUE),r,c)

Any of these options can be timed using the system.time() function as in system.time(m3 = matrix(round(runif(r*c)), r, c)). It’s interesting how different people come up with alternative strategies (reflecting different ways of thinking about the problem) using exactly the same language. I’m not sure what should be the ‘obvious’ solution; but for almost any purpose any of them (with the exception of the explicit loop) will work just fine.

Gratuitous picture: training horses in the beach (Photo: Luis).

P.S. 2012-08-29, 22:33. Another way that would work (although 4 times slower than m1) would be m5 = matrix(ifelse(runif(r*c) < 0.5, 0, 1)). Still, it takes less than 0.5 seconds for a 1,000,000 elements matrix.
P.S. 2012-09-04. Scott Chamberlain wrote his own blog post discussing the different suggestions. Even cooler, Dirk Eddelbuettel implemented even faster creation of random 1/0 matrices using Rcpp, with inline C++ code. Much faster in a crazy overkill sort of way.

Mid-August flotsam

Reached mid-semester point, with quite a few new lectures to prepare. Nothing extremely complicated but, as always, the tricky part is finding a way to make it meaningful and memorable. Sometimes, and this is one of those times, I sound like a broken record but I’m a bit obsessive about helping people to ‘get’ a topic.

Gratuitous picture: Lola, Lisbon, Portugal(Photo: Luis).

INLA: Bayes goes to Norway

INLA is not the Norwegian answer to ABBA; that would probably be a-ha. INLA is the answer to ‘Why do I have enough time to cook a three-course meal while running MCMC analyses?”.

Integrated Nested Laplace Approximations (INLA) is based on direct numerical integration (rather than simulation as in MCMC) which, according to people ‘in the know’, allows:

  • the estimation of marginal posteriors for all parameters,
  • marginal posteriors for each random effect and
  • estimation of the posterior for linear combinations of random effects.

Rather than going to the usual univariate randomized complete block or split-plot designs that I have analyzed before (here using REML and here using MCMC), I’ll go for some analyses that motivated me to look for INLA. I was having a look at some reproductive output for Drosophila data here at the university, and wanted to fit a logistic model using MCMCglmm. Unfortunately, I was running into the millions (~3M) of iterations to get a good idea of the posterior and, therefore, leaving the computer running overnight. Almost by accident I came across INLA and started playing with it. The idea is that Sol—a Ph.D. student—had a cool experiment with a bunch of flies using different mating strategies over several generations, to check the effect on breeding success. Therefore we have to keep track of the pedigree too.

Gratuitous picture: Cubist apartments not in Norway (Photo: Luis, click to enlarge).

# Set working directory containing data and code
setwd('~/Dropbox/research/2010/flies')
 
# Packages needed for analysis
# This code requires the latest (and updated) version of INLA
require(INLA)            # This loads INLA
require(pedigreemm)      # pedigree(), relfactor(), Tinv, D, ...
 
####### Pedigree and assessment files
# Reads pedigree file
ped = read.csv('recodedped.csv', header=FALSE)
names(ped) = c('id', 'mum', 'dad')
 
# Reads data file
dat = read.csv('ggdatall.csv', header=TRUE)
dat$cross = factor(dat$cross)
 
# Pedigree object for pedigreemm functions
pedPMM = with(ped, pedigreemm::pedigree(sire=dad, dam=mum, label=id))
 
# Pedigree precision matrix (A inverse)
# T^{-1} in A^{-1} = (T^{-1})' D^{-1} T^{-1}
Tinv = as(pedPMM, "sparseMatrix") 
D    = Diagonal(x=Dmat(pedPMM))   # D      in A = TDT'
Dinv = solve(D)
Ainv = t(Tinv) %*% Dinv %*% Tinv

Up to this point we have read the response data, the pedigree and constructed the inverse of the pedigree matrix. We also needed to build a contrast matrix to compare the mean response between the different mating strategies. I was struggling there and contacted Gregor Gorjanc, who kindly emailed me the proper way to do it.

# Define contrasts to compare cross types. Thanks to Gregor Gorjanc
# for coding contrasts
k = nlevels(dat$cross)
tmp = matrix(nrow=(k-1)*k/2, ncol=k)
 
##               1   2   3   4   5   6 
tmp[ 1, ] =  c(  1, -1, NA, NA, NA, NA) ## c1-c2
tmp[ 2, ] =  c(  1, NA, -1, NA, NA, NA) ##   -c3
tmp[ 3, ] =  c(  1, NA, NA, -1, NA, NA) ##   -c4
tmp[ 4, ] =  c(  1, NA, NA, NA, -1, NA) ##   -c5
tmp[ 5, ] =  c(  1, NA, NA, NA, NA, -1) ##   -c6
tmp[ 6, ] =  c( NA,  1, -1, NA, NA, NA) ## c2-c3
tmp[ 7, ] =  c( NA,  1, NA, -1, NA, NA) ##   -c4
tmp[ 8, ] =  c( NA,  1, NA, NA, -1, NA) ##   -c5
tmp[ 9, ] =  c( NA,  1, NA, NA, NA, -1) ##   -c6
tmp[10, ] =  c( NA, NA,  1, -1, NA, NA) ## c3-c4
tmp[11, ] =  c( NA, NA,  1, NA, -1, NA) ##   -c5
tmp[12, ] =  c( NA, NA,  1, NA, NA, -1) ##   -c6
tmp[13, ] =  c( NA, NA, NA,  1, -1, NA) ## c4-c5
tmp[14, ] =  c( NA, NA, NA,  1, NA, -1) ##   -c6
tmp[15, ] =  c( NA, NA, NA, NA,  1, -1) ## c5-c6
 
 
# Make Linear Combinations
LC = inla.make.lincombs(cross=tmp)
 
# Assign names to combinations
t = 0
for(i in 1:(k-1)) {
  for(j in (i+1):k) {
    t = t + 1
    names(LC)[t] = paste("c", i, "-", "c", j, sep="")
  }
}

There is another related package (Animal INLA) that takes care of i- giving details about the priors and ii- “easily” fitting models that include a term with a pedigree (an animal model in quantitative genetics speak). However, I wanted the assumptions to be clear so read the source of Animal INLA and shamelessly copied the useful bits (read the source, Luke!).

######  Analysis for for binomial traits
####### Plain-vanilla INLA Version
# Feeling more comfortable with *explicit* statement of assumptions
# (rather than hidden behind animal.inla()) 
 
# Function to backconvert logits to probabilities
back.conv = function(values){
  return(1/(1+exp(-(values))))
}
 
# Function to get posterior of the odds
# Thanks to Helena Moltchanova
inla.marginal.summary = function(x){
  m1 = inla.emarginal(function(z) exp(z), marginal=x)
  odds = inla.marginal.transform(function(x) exp(x), x)
  q = inla.qmarginal(p=c(0.025, 0.975), marginal=odds)
  c("0.025quant"=q[1], "0.5quant"=m1, "0.975quant"=q[2])
}
 
 
# Model for pupae/eggs
# Drops a few observations with no reproductive output (trips INLA)
no0eggs = subset(dat, eggs>0 & pupae <= eggs)
 
# Actual model
mpueg =  pupae ~ f(cross, model='iid', constr=TRUE, hyper=list(theta=list(initial=-10, fixed=TRUE))) +
                 f(id, model='generic0', constr=TRUE, Cmatrix=Ainv,
                 hyper=list(theta=list(param=c(0.5,0.5), fixed=FALSE)))
 
# INLA call
fpueg =  inla(formula=mpueg, family='binomial', data=no0eggs,
              lincomb=LC,
              Ntrials=eggs,
              control.compute=list(dic=FALSE))
 
# Results
summary(fpueg)
 
# Call:
#   c("inla(formula = mpueg, family = \"binomial\", data = no0eggs, 
#     Ntrials = eggs, ",  "    lincomb = LC, control.compute = list(dic = FALSE))")
# 
# Time used:
#   Pre-processing    Running inla Post-processing           Total 
# 0.2712612       1.1172159       2.0439510       3.4324281 
# 
# Fixed effects:
#   mean        sd 0.025quant 0.5quant 0.975quant       kld
# (Intercept) 1.772438 0.1830827   1.417413 1.770863   2.136389 0.5833235
# 
# Linear combinations (derived):
#   ID        mean        sd 0.025quant    0.5quant  0.975quant kld
# c1-c2  0 -0.26653572 0.7066540 -1.6558225 -0.26573011  1.11859967   0
# c1-c3  1  0.04150999 0.7554753 -1.4401435  0.04104020  1.52622856   0
# c1-c4  2 -0.08777325 0.6450669 -1.3557501 -0.08713005  1.17693349   0
# c1-c5  3 -1.36702960 0.6583121 -2.6615604 -1.36618274 -0.07690788   0
# c1-c6  4 -1.82037714 0.8193280 -3.4338294 -1.81848244 -0.21714431   0
# c2-c3  5  0.30804735 0.7826815 -1.2248185  0.30677279  1.84852340   0
# c2-c4  6  0.17876229 0.5321948 -0.8654273  0.17859036  1.22421409   0
# c2-c5  7 -1.10049385 0.7466979 -2.5663142 -1.10046590  0.36558211   0
# c2-c6  8 -1.55383673 0.8188321 -3.1640965 -1.55276603  0.05084282   0
# c3-c4  9 -0.12928419 0.7475196 -1.5996080 -0.12817855  1.33522000   0
# c3-c5 10 -1.40854298 0.6016539 -2.5930656 -1.40723901 -0.23103707   0
# c3-c6 11 -1.86189314 0.8595760 -3.5555571 -1.85954031 -0.18100418   0
# c4-c5 12 -1.27925604 0.6998640 -2.6536362 -1.27905616  0.09438701   0
# c4-c6 13 -1.73259977 0.7764105 -3.2600936 -1.73134961 -0.21171790   0
# c5-c6 14 -0.45334267 0.8179794 -2.0618730 -0.45229981  1.14976690   0
# 
# Random effects:
#   Name    Model	  	Max KLD 
# cross   IID model 
# id   Generic0 model 
# 
# Model hyperparameters:
#   mean    sd      0.025quant 0.5quant 0.975quant
# Precision for id 0.08308 0.01076 0.06381    0.08244  0.10604   
# 
# Expected number of effective parameters(std dev): 223.95(0.7513)
# Number of equivalent replicates : 1.121 
# 
# Marginal Likelihood:  -1427.59 
 
fpueg$summary.random$cross
#   ID       mean        sd  0.025quant   0.5quant 0.975quant          kld
# 1  1 -0.5843466 0.4536668 -1.47561024 -0.5840804  0.3056632 0.0178780930
# 2  2 -0.3178102 0.4595676 -1.21808638 -0.3184925  0.5865565 0.0009666916
# 3  3 -0.6258600 0.4978254 -1.60536281 -0.6250077  0.3491075 0.0247426578
# 4  4 -0.4965763 0.4071715 -1.29571071 -0.4966277  0.3030747 0.0008791629
# 5  5  0.7826817 0.4389003 -0.07756805  0.7821937  1.6459253 0.0077476806
# 6  6  1.2360387 0.5768462  0.10897529  1.2340813  2.3744368 0.0451357379
 
# Backtransforms point estimates and credible intervals for odds -> prob
for(name in names(fpueg$marginals.lincomb.derived)){
  summa = inla.marginal.summary(eval(parse(text=paste("fpueg$marginals.lincomb.derived$\'", name, "\'", sep=''))))
  cat(name, summa, '\n')
}
 
# c1-c2 0.1894451 0.9831839 3.019878 
# c1-c3 0.2338952 1.387551 4.534581 
# c1-c4 0.256858 1.127751 3.204961 
# c1-c5 0.0695406 0.3164847 0.9145132 
# c1-c6 0.03157478 0.2264027 0.792517 
# c2-c3 0.289088 1.850719 6.255175 
# c2-c4 0.4213069 1.377848 3.366947 
# c2-c5 0.0759222 0.4398384 1.420934 
# c2-c6 0.04135211 0.2955985 1.035951 
# c3-c4 0.1996085 1.16168 3.747526 
# c3-c5 0.0746894 0.2929174 0.7847903 
# c3-c6 0.02774805 0.2245797 0.821099 
# c4-c5 0.06988459 0.355529 1.084414 
# c4-c6 0.03780307 0.2389529 0.7974092 
# c5-c6 0.1245211 0.8878682 3.108852

A quick look at the time taken by INLA shows that it is in the order of seconds (versus overnight using MCMC). I have tried a few examples and the MCMCglmm and INLA results tend to be very close; however, figuring out how to code models has been very tricky for me. INLA follows the glorious tradition of not having a ‘proper’ manual, but a number of examples with code. In fact, they reimplement BUGS‘s examples. Personally, I struggle with that approach towards documentation, but you may be the right type of person for that. Note for letter to Santa: real documentation for INLA.

I was talking with a student about using Norwegian software and he mentioned Norwegian Black Metal. That got me thinking about how the developers of the package would look like; would they look like Gaahl of Gorgoroth (see interview here)?

Gaahl Gorgoroth

Not an INLA developer

Talk about disappointment! In fact Håvard Rue, INLA mastermind, looks like a nice, clean, non-black-metal statistician. To be fair, it would be quite hard to code in any language wearing those spikes…

Careless comparison bites back (again)

When running stats labs I like to allocate a slightly different subset of data to each student, which acts as an incentive for people to do their own work (rather than copying the same results from a fellow student). We also need to be able to replicate the results when marking, so we need a record of exactly which observations were dropped to create a particular data set. I have done this in a variety of ways, but this time I opted for code that looked like:

setwd('~/Dropbox/teaching/stat202-2012')
 
biom = read.csv('biom2012.csv', header = TRUE)
drops = read.csv('lab4-dels.csv', header = TRUE)
 
# Use here your OWN student code
my.drop = subset(drops, student.code == 'mjl159')
my.data = subset(biom, !(id %in% my.drop))

Thus, we were reading a full data set and assigning it to biom, reading a table that contained student codes in the first column and 5 columns with observations to be dropped (assigned to drops) and choosing one row of drops depending on the student code (assigned to my.drop). As an example, for student ‘mjl159′ my.drop looked like:

my.drop
#   student.code X1 X2 X3 X4 X5
#40       mjl159 21 18 30 15 43

The problem with the code is the comparison !(id %in% my.drop), as it includes the factor student.code, so when R makes the comparison checking if a record is in my.drop it converts the text, e.g. ‘mjl159’, to the number of level, e.g. 41, which makes the code to delete ONE MORE observation (in this #41) on top of the ones the student was allocated. This happens only for some students, where the number of level is not included in the list of observations to drop.

This is another version of R pitfall #3: friggin’ factors. A simple workaround is to change the comparison to !(id %in% my.drop[2:6]). I should know better than this.

Gratuitous image: Tree spread on metal frame to provide shade in a plaza, Lisbon, Portugal. Some days I would love to have a coffee there without computer, just watching the world pass by. (Photo: Luis).

Early August flotsam

Back teaching a couple of subjects and it’s the constant challenge to find enough common ground with students so one can push/pull them to the other side of new concepts. We are not talking about complex hierarchical models using mixed models or Bayesian approaches, but multiple linear regression or similar. What do students actually learn in first year stats…?

  • I’m enjoying reading Machine Learning for Hackers by Drew Conway and John Myles White. There isn’t a lot of stuff new for me in the book—although working with text is not something I usually do—but I have chosen to read the book with newbie eyes. I’m (repeating myself) looking for enough common ground with students so one can push/pull them to the other side of new concepts and, let’s face it, I was 20 quite a few years ago.
  • Observation on teaching a lab for STAT202, in which many students are using R for the first time. Do you remember your first steps in S+/R? Some students see the light quickly while others are struggling to get their heads around giving commands to a computer (without clicking on icons).
  • Videos and screencasts on using IPython via Vince Buffalo.
  • This tweet by @isomorphisms resonated with me: ‘Someday I hope to be reading more Penguin Classics than John Wileys & Springer Verlags’.
  • Tom points to an explanation of ‘What really shoots out of spiderman’s modified forelimbs, and why this causes such consternation’.
  • I have to convince College IT guys to install R-Studio in a few hundred computers. R-Studio is becoming better all the time, making it obscene to subject students to the naked R for Windows installation without syntax highlighting.
  • Finally, reasons why men should not write advice columns via Arthur Charpentier.

Derelict house in Sintra, Portugal (Photo: Luis).