R, Julia and genome wide selection

— “You are a pussy” emailed my friend.
— “Sensu cat?” I replied.
— “No. Sensu chicken” blurbed my now ex-friend.

What was this about? He read my post on R, Julia and the shiny new thing, which prompted him to assume that I was the proverbial old dog unwilling (or was it unable?) to learn new tricks. (Incidentally, with friends like this who needs enemies? Hi, Gus.)

Having a look at different—statistical—horses (Photo: Luis).

I decided to tackle a small—but hopefully useful—piece of code: fitting/training a Genome Wide Selection model, using the Bayes A approach put forward by Meuwissen, Hayes and Goddard in 2001. In that approach the breeding values of the individuals (response) are expressed as a function of a very large number of random predictors (2000, our molecular markers). The dataset (csv file) is a simulation of 2000 bi-allelic markers (aa = 0, Aa = 1, AA = 2) for 250 individuals, followed by the phenotypes (column 2001) and breeding values (column 2002). These models are frequently adjusted using MCMC.

In 2010 I attended this course in Ames, Iowa where Rohan Fernando passed us the following R code (pretty much a transliteration from C code; notice the trailing semicolons, for example). P.D. 2012-04-26 Please note that this is teaching code not production code:

nmarkers = 2000;    # number of markers
startMarker = 1981; # set to 1 to use all
numiter  = 2000;    # number of iterations
vara     = 1.0/20.0; 

# input data
data     = matrix(scan("trainData.out0"),ncol=nmarkers+2,byrow=TRUE);
nrecords = dim(data)[1];

beg = Sys.time()

# x has the mean followed by the markers
x = cbind(1,data[,startMarker:nmarkers]);
y = data[,nmarkers+1];
a =  data[,nmarkers+2];
# inital values

nmarkers = nmarkers - startMarker + 1;
mean2pq = 0.5;                          # just an approximation
scalea  = 0.5*vara/(nmarkers*mean2pq);  # 0.5 = (v-2)/v for v=4

size = dim(x)[2];
b = array(0.0,size);
meanb = b;
b[1] = mean(y);
var  = array(0.0,size);

# adjust y
 ycorr = y - x%*%b;                  

# MCMC sampling
for (iter in 1:numiter){
  # sample vare
  vare = ( t(ycorr)%*%ycorr )/rchisq(1,nrecords + 3);

  # sample intercept
  ycorr = ycorr + x[,1]*b[1];
  rhs = sum(ycorr)/vare;
  invLhs = 1.0/(nrecords/vare);
  mean = rhs*invLhs;
  b[1] = rnorm(1,mean,sqrt(invLhs));
  ycorr = ycorr - x[,1]*b[1];
  meanb[1] = meanb[1] + b[1];

  # sample variance for each locus
  for (locus in 2:size){
    var[locus] = (scalea*4+b[locus]*b[locus])/rchisq(1,4.0+1)
  }

# sample effect for each locus
  for (locus in 2:size){
    # unadjust y for this locus
    ycorr = ycorr + x[,locus]*b[locus];
    rhs = t(x[,locus])%*%ycorr/vare;
    lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus];
    invLhs = 1.0/lhs;
    mean = invLhs*rhs;
    b[locus]= rnorm(1,mean,sqrt(invLhs));
    #adjust y for the new value of this locus
    ycorr = ycorr - x[,locus]*b[locus];
    meanb[locus] = meanb[locus] + b[locus];
  }
}

Sys.time() - beg

meanb = meanb/numiter;
aHat  = x %*% meanb;

Thus, we just need defining a few variables, reading the data (marker genotypes, breeding values and phenotypic data) into a matrix, creating loops, matrix and vector multiplication and generating random numbers (using a Gaussian and Chi squared distributions). Not much if you think about it, but I didn’t have much time to explore Julia’s features as to go for something more complex.

nmarkers = 2000    # Number of markers
startmarker = 1981 # Set to 1 to use all
numiter = 2000     # Number of iterations

data = dlmread("markers.csv", ',')
(nrecords, ncols) = size(data)

tic()

#this is the mean and markers matrix
X = hcat(ones(Float64, nrecords), data[:, startmarker:nmarkers])
y = data[:, nmarkers + 1]
a = data[:, nmarkers + 2]

nmarkers = nmarkers - startmarker + 1
vara = 1.0/nmarkers
mean2pq = 0.5

scalea  = 0.5*vara/(nmarkers*mean2pq) # 0.5 = (v-2)/v for v=4

ndesign = size(X, 2)
b = zeros(Float64, ndesign)
meanb = zeros(Float64, ndesign)
b[1] = mean(y)
varian  = zeros(Float64, ndesign)

# adjust y
ycorr = y - X * b                  

# MCMC sampling
for i = 1:numiter
  # sample vare
  vare = dot(ycorr, ycorr )/randchi2(nrecords + 3)

  # sample intercept
  ycorr = ycorr + X[:, 1] * b[1];
  rhs = sum(ycorr)/vare;
  invlhs = 1.0/(nrecords/vare);
  mn = rhs*invlhs;
  b[1] = randn() * sqrt(invlhs) + mn;
  ycorr = ycorr - X[:, 1] * b[1];
  meanb[1] = meanb[1] + b[1];

  # sample variance for each locus
  for locus = 2:ndesign
      varian[locus] = (scalea*4 + b[locus]*b[locus])/randchi2(4.0 + 1);
  end

  # sample effect for each locus
  for locus = 2:ndesign
      # unadjust y for this locus
      ycorr = ycorr + X[:, locus] * b[locus];
      rhs = dot(X[:, locus], ycorr)/vare;
      lhs = dot(X[:, locus], X[:, locus])/vare + 1.0/varian[locus];
      invlhs = 1.0/lhs;
      mn = invlhs * rhs;
      b[locus] = randn() * sqrt(invlhs) + mn;
      #adjust y for the new value of this locus
      ycorr = ycorr - X[:, locus] * b[locus];
      meanb[locus] = meanb[locus] + b[locus];
  end
end

toc()

meanb = meanb/numiter;
aHat  = X * meanb;

The code looks remarkably similar and there are four main sources of differences:

  1. The first trivial one is that the original code read a binary dataset and I didn’t know how to do it in Julia, so I’ve read a csv file instead (this is why I start timing after reading the file too).
  2. The second trivial one is to avoid name conflicts between variables and functions; for example, in R the user is allowed to have a variable called var that will not interfere with the variance function. Julia is picky about that, so I needed renaming some variables.
  3. Julia pases variables by reference, while R does so by value when assigning matrices, which tripped me because in the original R code there was something like: b = array(0.0,size); meanb = b;. This works fine in R, but in Julia changes to the b vector also changed meanb.
  4. The definition of scalar vs array created some problems in Julia. For example y' * y (t(y) %*% y in R) is numerically equivalent to dot(y, y). However, the first version returns an array, while the second one a scalar. I got an error message when trying to store the ‘scalar like an array’ in to an array. I find that confusing.

One interesting point in this comparison is using rough code, not really optimized for speed; in fact, the only thing that I can say of the Julia code is that ‘it runs’ and it probably is not very idiomatic. Testing runs with different numbers of markers we get that R needs roughly 2.8x the time used by Julia. The Julia website claims better results in benchmarks, but in real life we work with, well, real problems.

In 1996-7 I switched from SAS to ASReml for genetic analyses because it was 1-2 orders of magnitude faster and opened a world of new models. Today a change from R to Julia would deliver (in this particular case) a much more modest speed up (~3x), which is OK but not worth changing languages (yet). Together with the embryonic graphical capabilities and the still-to-develop ecosystem of packages, means that I’m still using R. Nevertheless, the Julia team has achieved very impressive performance in very little time, so it is worth to keep an eye on their progress.

P.S.1 Readers are welcome to suggesting ways of improving the code.
P.S.2 WordPress does not let me upload the binary version of the simulated data.
P.S.3 Hey WordPress guys; it would be handy if the sourcecode plugin supported Julia!
P.S.4 2012-04-26 Following AL’s recommendation in the comments, one can replace in R:

rhs = t(x[,locus])%*%ycorr/vare;
lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus]

by

rhs = crossprod(x[,locus],ycorr)/vare
lhs = crossprod(x[,locus],x[,locus])/vare + 1.0/var[locus]

reducing execution time by roughly 20%, making the difference between Julia and R even smaller.

Simulating data following a given covariance structure

Every year there is at least a couple of occasions when I have to simulate multivariate data that follow a given covariance matrix. For example, let’s say that we want to create an example of the effect of collinearity when fitting multiple linear regressions, so we want to create one variable (the response) that is correlated with a number of explanatory variables and the explanatory variables have different correlations with each other.

There is a matrix operation called Cholesky decomposition, sort of equivalent to taking a square root with scalars, that is useful to produce correlated data. If we have a covariance matrix M, the Cholesky descomposition is a lower triangular matrix L, such as that M = L L'. How does this connect to our simulated data? Let’s assume that we generate a vector z of random normally independently distributed numbers with mean zero and variance one (with length equal to the dimension of M), we can create a realization of our multivariate distribution using the product L z.

The reason why this works is that the Variance(L z) = L Variance(z) L' as L is just a constant. The variance of z is the identity matrix I; remember that the random numbers have variance one and are independently distributed. Therefore Variance(L z) = L I L' = L L` = M so, in fact, we are producing random data that follow the desired covariance matrix.

As an example, let’s simulate 100 observations with 4 variables. Because we want to simulate 100 realizations, rather than a single one, it pays to generate a matrix of random numbers with as many rows as variables to simulate and as many columns as observations to simulate.

library(lattice) # for splom
library(car)     # for vif

# number of observations to simulate
nobs = 100

# Using a correlation matrix (let' assume that all variables
# have unit variance
M = matrix(c(1, 0.7, 0.7, 0.5,
             0.7, 1, 0.95, 0.3,
             0.7, 0.95, 1, 0.3,
             0.5, 0.3, 0.3, 1), nrow=4, ncol=4)

# Cholesky decomposition
L = chol(M)
nvars = dim(L)[1]

# R chol function produces an upper triangular version of L
# so we have to transpose it.
# Just to be sure we can have a look at t(L) and the
# product of the Cholesky decomposition by itself

t(L)

     [,1]       [,2]        [,3]      [,4]
[1,]  1.0  0.0000000  0.00000000 0.0000000
[2,]  0.7  0.7141428  0.00000000 0.0000000
[3,]  0.7  0.6441288  0.30837970 0.0000000
[4,]  0.5 -0.0700140 -0.01589586 0.8630442

t(L) %*% L

     [,1] [,2] [,3] [,4]
[1,]  1.0 0.70 0.70  0.5
[2,]  0.7 1.00 0.95  0.3
[3,]  0.7 0.95 1.00  0.3
[4,]  0.5 0.30 0.30  1.0

# Random variables that follow an M correlation matrix
r = t(L) %*% matrix(rnorm(nvars*nobs), nrow=nvars, ncol=nobs)
r = t(r)

rdata = as.data.frame(r)
names(rdata) = c('resp', 'pred1', 'pred2', 'pred3')

# Plotting and basic stats
splom(rdata)
cor(rdata)

# We are not that far from what we want to simulate!
           resp     pred1     pred2     pred3
resp  1.0000000 0.7347572 0.7516808 0.3915817
pred1 0.7347572 1.0000000 0.9587386 0.2841598
pred2 0.7516808 0.9587386 1.0000000 0.2942844
pred3 0.3915817 0.2841598 0.2942844 1.0000000

Now we can use the simulated data to learn something about the effects of collinearity when fitting multiple linear regressions. We will first fit two models using two predictors with low correlation between them, and then fit a third model with three predictors where pred1 and pred2 are highly correlated with each other.

# Model 1: predictors 1 and 3 (correlation is 0.28)
m1 = lm(resp ~ pred1 + pred3, rdata)
summary(m1)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.07536    0.06812   1.106  0.27133
pred1        0.67316    0.06842   9.838 2.99e-16 ***
pred3        0.20920    0.07253   2.884  0.00483 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6809 on 97 degrees of freedom
Multiple R-squared: 0.5762,	Adjusted R-squared: 0.5675
F-statistic: 65.95 on 2 and 97 DF,  p-value: < 2.2e-16 

# Model 2: predictors 2 and 3 (correlation is 0.29)
m2 = lm(resp ~ pred2 + pred3, rdata)
summary(m2)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.06121    0.06649   0.921  0.35953
pred2        0.68513    0.06633  10.329  < 2e-16 ***
pred3        0.19624    0.07097   2.765  0.00681 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6641 on 97 degrees of freedom
Multiple R-squared: 0.5968,	Adjusted R-squared: 0.5885
F-statistic: 71.79 on 2 and 97 DF,  p-value: < 2.2e-16 

# Model 3: correlation between predictors 1 and 2 is 0.96
m3 = lm(resp ~ pred1 + pred2 + pred3, rdata)
summary(m3)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.06421    0.06676   0.962  0.33856
pred1        0.16844    0.22560   0.747  0.45712
pred2        0.52525    0.22422   2.343  0.02122 *
pred3        0.19584    0.07114   2.753  0.00706 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6657 on 96 degrees of freedom
Multiple R-squared: 0.5991,	Adjusted R-squared: 0.5866
F-statistic: 47.83 on 3 and 96 DF,  p-value: < 2.2e-16 

# Variance inflation
vif(m3)

    pred1     pred2     pred3
12.373826 12.453165  1.094875 

In my example it is possible to see the huge increase for the standard error for pred1 and pred2, when we use both highly correlated explanatory variables in model 3. In addition, model fit does not improve for model 3.