Multisite, multivariate genetic analysis: simulation and analysis

The email wasn’t a challenge but a simple question: Is it possible to run a multivariate analysis in multiple sites? I was going to answer yes, of course, and leave it there but it would be a cruel, non-satisfying answer. We can get a better handle of the question if we use a simple example; let’s assume that we have two traits (call them tree stem diameter and stem density) assessed in two sites (localities).

Because this is genetics we have a family structure, let’s say half-siblings so we only half the mother in common, and we will ignore any experimental design features to keep things simple. We have 100 families, with 30 trees each, in sites A and B, for a total of 6,000 trees (100 x 30 x 2). The data could look like this:

site family tree diam dens
 A     1     1   20    398
 A     1     2   19    400
...
 A    100   30   24    375
...
 B     1    1    25    396
...
 B    100   30   23    403

We can also think of a trait measured in two sites as separate, but correlated, traits. For example, diameter in site 1 (diam1) is a different variable from diameter in site 2 (diam 2). So now we have four response variables (diam1, dens1, diam2, dens2), two of which have only missing values in a given site:

site family tree diam1 dens1 diam2 dens2
 A     1     1    20    398   NA    NA
 A     1     2    19    400   NA    NA
...
 A    100   30    24    375   NA    NA
...
 B      1    1    NA    NA    25    396
...
 B    100   30    NA    NA    23    403

All variables are correlated at the genetic level with an unstructured G covariance matrix, while at the residual level are only correlated within-site (same tree was assessed for both traits), but there is zero correlation between sites, because one could be assessed near Christchurch, while the other near Dunedin, for example.

###
### Data generation
###
 
# Simulate 100 half-sib families with 30 trees each
# for the four traits in two sites
n.fam <- 100
n.prog <- 30
n.trait <- 4
n.site <- 2
 
# Couple of variance matrices,
# G is for additive genetic effects
# R is for residuals
# All variances are assumed equal to 1 without loss
# of generality
G <- matrix(c(1, 0.7, 0.5, 0.3,
              0.7, 1, 0.2, 0.1,
              0.5, 0.2, 1, 0.8,
              0.3, 0.1, 0.8, 1), 4, 4)
 
R <- matrix(c(1, 0.3, 0, 0,
              0.3, 1, 0, 0,
              0, 0, 1, 0.2,
              0, 0, 0.2, 1), 4, 4)
 
 
# Between-family variances account for 1/4 of additive
# genetic effects. Within-family account for 3/4 of
# additive + residual.
# We also get the Cholesky decomposition in the same step
BFL <- chol(1/4 * G)
WFL <- chol(3/4 * G + R)
 
# Simulate random family effects for four traits
# Simulate random family effects for four traits
fam.eff <- t(BFL) %*% matrix(rnorm(n.fam*n.trait), 
                             n.trait, n.fam)
fam.eff <- t(fam.eff) # This is 100 x 4
 
tre.eff <- t(WFL) %*% matrix(rnorm(n.prog*n.fam*n.trait), 
                             n.trait, n.prog*n.fam)
tre.eff <- t(tre.eff) # This is 3000 x 4
 
# Expand family effects matrix (n.prog each) to match 
# dimension of tree effects
pheno <- fam.eff[rep(1:dim(fam.eff)[1], each = n.prog),] + tre.eff
 
# Now to 2 sites
pheno2s <- matrix(NA, n.prog*n.fam*n.site, n.trait)
colnames(pheno2s) <- c('diam1', 'dens1', 'diam2', 'dens2')
 
pheno2s[1:3000, 1:2] <- pheno[1:3000, 1:2]
pheno2s[3001:6000, 3:4] <- pheno[1:3000, 3:4]
 
# Creating data set. Family and tree are shamelessly recycled
sim.data <- data.frame(site = factor(rep(c('A', 'B'), each = n.fam*n.prog)),
                       family = factor(rep(1:n.fam, each = n.prog)),
                       tree = factor(1:30),
                       pheno2s)

Some neat things:

  • Data simulation relies on Cholesky decomposition, as explained in this post over a year ago (time flies!). Generation is a bit more complex this time because we have to account for two, rather than one, covariance structures.
  • Another point with data simulation is that one could generate one set of correlated values at the time by using something like t(BFL) %*% diag(rnorm(trait), 3) and loop it or use apply(). This would require much less memory but would also be much slower.
  • We need to repeat each line of the family effects matrix 30 times so we can add to the individual tree effects. Often we use indexing in matrices or data frames to extract a few rows. Instead here I’m using to repeat a given number of times each row by indexing with rep().

If we show observations 2990 to 3010 (last 10 for site A and first 10 for site B) we can see the pattern of observations below, which will have the required structure to have a US (or heterogeneous variance correlation matrix) for G and a block diagonal matrix for R. By the way, one could also have block diagonal matrices with repeated measurements, although probably with a different correlation structure:

> pheno2s[2990:3010,]
            diam1      dens1      diam2       dens2
 [1,]  0.57087250 -0.8521059         NA          NA
 [2,]  0.94859621  0.6599391         NA          NA
 [3,] -3.37405451 -0.6093312         NA          NA
 [4,]  0.93541048 -0.7977893         NA          NA
 [5,] -0.74758553  0.7962593         NA          NA
 [6,]  0.51280201  1.4870425         NA          NA
 [7,] -1.92571147 -0.2554365         NA          NA
 [8,] -1.15923045 -2.0656582         NA          NA
 [9,] -0.82145904 -0.3138340         NA          NA
[10,]  1.79631670  0.3814723         NA          NA
[11,] -0.01604778 -1.1804723         NA          NA
[12,]          NA         NA -0.1436143 -1.97628883
[13,]          NA         NA -2.7099687 -2.93832962
[14,]          NA         NA -2.5153420  0.73780760
[15,]          NA         NA  0.3084056  0.61696714
[16,]          NA         NA -0.2909500  0.78111864
[17,]          NA         NA -1.8629862 -2.19346309
[18,]          NA         NA  0.8673053 -0.07692884
[19,]          NA         NA -0.1459703  0.36981965
[20,]          NA         NA -0.7688851 -0.96765799
[21,]          NA         NA  0.6637173 -0.34924814
Gratuitous picture: driving in rural Japan, the cultural value of wood and breaking preconceptions (Photo: Luis).

Gratuitous picture: driving in rural Japan, the cultural value of wood and breaking preconceptions (Photo: Luis, click to enlarge).

For the analysis we will use ASReml, which is the Pro option if you work in a breeding program and require solving much larger systems of equations than this tiny example (say 50-100 traits, large pedigrees, etc). Another option for playing with this small data set would be to use MCMCglmm, which also allows for multivariate evaluation of linear mixed models.

###
### Multivariate analysis
###
 
require(asreml)
 
# Assuming zero correlation between traits (equivalent
# to four univariate analyses)
m1 <- asreml(cbind(diam1, dens1, diam2, dens2) ~ trait, 
             random = ~ diag(trait):family, 
             rcov = ~ units:diag(trait),
             data = sim.data)
 
summary(m1)$varcomp
 
#                                 gamma component  std.error   z.ratio constraint
#trait:family!trait.diam1.var 0.2571274 0.2571274 0.04524049  5.683569   Positive
#trait:family!trait.dens1.var 0.2265041 0.2265041 0.04059742  5.579274   Positive
#trait:family!trait.diam2.var 0.2383959 0.2383959 0.04185982  5.695102   Positive
#trait:family!trait.dens2.var 0.2567999 0.2567999 0.04459246  5.758818   Positive
#R!variance                   1.0000000 1.0000000         NA        NA      Fixed
#R!trait.diam1.var            1.8290472 1.8290472 0.04803313 38.078866   Positive
#R!trait.dens1.var            1.7674960 1.7674960 0.04641672 38.078866   Positive
#R!trait.diam2.var            1.6779793 1.6779793 0.04406589 38.078866   Positive
#R!trait.dens2.var            1.7028171 1.7028171 0.04471817 38.078866   Positive
 
# The multivariate analysis allowing for correlation
# is a bit more complex, so we start by generating 
# a structure for starting values
m2.sv <- asreml(cbind(diam1, dens1, diam2, dens2) ~ trait, 
                random = ~ corgh(trait):family, 
                rcov = ~ units:us(trait),
                data = sim.data,
                start.values = TRUE)
 
# Now we'll constraint some R variance components
# to zero and fix them there. These are values for
# which we now they are 0
sv <- m2.sv$gammas.table
sv$Value[c 1="16," 2="18," 3="19)" language="(15,"][/c] <- 0
sv$Constraint[c 1="16," 2="18," 3="19)" language="(15,"][/c] <- 'F'
 
# Run the analyses with those constraints for the R matrix
# using the restricted values in R.param
m2 <- asreml(cbind(diam1, dens1, diam2, dens2) ~ trait, 
             random = ~ corgh(trait):family, 
             rcov = ~ units:us(trait, init = r.init),
             data = sim.data,
             R.param = sv)
 
summary(m2)$varcomp
#                                                 gamma   component  std.error
# trait:family!trait.dens1:!trait.diam1.cor 0.72535514 0.72535514 0.06381776
# trait:family!trait.diam2:!trait.diam1.cor 0.49100215 0.49100215 0.10067768
# trait:family!trait.diam2:!trait.dens1.cor 0.21445972 0.21445972 0.12085780
# trait:family!trait.dens2:!trait.diam1.cor 0.37476119 0.37476119 0.10975970
# trait:family!trait.dens2:!trait.dens1.cor 0.05221773 0.05221773 0.12439924
# trait:family!trait.dens2:!trait.diam2.cor 0.85356318 0.85356318 0.04321683
# trait:family!trait.diam1                  0.25712744 0.25712744 0.04524049
# trait:family!trait.dens1                  0.22650410 0.22650410 0.04059742
# trait:family!trait.diam2                  0.23839593 0.23839593 0.04185982
# trait:family!trait.dens2                  0.25679989 0.25679989 0.04459246
# R!variance                                1.00000000 1.00000000         NA
# R!trait.diam1:diam1                       1.82904721 1.82904721 0.04803313
# R!trait.dens1:diam1                       0.90450432 0.90450432 0.03737490
# R!trait.dens1:dens1                       1.76749598 1.76749598 0.04641672
# R!trait.diam2:diam1                       0.00000000 0.00000000         NA
# R!trait.diam2:dens1                       0.00000000 0.00000000         NA
# R!trait.diam2:diam2                       1.67797927 1.67797927 0.04406589
# R!trait.dens2:diam1                       0.00000000 0.00000000         NA
# R!trait.dens2:dens1                       0.00000000 0.00000000         NA
# R!trait.dens2:diam2                       0.71249532 0.71249532 0.03406354
# R!trait.dens2:dens2                       1.70281710 1.70281710 0.04471817
#
#                                              z.ratio    constraint
# trait:family!trait.dens1:!trait.diam1.cor 11.3660390 Unconstrained
# trait:family!trait.diam2:!trait.diam1.cor  4.8769710 Unconstrained
# trait:family!trait.diam2:!trait.dens1.cor  1.7744796 Unconstrained
# trait:family!trait.dens2:!trait.diam1.cor  3.4143789 Unconstrained
# trait:family!trait.dens2:!trait.dens1.cor  0.4197593 Unconstrained
# trait:family!trait.dens2:!trait.diam2.cor 19.7507102 Unconstrained
# trait:family!trait.diam1                   5.6835685      Positive
# trait:family!trait.dens1                   5.5792738      Positive
# trait:family!trait.diam2                   5.6951016      Positive
# trait:family!trait.dens2                   5.7588182      Positive
# R!variance                                        NA         Fixed
# R!trait.diam1:diam1                       38.0788655      Positive
# R!trait.dens1:diam1                       24.2008477      Positive
# R!trait.dens1:dens1                       38.0788655      Positive
# R!trait.diam2:diam1                               NA         Fixed
# R!trait.diam2:dens1                               NA         Fixed
# R!trait.diam2:diam2                       38.0788655      Positive
# R!trait.dens2:diam1                               NA         Fixed
# R!trait.dens2:dens1                               NA         Fixed
# R!trait.dens2:diam2                       20.9166565      Positive
# R!trait.dens2:dens2                       38.0788655      Positive

In model 1 each of the variances was supposed to be ~0.25 (1/4 * 1) and the residual variances ~1.75 (3/4*1 + 1). Once we move to model 2 we also get values similar to the correlations we were trying to simulate. And this is the end of the long answer.

P.S. If, for some bizarre reason, you would like to use SAS for this type of analyses, Fikret Isik has proc mixed code to run multivariate genetic analyses.

Overlay of design matrices in genetic analysis

I’ve ignored my quantitative geneticist side of things for a while (at least in this blog) so this time I’ll cover some code I was exchanging with a couple of colleagues who work for other organizations.

It is common to use diallel mating designs in plant and tree breeding, where a small number of parents acts as both males and females. For example, with 5 parents we can have 25 crosses, including reciprocals and selfing (crossing an individual with itself). Decades ago this mating design was tricky to fit and, considering an experimental layout with randomized complete blocks, one would have something like y = mu + blocks + dads + mums + cross + error. In this model dads and mums were estimating a fraction of the additive genetic variance. With the advent of animal model BLUP, was possible to fit something like y = mu + blocks + individual (using a pedigree) + cross + error. Another less computationally demanding alternative (at least with unrelated parents) is to fit a parental model, overlaying the design matrices for parents with something like this y = mu + blocks + (dad + mum) + cross + error.

The following code simulate data for a a diallel trial with three replicates and runs a parental model with ASReml. Later I expand the analysis building the matrices by hand.

# Defining diallel cross
n.blocks <- 3
diallel <- matrix(0, nrow = 5, ncol = 5)
males <- 1:dim(diallel)[1]
females <- 1:dim(diallel)[2]
cross <- factor(outer(males, females, paste, sep ='x'))
cross
#[1] 1x1 2x1 3x1 4x1 5x1 1x2 2x2 3x2 4x2 5x2 1x3 2x3 3x3 4x3 5x3 1x4 2x4
#[18] 3x4 4x4 5x4 1x5 2x5 3x5 4x5 5x5
#25 Levels: 1x1 1x2 1x3 1x4 1x5 2x1 2x2 2x3 2x4 2x5 3x1 3x2 3x3 ... 5x5
 
 
#### Generating random values for the trial
# Variance components
sig2.a <- 40    # additive genetic variance
sig2.sca <- 10  # specific combining ability (SCA)
sig2.b <- 10    # blocks
sig2.e <- 30    # residuals
 
# Numbers of everything
n.parents <- length(males)
n.crosses <- length(cross)
n.trees <- n.crosses*n.blocks
 
# Random values for everything
u.g <- rnorm(n.parents)*sqrt(sig2.a)
u.sca <- rnorm(n.crosses)*sqrt(sig2.sca)
u.block <- rnorm(n.blocks)*sqrt(sig2.b)
u.e <- rnorm(n.trees)*sqrt(sig2.e)
 
# Whole trial
trial <- data.frame(block = factor(rep(1:n.blocks, each = n.crosses)),
                    mum = factor(rep(females, n.blocks)),
                    dad = factor(rep(males, each = length(females))),
                    mate = factor(rep(cross, n.blocks)))
 
trial$yield <- 0.5*(u.g[trial$mum] + u.g[trial$dad]) +
               u.sca[trial$mate] + u.block[trial$block] + u.e
 
head(trial)   
#block mum dad mate       yield
#1     1   1   1  1x1 -0.02185486
#2     1   2   1  2x1 10.79760712
#3     1   3   1  3x1 16.45186037
#4     1   4   1  4x1  8.15026291
#5     1   5   1  5x1  5.57707180
#6     1   1   2  1x2 -7.30675148
 
# Fitting the model with ASReml
library(asreml)
m1 <- asreml(yield ~ 1, 
             random = ~ block + dad + and(mum) + mate, 
             data = trial)
summary(m1)
#                       gamma    component    std.error   z.ratio
#block!block.var 1.299110e-02 3.861892e-01 1.588423e+00 0.2431274
#dad!dad.var     2.101417e-01 6.246930e+00 5.120745e+00 1.2199259
#mate!mate.var   4.589938e-07 1.364461e-05 2.340032e-06 5.8309519
#R!variance      1.000000e+00 2.972722e+01 5.098177e+00 5.8309519
 
# Obtaining the predicted breeding values for the parents
coef(m1, pattern = 'dad')
#         effect
#dad_1  1.780683
#dad_2 -2.121174
#dad_3  3.151991
#dad_4 -1.473620
#dad_5 -1.337879

How is the matrix overlay working? We can replicate the calculations used by ASReml by building the matrices from scratch and reusing the variance components, so we avoid the nastiness of writing code for residual maximum likelihood. Once I build the basic matrices I use the code from my How does a linear mixed model look like? post.

# Building incidence matrices for males and females
Zmales <- model.matrix(~ dad - 1, data = trial)
Zfemales <- model.matrix(~ mum - 1, data = trial)
 
# and() in ASReml overlays (i.e. sums) the matrices
# (Notice that this creates 0s, 1 and 2s in the design matrix)
Zparents <- Zmales + Zfemales
Zparents[1:6,]
#   dad1 dad2 dad3 dad4 dad5
#1     2    0    0    0    0
#2     1    1    0    0    0
#3     1    0    1    0    0
#4     1    0    0    1    0
#5     1    0    0    0    1
#6     1    1    0    0    0
 
# Design matrices from other factors
Zblocks <- model.matrix(~ block - 1, data = trial)
Zmates <- model.matrix(~ mate - 1, data = trial)
 
# Creating y, X and big Z matrices to solve mixed model equations
y <- trial$yield
X <- matrix(1, nrow = 75, ncol = 1)
Z <- cbind(Zblocks, Zparents, Zmates)
 
# Building covariance matrices for random effects
# Using the variance components from the ASReml model
G = diag(c(rep(3.861892e-01, 3), 
           rep(6.246930e+00, 5), 
           rep(1.364461e-05, 25)))
R = diag(2.972722e+01, 75, 75)
Rinv = solve(R)
 
# Components of C
XpX = t(X) %*% Rinv %*% X
ZpZ = t(Z) %*% Rinv %*% Z
 
XpZ = t(X) %*% Rinv %*% Z
ZpX = t(Z) %*% Rinv %*% X
 
Xpy = t(X) %*% Rinv %*% y
Zpy = t(Z) %*% Rinv %*% y
 
# Building C * [b a] = RHS
C = rbind(cbind(XpX, XpZ),
          cbind(ZpX, ZpZ + solve(G)))
RHS = rbind(Xpy, Zpy)
 
blup = solve(C, RHS)
blup
# These results are identical to the ones
# produced by ASReml
#dad1     1.780683e+00
#dad2    -2.121174e+00
#dad3     3.151991e+00
#dad4    -1.473620e+00
#dad5    -1.337879e+00

The overlay matrix Zparents is double the actual value we should use:

Zparents2 <- 0.5*(Zmales + Zfemales)
Zparents2[1:6,]
#  dad1 dad2 dad3 dad4 dad5
#1  1.0  0.0  0.0  0.0  0.0
#2  0.5  0.5  0.0  0.0  0.0
#3  0.5  0.0  0.5  0.0  0.0
#4  0.5  0.0  0.0  0.5  0.0
#5  0.5  0.0  0.0  0.0  0.5
#6  0.5  0.5  0.0  0.0  0.0

Repeating the analyses ‘by hand’ using Zparents2 to build the Z matrix results in the same overall mean and block effects, but the predicted breeding values for parents when using Zparents are 0.7 of the predicted breeding values for parents when using Zparents2. I may need to refit the model and obtain new variance components for parents when working with the correct overlaid matrix.

Gratuitous picture: walking in Quail Island (Photo: Luis).

Suicide statistics and the Christchurch earthquake

Suicide is a tragic and complex problem. This week New Zealand’s Chief Coroner released its annual statistics on suicide, which come with several tables and figures. One of those figures refers to monthly suicides in the Christchurch region (where I live) and comes with an interesting comment:

Suicides in the Christchurch region (Timaru to Kaikoura) have risen from 67 (2010/11) to 81 (2011/12). The average number of suicides per year for this region over the past four years is 74. The figure of 67 deaths last year reflected the drop in suicides post-earthquake. The phenomenon of a drop in the suicide rate after a large scale crisis event, such as a natural disaster, has been observed elsewhere. [my emphasis]

‘Provisional Suicide deaths in relation to the Christchurch earthquakes’ this is the original header for the graph in the report. The first earthquake was in September 2010 and is marked with red dots.

The figure highlights the earthquake and its major aftershocks using different colors. It is true that we have faced large problems following the 4th September 2010 earthquake and thousands of aftershocks, but can we really make the coroner’s interpretation (already picked up by the media)? In fact, one could have a look at the data before the earthquake, where there are big drops and rises (What would be the explanation for that? Nothing to do with any earthquake). In fact, the average number of suicides hasn’t really changed after the quake.

I typed the data in to this file, calculated the mean number of suicides per month (~6.3) and generated a few random realizations of a Poisson process using that mean; here I’m plotting the real data in panel 1 and 4 other randomly generated series in panels 2 to 5.

require(lattice)
su = read.csv('suicide-canterbury-2012.csv')
 
su$Day = ifelse(Month %in% c(1, 3, 5, 7, 8, 10, 12), 31,
                ifelse(Month %in% c(4, 6, 9, 11), 30, 28)) 
su$Date = as.Date(paste(Day, Month, Year, sep = '-'), format = '%d-%m-%Y')
 
# No huge autocorrelation
acf(su$Number)
 
# Actual data
xyplot(Number ~ Date, data = su, type = 'b')
 
# Mean number of monthly suicides: 6.283
# Simulating 4 5-year series using Poisson for
# panels 2 to 5. Panel 1 contains the observed data
simulated =  data.frame(s = factor(rep(1:5, each = 60)),
                        d = rep(su$Date, 5),
                        n = c(su$Number, rpois(240, lambda = mean(su$Number))))
 
xyplot(n ~ d | s, simulated, type = 'b', cex = 0.3, layout = c(1, 5),
       xlab = 'Date', ylab = 'Suicides per month')

Observed suicide data for Christchurch (panel 1) and four 60-month simulations (panels 2-5).

Do they really look different? We could try to fit any set of peaks and valleys to our favorite narrative; however, it doesn’t necessarily make any sense. We’ll need a larger data set to see any long-time effects of the earthquake.

P.S. 2012-09-18. Thomas Lumley comments on this post in StatsChat.

R, Julia and genome wide selection

— “You are a pussy” emailed my friend.
— “Sensu cat?” I replied.
— “No. Sensu chicken” blurbed my now ex-friend.

What was this about? He read my post on R, Julia and the shiny new thing, which prompted him to assume that I was the proverbial old dog unwilling (or was it unable?) to learn new tricks. (Incidentally, with friends like this who needs enemies? Hi, Gus.)

Having a look at different—statistical—horses (Photo: Luis, click to enlarge).

I decided to tackle a small—but hopefully useful—piece of code: fitting/training a Genome Wide Selection model, using the Bayes A approach put forward by Meuwissen, Hayes and Goddard in 2001. In that approach the breeding values of the individuals (response) are expressed as a function of a very large number of random predictors (2000, our molecular markers). The dataset (csv file) is a simulation of 2000 bi-allelic markers (aa = 0, Aa = 1, AA = 2) for 250 individuals, followed by the phenotypes (column 2001) and breeding values (column 2002). These models are frequently adjusted using MCMC.

In 2010 I attended this course in Ames, Iowa where Rohan Fernando passed us the following R code (pretty much a transliteration from C code; notice the trailing semicolons, for example). P.D. 2012-04-26 Please note that this is teaching code not production code:

nmarkers = 2000;    # number of markers
startMarker = 1981; # set to 1 to use all
numiter  = 2000;    # number of iterations
vara     = 1.0/20.0; 
 
# input data
data     = matrix(scan("trainData.out0"),ncol=nmarkers+2,byrow=TRUE);
nrecords = dim(data)[1];
 
beg = Sys.time()
 
# x has the mean followed by the markers
x = cbind(1,data[,startMarker:nmarkers]);  
y = data[,nmarkers+1];
a =  data[,nmarkers+2];
# inital values
 
nmarkers = nmarkers - startMarker + 1;
mean2pq = 0.5;                          # just an approximation
scalea  = 0.5*vara/(nmarkers*mean2pq);  # 0.5 = (v-2)/v for v=4
 
size = dim(x)[2];
b = array(0.0,size);
meanb = b;
b[1] = mean(y);
var  = array(0.0,size);
 
# adjust y
 ycorr = y - x%*%b;                  
 
# MCMC sampling
for (iter in 1:numiter){	
  # sample vare
  vare = ( t(ycorr)%*%ycorr )/rchisq(1,nrecords + 3);
 
  # sample intercept
  ycorr = ycorr + x[,1]*b[1];
  rhs = sum(ycorr)/vare;
  invLhs = 1.0/(nrecords/vare);
  mean = rhs*invLhs;                            
  b[1] = rnorm(1,mean,sqrt(invLhs));                  
  ycorr = ycorr - x[,1]*b[1];
  meanb[1] = meanb[1] + b[1];
 
  # sample variance for each locus	
  for (locus in 2:size){
    var[locus] = (scalea*4+b[locus]*b[locus])/rchisq(1,4.0+1) 
  }
 
# sample effect for each locus	
  for (locus in 2:size){
    # unadjust y for this locus
    ycorr = ycorr + x[,locus]*b[locus];   
    rhs = t(x[,locus])%*%ycorr/vare;
    lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus];
    invLhs = 1.0/lhs;
    mean = invLhs*rhs;
    b[locus]= rnorm(1,mean,sqrt(invLhs));  
    #adjust y for the new value of this locus                   
    ycorr = ycorr - x[,locus]*b[locus];   
    meanb[locus] = meanb[locus] + b[locus];		
  }	
}
 
Sys.time() - beg
 
meanb = meanb/numiter;
aHat  = x %*% meanb;

Thus, we just need defining a few variables, reading the data (marker genotypes, breeding values and phenotypic data) into a matrix, creating loops, matrix and vector multiplication and generating random numbers (using a Gaussian and Chi squared distributions). Not much if you think about it, but I didn’t have much time to explore Julia’s features as to go for something more complex.

nmarkers = 2000    # Number of markers
startmarker = 1981 # Set to 1 to use all
numiter = 2000     # Number of iterations
 
data = dlmread("markers.csv", ',')
(nrecords, ncols) = size(data)
 
tic()
 
#this is the mean and markers matrix
X = hcat(ones(Float64, nrecords), data[:, startmarker:nmarkers])
y = data[:, nmarkers + 1]
a = data[:, nmarkers + 2]
 
nmarkers = nmarkers - startmarker + 1
vara = 1.0/nmarkers
mean2pq = 0.5
 
scalea  = 0.5*vara/(nmarkers*mean2pq) # 0.5 = (v-2)/v for v=4
 
ndesign = size(X, 2)
b = zeros(Float64, ndesign)
meanb = zeros(Float64, ndesign)
b[1] = mean(y)
varian  = zeros(Float64, ndesign)
 
# adjust y
ycorr = y - X * b                  
 
# MCMC sampling
for i = 1:numiter
  # sample vare
  vare = dot(ycorr, ycorr )/randchi2(nrecords + 3)
 
  # sample intercept
  ycorr = ycorr + X[:, 1] * b[1];
  rhs = sum(ycorr)/vare;
  invlhs = 1.0/(nrecords/vare);
  mn = rhs*invlhs;                            
  b[1] = randn() * sqrt(invlhs) + mn;
  ycorr = ycorr - X[:, 1] * b[1];
  meanb[1] = meanb[1] + b[1];
 
  # sample variance for each locus
  for locus = 2:ndesign
      varian[locus] = (scalea*4 + b[locus]*b[locus])/randchi2(4.0 + 1);
  end
 
  # sample effect for each locus
  for locus = 2:ndesign
      # unadjust y for this locus
      ycorr = ycorr + X[:, locus] * b[locus]; 
      rhs = dot(X[:, locus], ycorr)/vare;
      lhs = dot(X[:, locus], X[:, locus])/vare + 1.0/varian[locus];
      invlhs = 1.0/lhs;
      mn = invlhs * rhs;
      b[locus] = randn() * sqrt(invlhs) + mn;
      #adjust y for the new value of this locus
      ycorr = ycorr - X[:, locus] * b[locus];
      meanb[locus] = meanb[locus] + b[locus];
  end
end
 
toc()
 
meanb = meanb/numiter;
aHat  = X * meanb;

The code looks remarkably similar and there are four main sources of differences:

  1. The first trivial one is that the original code read a binary dataset and I didn’t know how to do it in Julia, so I’ve read a csv file instead (this is why I start timing after reading the file too).
  2. The second trivial one is to avoid name conflicts between variables and functions; for example, in R the user is allowed to have a variable called var that will not interfere with the variance function. Julia is picky about that, so I needed renaming some variables.
  3. Julia pases variables by reference, while R does so by value when assigning matrices, which tripped me because in the original R code there was something like: b = array(0.0,size); meanb = b;. This works fine in R, but in Julia changes to the b vector also changed meanb.
  4. The definition of scalar vs array created some problems in Julia. For example y' * y (t(y) %*% y in R) is numerically equivalent to dot(y, y). However, the first version returns an array, while the second one a scalar. I got an error message when trying to store the ‘scalar like an array’ in to an array. I find that confusing.

One interesting point in this comparison is using rough code, not really optimized for speed; in fact, the only thing that I can say of the Julia code is that ‘it runs’ and it probably is not very idiomatic. Testing runs with different numbers of markers we get that R needs roughly 2.8x the time used by Julia. The Julia website claims better results in benchmarks, but in real life we work with, well, real problems.

In 1996-7 I switched from SAS to ASReml for genetic analyses because it was 1-2 orders of magnitude faster and opened a world of new models. Today a change from R to Julia would deliver (in this particular case) a much more modest speed up (~3x), which is OK but not worth changing languages (yet). Together with the embryonic graphical capabilities and the still-to-develop ecosystem of packages, means that I’m still using R. Nevertheless, the Julia team has achieved very impressive performance in very little time, so it is worth to keep an eye on their progress.

P.S.1 Readers are welcome to suggesting ways of improving the code.
P.S.2 WordPress does not let me upload the binary version of the simulated data.
P.S.3 Hey WordPress guys; it would be handy if the sourcecode plugin supported Julia!
P.S.4 2012-04-26 Following AL’s recommendation in the comments, one can replace in R:

rhs = t(x[,locus])%*%ycorr/vare;
lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus]

by

rhs = crossprod(x[,locus],ycorr)/vare
lhs = crossprod(x[,locus],x[,locus])/vare + 1.0/var[locus]

reducing execution time by roughly 20%, making the difference between Julia and R even smaller.

Simulating data following a given covariance structure

Every year there is at least a couple of occasions when I have to simulate multivariate data that follow a given covariance matrix. For example, let’s say that we want to create an example of the effect of collinearity when fitting multiple linear regressions, so we want to create one variable (the response) that is correlated with a number of explanatory variables and the explanatory variables have different correlations with each other.

There is a matrix operation called Cholesky decomposition, sort of equivalent to taking a square root with scalars, that is useful to produce correlated data. If we have a covariance matrix M, the Cholesky descomposition is a lower triangular matrix L, such as that M = L L'. How does this connect to our simulated data? Let’s assume that we generate a vector z of random normally independently distributed numbers with mean zero and variance one (with length equal to the dimension of M), we can create a realization of our multivariate distribution using the product L z.

The reason why this works is that the Variance(L z) = L Variance(z) L' as L is just a constant. The variance of z is the identity matrix I; remember that the random numbers have variance one and are independently distributed. Therefore Variance(L z) = L I L' = L L` = M so, in fact, we are producing random data that follow the desired covariance matrix.

As an example, let’s simulate 100 observations with 4 variables. Because we want to simulate 100 realizations, rather than a single one, it pays to generate a matrix of random numbers with as many rows as variables to simulate and as many columns as observations to simulate.

library(lattice) # for splom
library(car)     # for vif
 
# number of observations to simulate
nobs = 100
 
# Using a correlation matrix (let' assume that all variables
# have unit variance
M = matrix(c(1, 0.7, 0.7, 0.5,
             0.7, 1, 0.95, 0.3,
             0.7, 0.95, 1, 0.3,
             0.5, 0.3, 0.3, 1), nrow=4, ncol=4)
 
# Cholesky decomposition             
L = chol(M)
nvars = dim(L)[1]
 
 
# R chol function produces an upper triangular version of L
# so we have to transpose it.
# Just to be sure we can have a look at t(L) and the
# product of the Cholesky decomposition by itself
 
t(L)
 
     [,1]       [,2]        [,3]      [,4]
[1,]  1.0  0.0000000  0.00000000 0.0000000
[2,]  0.7  0.7141428  0.00000000 0.0000000
[3,]  0.7  0.6441288  0.30837970 0.0000000
[4,]  0.5 -0.0700140 -0.01589586 0.8630442
 
t(L) %*% L
 
     [,1] [,2] [,3] [,4]
[1,]  1.0 0.70 0.70  0.5
[2,]  0.7 1.00 0.95  0.3
[3,]  0.7 0.95 1.00  0.3
[4,]  0.5 0.30 0.30  1.0
 
 
# Random variables that follow an M correlation matrix
r = t(L) %*% matrix(rnorm(nvars*nobs), nrow=nvars, ncol=nobs)
r = t(r)
 
rdata = as.data.frame(r)
names(rdata) = c('resp', 'pred1', 'pred2', 'pred3')
 
# Plotting and basic stats
splom(rdata)
cor(rdata)
 
# We are not that far from what we want to simulate!
           resp     pred1     pred2     pred3
resp  1.0000000 0.7347572 0.7516808 0.3915817
pred1 0.7347572 1.0000000 0.9587386 0.2841598
pred2 0.7516808 0.9587386 1.0000000 0.2942844
pred3 0.3915817 0.2841598 0.2942844 1.0000000

Now we can use the simulated data to learn something about the effects of collinearity when fitting multiple linear regressions. We will first fit two models using two predictors with low correlation between them, and then fit a third model with three predictors where pred1 and pred2 are highly correlated with each other.

# Model 1: predictors 1 and 3 (correlation is 0.28)
m1 = lm(resp ~ pred1 + pred3, rdata)
summary(m1)
 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.07536    0.06812   1.106  0.27133    
pred1        0.67316    0.06842   9.838 2.99e-16 ***
pred3        0.20920    0.07253   2.884  0.00483 ** 
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1 
 
Residual standard error: 0.6809 on 97 degrees of freedom
Multiple R-squared: 0.5762,	Adjusted R-squared: 0.5675 
F-statistic: 65.95 on 2 and 97 DF,  p-value: < 2.2e-16 
 
# Model 2: predictors 2 and 3 (correlation is 0.29)
m2 = lm(resp ~ pred2 + pred3, rdata)
summary(m2)
 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.06121    0.06649   0.921  0.35953    
pred2        0.68513    0.06633  10.329  < 2e-16 ***
pred3        0.19624    0.07097   2.765  0.00681 ** 
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1 
 
Residual standard error: 0.6641 on 97 degrees of freedom
Multiple R-squared: 0.5968,	Adjusted R-squared: 0.5885 
F-statistic: 71.79 on 2 and 97 DF,  p-value: < 2.2e-16 
 
# Model 3: correlation between predictors 1 and 2 is 0.96
m3 = lm(resp ~ pred1 + pred2 + pred3, rdata)
summary(m3)
 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.06421    0.06676   0.962  0.33856   
pred1        0.16844    0.22560   0.747  0.45712   
pred2        0.52525    0.22422   2.343  0.02122 * 
pred3        0.19584    0.07114   2.753  0.00706 **
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1 
 
Residual standard error: 0.6657 on 96 degrees of freedom
Multiple R-squared: 0.5991,	Adjusted R-squared: 0.5866 
F-statistic: 47.83 on 3 and 96 DF,  p-value: < 2.2e-16 
 
 
# Variance inflation
vif(m3)
 
    pred1     pred2     pred3 
12.373826 12.453165  1.094875

In my example it is possible to see the huge increase for the standard error for pred1 and pred2, when we use both highly correlated explanatory variables in model 3. In addition, model fit does not improve for model 3.