Overlay of design matrices in genetic analysis

2012-10-12 / Luis

I’ve ignored my quantitative geneticist side of things for a while (at least in this blog) so this time I’ll cover some code I was exchanging with a couple of colleagues who work for other organizations.

It is common to use diallel mating designs in plant and tree breeding, where a small number of parents acts as both males and females. For example, with 5 parents we can have 25 crosses, including reciprocals and selfing (crossing an individual with itself). Decades ago this mating design was tricky to fit and, considering an experimental layout with randomized complete blocks, one would have something like y = mu + blocks + dads + mums + cross + error. In this model dads and mums were estimating a fraction of the additive genetic variance. With the advent of animal model BLUP, was possible to fit something like y = mu + blocks + individual (using a pedigree) + cross + error. Another less computationally demanding alternative (at least with unrelated parents) is to fit a parental model, overlaying the design matrices for parents with something like this y = mu + blocks + (dad + mum) + cross + error.

The following code simulate data for a a diallel trial with three replicates and runs a parental model with ASReml. Later I expand the analysis building the matrices by hand.

# Defining diallel cross
n.blocks <- 3
diallel <- matrix(0, nrow = 5, ncol = 5)
males <- 1:dim(diallel)[1]
females <- 1:dim(diallel)[2]
cross <- factor(outer(males, females, paste, sep ='x'))
cross
#[1] 1x1 2x1 3x1 4x1 5x1 1x2 2x2 3x2 4x2 5x2 1x3 2x3 3x3 4x3 5x3 1x4 2x4
#[18] 3x4 4x4 5x4 1x5 2x5 3x5 4x5 5x5
#25 Levels: 1x1 1x2 1x3 1x4 1x5 2x1 2x2 2x3 2x4 2x5 3x1 3x2 3x3 ... 5x5


#### Generating random values for the trial
# Variance components
sig2.a <- 40    # additive genetic variance
sig2.sca <- 10  # specific combining ability (SCA)
sig2.b <- 10    # blocks
sig2.e <- 30    # residuals

# Numbers of everything
n.parents <- length(males)
n.crosses <- length(cross)
n.trees <- n.crosses*n.blocks

# Random values for everything
u.g <- rnorm(n.parents)*sqrt(sig2.a)
u.sca <- rnorm(n.crosses)*sqrt(sig2.sca)
u.block <- rnorm(n.blocks)*sqrt(sig2.b)
u.e <- rnorm(n.trees)*sqrt(sig2.e)

# Whole trial
trial <- data.frame(block = factor(rep(1:n.blocks, each = n.crosses)),
                    mum = factor(rep(females, n.blocks)),
                    dad = factor(rep(males, each = length(females))),
                    mate = factor(rep(cross, n.blocks)))

trial$yield <- 0.5*(u.g[trial$mum] + u.g[trial$dad]) +
               u.sca[trial$mate] + u.block[trial$block] + u.e

head(trial)
#block mum dad mate       yield
#1     1   1   1  1x1 -0.02185486
#2     1   2   1  2x1 10.79760712
#3     1   3   1  3x1 16.45186037
#4     1   4   1  4x1  8.15026291
#5     1   5   1  5x1  5.57707180
#6     1   1   2  1x2 -7.30675148

# Fitting the model with ASReml
library(asreml)
m1 <- asreml(yield ~ 1,
             random = ~ block + dad + and(mum) + mate,
             data = trial)
summary(m1)
#                       gamma    component    std.error   z.ratio
#block!block.var 1.299110e-02 3.861892e-01 1.588423e+00 0.2431274
#dad!dad.var     2.101417e-01 6.246930e+00 5.120745e+00 1.2199259
#mate!mate.var   4.589938e-07 1.364461e-05 2.340032e-06 5.8309519
#R!variance      1.000000e+00 2.972722e+01 5.098177e+00 5.8309519

# Obtaining the predicted breeding values for the parents
coef(m1, pattern = 'dad')
#         effect
#dad_1  1.780683
#dad_2 -2.121174
#dad_3  3.151991
#dad_4 -1.473620
#dad_5 -1.337879

How is the matrix overlay working? We can replicate the calculations used by ASReml by building the matrices from scratch and reusing the variance components, so we avoid the nastiness of writing code for residual maximum likelihood. Once I build the basic matrices I use the code from my How does a linear mixed model look like? post.

# Building incidence matrices for males and females
Zmales <- model.matrix(~ dad - 1, data = trial)
Zfemales <- model.matrix(~ mum - 1, data = trial)

# and() in ASReml overlays (i.e. sums) the matrices
# (Notice that this creates 0s, 1 and 2s in the design matrix)
Zparents <- Zmales + Zfemales
Zparents[1:6,]
#   dad1 dad2 dad3 dad4 dad5
#1     2    0    0    0    0
#2     1    1    0    0    0
#3     1    0    1    0    0
#4     1    0    0    1    0
#5     1    0    0    0    1
#6     1    1    0    0    0

# Design matrices from other factors
Zblocks <- model.matrix(~ block - 1, data = trial)
Zmates <- model.matrix(~ mate - 1, data = trial)

# Creating y, X and big Z matrices to solve mixed model equations
y <- trial$yield
X <- matrix(1, nrow = 75, ncol = 1)
Z <- cbind(Zblocks, Zparents, Zmates)

# Building covariance matrices for random effects
# Using the variance components from the ASReml model
G <- diag(c(rep(3.861892e-01, 3),
            rep(6.246930e+00, 5),
            rep(1.364461e-05, 25)))

R <- diag(2.972722e+01, 75, 75)
Rinv <- solve(R)

# Components of C
XpX <- t(X) %*% Rinv %*% X
ZpZ <- t(Z) %*% Rinv %*% Z

XpZ <- t(X) %*% Rinv %*% Z
ZpX <- t(Z) %*% Rinv %*% X

Xpy <- t(X) %*% Rinv %*% y
Zpy <- t(Z) %*% Rinv %*% y

# Building C * [b a] = RHS
C <- rbind(cbind(XpX, XpZ),
           cbind(ZpX, ZpZ + solve(G)))

RHS <- rbind(Xpy, Zpy)

blup <- solve(C, RHS)
blup
# These results are identical to the ones
# produced by ASReml
#dad1     1.780683e+00
#dad2    -2.121174e+00
#dad3     3.151991e+00
#dad4    -1.473620e+00
#dad5    -1.337879e+00

The overlay matrix Zparents is double the actual value we should use:

Zparents2 <- 0.5*(Zmales + Zfemales)
Zparents2[1:6,]
#  dad1 dad2 dad3 dad4 dad5
#1  1.0  0.0  0.0  0.0  0.0
#2  0.5  0.5  0.0  0.0  0.0
#3  0.5  0.0  0.5  0.0  0.0
#4  0.5  0.0  0.0  0.5  0.0
#5  0.5  0.0  0.0  0.0  0.5
#6  0.5  0.5  0.0  0.0  0.0

Repeating the analyses 'by hand' using Zparents2 to build the Z matrix results in the same overall mean and block effects, but the predicted breeding values for parents when using Zparents are 0.7 of the predicted breeding values for parents when using Zparents2. I may need to refit the model and obtain new variance components for parents when working with the correct overlaid matrix.

Gratuitous picture: walking in Quail Island (Photo: Luis).

asreml, linear models, programming, rblogs, simulation, teaching

8 Comments

ibrahim
2012-10-13 at 08:22

I have question, hopefully you can help, how to estimate the breeding values for some lines using Asreml-R, given that we have the kinship matrix ( from molecular markers) and the lines were evaluated in single replicate over number of environments.
in other words how to include the covariance among environment as well as the the covariance among lines.
please refer to this example:
dat <- data.frame(env = c(10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,10,11,12,13,14,15,16,17,18,
19,20,21,22,23,24,25,26,5,6,7,8,9,10,11,12,13,14,15,16,1,2,3,4,5,6,7,8,9,10,11,
12,13,14,15,16,17,26,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18),
gen = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,
4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5),
y = c(63,72,69,55,52,67,63,55,66,57,69,73,68,64,67,70,70,64,82,76,53,59,62,74,54,66,54,
64,69,66,79,66,66,65,75,72,55,52,79,72,83,78,40,54,65,73,66,59,67,55,85,77,51,
49,80,65,88,86,63,53,68,63,52,60,49,40,57,53,74,71,46,44,62,51,72,70,60,51,65,59,51,66))
- Luis (Post author)
  2012-10-13 at 12:59
  
  Your question has nothing to do with this post. Anyway, there should be zero environmental covariance (as they are different sites) and you could model the genetic covariance matrix using a multivariate approach as I explained here. I recommend that you go to the asreml forum for more help.
  - ibrahim
    2012-10-17 at 13:45
    
    Hi:
    thank you for your reply —- and i am so sorry for asking irrelevant question to your post. however it was not clear for me were to post my question. i would like to mention also that i never used ASreml before, so excuse me if my question was a little stupid.
    
    regarding your reply:
    let me make my point more clear
    in plant breeding we are trying to use many locations to evaluate hundreds (if not thousands of lines). in some cases we cannot plant all the lines in each location, but we have some lines overlap from one location to anther. In the same time we can estimate the covariance matrix among all lines using molecular marker data. here is what i have in mind
    
    asreml(yield ~ Site,
    random =~fa(Site):ped(Genotype) + fa(Site):ide(Genotype),
    rcov =~at(Site):ar1(Column):ar1(Row), ## covariance structure
    ginverse =~list(Genotype=ainv), data=qb)
    is that make sense:
    i will appreciate your comments.
    - Luis (Post author)
      2012-10-18 at 13:32
      
      In addition to my previous answer, which included “I recommend that you go to the asreml forum for more help”, I would also like to point 5 in this page.
Odilon Morais
2014-03-15 at 18:05

Dear Luiz,

I am conducting a diallel analysis, and would to employ the mixed models based on Bayesian approach with MCMCglmm package.
But I’m struggling to define the model because of the sources of variation dad and mum, which must be defined in the model together so that a single variance component is obtained for both (dad and mum), as you defined in the your model, observed in the model m1 detailed below.

# Fitting the model with ASReml
library(asreml)
m1 <- asreml(yield ~ 1,
random = ~ block + dad + and(mum) + mate,
data = trial)
summary(m1)
# gamma component std.error z.ratio
#block!block.var 1.299110e-02 3.861892e-01 1.588423e+00 0.2431274
#dad!dad.var 2.101417e-01 6.246930e+00 5.120745e+00 1.2199259
#mate!mate.var 4.589938e-07 1.364461e-05 2.340032e-06 5.8309519
#R!variance 1.000000e+00 2.972722e+01 5.098177e+00 5.8309519

But I did not get information about this possibility for MCMCglmm package in R. It would be possible to set the mcmc model for MCMCglmm also detailed below, in the same way that is set to m1 model, the ASReml?

# Fitting the model with MCMCglmm
library(MCMCglmm)
prior.mcmc <- list(R = list(V = 1e-16, nu = -2),
G = list(G1 = list(V = 1e-16, nu = -2),
G2 = list(V = 1e-16, nu = -2),
G3 = list(V = 1e-16, nu = -2)))

mcmc = MCMCglmm(yield ~ 1,
random = ~ block + dad + mum + mate,
family = 'gaussian',
data = trial,
prior = prior.mcmc,
verbose = FALSE,
pr = TRUE,
burnin = 10000,
nitt = 20000,
thin = 10)
summary(mcmc)

Already very grateful for your attention!
- Luis (Post author)
  2014-03-15 at 19:27
  
  Hi Odilon,
  
  Luis (with s) here. While in this post I discuss the issue of overlapping matrices in the context of a diallel cross, I also point out that “With the advent of animal model BLUP, was possible to fit something like y = mu + blocks + individual (using a pedigree) + cross + error”. Fitting separate mums and dads predates the use of animal models, it is old-fashioned (we’re talking 30-40 years old) and much more complicated. I have code for the analysis of diallels in asreml here. In the case of MCMCglmm there are a few differences, starting with the pedigree terms HAS to be called animal. You could set up something like:
  
  prior = list(R = list(V = 0.007, n = 0),
  G = list(G1 = list(V = 0.002, n = 0), G2 = list(V = 0.001, n = 0), G3 = list(V = 0.001, n = 0))))
  
  # I use high thinning to avoid autocorrelation with animal model
  nvel.bayes <- MCMCglmm(nvel ~ 1, random = ~ animal + Block + Family, family = 'gaussian', data = ncs, pedigree = ped, prior = prior, verbose = FALSE, pr = TRUE, burnin = 10000, nitt = 200000, thin = 200) ped is a data frame with the pedigree, with columns animal, Mum, Dad. This should fit the additive and family effects in a randomized complete block design. Reciprocal effects are home work. 😉
  - Odilon Morais
    2014-03-28 at 20:00
    
    Hi Luis,
    
    Thank you very much by the answer.
    
    Excuse the ignorance on the subject, but I am now initiating studies on Bayesian inference. I’d like to ask a question:
    I want to define a priori information, with normal distribution for the data, in MCMCglmm package. So, what are the values that should be assigned to “V” ans “n” in argument “list (V = …. n …. =)” ??
    
    Already thank you very much!!
Kevin Wright
2014-04-30 at 10:07

Nice post. If you use set.seed(), your example will be reproducible.

Question: Are the calculated breeding values estimates of the u.g effects?