Analyzing a simple experiment with heterogeneous variances using asreml, MCMCglmm and SAS

I was working with a small experiment which includes families from two Eucalyptus species and thought it would be nice to code a first analysis using alternative approaches. The experiment is a randomized complete block design, with species as fixed effect and family and block as a random effects, while the response variable is growth strain (in \( \mu \epsilon\)).

When looking at the trees one can see that the residual variances will be very different. In addition, the trees were growing in plastic bags laid out in rows (the blocks) and columns. Given that trees were growing in bags siting on flat terrain, most likely the row effects are zero.

Below is the code for a first go in R (using both MCMCglmm and ASReml-R) and SAS. I had stopped using SAS for several years, mostly because I was running a mac for which there is no version. However, a few weeks ago I started accessing it via their OnDemand for Academics program via a web browser.

The R code using REML looks like:

# Options
options(stringsAsFactors = FALSE)
setwd('~/Dropbox/research/2013/growth-stress')

# Packages
require(ggplot2)
require(asreml)
require(MCMCglmm)


# Reading data, renaming, factors, etc
gs = read.csv('eucalyptus-growth-stress.csv')
summary(gs)

# Both argophloia and bosistoana
gsboth = subset(gs, !is.na(strain))
gsboth = within(gsboth, {
	species = factor(species)
	row = factor(row)
	column = factor(column)
	fam = factor(fam)
})            



ma = asreml(strain ~ species, random = ~ fam + row, 
            rcov = ~ at(species):units,
            data = gsboth)

summary(ma)$varcomp

#                                   gamma  component std.error   z.ratio constraint
#fam!fam.var                    27809.414  27809.414 10502.036 2.6480022   Positive
#row!row.var                     2337.164   2337.164  3116.357 0.7499666   Positive
#species_E.argopholia!variance 111940.458 111940.458 26609.673 4.2067580   Positive
#species_E.bosistoana!variance  63035.256  63035.256  7226.768 8.7224681   Positive

While using MCMC we get estimates in the ballpark by using:

# Priors
bothpr = list(R = list(V = diag(c(50000, 50000)), nu = 3), 
              G = list(G1 = list(V = 20000, nu = 0.002), 
                       G2 = list(V = 20000, nu = 0.002),
                       G3 = list(V = 20000, nu = 0.002)))

# MCMC                    
m2 = MCMCglmm(strain ~ species, random = ~ fam + row + column, 
              rcov = ~ idh(species):units,
              data = gsboth,
              prior = bothpr,
              pr = TRUE,
              family = 'gaussian',
              burnin = 10000,
              nitt = 40000,
              thin = 20,
              saveX = TRUE,
              saveZ = TRUE,
              verbose = FALSE)

summary(m2)

# Iterations = 10001:39981
# Thinning interval  = 20
# Sample size  = 1500 
# 
# DIC: 3332.578 
# 
# G-structure:  ~fam
# 
# post.mean l-95% CI u-95% CI eff.samp
# fam     30315    12211    55136     1500
# 
# ~row
# 
# post.mean l-95% CI u-95% CI eff.samp
# row      1449    5.928     6274    589.5
# 
# R-structure:  ~idh(species):units
# 
# post.mean l-95% CI u-95% CI eff.samp
# E.argopholia.units    112017    71152   168080     1500
# E.bosistoana.units     65006    52676    80049     1500
# 
# Location effects: strain ~ species 
# 
# post.mean l-95% CI u-95% CI eff.samp  pMCMC    
# (Intercept)            502.21   319.45   690.68     1500 <7e-04 ***
# speciesE.bosistoana   -235.95  -449.07   -37.19     1361  0.036 * 

The SAS code is not that disimilar, except for the clear demarcation between data processing (data step, for reading files, data transformations, etc) and specific procs (procedures), in this case to summarize data, produce a boxplot and fit a mixed model.

* termstr=CRLF accounts for the windows-like line endings of the data set;
data gs;
  infile "/home/luis/Projects/euc-strain/growthstresses.csv" 
        dsd termstr=CRLF firstobs=2;
  input row column species $ family $ strain;
  if strain ^= .;
run;
 
proc summary data = gs print;
  class species;
  var strain;
run;
 
proc boxplot data = gs;
  plot strain*species;
run;
 
proc mixed data = gs;
  class row species family;
  model strain = species;
  random row family;
  repeated species / group=species;
run;
 
/*
Covariance Parameter Estimates
Cov Parm 	Group 	               Estimate
row 	  	                       2336.80
family 	  	                       27808
species 	species E.argoph 	111844
species 	species E.bosist 	63036
*/
SAS boxplot for the data set.

SAS boxplot for the data set.

I like working with multiple languages and I realized that, in fact, I missed SAS a bit. It was like meeting an old friend; at the beginning felt strange but we were quickly chatting away after a few minutes.

Subsetting data

At School we use R across many courses, because students are supposed to use statistics under a variety of contexts. Imagine their disappointment when they pass stats and discovered that R and statistics haven’t gone away!

When students start working with real data sets one of their first stumbling blocks is subsetting data. We have data sets and either they are required to deal with different subsets or there is data cleaning to do. For some reason, many students struggle with what should be a simple task.

If one thinks of data as as a matrix/2-dimensional array, subsetting boils down to extracting the needed rows (cases) and columns (variables). In the R world one can do this in a variety of ways, ranging from the cryptic to the explicit and clear. For example, let’s assume that we have a dataset called alltrials with heights and stem diameters for a number of trees in different trials (We may have a bunch of additional covariates and experimental design features that we’ll ignore for the moment). How do we extract all trees located in Christchurch?

Two common approaches are:

mytrial = alltrials[alltrials$location == 'Christchurch', ]

mytrial = subset(alltrials, location == 'Christchurch')

While both expressions are equivalent, the former reads like Klingon to students, while the latter makes explicit that we are obtaining a subset of the original data set. This can easily be expanded to more complex conditions; for example to include all trees from Christchurch that are taller than 10 m:

mytrial = alltrials[alltrials$location == 'Christchurch' & alltrials$height > 10, ]

mytrial = subset(alltrials, location == 'Christchurch' & height > 10)

I think the complication with the Klingonian notation comes mostly from two sources:

  • Variable names for subsetting the data set are not directly accessible, so we have to prefix them with the NameOfDataset$, making the code more difficult to read, particularly if we join several conditions with & and |.
  • Hanging commas: if we are only working with rows or columns we have to acknowledge it by suffixing or prefixing with a comma, which are often forgotten.

Both points result on frustrating error messages like

  • Error in `[.data.frame`(alltrials, location == 'Christchurch', ) : object 'location' not found for the first point or
  • Error in `[.data.frame`(alltrials, alltrials$location == 'Christchurch') undefined columns selected for the second point.

The generic forms of these two notations are:

dataset[what to do with rows, what to do with columns]

subset(dataset, what to do with rows, what to do with columns)

We often want to keep a subset of the observed cases and keep (or drop) specific variables. For example, we want to keep trees in 'Christchurch' and we want to ignore diameter, because the assessor was 'high' that day:

# With this notation things get a bit trickier
# The easiest way is to provide the number of the variable
# Here diameter is the 5th variable in the dataset
mytrial = alltrials[all.trials$location == 'Christchurch' & all.trials$height > 10, -5]

# This notation is still straightforward
mytrial = subset(alltrials, location == 'Christchurch' & height > 10, select = -diameter)

There are, however, situations where Klingon is easier or more efficient. For example, to take a random sample of 100 trees from the full dataset:


mytrial = alltrials[sample(1:nrow(alltrials), 100, replace = FALSE),]

If you are interested in this issue Quick-R has a good description of subsetting. I'm sure this basic topic must has been covered many times, although I doubt anyone used Klingon in the process.

Gratuitous picture: building blocks for research (Photo: Luis).

Gratuitous picture: building blocks for R (Photo: Luis).

Learning to code in R

It used to be that the one of the first decisions to make when learning to program was between compiled (e.g. C or FORTRAN) and interpreted (e.g. Python) languages. In my opinion these days one would have to be a masochist to learn with a compiled language: the extra compilation time and obscure errors are a killer when learning.

Today the decision would be between using a generic interpreted language (e.g. Python) and an interpreted domain specific language (DSL) like R, MATLAB, etc. While some people prefer generic languages, I’d argue that immediate feedback and easy accomplishment of useful tasks are a great thing when one is learning something for the first time.

As an example, a while ago my son asked me what I was doing in the computer and I told him that I was programming some analyses in R. I showed that the program was spitting back some numbers and plots, a fact that he found totally unremarkable and uninteresting. I searched in internet and I found Scratch, a visual programming language, that let’s the user moves blocks representing code around and build interactive games: now my son was sold. Together we are learning about loops, control structures and variables, drawing characters, etc. We are programming because the problems are i- much more interesting for him and ii- achievable in a short time frame.

An example scratch script.

An example scratch script.

Learning to program for statistics, or other scientific domains for that matter, is not that different from being a kid and learning programming. Having to do too much to get even a mildly interesting result is frustrating and discouraging; it is not that the learner is dumb, but that he has to build too many functions to get a meager reward. This is why I’d say that you should use whatever language already has a large amount of functionality (‘batteries included’ in Python parlance) for your discipline. Choose rightly and you are half-way there.

‘But’ someone will say, R is not a real language. Sorry, but it is a real language (Turing complete and the whole shebang) with oddities, as any other language, granted. As with human languages, the more you study the easier it gets to learn a new language. In fact, the syntax for many basic constructs in R is highly similar to alternatives:

# This is R code
# Loop
for(i in 1:10){
    print(i)
}
#[1] 1
#[1] 2
#[1] 3
#[1] 4
#[1] 5
#[1] 6
#[1] 7
#[1] 8
#[1] 9
#[1] 10

# Control 
if(i == 10) {
    print('It is ten!')
}

#[1] "It is ten!"
# This is Python code
# Loop
for i in range(1,11): # Python starts indexing from zero
    print(i)

#1
#2
#3
#4
#5
#6
#7
#8
#9
#10

# Control
if i == 10:
    print("It is ten!")

#It is ten!

By the way, I formatted the code to highlight similarities. Of course there are plenty of differences between the languages, but many of the building blocks (the core ideas if you will) are shared. You learn one and the next one gets easier.

How does one start learning to program? If I look back at 1985 (yes, last millennium) I was struggling to learn how to program in BASIC when, suddenly, I had an epiphany. The sky opened and I heard a choir singing György Ligeti’s Atmosphères (from 2001: a Space Odyssey, you know) and then I realized: we are only breaking problems in little pieces and dealing with them one at the time. I’ve been breaking problems into little pieces since then. What else did you expect? Anyhow, if you feel that you want to learn how to code in R, or whatever language is the best option for your problems, start small. Just a few lines will do. Read other people’s code but, again, only small pieces that are supposed to do something small. At this stage is easy to get discouraged because everything is new and takes a lot of time. Don’t worry: everyone has to go through this process.

Many people struggle vectorizing the code so it runs faster. Again, don’t worry at the beginning if your code is slow. Keep on writing and learning. Read Norman Noam Ross’s FasteR! HigheR! StongeR! — A Guide to Speeding Up R Code for Busy People. The guide is very neat and useful, although you don’t need super fast code, not yet at least. You need working, readable and reusable code. You may not even need code: actually, try to understand the problem, the statistical theory, before you get coding like crazy. Programming will help you understand your problem a lot better, but you need a helpful starting point.

Don’t get distracted with the politics of research and repeatability and trendy things like git (noticed that I didn’t even link to them?). You’ll learn them in time, once you got a clue about how to program.

P.S. The code used in the examples could be shortened and sped up dramatically (e.g. 1:10 or range(1, 11)) but it is not the point of this post.
P.S.2. A while ago I wrote R is a language, which could be useful connected to this post.

Remembering server installation details

I’ve been moving part of my work to university servers, where I’m just one more peasant user with little privileges. In exchange, I can access the jobs from anywhere and I can access multiple processors if needed. Given that I have a sieve-like memory, where configuration details quickly disappear through many small holes, I’m documenting the little steps needed to move my work environment there.

The server provides a default R installation but none of the additional packages I often install are available (most people accessing the servers don’t use R). I could contact the administrator to get them installed, but I’ve opted for installing them under my user space. For that I followed the instructions presented here, which in summary require adding the name of the default folder (/hpc/home/luis/rpackages) for the local library of packages to my .bashrc file:

R_LIBS="/hpc/home/luis/rpackages"
export R_LIBS

I also have a temporary folder (called rpackages)in the account where I move the source of the packages to be installed (using SFTP). Once the R session is started it is possible to check that the local folder is first in the library path, confirming that R_LIBS has been made available to R.

Then I can install the packages I moved to the server with SFTP from the temporary folder to the local library using install.packages().

.libPaths()
# [1] "/hpc/home/luis/rpackages"                                               
# [2] "/hpc/home/projects/packages/local.linux.ppc/pkg/R/2.15.1/lib64/R/library"
#
install.packages("~/temporary/plyr_1.8.tar.gz", lib="/hpc/home/luis/rpackages", repos=NULL)

# * installing *source* package 'plyr' ...
# ** package 'plyr' successfully unpacked and MD5 sums checked
# ** libs
# gcc -std=gnu99 -I/usr/local/pkg/R/2.15.1/lib64/R/include -DNDEBUG      -fPIC  -O2  -c loop-apply.c -o loop-apply.o
# gcc -std=gnu99 -I/usr/local/pkg/R/2.15.1/lib64/R/include -DNDEBUG      -fPIC  -O2  -c split-numeric.c -o split-numeric.o
# gcc -std=gnu99 -shared -Wl,--as-needed -o plyr.so loop-apply.o split-numeric.o
# installing to /hpc/home/luis/rpackages/plyr/libs
# ** R
# ** data
# **  moving datasets to lazyload DB
# ** inst
# ** preparing package for lazy loading
# ** help
# *** installing help indices
# ** building package indices
# ** testing if installed package can be loaded
#
# * DONE (plyr)

Now we can load the package as normal:

require(plyr)
# Loading required package: plyr

Nothing complicated or groundbreaking, just writing down the small details before I forget them.

Gratuitous picture: just different hardware (Photo: Luis).

Gratuitous picture: just different hardware (Photo: Luis).

An R wish list for 2013

First go and read An R wish list for 2012. None of the wishes came through in 2012. Fix the R website? No, it is the same this year. In fact, it is the same as in 2005. Easy to find help? Sorry, next year. Consistency and sane defaults? Coming soon to a theater near you (one day). Thus my wish list for 2012 is, very handy, still the wish list for 2013.

R as social software

The strength of R is not the software itself, but the community surrounding the software. Put another way, there are several languages that could offer the core functionality, but the whole ‘ecosystem’ that’s another thing. Softening @gappy3000′s comment: innovation is (mostly) happening outside the core.

This prompts some questions: Why isn’t ggplot2 or plyr in the default download? I don’t know if some people realize that ggplot2 is now one of the main attractions for R as data visualization language. Why isn’t Hadley’s name in this page? (Sorry I’m picking on him, first name that came to mind). How come there is not one woman in that page? I’m not saying there is an evil plan, but I’m wondering if (and how) the site and core reflect the R community and the diversity of interests (and uses). I’m also wondering what is the process to express these questions beyond a blog post. Perhaps in the developers email list?

I think that, in summary, my R wish for 2013 is that ‘The R project’—whoever that is—recognizes that the project is much more than the core download. I wish the list of contributors goes beyond the fairly small number of people with writing access to the source. I’d include those who write packages, those who explain, those who market and, yes, those who sell R. Finally, I wish all readers of Quantum Forest a great 2013.

Entry point to the R world. Same as ever.

Entry point to the R world. Same as ever.

P.S. Just in case, no, I’m not suggesting to be included in any list.

My R year

End-of-year posts are corny but, what the heck, I think I can let myself delve in to corniness once a year. The following code gives a snapshot of what and how was R for me in 2012.

outside.packages.2012 <- list(used.the.most = c('asreml', 'ggplot2'),
                              largest.use.decline = c('MASS', 'lattice'),
                              same.use = c('MCMCglmm', 'lme4'),
                              would.like.use.more = 'JAGS')

skill.level <- list(improved = 'fewer loops (plyr and do.call())',
                    unimproved = c('variable.naming (Still an InConsistent mess)', 
                                   'versioning (still hit and miss)'))

interfaces <- list(most.used = c('RStudio', 'plain vanilla R', 'text editor (Textmate and VIM)'),
                   didnt.use.at.all = 'Emacs')

languages <- list(for.inquisition = c('R', 'Python', 'Javascript'),
                  revisiting = 'J',
                  discarded = 'Julia (note to self: revisit in a year)')

(R.2012 <- list(outside.packages.2012, 
                skill.level, 
                interfaces, 
                languages))

# [[1]]
# [[1]]$used.the.most
# [1] "asreml"  "ggplot2"

# [[1]]$largest.use.decline
# [1] "MASS"    "lattice"

# [[1]]$same.use
# [1] "MCMCglmm" "lme4"    

# [[1]]$would.like.use.more
# [1] "JAGS"


# [[2]]
# [[2]]$improved
# [1] "fewer loops (plyr and do.call())"

# [[2]]$unimproved
# [1] "variable.naming (Still an InConsistent mess)"
# [2] "versioning (still hit and miss)"             


# [[3]]
# [[3]]$most.used
# [1] "RStudio"                        "plain vanilla R"               
# [3] "text editor (Textmate and VIM)"

# [[3]]$didnt.use.at.all
# [1] "Emacs"


# [[4]]
# [[4]]$for.inquisition
# [1] "R"          "Python"     "Javascript"

# [[4]]$revisiting
# [1] "J"

# [[4]]$discarded
# [1] "Julia (note to self: revisit in a year)"

So one can query this over-the-top structure with code like R.2012[[3]]$didnt.use.at.all to learn [1] "Emacs", but you already new that, didn’t you?

Despite all my complaints, monologuing about other languages and overall frustration, R has served me well. It’s just that I’d be disappointed if I were still using it a lot in ten-years time.

Gratuitous picture: building blocks for research (Photo: Luis).

Gratuitous picture: building blocks for research (Photo: Luis).

Of course there was a lot more than R and stats this year. For example, the blogs I read most often have nothing to do with either topic: Isomorphismes (can’t define it), The music of sound (sound design), Offsetting behaviour (economics/politics in NZ). In fact, I need reading about a broad range of topics to feel human.

P.S. Incidentally, my favorite R function this year was subset(); I’ve been subsetting like there is no tomorrow. By the way, you are welcome to browse around the blog and subset whatever you like.

R for inquisition

A post on high-dimensional arrays by @isomorphisms reminded me of APL and, more generally, of matrix languages, which took me back to inquisitive computing: computing not in the sense of software engineering, or databases, or formats, but of learning by poking problems through a computer.

I like languages not because I can get a job by using one, but because I can think thoughts and express ideas through them. The way we think about a problem is somehow molded by the tools we use, and if we have loops, loops we use or if we have a terse matrix notation (see my previous post on Matrix Algebra Useful for Statistics), we may use that.

I used APL fairly briefly but I was impressed by some superficial aspects (hey, that’s a weird set of characters that needs a keyboard overlay) and some deeper ones (this is an actual language, cool PDF paper). The APL revolution didn’t happen, at least not directly, but it had an influence over several other languages (including R). Somehow as a group we took a different path from ‘Expository programming’, but I think that we have to recover at least part of that ethos, programming for understanding the world.

While many times I struggle with R frustrations, it is now my primary language for inquisitive computing, although some times I dive into something else. I like Mathematica, but can access it only while plugged to the university network (license limits). Python is turning into a great scientific computing environment—although still with a feeling of sellotape holding it together, J is like APL without the Klingon keyboard.

If anything, dealing with other ways of doing things leads to a better understanding of one’s primary language. Idioms that seem natural acquire a new sense of weirdness when compared to other languages. R’s basic functionality gives an excellent starting point for inquisitive computing but don’t forget other languages that can enrich the way we look at problems.

I am curious about what are people’s favorite inquisitive languages.

Gratuitous picture: inquisition, Why bloody trees grow like this? (Photo: Luis).

Gratuitous picture: inquisition, Why bloody trees grow like this? (Photo: Luis).

R pitfalls #4: redefining the basics

I try to be economical when writing code; for example, I tend to use single quotes over double quotes for characters because it saves me one keystroke. One area where I don’t do that is when typing TRUE and FALSE (R accepts T and F as well), just because it is clearer to see in code and syntax highlighting kicks in. That’s why I was surprised to read Jason Morgan’s post in that it is possible to redefine T and F and get undesirable behavior.

Playing around it is quite easy to redefine other fundamental constants in R. For example, I posted in Twitter:

> pi
[1] 3.141593
> pi <- 2
> pi*2
[1] 4

Ouch, dangerous! I tend to muck around with matrices quite a bit and, being a friend of parsimony, I often use capital letters to represent them. This would have eventually bitten me if I had used the abbreviated TRUE and FALSE. As Kevin Ushey replied to my tweet, one can redefine even basic functions like ‘+’ and be pure evil; over the top, sure, but possible.

Clown faces
Some times coding is scary (Photo: Luis).

Multisite, multivariate genetic analysis: simulation and analysis

The email wasn’t a challenge but a simple question: Is it possible to run a multivariate analysis in multiple sites? I was going to answer yes, of course, and leave it there but it would be a cruel, non-satisfying answer. We can get a better handle of the question if we use a simple example; let’s assume that we have two traits (call them tree stem diameter and stem density) assessed in two sites (localities).

Because this is genetics we have a family structure, let’s say half-siblings so we only half the mother in common, and we will ignore any experimental design features to keep things simple. We have 100 families, with 30 trees each, in sites A and B, for a total of 6,000 trees (100 x 30 x 2). The data could look like this:

site family tree diam dens
 A     1     1   20    398
 A     1     2   19    400
...
 A    100   30   24    375
...
 B     1    1    25    396
...
 B    100   30   23    403

We can also think of a trait measured in two sites as separate, but correlated, traits. For example, diameter in site 1 (diam1) is a different variable from diameter in site 2 (diam 2). So now we have four response variables (diam1, dens1, diam2, dens2), two of which have only missing values in a given site:

site family tree diam1 dens1 diam2 dens2
 A     1     1    20    398   NA    NA
 A     1     2    19    400   NA    NA
...
 A    100   30    24    375   NA    NA
...
 B      1    1    NA    NA    25    396
...
 B    100   30    NA    NA    23    403

All variables are correlated at the genetic level with an unstructured G covariance matrix, while at the residual level are only correlated within-site (same tree was assessed for both traits), but there is zero correlation between sites, because one could be assessed near Christchurch, while the other near Dunedin, for example.

###
### Data generation
###

# Simulate 100 half-sib families with 30 trees each
# for the four traits in two sites
n.fam <- 100
n.prog <- 30
n.trait <- 4
n.site <- 2

# Couple of variance matrices,
# G is for additive genetic effects
# R is for residuals
# All variances are assumed equal to 1 without loss
# of generality
G <- matrix(c(1, 0.7, 0.5, 0.3,
              0.7, 1, 0.2, 0.1,
              0.5, 0.2, 1, 0.8,
              0.3, 0.1, 0.8, 1), 4, 4)

R <- matrix(c(1, 0.3, 0, 0,
              0.3, 1, 0, 0,
              0, 0, 1, 0.2,
              0, 0, 0.2, 1), 4, 4)


# Between-family variances account for 1/4 of additive
# genetic effects. Within-family account for 3/4 of
# additive + residual.
# We also get the Cholesky decomposition in the same step
BFL <- chol(1/4 * G)
WFL <- chol(3/4 * G + R)

# Simulate random family effects for four traits
# Simulate random family effects for four traits
fam.eff <- t(BFL) %*% matrix(rnorm(n.fam*n.trait), 
                             n.trait, n.fam)
fam.eff <- t(fam.eff) # This is 100 x 4

tre.eff <- t(WFL) %*% matrix(rnorm(n.prog*n.fam*n.trait), 
                             n.trait, n.prog*n.fam)
tre.eff <- t(tre.eff) # This is 3000 x 4

# Expand family effects matrix (n.prog each) to match 
# dimension of tree effects
pheno <- fam.eff[rep(1:dim(fam.eff)[1], each = n.prog),] + tre.eff

# Now to 2 sites
pheno2s <- matrix(NA, n.prog*n.fam*n.site, n.trait)
colnames(pheno2s) <- c('diam1', 'dens1', 'diam2', 'dens2')

pheno2s[1:3000, 1:2] <- pheno[1:3000, 1:2]
pheno2s[3001:6000, 3:4] <- pheno[1:3000, 3:4]

# Creating data set. Family and tree are shamelessly recycled
sim.data <- data.frame(site = factor(rep(c('A', 'B'), each = n.fam*n.prog)),
                       family = factor(rep(1:n.fam, each = n.prog)),
                       tree = factor(1:30),
                       pheno2s)

Some neat things:

  • Data simulation relies on Cholesky decomposition, as explained in this post over a year ago (time flies!). Generation is a bit more complex this time because we have to account for two, rather than one, covariance structures.
  • Another point with data simulation is that one could generate one set of correlated values at the time by using something like t(BFL) %*% diag(rnorm(trait), 3) and loop it or use apply(). This would require much less memory but would also be much slower.
  • We need to repeat each line of the family effects matrix 30 times so we can add to the individual tree effects. Often we use indexing in matrices or data frames to extract a few rows. Instead here I’m using to repeat a given number of times each row by indexing with rep().

If we show observations 2990 to 3010 (last 10 for site A and first 10 for site B) we can see the pattern of observations below, which will have the required structure to have a US (or heterogeneous variance correlation matrix) for G and a block diagonal matrix for R. By the way, one could also have block diagonal matrices with repeated measurements, although probably with a different correlation structure:

> pheno2s[2990:3010,]
            diam1      dens1      diam2       dens2
 [1,]  0.57087250 -0.8521059         NA          NA
 [2,]  0.94859621  0.6599391         NA          NA
 [3,] -3.37405451 -0.6093312         NA          NA
 [4,]  0.93541048 -0.7977893         NA          NA
 [5,] -0.74758553  0.7962593         NA          NA
 [6,]  0.51280201  1.4870425         NA          NA
 [7,] -1.92571147 -0.2554365         NA          NA
 [8,] -1.15923045 -2.0656582         NA          NA
 [9,] -0.82145904 -0.3138340         NA          NA
[10,]  1.79631670  0.3814723         NA          NA
[11,] -0.01604778 -1.1804723         NA          NA
[12,]          NA         NA -0.1436143 -1.97628883
[13,]          NA         NA -2.7099687 -2.93832962
[14,]          NA         NA -2.5153420  0.73780760
[15,]          NA         NA  0.3084056  0.61696714
[16,]          NA         NA -0.2909500  0.78111864
[17,]          NA         NA -1.8629862 -2.19346309
[18,]          NA         NA  0.8673053 -0.07692884
[19,]          NA         NA -0.1459703  0.36981965
[20,]          NA         NA -0.7688851 -0.96765799
[21,]          NA         NA  0.6637173 -0.34924814
Gratuitous picture: driving in rural Japan, the cultural value of wood and breaking preconceptions (Photo: Luis).

Gratuitous picture: driving in rural Japan, the cultural value of wood and breaking preconceptions (Photo: Luis).

For the analysis we will use ASReml, which is the Pro option if you work in a breeding program and require solving much larger systems of equations than this tiny example (say 50-100 traits, large pedigrees, etc). Another option for playing with this small data set would be to use MCMCglmm, which also allows for multivariate evaluation of linear mixed models.

###
### Multivariate analysis
###

require(asreml)

# Assuming zero correlation between traits (equivalent
# to four univariate analyses)
m1 <- asreml(cbind(diam1, dens1, diam2, dens2) ~ trait, 
             random = ~ diag(trait):family, 
             rcov = ~ units:diag(trait),
             data = sim.data)

summary(m1)$varcomp

#                                 gamma component  std.error   z.ratio constraint
#trait:family!trait.diam1.var 0.2571274 0.2571274 0.04524049  5.683569   Positive
#trait:family!trait.dens1.var 0.2265041 0.2265041 0.04059742  5.579274   Positive
#trait:family!trait.diam2.var 0.2383959 0.2383959 0.04185982  5.695102   Positive
#trait:family!trait.dens2.var 0.2567999 0.2567999 0.04459246  5.758818   Positive
#R!variance                   1.0000000 1.0000000         NA        NA      Fixed
#R!trait.diam1.var            1.8290472 1.8290472 0.04803313 38.078866   Positive
#R!trait.dens1.var            1.7674960 1.7674960 0.04641672 38.078866   Positive
#R!trait.diam2.var            1.6779793 1.6779793 0.04406589 38.078866   Positive
#R!trait.dens2.var            1.7028171 1.7028171 0.04471817 38.078866   Positive

# The multivariate analysis allowing for correlation
# is a bit more complex, so we start by generating 
# a structure for starting values
m2.sv <- asreml(cbind(diam1, dens1, diam2, dens2) ~ trait, 
                random = ~ corgh(trait):family, 
                rcov = ~ units:us(trait),
                data = sim.data,
                start.values = TRUE)

# Now we'll constraint some R variance components
# to zero and fix them there. These are values for
# which we now they are 0
sv <- m2.sv$gammas.table
sv$Value1 <- 0
sv$Constraint1 <- 'F'

# Run the analyses with those constraints for the R matrix
# using the restricted values in R.param
m2 <- asreml(cbind(diam1, dens1, diam2, dens2) ~ trait, 
             random = ~ corgh(trait):family, 
             rcov = ~ units:us(trait, init = r.init),
             data = sim.data,
             R.param = sv)

summary(m2)$varcomp
#                                                 gamma   component  std.error
# trait:family!trait.dens1:!trait.diam1.cor 0.72535514 0.72535514 0.06381776
# trait:family!trait.diam2:!trait.diam1.cor 0.49100215 0.49100215 0.10067768
# trait:family!trait.diam2:!trait.dens1.cor 0.21445972 0.21445972 0.12085780
# trait:family!trait.dens2:!trait.diam1.cor 0.37476119 0.37476119 0.10975970
# trait:family!trait.dens2:!trait.dens1.cor 0.05221773 0.05221773 0.12439924
# trait:family!trait.dens2:!trait.diam2.cor 0.85356318 0.85356318 0.04321683
# trait:family!trait.diam1                  0.25712744 0.25712744 0.04524049
# trait:family!trait.dens1                  0.22650410 0.22650410 0.04059742
# trait:family!trait.diam2                  0.23839593 0.23839593 0.04185982
# trait:family!trait.dens2                  0.25679989 0.25679989 0.04459246
# R!variance                                1.00000000 1.00000000         NA
# R!trait.diam1:diam1                       1.82904721 1.82904721 0.04803313
# R!trait.dens1:diam1                       0.90450432 0.90450432 0.03737490
# R!trait.dens1:dens1                       1.76749598 1.76749598 0.04641672
# R!trait.diam2:diam1                       0.00000000 0.00000000         NA
# R!trait.diam2:dens1                       0.00000000 0.00000000         NA
# R!trait.diam2:diam2                       1.67797927 1.67797927 0.04406589
# R!trait.dens2:diam1                       0.00000000 0.00000000         NA
# R!trait.dens2:dens1                       0.00000000 0.00000000         NA
# R!trait.dens2:diam2                       0.71249532 0.71249532 0.03406354
# R!trait.dens2:dens2                       1.70281710 1.70281710 0.04471817
#
#                                              z.ratio    constraint
# trait:family!trait.dens1:!trait.diam1.cor 11.3660390 Unconstrained
# trait:family!trait.diam2:!trait.diam1.cor  4.8769710 Unconstrained
# trait:family!trait.diam2:!trait.dens1.cor  1.7744796 Unconstrained
# trait:family!trait.dens2:!trait.diam1.cor  3.4143789 Unconstrained
# trait:family!trait.dens2:!trait.dens1.cor  0.4197593 Unconstrained
# trait:family!trait.dens2:!trait.diam2.cor 19.7507102 Unconstrained
# trait:family!trait.diam1                   5.6835685      Positive
# trait:family!trait.dens1                   5.5792738      Positive
# trait:family!trait.diam2                   5.6951016      Positive
# trait:family!trait.dens2                   5.7588182      Positive
# R!variance                                        NA         Fixed
# R!trait.diam1:diam1                       38.0788655      Positive
# R!trait.dens1:diam1                       24.2008477      Positive
# R!trait.dens1:dens1                       38.0788655      Positive
# R!trait.diam2:diam1                               NA         Fixed
# R!trait.diam2:dens1                               NA         Fixed
# R!trait.diam2:diam2                       38.0788655      Positive
# R!trait.dens2:diam1                               NA         Fixed
# R!trait.dens2:dens1                               NA         Fixed
# R!trait.dens2:diam2                       20.9166565      Positive
# R!trait.dens2:dens2                       38.0788655      Positive

In model 1 each of the variances was supposed to be ~0.25 (1/4 * 1) and the residual variances ~1.75 (3/4*1 + 1). Once we move to model 2 we also get values similar to the correlations we were trying to simulate. And this is the end of the long answer.

P.S. If, for some bizarre reason, you would like to use SAS for this type of analyses, Fikret Isik has proc mixed code to run multivariate genetic analyses.

Scraping pages and downloading files using R

I have written a few posts discussing descriptive analyses of evaluation of National Standards for New Zealand primary schools.The data for roughly half of the schools was made available by the media, but the full version of the dataset is provided in a single-school basis. In the page for a given school there may be link to a PDF file with the information on standards sent by the school to the Ministry of Education.

I’d like to keep a copy of the PDF reports for all the schools for which I do not have performance information, so I decided to write an R script to download just over 1,000 PDF files. Once I can identify all the schools with missing information I just loop over the list, using the fact that all URL for the school pages start with the same prefix. I download the page, look for the name of the PDF file and then download the PDF file, which is named school_schoolnumber.pdf. And that’s it.

Of course life would be a lot simpler if the Ministry of Education made the information available in a usable form for analysis.

library(XML) # HTML processing
options(stringsAsFactors = FALSE)

# Base URL
base.url = 'http://www.educationcounts.govt.nz/find-a-school/school/national?school='
download.folder = '~/Downloads/schools/'

# Schools directory
directory <- read.csv('Directory-Schools-Current.csv')
directory <- subset(directory, 
                    !(school.type %in% c("Secondary (Year 9-15)", "Secondary (Year 11-15)")))

# Reading file obtained from stuff.co.nz obtained from here:
# http://schoolreport.stuff.co.nz/index.html
fairfax <- read.csv('SchoolReport_data_distributable.csv')
fairfax <- subset(fairfax, !is.na(reading.WB)) 

# Defining schools with missing information
to.get <- merge(directory, fairfax, by = 'school.id', all.x = TRUE)
to.get <- subset(to.get, is.na(reading.WB))

# Looping over schools, to find name of PDF file
# with information and download it

for(school in to.get$school.id){
  
  # Read HTML file, extract PDF link name
  cat('Processing school ', school, '\n')
  doc.html <- htmlParse(paste(base.url, school, sep = ''))
  doc.links <- xpathSApply(doc.html, "//a/@href")
  pdf.url <- as.character(doc.links[grep('pdf', doc.links)])
  if(length(pdf.url) > 0) {
    pdf.name <- paste(download.folder, 'school_', school, '.pdf', sep = '')
    download.file(pdf.url, pdf.name, method = 'auto', quiet = FALSE, mode = "w",
                  cacheOK = TRUE, extra = getOption("download.file.extra"))
  }
}

Can you help?

It would be great if you can help me to get the information from the reports. The following link randomly chooses a school, click on the “National Standards” tab and open the PDF file.

Then type the achievement numbers for reading, writing and mathematics in this Google Spreadsheet. No need to worry about different values per sex or ethnicity; the total values will do.

Gratuitous picture: a simple summer lunch (Photo: Luis).