Subsetting data

At School we use R across many courses, because students are supposed to use statistics under a variety of contexts. Imagine their disappointment when they pass stats and discovered that R and statistics haven’t gone away!

When students start working with real data sets one of their first stumbling blocks is subsetting data. We have data sets and either they are required to deal with different subsets or there is data cleaning to do. For some reason, many students struggle with what should be a simple task.

If one thinks of data as as a matrix/2-dimensional array, subsetting boils down to extracting the needed rows (cases) and columns (variables). In the R world one can do this in a variety of ways, ranging from the cryptic to the explicit and clear. For example, let’s assume that we have a dataset called alltrials with heights and stem diameters for a number of trees in different trials (We may have a bunch of additional covariates and experimental design features that we’ll ignore for the moment). How do we extract all trees located in Christchurch?

Two common approaches are:

mytrial = alltrials[alltrials$location == 'Christchurch', ]

mytrial = subset(alltrials, location == 'Christchurch')

While both expressions are equivalent, the former reads like Klingon to students, while the latter makes explicit that we are obtaining a subset of the original data set. This can easily be expanded to more complex conditions; for example to include all trees from Christchurch that are taller than 10 m:

mytrial = alltrials[alltrials$location == 'Christchurch' & alltrials$height > 10, ]

mytrial = subset(alltrials, location == 'Christchurch' & height > 10)

I think the complication with the Klingonian notation comes mostly from two sources:

  • Variable names for subsetting the data set are not directly accessible, so we have to prefix them with the NameOfDataset$, making the code more difficult to read, particularly if we join several conditions with & and |.
  • Hanging commas: if we are only working with rows or columns we have to acknowledge it by suffixing or prefixing with a comma, which are often forgotten.

Both points result on frustrating error messages like

  • Error in `[.data.frame`(alltrials, location == 'Christchurch', ) : object 'location' not found for the first point or
  • Error in `[.data.frame`(alltrials, alltrials$location == 'Christchurch') undefined columns selected for the second point.

The generic forms of these two notations are:

dataset[what to do with rows, what to do with columns]

subset(dataset, what to do with rows, what to do with columns)

We often want to keep a subset of the observed cases and keep (or drop) specific variables. For example, we want to keep trees in 'Christchurch' and we want to ignore diameter, because the assessor was 'high' that day:

# With this notation things get a bit trickier
# The easiest way is to provide the number of the variable
# Here diameter is the 5th variable in the dataset
mytrial = alltrials[all.trials$location == 'Christchurch' & all.trials$height > 10, -5]

# This notation is still straightforward
mytrial = subset(alltrials, location == 'Christchurch' & height > 10, select = -diameter)

There are, however, situations where Klingon is easier or more efficient. For example, to take a random sample of 100 trees from the full dataset:


mytrial = alltrials[sample(1:nrow(alltrials), 100, replace = FALSE),]

If you are interested in this issue Quick-R has a good description of subsetting. I'm sure this basic topic must has been covered many times, although I doubt anyone used Klingon in the process.

Gratuitous picture: building blocks for research (Photo: Luis).

Gratuitous picture: building blocks for R (Photo: Luis).

Learning to code in R

It used to be that the one of the first decisions to make when learning to program was between compiled (e.g. C or FORTRAN) and interpreted (e.g. Python) languages. In my opinion these days one would have to be a masochist to learn with a compiled language: the extra compilation time and obscure errors are a killer when learning.

Today the decision would be between using a generic interpreted language (e.g. Python) and an interpreted domain specific language (DSL) like R, MATLAB, etc. While some people prefer generic languages, I’d argue that immediate feedback and easy accomplishment of useful tasks are a great thing when one is learning something for the first time.

As an example, a while ago my son asked me what I was doing in the computer and I told him that I was programming some analyses in R. I showed that the program was spitting back some numbers and plots, a fact that he found totally unremarkable and uninteresting. I searched in internet and I found Scratch, a visual programming language, that let’s the user moves blocks representing code around and build interactive games: now my son was sold. Together we are learning about loops, control structures and variables, drawing characters, etc. We are programming because the problems are i- much more interesting for him and ii- achievable in a short time frame.

An example scratch script.

An example scratch script.

Learning to program for statistics, or other scientific domains for that matter, is not that different from being a kid and learning programming. Having to do too much to get even a mildly interesting result is frustrating and discouraging; it is not that the learner is dumb, but that he has to build too many functions to get a meager reward. This is why I’d say that you should use whatever language already has a large amount of functionality (‘batteries included’ in Python parlance) for your discipline. Choose rightly and you are half-way there.

‘But’ someone will say, R is not a real language. Sorry, but it is a real language (Turing complete and the whole shebang) with oddities, as any other language, granted. As with human languages, the more you study the easier it gets to learn a new language. In fact, the syntax for many basic constructs in R is highly similar to alternatives:

# This is R code
# Loop
for(i in 1:10){
    print(i)
}
#[1] 1
#[1] 2
#[1] 3
#[1] 4
#[1] 5
#[1] 6
#[1] 7
#[1] 8
#[1] 9
#[1] 10

# Control 
if(i == 10) {
    print('It is ten!')
}

#[1] "It is ten!"
# This is Python code
# Loop
for i in range(1,11): # Python starts indexing from zero
    print(i)

#1
#2
#3
#4
#5
#6
#7
#8
#9
#10

# Control
if i == 10:
    print("It is ten!")

#It is ten!

By the way, I formatted the code to highlight similarities. Of course there are plenty of differences between the languages, but many of the building blocks (the core ideas if you will) are shared. You learn one and the next one gets easier.

How does one start learning to program? If I look back at 1985 (yes, last millennium) I was struggling to learn how to program in BASIC when, suddenly, I had an epiphany. The sky opened and I heard a choir singing György Ligeti’s Atmosphères (from 2001: a Space Odyssey, you know) and then I realized: we are only breaking problems in little pieces and dealing with them one at the time. I’ve been breaking problems into little pieces since then. What else did you expect? Anyhow, if you feel that you want to learn how to code in R, or whatever language is the best option for your problems, start small. Just a few lines will do. Read other people’s code but, again, only small pieces that are supposed to do something small. At this stage is easy to get discouraged because everything is new and takes a lot of time. Don’t worry: everyone has to go through this process.

Many people struggle vectorizing the code so it runs faster. Again, don’t worry at the beginning if your code is slow. Keep on writing and learning. Read Norman Noam Ross’s FasteR! HigheR! StongeR! — A Guide to Speeding Up R Code for Busy People. The guide is very neat and useful, although you don’t need super fast code, not yet at least. You need working, readable and reusable code. You may not even need code: actually, try to understand the problem, the statistical theory, before you get coding like crazy. Programming will help you understand your problem a lot better, but you need a helpful starting point.

Don’t get distracted with the politics of research and repeatability and trendy things like git (noticed that I didn’t even link to them?). You’ll learn them in time, once you got a clue about how to program.

P.S. The code used in the examples could be shortened and sped up dramatically (e.g. 1:10 or range(1, 11)) but it is not the point of this post.
P.S.2. A while ago I wrote R is a language, which could be useful connected to this post.

Matrix Algebra Useful for Statistics

I was having a conversation with an acquaintance about courses that were particularly useful in our work. My forestry degree involved completing 50 compulsory + 10 elective courses; if I had to choose courses that were influential and/or really useful they would be Operations Research, Economic Evaluation of Projects, Ecology, 3 Calculus and 2 Algebras. Subsequently my PhD was almost entirely research based but I sort of did Matrix Algebra: Dorian lent me his copy of Searle’s Matrix Algebra Useful for Statistics and passed me a pile of assignments that Shayle Searle used to give in his course in Cornell. I completed the assignments on my own pace and then sat a crazy take-home exam for 24 hours.

Later that year I bought a cloth-bound 1982 version of the book, not the alien vomit purple paperback reprint currently on sale, which I consult from time to time. Why would one care about matrix algebra? Besides being a perfectly respectable intellectual endeavor on itself, maybe you can see that the degrees of freedom are the rank of a quadratic form; you can learn from this book what a quadratic form and a matrix rank are. Or you want to see more clearly the relationship between regression and ANOVA, because in matrix form a linear model is a linear model is a linear model. The commands outer, inner and kronecker product make a lot more sense once you know what an outer product and an inner product of vectors are. Thus, if you really want to understand a matrix language for data analysis and statistics (like R), it seems reasonable to try to understand the building blocks for such a language.

The book does not deal with any applications to statistics until chapter 13. Before that it is all about laying foundations to understand the applications, but do not expect nice graphs and cute photos. This is a very good text where one as to use the brain to imagine what’s going on in the equations and demonstrations. The exercises rely a lot on ‘prove this’ and ‘prove that’, which lead to much frustration and, after persevering, to many ‘aha! moments’.

XKCD 1050: In your face! Actually I feel the opposite concerning math.

I am the first to accept that I have a biased opinion about this book, because it has sentimental value. It represents difficult times, dealing with a new language, culture and, on top of that, animal breeding. At the same time, it opened doors to a whole world of ideas. This is much more than I can say of most books.

PS 2012-12-17: I have commented on a few more books in these posts.

A good part of my electives were in humanities (history & literature), which was unusual for forestry. I just couldn’t conceive going through a university without doing humanities.

More sense of random effects

I can’t exactly remember how I arrived to Making sense of random effects, a good post in the Distributed Ecology blog (go over there and read it). Incidentally, my working theory is that I follow Scott Chamberlain (@recology_), who follows Karthik Ram ‏(@_inundata) who mentioned Edmund Hart’s (@DistribEcology) post. I liked the discussion, but I thought one could add to the explanation to make it a bit clearer.

The idea is that there are 9 individuals, assessed five times each—once under each of five different levels for a treatment—so we need to include individual as a random effect; after all, it is our experimental unit. The code to generate the data, plot it and fit the model is available in the post, but I redid data generation to make it a bit more R-ish and, dare I say, a tad more elegant:

require(lme4)
require(ggplot2)

# Basic dataset
idf <- data.frame(ind = factor(rep(1:9, 5)),
                  levs = factor(rep(c('i1', 'i2', 'i3', 'i4', 'i5'), each = 9)),
                  f.means = rep(c(6, 16, 2, 10, 13), each = 9),
                  i.eff = rep(seq(-4, 4, length = 9), 5))
                  
# Function for individual values used in apply
ran.ind <- function(f.means, i.eff) rnorm(1, f.means + i.eff, 0.3)

# Rich showed me this version of apply that takes more than one argument
# https://smtp.biostat.wustl.edu/sympa/biostat/arc/s-news/2005-06/msg00022.html
# so we skip loops
idf$size <- apply(idf[, 3:4], 1, 
                  function(x) do.call('ran.ind', as.list(x)))

ggplot(idf, aes(x = levs, y = size, group = ind, colour = ind)) +
       geom_point() + geom_path()

# Fit linear mixed model (avoid an overall mean with -1)
m3 <- lmer(size ~ levs - 1 + (1 | ind), data = idf)
summary(m3)

# Skipping a few things
#   AIC   BIC logLik deviance REMLdev
# 93.84 106.5 -39.92    72.16   79.84
#Random effects:
# Groups   Name        Variance Std.Dev.
# ind      (Intercept) 7.14676  2.67334 
# Residual             0.10123  0.31816 
#Number of obs: 45, groups: ind, 9

# Show fixed effects
fixef(m3)

#   levsi1    levsi2    levsi3    levsi4    levsi5 
# 5.824753 15.896714  2.029902  9.969462 12.870952 

Original simulated data, showing the effect of treatment (fixed effect) and each individual.

What we can do to better understand what’s going on is ‘adjust’ the score observations by the estimated fixed effects and plot those values to see what we are modeling with the random effects:

# Substract fixed effects
idf$adjusted <- idf$size - rep(fixef(m3), each = 9)
ggplot(idf, aes(x = levs, y = adjusted, group = ind, colour = ind)) + 
       geom_point() + geom_path()

# Display random effects
ranef(m3)

#$ind
#  (Intercept)
#1 -3.89632810
#2 -2.83054394
#3 -1.99715524
#4 -1.12733342
#5  0.06619981
#6  0.95605162
#7  2.00483963
#8  2.99947727
#9  3.82479236

Data ‘adjusted’ by fixed effects. The random intercepts would be lines going through the average of points for each individual.

The random effects for individual or, better, the individual-level intercepts are pretty much the lines going through the middle of the points for each individual. Furthermore, the variance for ind is the variance of the random intercepts around the’adjusted’ values, which can be seen comparing the variance of random effects above (~7.15) with the result below (~7.13).

var(unlist(ranef(m3)))
#[1] 7.12707

Distributed Ecology then goes on to randomize randomize the individuals within treatment, which means that the average deviation around the adjusted means is pretty close to zero, making that variance component close to zero. I hope this explanation complements Edmund Hart’s nice post.

P.S. If you happen to be in the Southern part of South America next week, I’ll be here and we can have a chat (and a beer, of course).

Publication incentives

(This post continues discussing issues I described back in January in Academic publication boycott)

Some weeks ago I received a couple of emails the same day: one asking me to submit a paper to an open access journal, while the other one was inviting me to be the editor of an ‘special issue’ of my choice for another journal. I haven’t heard before about any of the two publications, which follow pretty much the same model: submit a paper for $600 and—if they like it—it will be published. However, the special issue email had this ‘buy your way in’ feeling: find ten contributors (i.e. $6,000) and you get to be an editor. Now, there is nothing wrong per-se with open access journals, some of my favorite ones (e.g. PLoS ONE) follow that model. However, I was surprised by the increasing number of new journals that look at filling the gap for ‘I need to publish soon, somewhere’. Surprised until one remembers the incentives at play in academic environments.

If I, or most academics for that matter, want to apply for academic promotion I have to show that I’m a good guy that has a ‘teaching philosophy’ and that my work is good enough to get published in journals; hopefully in lots of them. The first part is a pain, but most people can write something along the lines ‘I’m passionate about teaching and enjoy creating a challenging environment for students…’ without puking. The second part is trickier because one has to really have the papers in actual journals.

Personally, I would be happier with only having the odd ‘formal’ publication. The first time (OK, few times) I saw my name in a properly typeset paper was very exciting, but it gets old after a while. These days, however, I would prefer to just upload my work to a website, saying here you have some ideas and code, play with it. If you like it great, if not well, next time I hope it’ll be better. Nevertheless, this doesn’t count as proper publication, because it isn’t peer reviewed, independently of the number of comments the post may get. PLoS ONE counts, but it’s still a journal and I (and many other researchers) work in many things that are too small for a paper, but cool enough to share. The problem: there is little or no credit for sharing so Quantum Forest is mostly a ‘labor of love’, which counts bugger all for anything else.

These days as a researcher I often learn more from other people’s blogs and quick idea exchanges (for example through Twitter) than via formal publication. I enjoy sharing analysis, ideas and code in this blog. So what’s the point of so many papers in so many journals? I guess that many times we are just ‘ticking the box’ for promotions purposes. In addition, the idea of facing referees’ or editors’ comments like ‘it would be a good idea that you cite the following papers…’ puts me off. And what about authorship arrangements? We have moved from papers with 2-3 authors to enough authors to have a football team (with reserves and everything). Some research groups also run arrangements where ‘I scratch your back (include you as a coauthor) and you scratch mine (include me in your papers)’. We break ideas into little pieces that count for many papers, etc.

Another related issue is the cost of publication (and the barriers it imposes on readership). You see, we referee papers for journals for free (as in for zero money) and tell ourselves that we are doing a professional service to uphold the high standards of whatever research branch we belong to. Then we spend a fortune from our library budget to subscribe to the same journals for which we reviewed the papers (for free, remember?). It is not a great deal, as many reasonable people have pointed out; I added a few comments in academic publication boycott.

So, what do we need? We need promotion committees to reduce the weight on publication. We need to move away from impact factor. We can and need to communicate in other ways: scientific papers will not go away, but their importance should be reduced.

Some times the way forward is unclear. Incense doesn’t hurt (Photo: Luis).

Making an effort to prepare interesting lectures doesn’t hurt either.
These days it is fairly common editors ‘suggesting’ to include additional references in our manuscripts, which just happen to be to papers in the same journal, hoping to inflate the impact factor of the journal. Referees tend to suggest their own papers (some times useful, many times not). Lame, isn’t it?

PS. 2012-10-19 15:27 NZST. You also have to remember that not because something was published it is actually correct: outrageously funny example (via Arthur Charpentier). Yep, through Twitter.

Overlay of design matrices in genetic analysis

I’ve ignored my quantitative geneticist side of things for a while (at least in this blog) so this time I’ll cover some code I was exchanging with a couple of colleagues who work for other organizations.

It is common to use diallel mating designs in plant and tree breeding, where a small number of parents acts as both males and females. For example, with 5 parents we can have 25 crosses, including reciprocals and selfing (crossing an individual with itself). Decades ago this mating design was tricky to fit and, considering an experimental layout with randomized complete blocks, one would have something like y = mu + blocks + dads + mums + cross + error. In this model dads and mums were estimating a fraction of the additive genetic variance. With the advent of animal model BLUP, was possible to fit something like y = mu + blocks + individual (using a pedigree) + cross + error. Another less computationally demanding alternative (at least with unrelated parents) is to fit a parental model, overlaying the design matrices for parents with something like this y = mu + blocks + (dad + mum) + cross + error.

The following code simulate data for a a diallel trial with three replicates and runs a parental model with ASReml. Later I expand the analysis building the matrices by hand.

# Defining diallel cross
n.blocks <- 3
diallel <- matrix(0, nrow = 5, ncol = 5)
males <- 1:dim(diallel)[1]
females <- 1:dim(diallel)[2]
cross <- factor(outer(males, females, paste, sep ='x'))
cross
#[1] 1x1 2x1 3x1 4x1 5x1 1x2 2x2 3x2 4x2 5x2 1x3 2x3 3x3 4x3 5x3 1x4 2x4
#[18] 3x4 4x4 5x4 1x5 2x5 3x5 4x5 5x5
#25 Levels: 1x1 1x2 1x3 1x4 1x5 2x1 2x2 2x3 2x4 2x5 3x1 3x2 3x3 ... 5x5


#### Generating random values for the trial
# Variance components
sig2.a <- 40    # additive genetic variance
sig2.sca <- 10  # specific combining ability (SCA)
sig2.b <- 10    # blocks
sig2.e <- 30    # residuals

# Numbers of everything
n.parents <- length(males)
n.crosses <- length(cross)
n.trees <- n.crosses*n.blocks

# Random values for everything
u.g <- rnorm(n.parents)*sqrt(sig2.a)
u.sca <- rnorm(n.crosses)*sqrt(sig2.sca)
u.block <- rnorm(n.blocks)*sqrt(sig2.b)
u.e <- rnorm(n.trees)*sqrt(sig2.e)

# Whole trial
trial <- data.frame(block = factor(rep(1:n.blocks, each = n.crosses)),
                    mum = factor(rep(females, n.blocks)),
                    dad = factor(rep(males, each = length(females))),
                    mate = factor(rep(cross, n.blocks)))

trial$yield <- 0.5*(u.g[trial$mum] + u.g[trial$dad]) +
               u.sca[trial$mate] + u.block[trial$block] + u.e

head(trial)   
#block mum dad mate       yield
#1     1   1   1  1x1 -0.02185486
#2     1   2   1  2x1 10.79760712
#3     1   3   1  3x1 16.45186037
#4     1   4   1  4x1  8.15026291
#5     1   5   1  5x1  5.57707180
#6     1   1   2  1x2 -7.30675148

# Fitting the model with ASReml
library(asreml)
m1 <- asreml(yield ~ 1, 
             random = ~ block + dad + and(mum) + mate, 
             data = trial)
summary(m1)
#                       gamma    component    std.error   z.ratio
#block!block.var 1.299110e-02 3.861892e-01 1.588423e+00 0.2431274
#dad!dad.var     2.101417e-01 6.246930e+00 5.120745e+00 1.2199259
#mate!mate.var   4.589938e-07 1.364461e-05 2.340032e-06 5.8309519
#R!variance      1.000000e+00 2.972722e+01 5.098177e+00 5.8309519

# Obtaining the predicted breeding values for the parents
coef(m1, pattern = 'dad')
#         effect
#dad_1  1.780683
#dad_2 -2.121174
#dad_3  3.151991
#dad_4 -1.473620
#dad_5 -1.337879

How is the matrix overlay working? We can replicate the calculations used by ASReml by building the matrices from scratch and reusing the variance components, so we avoid the nastiness of writing code for residual maximum likelihood. Once I build the basic matrices I use the code from my How does a linear mixed model look like? post.

# Building incidence matrices for males and females
Zmales <- model.matrix(~ dad - 1, data = trial)
Zfemales <- model.matrix(~ mum - 1, data = trial)

# and() in ASReml overlays (i.e. sums) the matrices
# (Notice that this creates 0s, 1 and 2s in the design matrix)
Zparents <- Zmales + Zfemales
Zparents[1:6,]
#   dad1 dad2 dad3 dad4 dad5
#1     2    0    0    0    0
#2     1    1    0    0    0
#3     1    0    1    0    0
#4     1    0    0    1    0
#5     1    0    0    0    1
#6     1    1    0    0    0

# Design matrices from other factors
Zblocks <- model.matrix(~ block - 1, data = trial)
Zmates <- model.matrix(~ mate - 1, data = trial)

# Creating y, X and big Z matrices to solve mixed model equations
y <- trial$yield
X <- matrix(1, nrow = 75, ncol = 1)
Z <- cbind(Zblocks, Zparents, Zmates)

# Building covariance matrices for random effects
# Using the variance components from the ASReml model
G = diag(c(rep(3.861892e-01, 3), 
           rep(6.246930e+00, 5), 
           rep(1.364461e-05, 25)))
R = diag(2.972722e+01, 75, 75)
Rinv = solve(R)

# Components of C
XpX = t(X) %*% Rinv %*% X
ZpZ = t(Z) %*% Rinv %*% Z

XpZ = t(X) %*% Rinv %*% Z
ZpX = t(Z) %*% Rinv %*% X

Xpy = t(X) %*% Rinv %*% y
Zpy = t(Z) %*% Rinv %*% y

# Building C * [b a] = RHS
C = rbind(cbind(XpX, XpZ),
          cbind(ZpX, ZpZ + solve(G)))
RHS = rbind(Xpy, Zpy)

blup = solve(C, RHS)
blup
# These results are identical to the ones
# produced by ASReml
#dad1     1.780683e+00
#dad2    -2.121174e+00
#dad3     3.151991e+00
#dad4    -1.473620e+00
#dad5    -1.337879e+00

The overlay matrix Zparents is double the actual value we should use:

Zparents2 <- 0.5*(Zmales + Zfemales)
Zparents2[1:6,]
#  dad1 dad2 dad3 dad4 dad5
#1  1.0  0.0  0.0  0.0  0.0
#2  0.5  0.5  0.0  0.0  0.0
#3  0.5  0.0  0.5  0.0  0.0
#4  0.5  0.0  0.0  0.5  0.0
#5  0.5  0.0  0.0  0.0  0.5
#6  0.5  0.5  0.0  0.0  0.0

Repeating the analyses ‘by hand’ using Zparents2 to build the Z matrix results in the same overall mean and block effects, but the predicted breeding values for parents when using Zparents are 0.7 of the predicted breeding values for parents when using Zparents2. I may need to refit the model and obtain new variance components for parents when working with the correct overlaid matrix.

Gratuitous picture: walking in Quail Island (Photo: Luis).

A word of caution: the sample may have an effect

This week I’ve tried to i-stay mostly in the descriptive statistics realm and ii-surround any simple(istic) models with caveats and pointing that they are very preliminary. We are working with a sample of ~1,000 schools that did reply to Fairfax’s request, while there is a number of schools that either ignored the request or told Fairfax to go and F themselves. Why am I saying this? If one goes and gets a simple table of the number of schools by type and decile there is something quite interesting: we have different percentages for different types of schools represented in the sample and the possibility of bias on the reporting to Fairfax, due to potential low performance (references to datasets correspond to the ones I used in this post):

summary(standards$school.type)
#         Composite (Year 1-10)          Composite (Year 1-15)        Contributing (Year 1-6) 
#                             1                             29                            403 
#       Full Primary (Year 1-8)    Intermediate (year 7 and 8) Restricted Composite (Yr 7-10) 
#                           458                             62                              1 
#         Secondary (Year 7-15) 
#                            56 

Now let’s compare this number with the school directory:

summary(factor(directory$school.type))
#         Composite (Year 1-10)          Composite (Year 1-15)        Contributing (Year 1-6) 
#                             4                            149                            775 
#         Correspondence School        Full Primary (Year 1-8)    Intermediate (year 7 and 8) 
#                             1                           1101                            122 
#Restricted Composite (Yr 7-10)         Secondary (Year 11-15)          Secondary (Year 7-10) 
#                             4                              2                              2 
#         Secondary (Year 7-15)          Secondary (Year 9-15)                 Special School 
#                           100                            238                             39 
#              Teen Parent Unit 
#                            20 

As a proportion we are missing more secondary schools. We can use the following code to get an idea of how similar are school types, because the small number of different composite schools is a pain. If

# Performance of Contributing (Year 1-6) and
# Full Primary (Year 1-8) looks pretty much the
# same. Composites could be safely merged
qplot(school.type, reading.OK, 
      data = standards, geom = 'jitter')

qplot(school.type, writing.OK, 
      data = standards, geom = 'jitter')

qplot(school.type, math.OK, 
      data = standards, geom = 'jitter')

# Merging school types and plotting them colored
# by decile
standards$school.type.4 <- standards$school.type
levels(standards$school.type.4) <- c('Composite', 'Composite', 'Primary',
                                     'Primary', 'Intermediate',
                                     'Composite', 'Secondary')

qplot(school.type.4, reading.OK, colour = decile,
      data = standards, geom = 'jitter')

Representation of different schools types and deciles is uneven.

Different participations in the sample for school types. This type is performance in mathematics.


I’m using jittering rather than box and whisker plots to i- depict all the schools and ii- get an idea of the different participation of school types in the dataset. Sigh. Another caveat to add in the discussion.

P.S. 2012-09-27 16:15. Originally I mentioned in this post the lack of secondary schools (Year 9-15) but, well, they are not supposed to be here, because National Standards apply to years 1 to 8 (Thanks to Michael MacAskill for pointing out my error.)

Some regressions on school data

Eric and I have been exchanging emails about potential analyses for the school data and he published a first draft model in Offsetting Behaviour. I have kept on doing mostly data exploration while we get a definitive full dataset, and looking at some of the pictures I thought we could present a model with fewer predictors.

The starting point is the standards dataset I created in the previous post:

# Make authority a factor
standards$authority <- factor(standards$authority)

# Plot relationship between school size and number of FTTE
# There seems to be a separate trend for secondary schools
qplot(total.roll, teachers.ftte,  
      data = standards, color = school.type)

There seems to be a different trend for secondary vs non-secondary schools concerning the relationship between number of full time teacher equivalent and total roll. The presence of a small number of large schools suggests that log transforming the variables could be a good idea.

# Create a factor for secondary schools versus the rest
standards$secondary <- factor(ifelse(standards$school.type == 'Secondary (Year 7-15)', 
                                'Yes', 'No'))

# Plot the relationship between number of students per FTTE
# and type of school
 qplot(secondary, total.roll/teachers.ftte, 
      data = standards, geom = 'boxplot',
      ylab = 'Number of students per FTTE')

Difference on the number of students per FTTE between secondary and non-secondary schools.

Now we fit a model where we are trying to predict reading standards achievement per school accounting for decile, authority , proportion of non-european students, secondary schools versus the rest, and a different effect of number of students per FTTE
for secondary and non-secondary schools.

standards$students.per.ftte <- with(standards, total.roll/teachers.ftte)

# Decile2 is just a numeric version of decile (rather than a factor)
m1e <- lm(reading.OK ~ decile2 + I(decile2^2) +  
          authority +  I(1 - prop.euro) + secondary*students.per.ftte, 
          data = standards, weights = total.roll)
summary(m1e)

#Coefficients:
#                                         Estimate Std. Error t value Pr(>|t|)    
#(Intercept)                             0.6164089  0.0305197  20.197  < 2e-16 ***
#decile2                                 0.0422733  0.0060756   6.958 6.24e-12 ***
#I(decile2^2)                           -0.0015441  0.0004532  -3.407 0.000682 ***
#authorityState: Not integrated         -0.0440261  0.0089038  -4.945 8.94e-07 ***
#I(1 - prop.euro)                       -0.0834869  0.0181453  -4.601 4.74e-06 ***
#secondaryYes                           -0.2847035  0.0587674  -4.845 1.47e-06 ***
#students.per.ftte                       0.0023512  0.0011706   2.009 0.044854 *  
#secondaryYes:students.per.ftte          0.0167085  0.0040847   4.091 4.65e-05 ***
---

#Residual standard error: 1.535 on 999 degrees of freedom
#  (3 observations deleted due to missingness)
#Multiple R-squared: 0.4942,	Adjusted R-squared: 0.4906 
#F-statistic: 139.4 on 7 and 999 DF,  p-value: < 2.2e-16 

The residuals are still a bit of a mess:

Residuals for this linear model: still a bit of a mess.

If we remember my previous post decile accounted for 45% of variation and we explain 4% more through the additional predictors. Non-integrated schools have lower performance, a higher proportion of non-European students reduce performance, secondary schools have lower performance and larger classes tend to perform better (Eric suggests reverse causality, I’m agnostic at this stage), although the rate of improvement changes between secondary and non-secondary schools. In contrast with Eric, I didn’t fit separate ethnicities as those predictors are related to each other and constrained to add up to one.

Of course this model is very preliminary, and a quick look at the coefficients will show that changes on any predictors besides decile will move the response by a very small amount (despite the tiny p-values and numerous stars next to them). The distribution of residuals is still heavy-tailed and there are plenty of questions about data quality; I’ll quote Eric here:

But differences in performance among schools of the same decile by definition have to be about something other than decile. I can’t tell from this data whether it’s differences in stat-juking, differences in unobserved characteristics of entering students, differences in school pedagogy, or something else. But there’s something here that bears explaining.

Careless comparison bites back (again)

When running stats labs I like to allocate a slightly different subset of data to each student, which acts as an incentive for people to do their own work (rather than copying the same results from a fellow student). We also need to be able to replicate the results when marking, so we need a record of exactly which observations were dropped to create a particular data set. I have done this in a variety of ways, but this time I opted for code that looked like:

setwd('~/Dropbox/teaching/stat202-2012')

biom <- read.csv('biom2012.csv', header = TRUE)
drops <- read.csv('lab4-dels.csv', header = TRUE)

# Use here your OWN student code
my.drop <- subset(drops, student.code == 'mjl159')
my.data <- subset(biom, !(id %in% my.drop))

Thus, we were reading a full data set and assigning it to biom, reading a table that contained student codes in the first column and 5 columns with observations to be dropped (assigned to drops) and choosing one row of drops depending on the student code (assigned to my.drop). As an example, for student ‘mjl159′ my.drop looked like:

my.drop
#   student.code X1 X2 X3 X4 X5
#40       mjl159 21 18 30 15 43

The problem with the code is the comparison !(id %in% my.drop), as it includes the factor student.code, so when R makes the comparison checking if a record is in my.drop it converts the text, e.g. ‘mjl159’, to the number of level, e.g. 41, which makes the code to delete ONE MORE observation (in this #41) on top of the ones the student was allocated. This happens only for some students, where the number of level is not included in the list of observations to drop.

This is another version of R pitfall #3: friggin’ factors. A simple workaround is to change the comparison to !(id %in% my.drop[2:6]). I should know better than this.

Gratuitous image: Tree spread on metal frame to provide shade in a plaza, Lisbon, Portugal. Some days I would love to have a coffee there without computer, just watching the world pass by. (Photo: Luis).

Split-plot 2: let’s throw in some spatial effects

Disappeared for a while collecting frequent flyer points. In the process I ‘discovered’ that I live in the middle of nowhere, as it took me 36 hours to reach my conference destination (Estoril, Portugal) through Christchurch, Sydney, Bangkok, Dubai, Madrid and Lisbon.

Where was I? Showing how split-plots look like under the bonnet (hood for you US readers). Yates presented a nice diagram of his oats data set in the paper, so we have the spatial location of each data point which permits us playing with within-trial spatial trends.

Rather than mucking around with typing coordinates we can rely on Kevin Wright’s version of the oats dataset contained in the agridat package. Kevin is a man of mistery, a James Bond of statisticians—so he keeps a low profile—with a keen interest in experimental design and analyses. This chap has put a really nice collection of data sets WITH suggested coding for the analyses, including nlme, lme4, asreml, MCMCglmm and a few other bits and pieces. Recommended!

Plants (triffids excepted) do not move, which means that environmental trends within a trial (things like fertility, water availability, temperature, etc) can affect experimental units in a way that varies with space and which induces correlation of the residuals. Kind of we could be violating the independence assumption if you haven’t got the hint yet.

Gratuitous picture: Detail of Mosteiro dos Jerónimos, Belém, Lisboa, Portugal (Photo: Luis).

There are a few ways to model environmental trends (AR processes, simple polynomials, splines, etc) that can be accounted for either through the G matrix (as random effects) or the R matrix. See previous post for explanation of the bits and pieces. We will use here a very popular approach, which is to consider two separable (so we can estimate the bloody things) autoregressive processes, one for rows and one for columns, to model spatial association. In addition, we will have a spatial residual. In summary, the residuals have moved from \(\boldsymbol{R} = \sigma^2_e \boldsymbol{I} \) to \(\boldsymbol{R} = \sigma^2_s \boldsymbol{R}_{col} \otimes \boldsymbol{R}_{row}\). I previously showed the general form of this autoregressive matrices in this post, and you can see the \(\boldsymbol{R}_{col} \) matrix below. In some cases we can also add an independent residual (the so-called nugget) to the residual matrix.

We will first fit a split-plot model considering spatial residuals using asreml because, let’s face it, there is no other package that will give you the flexibility:

require(asreml)
require(agridat)

# Having a look at the structure of yates.oats
# and changing things slightly so it matches 
# previous post
str(yates.oats)

names(yates.oats) = c('row', 'col', 'Y', 'N', 'V', 'B')

yates.oats$row = factor(yates.oats$row)
yates.oats$col = factor(yates.oats$col)
yates.oats$N = factor(yates.oats$N)
yates.oats$MP = yates.oats$V

# Base model (this was used in the previous post)
m2 = asreml(Y ~ V*N, random = ~ B/MP, data = yates.oats)
summary(m2)$varcomp
#                gamma component std.error  z.ratio constraint
# B!B.var    1.2111647  214.4771 168.83404 1.270343   Positive
# B:MP!B.var 0.5989373  106.0618  67.87553 1.562593   Positive
# R!variance 1.0000000  177.0833  37.33244 4.743416   Positive

# Spatial model
m3 = asreml(Y ~ V*N, random = ~ B/MP, 
            rcov = ~ ar1(col):ar1(row),
            data = yates.oats)
summary(m3)$varcomp
#                 gamma    component   std.error   z.ratio    constraint
# B!B.var    0.80338277 169.24347389 156.8662436 1.0789031      Positive
# B:MP!B.var 0.49218005 103.68440202  73.6390759 1.4080079      Positive
# R!variance 1.00000000 210.66355939  67.4051020 3.1253355      Positive
# R!col.cor  0.04484166   0.04484166   0.2006562 0.2234751 Unconstrained
# R!row.cor  0.49412567   0.49412567   0.1420397 3.4787860 Unconstrained

coef(m3)$fixed
coef(m3)$random
# effect
# V_GoldenRain:N_0     0.0000000
# V_GoldenRain:N_0.2   0.0000000
# V_GoldenRain:N_0.4   0.0000000
# V_GoldenRain:N_0.6   0.0000000
# V_Marvellous:N_0     0.0000000
# V_Marvellous:N_0.2  -0.8691155
# V_Marvellous:N_0.4 -12.4223873
# V_Marvellous:N_0.6  -5.5018907
# V_Victory:N_0        0.0000000
# V_Victory:N_0.2     -1.9580360
# V_Victory:N_0.4      2.1913469
# V_Victory:N_0.6      0.3728648
# N_0                  0.0000000
# N_0.2               23.3299154
# N_0.4               40.0570745
# N_0.6               47.1749577
# V_GoldenRain         0.0000000
# V_Marvellous         9.2845952
# V_Victory           -5.7259866
# (Intercept)         76.5774292

# effect
# B_B1               21.4952875
# B_B2                1.0944484
# B_B3               -5.4336461
# B_B4               -4.4334455
# B_B5               -6.6925874
# B_B6               -6.0300569
# B_B1:MP_GoldenRain  1.3036724
# B_B1:MP_Marvellous -0.9082462
# B_B1:MP_Victory    12.7754283
# B_B2:MP_GoldenRain  1.3286187
# B_B2:MP_Marvellous  7.4181674
# B_B2:MP_Victory    -8.0761824
# B_B3:MP_GoldenRain -6.5288220
# B_B3:MP_Marvellous 10.4361799
# B_B3:MP_Victory    -7.2367277
# B_B4:MP_GoldenRain  6.6868810
# B_B4:MP_Marvellous -9.2317585
# B_B4:MP_Victory    -0.1716372
# B_B5:MP_GoldenRain  2.4635492
# B_B5:MP_Marvellous -9.7086196
# B_B5:MP_Victory     3.1443067
# B_B6:MP_GoldenRain -5.2538993
# B_B6:MP_Marvellous  1.9942770
# B_B6:MP_Victory    -0.4351876

# In a larger experiment we could try fitting a nugget using units
m4 = asreml(Y ~ V*N, random = ~ B/MP + units, 
            rcov = ~ ar1(col):ar1(row),
            data = yates.oats)
summary(m4)

# However in this small experiments the system
# goes crazy and results meaningless
#                       gamma   component  std.error  z.ratio constraint
# B!B.var         0.006759124   223.70262   185.3816 1.206714   Positive
# B:MP!B.var      0.001339017    44.31663    29.2004 1.517672   Positive
# units!units.var 0.003150356   104.26542    34.2738 3.042132   Positive
# R!variance      1.000000000 33096.39128 19480.8333 1.698921   Positive
# R!col.cor       0.999000000     0.99900         NA       NA      Fixed
# R!row.cor       0.999000000     0.99900         NA       NA      Fixed

So we have to build an autoregressive correlation matrix for rows, one for columns and multiply the whole thing for a spatial variance. Then we can add an independent residual (the nugget, if we want—and can estimate—one). Peter Dalgaard has neat code for building the autocorrelation matrix. And going back to the code in the previous post:

ar.matrix = function(ar, dim) {
  M = diag(dim)
  M = ar^abs(row(M)-col(M))
  return(M)
}

# Variance components (from m3)
varB = 169.243
varMP = 103.684
varR = 210.664
arcol = 0.045
arrow = 0.494

# Basic vector and matrices: y, X, Z, G & R
y = matrix(yates.oats$Y, nrow = dim(yates.oats)[1], ncol = 1)
X = model.matrix(~ V*N, data = yates.oats)
Z = model.matrix(~ B/MP - 1, data = yates.oats, 
                 contrasts.arg = list(B = contrasts(yates.oats$B, contrasts = F), 
                                      MP = contrasts(yates.oats$MP, contrasts = F)))

G = diag(c(rep(varB, 6), rep(varMP, 18)))

# Only change from previous post is building the R matrix
Rcol = ar.matrix(arcol, 4)
Rrow = ar.matrix(arrow, 18)
Rcol
# Having a look at the structure
#            [,1]     [,2]     [,3]       [,4]
# [1,] 1.0000e+00 0.045000 0.002025 9.1125e-05
# [2,] 4.5000e-02 1.000000 0.045000 2.0250e-03
# [3,] 2.0250e-03 0.045000 1.000000 4.5000e-02
# [4,] 9.1125e-05 0.002025 0.045000 1.0000e+00

R = varR * kronecker(Rcol, Rrow)
Rinv = solve(R)

# Components of C
XpX = t(X) %*% Rinv %*% X
ZpZ = t(Z) %*% Rinv %*% Z

XpZ = t(X) %*% Rinv %*% Z
ZpX = t(Z) %*% Rinv %*% X

Xpy = t(X) %*% Rinv %*% y
Zpy = t(Z) %*% Rinv %*% y

# Building C * [b a] = RHS
C = rbind(cbind(XpX, XpZ),
          cbind(ZpX, ZpZ + solve(G)))
RHS = rbind(Xpy, Zpy)

blup = solve(C, RHS)
blup
#                         [,1]
# (Intercept)       76.5778238
# VMarvellous        9.2853002
# VVictory          -5.7262894
# N0.2              23.3283060
# N0.4              40.0555464
# N0.6              47.1740348
# VMarvellous:N0.2  -0.8682597
# VVictory:N0.2     -1.9568979
# VMarvellous:N0.4 -12.4200362
# VVictory:N0.4      2.1912083
# VMarvellous:N0.6  -5.5017225
# VVictory:N0.6      0.3732453
# BB1               21.4974445
# BB2                1.0949433
# BB3               -5.4344098
# BB4               -4.4333080
# BB5               -6.6948783
# BB6               -6.0297918
# BB1:MPGoldenRain   1.3047656
# BB2:MPGoldenRain   1.3294043
# BB3:MPGoldenRain  -6.5286993
# BB4:MPGoldenRain   6.6855568
# BB5:MPGoldenRain   2.4624436
# BB6:MPGoldenRain  -5.2534710
# BB1:MPMarvellous  -0.9096022
# BB2:MPMarvellous   7.4170634
# BB3:MPMarvellous  10.4349240
# BB4:MPMarvellous  -9.2293528
# BB5:MPMarvellous  -9.7080694
# BB6:MPMarvellous   1.9950370
# BB1:MPVictory     12.7749000
# BB2:MPVictory     -8.0756682
# BB3:MPVictory     -7.2355284
# BB4:MPVictory     -0.1721988
# BB5:MPVictory      3.1441163
# BB6:MPVictory     -0.4356209

Which are the same results one gets from ASReml-R. QED.

P.S. Many thanks to Kevin Wright for answering my questions about agridat.