Categories
policy r stats teaching

A word of caution: the sample may have an effect

This week I’ve tried to i-stay mostly in the descriptive statistics realm and ii-surround any simple(istic) models with caveats and pointing that they are very preliminary. We are working with a sample of ~1,000 schools that did reply to Fairfax’s request, while there is a number of schools that either ignored the request or told Fairfax to go and F themselves. Why am I saying this? If one goes and gets a simple table of the number of schools by type and decile there is something quite interesting: we have different percentages for different types of schools represented in the sample and the possibility of bias on the reporting to Fairfax, due to potential low performance (references to datasets correspond to the ones I used in this post):

summary(standards$school.type)
#         Composite (Year 1-10)          Composite (Year 1-15)        Contributing (Year 1-6)
#                             1                             29                            403
#       Full Primary (Year 1-8)    Intermediate (year 7 and 8) Restricted Composite (Yr 7-10)
#                           458                             62                              1
#         Secondary (Year 7-15)
#                            56

Now let’s compare this number with the school directory:

summary(factor(directory$school.type))
#         Composite (Year 1-10)          Composite (Year 1-15)        Contributing (Year 1-6)
#                             4                            149                            775
#         Correspondence School        Full Primary (Year 1-8)    Intermediate (year 7 and 8)
#                             1                           1101                            122
#Restricted Composite (Yr 7-10)         Secondary (Year 11-15)          Secondary (Year 7-10)
#                             4                              2                              2
#         Secondary (Year 7-15)          Secondary (Year 9-15)                 Special School
#                           100                            238                             39
#              Teen Parent Unit
#                            20

As a proportion we are missing more secondary schools. We can use the following code to get an idea of how similar are school types, because the small number of different composite schools is a pain. If

# Performance of Contributing (Year 1-6) and
# Full Primary (Year 1-8) looks pretty much the
# same. Composites could be safely merged
qplot(school.type, reading.OK,
data = standards, geom = 'jitter')

qplot(school.type, writing.OK,
data = standards, geom = 'jitter')

qplot(school.type, math.OK,
data = standards, geom = 'jitter')

# Merging school types and plotting them colored
# by decile
standards$school.type.4 <- standards$school.type
levels(standards$school.type.4) <- c('Composite', 'Composite', 'Primary',
'Primary', 'Intermediate',
'Composite', 'Secondary')

qplot(school.type.4, reading.OK, colour = decile,
data = standards, geom = 'jitter')
Representation of different schools types and deciles is uneven.

Different participations in the sample for school types. This type is performance in mathematics.

I'm using jittering rather than box and whisker plots to i- depict all the schools and ii- get an idea of the different participation of school types in the dataset. Sigh. Another caveat to add in the discussion.

P.S. 2012-09-27 16:15. Originally I mentioned in this post the lack of secondary schools (Year 9-15) but, well, they are not supposed to be here, because National Standards apply to years 1 to 8 (Thanks to Michael MacAskill for pointing out my error.)

Categories
policy r rblogs stats

New Zealand school performance: beyond the headlines

I like the idea of having data on school performance, not to directly rank schools—hard, to say the least, at this stage—but because we can start having a look at the factors influencing test results. I imagine the opportunity in the not so distant future to run hierarchical models combining Ministry of Education data with Census/Statistics New Zealand data.

At the same time, there is the temptation to come up with very simple analyses that would make appealing newspaper headlines. I’ll read the data and create a headline and then I’ll move to something that, personally, seems more important. In my previous post I combined the national standards for around 1,000 schools with decile information to create the standards.csv file.

library(ggplot2)

# Reading data, building factors
standards <- read.csv('standards.csv')
standards$decile.2008 <- factor(standards$decile.2008)
standards$electorate <- factor(standards$electorate)
standards$school.type <- factor(standards$school.type)

# Removing special schools (all 8 of them) because their
# performance is not representative of the overall school
# system
standards <- subset(standards,
as.character(school.type) != 'Special School')
standards$school.type <- standards$school.type[drop = TRUE]

# Create performance variables. Proportion of the students that
# at least meet the standards (i.e. meet + above)
# Only using all students, because subsets are not reported
# by many schools
standards$math.OK <- with(standards, math.At + math.A)
standards$reading.OK <- with(standards, reading.At + reading.A)
standards$writing.OK <- with(standards, writing.At + writing.A)

Up to this point we have read the data, removed special schools and created variables that represent the proportion of students that al least meet standards. Now I'll do something very misleading: calculate the average for reading.OK for each school decile and plot it, showing the direct relationship between socio-economic decile and school performance.

mislead <- aggregate(reading.OK ~ decile.2008,
data = standards, mean)
qplot(decile.2008, reading.OK , data = mislead)
Scatterplot of average proportion of students at least meeting the reading national standards for each socioeconomic decile.

Using only the average strengthens the relationship, but it hides something extremely important: the variability within each decile. The following code will display box and whisker plots for each decile: the horizontal line is the median, the box contains the middle 50% of the schools and the vertical lines 1.5 times the interquantile range (pretty much the range in most cases):

qplot(decile.2008, reading.OK,
data = standards, geom = 'boxplot',
xlab = 'Decile in 2008',
ylab = 'Reading at or above National Standard')
Box and whisker plot for proportion of students at least meeting the reading national standard for each decile. Notice the variability for each decile.

A quick look at the previous graph shows a few interesting things:

  • the lower the decile the more variable school performance,
  • there is pretty much no difference between deciles 6, 7, 8 and 9, and a minor increase for decile 10.
  • there is a trend for decreasing performance for lower deciles; however, there is also a huge variability within those deciles.

We can repeat the same process for writing.OK and math.OK with similar trends, although the level of standard achievement is lower than for reading:

qplot(decile.2008, writing.OK,
data = standards, geom = 'boxplot',
xlab = 'Decile in 2008',
ylab = 'Writing at or above National Standard')

qplot(decile.2008, math.OK,
data = standards, geom = 'boxplot',
xlab = 'Decile in 2008',
ylab = 'Mathematics at or above National Standard')
Box and whisker plot for proportion of students at least meeting the writing national standard for each decile.
Box and whisker plot for proportion of students at least meeting the mathematics national standard for each decile.

Achievement in different areas (reading, writing, mathematics) is highly correlated:

cor(standards[, c('reading.OK', 'writing.OK', 'math.OK')],
use = 'pairwise.complete.obs')

#           reading.OK writing.OK   math.OK
#reading.OK  1.0000000  0.7886292 0.7749094
#writing.OK  0.7886292  1.0000000 0.7522446
#math.OK     0.7749094  0.7522446 1.0000000

# We can plot an example and highlight decile groups
# 1-5 vs 6-10
qplot(reading.OK, math.OK,
data = standards, color = ifelse(as.numeric(decile.2008) > 5, 1, 0)) +
opts(legend.position = 'none')
Scatterplot for proportion meeting mathematics and reading national standards. Dark points are deciles 1 to 5, while light points are deciles 6 to 10. Notice the large overlap for performance between the two groups.

All these graphs are only descriptive/exploratory; however, once we had more data we could move to hierarchical models to start adjusting performance by socioeconomic aspects, geographic location, type of school, urban or rural setting, school size, ethnicity, etc. Potentially, this would let us target resources on schools that could be struggling to perform; nevertheless, any attempt at creating a 'quick & dirty' ranking ignoring the previously mentioned covariates would be, at the very least, misleading.

Note 2012-09-26: I have updated data and added a few plots in this post.

Categories
flotsam photos rblogs stats

Mid-August flotsam

Reached mid-semester point, with quite a few new lectures to prepare. Nothing extremely complicated but, as always, the tricky part is finding a way to make it meaningful and memorable. Sometimes, and this is one of those times, I sound like a broken record but I’m a bit obsessive about helping people to ‘get’ a topic.

Gratuitous picture: Lola, Lisbon, Portugal(Photo: Luis).
Categories
asreml linear models nlme r rblogs stats teaching

Split-plot 1: How does a linear mixed model look like?

I like statistics and I struggle with statistics. Often times I get frustrated when I don’t understand and I really struggled to make sense of Krushke’s Bayesian analysis of a split-plot, particularly because ‘it didn’t look like’ a split-plot to me.

Additionally, I have made a few posts discussing linear mixed models using several different packages to fit them. At no point I have shown what are the calculations behind the scenes. So, I decided to combine my frustration and an explanation to myself in a couple of posts. This is number one and the follow up is Split-plot 2: let’s throw in some spatial effects.

Example of forestry split-plot: one of my colleagues has a trial in which stocking (number of trees per ha) is the main plot and fertilization is the subplot (higher stockings look darker because trees are closer together).

How do linear mixed models look like

Linear mixed models, models that combine so-called fixed and random effects, are often represented using matrix notation as:

\( oldsymbol{y} = oldsymbol{X b} + oldsymbol{Z a} + oldsymbol{e}\)

where \( oldsymbol{y}\) represents the response variable vector, \( oldsymbol{X} mbox{ and } oldsymbol{Z}\) are incidence matrices relating the response to the fixed (\( oldsymbol{b}\) e.g. overall mean, treatment effects, etc) and random (\( oldsymbol{a}\), e.g. family effects, additive genetic effects and a host of experimental design effects like blocks, rows, columns, plots, etc), and last the random residuals (\( oldsymbol{e}\)).

The previous model equation still requires some assumptions about the distribution of the random effects. A very common set of assumptions is to say that the residuals are iid (identical and independently distributed) normal (so \( oldsymbol{R} = sigma^2_e oldsymbol{I} \)) and that the random effects are independent of each other, so \( oldsymbol{G} = oldsymbol{B} igoplus oldsymbol{M}\) is a direct sum of the variance of blocks (\( oldsymbol{B} = sigma^2_B oldsymbol{I}\)) and main plots (\( oldsymbol{M} = sigma^2_M oldsymbol{I}\)).

An interesting feature of this matrix formulation is that we can express all sort of models by choosing different assumptions for our covariance matrices (using different covariance structures). Do you have longitudinal data (units assessed repeated times)? Is there spatial correlation? Account for this in the \( oldsymbol{R}\) matrix. Random effects are correlated (e.g. maternal and additive genetic effects in genetics)? Account for this in the \( oldsymbol{G}\) matrix. Multivariate response? Deal with unstructured \( oldsymbol{R}\) and \( oldsymbol{G}\), or model the correlation structure using different constraints (and the you’ll need asreml).

By the way, the history of linear mixed models is strongly related to applications of quantitative genetics for the prediction of breeding values, particularly in dairy cattle. Charles Henderson developed what is now called Henderson’s Mixed Model Equations to simultaneously estimate fixed effects and predict random genetic effects:

\(
left [
egin{array}{cc}
oldsymbol{X}’ oldsymbol{R}^{-1} oldsymbol{X} & oldsymbol{X}’ oldsymbol{R}^{-1} oldsymbol{Z} \
oldsymbol{Z}’ oldsymbol{R}^{-1} oldsymbol{X} & oldsymbol{Z}’ oldsymbol{R}^{-1} oldsymbol{Z} + oldsymbol{G}^{-1}
end{array}

ight ]
left [
egin{array}{c}
oldsymbol{b} \
oldsymbol{a}
end{array}

ight ] =
left [
egin{array}{c}
oldsymbol{X}’ oldsymbol{R}^{-1} oldsymbol{y} \
oldsymbol{Z}’ oldsymbol{R}^{-1} oldsymbol{y}
end{array}

ight ] \)

The big matrix on the left-hand side of this equation is often called the \( oldsymbol{C} \) matrix. You could be thinking ‘What does this all mean?’ It is easier to see what is going on with a small example, but rather than starting with, say, a complete block design, we’ll go for a split-plot to start tackling my annoyance with the aforementioned blog post.

Old school: physical split-plots

Given that I’m an unsophisticated forester and that I’d like to use data available to anyone, I’ll rely on an agricultural example (so plots are actually physical plots in the field) that goes back to Frank Yates. There are two factors (oats variety, with three levels, and fertilization, with four levels). Yates, F. (1935) Complex experiments, Journal of the Royal Statistical Society Suppl. 2, 181–247 (behind pay wall here).

Layout of oats experiment in Yates’s paper, from a time when articles were meaty. Each of the 6 replicates is divided in to 3 main plots for oats varieties (v1, v2 and v3), while each variety was divided into four parts with different levels of fertilization (manure—animal crap—n1 to n4). Cells display yield.

Now it is time to roll up our sleeves and use some code, getting the data and fitting the same model using nlme (m1) and asreml (m2), just for the fun of it. Anyway, nlme and asreml produce exactly the same results.

We will use the oats data set that comes with MASS, although there is also an Oats data set in nlme and another version in the asreml package. (By the way, you can see a very good explanation by Bill Venables of a ‘traditional’ ANOVA analysis for a split-plot here):

require(MASS) # we get the oats data from here
require(nlme) # for lme function
require(asreml) # for asreml function. This dataset use different variable names,
# which may require renaming a dataset to use the code below

# Get the oats data set and check structure
data(oats)
head(oats)
str(oats)

# Create a main plot effect for clarity's sake
oats$MP = oats$V

# Split-plot using NLME
m1 = lme(Y ~ V*N, random = ~ 1|B/MP, data = oats)
summary(m1)
fixef(m1)
ranef(m1)

# Split-plot using ASReml
m2 = asreml(Y ~ V*N, random = ~ B/MP, data = oats)
summary(m2)$varcomp
coef(m2)$fixed
coef(m2)$random

fixef(m1)
#        (Intercept)         VMarvellous            VVictory             N0.2cwt
#         80.0000000           6.6666667          -8.5000000          18.5000000
#            N0.4cwt             N0.6cwt VMarvellous:N0.2cwt    VVictory:N0.2cwt
#         34.6666667          44.8333333           3.3333333          -0.3333333
#VMarvellous:N0.4cwt    VVictory:N0.4cwt VMarvellous:N0.6cwt    VVictory:N0.6cwt
#         -4.1666667           4.6666667          -4.6666667           2.1666667

ranef(m1)
# Level: B
# (Intercept)
# I     25.421511
# II     2.656987
# III   -6.529883
# IV    -4.706019
# V    -10.582914
# VI    -6.259681
#
# Level: MP %in% B
# (Intercept)
# I/Golden.rain      2.348296
# I/Marvellous      -3.854348
# I/Victory         14.077467
# II/Golden.rain     4.298706
# II/Marvellous      6.209473
# II/Victory        -9.194250
# III/Golden.rain   -7.915950
# III/Marvellous    10.750776
# III/Victory       -6.063976
# IV/Golden.rain     5.789462
# IV/Marvellous     -7.115566
# IV/Victory        -1.001111
# V/Golden.rain      1.116768
# V/Marvellous      -9.848096
# V/Victory          3.497878
# VI/Golden.rain    -5.637282
# VI/Marvellous      3.857761
# VI/Victory        -1.316009

Now we can move to implement the Mixed Model Equations, where probably the only gotcha is the definition of the \(oldsymbol{Z}\) matrix (incidence matrix for random effects), as both nlme and asreml use ‘number of levels of the factor’ for both the main and interactions effects, which involves using the contrasts.arg argument in model.matrix().

# Variance components
varB = 214.477
varMP = 106.062
varR = 177.083

# Basic vector and matrices: y, X, Z, G & R
y = matrix(oats$Y, nrow = dim(oats)[1], ncol = 1)
X = model.matrix(~ V*N, data = oats)
Z = model.matrix(~ B/MP - 1, data = oats,
contrasts.arg = list(B = contrasts(oats$B, contrasts = F),
MP = contrasts(oats$MP, contrasts = F)))

G = diag(c(rep(varB, 6), rep(varMP, 18)))
R = diag(varR, 72, 72)
Rinv = solve(R)

# Components of C
XpX = t(X) %*% Rinv %*% X
ZpZ = t(Z) %*% Rinv %*% Z

XpZ = t(X) %*% Rinv %*% Z
ZpX = t(Z) %*% Rinv %*% X

Xpy = t(X) %*% Rinv %*% y
Zpy = t(Z) %*% Rinv %*% y

# Building C * [b a] = RHS
C = rbind(cbind(XpX, XpZ),
cbind(ZpX, ZpZ + solve(G)))
RHS = rbind(Xpy, Zpy)

blup = solve(C, RHS)
blup

# [,1]
# (Intercept)          80.0000000
# VMarvellous           6.6666667
# VVictory             -8.5000000
# N0.2cwt              18.5000000
# N0.4cwt              34.6666667
# N0.6cwt              44.8333333
# VMarvellous:N0.2cwt   3.3333333
# VVictory:N0.2cwt     -0.3333333
# VMarvellous:N0.4cwt  -4.1666667
# VVictory:N0.4cwt      4.6666667
# VMarvellous:N0.6cwt  -4.6666667
# VVictory:N0.6cwt      2.1666667
# BI                   25.4215578
# BII                   2.6569919
# BIII                 -6.5298953
# BIV                  -4.7060280
# BV                  -10.5829337
# BVI                  -6.2596927
# BI:MPGolden.rain      2.3482656
# BII:MPGolden.rain     4.2987082
# BIII:MPGolden.rain   -7.9159514
# BIV:MPGolden.rain     5.7894753
# BV:MPGolden.rain      1.1167834
# BVI:MPGolden.rain    -5.6372811
# BI:MPMarvellous      -3.8543865
# BII:MPMarvellous      6.2094778
# BIII:MPMarvellous    10.7507978
# BIV:MPMarvellous     -7.1155687
# BV:MPMarvellous      -9.8480945
# BVI:MPMarvellous      3.8577740
# BI:MPVictory         14.0774514
# BII:MPVictory        -9.1942649
# BIII:MPVictory       -6.0639747
# BIV:MPVictory        -1.0011059
# BV:MPVictory          3.4978963
# BVI:MPVictory        -1.3160021

Not surprisingly, we get the same results, except that we start assuming the variance components from the previous analyses, so we can avoid implementing the code for restricted maximum likelihood estimation as well. By the way, given that \( oldsymbol{R}^{-1} \) is in all terms it can be factored out from the MME, leaving terms like \( oldsymbol{X}’ oldsymbol{X} \), i.e. without \( oldsymbol{R}^{-1}\), making for simpler calculations. In fact, if you drop the \( oldsymbol{R}^{-1} \) it is easier to see what is going on in the different components of the \( oldsymbol{C} \) matrix. For example, print \( oldsymbol{X}’ oldsymbol{X} \) and you’ll get the sum of observations for the overall mean and for each of the levels of the fixed effect factors. Give it a try with the other submatrices too!

XpXnoR = t(X) %*% X
XpXnoR
#                    (Intercept) VMarvellous VVictory N0.2cwt N0.4cwt N0.6cwt
#(Intercept)                  72          24       24      18      18      18
#VMarvellous                  24          24        0       6       6       6
#VVictory                     24           0       24       6       6       6
#N0.2cwt                      18           6        6      18       0       0
#N0.4cwt                      18           6        6       0      18       0
#N0.6cwt                      18           6        6       0       0      18
#VMarvellous:N0.2cwt           6           6        0       6       0       0
#VVictory:N0.2cwt              6           0        6       6       0       0
#VMarvellous:N0.4cwt           6           6        0       0       6       0
#VVictory:N0.4cwt              6           0        6       0       6       0
#VMarvellous:N0.6cwt           6           6        0       0       0       6
#VVictory:N0.6cwt              6           0        6       0       0       6
#                    VMarvellous:N0.2cwt VVictory:N0.2cwt VMarvellous:N0.4cwt
#(Intercept)                           6                6                   6
#VMarvellous                           6                0                   6
#VVictory                              0                6                   0
#N0.2cwt                               6                6                   0
#N0.4cwt                               0                0                   6
#N0.6cwt                               0                0                   0
#VMarvellous:N0.2cwt                   6                0                   0
#VVictory:N0.2cwt                      0                6                   0
#VMarvellous:N0.4cwt                   0                0                   6
#VVictory:N0.4cwt                      0                0                   0
#VMarvellous:N0.6cwt                   0                0                   0
#VVictory:N0.6cwt                      0                0                   0
#                    VVictory:N0.4cwt VMarvellous:N0.6cwt VVictory:N0.6cwt
#(Intercept)                        6                   6                6
#VMarvellous                        0                   6                0
#VVictory                           6                   0                6
#N0.2cwt                            0                   0                0
#N0.4cwt                            6                   0                0
#N0.6cwt                            0                   6                6
#VMarvellous:N0.2cwt                0                   0                0
#VVictory:N0.2cwt                   0                   0                0
#VMarvellous:N0.4cwt                0                   0                0
#VVictory:N0.4cwt                   6                   0                0
#VMarvellous:N0.6cwt                0                   6                0
#VVictory:N0.6cwt                   0                   0                6

I will leave it here and come back to this problem as soon as I can.

† Incidentally, a lot of the theoretical development was supported by Shayle Searle (a Kiwi statistician and Henderson’s colleague in Cornell University).

Categories
bayesian julia r rblogs simulation stats

R, Julia and genome wide selection

— “You are a pussy” emailed my friend.
— “Sensu cat?” I replied.
— “No. Sensu chicken” blurbed my now ex-friend.

What was this about? He read my post on R, Julia and the shiny new thing, which prompted him to assume that I was the proverbial old dog unwilling (or was it unable?) to learn new tricks. (Incidentally, with friends like this who needs enemies? Hi, Gus.)

Having a look at different—statistical—horses (Photo: Luis, click to enlarge).

I decided to tackle a small—but hopefully useful—piece of code: fitting/training a Genome Wide Selection model, using the Bayes A approach put forward by Meuwissen, Hayes and Goddard in 2001. In that approach the breeding values of the individuals (response) are expressed as a function of a very large number of random predictors (2000, our molecular markers). The dataset (csv file) is a simulation of 2000 bi-allelic markers (aa = 0, Aa = 1, AA = 2) for 250 individuals, followed by the phenotypes (column 2001) and breeding values (column 2002). These models are frequently adjusted using MCMC.

In 2010 I attended this course in Ames, Iowa where Rohan Fernando passed us the following R code (pretty much a transliteration from C code; notice the trailing semicolons, for example). P.D. 2012-04-26 Please note that this is teaching code not production code:

nmarkers = 2000;    # number of markers
startMarker = 1981; # set to 1 to use all
numiter  = 2000;    # number of iterations
vara     = 1.0/20.0;

# input data
data     = matrix(scan("trainData.out0"),ncol=nmarkers+2,byrow=TRUE);
nrecords = dim(data)[1];

beg = Sys.time()

# x has the mean followed by the markers
x = cbind(1,data[,startMarker:nmarkers]);
y = data[,nmarkers+1];
a =  data[,nmarkers+2];
# inital values

nmarkers = nmarkers - startMarker + 1;
mean2pq = 0.5;                          # just an approximation
scalea  = 0.5*vara/(nmarkers*mean2pq);  # 0.5 = (v-2)/v for v=4

size = dim(x)[2];
b = array(0.0,size);
meanb = b;
b[1] = mean(y);
var  = array(0.0,size);

# adjust y
ycorr = y - x%*%b;

# MCMC sampling
for (iter in 1:numiter){
# sample vare
vare = ( t(ycorr)%*%ycorr )/rchisq(1,nrecords + 3);

# sample intercept
ycorr = ycorr + x[,1]*b[1];
rhs = sum(ycorr)/vare;
invLhs = 1.0/(nrecords/vare);
mean = rhs*invLhs;
b[1] = rnorm(1,mean,sqrt(invLhs));
ycorr = ycorr - x[,1]*b[1];
meanb[1] = meanb[1] + b[1];

# sample variance for each locus
for (locus in 2:size){
var[locus] = (scalea*4+b[locus]*b[locus])/rchisq(1,4.0+1)
}

# sample effect for each locus
for (locus in 2:size){
# unadjust y for this locus
ycorr = ycorr + x[,locus]*b[locus];
rhs = t(x[,locus])%*%ycorr/vare;
lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus];
invLhs = 1.0/lhs;
mean = invLhs*rhs;
b[locus]= rnorm(1,mean,sqrt(invLhs));
#adjust y for the new value of this locus
ycorr = ycorr - x[,locus]*b[locus];
meanb[locus] = meanb[locus] + b[locus];
}
}

Sys.time() - beg

meanb = meanb/numiter;
aHat  = x %*% meanb;

Thus, we just need defining a few variables, reading the data (marker genotypes, breeding values and phenotypic data) into a matrix, creating loops, matrix and vector multiplication and generating random numbers (using a Gaussian and Chi squared distributions). Not much if you think about it, but I didn’t have much time to explore Julia’s features as to go for something more complex.

nmarkers = 2000    # Number of markers
startmarker = 1981 # Set to 1 to use all
numiter = 2000     # Number of iterations

data = dlmread("markers.csv", ',')
(nrecords, ncols) = size(data)

tic()

#this is the mean and markers matrix
X = hcat(ones(Float64, nrecords), data[:, startmarker:nmarkers])
y = data[:, nmarkers + 1]
a = data[:, nmarkers + 2]

nmarkers = nmarkers - startmarker + 1
vara = 1.0/nmarkers
mean2pq = 0.5

scalea  = 0.5*vara/(nmarkers*mean2pq) # 0.5 = (v-2)/v for v=4

ndesign = size(X, 2)
b = zeros(Float64, ndesign)
meanb = zeros(Float64, ndesign)
b[1] = mean(y)
varian  = zeros(Float64, ndesign)

# adjust y
ycorr = y - X * b

# MCMC sampling
for i = 1:numiter
# sample vare
vare = dot(ycorr, ycorr )/randchi2(nrecords + 3)

# sample intercept
ycorr = ycorr + X[:, 1] * b[1];
rhs = sum(ycorr)/vare;
invlhs = 1.0/(nrecords/vare);
mn = rhs*invlhs;
b[1] = randn() * sqrt(invlhs) + mn;
ycorr = ycorr - X[:, 1] * b[1];
meanb[1] = meanb[1] + b[1];

# sample variance for each locus
for locus = 2:ndesign
varian[locus] = (scalea*4 + b[locus]*b[locus])/randchi2(4.0 + 1);
end

# sample effect for each locus
for locus = 2:ndesign
# unadjust y for this locus
ycorr = ycorr + X[:, locus] * b[locus];
rhs = dot(X[:, locus], ycorr)/vare;
lhs = dot(X[:, locus], X[:, locus])/vare + 1.0/varian[locus];
invlhs = 1.0/lhs;
mn = invlhs * rhs;
b[locus] = randn() * sqrt(invlhs) + mn;
#adjust y for the new value of this locus
ycorr = ycorr - X[:, locus] * b[locus];
meanb[locus] = meanb[locus] + b[locus];
end
end

toc()

meanb = meanb/numiter;
aHat  = X * meanb;

The code looks remarkably similar and there are four main sources of differences:

  1. The first trivial one is that the original code read a binary dataset and I didn’t know how to do it in Julia, so I’ve read a csv file instead (this is why I start timing after reading the file too).
  2. The second trivial one is to avoid name conflicts between variables and functions; for example, in R the user is allowed to have a variable called var that will not interfere with the variance function. Julia is picky about that, so I needed renaming some variables.
  3. Julia pases variables by reference, while R does so by value when assigning matrices, which tripped me because in the original R code there was something like: b = array(0.0,size); meanb = b;. This works fine in R, but in Julia changes to the b vector also changed meanb.
  4. The definition of scalar vs array created some problems in Julia. For example y' * y (t(y) %*% y in R) is numerically equivalent to dot(y, y). However, the first version returns an array, while the second one a scalar. I got an error message when trying to store the ‘scalar like an array’ in to an array. I find that confusing.

One interesting point in this comparison is using rough code, not really optimized for speed; in fact, the only thing that I can say of the Julia code is that ‘it runs’ and it probably is not very idiomatic. Testing runs with different numbers of markers we get that R needs roughly 2.8x the time used by Julia. The Julia website claims better results in benchmarks, but in real life we work with, well, real problems.

In 1996-7 I switched from SAS to ASReml for genetic analyses because it was 1-2 orders of magnitude faster and opened a world of new models. Today a change from R to Julia would deliver (in this particular case) a much more modest speed up (~3x), which is OK but not worth changing languages (yet). Together with the embryonic graphical capabilities and the still-to-develop ecosystem of packages, means that I’m still using R. Nevertheless, the Julia team has achieved very impressive performance in very little time, so it is worth to keep an eye on their progress.

P.S.1 Readers are welcome to suggesting ways of improving the code.
P.S.2 WordPress does not let me upload the binary version of the simulated data.
P.S.3 Hey WordPress guys; it would be handy if the sourcecode plugin supported Julia!
P.S.4 2012-04-26 Following AL’s recommendation in the comments, one can replace in R:

rhs = t(x[,locus])%*%ycorr/vare;
lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus]

by

rhs = crossprod(x[,locus],ycorr)/vare
lhs = crossprod(x[,locus],x[,locus])/vare + 1.0/var[locus]

reducing execution time by roughly 20%, making the difference between Julia and R even smaller.

Categories
asreml bayesian r rblogs stats

Tall big data, wide big data

After attending two one-day workshops last week I spent most days paying attention to (well, at least listening to) presentations in this biostatistics conference. Most presenters were R users—although Genstat, Matlab and SAS fans were also present and not once I heard “I can’t deal with the current size of my data sets”. However, there were some complaints about the speed of R, particularly when dealing with simulations or some genomic analyses.

Some people worried about the size of coming datasets; nevertheless that worry was across statistical packages or, more precisely, it went beyond statistical software. How will we able to even store the data from something like the Square Kilometer Array, let alone analyze it?

LaurelAndHardy.jpg

In a previous post I was asking if we needed to actually deal with ‘big data’ in R, and my answer was probably not or, better, at least not directly. I still think that it is a valid, although incomplete opinion. In many statistical analyses we can think of n (the number of observations) and p (the number of variables per observation). In most cases, particularly when people refer to big data, n >> p. Thus, we may have 100 million people but we have only 10 potential predictors: tall data. In contrast, we may have only 1,000 individuals but with 50,000 points each coming from a near infrared spectrometry or information from 250,000 SNPs (a type of molecular marker): wide data. Both types of data will keep on growing but are challenging in a different way.

In a totally generalizing, unfair and simplistic way I will state that dealing with wide is more difficult (and potentially interesting) than dealing with tall. This from a modeling perspective. As the t-shirt says: sampling is not a crime, and it should work quite well with simpler models and large datasets. In contrast, sampling to fitting wide data may not work at all.

Algorithms. Clever algorithms is what we need in a first stage. For example, we can fit linear mixed models to a tall dataset with ten millions records or a multivariate mixed model with 60 responses using ASReml-R. Wide datasets are often approached using Bayesian inference, but MCMC gets slooow when dealing with thousands of predictors, we need other fast approximations to the posterior distributions.

This post may not be totally coherent, but it keeps the conversation going. My excuse? I was watching Be kind rewind while writing it.

Categories
r rblogs stats teaching

R, academia and the democratization of statistics

I am not a statistician but I use statistics, teach some statistics and write about applications of statistics in biological problems.

Last week I was in this biostatistics conference, talking with a Ph.D. student who was surprised about this situation because I didn’t have any statistical training. I corrected “any formal training”. On the first day one of the invited speakers was musing about the growing number of “amateurs” using statistics—many times wrongly—and about what biostatisticians could offer as professional value-adding. Yes, he was talking about people like me spoiling the party.

Twenty years ago it was more difficult to say “cool, I have some data and I will run some statistical analyses” because (and you can easily see where I am going here) access to statistical software was difficult. You were among the lucky ones if you were based at a university or a large company, because you had access to SAS, SPSS, MINITAB, etc. However, you were out of luck outside of these environments, because there was no way to easily afford a personal licence, not for a hobby at least. This greatly limited the pool of people that could afford to muck around with stats.

Gratuitous picture: spiral in Sydney

Enter R and other free (sensu gratis) software that allowed us to skip the specialist, skip the university or the large organization. Do you need a formal degree in statistics to start running analyses? Do you even need to go through a university (for any degree, it doesn’t really matter) to do so? There are plenty of resources to start, download R or a matrix language, get online tutorials and books, read, read, read and ask questions in email lists or fora when you get stuck. If you are a clever cookie—and some of you clearly are one—you could easily cover as much ground as someone going through a university degree. It is probably still not enough to settle down, but it is a great start and a great improvement over the situation twenty years ago.

This description leaves three groups in trouble, trying to sort out their “value-adding” ability: academia, (bio)statisticians and software makers. What are universities offering, that’s unique enough, to justify the time and money invested by individuals and governments? If you make software, What makes it special? For how long can you rely on tradition and inertia so people don’t switch to something else? What’s so special about your (bio) statistical training to justify having one “you” in the organization? Too many questions, so better I go to sleep.

P.S. Did I write “value-adding” ability? I must have been talking to the suits for too long… Next post I may end up writing “value-adding proposition”!

Categories
linear models r rblogs stats teaching

On the (statistical) road, workshops and R

Things have been a bit quiet at Quantum Forest during the last ten days. Last Monday (Sunday for most readers) I flew to Australia to attend a couple of one-day workshops; one on spatial analysis (in Sydney) and another one on modern applications of linear mixed models (in Wollongong). This will be followed by attending The International Biometric Society Australasian Region Conference in Kiama.

I would like to comment on the workshops to look for commonalities and differences. First, both workshops heavily relied on R, supporting the idea that if you want to reach a lot of people and get them using your ideas, R is pretty much the vehicle to do so. It is almost trivial to get people to install R and RStudio before the workshop so they are ready to go. “Almost” because you have to count on someone having a bizarre software configuration or draconian security policies for their computer.

The workshop on Spatial Analysis also used WinBUGS, which with all respect to the developers, is a clunky piece of software. Calling it from R or using JAGS from R seems to me a much more sensible way of using a Bayesian approach while maintaining access to the full power of R. The workshop on linear mixed models relied on asreml-R; if you haven’t tried it, please give it a go (it is a free license for academic/research use). There were applications on multi-stage experiments, composite samples and high dimensional data (molecular information). In addition, there was an initial session on optimal design of experiments.

In my opinion, the second workshop (modern applications…) was much more successful than the first one (spatial analysis…) for a few reasons:

  • One has to limit the material to cover in a one-day workshop; if you want to cover a lot consider three days so people can digest all the material.
  • One has to avoid the “split-personality” approach to presentations; having very-basic and super-hard but nothing in the middle is not a great idea (IMHO). Pick a starting point and gradually move people from there.
  • Limit the introduction of new software. One software per day seems to be a good rule of thumb.

Something bizarre (for me, at least) was the difference between the audiences. In Sydney the crowd was a lot younger, with many trainees in biostatistics coming mostly from health research. They had little exposure to R and seemed to come from a mostly SAS shop. The crowd in Wollongong had people with a lot of experience (OK, oldish) both in statistics and R. I was expecting young people to be more conversant in R.

Tomorrow we will drive down to Kiama, sort out registration and then go to the welcome BBQ. Funny thing is that this is my first statistics conference; as I mentioned in the About page of this blog, I am a simple forester. 🙂

Categories
books quotes stats

No one would ever conceive

I believe that no one who is familiar, either with mathematical advances in other fields, or with the range of special biological conditions to be considered, would ever conceive that everything could be summed up in a single mathematical formula, however complex.

— R.A. Fisher (1932) quoted in the preface to Foundations of Mathematical Genetics by A.W.F. Edwards (1976).

Categories
r rblogs stats

Do we need to deal with ‘big data’ in R?

David Smith at the Revolutions blog posted a nice presentation on “big data” (oh, how I dislike that term). It is a nice piece of work and the Revolution guys managed to process a large amount of records, starting with a download of 70GB and ending up with a series of linear regressions.

I’ve spent the last two weeks traveling (including a visit to the trial below) and finishing marking for the semester, which has somewhat affected my perception on dealing with large amounts of data. The thing is that dealing with hotel internet caps (100MB) or even with my lowly home connection monthly cap (5GB) does get one thinking… Would I spend several months of internet connection just downloading data so I could graph and plot some regression lines for 110 data points? Or does it make sense to run a linear regression with two predictors using 100 million records?

My basic question is why would I want to deal with all those 100 million records directly in R? Wouldn’t it make much more sense to reduce the data to a meaningful size using the original database, up there in the cloud, and download the reduced version to continue an in-depth analysis? There are packages to query external databases (ROracle, RMySQL, RODBC, …, pick your poison), we can sample to explore the dataset, etc.

We can deal with a rather large dataset in our laptop but is it the best that we can do to deal with the underlying modeling problem? Just wondering.

Gratuitous picture of generic Pinus radiata breeding trial in Northern Tasmania.

Categories
books quotes stats

On “true” models

Before starting the description of the probability distributions, we want to impose on the reader the essential feature that a model is an interpretation of a real phenomenon that fits its characteristics to some degree of approximation rather than an explanation that would require the model to be “true”. In short, there is no such thing as a “true model”, even though some models are more appropriate than others!

— Jean-Michel Marin and Christian P. Robert in Bayesian Core: a practical approach to computational Bayesian statistics.

Categories
r rblogs stats teaching

Teaching with R: the tools

I bought an Android phone, nothing fancy just my first foray in the smartphone world, which is a big change coming from the dumb phone world(*). Everything is different and I am back at being a newbie; this is what many students feel the same time they are exposed to R. However, and before getting into software, I find it useful to think of teaching from several points of view, considering that there are several user cases:

  1. A few of the students will get into statistics and heavy duty coding, they will be highly motivated and take several stats courses. It is in their best interest to learn R (and other software) very well.
  2. Some, but not many, of the students will use statistics once in while. A point and click GUI may help to remember commands.
  3. Most students will have to consume statistics, read reports, request information from colleagues and act as responsible, statistically literate citizens. I think that concepts (rather than software) are far the most relevant element to this group.

The first group requires access to all the power in R and, very likely, has at least a passing idea of coding in other languages. The second and third groups are occasional users, which will tend to forget the language and notation and most likely will need a system with menus.

At this point, some of the readers may be tempted to say that everyone (that is groups 1—3) should learn to write R code and they just need a good text editor (say Emacs). This may come as a surprise but normal people not only do not use text editors, they even don’t know what they are. You could be tempted to say that having a good editor would also let them write in LaTeX (or XeLaTeX), which is an excellent way to ‘future proof’ your documents. Please let me repeat this: normal people do not write in LaTeX, they use Word or something equivalent.

But, but. Yes, I know, we are not normal people.

What are the problems?

When working in my Ph.D. I had the ‘brilliant’ (a.k.a. masochistic) idea of using different languages for each part of my project: Fortran 90 (Lahey), ASReml, Python (ActiveState), Matlab and Mathematica. One thing that I experienced, was that working with a scripting language integrated with a good IDE (e.g. ActiveState Python or Matlab) was much more conducive to learning than a separate console and text editor. I still have fond memories of learning and using Python. This meandering description brings me back to what we should use for teaching.

Let’s be honest, the stock R IDE that one gets with the initial download is spartan if you are in OS X and plain sucky if you are in Windows. Given that working with the console plus a text editor (Emacs, Vim, Textmate, etc) is an uncomfortable learning experience (at least in my opinion) there is a nice niche for IDEs like RStudio, which integrate editor, data manager, graphs, etc.; particularly if they are cross-platform. Why is that RStudio is not included as the default R IDE? (Incidentally, I have never used Revolution R Productivity Environment—Windows only—that looks quite cool).

Today I am tempted to recommend moving the whole course to RStudio, which means installing it in an awful lot of computers at the university. One of the issues that stops me is that is introducing another layer of abstraction to R. We have the plain-vanilla console, then the normal installation and, on top, RStudio. On the other hand, we are already introducing an extra level with R commander.

At this point we reach the point-and-click GUI. The last two years we have used R Commander, which has helped, but I have never felt entirely comfortable with it. This year I had a chat with some students that used SPSS before and, after the initial shock, they seemed to cope with R Commander. In a previous post someone suggested Deducer, which I hope to check before the end of this year. I am always on the look out for a good and easy interface for students that fall in the second and third cases (see above). It would be nice to have a series of interfaces that look like Jeroen Ooms’s prototypes. Please let me know if you have any suggestions.

(*)This is not strictly true, as I had a Nokia E72, which was a dumb phone with a lot of buttons pretending to be a smartphone.

(**)This post should be read as me thinking aloud and reflecting today’s impressions, which are continually evolving. I am not yet truly comfortable with any GUI in the R world, and still feel that SPSS, Splus or Genstat (for example) provide a nicer flow on the GUI front.