I’ve been having a conversation for a while with @kamal_hothi and @aschiff on maps, schools, census, making NZ data available, etc. This post documents some basic steps I used for creating a map on ethnic diversity in schools at the census-area-unit level. This “el quicko” version requires 3 ingredients:

Census area units shape files (available from Statistics New Zealand for free here).

We’ll read the school directory data, aggregate ethnicity information to calculate the Herfindahl–Hirschman Index of diversity and then plot it.

# School directory
direc = read.csv('directory-school-current.csv', skip = 3)
# Total number of students for each ethnicity by Census Area Unit
hhi = aggregate(cbind(European..Pakeha, Maori, Pasifika, Asian, MELAA, Other) ~
Census.Area.Unit, data = direc, FUN = sum, na.rm = TRUE)
# Function to calculate
index = function(x) {
total = sum(x, na.rm = TRUE)
frac = x/total
h = sum(frac^2)
hn = if(total > 1) (h - 1/total)/(1 - 1/total) else 1
return(hn)
}
# Calculate the index for each area
hhi$hn = apply(hhi[,2:7], 1, index)
# Write data to use in QGis later
write.csv(hhi, 'diversity-index-census-area-unit.csv', quote = FALSE,
row.names = FALSE)

Then I moved to create a map in R, for the sake of it:

library(rgdal) # for readOGR
library(sp) # for spplot
# Reading shapefile
cau = readOGR(dsn='/Users/lap44/Dropbox/research/2015/census/2014 Digital Boundaries Generlised Clipped',
layer='AU2014_GV_Clipped')
# Joining with school ethnicity data (notice we refer to @data, as cau contains spatial info as well)
cau@data = data.frame(cau@data,
hhi[match(cau@data[,"AU2014_NAM"], hhi[,"Census.Area.Unit"]),])
# Limiting map to the area around Christchurch
spplot(cau, zcol = "hn", xlim = c(1540000, 1590000),
ylim= c(5163000, 5198000))

And we get a plot like this:

Just because it is Monday down under.

P.S. Using the diversity-index-census-area-unit.csv and joining it with the shapefile in QGIS one can get something prettier (I have to work on matching the color scales):

Map rendering is so much faster in QGIS than in R! Clearly the system has been optimized for this user case.

When I write code I’m often amazed by the direction of the statements. I mean, we read and write left-to-right except when we assign statements to a variable. So here we are, doing our business, slowly working in a sentence and, puff!, we get this insight from the future and we assign it to our past, on the left. I commented this on Twitter and @fadesingh pointed me to this artistic creation of right-to-left programming language, except that the assign was left-to-right.

R has a right-assign operator (->) and I was thinking, how would it look if I wrote all the assigns to the left, using an adaptation of my previous post.

Currently there is some discussion in New Zealand about the effect of the reclassification of schools in socioeconomic deciles. An interesting aspect of the funding system in New Zealand is that state and state-integrated schools with poorer families receive substantially more funding from the government than schools that receive students from richer families (see this page in the Ministry of Education’s website).

One thing that I haven’t noticed before is that funding decisions are more granular than simply using deciles, as deciles 1 to 4 are split into 3 steps each. For example, for Targeted Funding for Educational Achievement in 2015 we get the following amounts per student for decile: 1 (A: $905.81, B: $842.11, C: $731.3), 2 (D: $617.8, E: 507.01, F: 420.54), 3 (G: $350.25, H: $277.32, I: $220.59), 4 (J: $182.74, K: $149.99, L: $135.12), 5 ($115.76), 6 ($93.71), 7: ($71.64), 8 ($46.86), 9 ($28.93) and 10 ($0).

The Ministry of Education states that 784 schools ‘have moved to a higher decile rating’ while 800 ‘have moved to a lower decile rating’ (800 didn’t move). They do not mean that those numbers of schools changed deciles, but that information also includes changes of steps within deciles. Another issue is that it is not the same to move one step at the bottom of the scale (e.g. ~$63 from 1A to 1B) or at the top (~$29 from 9 to 10); that is, the relationship is not linear.

I assume that the baseline to measure funding changes is to calculate how much would a school would get per student in 2015 without any change of decile/step. That is, funding assuming that the previous step within decile had stayed constant. Then we can calculate how a student will get with the new decile/step for the school. I have limited this ‘back of the envelope’ calculation to Targeted Funding for Educational Achievement, which is not the only source of funding linked to deciles. There are other things like Special Education Grant and Careers Information Grant, but they have much smaller magnitude (maximum $73.94 & $37.31 per student) and the maximum differences between deciles 1 and 10 are 2:1.

Steps are in capital letters and need to be translated into money. Once we get that we can calculate differences at both student level and school level:

steps = c(LETTERS[1:17], 'Z')
money = c(905.81, 842.11, 731.3, 617.8, 507.01, 420.54, 350.25, 277.32,
220.59, 182.74, 149.99, 135.12, 115.76, 93.71, 71.64, 46.86, 28.93, 0)
dc = within(dc, {
sm2014 = sapply(step2014, function(x) money[match(x, steps)])
sm2015 = sapply(step2015, function(x) money[match(x, steps)])
# Change per student
funding.change = sm2015 - sm2014
})
summary(dc$funding.change)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#-812.100 -22.070 0.000 -3.448 22.070 707.000
# Merging with school directory data
dc = merge(dc, direc[, c('School.ID', 'Total.School.Roll')],
by = 'School.ID', all.x = TRUE)
# Calculating total change at school level
# considering roll in thousands of dollars
dc$school.level.change = with(dc, funding.change*Total.School.Roll/1000)
summary(dc$school.level.change)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#-137.100 -2.820 0.000 2.689 3.627 413.100

If we look at the 50% of the schools in the middle of the distribution they had fairly small changes, approximately +/- $22 per student per year or at the school level +/- 3,000 dollars per year.

An interesting, though not entirely surprising, graph is plotting changes of funding on the size of the school. Large schools are much more stable on deciles/step than small ones.

# At student level
qplot(Total.School.Roll, funding.change, data = dc, alpha = 0.8,
xlab = 'Total School Roll', ylab = 'Funding change per student (NZ$)') +
theme_bw() + theme(legend.position = 'none')
# At school level
qplot(Total.School.Roll, school.level.change, data = dc, alpha = 0.8,
xlab = 'Total School Roll', ylab = 'Funding change per school (NZ$ 000)') +
theme_bw() + theme(legend.position = 'none')

Overall, there is a small change of the total amount of money for Targeted Funding for Educational Achievement used in the reclassified school system versus using the old deciles ($125M using 2014 deciles versus $132M using 2015 deciles) and for most schools the changes do not seem dramatic. There is, however, a number of schools (mostly small ones) who have had substantial changes to their funding. Very small schools will tend to display the largest changes, as the arrival or departure of only few pupils with very different socioeconomic backgrounds would have a substantial effect. An example would be Mata School in the Gisborne area, which moved 13 steps in decile funding (from A to N) with a roll of 11 kids. How to maintain a more steady funding regime seems to be a difficult challenge in those cases.

One consequence of the larger variability in small schools is that rural areas will be more affected by larger changes of funding. While overall 34% of the schools had no changes to their decile/step classification in rural areas that reduces to 22%; on top of that, the magnitude of the changes for rural schools is also larger.

Operational school funding is much more complex than deciles, as it includes allocations depending on number of students, use of Maori language, etc.

P.S. Stephen Senn highlights an obvious problem with the language the Ministry uses: there are 9 deciles (the points splitting the distribution into 10 parts). We should be talking about tenths, a much simpler word, instead of deciles.

At the moment I am writing R code that involves a lot of simulation for a project. This time I wanted to organize the work properly, put a package together, document it,… the whole shebang. Hadley Wickham has excellent documentation for this process in Advanced R, which works very well as a website. Up to this point there is nothing new; but the material is also available as a book.

At this point in my life I do not want to have a physical object if I can avoid it. On top of that, code tutorials work a lot better as a website, so one can copy, paste and experiment. PDF or ebooks are not very handy for this subject either. Here enters a revolutionary notion: I like to pay people who do a good job and, in the process, make my job easier but sometimes I do not want an object in exchange.

One short term solution: asking Hadley for his favorite charity and donating the cost of a copy of the book. That gets most people happy except, perhaps, the publisher. I then remembered this idea by Cory Doctorow, in which he acts as a middleman between people who wish to pay him for his stories (but don’t want a physical copy of books) and school libraries that wish to have copies of the books.

Wouldn’t it be nice to have an arrangement like that for programming and research books? For example, we could get R learners who prefer but can’t afford books and people willing to pay for them.

I’m the first to acknowledge that most of my code could run faster. The truth of the matter is that, in essence, I write ‘quickies’: code that will run once or twice, so there is no incentive to spend days or hours in shaving seconds of a computation. Most analyses of research data fall in to this approach: read data-clean data-fit model-check model-be happy-write article-(perhaps) make data and code available-move on with life.

One of the reasons why my code doesn’t run faster or uses less memory is the trade-off between the cost of my time (very high) compared to the cost of more memory or faster processors (very cheap) and the gains of shaving a few seconds or minutes of computer time, which tend to be fairly little.

In R vectorization is faster than working with each vector element, although it implies allocating memory for whole vectors and matrices, which for large-enough problems may become prohibitively expensive. On the other hand, not vectorizing some operations may turn your problem into an excruciatingly slow exercise and, for example in large simulations, in practice intractable in a useful timeframe.

John Muschelli wrote an interesting post reimplementing 2×2 frequency tables for a highly specific use, comparing the results of image processing algorithms. In John’s case, there are two greater-than-9-million-long-elements logical vectors, and when comparing vectors for many images the process becomes very slow. He explains part of his rationale in his blog (go and read it), but say that his solution can be expressed like this:

# I don't have any image data, but I'll simulate a couple
# of 10 million long logical vectors (just to round things)
set.seed(2014)
manual = sample(c(TRUE, FALSE), 10E6, replace = TRUE)
auto = sample(c(TRUE, FALSE), 10E6, replace = TRUE)
logical.tab = function(x, y) {
tt = sum(x & y)
tf = sum(x & !y)
ft = sum(!x & y)
ff = sum(!x & !y)
return(matrix(c(ff, tf, ft, tt), 2, 2))
}
logical.tab(manual, auto)

which uses 1/4 of the time used by table(manual, auto). Mission accomplished! However, if I stopped here this blog post would not make much sense, simply rehashing (pun intended) John’s code. The point is to explain what is going on and, perhaps, to find even faster ways of performing the calculations. As a start, we have to be aware that the calculations in logical.tab() rely on logical (boolean) operations and on the coercion of logical vectors to numerical values, as stated in the documentation:

Logical vectors are coerced to integer vectors in contexts where a numerical value is required, with TRUE being mapped to 1L, FALSE to 0L and NA to NA_integer_.

In R Logical operations can be slower than mathematical ones, a consideration that may guide us to think of the problem in a slightly different way. For example, take the difference between the vectors (dif = x - y), so both TRUE-TRUE (1-1) and FALSE-FALSE (0-0) are 0, while TRUE - FALSE (1-0) is 1 and FALSE - TRUE (0-1) is -1. Therefore:

the sum of positive values (sum(dif > 0)) is the frequency of TRUE & FALSE,

while the sum of negative values (sum(dif < 0)) is the frequency of FALSE & TRUE.

The values for TRUE & TRUE can be obtained by adding up the element-wise multiplication of the two vectors, as TRUE*TRUE (1*1) is the only product that's different from zero. A vectorized way of performing this operation would be to use t(x) %*% y; however, for large vectors the implementation crossprod(x, y) is faster. The values for FALSE & FALSE are simply the difference of the length of the vectors and the previously calculated frequencies length(dif) - tt - tf - ft. Putting it all together:

basic.tab = function(x, y) {
dif = x - y
tf = sum(dif > 0)
ft = sum(dif < 0)
tt = crossprod(x, y)
ff = length(dif) - tt - tf - ft
return(c(tf, ft, tt, ff))
}

This code takes 1/20 of the time taken by table(x, y). An interesting result is that the crossproduct crossprod(x, y) can also be expressed as sum(x * y), which I didn't expect to be faster but, hey, it is. So we can express the code as:

basic.tab2 = function(x, y) {
dif = x - y
tf = sum(dif > 0)
ft = sum(dif < 0)
tt = sum(x*y)
ff = length(dif) - tt - tf - ft
return(c(tf, ft, tt, ff))
}

to get roughly 1/22 of the time. The cost of logical operations is easier to see if we isolate particular calculations in logical.tab and basic.tab; for example,

tf1 = function(x, y) {
tf = sum(x & !y)
}

is slower than

tf2 = function(x, y) {
dif = x - y
tf = sum(dif > 0)
}

This also got me thinking of the cost of coercion: Would it take long? In fact, coercing logical vectors to numeric has little (at least I couldn't see from a practical viewpoint) if any cost. In some cases relying on logical vectors converted using as.numeric() seems to be detrimental on terms of speed.

As I mentioned at the beginning, vectorization uses plenty of memory, so if we were constrained in that front and we wanted to do a single pass on the data we could write an explicit loop:

The loopy.tab does only one pass over the vectors and should use less memory as it doesn't need to create those huge 10M elements vectors all the time (at least it would if we used a proper iterator in the loop instead of iterating on a vector of length 10M (that's 40 MB big in this case). The iterators package may help here). We save room/memory at the expense of speed, as loopy.tab is ten times slower than the original table() function. Of course one could run it a lot faster if implemented in another language like C++ or Fortran and here Rcpp or Rcpp11 would come handy (updated! See below).

This is only a not-so-short way of reminding myself what's going on when trading-off memory, execution speed and my personal time. Thus, I am not saying that any of these functions is the best possible solution, but playing with ideas and constraints that one often considers when writing code. Incidentally, an easy way to compare the execution time of these functions is using the microbenchmark library. For example:

library(microbenchmark)
microbenchmark(logical.tab(manual, auto), basic.tab2(manual, auto), times = 1000)

will spit numbers when running a couple of functions 1,000 times with results that apply to your specific system.

PS 2014-07-12 11:08 NZST Hadley Wickham suggests using tabulate(manual + auto * 2 + 1, 4) as a fast alternative. Nevertheless, I would like to point out that i- the initial tabulation problem is only an excuse to discuss a subset of performance issues when working with R and ii- this post is more about the journey than the destination.

PS 2014-07-13 20:32 NZST Links to two posts related to this one:

Wiekvoet's work which i- relies on marginals and ii- reduces the use of logical operators even more than my basic.tab2(), simultaneously taking around 1/4 of the time.

Yin Zhu's post describing vectors is a handy reminder of the basic building blocks used in this post.

Updated with Rcpp goodness!

PS 2014-07-13 21:45 NZST I browsed the Rcpp Gallery and Hadley Wickham's Rcpp tutorial and quickly converted loopy.tab() to C++. Being a bit of an oldie I've used Fortran (90, not that old) before but never C++, so the following code is probably not very idiomatic.

The Swarm Lab presents a nice comparison of R and Python code for a simple (read ‘one could do it in Excel’) problem. The example works, but I was surprised by how wordy the R code was and decided to check if one could easily produce a shorter version.

The beginning is pretty much the same, although I’ll use ggplot2 rather than lattice, because it will be a lot easier (and shorter) to get the desired appearance for the plots:

require(Quandl)
require(ggplot2)
# Load data from Quandl
my.data = Quandl("TPC/HIST_RECEIPT",
start_date = "1945-12-31",
end_date = "2013-12-31")

The whole example relies on only three variables and—as I am not great at typing—I tend to work with shorter variable names. I directly changed the names for variables 1 to 3:

# Display first lines of the data frame
# and set short names for first three columns
head(my.data)
names(my.data)[1:3] = c('year', 'indtax', 'corptax')

It is a lot easier to compare the regression lines if we change the shape of the data set from wide to long, where there is one variable for year, one for tax type, and one for the actual tax rate. It would be possible to use one of Hadley’s packages to get a simpler syntax for this, but I decided to stick to the minimum set of requirements:

# Change shape to fit both regressions simultaneously
mdlong = reshape(my.data[, 1:3],
idvar = 'year', times = c('Individual', 'Corporate'),
varying = list(2:3), direction = 'long')
mdlong$taxtype = factor(mdlong$time)

And now we are ready to produce the plots. The first one can be a rough cut to see if we get the right elements:

ggplot(mdlong, aes(x = year, y = indtax, color = taxtype)) +
geom_point() + geom_line() + geom_smooth(method = 'lm')

Yes, this one has the points, lines, linear regression and 95% confidence intervals for the mean predicted responses, but we still need to get rid of the grey background and get black labels (theme_bw()), set the right axis labels and ticks (scale_x... scale_y...) and set the right color palette for points and lines (scale_colour_manual) and filling the confidence intervals (scale_colour_fill) like so:

# Plotting the graph (first match color palette) and put the regression
# lines as well
serious.palette = c('#AD3333', '#00526D')
ggplot(mdlong, aes(x = year, y = indtax, color = taxtype)) +
geom_point() + geom_line() + geom_smooth(method = 'lm', aes(fill = taxtype)) +
theme_bw() +
scale_y_continuous('Income taxes (% of GDP)', breaks = seq(0, 12, 2), minor_breaks = NULL) +
scale_x_date('Fiscal year', minor_breaks = NULL) +
scale_colour_manual(values=serious.palette) + scale_fill_manual(values=serious.palette)

One can still change font sizes to match the original plots, reposition the legend, change the aspect ratio while saving the png graphs (all simple statements) but you get the idea. If now we move to fitting the regression lines:

# Fitting a regression with dummy variables
m1 = lm(indtax ~ year*taxtype, data = mdlong)
summary(m1)
# The regressions have different intercepts and slopes
# Call:
# lm(formula = indtax ~ year * taxtype, data = mdlong)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.95221 -0.44303 -0.05731 0.35749 2.39415
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.435e+00 1.040e-01 33.01 <2e-16 ***
# year -1.564e-04 1.278e-05 -12.23 <2e-16 ***
# taxtypeIndividual 4.406e+00 1.471e-01 29.94 <2e-16 ***
# year:taxtypeIndividual 1.822e-04 1.808e-05 10.08 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.7724 on 134 degrees of freedom
# Multiple R-squared: 0.9245, Adjusted R-squared: 0.9228
# F-statistic: 546.9 on 3 and 134 DF, p-value: < 2.2e-16

This gives the regression coefficients for Corporate (3.45 - 1.564e-04 year) and Individual ([3.45 + 4.41] + [-1.564e-04 + 1.822e-04] year or 7.84 + 2.58e-05 year). As a bonus you get the comparison between regression lines.

In R as a second language I pointed out that 'brevity reduces the time between thinking and implementation, so we can move on and keep on trying new ideas'. Some times it seriously does.

Imagine that you are studying English as a second language; you learn the basic rules, some vocabulary and start writing sentences. After a little while, it is very likely that you’ll write grammatically correct sentences that no native speaker would use. You’d be following the formalisms but ignoring culture, idioms, slang and patterns of effective use.

R is a language and any newcomers, particularly if they already know another programming language, will struggle at the beginning to get what is beyond the formal grammar and vocabulary. I use R for inquisition: testing ideas, data exploration, visualization; under this setting, the easiest is to perform a task the more likely is one going to do it. It is possible to use several other languages for this but—and I think this is an important but—R’s brevity reduces the time between thinking and implementation, so we can move on and keep on trying new ideas^{†}.

A typical example is when we want to repeat something or iterate over a collection of elements. In most languages if one wants to do something many times the obvious way is using a loop (coded like, for() or while()). It is possible to use a for() loop in R but many times is the wrong tool for the job, as it increases the lag between thought and code, moving us away from ‘the flow’.

# Generate some random data with 10 rows and 5 columns
M = matrix(round(runif(50, 1, 5), 0), nrow = 10, ncol = 5)
M
# [,1] [,2] [,3] [,4] [,5]
# [1,] 2 3 4 2 1
# [2,] 3 1 3 3 4
# [3,] 4 2 5 1 3
# [4,] 2 4 4 5 3
# [5,] 2 3 1 4 4
# [6,] 3 2 2 5 1
# [7,] 1 3 5 5 2
# [8,] 5 4 2 5 4
# [9,] 3 2 3 4 3
#[10,] 4 4 1 2 3
# Create dumb function that returns mean and median
# for data
sillyFunction = function(aRow) {
c(mean(aRow), median(aRow))
}
# On-liner to apply our function to each row
apply(M, 1, sillyFunction)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,] 2.4 2.8 3 3.6 2.8 2.6 3.2 4 3 2.8
#[2,] 2.0 3.0 3 4.0 3.0 2.0 3.0 4 3 3.0
# or one could do it for each column
apply(M, 2, sillyFunction)
# Of course one could use a loop. Pre-allocating
# the result matrix would have a loop with little
# time penalty (versus growing the matrix)
nCases = dim(M)[1]
resMatrix = matrix(0, nrow = nCases, ncol = 2)
# and here is the loop
for(i in 1:nCases){
resMatrix[i, 1:2] = sillyFunction(M[i,])
}
resMatrix
# Same results as before
# [,1] [,2]
# [1,] 2.4 2
# [2,] 2.8 3
# [3,] 3.0 3
# [4,] 3.6 4
# [5,] 2.8 3
# [6,] 2.6 2
# [7,] 3.2 3
# [8,] 4.0 4
# [9,] 3.0 3
#[10,] 2.8 3

One of the distinctive features of R is that there is already a lot of functionality available for jobs that occur frequently in data analysis. The easiest is to perform a task the more likely is one going to do it, which is perfect if one is exploring/thinking about data.

Thomas Lumley reminded me of the ACM citation for John Chambers—father of S of which R is an implementation—which stated that Chambers’s work:

…will forever alter the way people analyze, visualize, and manipulate data . . . S is an elegant, widely accepted, and enduring software system, with conceptual integrity, thanks to the insight, taste, and effort of John Chambers.

If I could summarize the relevance of R in a Tweetable phrase (with hash tags and everything) it would be:

Most data analysis languages underestimate the importance of interactivity/low barrier to exploration. That’s where #Rstats shines.

One could run statistical analyses with many languages (including generic ones), but to provide the right level of interactivity for analysis, visualization and data manipulation one ends up creating functions that, almost invariably, look a bit like R; pandas in Python, for example.

There are some complications with some of the design decisions in R, especially when we get down to consistency which begets memorability. A glaring example is the apply family of functions and here is where master opportunist (in the positive sense of expert at finding good opportunities) Hadley Wickham^{‡} made sense out of confusion in his package plyr.

There is also a tension in languages under considerable use because speakers/writers/analysts/coders start adapting them to new situations, adding words and turns of phrase. Look at English for an example! This is also happening to R and some people wish the language looked different in some non-trivial ways. A couple of examples: Coffeescript for R and Rasmus Bååth’s suggestions. Not all of them can be implemented, but suggestions like this speak of the success of R.

If you are struggling to start working with R, as with other languages, first let go. The key to learning and working with a new language is immersing yourself in it; even better if you do it with people who already speak it.

^{†} Just to be clear, there are several good statistical languages. However, none is as supportive of rapid inquisition as R (IMO). It is not unusual to develop models in one language (e.g. R) and implement it in another for operational purposes (e.g. SAS, Python, whatever).

^{‡} The first thing I admire about Hadley is his ‘good eye’ for finding points of friction. The second one is doing something about the frictions, often with very good taste.

P.S. It should come clear from this post that English is indeed my second language.

I teach several courses every year and the most difficult to pull off is FORE224/STAT202: regression modeling.

The academic promotion application form in my university includes a section on one’s ‘teaching philosophy’. I struggle with that part because I suspect I lack anything as grandiose as a philosophy when teaching: as most university lecturers I never studied teaching, although I try to do my best. If anything, I can say that I enjoy teaching and helping students to ‘get it’ and that I want to instill a sense of ‘statistics is fun’ in them. I spend quite a bit of time looking for memorable examples, linking to stats in the news (statschat and listening the news while walking my dog are very helpful here) and collecting data. But a philosophy? Don’t think so.

One of the hardest parts of the course is the diversity of student backgrounds. Hitting the right level, the right tone is very hard. Make it too easy and the 1/5 to 1/4 of students with a good mathematical background will hate it; they may even decide to abandon any intention of continuing doing stats if ‘that’s all there is about the topic’. Make it too complicated and half the class will fail and/or hate the content.

Part of the problem is based around what we mean by teaching ‘statistics’. In some cases it seems limited to what specific software does; for example, teaching with Excel means restriction to whatever models are covered in Excel’s Data Analysis Toolpak (DAT). The next choice when teaching is using menu-driven software (e.g. SPSS), which provides much more statistical functionality than Excel + DAT, at the expense of being further removed from common usability conventions. At the other extreme of simplicity is software that requires coding to control the analyses (e.g. R or SAS). In general, the more control we want, the more we have to learn to achieve it^{†}.

A while ago I made a distinction between the different levels of learning (user cases) when teaching statistics. In summary, we had i- very few students getting in to statistics and heavy duty coding, ii- a slightly larger group that will use stats while in a while and iii- the majority that will mostly consume statistics. I feel a duty towards the three groups, while admitting that I have predilection for the first one. Nevertheless, the third group provides most of the challenges and need for thinking about how to teach the subject.

When teaching linear models (general form \(y = X eta + epsilon\)) we tend to compartmentalize content: we have an ANOVA course if the design matrix \(X\) represents categorical predictors (contains only 1s and 0s), a regression course if \(X\) is full of continuous predictors and we talk about ANCOVA or regression on dummy variables if \(X\) is a combination of both. The use of different functions for different contents of \(X\) (for example aov() versus lm() in R or proc reg versus proc glm in SAS) further consolidates the distinction. Even when using menus, software tends to guide students through different submenus depending on the type of \(X\).

At the beginning of the course we restrict ourselves to \(X\) full of continuous predictors, but we introduce the notion of matrices with small examples. This permits showing the connection between all the linear model courses (because a rose by any other name…) and it also allows deriving a general expression of the formulas for the regression coefficients (essential for the most advanced students). Slower students may struggle with some of this material; however, working with small examples they can replicate the results from R (or Excel or SAS or whatever one uses to teach). Some times they even think it is cool.

Here is where the model.matrix() R function becomes handy; rather than building incidence matrices by hand—which is easy for tiny examples—we can get the matrices used by the lm() function to then calculate regression parameters (and any other output) for more complex models.

Once students get the idea that on matrix terms our teaching compartments are pretty much the same, we can reinforce the idea by using a single function (or proc) to show that we can obtain all the bits and pieces that make up what we call ‘fitting the model’. This highlights the idea that ANOVA, ANCOVA & regression are subsets of linear models, which are subsets of linear mixed models, which are subsets of generalized linear mixed models. A statistical Russian doll.

We want students to understand, some times so badly that we lower the bar to a point where there is no much to understand. Here is the tricky part, finding the right level of detail so all types of students learn to enjoy the topic, although at different levels of understanding.

^{†}There is software that generates code from menus too, like Stata or Genstat.

I wanted to develop a small experiment with a front end using the Processing language and the backend calculations in R; the reason why will be another post. This post explained the steps assuming that one already has R and Processing installed:

Install the Rserve package. This has to be done from source (e.g. using R CMD INSTALL packagename).

Download Rserve jar files and include them in the Processing sketch.

Run your code

For example, this generates 100 normal distributed random numbers in R and then sorts them (code copy and pasted from second link):

import org.rosuda.REngine.Rserve.*;
import org.rosuda.REngine.*;
double[] data;
void setup() {
size(300, 300);
try {
RConnection c = new RConnection();
// generate 100 normal distributed random numbers and then sort them
data= c.eval("sort(rnorm(100))").asDoubles();
} catch ( REXPMismatchException rme ) {
rme.printStackTrace();
} catch ( REngineException ree ) {
ree.printStackTrace();
}
}
void draw() {
background(255);
for( int i = 0; i < data.length; i++) {
line( i * 3.0, height/2, i* 3.0, height/2 - (float)data[i] * 50 );
}
}

The problem is that this didn't work, because my OS X (I use macs) R installation didn't have shared libraries. My not-so-quick solution was to compile R from source, which involved:

Downloading R source. I went for the latest stable version, but I could have gone for the development one.

Setting up the latest version of C and Fortran compilers. I did have an outdated version of Xcode in my macbook air, but decided to delete it because i- uses many GB of room in a small drive and ii- it's a monster download. Instead I went for Apple's Command Line Tools, which is a small fraction of size and do the job.

In the case of gfortran, there are many sites pointing to this page that hosts a fairly outdated version, which was giving me all sorts of problems (e.g. "checking for Fortran 77 name-mangling scheme") because the versions between the C and Fortran compilers were out of whack. Instead, I downloaded the latest version from the GNU site.

Changing the config.site file in a few places, ensuring that I had:

Then compiled using (didn't want X11 and enabling shared library):

./configure --without-x --enable-R-shlib
make
make check
make pdf # This produces a lot of rubbish on screen and it isn't really needed
make info

And finally installed using:

sudo make prefix=/luis/compiled install

This used a prefix because I didn't want to replace my fully functioning R installation, but just having another one with shared libraries. If one types R in terminal then it is still calling the old version; the new one is called via /luis/compiled/R.framework/Versions/Current/Resources/bin/R. I then installed Rserve in the new version and was able to call R from processing so I could obtain.

Now I can move to what I really wanted to do. File under stuff-that-I-may-need-to-remember-one-day.

This week I’ve been feeling tired of excessive fanaticism (or zealotry) of open source software (OSS) and R in general. I do use a fair amount of OSS and pushed for the adoption of R in our courses; in fact, I do think OSS is a Good Thing^{TM}. I do not like, however, constant yabbering on why using exclusively OSS in science is a good idea and the reduction of science to repeatability^{‡} and computability (both of which I covered in my previous post). I also dislike the snobbery of ‘you shall use R and not Excel at all, because the latter is evil’ (going back ages).

We often have several experiments running during the year and most of the time we do not bother setting up a data base to keep data. Doing that would essentially mean that I would have to do it, and I have a few things more important to do. Therefore, many data sets end up in… (drum roll here) Microsoft Excel.

How should a researcher setup data in Excel? Rather than reinventing the wheel, I’ll use a(n) (im)perfect diagram that I found years ago in a Genstat manual.

I like it because:

It makes clear how to setup the experimental and/or sampling structure; one can handle any design with enough columns.

It also manages any number of traits assessed in the experimental units.

It contains metadata in the first few rows, which can be easily skipped when reading the file. I normally convert Excel files to text and then I skip the first few lines (using skip in R or firstobs in SAS).

People doing data analysis often start convulsing at the mention of Excel; personally, I deeply dislike it for analyses but it makes data entry very easy, and even a monkey can understand how to use it (I’ve seen them typing, I swear). The secret for sane use is to use Excel only for data entry; any data manipulation (subsetting, merging, derived variables, etc.) or analysis is done in statistical software (I use either R or SAS for general statistics, ASReml for quantitative genetics).

It is far from a perfect solution but it fits in the realm of the possible and, considering all my work responsibilities, it’s a reasonable use of my time. Would it be possible that someone makes a weird change in the spreadsheet? Yes. Could you fart while moving the mouse and create a non-obvious side effect? Yes, I guess so. Will it make your life easier, and make possible to complete your research projects? Yes sir!

P.S. One could even save data using a text-based format (e.g. csv, tab-delimited) and use Excel only as a front-end for data entry. Other spreadsheets are of course equally useful.

P.S.2. Some of my data are machine-generated (e.g. by acoustic scanners and NIR spectroscopy) and get dumped by the machine in a separate—usually very wide; for example 2000 columns—text file for each sample. I never put them in Excel, but read them directly (a directory-full of them) in to R for manipulation and analysis.

An opinion piece on Calculus and statistics by Daniel Kaplan, on teaching a different version of your typical introductory calculus course, so it is useful for statistics. He goes as far as teaching calculus using R. There is more information in Project MOSAIC.

Biased and Inefficient, Thomas Lumley’s personal statistics blog (he insists that posting 75% of Statschat is not enough to qualify as personal). You may know Thomas from the survey package (or a few others).

If you are a postgrad student in New Zealand you can apply for a NeSI (New Zealand eScience Infrastructure) postgraduate allocation to access high performance computing facilities.

A few days ago I came across Jack Heinemann and collaborators’ article (Sustainability and innovation in staple crop production in the US Midwest, Open Access) comparing the agricultural sectors of USA and Western Europe^{‡}. While the article is titled around the word sustainability, the main comparison stems from the use of Genetically Modified crops in USA versus the absence of them in Western Europe.

I was curious about part of the results and discussion which, in a nutshell, suggest that “GM cropping systems have not contributed to yield gains, are not necessary for yield gains, and appear to be eroding yields compared to the equally modern agroecosystem of Western Europe”. The authors relied on several crops for the comparison (Maize/corn, rapeseed/canola^{see P.S.6}, soybean and cotton); however, I am going to focus on a single one (corn) for two reasons: 1. I can’t afford a lot of time for blog posts when I should be preparing lectures and 2. I like eating corn.

When the authors of the paper tackled corn the comparison was between the USA and Western Europe, using the United Nations definition of Western Europe (i.e. Austria, Belgium, France, Germany, Liechtenstein, Luxembourg, Monaco, Netherlands, Switzerland). Some large European corn producers like Italy are not there because of the narrow definition of Western.

I struggled with the comparison used by the authors because, in my opinion, there are potentially so many confounded effects (different industry structures, weather, varieties, etc.) that it can’t provide the proper counterfactual for GM versus non-GM crops. Anyway, I decided to have a look at the same data to see if I would reach the same conclusions. The article provides a good description of where the data came from, as well as how the analyses were performed. Small details to match exactly the results were fairly easy to figure out. I downloaded the FAO corn data (3.7 MB csv file) for all countries (so I can reuse the code and data later for lectures and assignments). I then repeated the plots using the following code:

# Default directory
setwd('~/Dropbox/quantumforest')
# Required packages
require(ggplot2)
require(labels)
# Reading FAO corn data
FAOcorn = read.csv('FAOcorn.csv')
# Extracting Area
FAOarea = subset(FAOcorn, Element == 'Area Harvested',
select = c('Country', 'Year', 'Value'))
names(FAOarea)[3] = 'Area'
# and production
FAOprod = subset(FAOcorn, Element == 'Production',
select = c('Country', 'Year', 'Value'))
names(FAOprod)[3] = 'Production'
# to calculate yield in hectograms
FAOarea = merge(FAOarea, FAOprod, by = c('Country', 'Year'))
FAOarea$Yield = with(FAOarea, Production/Area*10000)
# Subsetting only the countries of interest (and years to match paper)
FAOarticle = subset(FAOarea, Country == 'United States of America' | Country == 'Western Europe')
# Plot with regression lines
ggplot(FAOarticle, aes(x = Year, y = Yield, color = Country)) +
geom_point() + stat_smooth(method = lm, fullrange = TRUE, alpha = 0.1) +
scale_y_continuous('Yield [hectograms/ha]', limits = c(0, 100000), labels = comma) +
theme(legend.position="top")

I could obtain pretty much the same regression model equations as in the article by expressing the years as deviation from 1960 as in:

# Expressing year as a deviation from 1960, so results
# match paper
FAOarticle$NewYear = with(FAOarticle, Year - 1960)
usa.lm = lm(Yield ~ NewYear, data = FAOarticle,
subset = Country == 'United States of America')
summary(usa.lm)
#Call:
#lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country ==
# "United States of America")
#
#Residuals:
# Min 1Q Median 3Q Max
#-18435.4 -1958.3 338.3 3663.8 10311.0
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 38677.34 1736.92 22.27|t|)
#(Intercept) 31510.14 1665.90 18.91

Heinemann and collaborators then point out the following:

…the slope in yield increase by year is steeper in W. Europe (y?=?1344.2x?+?31512, R²?=?0.92084) than the United States (y?=?1173.8x?+?38677, R²?=?0.89093) from 1961 to 2010 (Figure 1). This shows that in recent years W. Europe has had similar and even slightly higher yields than the United States despite the latter’s use of GM varieties.

However, that interpretation using all data assumes that both ‘countries’ are using GMO all the time. An interesting thing is that USA and Western Europe were in different trends already before the introduction of GM corn. We can state that because we have some idea of when GM crops were introduced in the USA. This information is collected by the US Department of Agriculture in their June survey to growers and made publicly available at the State level (GMcornPenetration.csv):

This graph tells us that by the year 2000 the percentage of planted corn was way below 50% in most corn producing states (in fact, it was 25% at the country level). From that time on we have a steady increase reaching over 80% for most states by 2008. Given this, it probably makes sense to assume that, at the USA level, yield reflects non-GM corn until 1999 and progressively reflects the effect of GM genotypes from 2000 onwards. This division is somewhat arbitrary, but easy to implement.

We can repeat the previous analyzes limiting the data from 1961 until, say, 1999:

usa.lm2 = lm(Yield ~ NewYear, data = FAOarticle,
subset = Country == 'United States of America' & Year < 2000)
summary(usa.lm2)
#Call:
#lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country ==
# "United States of America" & Year < 2000)
#
#Residuals:
# Min 1Q Median 3Q Max
#-17441 -2156 1123 3989 9878
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 39895.57 2084.81 19.14 < 2e-16 ***
#NewYear 1094.82 90.84 12.05 2.25e-14 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 6385 on 37 degrees of freedom
#Multiple R-squared: 0.797, Adjusted R-squared: 0.7915
#F-statistic: 145.2 on 1 and 37 DF, p-value: 2.245e-14
weu.lm2 = lm(Yield ~ NewYear, data = FAOarticle,
subset = Country == 'Western Europe' & Year < 2000)
summary(weu.lm2)
#Call:
#lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country ==
# "Western Europe" & Year < 2000)
#
#Residuals:
# Min 1Q Median 3Q Max
#-10785 -3348 -34 3504 11117
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 29802.17 1813.79 16.43

These analyses indicate that Western Europe started with a lower yield than the USA (29,802.17 vs 39,895.57 hectograms/ha) and managed to increase yield much more quickly (1,454.48 vs 1,094.82 hectograms/ha per year) before any use of GM corn by the USA. Figure 1 shows a messy picture because there are numerous factors affecting yield each year (e.g. weather has a large influence). We can take averages for each decade and see how the two ‘countries’ are performing:

# Aggregating every decade.
# 2013-07-05 20:10 NZST I fixed the aggregation because it was averaging yields rather
# calculating total production and area for the decade and then calculating average yield
# Discussion points are totally valid
FAOarticle$Decade = cut(FAOarticle$Year,
breaks = seq(1959, 2019, 10),
labels = paste(seq(1960, 2010, 10), 's', sep = ''))
decadeProd = aggregate(Production ~ Country + Decade,
data = FAOarticle,
FUN = sum)
decadeArea = aggregate(Area ~ Country + Decade,
data = FAOarticle,
FUN = sum)
decadeYield = merge(decadeProd, decadeArea, by = c('Country', 'Decade'))
decadeYield$Yield = with(decadeYield, Production/Area*10000)
ggplot(decadeYield, aes(x = Decade, y = Yield, fill = Country)) +
geom_bar(stat = 'identity', position = 'dodge') +
scale_y_continuous('Yield [hectograms/ha]', expand = c(0, 0)) +
theme(legend.position="top")

This last figure requires more attention. We can again see that Western Europe starts with lower yields than the USA; however, it keeps on increasing those yields faster than USA, overtaking it during the 1990s. Again, all this change happened while both USA and Western Europe were not using GM corn. The situation reverses in the 2000s, when the USA overtakes Western Europe, while the USA continuously increased the percentage of GM corn. The last bar in Figure 3 is misleading because it includes a single year (2010) and we know that yields in USA went down in 2011 and 2012, affected by a very large drought (see Figure 4).

At least when looking at corn, I can’t say (with the same data available to Heinemann) that there is no place or need for GM genotypes. I do share some of his concerns with respect to the low level of diversity present in staple crops but, in contrast to his opinion, I envision a future for agriculture that includes large-scale operations (either GM or no-GM), as well as smaller operations (including organic ones). I’d like to finish with some optimism looking further back to yield, because the USDA National Agricultural Statistics Service keeps yield statistics for corn since 1866(!) (csv file), although it uses bizarre non-metric units (bushels/acre). As a metric boy, I converted to kilograms per hectare (multiplying by 62.77 from this page) and then to hectograms (100 g) multiplying by 10.

# Reading NASS corn data
NASS = read.csv('NASScorn.csv')
# Conversion to sensical units (see Iowa State Extension article)
# http://www.extension.iastate.edu/agdm/wholefarm/html/c6-80.html
NASS$Yield = with(NASS, Value*62.77*10)
# Average by decade
NASS$Decade = cut(NASS$Year,
breaks = seq(1859, 2019, 10),
labels = paste(seq(1860, 2010, 10), 's', sep = ''))
oldYield = aggregate(Yield ~ Decade, data = NASS, FUN = mean)
# Plotting
ggplot(oldYield, aes(x = Decade, y = Yield)) +
geom_bar(stat = 'identity') +
scale_y_continuous('Yield [hectograms]', expand = c(0, 0))

It is interesting to see that there was little change until the 1940s, with the advent of the Green Revolution (modern breeding techniques, fertilization, pesticides, etc.). The 2010s decade in Figure 4 includes 2010, 2011 and 2012, with the last two years reflecting extensive droughts. Drought tolerance is one of the most important traits in modern breeding programs.

^{‡} While Prof. Heinemann and myself work for the same university I don’t know him in person.

P.S. Did you know that Norman Borlaug (hero of mine) studied forestry?
P.S.2 Time permitting I’ll have a look at other crops later. I would have liked to test a regression with dummy variables for corn to account for pre-2000 and post-1999, but there are not yet many years to fit a decent model (considering natural variability between years). We’ll have to wait for that one.
P.S.3 I share some of Heinemann’s concerns relating to subsidies and other agricultural practices.
P.S.4 In case anyone is interested, I did write about a GM-fed pigs study not long ago.
P.S.5 2013-07-05 20:10 NZST. I updated Figures 1 and 3 to clearly express that yield was in hectograms/ha and recalculated average decade yield because it was originally averaging yields rather calculating total production and area for the decade and then calculating average yield. The discussion points I raised are still completely valid.
P.S.6 2013-07-07 11:30 NZST. The inclusion of Canadian canola does not make any sense because, as far as I know, Canada is not part of Western Europe or the US Midwest. This opens the inclusion of crops from any origin as far as the results are convenient for one’s argument.
P.S.7 2014-08-06 13:00 NZST My comment was expanded and published in the journal (or here for HTML version with comments on the authors’ reply to my comment.