## Quantum Forest

### notes in a shoebox

#### Category: rblogs (page 1 of 10)

I’ve been having a conversation for a while with @kamal_hothi and @aschiff on maps, schools, census, making NZ data available, etc. This post documents some basic steps I used for creating a map on ethnic diversity in schools at the census-area-unit level. This “el quicko” version requires 3 ingredients:

• Census area units shape files (available from Statistics New Zealand for free here).
• R with some spatial packages (also free).

We’ll read the school directory data, aggregate ethnicity information to calculate the Herfindahl–Hirschman Index of diversity and then plot it.

Then I moved to create a map in R, for the sake of it:

And we get a plot like this:

Ethnic diversity in schools at the Census Area Unit level (0 very diverse, 1 not diverse at all).

Just because it is Monday down under.

P.S. Using the diversity-index-census-area-unit.csv and joining it with the shapefile in QGIS one can get something prettier (I have to work on matching the color scales):

Similar map produced with point and click in QGIS.

Map rendering is so much faster in QGIS than in R! Clearly the system has been optimized for this user case.

After writing a blog post about the paper “Sustainability and innovation in staple crop production in the US Midwest” I decided to submit a formal comment to the International Journal of Agricultural Sustainability in July 2013, which was published today. As far as I know, Heinemann et al. provided a rebuttal to my comments, which I have not seen but that should be published soon. This post is an example on how we can use open data (in this case from the USDA and FAO) and free software (R) to participate in scientific discussion (see supplementary material below).

The text below the *** represents my author’s version provided as part of my Green Access rights. The article published in the International Journal of Agricultural Sustainability [copyright Taylor & Francis]; is freely available online at http://dx.doi.org/10.1080/14735903.2014.939842).

While I had many issues with the original article, I decided to focus on three problems—to make the submission brief and fit under the 1,000 words limit enforced by the journal editor. The first point I make is a summary of my previous post on the article, and then move on to two big problems: assuming that only the choice of biotechnology affects yield (making the comparison between USA and Western Europe inadequate) and comparing use of agrochemicals at the wrong scale (national- versus crop-level).

PS 2014-08-05 16:30: Heinemann et al.’s reply to my comment is now available.

• They state that “Apiolaza did not report variability in the annual yield data” and that they had to ‘reconstruct’ my figure. Firstly, the published figure 2 does include error bars and, secondly, the data and R code are available as supplementary information (in contrast to their original paper). Let’s blame the journal for not passing this information to them.
• They also include 2 more years of data (2011 & 2012), although the whole point of my comment is that the conclusions they derived with the original data set were not justified. Heinemann et al. are right in that yields in 2011 & 2012 were higher in Western Europe than in the USA (the latter suffering a huge drought); however, in 2013 it was the opposite: 99,695 Hg/ha for USA vs 83,724 Hg/ha for Western Europe.
• They question that I commented only on one crop, although I did cover another crop (wheat) quickly with respect to the use of pesticides—but not with the detail I wanted, as there was a 1,000 words limit. I would have also mentioned the higher variability for Western European wheat production, as it is weird that they pointed out that high variability is a problems for USA maize production but said nothing for wheat in Europe.
• Furthermore, they also claimed “There is no statistically significant difference in the means over the entire 50-year period” however, a naive paired t-test t.test(FAOarticle$Yield[1:50], FAOarticle$Yield[51:100], paired = TRUE) (see full code below) says otherwise.

***

## Abstract

This comment highlight issues when comparing genetically modified (GM) crops to non-GM ones across countries. Ignoring structural differences between agricultural sectors and assuming common yield trajectories before the time of introduction of GM crops results on misestimating the effect of GM varieties. Further data collection and analyses should guide policy-makers to encourage diverse approaches to agriculture, rather than excluding specific technologies (like GM crops) from the onset.

Keywords: genetic modification; biotechnology; productivity; economics

In a recent article Heinemann et al. (2013) focused “on the US staple crop agrobiodiversity, particularly maize” using the contrast between the yield of Western Europe and United States as a proxy for the comparison between genetically modified (GM) maize versus non-GM maize. They found no yield benefit from using GM maize when comparing the United States to Western Europe.

In addition, Heinemann et al. contrasted wheat yields across United States and Western Europe to highlight the superiority of the European biotechnological package from a sustainability viewpoint.

I am compelled to comment on two aspects that led the authors to draw incorrect conclusions on these issues. My statistical code and data are available as supplementary material.

## 1. Misestimating the effect of GM maize varieties

Heinemann et al. used FAO data[], from 1961 to 2010 inclusive, to fit linear models with yield as the response variable, country and year as predictors. Based on this analysis they concluded, “W. Europe has benefitted from the same, or marginally greater, yield increases without GM”. However, this assumes a common yield trajectory for United States and Western Europe before significant commercial use of GM maize, conflating GM and non-GM yields. GM maize adoption in United States has continually increased from 25% of area of maize planted in 2000 to the current 90% (Figure 1, United States Department of Agriculture 2013[]).

Figure 1: Adoption of GM maize in United States, expressed as percentage of total maize planted area (click to enlarge).

If we fit a linear model from 1961 to 1999 (last year with less than 25% area of GM maize) we obtain the following regression equations $$y = 1094.8 x + 39895.6$$ (United States, R2 = 0.80) and $$y = 1454.5 x + 29802.2$$ (W. Europe, R2 = 0.90). This means that Western Europe started with a considerably lower yield than the USA (29,802.2 vs 39,895.6 hg/ha) in 1961 but increased yields faster than USA (1,454.5 vs 1,094.8 hg/ha per year) before substantial use of GM maize. By 1999 yield in Western Europe was superior to that in United States.

This is even more evident in Figure 2, which shows average yield per decade, removing year-to-year extraneous variation (e.g. due to weather). Western European yields surpassed United States’s during the 1990s (Figure 2). This trend reverses in the 2000s, while United States simultaneously increased the percentage of planted area with GM maize, directly contradicting Heinemann et al.’s claim.

Figure 2: Average maize yield (and standard error) per decade for United States and Western Europe. The 2010s include a single year to replicate the original data set (click to enlarge).

## 2. Ignoring structural differences between agricultural sectors

When discussing non-GM crops using wheat the authors state, “the combination of biotechnologies used by W. Europe is demonstrating greater productivity than the combination used by the United States”. This sentence summarizes one of the central problems of their article: assuming that, if it were not for the choice of biotechnology bundle, the agricultural sectors would have the same intrinsic yield, making them comparable. However, many inputs beside biotechnology affect yield. For example, Neumann et al. (2010) studied the spatial distribution of yield and found that in the Unites States “access can explain most of the variability in wheat efficiency. In the more remote regions land prices are lower and inputs are therefore often substituted by land leading to lower efficiencies”. Lower yields in United States make sense from an economic point of view, as land replaces more expensive inputs like agrochemicals.

Heinemann et al. support their case by comparing pesticide use between United States and France across all crops. However, what is relevant to the discussion is pesticide use for the crops being compared. European cereals, and wheat in particular, are the most widely fungicide-treated group of crops worldwide (Kucket al. 2012). For example, 27% of the wheat planted area in France was already treated with fungicides by 1979 (Jenkins and Lescar 1980). More than 30 years later in the United States this figure has reached only 19% for winter wheat (which accounts for 70% of planted area, NASS 2013). Fungicide applications result on higher yield responses (Oerke 2006).

## Final remarks

Heinemann et al. ignored available data on GM adoption when analysing maize yields. They also mistakenly treated biotechnological bundles as the only/main explanation for non-GMO yield differences between United States and Western Europe. These issues mean that the thrust of their general conclusion is unsupported by the available evidence. Nevertheless, their article also raised issues that deserve more consideration; e.g. the roles of agricultural subsidies and market concentration on food security.

Agricultural sustainability requires carefully matching options in the biotechnology portfolio to site-specific economic, environmental and cultural constraints. Further data collection and analyses should lead policy makers to encourage diverse approaches to agriculture, rather than excluding specific technologies (like GMOs and pesticides) from the onset.

## References

Heinemann, J. A., Massaro, M., Coray, D. S., Agapito-Tenfen, S. Z. and Wen, J. D. 2013. Sustainability and innovation in staple crop production in the US Midwest. International Journal of Agricultural Sustainability (available here).

Jenkins, J. E. E. and Lescar, L. 1980. Use of foliar fungicides on cereals in Western Europe. Plant Disease, 64(11): 987-994 (behind paywall).

Kuck, K. H., Leadbeater, A. and Gisi, U. 2012. FRAC Mode of Action Classification and Resistance Risk of Fungicides. In: Krämer, W., Schirmer, U., Jeschke, P. and Witschel, M., eds., Modern Crop Protection Compounds. Wiley. 539-567.

NASS, 2013. Highlights: 2012 Agricultural Chemical Use Survey. Wheat. United States Department of Agriculture (available here).

Neumann, K., Verburg, P. H., Stehfest, E., and Müller, C. 2010. The yield gap of global grain production: a spatial analysis. Agricultural Systems, 103(5), 316–326 (behind paywall).

Oerke E. 2006. Crop losses to pests. Journal of Agriculture Science, 144: 31-43 (behind paywall, or Free PDF).

United States Department of Agriculture. 2013. Adoption of Genetically Engineered Crops in the U.S. USDA Economic Research Service (available here).

## Supplementary material

You can replicate the analyses and plots produced in this comment using the following files:

• Maize production data for United States and Western Europe (csv, extracted from FAO). []
• GMO maize penetration data (csv, extracted from USDA). []
• R code for analyses (R file, changed extension to .txt so WordPress would not complain).

I’m the first to acknowledge that most of my code could run faster. The truth of the matter is that, in essence, I write ‘quickies’: code that will run once or twice, so there is no incentive to spend days or hours in shaving seconds of a computation. Most analyses of research data fall in to this approach: read data-clean data-fit model-check model-be happy-write article-(perhaps) make data and code available-move on with life.

One of the reasons why my code doesn’t run faster or uses less memory is the trade-off between the cost of my time (very high) compared to the cost of more memory or faster processors (very cheap) and the gains of shaving a few seconds or minutes of computer time, which tend to be fairly little.

In R vectorization is faster than working with each vector element, although it implies allocating memory for whole vectors and matrices, which for large-enough problems may become prohibitively expensive. On the other hand, not vectorizing some operations may turn your problem into an excruciatingly slow exercise and, for example in large simulations, in practice intractable in a useful timeframe.

Dealing with vectors and matrices is like dealing with many chairs simultaneously. Some patterns and operations are easier and faster than others. (Photo: Luis, click to enlarge).

John Muschelli wrote an interesting post reimplementing 2×2 frequency tables for a highly specific use, comparing the results of image processing algorithms. In John’s case, there are two greater-than-9-million-long-elements logical vectors, and when comparing vectors for many images the process becomes very slow. He explains part of his rationale in his blog (go and read it), but say that his solution can be expressed like this:

which uses 1/4 of the time used by table(manual, auto). Mission accomplished! However, if I stopped here this blog post would not make much sense, simply rehashing (pun intended) John’s code. The point is to explain what is going on and, perhaps, to find even faster ways of performing the calculations. As a start, we have to be aware that the calculations in logical.tab() rely on logical (boolean) operations and on the coercion of logical vectors to numerical values, as stated in the documentation:

Logical vectors are coerced to integer vectors in contexts where a numerical value is required, with TRUE being mapped to 1L, FALSE to 0L and NA to NA_integer_.

In R Logical operations can be slower than mathematical ones, a consideration that may guide us to think of the problem in a slightly different way. For example, take the difference between the vectors (dif = x - y), so both TRUE-TRUE (1-1) and FALSE-FALSE (0-0) are 0, while TRUE - FALSE (1-0) is 1 and FALSE - TRUE (0-1) is -1. Therefore:

• the sum of positive values (sum(dif > 0)) is the frequency of TRUE & FALSE,
• while the sum of negative values (sum(dif < 0)) is the frequency of FALSE & TRUE.

The values for TRUE & TRUE can be obtained by adding up the element-wise multiplication of the two vectors, as TRUE*TRUE (1*1) is the only product that's different from zero. A vectorized way of performing this operation would be to use t(x) %*% y; however, for large vectors the implementation crossprod(x, y) is faster. The values for FALSE & FALSE are simply the difference of the length of the vectors and the previously calculated frequencies length(dif) - tt - tf - ft. Putting it all together:

This code takes 1/20 of the time taken by table(x, y). An interesting result is that the crossproduct crossprod(x, y) can also be expressed as sum(x * y), which I didn't expect to be faster but, hey, it is. So we can express the code as:

to get roughly 1/22 of the time. The cost of logical operations is easier to see if we isolate particular calculations in logical.tab and basic.tab; for example,

is slower than

This also got me thinking of the cost of coercion: Would it take long? In fact, coercing logical vectors to numeric has little (at least I couldn't see from a practical viewpoint) if any cost. In some cases relying on logical vectors converted using as.numeric() seems to be detrimental on terms of speed.

As I mentioned at the beginning, vectorization uses plenty of memory, so if we were constrained in that front and we wanted to do a single pass on the data we could write an explicit loop:

The loopy.tab does only one pass over the vectors and should use less memory as it doesn't need to create those huge 10M elements vectors all the time (at least it would if we used a proper iterator in the loop instead of iterating on a vector of length 10M (that's 40 MB big in this case). The iterators package may help here). We save room/memory at the expense of speed, as loopy.tab is ten times slower than the original table() function. Of course one could run it a lot faster if implemented in another language like C++ or Fortran and here Rcpp or Rcpp11 would come handy (updated! See below).

This is only a not-so-short way of reminding myself what's going on when trading-off memory, execution speed and my personal time. Thus, I am not saying that any of these functions is the best possible solution, but playing with ideas and constraints that one often considers when writing code. Incidentally, an easy way to compare the execution time of these functions is using the microbenchmark library. For example:

will spit numbers when running a couple of functions 1,000 times with results that apply to your specific system.

PS 2014-07-12 11:08 NZST Hadley Wickham suggests using tabulate(manual + auto * 2 + 1, 4) as a fast alternative. Nevertheless, I would like to point out that i- the initial tabulation problem is only an excuse to discuss a subset of performance issues when working with R and ii- this post is more about the journey than the destination.

PS 2014-07-13 20:32 NZST Links to two posts related to this one:

• Wiekvoet's work which i- relies on marginals and ii- reduces the use of logical operators even more than my basic.tab2(), simultaneously taking around 1/4 of the time.
• Yin Zhu's post describing vectors is a handy reminder of the basic building blocks used in this post.

### Updated with Rcpp goodness!

PS 2014-07-13 21:45 NZST I browsed the Rcpp Gallery and Hadley Wickham's Rcpp tutorial and quickly converted loopy.tab() to C++. Being a bit of an oldie I've used Fortran (90, not that old) before but never C++, so the following code is probably not very idiomatic.

Loops and all it runs roughly twice as fast as basic.tab2() but it should also use much less memory.

The Swarm Lab presents a nice comparison of R and Python code for a simple (read ‘one could do it in Excel’) problem. The example works, but I was surprised by how wordy the R code was and decided to check if one could easily produce a shorter version.

The beginning is pretty much the same, although I’ll use ggplot2 rather than lattice, because it will be a lot easier (and shorter) to get the desired appearance for the plots:

The whole example relies on only three variables and—as I am not great at typing—I tend to work with shorter variable names. I directly changed the names for variables 1 to 3:

It is a lot easier to compare the regression lines if we change the shape of the data set from wide to long, where there is one variable for year, one for tax type, and one for the actual tax rate. It would be possible to use one of Hadley’s packages to get a simpler syntax for this, but I decided to stick to the minimum set of requirements:

And now we are ready to produce the plots. The first one can be a rough cut to see if we get the right elements:

First cut of the taxes per year plot.

Yes, this one has the points, lines, linear regression and 95% confidence intervals for the mean predicted responses, but we still need to get rid of the grey background and get black labels (theme_bw()), set the right axis labels and ticks (scale_x... scale_y...) and set the right color palette for points and lines (scale_colour_manual) and filling the confidence intervals (scale_colour_fill) like so:

Way closer to the desired plot, still much shorter.

One can still change font sizes to match the original plots, reposition the legend, change the aspect ratio while saving the png graphs (all simple statements) but you get the idea. If now we move to fitting the regression lines:

This gives the regression coefficients for Corporate (3.45 – 1.564e-04 year) and Individual ([3.45 + 4.41] + [-1.564e-04 + 1.822e-04] year or 7.84 + 2.58e-05 year). As a bonus you get the comparison between regression lines.

In R as a second language I pointed out that ‘brevity reduces the time between thinking and implementation, so we can move on and keep on trying new ideas’. Some times it seriously does.

Imagine that you are studying English as a second language; you learn the basic rules, some vocabulary and start writing sentences. After a little while, it is very likely that you’ll write grammatically correct sentences that no native speaker would use. You’d be following the formalisms but ignoring culture, idioms, slang and patterns of effective use.

R is a language and any newcomers, particularly if they already know another programming language, will struggle at the beginning to get what is beyond the formal grammar and vocabulary. I use R for inquisition: testing ideas, data exploration, visualization; under this setting, the easiest is to perform a task the more likely is one going to do it. It is possible to use several other languages for this but—and I think this is an important but—R’s brevity reduces the time between thinking and implementation, so we can move on and keep on trying new ideas.

A typical example is when we want to repeat something or iterate over a collection of elements. In most languages if one wants to do something many times the obvious way is using a loop (coded like, for() or while()). It is possible to use a for() loop in R but many times is the wrong tool for the job, as it increases the lag between thought and code, moving us away from ‘the flow’.

How apply loops around a matrix or data frame, doing its business for all rows [1] or columns [2] (Shaky handwriting and all).

One of the distinctive features of R is that there is already a lot of functionality available for jobs that occur frequently in data analysis. The easiest is to perform a task the more likely is one going to do it, which is perfect if one is exploring/thinking about data.

Thomas Lumley reminded me of the ACM citation for John Chambers—father of S of which R is an implementation—which stated that Chambers’s work:

…will forever alter the way people analyze, visualize, and manipulate data . . . S is an elegant, widely accepted, and enduring software system, with conceptual integrity, thanks to the insight, taste, and effort of John Chambers.

If I could summarize the relevance of R in a Tweetable phrase (with hash tags and everything) it would be:

Most data analysis languages underestimate the importance of interactivity/low barrier to exploration. That’s where #Rstats shines.

One could run statistical analyses with many languages (including generic ones), but to provide the right level of interactivity for analysis, visualization and data manipulation one ends up creating functions that, almost invariably, look a bit like R; pandas in Python, for example.

There are some complications with some of the design decisions in R, especially when we get down to consistency which begets memorability. A glaring example is the apply family of functions and here is where master opportunist (in the positive sense of expert at finding good opportunities) Hadley Wickham made sense out of confusion in his package plyr.

There is also a tension in languages under considerable use because speakers/writers/analysts/coders start adapting them to new situations, adding words and turns of phrase. Look at English for an example! This is also happening to R and some people wish the language looked different in some non-trivial ways. A couple of examples: Coffeescript for R and Rasmus Bååth’s suggestions. Not all of them can be implemented, but suggestions like this speak of the success of R.

If you are struggling to start working with R, as with other languages, first let go. The key to learning and working with a new language is immersing yourself in it; even better if you do it with people who already speak it.

Just to be clear, there are several good statistical languages. However, none is as supportive of rapid inquisition as R (IMO). It is not unusual to develop models in one language (e.g. R) and implement it in another for operational purposes (e.g. SAS, Python, whatever).

The first thing I admire about Hadley is his ‘good eye’ for finding points of friction. The second one is doing something about the frictions, often with very good taste.

P.S. It should come clear from this post that English is indeed my second language.

I teach several courses every year and the most difficult to pull off is FORE224/STAT202: regression modeling.

The academic promotion application form in my university includes a section on one’s ‘teaching philosophy’. I struggle with that part because I suspect I lack anything as grandiose as a philosophy when teaching: as most university lecturers I never studied teaching, although I try to do my best. If anything, I can say that I enjoy teaching and helping students to ‘get it’ and that I want to instill a sense of ‘statistics is fun’ in them. I spend quite a bit of time looking for memorable examples, linking to stats in the news (statschat and listening the news while walking my dog are very helpful here) and collecting data. But a philosophy? Don’t think so.

One of the hardest parts of the course is the diversity of student backgrounds. Hitting the right level, the right tone is very hard. Make it too easy and the 1/5 to 1/4 of students with a good mathematical background will hate it; they may even decide to abandon any intention of continuing doing stats if ‘that’s all there is about the topic’. Make it too complicated and half the class will fail and/or hate the content.

Part of the problem is based around what we mean by teaching ‘statistics’. In some cases it seems limited to what specific software does; for example, teaching with Excel means restriction to whatever models are covered in Excel’s Data Analysis Toolpak (DAT). The next choice when teaching is using menu-driven software (e.g. SPSS), which provides much more statistical functionality than Excel + DAT, at the expense of being further removed from common usability conventions. At the other extreme of simplicity is software that requires coding to control the analyses (e.g. R or SAS). In general, the more control we want, the more we have to learn to achieve it.

A while ago I made a distinction between the different levels of learning (user cases) when teaching statistics. In summary, we had i- very few students getting in to statistics and heavy duty coding, ii- a slightly larger group that will use stats while in a while and iii- the majority that will mostly consume statistics. I feel a duty towards the three groups, while admitting that I have predilection for the first one. Nevertheless, the third group provides most of the challenges and need for thinking about how to teach the subject.

When teaching linear models (general form $$y = X \beta + \epsilon$$) we tend to compartmentalize content: we have an ANOVA course if the design matrix $$X$$ represents categorical predictors (contains only 1s and 0s), a regression course if $$X$$ is full of continuous predictors and we talk about ANCOVA or regression on dummy variables if $$X$$ is a combination of both. The use of different functions for different contents of $$X$$ (for example aov() versus lm() in R or proc reg versus proc glm in SAS) further consolidates the distinction. Even when using menus, software tends to guide students through different submenus depending on the type of $$X$$.

Gliding in a hierarchy of models (Photo: Luis, click to enlarge).

At the beginning of the course we restrict ourselves to $$X$$ full of continuous predictors, but we introduce the notion of matrices with small examples. This permits showing the connection between all the linear model courses (because a rose by any other name…) and it also allows deriving a general expression of the formulas for the regression coefficients (essential for the most advanced students). Slower students may struggle with some of this material; however, working with small examples they can replicate the results from R (or Excel or SAS or whatever one uses to teach). Some times they even think it is cool.

Here is where the model.matrix() R function becomes handy; rather than building incidence matrices by hand—which is easy for tiny examples—we can get the matrices used by the lm() function to then calculate regression parameters (and any other output) for more complex models.

Once students get the idea that on matrix terms our teaching compartments are pretty much the same, we can reinforce the idea by using a single function (or proc) to show that we can obtain all the bits and pieces that make up what we call ‘fitting the model’. This highlights the idea that ANOVA, ANCOVA & regression are subsets of linear models, which are subsets of linear mixed models, which are subsets of generalized linear mixed models. A statistical Russian doll.

We want students to understand, some times so badly that we lower the bar to a point where there is no much to understand. Here is the tricky part, finding the right level of detail so all types of students learn to enjoy the topic, although at different levels of understanding.

There is software that generates code from menus too, like Stata or Genstat.

P.S. This is part of my thinking aloud with hesitation about teaching, as in Statistics unplugged, Excel, fanaticism and R, Split-plot 1: How does a linear mixed model look like?, R, academia and the democratization of statistics, Mid-January flotsam: teaching edition & Teaching with R: the switch. I am always looking for better ways of transferring knowledge.

How much does statistical software help and how much it interferes when teaching statistical concepts? Software used in the practice of statistics (say R, SAS, Stata, etc) brings to the party a mental model that it’s often alien to students, while being highly optimized for practitioners. It is possible to introduce a minimum of distraction while focusing on teaching concepts, although it requires careful choice of a subset of functionality. Almost invariably some students get stuck with the software and everything goes downhill from there; the student moved from struggling with a concept to struggling with syntax (Do I use a parenthesis here?).

I am a big fan of Tim Bell’s Computer Science Unplugged, a program for teaching Computer Science’s ideas at primary and secondary school without using computers (see example videos).

Here is an example video for public key encryption:

This type of instruction makes me question both how we teach statistics and at what level we can start teaching statistics. The good news is that the New Zealand school curriculum includes statistics in secondary school, for which there is increasing number of resources. However, I think we could be targeting students even earlier.

This year my wife was helping primary school students participating in a science fair and I ended up volunteering to introduce them to some basic concepts so they could design their own experiments. Students got the idea of the need for replication, randomization, etc based on a simple question: Did one of them have special powers to guess the result of flipping a coin? (Of course this is Fisher’s tea-drinking-lady-experiment, but no 10 year old cares about tea, while at least some of them care about super powers). After the discussion one of them ran a very cool experiment on the effect of liquefaction on the growth of native grasses (very pertinent in post-earthquake Christchurch), with 20 replicates (pots) for each treatment. He got the concepts behind the experiment; software just entered the scene when we needed to confirm our understanding of the results in a visual way:

Seven-week growth of native grasses with three proportions of liquefied soil. T1: pure liquefaction, T2: 50% liquefaction, 50% normal soil, T3: pure normal soil.

People tell me that teaching stats without a computer is like teaching chemistry without a lab or doing astronomy without a telescope, or… you get the idea. At the same time, there are some books that describe some class activities that do not need a computer; e.g. Gelman’s Teaching Statistics: A Bag of Tricks. (Incidentally, why is that book so friggin’ expensive?)

## Back to uni

Back from primary school kiddies to a regression course at university. Let’s say that we have two variables, x & y, and that we want to regress y (response) on x (predictor) and get diagnostic plots. In R we could simulate some data and plot the relationship using something like this:

Typical simple linear regression scatterplot.

We can the fit the linear regression and get some diagnostic plots using:

Typical diagnostic plot for simple linear regression model. What’s the meaning of the fourth plot (lower right)?

If we ask students to explain the 4th plot—which displays discrepancy (how far a point is from the general trend) on leverage (how far is a point from the center of mass, pivot of the regression)—many of them will struggle to say what is going on in that plot. At that moment one could go and calculate the Hat matrix of the regression ($$X (X’X)^{-1} X’$$) and get leverage from the diagonal, etc and students will get a foggy idea. Another, probably better, option is to present the issue as a physical system on which students already have experience. A good candidate for physical system is using a seesaw, because many (perhaps most) students experienced playing in one as children.

Take your students to a playground (luckily there is one next to uni), get them playing with a seesaw. The influence of a point is related to the product of leverage (how far from the pivot we are applying force) and discrepancy (how big is the force applied). The influence of a point on the estimated regression coefficients will be very large when we apply a strong force far from the pivot (as in our point y[100]), just as it happens in a seesaw. We can apply lots of force (discrepancy) near the pivot (as in our point [y[50]) and little will happen. Students like mucking around with the seesaw and, more importantly, they remember.

Compulsory seesaw picture (source Wikipedia).

Analogy can go only so far. Some times a physical analogy like a quincunx (to demonstrate the central limit theorem) ends up being more confusing than using an example with variables that are more meaningful for students.

I don’t know what is the maximum proportion of course content that could be replaced by using props, experiments, animations, software specifically designed to make a point (rather than to run analysis), etc. I do know that we still need to introduce ‘proper’ statistical software—at some point students have to face praxis. Nevertheless, developing an intuitive understanding is vital to move from performing monkeys; that is, people clicking on menus or going over the motion of copy/pasting code without understanding what’s going on in the analyses.

I’d like to hear if you have any favorite demos/props/etc when explaining statistical concepts.

P.S. In this post I don’t care if you love stats software, but I specifically care about helping learners who struggle understanding concepts.

I wanted to develop a small experiment with a front end using the Processing language and the backend calculations in R; the reason why will be another post. This post explained the steps assuming that one already has R and Processing installed:

1. Install the Rserve package. This has to be done from source (e.g. using R CMD INSTALL packagename).
2. Download Rserve jar files and include them in the Processing sketch.

For example, this generates 100 normal distributed random numbers in R and then sorts them (code copy and pasted from second link):

The problem is that this didn’t work, because my OS X (I use macs) R installation didn’t have shared libraries. My not-so-quick solution was to compile R from source, which involved:

1. Downloading R source. I went for the latest stable version, but I could have gone for the development one.
2. Setting up the latest version of C and Fortran compilers. I did have an outdated version of Xcode in my macbook air, but decided to delete it because i- uses many GB of room in a small drive and ii- it’s a monster download. Instead I went for Apple’s Command Line Tools, which is a small fraction of size and do the job.
3. In the case of gfortran, there are many sites pointing to this page that hosts a fairly outdated version, which was giving me all sorts of problems (e.g. “checking for Fortran 77 name-mangling scheme”) because the versions between the C and Fortran compilers were out of whack. Instead, I downloaded the latest version from the GNU site.
4. Changing the config.site file in a few places, ensuring that I had:

Then compiled using (didn’t want X11 and enabling shared library):

And finally installed using:

This used a prefix because I didn’t want to replace my fully functioning R installation, but just having another one with shared libraries. If one types R in terminal then it is still calling the old version; the new one is called via /luis/compiled/R.framework/Versions/Current/Resources/bin/R. I then installed Rserve in the new version and was able to call R from processing so I could obtain.

A ‘hello world’ of the calling R from Processing world.

Now I can move to what I really wanted to do. File under stuff-that-I-may-need-to-remember-one-day.

This week I’ve been feeling tired of excessive fanaticism (or zealotry) of open source software (OSS) and R in general. I do use a fair amount of OSS and pushed for the adoption of R in our courses; in fact, I do think OSS is a Good ThingTM. I do not like, however, constant yabbering on why using exclusively OSS in science is a good idea and the reduction of science to repeatability and computability (both of which I covered in my previous post). I also dislike the snobbery of ‘you shall use R and not Excel at all, because the latter is evil’ (going back ages).

We often have several experiments running during the year and most of the time we do not bother setting up a data base to keep data. Doing that would essentially mean that I would have to do it, and I have a few things more important to do. Therefore, many data sets end up in… (drum roll here) Microsoft Excel.

How should a researcher setup data in Excel? Rather than reinventing the wheel, I’ll use a(n) (im)perfect diagram that I found years ago in a Genstat manual.

Suggested sane data setup in a spreadsheet.

I like it because:

• It makes clear how to setup the experimental and/or sampling structure; one can handle any design with enough columns.
• It also manages any number of traits assessed in the experimental units.
• It contains metadata in the first few rows, which can be easily skipped when reading the file. I normally convert Excel files to text and then I skip the first few lines (using skip in R or firstobs in SAS).

People doing data analysis often start convulsing at the mention of Excel; personally, I deeply dislike it for analyses but it makes data entry very easy, and even a monkey can understand how to use it (I’ve seen them typing, I swear). The secret for sane use is to use Excel only for data entry; any data manipulation (subsetting, merging, derived variables, etc.) or analysis is done in statistical software (I use either R or SAS for general statistics, ASReml for quantitative genetics).

It is far from a perfect solution but it fits in the realm of the possible and, considering all my work responsibilities, it’s a reasonable use of my time. Would it be possible that someone makes a weird change in the spreadsheet? Yes. Could you fart while moving the mouse and create a non-obvious side effect? Yes, I guess so. Will it make your life easier, and make possible to complete your research projects? Yes sir!

P.S. One could even save data using a text-based format (e.g. csv, tab-delimited) and use Excel only as a front-end for data entry. Other spreadsheets are of course equally useful.

P.S.2. Some of my data are machine-generated (e.g. by acoustic scanners and NIR spectroscopy) and get dumped by the machine in a separate—usually very wide; for example 2000 columns—text file for each sample. I never put them in Excel, but read them directly (a directory-full of them) in to R for manipulation and analysis.

As an interesting aside, the post A summary of the evidence that most published research is false provides a good summary for the need to freak out about repeatability.

Man flu kept me at home today, so I decided to do something ‘useful’ and go for a linkathon:

Sometimes people are truthful and cruel. Here Gappy on a mission goes for the jugular:

Over and out.