Teaching linear models

I teach several courses every year and the most difficult to pull off is FORE224/STAT202: regression modeling.

The academic promotion application form in my university includes a section on one’s ‘teaching philosophy’. I struggle with that part because I suspect I lack anything as grandiose as a philosophy when teaching: as most university lecturers I never studied teaching, although I try to do my best. If anything, I can say that I enjoy teaching and helping students to ‘get it’ and that I want to instill a sense of ‘statistics is fun’ in them. I spend quite a bit of time looking for memorable examples, linking to stats in the news (statschat and listening the news while walking my dog are very helpful here) and collecting data. But a philosophy? Don’t think so.

One of the hardest parts of the course is the diversity of student backgrounds. Hitting the right level, the right tone is very hard. Make it too easy and the 1/5 to 1/4 of students with a good mathematical background will hate it; they may even decide to abandon any intention of continuing doing stats if ‘that’s all there is about the topic’. Make it too complicated and half the class will fail and/or hate the content.

Part of the problem is based around what we mean by teaching ‘statistics’. In some cases it seems limited to what specific software does; for example, teaching with Excel means restriction to whatever models are covered in Excel’s Data Analysis Toolpak (DAT). The next choice when teaching is using menu-driven software (e.g. SPSS), which provides much more statistical functionality than Excel + DAT, at the expense of being further removed from common usability conventions. At the other extreme of simplicity is software that requires coding to control the analyses (e.g. R or SAS). In general, the more control we want, the more we have to learn to achieve it.

A while ago I made a distinction between the different levels of learning (user cases) when teaching statistics. In summary, we had i- very few students getting in to statistics and heavy duty coding, ii- a slightly larger group that will use stats while in a while and iii- the majority that will mostly consume statistics. I feel a duty towards the three groups, while admitting that I have predilection for the first one. Nevertheless, the third group provides most of the challenges and need for thinking about how to teach the subject.

When teaching linear models (general form \(y = X \beta + \epsilon\)) we tend to compartmentalize content: we have an ANOVA course if the design matrix \(X\) represents categorical predictors (contains only 1s and 0s), a regression course if \(X\) is full of continuous predictors and we talk about ANCOVA or regression on dummy variables if \(X\) is a combination of both. The use of different functions for different contents of \(X\) (for example aov() versus lm() in R or proc reg versus proc glm in SAS) further consolidates the distinction. Even when using menus, software tends to guide students through different submenus depending on the type of \(X\).

Gliding in a hierarchy of models (Photo: Luis, click to enlarge).

Gliding in a hierarchy of models (Photo: Luis, click to enlarge).

At the beginning of the course we restrict ourselves to \(X\) full of continuous predictors, but we introduce the notion of matrices with small examples. This permits showing the connection between all the linear model courses (because a rose by any other name…) and it also allows deriving a general expression of the formulas for the regression coefficients (essential for the most advanced students). Slower students may struggle with some of this material; however, working with small examples they can replicate the results from R (or Excel or SAS or whatever one uses to teach). Some times they even think it is cool.

Here is where the model.matrix() R function becomes handy; rather than building incidence matrices by hand—which is easy for tiny examples—we can get the matrices used by the lm() function to then calculate regression parameters (and any other output) for more complex models.

Once students get the idea that on matrix terms our teaching compartments are pretty much the same, we can reinforce the idea by using a single function (or proc) to show that we can obtain all the bits and pieces that make up what we call ‘fitting the model’. This highlights the idea that ANOVA, ANCOVA & regression are subsets of linear models, which are subsets of linear mixed models, which are subsets of generalized linear mixed models. A statistical Russian doll.

We want students to understand, some times so badly that we lower the bar to a point where there is no much to understand. Here is the tricky part, finding the right level of detail so all types of students learn to enjoy the topic, although at different levels of understanding.

There is software that generates code from menus too, like Stata or Genstat.

P.S. This is part of my thinking aloud with hesitation about teaching, as in Statistics unplugged, Excel, fanaticism and R, Split-plot 1: How does a linear mixed model look like?, R, academia and the democratization of statistics, Mid-January flotsam: teaching edition & Teaching with R: the switch. I am always looking for better ways of transferring knowledge.

Statistics unplugged

How much does statistical software help and how much it interferes when teaching statistical concepts? Software used in the practice of statistics (say R, SAS, Stata, etc) brings to the party a mental model that it’s often alien to students, while being highly optimized for practitioners. It is possible to introduce a minimum of distraction while focusing on teaching concepts, although it requires careful choice of a subset of functionality. Almost invariably some students get stuck with the software and everything goes downhill from there; the student moved from struggling with a concept to struggling with syntax (Do I use a parenthesis here?).

I am a big fan of Tim Bell’s Computer Science Unplugged, a program for teaching Computer Science’s ideas at primary and secondary school without using computers (see example videos).

Here is an example video for public key encryption:

This type of instruction makes me question both how we teach statistics and at what level we can start teaching statistics. The good news is that the New Zealand school curriculum includes statistics in secondary school, for which there is increasing number of resources. However, I think we could be targeting students even earlier.

This year my wife was helping primary school students participating in a science fair and I ended up volunteering to introduce them to some basic concepts so they could design their own experiments. Students got the idea of the need for replication, randomization, etc based on a simple question: Did one of them have special powers to guess the result of flipping a coin? (Of course this is Fisher’s tea-drinking-lady-experiment, but no 10 year old cares about tea, while at least some of them care about super powers). After the discussion one of them ran a very cool experiment on the effect of liquefaction on the growth of native grasses (very pertinent in post-earthquake Christchurch), with 20 replicates (pots) for each treatment. He got the concepts behind the experiment; software just entered the scene when we needed to confirm our understanding of the results in a visual way:

Seven-week growth of native grasses with three proportions of liquefied soil.

Seven-week growth of native grasses with three proportions of liquefied soil. T1: pure liquefaction, T2: 50% liquefaction, 50% normal soil, T3: pure normal soil.

People tell me that teaching stats without a computer is like teaching chemistry without a lab or doing astronomy without a telescope, or… you get the idea. At the same time, there are some books that describe some class activities that do not need a computer; e.g. Gelman’s Teaching Statistics: A Bag of Tricks. (Incidentally, why is that book so friggin’ expensive?)

Back to uni

Back from primary school kiddies to a regression course at university. Let’s say that we have two variables, x & y, and that we want to regress y (response) on x (predictor) and get diagnostic plots. In R we could simulate some data and plot the relationship using something like this:

# Basic regression data
n = 100
x = 1:n
y = 70 + x*5 + rnorm(n, 0, 40)
 
# Changing couple of points and plotting
y[50] = 550
y[100] = 350
plot(y ~ x)
Typical simple linear regression scatterplot.

Typical simple linear regression scatterplot.

We can the fit the linear regression and get some diagnostic plots using:

# Regression and diagnostics
m1 = lm(y ~ x)
 
par(mfrow = c(2,2))
plot(m1)
par(mfrow = c(1,1))
Typical diagnostic plot for simple linear regression model. What's the meaning of the fourth plot (lower right)?

Typical diagnostic plot for simple linear regression model. What’s the meaning of the fourth plot (lower right)?

If we ask students to explain the 4th plot—which displays discrepancy (how far a point is from the general trend) on leverage (how far is a point from the center of mass, pivot of the regression)—many of them will struggle to say what is going on in that plot. At that moment one could go and calculate the Hat matrix of the regression (\(X (X’X)^{-1} X’\)) and get leverage from the diagonal, etc and students will get a foggy idea. Another, probably better, option is to present the issue as a physical system on which students already have experience. A good candidate for physical system is using a seesaw, because many (perhaps most) students experienced playing in one as children.

Take your students to a playground (luckily there is one next to uni), get them playing with a seesaw. The influence of a point is related to the product of leverage (how far from the pivot we are applying force) and discrepancy (how big is the force applied). The influence of a point on the estimated regression coefficients will be very large when we apply a strong force far from the pivot (as in our point y[100]), just as it happens in a seesaw. We can apply lots of force (discrepancy) near the pivot (as in our point [y[50]) and little will happen. Students like mucking around with the seesaw and, more importantly, they remember.

Compulsory seesaw picture  (source Wikipedia).

Compulsory seesaw picture (source Wikipedia).

Analogy can go only so far. Some times a physical analogy like a quincunx (to demonstrate the central limit theorem) ends up being more confusing than using an example with variables that are more meaningful for students.

I don’t know what is the maximum proportion of course content that could be replaced by using props, experiments, animations, software specifically designed to make a point (rather than to run analysis), etc. I do know that we still need to introduce ‘proper’ statistical software—at some point students have to face praxis. Nevertheless, developing an intuitive understanding is vital to move from performing monkeys; that is, people clicking on menus or going over the motion of copy/pasting code without understanding what’s going on in the analyses.

I’d like to hear if you have any favorite demos/props/etc when explaining statistical concepts.

P.S. In this post I don’t care if you love stats software, but I specifically care about helping learners who struggle understanding concepts.

Flotsam 13: early July links

Man flu kept me at home today, so I decided to do something ‘useful’ and go for a linkathon:

Sometimes people are truthful and cruel. Here Gappy on a mission goes for the jugular:

Over and out.

My take on the USA versus Western Europe comparison of GM corn

A few days ago I came across Jack Heinemann and collaborators’ article (Sustainability and innovation in staple crop production in the US Midwest, Open Access) comparing the agricultural sectors of USA and Western Europe. While the article is titled around the word sustainability, the main comparison stems from the use of Genetically Modified crops in USA versus the absence of them in Western Europe.

I was curious about part of the results and discussion which, in a nutshell, suggest that “GM cropping systems have not contributed to yield gains, are not necessary for yield gains, and appear to be eroding yields compared to the equally modern agroecosystem of Western Europe”. The authors relied on several crops for the comparison (Maize/corn, rapeseed/canolasee P.S.6, soybean and cotton); however, I am going to focus on a single one (corn) for two reasons: 1. I can’t afford a lot of time for blog posts when I should be preparing lectures and 2. I like eating corn.

When the authors of the paper tackled corn the comparison was between the USA and Western Europe, using the United Nations definition of Western Europe (i.e. Austria, Belgium, France, Germany, Liechtenstein, Luxembourg, Monaco, Netherlands, Switzerland). Some large European corn producers like Italy are not there because of the narrow definition of Western.

I struggled with the comparison used by the authors because, in my opinion, there are potentially so many confounded effects (different industry structures, weather, varieties, etc.) that it can’t provide the proper counterfactual for GM versus non-GM crops. Anyway, I decided to have a look at the same data to see if I would reach the same conclusions. The article provides a good description of where the data came from, as well as how the analyses were performed. Small details to match exactly the results were fairly easy to figure out. I downloaded the FAO corn data (3.7 MB csv file) for all countries (so I can reuse the code and data later for lectures and assignments). I then repeated the plots using the following code:

# Default directory
setwd('~/Dropbox/quantumforest')
 
# Required packages
require(ggplot2)
require(labels)
 
# Reading FAO corn data
FAOcorn = read.csv('FAOcorn.csv')
 
# Extracting Area
FAOarea = subset(FAOcorn, Element == 'Area Harvested', 
                 select = c('Country', 'Year', 'Value'))
 
names(FAOarea)[3] = 'Area'
 
# and production
FAOprod = subset(FAOcorn, Element == 'Production', 
                 select = c('Country', 'Year', 'Value'))
 
names(FAOprod)[3] = 'Production'
 
# to calculate yield in hectograms
FAOarea = merge(FAOarea, FAOprod, by = c('Country', 'Year'))
FAOarea$Yield = with(FAOarea, Production/Area*10000)
 
# Subsetting only the countries of interest (and years to match paper)
FAOarticle = subset(FAOarea, Country == 'United States of America' | Country == 'Western Europe')
 
# Plot with regression lines
ggplot(FAOarticle, aes(x = Year, y = Yield, color = Country)) + 
      geom_point() + stat_smooth(method = lm, fullrange = TRUE, alpha = 0.1) + 
      scale_y_continuous('Yield [hectograms/ha]', limits = c(0, 100000), labels = comma) + 
      theme(legend.position="top")
Figure 1. Corn yield per year for USA and Western Europe (click to enlarge).

Figure 1. Corn yield per year for USA and Western Europe (click to enlarge).

I could obtain pretty much the same regression model equations as in the article by expressing the years as deviation from 1960 as in:

# Expressing year as a deviation from 1960, so results
# match paper
FAOarticle$NewYear = with(FAOarticle, Year - 1960)
 
usa.lm = lm(Yield ~ NewYear, data = FAOarticle, 
            subset = Country == 'United States of America')
summary(usa.lm)   
 
#Call:
#lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == 
#    "United States of America")
#
#Residuals:
#     Min       1Q   Median       3Q      Max 
#-18435.4  -1958.3    338.3   3663.8  10311.0 
#
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept) 38677.34    1736.92   22.27   <2e-16 ***
#NewYear      1173.83      59.28   19.80   <2e-16 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
#
#Residual standard error: 6049 on 48 degrees of freedom
#Multiple R-squared: 0.8909,	Adjusted R-squared: 0.8887 
 
 
weu.lm = lm(Yield ~ NewYear, data = FAOarticle,
            subset = Country == 'Western Europe')
summary(weu.lm)   
 
#Call:
#lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == 
#   "Western Europe")
#
#Residuals:
# Min       1Q   Median       3Q      Max 
#-14726.6  -3205.8    346.4   4000.6  10289.5 
#
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept) 31510.14    1665.90   18.91   <2e-16 ***
#NewYear      1344.42      56.86   23.65   <2e-16 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
#
#Residual standard error: 5802 on 48 degrees of freedom
#Multiple R-squared: 0.9209,	Adjusted R-squared: 0.9193 
#F-statistic: 559.1 on 1 and 48 DF,  p-value: < 2.2e-16

Heinemann and collaborators then point out the following:

…the slope in yield increase by year is steeper in W. Europe (y = 1344.2x + 31512, R² = 0.92084) than the United States (y = 1173.8x + 38677, R² = 0.89093) from 1961 to 2010 (Figure 1). This shows that in recent years W. Europe has had similar and even slightly higher yields than the United States despite the latter’s use of GM varieties.

However, that interpretation using all data assumes that both ‘countries’ are using GMO all the time. An interesting thing is that USA and Western Europe were in different trends already before the introduction of GM corn. We can state that because we have some idea of when GM crops were introduced in the USA. This information is collected by the US Department of Agriculture in their June survey to growers and made publicly available at the State level (GMcornPenetration.csv):

cornPenetration = read.csv('GMcornPenetration.csv')
 
ggplot(cornPenetration, aes(x = Year, y = PerAllGM)) + geom_line() + facet_wrap(~ State) +
      scale_y_continuous('Percentage of GM corn') +
      theme(axis.text.x  = theme_text(angle=90))
GM corn percentage by state in the USA.

Figure 2. GM corn percentage by state in the USA (click to enlarge).

This graph tells us that by the year 2000 the percentage of planted corn was way below 50% in most corn producing states (in fact, it was 25% at the country level). From that time on we have a steady increase reaching over 80% for most states by 2008. Given this, it probably makes sense to assume that, at the USA level, yield reflects non-GM corn until 1999 and progressively reflects the effect of GM genotypes from 2000 onwards. This division is somewhat arbitrary, but easy to implement.

We can repeat the previous analyzes limiting the data from 1961 until, say, 1999:

usa.lm2 = lm(Yield ~ NewYear, data = FAOarticle, 
            subset = Country == 'United States of America' & Year < 2000)
summary(usa.lm2)   
 
#Call:
#lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == 
#    "United States of America" & Year < 2000)
#
#Residuals:
#   Min     1Q Median     3Q    Max 
#-17441  -2156   1123   3989   9878 
#
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept) 39895.57    2084.81   19.14  < 2e-16 ***
#NewYear      1094.82      90.84   12.05 2.25e-14 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 6385 on 37 degrees of freedom
#Multiple R-squared:  0.797,	Adjusted R-squared:  0.7915 
#F-statistic: 145.2 on 1 and 37 DF,  p-value: 2.245e-14
 
weu.lm2 = lm(Yield ~ NewYear, data = FAOarticle,
            subset = Country == 'Western Europe' & Year < 2000)
summary(weu.lm2)   
 
#Call:
#lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == 
#    "Western Europe" & Year < 2000)
#
#Residuals:
#   Min     1Q Median     3Q    Max 
#-10785  -3348    -34   3504  11117 
#
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept) 29802.17    1813.79   16.43   <2e-16 ***
#NewYear      1454.48      79.03   18.40   <2e-16 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 5555 on 37 degrees of freedom
#Multiple R-squared:  0.9015,	Adjusted R-squared:  0.8988 
#F-statistic: 338.7 on 1 and 37 DF,  p-value: < 2.2e-16

These analyses indicate that Western Europe started with a lower yield than the USA (29,802.17 vs 39,895.57 hectograms/ha) and managed to increase yield much more quickly (1,454.48 vs 1,094.82 hectograms/ha per year) before any use of GM corn by the USA. Figure 1 shows a messy picture because there are numerous factors affecting yield each year (e.g. weather has a large influence). We can take averages for each decade and see how the two ‘countries’ are performing:

# Aggregating every decade.
# 2013-07-05 20:10 NZST I fixed the aggregation because it was averaging yields rather 
# calculating total production and area for the decade and then calculating average yield
# Discussion points are totally valid 
FAOarticle$Decade = cut(FAOarticle$Year, 
                        breaks = seq(1959, 2019, 10), 
                        labels = paste(seq(1960, 2010, 10), 's', sep = ''))
 
decadeProd = aggregate(Production ~ Country + Decade,
                       data = FAOarticle,
                       FUN = sum)
 
decadeArea = aggregate(Area ~ Country + Decade,
                       data = FAOarticle,
                       FUN = sum)
 
decadeYield = merge(decadeProd, decadeArea, by = c('Country', 'Decade'))
decadeYield$Yield = with(decadeYield, Production/Area*10000)
 
ggplot(decadeYield, aes(x = Decade, y = Yield, fill = Country)) + 
       geom_bar(stat = 'identity', position = 'dodge') +
       scale_y_continuous('Yield [hectograms/ha]', expand = c(0, 0)) +
       theme(legend.position="top")
Figure 3. Corn yield by decade (click to enlarge).

Figure 3. Corn yield by decade (click to enlarge).

This last figure requires more attention. We can again see that Western Europe starts with lower yields than the USA; however, it keeps on increasing those yields faster than USA, overtaking it during the 1990s. Again, all this change happened while both USA and Western Europe were not using GM corn. The situation reverses in the 2000s, when the USA overtakes Western Europe, while the USA continuously increased the percentage of GM corn. The last bar in Figure 3 is misleading because it includes a single year (2010) and we know that yields in USA went down in 2011 and 2012, affected by a very large drought (see Figure 4).

At least when looking at corn, I can’t say (with the same data available to Heinemann) that there is no place or need for GM genotypes. I do share some of his concerns with respect to the low level of diversity present in staple crops but, in contrast to his opinion, I envision a future for agriculture that includes large-scale operations (either GM or no-GM), as well as smaller operations (including organic ones). I’d like to finish with some optimism looking further back to yield, because the USDA National Agricultural Statistics Service keeps yield statistics for corn since 1866(!) (csv file), although it uses bizarre non-metric units (bushels/acre). As a metric boy, I converted to kilograms per hectare (multiplying by 62.77 from this page) and then to hectograms (100 g) multiplying by 10.

# Reading NASS corn data
NASS = read.csv('NASScorn.csv')
# Conversion to sensical units (see Iowa State Extension article)
# http://www.extension.iastate.edu/agdm/wholefarm/html/c6-80.html
NASS$Yield = with(NASS, Value*62.77*10)
 
# Average by decade
NASS$Decade = cut(NASS$Year, 
                  breaks = seq(1859, 2019, 10), 
                  labels = paste(seq(1860, 2010, 10), 's', sep = ''))
 
oldYield = aggregate(Yield ~ Decade, data = NASS, FUN = mean)
 
# Plotting
ggplot(oldYield, aes(x = Decade, y = Yield)) + 
       geom_bar(stat = 'identity') +
       scale_y_continuous('Yield [hectograms]', expand = c(0, 0))
Historic average yield per decade for USA (click to enlarge).

Figure 4. Historic average yield per decade for USA (click to enlarge).

It is interesting to see that there was little change until the 1940s, with the advent of the Green Revolution (modern breeding techniques, fertilization, pesticides, etc.). The 2010s decade in Figure 4 includes 2010, 2011 and 2012, with the last two years reflecting extensive droughts. Drought tolerance is one of the most important traits in modern breeding programs.

Drought’s Footprint map produced by The New York Times (click on graph to  view larger version in the NYT).

Drought’s Footprint map produced by The New York Times (click on graph to view larger version in the NYT). This can help to understand the Decade patterns in previous figures.

While Prof. Heinemann and myself work for the same university I don’t know him in person.

P.S. Did you know that Norman Borlaug (hero of mine) studied forestry?
P.S.2 Time permitting I’ll have a look at other crops later. I would have liked to test a regression with dummy variables for corn to account for pre-2000 and post-1999, but there are not yet many years to fit a decent model (considering natural variability between years). We’ll have to wait for that one.
P.S.3 I share some of Heinemann’s concerns relating to subsidies and other agricultural practices.
P.S.4 In case anyone is interested, I did write about a GM-fed pigs study not long ago.
P.S.5 2013-07-05 20:10 NZST. I updated Figures 1 and 3 to clearly express that yield was in hectograms/ha and recalculated average decade yield because it was originally averaging yields rather calculating total production and area for the decade and then calculating average yield. The discussion points I raised are still completely valid.
P.S.6 2013-07-07 11:30 NZST. The inclusion of Canadian canola does not make any sense because, as far as I know, Canada is not part of Western Europe or the US Midwest. This opens the inclusion of crops from any origin as far as the results are convenient for one’s argument.
P.S.7 2014-08-06 13:00 NZST My comment was expanded and published in the journal (or here for HTML version with comments on the authors’ reply to my comment.

GM-fed pigs, chance and how research works

Following my post on GM-fed pigs I received several comments, mostly through Twitter. Some people liked having access to an alternative analysis, while others replied with typical anti-GM slogans, completely ignoring that I was posting about the technical side of the paper. This post is not for the slogan crowd (who clearly are not interested in understanding), but for people that would like to know more about how one would evaluate claims from a scientific article. While I refer to the pig paper, most issues apply to any paper that uses statistics.

In general, researchers want to isolate the effect of the treatments under study (diets in this case) from any other extraneous influence. We want control over the experimental conditions, so we can separate the effects of interest from all other issues that could create differences between our experimental units (pigs in this case). What could create ‘noise’ in our results? Animals could have different genetic backgrounds (for example with different parents), they could be exposed to different environmental conditions, they could be treated differently (more kindly or harshly), etc.

Once we control those conditions as much as possible we would randomly assign animals to each of the treatments (diets). The rationale behind random allocation is easy to follow. Let’s say that we can see that some animals are healthier than others before starting the trial. If I had a pro-GM agenda, I could assign the healthiest animals to the GM-diet and we would not be able to separate the effect of the treatment from the initial difference. To avoid this, we could have many labels in a hat, shake the hat, and for each pig take a label that randomly assigns the pig to a diet so the comparison is fair.

Researchers also like to have a baseline of the conditions before the treatments are applied. This way we can ‘adjust’ the results by any pre-existing differences. For example, there could be measurable differences on health, size, etc. I normally work with trees in experiments and we routinely assess the height of the trees just planted, so we can establish a baseline.

Finally, we often have a ‘default’ treatment which represents the status quo and acts as a comparison point for the new treatments. In the GM pig case, the default is a non-GM diet and the new treatment is the GM-diet.

The paper on GM fed pigs states that they tried to have as much control as possible of the growing conditions and that they used random allocation to the two feeding treatments. I have no problems with the paper up to this point.

When doing research it is good manners to state one’s expectations before the start of the trial. Doing this provides the experimenter with guidance about how to proceed with the evaluation of the experiment:

  1. What are the characteristics under study? Both the response variables (in the past called ‘dependent variables’) and the predictors (old name ‘independent variables’).
  2. What is the magnitude of the differences between groups that we would consider practically significant? Put another way, what would be the size of the difference that one would care about? For example, if we have two groups of pigs and the difference on weight between them is 1 g (0.035 ounces for people with funny units), who cares? If the difference was 5 kg (11 pounds, funny units again) then, perhaps, we are in to something.
  3. What level of statistical significance we consider appropriate? Even if we assume that truly there is no difference between the two diets, we would expect to see small differences between the two groups just by chance. Big differences are more unlikely. It is common in research to state that a difference is statistically significant if the probability of observing the difference is smaller than, say, 0.05 (or 1 in 20). There is nothing sacred about the number but just a convention.

By this stage one would expect a researcher to state one or more hypotheses to be tested. For example, ‘I expect that the GM diet will increase [condition here] by [number here] percent’. One can run an experiment ignoring ‘good manners’, but (and this is a big but) an informed reader will become suspicious if suddenly one starts testing hypotheses like there is no tomorrow. Why? Because if one conducts too many tests one is bound to find statistically significant results even when there are none.

The comic below presents a brief example with jelly beans assuming that we claim significance for an event occurring with probability 0.05 (1 in 20) or less. Notice that the scientists use 20 colors of jelly beans and find that green ones ’cause’ acne. Running so many tests—without adjusting down the probability of the event that we would call significant, so p needs to be much smaller than 0.05 to be significant—results in wrong conclusions.

XKCD explaining significance (click for original, larger version). p refers to the probability of observing that result if jelly beans have no effect on acne.

XKCD explaining significance (click for original, larger version).

In the pig paper there are 8 tests in Table 2, 18 (or 15 with some value) in Table 3, 8 in Table 4 and 17 in Table 5 for a total of 49 (or 46 with some testable values). In fact one would expect to find a couple of significant results (at 0.05 or 1 in 20) by chance even if there are absolutely no differences in reality.

Add to this that many of the tests are unnecessary, because they are performing the wrong type of analysis. For example, there are four separate analyses for stomach inflammation; however, the analysis ignores the type of variable one is testing as I point out in a previous post.

This is why, if I were Monsanto, I would use the paper as evidence supporting the idea that there is no difference between the two diets:

  1. the paper was ‘independent’ (although the authors have a clear anti-GM bias) and
  2. when running proper analyses (accounting for the type of variable and the number of tests that one is running) no difference is statistically significant.

P.S. 2013-06-21. 15:38 NZST A footnote about data availability: it is becoming increasingly common to make both the data and code used to analyze the data available when publishing (part of Open Science). I would expect that a paper that is making bold claims that could have policy implications would provide those, which does not happens in this case. However, the publication does include part of the data in results Tables 3 & 4, in the counts of pigs under different classes of inflammation.

Ordinal logistic GM pigs

This week another ‘scary GMO cause disease’ story was doing the rounds in internet: A long-term toxicology study on pigs fed a combined genetically modified (GM) soy and GM maize diet. Andrew Kniss, a non-smokable weeds expert, mentioned in Twitter that the statistical analyses in the study appeared to be kind of dodgy.

Curious, I decided to have a quick look and I was surprised, first, by the points the authors decide to highlight in their results, second, by the pictures and captioning used in the article and, last, by the way of running the analysis. As I’m in the middle of marking assignments and exams I’ll only have a quick go at part of the analysis. As I see it, the problem can be described as ‘there is a bunch of pigs who were fed either non-GM feed or GM feed. After some time (approximately 23 weeks) they were killed and went through a CSI-like autopsy’, where part of the exam involved the following process:

  1. Write down the type of feed the pig had during his/her life;
  2. Assess the condition of the stomach and put it in one of four boxes labeled ‘Nil’, ‘Mild’, ‘Moderate’ and ‘Severe’.

All this data is summarized in Table 3 of the paper (PDF). How would I go about the analysis? As I see it, we have a categorical response variable—which can take one of four mutually exclusive values—and a categorical predictor (diet). In addition, there is a natural order to the inflammation response variable in that Severe > Moderate > Mild > Nil.

Andrew Kniss wrote a post trying to reproduce the published results. Instead, I present the first approach I would try with the data: ordinal logistic regression. Not only that, but instead of using a hippie statistical software like R, I will use industrial-grade-business-like SAS:

/*
 Testing SAS web editor fitting pigs data
 Luis A. Apiolaza - School of Forestry
*/
 
*Reading data set;
data pigs;
  input inflam $ diet $ count;
  datalines;
  Nil Non-GMO 4
  Mild Non-GMO 31
  Moderate Non-GMO 29
  Severe Non-GMO 9
  Nil GMO 8
  Mild GMO 23
  Moderate GMO 18
  Severe GMO 23
;
run;
 
*Showing data set;
proc print data = pigs;
run;
 
*El quicko analysis;
ods graphics on;
proc logistic data = pigs order = data;
  freq count;
  class inflam (ref = "Nil") diet (ref = "Non-GMO") / param = ref;
  model inflam = diet / link = glogit;
  oddsratio diet;
run;

This produces a simple table with the same data as the paper and some very non-exciting results, which are better summarized in a single graph:

/*
Obs	inflam   diet   count
1	Nil      Non-GMO  4
2	Mild     Non-GMO 31
3	Moderate Non-GMO 29
4	Severe   Non-GMO  9
5	Nil      GMO      8
6	Mild     GMO     23
7	Moderate GMO     18
8	Severe   GMO     23
*/
Odd ratios for the different levels of stomach inflammation.

Odd ratios for the different levels of stomach inflammation.

The odd ratios would be 1 for no difference between the treatments. The graph shows that the confidence limits for all levels of inflammation include 1, so move on, nothing to see. In fact, GMO-fed pigs tend to have less inflammation for most disease categories.

P.S. There are many ways of running an analysis for this data set, but I’m in favor of approaches that take the whole problem in one go rather than looking at one class at the time. In an ideal situation we would have a continuous assessment for inflammation and the analysis would be a one-way ANOVA. I understand that for practical reasons one may prefer to split the response in four classes.

P.S.2 2013-06-15 I often act as a reviewer for scientific journals. In the case of this article some of my comments would have included: the analysis does not use the structure of the data properly, the photographs of the damaged organs should include both types of diet for each inflammation class (or at least include the most representative diet for the class), and the authors should highlight that there are no significant differences between the two diets for animal health; that is, the trial provides evidence for no difference between feeds. I still feel that the authors should be more forthcoming on terms of disclosing potential conflicts of interest too, but that’s their decision.

P.S.3 2013-07-04 I expand on aspects of the general research process in this post.

Tongue-in-cheek, of course, and with reference to weeds. This blog mostly uses R, but I’m pushing myself to use lots of different software to ‘keep the language’. Now if I could only do this with Spanish.

Matrix Algebra Useful for Statistics

I was having a conversation with an acquaintance about courses that were particularly useful in our work. My forestry degree involved completing 50 compulsory + 10 elective courses; if I had to choose courses that were influential and/or really useful they would be Operations Research, Economic Evaluation of Projects, Ecology, 3 Calculus and 2 Algebras. Subsequently my PhD was almost entirely research based but I sort of did Matrix Algebra: Dorian lent me his copy of Searle’s Matrix Algebra Useful for Statistics and passed me a pile of assignments that Shayle Searle used to give in his course in Cornell. I completed the assignments on my own pace and then sat a crazy take-home exam for 24 hours.

Later that year I bought a cloth-bound 1982 version of the book, not the alien vomit purple paperback reprint currently on sale, which I consult from time to time. Why would one care about matrix algebra? Besides being a perfectly respectable intellectual endeavor on itself, maybe you can see that the degrees of freedom are the rank of a quadratic form; you can learn from this book what a quadratic form and a matrix rank are. Or you want to see more clearly the relationship between regression and ANOVA, because in matrix form a linear model is a linear model is a linear model. The commands outer, inner and kronecker product make a lot more sense once you know what an outer product and an inner product of vectors are. Thus, if you really want to understand a matrix language for data analysis and statistics (like R), it seems reasonable to try to understand the building blocks for such a language.

The book does not deal with any applications to statistics until chapter 13. Before that it is all about laying foundations to understand the applications, but do not expect nice graphs and cute photos. This is a very good text where one as to use the brain to imagine what’s going on in the equations and demonstrations. The exercises rely a lot on ‘prove this’ and ‘prove that’, which lead to much frustration and, after persevering, to many ‘aha! moments’.

XKCD 1050: In your face! Actually I feel the opposite concerning math.

I am the first to accept that I have a biased opinion about this book, because it has sentimental value. It represents difficult times, dealing with a new language, culture and, on top of that, animal breeding. At the same time, it opened doors to a whole world of ideas. This is much more than I can say of most books.

PS 2012-12-17: I have commented on a few more books in these posts.

A good part of my electives were in humanities (history & literature), which was unusual for forestry. I just couldn’t conceive going through a university without doing humanities.

A word of caution: the sample may have an effect

This week I’ve tried to i-stay mostly in the descriptive statistics realm and ii-surround any simple(istic) models with caveats and pointing that they are very preliminary. We are working with a sample of ~1,000 schools that did reply to Fairfax’s request, while there is a number of schools that either ignored the request or told Fairfax to go and F themselves. Why am I saying this? If one goes and gets a simple table of the number of schools by type and decile there is something quite interesting: we have different percentages for different types of schools represented in the sample and the possibility of bias on the reporting to Fairfax, due to potential low performance (references to datasets correspond to the ones I used in this post):

summary(standards$school.type)
#         Composite (Year 1-10)          Composite (Year 1-15)        Contributing (Year 1-6) 
#                             1                             29                            403 
#       Full Primary (Year 1-8)    Intermediate (year 7 and 8) Restricted Composite (Yr 7-10) 
#                           458                             62                              1 
#         Secondary (Year 7-15) 
#                            56

Now let’s compare this number with the school directory:

summary(factor(directory$school.type))
#         Composite (Year 1-10)          Composite (Year 1-15)        Contributing (Year 1-6) 
#                             4                            149                            775 
#         Correspondence School        Full Primary (Year 1-8)    Intermediate (year 7 and 8) 
#                             1                           1101                            122 
#Restricted Composite (Yr 7-10)         Secondary (Year 11-15)          Secondary (Year 7-10) 
#                             4                              2                              2 
#         Secondary (Year 7-15)          Secondary (Year 9-15)                 Special School 
#                           100                            238                             39 
#              Teen Parent Unit 
#                            20

As a proportion we are missing more secondary schools. We can use the following code to get an idea of how similar are school types, because the small number of different composite schools is a pain. If

# Performance of Contributing (Year 1-6) and
# Full Primary (Year 1-8) looks pretty much the
# same. Composites could be safely merged
qplot(school.type, reading.OK, 
      data = standards, geom = 'jitter')
 
qplot(school.type, writing.OK, 
      data = standards, geom = 'jitter')
 
qplot(school.type, math.OK, 
      data = standards, geom = 'jitter')
 
# Merging school types and plotting them colored
# by decile
standards$school.type.4 <- standards$school.type
levels(standards$school.type.4) <- c('Composite', 'Composite', 'Primary',
                                     'Primary', 'Intermediate',
                                     'Composite', 'Secondary')
 
qplot(school.type.4, reading.OK, colour = decile,
      data = standards, geom = 'jitter')

Representation of different schools types and deciles is uneven.

Different participations in the sample for school types. This type is performance in mathematics.


I’m using jittering rather than box and whisker plots to i- depict all the schools and ii- get an idea of the different participation of school types in the dataset. Sigh. Another caveat to add in the discussion.

P.S. 2012-09-27 16:15. Originally I mentioned in this post the lack of secondary schools (Year 9-15) but, well, they are not supposed to be here, because National Standards apply to years 1 to 8 (Thanks to Michael MacAskill for pointing out my error.)

New Zealand school performance: beyond the headlines

I like the idea of having data on school performance, not to directly rank schools—hard, to say the least, at this stage—but because we can start having a look at the factors influencing test results. I imagine the opportunity in the not so distant future to run hierarchical models combining Ministry of Education data with Census/Statistics New Zealand data.

At the same time, there is the temptation to come up with very simple analyses that would make appealing newspaper headlines. I’ll read the data and create a headline and then I’ll move to something that, personally, seems more important. In my previous post I combined the national standards for around 1,000 schools with decile information to create the standards.csv file.

library(ggplot2)
 
# Reading data, building factors
standards <- read.csv('standards.csv')
standards$decile.2008 <- factor(standards$decile.2008)
standards$electorate <- factor(standards$electorate)
standards$school.type <- factor(standards$school.type)
 
# Removing special schools (all 8 of them) because their
# performance is not representative of the overall school
# system
standards <- subset(standards, 
                    as.character(school.type) != 'Special School')
standards$school.type <- standards$school.type[drop = TRUE]
 
# Create performance variables. Proportion of the students that
# at least meet the standards (i.e. meet + above)
# Only using all students, because subsets are not reported
# by many schools
standards$math.OK <- with(standards, math.At + math.A)
standards$reading.OK <- with(standards, reading.At + reading.A)
standards$writing.OK <- with(standards, writing.At + writing.A)

Up to this point we have read the data, removed special schools and created variables that represent the proportion of students that al least meet standards. Now I’ll do something very misleading: calculate the average for reading.OK for each school decile and plot it, showing the direct relationship between socio-economic decile and school performance.

mislead <- aggregate(reading.OK ~ decile.2008, 
                     data = standards, mean)
qplot(decile.2008, reading.OK , data = mislead)

Scatterplot of average proportion of students at least meeting the reading national standards for each socioeconomic decile.

Using only the average strengthens the relationship, but it hides something extremely important: the variability within each decile. The following code will display box and whisker plots for each decile: the horizontal line is the median, the box contains the middle 50% of the schools and the vertical lines 1.5 times the interquantile range (pretty much the range in most cases):

qplot(decile.2008, reading.OK, 
      data = standards, geom = 'boxplot',
      xlab = 'Decile in 2008',
      ylab = 'Reading at or above National Standard')

Box and whisker plot for proportion of students at least meeting the reading national standard for each decile. Notice the variability for each decile.

A quick look at the previous graph shows a few interesting things:

  • the lower the decile the more variable school performance,
  • there is pretty much no difference between deciles 6, 7, 8 and 9, and a minor increase for decile 10.
  • there is a trend for decreasing performance for lower deciles; however, there is also a huge variability within those deciles.

We can repeat the same process for writing.OK and math.OK with similar trends, although the level of standard achievement is lower than for reading:

qplot(decile.2008, writing.OK, 
      data = standards, geom = 'boxplot',
      xlab = 'Decile in 2008',
      ylab = 'Writing at or above National Standard')
 
qplot(decile.2008, math.OK, 
      data = standards, geom = 'boxplot',
      xlab = 'Decile in 2008',
      ylab = 'Mathematics at or above National Standard')

Box and whisker plot for proportion of students at least meeting the writing national standard for each decile.

Box and whisker plot for proportion of students at least meeting the mathematics national standard for each decile.

Achievement in different areas (reading, writing, mathematics) is highly correlated:

cor(standards[, c('reading.OK', 'writing.OK', 'math.OK')], 
    use = 'pairwise.complete.obs')
 
#           reading.OK writing.OK   math.OK
#reading.OK  1.0000000  0.7886292 0.7749094
#writing.OK  0.7886292  1.0000000 0.7522446
#math.OK     0.7749094  0.7522446 1.0000000
 
# We can plot an example and highlight decile groups
# 1-5 vs 6-10
qplot(reading.OK, math.OK,
      data = standards, color = ifelse(as.numeric(decile.2008) > 5, 1, 0)) + 
      opts(legend.position = 'none')

Scatterplot for proportion meeting mathematics and reading national standards. Dark points are deciles 1 to 5, while light points are deciles 6 to 10. Notice the large overlap for performance between the two groups.

All these graphs are only descriptive/exploratory; however, once we had more data we could move to hierarchical models to start adjusting performance by socioeconomic aspects, geographic location, type of school, urban or rural setting, school size, ethnicity, etc. Potentially, this would let us target resources on schools that could be struggling to perform; nevertheless, any attempt at creating a ‘quick & dirty’ ranking ignoring the previously mentioned covariates would be, at the very least, misleading.

Note 2012-09-26: I have updated data and added a few plots in this post.

Mid-August flotsam

Reached mid-semester point, with quite a few new lectures to prepare. Nothing extremely complicated but, as always, the tricky part is finding a way to make it meaningful and memorable. Sometimes, and this is one of those times, I sound like a broken record but I’m a bit obsessive about helping people to ‘get’ a topic.

Gratuitous picture: Lola, Lisbon, Portugal(Photo: Luis).