# Archive for the ‘teaching’ Category

A few days ago I came across Jack Heinemann and collaborators’ article (Sustainability and innovation in staple crop production in the US Midwest, Open Access) comparing the agricultural sectors of USA and Western Europe. While the article is titled around the word sustainability, the main comparison stems from the use of Genetically Modified crops in USA versus the absence of them in Western Europe.

I was curious about part of the results and discussion which, in a nutshell, suggest that “GM cropping systems have not contributed to yield gains, are not necessary for yield gains, and appear to be eroding yields compared to the equally modern agroecosystem of Western Europe”. The authors relied on several crops for the comparison (Maize/corn, rapeseed/canolasee P.S.6, soybean and cotton); however, I am going to focus on a single one (corn) for two reasons: 1. I can’t afford a lot of time for blog posts when I should be preparing lectures and 2. I like eating corn.

When the authors of the paper tackled corn the comparison was between the USA and Western Europe, using the United Nations definition of Western Europe (i.e. Austria, Belgium, France, Germany, Liechtenstein, Luxembourg, Monaco, Netherlands, Switzerland). Some large European corn producers like Italy are not there because of the narrow definition of Western.

I struggled with the comparison used by the authors because, in my opinion, there are potentially so many confounded effects (different industry structures, weather, varieties, etc.) that it can’t provide the proper counterfactual for GM versus non-GM crops. Anyway, I decided to have a look at the same data to see if I would reach the same conclusions. The article provides a good description of where the data came from, as well as how the analyses were performed. Small details to match exactly the results were fairly easy to figure out. I downloaded the FAO corn data (3.7 MB csv file) for all countries (so I can reuse the code and data later for lectures and assignments). I then repeated the plots using the following code:

# Default directory setwd('~/Dropbox/quantumforest')   # Required packages require(ggplot2) require(labels)   # Reading FAO corn data FAOcorn = read.csv('FAOcorn.csv')   # Extracting Area FAOarea = subset(FAOcorn, Element == 'Area Harvested', select = c('Country', 'Year', 'Value'))   names(FAOarea)[3] = 'Area'   # and production FAOprod = subset(FAOcorn, Element == 'Production', select = c('Country', 'Year', 'Value'))   names(FAOprod)[3] = 'Production'   # to calculate yield in hectograms FAOarea = merge(FAOarea, FAOprod, by = c('Country', 'Year')) FAOarea$Yield = with(FAOarea, Production/Area*10000) # Subsetting only the countries of interest (and years to match paper) FAOarticle = subset(FAOarea, Country == 'United States of America' | Country == 'Western Europe') # Plot with regression lines ggplot(FAOarticle, aes(x = Year, y = Yield, color = Country)) + geom_point() + stat_smooth(method = lm, fullrange = TRUE, alpha = 0.1) + scale_y_continuous('Yield [hectograms/ha]', limits = c(0, 100000), labels = comma) + theme(legend.position="top") Figure 1. Corn yield per year for USA and Western Europe (click to enlarge). I could obtain pretty much the same regression model equations as in the article by expressing the years as deviation from 1960 as in: # Expressing year as a deviation from 1960, so results # match paper FAOarticle$NewYear = with(FAOarticle, Year - 1960)   usa.lm = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'United States of America') summary(usa.lm)   #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "United States of America") # #Residuals: # Min 1Q Median 3Q Max #-18435.4 -1958.3 338.3 3663.8 10311.0 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 38677.34 1736.92 22.27 <2e-16 *** #NewYear 1173.83 59.28 19.80 <2e-16 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 6049 on 48 degrees of freedom #Multiple R-squared: 0.8909, Adjusted R-squared: 0.8887     weu.lm = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'Western Europe') summary(weu.lm)   #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "Western Europe") # #Residuals: # Min 1Q Median 3Q Max #-14726.6 -3205.8 346.4 4000.6 10289.5 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 31510.14 1665.90 18.91 <2e-16 *** #NewYear 1344.42 56.86 23.65 <2e-16 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 5802 on 48 degrees of freedom #Multiple R-squared: 0.9209, Adjusted R-squared: 0.9193 #F-statistic: 559.1 on 1 and 48 DF, p-value: < 2.2e-16

Heinemann and collaborators then point out the following:

…the slope in yield increase by year is steeper in W. Europe (y = 1344.2x + 31512, R² = 0.92084) than the United States (y = 1173.8x + 38677, R² = 0.89093) from 1961 to 2010 (Figure 1). This shows that in recent years W. Europe has had similar and even slightly higher yields than the United States despite the latter’s use of GM varieties.

However, that interpretation using all data assumes that both ‘countries’ are using GMO all the time. An interesting thing is that USA and Western Europe were in different trends already before the introduction of GM corn. We can state that because we have some idea of when GM crops were introduced in the USA. This information is collected by the US Department of Agriculture in their June survey to growers and made publicly available at the State level (GMcornPenetration.csv):

cornPenetration = read.csv('GMcornPenetration.csv')   ggplot(cornPenetration, aes(x = Year, y = PerAllGM)) + geom_line() + facet_wrap(~ State) + scale_y_continuous('Percentage of GM corn') + theme(axis.text.x = theme_text(angle=90))

Figure 2. GM corn percentage by state in the USA (click to enlarge).

This graph tells us that by the year 2000 the percentage of planted corn was way below 50% in most corn producing states (in fact, it was 25% at the country level). From that time on we have a steady increase reaching over 80% for most states by 2008. Given this, it probably makes sense to assume that, at the USA level, yield reflects non-GM corn until 1999 and progressively reflects the effect of GM genotypes from 2000 onwards. This division is somewhat arbitrary, but easy to implement.

We can repeat the previous analyzes limiting the data from 1961 until, say, 1999:

usa.lm2 = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'United States of America' & Year < 2000) summary(usa.lm2)   #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "United States of America" & Year < 2000) # #Residuals: # Min 1Q Median 3Q Max #-17441 -2156 1123 3989 9878 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 39895.57 2084.81 19.14 < 2e-16 *** #NewYear 1094.82 90.84 12.05 2.25e-14 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 6385 on 37 degrees of freedom #Multiple R-squared: 0.797, Adjusted R-squared: 0.7915 #F-statistic: 145.2 on 1 and 37 DF, p-value: 2.245e-14   weu.lm2 = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'Western Europe' & Year < 2000) summary(weu.lm2)   #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "Western Europe" & Year < 2000) # #Residuals: # Min 1Q Median 3Q Max #-10785 -3348 -34 3504 11117 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 29802.17 1813.79 16.43 <2e-16 *** #NewYear 1454.48 79.03 18.40 <2e-16 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 5555 on 37 degrees of freedom #Multiple R-squared: 0.9015, Adjusted R-squared: 0.8988 #F-statistic: 338.7 on 1 and 37 DF, p-value: < 2.2e-16

These analyses indicate that Western Europe started with a lower yield than the USA (29,802.17 vs 39,895.57 hectograms/ha) and managed to increase yield much more quickly (1,454.48 vs 1,094.82 hectograms/ha per year) before any use of GM corn by the USA. Figure 1 shows a messy picture because there are numerous factors affecting yield each year (e.g. weather has a large influence). We can take averages for each decade and see how the two ‘countries’ are performing:

# Aggregating every decade. # 2013-07-05 20:10 NZST I fixed the aggregation because it was averaging yields rather # calculating total production and area for the decade and then calculating average yield # Discussion points are totally valid FAOarticle$Decade = cut(FAOarticle$Year, breaks = seq(1959, 2019, 10), labels = paste(seq(1960, 2010, 10), 's', sep = ''))   decadeProd = aggregate(Production ~ Country + Decade, data = FAOarticle, FUN = sum)   decadeArea = aggregate(Area ~ Country + Decade, data = FAOarticle, FUN = sum)   decadeYield = merge(decadeProd, decadeArea, by = c('Country', 'Decade')) decadeYield$Yield = with(decadeYield, Production/Area*10000) ggplot(decadeYield, aes(x = Decade, y = Yield, fill = Country)) + geom_bar(stat = 'identity', position = 'dodge') + scale_y_continuous('Yield [hectograms/ha]', expand = c(0, 0)) + theme(legend.position="top") Figure 3. Corn yield by decade (click to enlarge). This last figure requires more attention. We can again see that Western Europe starts with lower yields than the USA; however, it keeps on increasing those yields faster than USA, overtaking it during the 1990s. Again, all this change happened while both USA and Western Europe were not using GM corn. The situation reverses in the 2000s, when the USA overtakes Western Europe, while the USA continuously increased the percentage of GM corn. The last bar in Figure 3 is misleading because it includes a single year (2010) and we know that yields in USA went down in 2011 and 2012, affected by a very large drought (see Figure 4). At least when looking at corn, I can’t say (with the same data available to Heinemann) that there is no place or need for GM genotypes. I do share some of his concerns with respect to the low level of diversity present in staple crops but, in contrast to his opinion, I envision a future for agriculture that includes large-scale operations (either GM or no-GM), as well as smaller operations (including organic ones). I’d like to finish with some optimism looking further back to yield, because the USDA National Agricultural Statistics Service keeps yield statistics for corn since 1866(!) (csv file), although it uses bizarre non-metric units (bushels/acre). As a metric boy, I converted to kilograms per hectare (multiplying by 62.77 from this page) and then to hectograms (100 g) multiplying by 10. # Reading NASS corn data NASS = read.csv('NASScorn.csv') # Conversion to sensical units (see Iowa State Extension article) # http://www.extension.iastate.edu/agdm/wholefarm/html/c6-80.html NASS$Yield = with(NASS, Value*62.77*10)   # Average by decade NASS$Decade = cut(NASS$Year, breaks = seq(1859, 2019, 10), labels = paste(seq(1860, 2010, 10), 's', sep = ''))   oldYield = aggregate(Yield ~ Decade, data = NASS, FUN = mean)   # Plotting ggplot(oldYield, aes(x = Decade, y = Yield)) + geom_bar(stat = 'identity') + scale_y_continuous('Yield [hectograms]', expand = c(0, 0))

Figure 4. Historic average yield per decade for USA (click to enlarge).

It is interesting to see that there was little change until the 1940s, with the advent of the Green Revolution (modern breeding techniques, fertilization, pesticides, etc.). The 2010s decade in Figure 4 includes 2010, 2011 and 2012, with the last two years reflecting extensive droughts. Drought tolerance is one of the most important traits in modern breeding programs.

Drought’s Footprint map produced by The New York Times (click on graph to view larger version in the NYT). This can help to understand the Decade patterns in previous figures.

While Prof. Heinemann and myself work for the same university I don’t know him in person.

P.S. Did you know that Norman Borlaug (hero of mine) studied forestry?
P.S.2 Time permitting I’ll have a look at other crops later. I would have liked to test a regression with dummy variables for corn to account for pre-2000 and post-1999, but there are not yet many years to fit a decent model (considering natural variability between years). We’ll have to wait for that one.
P.S.3 I share some of Heinemann’s concerns relating to subsidies and other agricultural practices.
P.S.4 In case anyone is interested, I did write about a GM-fed pigs study not long ago.
P.S.5 2013-07-05 20:10 NZST. I updated Figures 1 and 3 to clearly express that yield was in hectograms/ha and recalculated average decade yield because it was originally averaging yields rather calculating total production and area for the decade and then calculating average yield. The discussion points I raised are still completely valid.
P.S.6 2013-07-07 11:30 NZST. The inclusion of Canadian canola does not make any sense because, as far as I know, Canada is not part of Western Europe or the US Midwest. This opens the inclusion of crops from any origin as far as the results are convenient for one’s argument.

Following my post on GM-fed pigs I received several comments, mostly through Twitter. Some people liked having access to an alternative analysis, while others replied with typical anti-GM slogans, completely ignoring that I was posting about the technical side of the paper. This post is not for the slogan crowd (who clearly are not interested in understanding), but for people that would like to know more about how one would evaluate claims from a scientific article. While I refer to the pig paper, most issues apply to any paper that uses statistics.

In general, researchers want to isolate the effect of the treatments under study (diets in this case) from any other extraneous influence. We want control over the experimental conditions, so we can separate the effects of interest from all other issues that could create differences between our experimental units (pigs in this case). What could create ‘noise’ in our results? Animals could have different genetic backgrounds (for example with different parents), they could be exposed to different environmental conditions, they could be treated differently (more kindly or harshly), etc.

Once we control those conditions as much as possible we would randomly assign animals to each of the treatments (diets). The rationale behind random allocation is easy to follow. Let’s say that we can see that some animals are healthier than others before starting the trial. If I had a pro-GM agenda, I could assign the healthiest animals to the GM-diet and we would not be able to separate the effect of the treatment from the initial difference. To avoid this, we could have many labels in a hat, shake the hat, and for each pig take a label that randomly assigns the pig to a diet so the comparison is fair.

Researchers also like to have a baseline of the conditions before the treatments are applied. This way we can ‘adjust’ the results by any pre-existing differences. For example, there could be measurable differences on health, size, etc. I normally work with trees in experiments and we routinely assess the height of the trees just planted, so we can establish a baseline.

Finally, we often have a ‘default’ treatment which represents the status quo and acts as a comparison point for the new treatments. In the GM pig case, the default is a non-GM diet and the new treatment is the GM-diet.

The paper on GM fed pigs states that they tried to have as much control as possible of the growing conditions and that they used random allocation to the two feeding treatments. I have no problems with the paper up to this point.

When doing research it is good manners to state one’s expectations before the start of the trial. Doing this provides the experimenter with guidance about how to proceed with the evaluation of the experiment:

1. What are the characteristics under study? Both the response variables (in the past called ‘dependent variables’) and the predictors (old name ‘independent variables’).
2. What is the magnitude of the differences between groups that we would consider practically significant? Put another way, what would be the size of the difference that one would care about? For example, if we have two groups of pigs and the difference on weight between them is 1 g (0.035 ounces for people with funny units), who cares? If the difference was 5 kg (11 pounds, funny units again) then, perhaps, we are in to something.
3. What level of statistical significance we consider appropriate? Even if we assume that truly there is no difference between the two diets, we would expect to see small differences between the two groups just by chance. Big differences are more unlikely. It is common in research to state that a difference is statistically significant if the probability of observing the difference is smaller than, say, 0.05 (or 1 in 20). There is nothing sacred about the number but just a convention.

By this stage one would expect a researcher to state one or more hypotheses to be tested. For example, ‘I expect that the GM diet will increase [condition here] by [number here] percent’. One can run an experiment ignoring ‘good manners’, but (and this is a big but) an informed reader will become suspicious if suddenly one starts testing hypotheses like there is no tomorrow. Why? Because if one conducts too many tests one is bound to find statistically significant results even when there are none.

The comic below presents a brief example with jelly beans assuming that we claim significance for an event occurring with probability 0.05 (1 in 20) or less. Notice that the scientists use 20 colors of jelly beans and find that green ones ’cause’ acne. Running so many tests—without adjusting down the probability of the event that we would call significant, so p needs to be much smaller than 0.05 to be significant—results in wrong conclusions.

XKCD explaining significance (click for original, larger version).

In the pig paper there are 8 tests in Table 2, 18 (or 15 with some value) in Table 3, 8 in Table 4 and 17 in Table 5 for a total of 49 (or 46 with some testable values). In fact one would expect to find a couple of significant results (at 0.05 or 1 in 20) by chance even if there are absolutely no differences in reality.

Add to this that many of the tests are unnecessary, because they are performing the wrong type of analysis. For example, there are four separate analyses for stomach inflammation; however, the analysis ignores the type of variable one is testing as I point out in a previous post.

This is why, if I were Monsanto, I would use the paper as evidence supporting the idea that there is no difference between the two diets:

1. the paper was ‘independent’ (although the authors have a clear anti-GM bias) and
2. when running proper analyses (accounting for the type of variable and the number of tests that one is running) no difference is statistically significant.

P.S. 2013-06-21. 15:38 NZST A footnote about data availability: it is becoming increasingly common to make both the data and code used to analyze the data available when publishing (part of Open Science). I would expect that a paper that is making bold claims that could have policy implications would provide those, which does not happens in this case. However, the publication does include part of the data in results Tables 3 & 4, in the counts of pigs under different classes of inflammation.

At School we use R across many courses, because students are supposed to use statistics under a variety of contexts. Imagine their disappointment when they pass stats and discovered that R and statistics haven’t gone away!

When students start working with real data sets one of their first stumbling blocks is subsetting data. We have data sets and either they are required to deal with different subsets or there is data cleaning to do. For some reason, many students struggle with what should be a simple task.

If one thinks of data as as a matrix/2-dimensional array, subsetting boils down to extracting the needed rows (cases) and columns (variables). In the R world one can do this in a variety of ways, ranging from the cryptic to the explicit and clear. For example, let’s assume that we have a dataset called alltrials with heights and stem diameters for a number of trees in different trials (We may have a bunch of additional covariates and experimental design features that we’ll ignore for the moment). How do we extract all trees located in Christchurch?

Two common approaches are:

mytrial = alltrials[alltrials$location == 'Christchurch', ] mytrial = subset(alltrials, location == 'Christchurch') While both expressions are equivalent, the former reads like Klingon to students, while the latter makes explicit that we are obtaining a subset of the original data set. This can easily be expanded to more complex conditions; for example to include all trees from Christchurch that are taller than 10 m: mytrial = alltrials[alltrials$location == 'Christchurch' & alltrials$height > 10, ] mytrial = subset(alltrials, location == 'Christchurch' & height > 10) I think the complication with the Klingonian notation comes mostly from two sources: • Variable names for subsetting the data set are not directly accessible, so we have to prefix them with the NameOfDataset$, making the code more difficult to read, particularly if we join several conditions with & and |.
• Hanging commas: if we are only working with rows or columns we have to acknowledge it by suffixing or prefixing with a comma, which are often forgotten.

Both points result on frustrating error messages like

• Error in [.data.frame(alltrials, location == 'Christchurch', ) : object 'location' not found for the first point or
• Error in [.data.frame(alltrials, alltrials$location == 'Christchurch') undefined columns selected for the second point. The generic forms of these two notations are: dataset[what to do with rows, what to do with columns] subset(dataset, what to do with rows, what to do with columns) We often want to keep a subset of the observed cases and keep (or drop) specific variables. For example, we want to keep trees in 'Christchurch' and we want to ignore diameter, because the assessor was 'high' that day: # With this notation things get a bit trickier # The easiest way is to provide the number of the variable # Here diameter is the 5th variable in the dataset mytrial = alltrials[all.trials$location == 'Christchurch' & all.trials$height > 10, -5] # This notation is still straightforward mytrial = subset(alltrials, location == 'Christchurch' & height > 10, select = -diameter) There are, however, situations where Klingon is easier or more efficient. For example, to take a random sample of 100 trees from the full dataset:  mytrial = alltrials[sample(1:nrow(alltrials), 100, replace = FALSE),] If you are interested in this issue Quick-R has a good description of subsetting. I'm sure this basic topic must has been covered many times, although I doubt anyone used Klingon in the process. Gratuitous picture: building blocks for R (Photo: Luis). It used to be that the one of the first decisions to make when learning to program was between compiled (e.g. C or FORTRAN) and interpreted (e.g. Python) languages. In my opinion these days one would have to be a masochist to learn with a compiled language: the extra compilation time and obscure errors are a killer when learning. Today the decision would be between using a generic interpreted language (e.g. Python) and an interpreted domain specific language (DSL) like R, MATLAB, etc. While some people prefer generic languages, I’d argue that immediate feedback and easy accomplishment of useful tasks are a great thing when one is learning something for the first time. As an example, a while ago my son asked me what I was doing in the computer and I told him that I was programming some analyses in R. I showed that the program was spitting back some numbers and plots, a fact that he found totally unremarkable and uninteresting. I searched in internet and I found Scratch, a visual programming language, that let’s the user moves blocks representing code around and build interactive games: now my son was sold. Together we are learning about loops, control structures and variables, drawing characters, etc. We are programming because the problems are i- much more interesting for him and ii- achievable in a short time frame. An example scratch script. Learning to program for statistics, or other scientific domains for that matter, is not that different from being a kid and learning programming. Having to do too much to get even a mildly interesting result is frustrating and discouraging; it is not that the learner is dumb, but that he has to build too many functions to get a meager reward. This is why I’d say that you should use whatever language already has a large amount of functionality (‘batteries included’ in Python parlance) for your discipline. Choose rightly and you are half-way there. ‘But’ someone will say, R is not a real language. Sorry, but it is a real language (Turing complete and the whole shebang) with oddities, as any other language, granted. As with human languages, the more you study the easier it gets to learn a new language. In fact, the syntax for many basic constructs in R is highly similar to alternatives: # This is R code # Loop for(i in 1:10){ print(i) } #[1] 1 #[1] 2 #[1] 3 #[1] 4 #[1] 5 #[1] 6 #[1] 7 #[1] 8 #[1] 9 #[1] 10 # Control if(i == 10) { print('It is ten!') } #[1] "It is ten!" # This is Python code # Loop for i in range(1,11): # Python starts indexing from zero print(i) #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 # Control if i == 10: print("It is ten!") #It is ten! By the way, I formatted the code to highlight similarities. Of course there are plenty of differences between the languages, but many of the building blocks (the core ideas if you will) are shared. You learn one and the next one gets easier. How does one start learning to program? If I look back at 1985 (yes, last millennium) I was struggling to learn how to program in BASIC when, suddenly, I had an epiphany. The sky opened and I heard a choir singing György Ligeti’s Atmosphères (from 2001: a Space Odyssey, you know) and then I realized: we are only breaking problems in little pieces and dealing with them one at the time. I’ve been breaking problems into little pieces since then. What else did you expect? Anyhow, if you feel that you want to learn how to code in R, or whatever language is the best option for your problems, start small. Just a few lines will do. Read other people’s code but, again, only small pieces that are supposed to do something small. At this stage is easy to get discouraged because everything is new and takes a lot of time. Don’t worry: everyone has to go through this process. Many people struggle vectorizing the code so it runs faster. Again, don’t worry at the beginning if your code is slow. Keep on writing and learning. Read Norman Noam Ross’s FasteR! HigheR! StongeR! — A Guide to Speeding Up R Code for Busy People. The guide is very neat and useful, although you don’t need super fast code, not yet at least. You need working, readable and reusable code. You may not even need code: actually, try to understand the problem, the statistical theory, before you get coding like crazy. Programming will help you understand your problem a lot better, but you need a helpful starting point. Don’t get distracted with the politics of research and repeatability and trendy things like git (noticed that I didn’t even link to them?). You’ll learn them in time, once you got a clue about how to program. P.S. The code used in the examples could be shortened and sped up dramatically (e.g. 1:10 or range(1, 11)) but it is not the point of this post. P.S.2. A while ago I wrote R is a language, which could be useful connected to this post. I was having a conversation with an acquaintance about courses that were particularly useful in our work. My forestry degree involved completing 50 compulsory + 10 elective courses; if I had to choose courses that were influential and/or really useful they would be Operations Research, Economic Evaluation of Projects, Ecology, 3 Calculus and 2 Algebras. Subsequently my PhD was almost entirely research based but I sort of did Matrix Algebra: Dorian lent me his copy of Searle’s Matrix Algebra Useful for Statistics and passed me a pile of assignments that Shayle Searle used to give in his course in Cornell. I completed the assignments on my own pace and then sat a crazy take-home exam for 24 hours. Later that year I bought a cloth-bound 1982 version of the book, not the alien vomit purple paperback reprint currently on sale, which I consult from time to time. Why would one care about matrix algebra? Besides being a perfectly respectable intellectual endeavor on itself, maybe you can see that the degrees of freedom are the rank of a quadratic form; you can learn from this book what a quadratic form and a matrix rank are. Or you want to see more clearly the relationship between regression and ANOVA, because in matrix form a linear model is a linear model is a linear model. The commands outer, inner and kronecker product make a lot more sense once you know what an outer product and an inner product of vectors are. Thus, if you really want to understand a matrix language for data analysis and statistics (like R), it seems reasonable to try to understand the building blocks for such a language. The book does not deal with any applications to statistics until chapter 13. Before that it is all about laying foundations to understand the applications, but do not expect nice graphs and cute photos. This is a very good text where one as to use the brain to imagine what’s going on in the equations and demonstrations. The exercises rely a lot on ‘prove this’ and ‘prove that’, which lead to much frustration and, after persevering, to many ‘aha! moments’. XKCD 1050: In your face! Actually I feel the opposite concerning math. I am the first to accept that I have a biased opinion about this book, because it has sentimental value. It represents difficult times, dealing with a new language, culture and, on top of that, animal breeding. At the same time, it opened doors to a whole world of ideas. This is much more than I can say of most books. PS 2012-12-17: I have commented on a few more books in these posts. A good part of my electives were in humanities (history & literature), which was unusual for forestry. I just couldn’t conceive going through a university without doing humanities. I can’t exactly remember how I arrived to Making sense of random effects, a good post in the Distributed Ecology blog (go over there and read it). Incidentally, my working theory is that I follow Scott Chamberlain (@recology_), who follows Karthik Ram ‏(@_inundata) who mentioned Edmund Hart’s (@DistribEcology) post. I liked the discussion, but I thought one could add to the explanation to make it a bit clearer. The idea is that there are 9 individuals, assessed five times each—once under each of five different levels for a treatment—so we need to include individual as a random effect; after all, it is our experimental unit. The code to generate the data, plot it and fit the model is available in the post, but I redid data generation to make it a bit more R-ish and, dare I say, a tad more elegant: require(lme4) require(ggplot2) # Basic dataset idf <- data.frame(ind = factor(rep(1:9, 5)), levs = factor(rep(c('i1', 'i2', 'i3', 'i4', 'i5'), each = 9)), f.means = rep(c(6, 16, 2, 10, 13), each = 9), i.eff = rep(seq(-4, 4, length = 9), 5)) # Function for individual values used in apply ran.ind <- function(f.means, i.eff) rnorm(1, f.means + i.eff, 0.3) # Rich showed me this version of apply that takes more than one argument # https://smtp.biostat.wustl.edu/sympa/biostat/arc/s-news/2005-06/msg00022.html # so we skip loops idf$size <- apply(idf[, 3:4], 1, function(x) do.call('ran.ind', as.list(x)))   ggplot(idf, aes(x = levs, y = size, group = ind, colour = ind)) + geom_point() + geom_path()   # Fit linear mixed model (avoid an overall mean with -1) m3 <- lmer(size ~ levs - 1 + (1 | ind), data = idf) summary(m3)   # Skipping a few things # AIC BIC logLik deviance REMLdev # 93.84 106.5 -39.92 72.16 79.84 #Random effects: # Groups Name Variance Std.Dev. # ind (Intercept) 7.14676 2.67334 # Residual 0.10123 0.31816 #Number of obs: 45, groups: ind, 9   # Show fixed effects fixef(m3)   # levsi1 levsi2 levsi3 levsi4 levsi5 # 5.824753 15.896714 2.029902 9.969462 12.870952

Original simulated data, showing the effect of treatment (fixed effect) and each individual.

What we can do to better understand what’s going on is ‘adjust’ the score observations by the estimated fixed effects and plot those values to see what we are modeling with the random effects:

# Substract fixed effects idf$adjusted <- idf$size - rep(fixef(m3), each = 9) ggplot(idf, aes(x = levs, y = adjusted, group = ind, colour = ind)) + geom_point() + geom_path()   # Display random effects ranef(m3)   #$ind # (Intercept) #1 -3.89632810 #2 -2.83054394 #3 -1.99715524 #4 -1.12733342 #5 0.06619981 #6 0.95605162 #7 2.00483963 #8 2.99947727 #9 3.82479236 Data ‘adjusted’ by fixed effects. The random intercepts would be lines going through the average of points for each individual. The random effects for individual or, better, the individual-level intercepts are pretty much the lines going through the middle of the points for each individual. Furthermore, the variance for ind is the variance of the random intercepts around the’adjusted’ values, which can be seen comparing the variance of random effects above (~7.15) with the result below (~7.13). var(unlist(ranef(m3))) #[1] 7.12707 Distributed Ecology then goes on to randomize randomize the individuals within treatment, which means that the average deviation around the adjusted means is pretty close to zero, making that variance component close to zero. I hope this explanation complements Edmund Hart’s nice post. P.S. If you happen to be in the Southern part of South America next week, I’ll be here and we can have a chat (and a beer, of course). (This post continues discussing issues I described back in January in Academic publication boycott) Some weeks ago I received a couple of emails the same day: one asking me to submit a paper to an open access journal, while the other one was inviting me to be the editor of an ‘special issue’ of my choice for another journal. I haven’t heard before about any of the two publications, which follow pretty much the same model: submit a paper for$600 and—if they like it—it will be published. However, the special issue email had this ‘buy your way in’ feeling: find ten contributors (i.e. $6,000) and you get to be an editor. Now, there is nothing wrong per-se with open access journals, some of my favorite ones (e.g. PLoS ONE) follow that model. However, I was surprised by the increasing number of new journals that look at filling the gap for ‘I need to publish soon, somewhere’. Surprised until one remembers the incentives at play in academic environments. If I, or most academics for that matter, want to apply for academic promotion I have to show that I’m a good guy that has a ‘teaching philosophy’ and that my work is good enough to get published in journals; hopefully in lots of them. The first part is a pain, but most people can write something along the lines ‘I’m passionate about teaching and enjoy creating a challenging environment for students…’ without puking. The second part is trickier because one has to really have the papers in actual journals. Personally, I would be happier with only having the odd ‘formal’ publication. The first time (OK, few times) I saw my name in a properly typeset paper was very exciting, but it gets old after a while. These days, however, I would prefer to just upload my work to a website, saying here you have some ideas and code, play with it. If you like it great, if not well, next time I hope it’ll be better. Nevertheless, this doesn’t count as proper publication, because it isn’t peer reviewed, independently of the number of comments the post may get. PLoS ONE counts, but it’s still a journal and I (and many other researchers) work in many things that are too small for a paper, but cool enough to share. The problem: there is little or no credit for sharing so Quantum Forest is mostly a ‘labor of love’, which counts bugger all for anything else. These days as a researcher I often learn more from other people’s blogs and quick idea exchanges (for example through Twitter) than via formal publication. I enjoy sharing analysis, ideas and code in this blog. So what’s the point of so many papers in so many journals? I guess that many times we are just ‘ticking the box’ for promotions purposes. In addition, the idea of facing referees’ or editors’ comments like ‘it would be a good idea that you cite the following papers…’ puts me off. And what about authorship arrangements? We have moved from papers with 2-3 authors to enough authors to have a football team (with reserves and everything). Some research groups also run arrangements where ‘I scratch your back (include you as a coauthor) and you scratch mine (include me in your papers)’. We break ideas into little pieces that count for many papers, etc. Another related issue is the cost of publication (and the barriers it imposes on readership). You see, we referee papers for journals for free (as in for zero money) and tell ourselves that we are doing a professional service to uphold the high standards of whatever research branch we belong to. Then we spend a fortune from our library budget to subscribe to the same journals for which we reviewed the papers (for free, remember?). It is not a great deal, as many reasonable people have pointed out; I added a few comments in academic publication boycott. So, what do we need? We need promotion committees to reduce the weight on publication. We need to move away from impact factor. We can and need to communicate in other ways: scientific papers will not go away, but their importance should be reduced. Some times the way forward is unclear. Incense doesn’t hurt (Photo: Luis). Making an effort to prepare interesting lectures doesn’t hurt either. These days it is fairly common editors ‘suggesting’ to include additional references in our manuscripts, which just happen to be to papers in the same journal, hoping to inflate the impact factor of the journal. Referees tend to suggest their own papers (some times useful, many times not). Lame, isn’t it? PS. 2012-10-19 15:27 NZST. You also have to remember that not because something was published it is actually correct: outrageously funny example (via Arthur Charpentier). Yep, through Twitter. I’ve ignored my quantitative geneticist side of things for a while (at least in this blog) so this time I’ll cover some code I was exchanging with a couple of colleagues who work for other organizations. It is common to use diallel mating designs in plant and tree breeding, where a small number of parents acts as both males and females. For example, with 5 parents we can have 25 crosses, including reciprocals and selfing (crossing an individual with itself). Decades ago this mating design was tricky to fit and, considering an experimental layout with randomized complete blocks, one would have something like y = mu + blocks + dads + mums + cross + error. In this model dads and mums were estimating a fraction of the additive genetic variance. With the advent of animal model BLUP, was possible to fit something like y = mu + blocks + individual (using a pedigree) + cross + error. Another less computationally demanding alternative (at least with unrelated parents) is to fit a parental model, overlaying the design matrices for parents with something like this y = mu + blocks + (dad + mum) + cross + error. The following code simulate data for a a diallel trial with three replicates and runs a parental model with ASReml. Later I expand the analysis building the matrices by hand. # Defining diallel cross n.blocks <- 3 diallel <- matrix(0, nrow = 5, ncol = 5) males <- 1:dim(diallel)[1] females <- 1:dim(diallel)[2] cross <- factor(outer(males, females, paste, sep ='x')) cross #[1] 1x1 2x1 3x1 4x1 5x1 1x2 2x2 3x2 4x2 5x2 1x3 2x3 3x3 4x3 5x3 1x4 2x4 #[18] 3x4 4x4 5x4 1x5 2x5 3x5 4x5 5x5 #25 Levels: 1x1 1x2 1x3 1x4 1x5 2x1 2x2 2x3 2x4 2x5 3x1 3x2 3x3 ... 5x5 #### Generating random values for the trial # Variance components sig2.a <- 40 # additive genetic variance sig2.sca <- 10 # specific combining ability (SCA) sig2.b <- 10 # blocks sig2.e <- 30 # residuals # Numbers of everything n.parents <- length(males) n.crosses <- length(cross) n.trees <- n.crosses*n.blocks # Random values for everything u.g <- rnorm(n.parents)*sqrt(sig2.a) u.sca <- rnorm(n.crosses)*sqrt(sig2.sca) u.block <- rnorm(n.blocks)*sqrt(sig2.b) u.e <- rnorm(n.trees)*sqrt(sig2.e) # Whole trial trial <- data.frame(block = factor(rep(1:n.blocks, each = n.crosses)), mum = factor(rep(females, n.blocks)), dad = factor(rep(males, each = length(females))), mate = factor(rep(cross, n.blocks))) trial$yield <- 0.5*(u.g[trial$mum] + u.g[trial$dad]) + u.sca[trial$mate] + u.block[trial$block] + u.e   head(trial) #block mum dad mate yield #1 1 1 1 1x1 -0.02185486 #2 1 2 1 2x1 10.79760712 #3 1 3 1 3x1 16.45186037 #4 1 4 1 4x1 8.15026291 #5 1 5 1 5x1 5.57707180 #6 1 1 2 1x2 -7.30675148   # Fitting the model with ASReml library(asreml) m1 <- asreml(yield ~ 1, random = ~ block + dad + and(mum) + mate, data = trial) summary(m1) # gamma component std.error z.ratio #block!block.var 1.299110e-02 3.861892e-01 1.588423e+00 0.2431274 #dad!dad.var 2.101417e-01 6.246930e+00 5.120745e+00 1.2199259 #mate!mate.var 4.589938e-07 1.364461e-05 2.340032e-06 5.8309519 #R!variance 1.000000e+00 2.972722e+01 5.098177e+00 5.8309519   # Obtaining the predicted breeding values for the parents coef(m1, pattern = 'dad') # effect #dad_1 1.780683 #dad_2 -2.121174 #dad_3 3.151991 #dad_4 -1.473620 #dad_5 -1.337879

How is the matrix overlay working? We can replicate the calculations used by ASReml by building the matrices from scratch and reusing the variance components, so we avoid the nastiness of writing code for residual maximum likelihood. Once I build the basic matrices I use the code from my How does a linear mixed model look like? post.

# Building incidence matrices for males and females Zmales <- model.matrix(~ dad - 1, data = trial) Zfemales <- model.matrix(~ mum - 1, data = trial)   # and() in ASReml overlays (i.e. sums) the matrices # (Notice that this creates 0s, 1 and 2s in the design matrix) Zparents <- Zmales + Zfemales Zparents[1:6,] # dad1 dad2 dad3 dad4 dad5 #1 2 0 0 0 0 #2 1 1 0 0 0 #3 1 0 1 0 0 #4 1 0 0 1 0 #5 1 0 0 0 1 #6 1 1 0 0 0   # Design matrices from other factors Zblocks <- model.matrix(~ block - 1, data = trial) Zmates <- model.matrix(~ mate - 1, data = trial)   # Creating y, X and big Z matrices to solve mixed model equations y <- trial$yield X <- matrix(1, nrow = 75, ncol = 1) Z <- cbind(Zblocks, Zparents, Zmates) # Building covariance matrices for random effects # Using the variance components from the ASReml model G = diag(c(rep(3.861892e-01, 3), rep(6.246930e+00, 5), rep(1.364461e-05, 25))) R = diag(2.972722e+01, 75, 75) Rinv = solve(R) # Components of C XpX = t(X) %*% Rinv %*% X ZpZ = t(Z) %*% Rinv %*% Z XpZ = t(X) %*% Rinv %*% Z ZpX = t(Z) %*% Rinv %*% X Xpy = t(X) %*% Rinv %*% y Zpy = t(Z) %*% Rinv %*% y # Building C * [b a] = RHS C = rbind(cbind(XpX, XpZ), cbind(ZpX, ZpZ + solve(G))) RHS = rbind(Xpy, Zpy) blup = solve(C, RHS) blup # These results are identical to the ones # produced by ASReml #dad1 1.780683e+00 #dad2 -2.121174e+00 #dad3 3.151991e+00 #dad4 -1.473620e+00 #dad5 -1.337879e+00 The overlay matrix Zparents is double the actual value we should use: Zparents2 <- 0.5*(Zmales + Zfemales) Zparents2[1:6,] # dad1 dad2 dad3 dad4 dad5 #1 1.0 0.0 0.0 0.0 0.0 #2 0.5 0.5 0.0 0.0 0.0 #3 0.5 0.0 0.5 0.0 0.0 #4 0.5 0.0 0.0 0.5 0.0 #5 0.5 0.0 0.0 0.0 0.5 #6 0.5 0.5 0.0 0.0 0.0 Repeating the analyses ‘by hand’ using Zparents2 to build the Z matrix results in the same overall mean and block effects, but the predicted breeding values for parents when using Zparents are 0.7 of the predicted breeding values for parents when using Zparents2. I may need to refit the model and obtain new variance components for parents when working with the correct overlaid matrix. Gratuitous picture: walking in Quail Island (Photo: Luis). This week I’ve tried to i-stay mostly in the descriptive statistics realm and ii-surround any simple(istic) models with caveats and pointing that they are very preliminary. We are working with a sample of ~1,000 schools that did reply to Fairfax’s request, while there is a number of schools that either ignored the request or told Fairfax to go and F themselves. Why am I saying this? If one goes and gets a simple table of the number of schools by type and decile there is something quite interesting: we have different percentages for different types of schools represented in the sample and the possibility of bias on the reporting to Fairfax, due to potential low performance (references to datasets correspond to the ones I used in this post): summary(standards$school.type) # Composite (Year 1-10) Composite (Year 1-15) Contributing (Year 1-6) # 1 29 403 # Full Primary (Year 1-8) Intermediate (year 7 and 8) Restricted Composite (Yr 7-10) # 458 62 1 # Secondary (Year 7-15) # 56

Now let’s compare this number with the school directory:

summary(factor(directory$school.type)) # Composite (Year 1-10) Composite (Year 1-15) Contributing (Year 1-6) # 4 149 775 # Correspondence School Full Primary (Year 1-8) Intermediate (year 7 and 8) # 1 1101 122 #Restricted Composite (Yr 7-10) Secondary (Year 11-15) Secondary (Year 7-10) # 4 2 2 # Secondary (Year 7-15) Secondary (Year 9-15) Special School # 100 238 39 # Teen Parent Unit # 20 As a proportion we are missing more secondary schools. We can use the following code to get an idea of how similar are school types, because the small number of different composite schools is a pain. If # Performance of Contributing (Year 1-6) and # Full Primary (Year 1-8) looks pretty much the # same. Composites could be safely merged qplot(school.type, reading.OK, data = standards, geom = 'jitter') qplot(school.type, writing.OK, data = standards, geom = 'jitter') qplot(school.type, math.OK, data = standards, geom = 'jitter') # Merging school types and plotting them colored # by decile standards$school.type.4 <- standards$school.type levels(standards$school.type.4) <- c('Composite', 'Composite', 'Primary', 'Primary', 'Intermediate', 'Composite', 'Secondary')   qplot(school.type.4, reading.OK, colour = decile, data = standards, geom = 'jitter')

Representation of different schools types and deciles is uneven.

Different participations in the sample for school types. This type is performance in mathematics.

I’m using jittering rather than box and whisker plots to i- depict all the schools and ii- get an idea of the different participation of school types in the dataset. Sigh. Another caveat to add in the discussion.

P.S. 2012-09-27 16:15. Originally I mentioned in this post the lack of secondary schools (Year 9-15) but, well, they are not supposed to be here, because National Standards apply to years 1 to 8 (Thanks to Michael MacAskill for pointing out my error.)

Eric and I have been exchanging emails about potential analyses for the school data and he published a first draft model in Offsetting Behaviour. I have kept on doing mostly data exploration while we get a definitive full dataset, and looking at some of the pictures I thought we could present a model with fewer predictors.

The starting point is the standards dataset I created in the previous post:

# Make authority a factor standards$authority <- factor(standards$authority)   # Plot relationship between school size and number of FTTE # There seems to be a separate trend for secondary schools qplot(total.roll, teachers.ftte, data = standards, color = school.type)

There seems to be a different trend for secondary vs non-secondary schools concerning the relationship between number of full time teacher equivalent and total roll. The presence of a small number of large schools suggests that log transforming the variables could be a good idea.

# Create a factor for secondary schools versus the rest standards$secondary <- factor(ifelse(standards$school.type == 'Secondary (Year 7-15)', 'Yes', 'No'))   # Plot the relationship between number of students per FTTE # and type of school qplot(secondary, total.roll/teachers.ftte, data = standards, geom = 'boxplot', ylab = 'Number of students per FTTE')

Difference on the number of students per FTTE between secondary and non-secondary schools.

Now we fit a model where we are trying to predict reading standards achievement per school accounting for decile, authority , proportion of non-european students, secondary schools versus the rest, and a different effect of number of students per FTTE
for secondary and non-secondary schools.

standards\$students.per.ftte <- with(standards, total.roll/teachers.ftte)   # Decile2 is just a numeric version of decile (rather than a factor) m1e <- lm(reading.OK ~ decile2 + I(decile2^2) + authority + I(1 - prop.euro) + secondary*students.per.ftte, data = standards, weights = total.roll) summary(m1e)   #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 0.6164089 0.0305197 20.197 < 2e-16 *** #decile2 0.0422733 0.0060756 6.958 6.24e-12 *** #I(decile2^2) -0.0015441 0.0004532 -3.407 0.000682 *** #authorityState: Not integrated -0.0440261 0.0089038 -4.945 8.94e-07 *** #I(1 - prop.euro) -0.0834869 0.0181453 -4.601 4.74e-06 *** #secondaryYes -0.2847035 0.0587674 -4.845 1.47e-06 *** #students.per.ftte 0.0023512 0.0011706 2.009 0.044854 * #secondaryYes:students.per.ftte 0.0167085 0.0040847 4.091 4.65e-05 *** ---   #Residual standard error: 1.535 on 999 degrees of freedom # (3 observations deleted due to missingness) #Multiple R-squared: 0.4942, Adjusted R-squared: 0.4906 #F-statistic: 139.4 on 7 and 999 DF, p-value: < 2.2e-16

The residuals are still a bit of a mess:

Residuals for this linear model: still a bit of a mess.

If we remember my previous post decile accounted for 45% of variation and we explain 4% more through the additional predictors. Non-integrated schools have lower performance, a higher proportion of non-European students reduce performance, secondary schools have lower performance and larger classes tend to perform better (Eric suggests reverse causality, I’m agnostic at this stage), although the rate of improvement changes between secondary and non-secondary schools. In contrast with Eric, I didn’t fit separate ethnicities as those predictors are related to each other and constrained to add up to one.

Of course this model is very preliminary, and a quick look at the coefficients will show that changes on any predictors besides decile will move the response by a very small amount (despite the tiny p-values and numerous stars next to them). The distribution of residuals is still heavy-tailed and there are plenty of questions about data quality; I’ll quote Eric here:

But differences in performance among schools of the same decile by definition have to be about something other than decile. I can’t tell from this data whether it’s differences in stat-juking, differences in unobserved characteristics of entering students, differences in school pedagogy, or something else. But there’s something here that bears explaining.