Should I reject a manuscript because the analyses weren’t done using open source software?

“Should I reject a manuscript because the analyses weren’t done using open software?” I overheard a couple of young researchers discussing. Initially I thought it was a joke but, to my surprise, it was not funny at all.

There is an unsettling, underlying idea in that question: the value of a scientific work can be reduced to its computability. If I, the reader, cannot replicate the computation the work is of little, if any, value. Even further, my verification has to have no software cost involved, because if that is not the case we are limiting the possibility of computation to only those who can afford it. Therefore, the almost unavoidable conclusion is that we should force the use of open software in science.

What happens if the analyses were run using a point-and-click interface? For example SPSS, JMP, Genstat, Statistica, and a few other programs allow access to fairly complex analytical algorithms via a system of menus and icons. Most of them are not open source nor generate code for the analyses. Should we ban their use in science? One could argue that if users only spend the time and learn a programming language (e.g. R or Python) they will be free of the limitations of point-and-click. Nevertheless, we would be shifting accessibility from people that can pay for an academic license for a software to people that can learn and moderately enjoy programming. Are we better off as research community by that shift?

There is another assumption: open software will always provide good (or even appropriate) analytical tools for any problem. I assume that in many cases OSS is good enough and that there is a subset of problems where it is the best option. However, there is another subset where it is suboptimal. For example, I deal a lot with linear mixed models used in quantitative genetics, an area where R is seriously deficient. In fact, I should have to ignore the last 15 years of statistical development to run large problems. Given that some of the data sets are worth millions of dollars and decades of work, Should I sacrifice the use of best models so a hypothetical someone, somewhere can actually run my code without paying for an academic software license? This was a rhetorical question, by the way, as I would not do it.

There are trade-offs and unintended consequences in all research policies. This is one case where I think the negative effects would outweigh the benefits.

Gratuitous picture: I smiled when I saw the sign with the rightful place for forestry (Photo: Luis).

Gratuitous picture: I smiled when I saw the sign with the rightful place for forestry (Photo: Luis).

P.S. 2013-12-20 16:13 NZST Timothée Poisot provides some counterarguments for a subset of articles: papers about software.

If one were to invent scientific journals today

While taking a shower I was daydreaming about what would happen if one were to invent journals today, with a very low cost of publication and no physical limits to the size of a publication. My shower answer was that there would be little chance for a model like traditional printed journals.

One could create a central repository (a bit like the arXiv) taking submissions of text format of the article + figures, which are automatically translated to a decent-looking web format and a printable version. This would be the canonical version of the article and would get assigned a unique identifier. The submitters would get to update their article any number of times, creating versions (pretty much like software). This way they could fix any issues without breaking references from other articles.

There would be a payment for submitting articles to the repository (say $100 for the sake of argument), covering the costs of hosting and infrastructure, serving at the same time as a deterrent for spam.

Journals in their current form would tend to disappear, but there would be topical aggregators (or feeds). Thus, the ‘Journal of whatever’ would now be a curator of content from the ‘big bucket’ central repository, pulling aside articles worthy (in their opinion) of more scrutiny, or commentary, etc. This could be either a commercial venture or amateur labor of love, done by people very interested in a given topic and could even apply a different format to the canonical article, always pointing back to the unique identifier in the central repository.

Some aggregators could be highly picky and recognized by readers, becoming the new Nature or Science. Authors could still ‘submit’ or recommend their papers to these aggregators. However, papers could also be in multiple feeds and copyright would probably stay with the authors for a limited amount of time. The most important currency for academics is recognition, and this system would provide it, as well as the potential for broad exposure and no cost for readers or libraries.

There would be no pre-publication peer review because, let’s face it, currently it’s more of a lottery than anything else. Post-publication peer review, broad by the research community would be new standard.

Any big drawbacks for my shower daydream?

P.S.1 2013-12-15 13:40 NZST Thomas Lumley pointed me to a couple of papers on the ‘Selected Papers Network’, which would be one way of dealing with prestige/quality/recognition signals needed by academics.

P.S.2 2013-12-15 14:43 NZST This ‘journals are feeds’ approach fits well with how I read papers: I do not read journals, but odd papers that I found either via web searches or recommended by other researchers. There are, however, researchers that aim to read whole issues, although I can;t make sense of it.

Jetsam 11: Two hundred beetles

I am a fan of Tim Prebble’s The Music of Sound, where he deals with field recording and the role of sound in general. Tim is running a field recording competition which requires ‘a cardboard box, a microphone, a recorder and you. Thats it! No processing allowed, submit a single take’.

I decided to go for something a bit creepy given the time of the year: getting close to Día de los Muertos. My setup was fairly simple:

  • One (empty) box of Weet-Bix apricot bites.
  • One JrF c-series contact microphone.
  • One impedance transforming thingy.
  • Two pieces of masking tape.
  • One Sound Devices Mixpre-D.
  • One USB cable to record in a mac air using Audacity.
  • Two hundred (200) mealworm beetles (Tenebrio sp).

The setup looked like this, except that in the recording the box was leaning while the insects went in:

The photo shows the ingredients with the beetles having a dinner-break, while I was setting up a leaning box.

The photo shows the ingredients with the beetles having a dinner-break, while I was setting up a leaning box.

I was pleased with the final result, although I would have liked a more full boxy sound. I’ll have to train the beetles to behave better in the box.

Flotsam 13: early July links

Man flu kept me at home today, so I decided to do something ‘useful’ and go for a linkathon:

Sometimes people are truthful and cruel. Here Gappy on a mission goes for the jugular:

Over and out.

My take on the USA versus Western Europe comparison of GM corn

A few days ago I came across Jack Heinemann and collaborators’ article (Sustainability and innovation in staple crop production in the US Midwest, Open Access) comparing the agricultural sectors of USA and Western Europe. While the article is titled around the word sustainability, the main comparison stems from the use of Genetically Modified crops in USA versus the absence of them in Western Europe.

I was curious about part of the results and discussion which, in a nutshell, suggest that “GM cropping systems have not contributed to yield gains, are not necessary for yield gains, and appear to be eroding yields compared to the equally modern agroecosystem of Western Europe”. The authors relied on several crops for the comparison (Maize/corn, rapeseed/canolasee P.S.6, soybean and cotton); however, I am going to focus on a single one (corn) for two reasons: 1. I can’t afford a lot of time for blog posts when I should be preparing lectures and 2. I like eating corn.

When the authors of the paper tackled corn the comparison was between the USA and Western Europe, using the United Nations definition of Western Europe (i.e. Austria, Belgium, France, Germany, Liechtenstein, Luxembourg, Monaco, Netherlands, Switzerland). Some large European corn producers like Italy are not there because of the narrow definition of Western.

I struggled with the comparison used by the authors because, in my opinion, there are potentially so many confounded effects (different industry structures, weather, varieties, etc.) that it can’t provide the proper counterfactual for GM versus non-GM crops. Anyway, I decided to have a look at the same data to see if I would reach the same conclusions. The article provides a good description of where the data came from, as well as how the analyses were performed. Small details to match exactly the results were fairly easy to figure out. I downloaded the FAO corn data (3.7 MB csv file) for all countries (so I can reuse the code and data later for lectures and assignments). I then repeated the plots using the following code:

# Default directory
setwd('~/Dropbox/quantumforest')
 
# Required packages
require(ggplot2)
require(labels)
 
# Reading FAO corn data
FAOcorn = read.csv('FAOcorn.csv')
 
# Extracting Area
FAOarea = subset(FAOcorn, Element == 'Area Harvested', 
                 select = c('Country', 'Year', 'Value'))
 
names(FAOarea)[3] = 'Area'
 
# and production
FAOprod = subset(FAOcorn, Element == 'Production', 
                 select = c('Country', 'Year', 'Value'))
 
names(FAOprod)[3] = 'Production'
 
# to calculate yield in hectograms
FAOarea = merge(FAOarea, FAOprod, by = c('Country', 'Year'))
FAOarea$Yield = with(FAOarea, Production/Area*10000)
 
# Subsetting only the countries of interest (and years to match paper)
FAOarticle = subset(FAOarea, Country == 'United States of America' | Country == 'Western Europe')
 
# Plot with regression lines
ggplot(FAOarticle, aes(x = Year, y = Yield, color = Country)) + 
      geom_point() + stat_smooth(method = lm, fullrange = TRUE, alpha = 0.1) + 
      scale_y_continuous('Yield [hectograms/ha]', limits = c(0, 100000), labels = comma) + 
      theme(legend.position="top")
Figure 1. Corn yield per year for USA and Western Europe (click to enlarge).

Figure 1. Corn yield per year for USA and Western Europe (click to enlarge).

I could obtain pretty much the same regression model equations as in the article by expressing the years as deviation from 1960 as in:

# Expressing year as a deviation from 1960, so results
# match paper
FAOarticle$NewYear = with(FAOarticle, Year - 1960)
 
usa.lm = lm(Yield ~ NewYear, data = FAOarticle, 
            subset = Country == 'United States of America')
summary(usa.lm)   
 
#Call:
#lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == 
#    "United States of America")
#
#Residuals:
#     Min       1Q   Median       3Q      Max 
#-18435.4  -1958.3    338.3   3663.8  10311.0 
#
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept) 38677.34    1736.92   22.27   <2e-16 ***
#NewYear      1173.83      59.28   19.80   <2e-16 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
#
#Residual standard error: 6049 on 48 degrees of freedom
#Multiple R-squared: 0.8909,	Adjusted R-squared: 0.8887 
 
 
weu.lm = lm(Yield ~ NewYear, data = FAOarticle,
            subset = Country == 'Western Europe')
summary(weu.lm)   
 
#Call:
#lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == 
#   "Western Europe")
#
#Residuals:
# Min       1Q   Median       3Q      Max 
#-14726.6  -3205.8    346.4   4000.6  10289.5 
#
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept) 31510.14    1665.90   18.91   <2e-16 ***
#NewYear      1344.42      56.86   23.65   <2e-16 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
#
#Residual standard error: 5802 on 48 degrees of freedom
#Multiple R-squared: 0.9209,	Adjusted R-squared: 0.9193 
#F-statistic: 559.1 on 1 and 48 DF,  p-value: < 2.2e-16

Heinemann and collaborators then point out the following:

…the slope in yield increase by year is steeper in W. Europe (y = 1344.2x + 31512, R² = 0.92084) than the United States (y = 1173.8x + 38677, R² = 0.89093) from 1961 to 2010 (Figure 1). This shows that in recent years W. Europe has had similar and even slightly higher yields than the United States despite the latter’s use of GM varieties.

However, that interpretation using all data assumes that both ‘countries’ are using GMO all the time. An interesting thing is that USA and Western Europe were in different trends already before the introduction of GM corn. We can state that because we have some idea of when GM crops were introduced in the USA. This information is collected by the US Department of Agriculture in their June survey to growers and made publicly available at the State level (GMcornPenetration.csv):

cornPenetration = read.csv('GMcornPenetration.csv')
 
ggplot(cornPenetration, aes(x = Year, y = PerAllGM)) + geom_line() + facet_wrap(~ State) +
      scale_y_continuous('Percentage of GM corn') +
      theme(axis.text.x  = theme_text(angle=90))
GM corn percentage by state in the USA.

Figure 2. GM corn percentage by state in the USA (click to enlarge).

This graph tells us that by the year 2000 the percentage of planted corn was way below 50% in most corn producing states (in fact, it was 25% at the country level). From that time on we have a steady increase reaching over 80% for most states by 2008. Given this, it probably makes sense to assume that, at the USA level, yield reflects non-GM corn until 1999 and progressively reflects the effect of GM genotypes from 2000 onwards. This division is somewhat arbitrary, but easy to implement.

We can repeat the previous analyzes limiting the data from 1961 until, say, 1999:

usa.lm2 = lm(Yield ~ NewYear, data = FAOarticle, 
            subset = Country == 'United States of America' & Year < 2000)
summary(usa.lm2)   
 
#Call:
#lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == 
#    "United States of America" & Year < 2000)
#
#Residuals:
#   Min     1Q Median     3Q    Max 
#-17441  -2156   1123   3989   9878 
#
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept) 39895.57    2084.81   19.14  < 2e-16 ***
#NewYear      1094.82      90.84   12.05 2.25e-14 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 6385 on 37 degrees of freedom
#Multiple R-squared:  0.797,	Adjusted R-squared:  0.7915 
#F-statistic: 145.2 on 1 and 37 DF,  p-value: 2.245e-14
 
weu.lm2 = lm(Yield ~ NewYear, data = FAOarticle,
            subset = Country == 'Western Europe' & Year < 2000)
summary(weu.lm2)   
 
#Call:
#lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == 
#    "Western Europe" & Year < 2000)
#
#Residuals:
#   Min     1Q Median     3Q    Max 
#-10785  -3348    -34   3504  11117 
#
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept) 29802.17    1813.79   16.43   <2e-16 ***
#NewYear      1454.48      79.03   18.40   <2e-16 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 5555 on 37 degrees of freedom
#Multiple R-squared:  0.9015,	Adjusted R-squared:  0.8988 
#F-statistic: 338.7 on 1 and 37 DF,  p-value: < 2.2e-16

These analyses indicate that Western Europe started with a lower yield than the USA (29,802.17 vs 39,895.57 hectograms/ha) and managed to increase yield much more quickly (1,454.48 vs 1,094.82 hectograms/ha per year) before any use of GM corn by the USA. Figure 1 shows a messy picture because there are numerous factors affecting yield each year (e.g. weather has a large influence). We can take averages for each decade and see how the two ‘countries’ are performing:

# Aggregating every decade.
# 2013-07-05 20:10 NZST I fixed the aggregation because it was averaging yields rather 
# calculating total production and area for the decade and then calculating average yield
# Discussion points are totally valid 
FAOarticle$Decade = cut(FAOarticle$Year, 
                        breaks = seq(1959, 2019, 10), 
                        labels = paste(seq(1960, 2010, 10), 's', sep = ''))
 
decadeProd = aggregate(Production ~ Country + Decade,
                       data = FAOarticle,
                       FUN = sum)
 
decadeArea = aggregate(Area ~ Country + Decade,
                       data = FAOarticle,
                       FUN = sum)
 
decadeYield = merge(decadeProd, decadeArea, by = c('Country', 'Decade'))
decadeYield$Yield = with(decadeYield, Production/Area*10000)
 
ggplot(decadeYield, aes(x = Decade, y = Yield, fill = Country)) + 
       geom_bar(stat = 'identity', position = 'dodge') +
       scale_y_continuous('Yield [hectograms/ha]', expand = c(0, 0)) +
       theme(legend.position="top")
Figure 3. Corn yield by decade (click to enlarge).

Figure 3. Corn yield by decade (click to enlarge).

This last figure requires more attention. We can again see that Western Europe starts with lower yields than the USA; however, it keeps on increasing those yields faster than USA, overtaking it during the 1990s. Again, all this change happened while both USA and Western Europe were not using GM corn. The situation reverses in the 2000s, when the USA overtakes Western Europe, while the USA continuously increased the percentage of GM corn. The last bar in Figure 3 is misleading because it includes a single year (2010) and we know that yields in USA went down in 2011 and 2012, affected by a very large drought (see Figure 4).

At least when looking at corn, I can’t say (with the same data available to Heinemann) that there is no place or need for GM genotypes. I do share some of his concerns with respect to the low level of diversity present in staple crops but, in contrast to his opinion, I envision a future for agriculture that includes large-scale operations (either GM or no-GM), as well as smaller operations (including organic ones). I’d like to finish with some optimism looking further back to yield, because the USDA National Agricultural Statistics Service keeps yield statistics for corn since 1866(!) (csv file), although it uses bizarre non-metric units (bushels/acre). As a metric boy, I converted to kilograms per hectare (multiplying by 62.77 from this page) and then to hectograms (100 g) multiplying by 10.

# Reading NASS corn data
NASS = read.csv('NASScorn.csv')
# Conversion to sensical units (see Iowa State Extension article)
# http://www.extension.iastate.edu/agdm/wholefarm/html/c6-80.html
NASS$Yield = with(NASS, Value*62.77*10)
 
# Average by decade
NASS$Decade = cut(NASS$Year, 
                  breaks = seq(1859, 2019, 10), 
                  labels = paste(seq(1860, 2010, 10), 's', sep = ''))
 
oldYield = aggregate(Yield ~ Decade, data = NASS, FUN = mean)
 
# Plotting
ggplot(oldYield, aes(x = Decade, y = Yield)) + 
       geom_bar(stat = 'identity') +
       scale_y_continuous('Yield [hectograms]', expand = c(0, 0))
Historic average yield per decade for USA (click to enlarge).

Figure 4. Historic average yield per decade for USA (click to enlarge).

It is interesting to see that there was little change until the 1940s, with the advent of the Green Revolution (modern breeding techniques, fertilization, pesticides, etc.). The 2010s decade in Figure 4 includes 2010, 2011 and 2012, with the last two years reflecting extensive droughts. Drought tolerance is one of the most important traits in modern breeding programs.

Drought’s Footprint map produced by The New York Times (click on graph to  view larger version in the NYT).

Drought’s Footprint map produced by The New York Times (click on graph to view larger version in the NYT). This can help to understand the Decade patterns in previous figures.

While Prof. Heinemann and myself work for the same university I don’t know him in person.

P.S. Did you know that Norman Borlaug (hero of mine) studied forestry?
P.S.2 Time permitting I’ll have a look at other crops later. I would have liked to test a regression with dummy variables for corn to account for pre-2000 and post-1999, but there are not yet many years to fit a decent model (considering natural variability between years). We’ll have to wait for that one.
P.S.3 I share some of Heinemann’s concerns relating to subsidies and other agricultural practices.
P.S.4 In case anyone is interested, I did write about a GM-fed pigs study not long ago.
P.S.5 2013-07-05 20:10 NZST. I updated Figures 1 and 3 to clearly express that yield was in hectograms/ha and recalculated average decade yield because it was originally averaging yields rather calculating total production and area for the decade and then calculating average yield. The discussion points I raised are still completely valid.
P.S.6 2013-07-07 11:30 NZST. The inclusion of Canadian canola does not make any sense because, as far as I know, Canada is not part of Western Europe or the US Midwest. This opens the inclusion of crops from any origin as far as the results are convenient for one’s argument.

GM-fed pigs, chance and how research works

Following my post on GM-fed pigs I received several comments, mostly through Twitter. Some people liked having access to an alternative analysis, while others replied with typical anti-GM slogans, completely ignoring that I was posting about the technical side of the paper. This post is not for the slogan crowd (who clearly are not interested in understanding), but for people that would like to know more about how one would evaluate claims from a scientific article. While I refer to the pig paper, most issues apply to any paper that uses statistics.

In general, researchers want to isolate the effect of the treatments under study (diets in this case) from any other extraneous influence. We want control over the experimental conditions, so we can separate the effects of interest from all other issues that could create differences between our experimental units (pigs in this case). What could create ‘noise’ in our results? Animals could have different genetic backgrounds (for example with different parents), they could be exposed to different environmental conditions, they could be treated differently (more kindly or harshly), etc.

Once we control those conditions as much as possible we would randomly assign animals to each of the treatments (diets). The rationale behind random allocation is easy to follow. Let’s say that we can see that some animals are healthier than others before starting the trial. If I had a pro-GM agenda, I could assign the healthiest animals to the GM-diet and we would not be able to separate the effect of the treatment from the initial difference. To avoid this, we could have many labels in a hat, shake the hat, and for each pig take a label that randomly assigns the pig to a diet so the comparison is fair.

Researchers also like to have a baseline of the conditions before the treatments are applied. This way we can ‘adjust’ the results by any pre-existing differences. For example, there could be measurable differences on health, size, etc. I normally work with trees in experiments and we routinely assess the height of the trees just planted, so we can establish a baseline.

Finally, we often have a ‘default’ treatment which represents the status quo and acts as a comparison point for the new treatments. In the GM pig case, the default is a non-GM diet and the new treatment is the GM-diet.

The paper on GM fed pigs states that they tried to have as much control as possible of the growing conditions and that they used random allocation to the two feeding treatments. I have no problems with the paper up to this point.

When doing research it is good manners to state one’s expectations before the start of the trial. Doing this provides the experimenter with guidance about how to proceed with the evaluation of the experiment:

  1. What are the characteristics under study? Both the response variables (in the past called ‘dependent variables’) and the predictors (old name ‘independent variables’).
  2. What is the magnitude of the differences between groups that we would consider practically significant? Put another way, what would be the size of the difference that one would care about? For example, if we have two groups of pigs and the difference on weight between them is 1 g (0.035 ounces for people with funny units), who cares? If the difference was 5 kg (11 pounds, funny units again) then, perhaps, we are in to something.
  3. What level of statistical significance we consider appropriate? Even if we assume that truly there is no difference between the two diets, we would expect to see small differences between the two groups just by chance. Big differences are more unlikely. It is common in research to state that a difference is statistically significant if the probability of observing the difference is smaller than, say, 0.05 (or 1 in 20). There is nothing sacred about the number but just a convention.

By this stage one would expect a researcher to state one or more hypotheses to be tested. For example, ‘I expect that the GM diet will increase [condition here] by [number here] percent’. One can run an experiment ignoring ‘good manners’, but (and this is a big but) an informed reader will become suspicious if suddenly one starts testing hypotheses like there is no tomorrow. Why? Because if one conducts too many tests one is bound to find statistically significant results even when there are none.

The comic below presents a brief example with jelly beans assuming that we claim significance for an event occurring with probability 0.05 (1 in 20) or less. Notice that the scientists use 20 colors of jelly beans and find that green ones ’cause’ acne. Running so many tests—without adjusting down the probability of the event that we would call significant, so p needs to be much smaller than 0.05 to be significant—results in wrong conclusions.

XKCD explaining significance (click for original, larger version). p refers to the probability of observing that result if jelly beans have no effect on acne.

XKCD explaining significance (click for original, larger version).

In the pig paper there are 8 tests in Table 2, 18 (or 15 with some value) in Table 3, 8 in Table 4 and 17 in Table 5 for a total of 49 (or 46 with some testable values). In fact one would expect to find a couple of significant results (at 0.05 or 1 in 20) by chance even if there are absolutely no differences in reality.

Add to this that many of the tests are unnecessary, because they are performing the wrong type of analysis. For example, there are four separate analyses for stomach inflammation; however, the analysis ignores the type of variable one is testing as I point out in a previous post.

This is why, if I were Monsanto, I would use the paper as evidence supporting the idea that there is no difference between the two diets:

  1. the paper was ‘independent’ (although the authors have a clear anti-GM bias) and
  2. when running proper analyses (accounting for the type of variable and the number of tests that one is running) no difference is statistically significant.

P.S. 2013-06-21. 15:38 NZST A footnote about data availability: it is becoming increasingly common to make both the data and code used to analyze the data available when publishing (part of Open Science). I would expect that a paper that is making bold claims that could have policy implications would provide those, which does not happens in this case. However, the publication does include part of the data in results Tables 3 & 4, in the counts of pigs under different classes of inflammation.