# Author: Luis

# Jetsam 21: View from Vyšehrad

# Comment on Sustainability and innovation in staple crop production in the US Midwest

After writing a blog post about the paper “Sustainability and innovation in staple crop production in the US Midwest” I decided to submit a formal comment to the International Journal of Agricultural Sustainability in July 2013, which was published today. As far as I know, Heinemann et al. provided a rebuttal to my comments, which I have not seen but that should be published soon. This post is an example on how we can use open data (in this case from the USDA and FAO) and free software (R) to participate in scientific discussion (see supplementary material below).

The text below the *** represents my author’s version provided as part of my Green Access rights. The article published in the International Journal of Agricultural Sustainability [copyright Taylor & Francis]; is freely available online at http://dx.doi.org/10.1080/14735903.2014.939842).

While I had many issues with the original article, I decided to focus on three problems—to make the submission brief and fit under the 1,000 words limit enforced by the journal editor. The first point I make is a summary of my previous post on the article, and then move on to two big problems: assuming that only the choice of biotechnology affects yield (making the comparison between USA and Western Europe inadequate) and comparing use of agrochemicals at the wrong scale (national- versus crop-level).

**PS 2014-08-05 16:30:** Heinemann et al.’s reply to my comment is now available.

- They state that “Apiolaza did not report variability in the annual yield data” and that they had to ‘reconstruct’ my figure. Firstly, the published figure 2 does include error bars and, secondly, the data and R code are available as supplementary information (in contrast to their original paper). Let’s blame the journal for not passing this information to them.
- They also include 2 more years of data (2011 & 2012), although the whole point of my comment is that the conclusions they derived with the
*original*data set were not justified. Heinemann et al. are right in that yields in 2011 & 2012 were higher in Western Europe than in the USA (the latter suffering a huge drought); however, in 2013 it was the opposite: 99,695 Hg/ha for USA vs 83,724 Hg/ha for Western Europe. - They question that I commented only on one crop, although I did cover another crop (wheat) quickly with respect to the use of pesticides—but not with the detail I wanted, as there was a 1,000 words limit. I would have also mentioned the higher variability for Western European wheat production, as it is weird that they pointed out that high variability is a problems for USA maize production but said nothing for wheat in Europe.
- Furthermore, they also claimed “There is no statistically significant difference in the means over the entire 50-year period” however, a naive paired t-test
`t.test(FAOarticle$Yield[1:50], FAOarticle$Yield[51:100], paired = TRUE)`

(see full code below) says otherwise.

***

## Abstract

This comment highlight issues when comparing genetically modified (GM) crops to non-GM ones across countries. Ignoring structural differences between agricultural sectors and assuming common yield trajectories before the time of introduction of GM crops results on misestimating the effect of GM varieties. Further data collection and analyses should guide policy-makers to encourage diverse approaches to agriculture, rather than excluding specific technologies (like GM crops) from the onset.

Keywords:genetic modification; biotechnology; productivity; economics

In a recent article Heinemann et al. (2013) focused “on the US staple crop agrobiodiversity, particularly maize” using the contrast between the yield of Western Europe and United States as a proxy for the comparison between genetically modified (GM) maize versus non-GM maize. They found no yield benefit from using GM maize when comparing the United States to Western Europe.

In addition, Heinemann et al. contrasted wheat yields across United States and Western Europe to highlight the superiority of the European biotechnological package from a sustainability viewpoint.

I am compelled to comment on two aspects that led the authors to draw incorrect conclusions on these issues. My statistical code and data are available as supplementary material.

## 1. Misestimating the effect of GM maize varieties

Heinemann et al. used FAO data[↓], from 1961 to 2010 inclusive, to fit linear models with yield as the response variable, country and year as predictors. Based on this analysis they concluded, “W. Europe has benefitted from the same, or marginally greater, yield increases without GM”. However, this assumes a common yield trajectory for United States and Western Europe before significant commercial use of GM maize, conflating GM and non-GM yields. GM maize adoption in United States has continually increased from 25% of area of maize planted in 2000 to the current 90% (Figure 1, United States Department of Agriculture 2013[↓]).

If we fit a linear model from 1961 to 1999 (last year with less than 25% area of GM maize) we obtain the following regression equations \(y = 1094.8 x + 39895.6\) (United States, R^{2} = 0.80) and \(y = 1454.5 x + 29802.2\) (W. Europe, R^{2} = 0.90). This means that Western Europe started with a considerably lower yield than the USA (29,802.2 vs 39,895.6 hg/ha) in 1961 but increased yields faster than USA (1,454.5 vs 1,094.8 hg/ha per year) before substantial use of GM maize. By 1999 yield in Western Europe was superior to that in United States.

This is even more evident in Figure 2, which shows average yield per decade, removing year-to-year extraneous variation (e.g. due to weather). Western European yields surpassed United States’s during the 1990s (Figure 2). This trend reverses in the 2000s, while United States simultaneously increased the percentage of planted area with GM maize, directly contradicting Heinemann et al.’s claim.

## 2. Ignoring structural differences between agricultural sectors

When discussing non-GM crops using wheat the authors state, “the combination of biotechnologies used by W. Europe is demonstrating greater productivity than the combination used by the United States”. This sentence summarizes one of the central problems of their article: assuming that, if it were not for the choice of biotechnology bundle, the agricultural sectors would have the same intrinsic yield, making them comparable. However, many inputs beside biotechnology affect yield. For example, Neumann et al. (2010) studied the spatial distribution of yield and found that in the Unites States “*access* can explain most of the variability in wheat efficiency. In the more remote regions land prices are lower and inputs are therefore often substituted by land leading to lower efficiencies”. Lower yields in United States make sense from an economic point of view, as land replaces more expensive inputs like agrochemicals.

Heinemann et al. support their case by comparing pesticide use between United States and France across *all* crops. However, what is relevant to the discussion is pesticide use for the crops being compared. European cereals, and wheat in particular, are the most widely fungicide-treated group of crops worldwide (Kucket al. 2012). For example, 27% of the wheat planted area in France was already treated with fungicides by 1979 (Jenkins and Lescar 1980). More than 30 years later in the United States this figure has reached only 19% for winter wheat (which accounts for 70% of planted area, NASS 2013). Fungicide applications result on higher yield responses (Oerke 2006).

## Final remarks

Heinemann et al. ignored available data on GM adoption when analysing maize yields. They also mistakenly treated biotechnological bundles as the only/main explanation for non-GMO yield differences between United States and Western Europe. These issues mean that the thrust of their general conclusion is unsupported by the available evidence. Nevertheless, their article also raised issues that deserve more consideration; e.g. the roles of agricultural subsidies and market concentration on food security.

Agricultural sustainability requires carefully matching options in the biotechnology portfolio to site-specific economic, environmental and cultural constraints. Further data collection and analyses should lead policy makers to encourage diverse approaches to agriculture, rather than excluding specific technologies (like GMOs and pesticides) from the onset.

## References

Heinemann, J. A., Massaro, M., Coray, D. S., Agapito-Tenfen, S. Z. and Wen, J. D. 2013. Sustainability and innovation in staple crop production in the US Midwest. *International Journal of Agricultural Sustainability* (available here).

Jenkins, J. E. E. and Lescar, L. 1980. Use of foliar fungicides on cereals in Western Europe. *Plant Disease*, 64(11): 987-994 (behind paywall).

Kuck, K. H., Leadbeater, A. and Gisi, U. 2012. FRAC Mode of Action Classification and Resistance Risk of Fungicides. In: Krämer, W., Schirmer, U., Jeschke, P. and Witschel, M., eds., *Modern Crop Protection Compounds*. Wiley. 539-567.

NASS, 2013. Highlights: 2012 Agricultural Chemical Use Survey. Wheat. United States Department of Agriculture (available here).

Neumann, K., Verburg, P. H., Stehfest, E., and Müller, C. 2010. The yield gap of global grain production: a spatial analysis. *Agricultural Systems*, 103(5), 316–326 (behind paywall).

Oerke E. 2006. Crop losses to pests. *Journal of Agriculture Science*, 144: 31-43 (behind paywall, or Free PDF).

United States Department of Agriculture. 2013. Adoption of Genetically Engineered Crops in the U.S. USDA Economic Research Service (available here).

## Supplementary material

You can replicate the analyses and plots produced in this comment using the following files:

# The Precautionary Principle/Critique of GMOs

Nassim Nicolas Taleb and colleagues present an (almost?) tautological view of the effect of GMO (PDF): the large areas of establishment, human consumption and some unspecified risk mechanism (which does not seem to affect non-GMO crops, see next paragraph) may cause ruin to humanity, because, hey, they say so. I could come up with a similar scenario in which we should stop working on any new computing development, because there is a non-zero probability in which we may end-up with non-bottom-up tinkering causing the rise of the machines (a Black Swan event) that has ruinous consequences for humanity (with apologies to the Terminator). For some reason that escapes me, I do not find it compelling.

In making their case Taleb and colleagues also display a very naive view of non-GMO agriculture. Selective-breeding accelerated tremendously since the formalization of quantitative (statistical) genetics theory around the 1930s. Moreover, since the green revolution, the speed of creation of new cultivars, their geographical spread and the reduction of the number of genotypes commercially planted has accelerated. This all happened *before* the introduction of GMO as part of agricultural breeding programs. Using Taleb et al.’s terminology, crop breeding mostly abandoned ‘bottom-up modifications’ more than half a century ago. Therefore, the comparison between GMO and non-GMO crops is not one of *‘between tinkering with the selective breeding of genetic components of organisms that have previously undergone extensive histories of selection and the top-down engineering of taking a gene from a fish and putting it into a tomato’* but between high-speed, fast-deployment breeding programs and *the same* program with a few extra *plant* genes (sorry, no fish for you), sometimes even from organisms that could conventionally breed with each other (cisgenesis).

If we accept the statement ‘in addition to intentional cultivation, GMOs have the propensity to spread uncontrollably, and thus their risks cannot be localized’ and put it together with the total area of planted GMO (say 160 million hectares, or 10% of planted area, over the last 20 years) we face potentially billions of interactions between GMO and non-GMO plants. Given that there are no reported events of any large-scale disruption in billions of plant interactions—and relying on their reference to the Russian roulette fallacy *‘one needs a large sample for claims of absence of risk in the presence of a small probability of ruin’*— this would support the idea that such probability of ruin is infinitesimally small indeed, as we are way past sample size n = 1. A similar argument would apply to the billions of meals ingested by millions of humans every day: there is no evidence of large (or small) scale effects of consumption of GMO on human health.

From a probabilistic point of view, absence of evidence is evidence (but not proof) of absence (see, e.g., this article). In the same way, we cannot negate the existence of the Chupacabra, but the lack of any reliable sighting puts its existence in the very highly unlikely basket. For any scientist finding real evidence of the deleterious effects of GMOs (or the existence of the Chupacabra) would be a dream come true: guaranteed article in Nature, lecture circuit, book contract, the works. Many are actively looking for negative effects for GMO, but no one has found anything beyond noise. The question is How many billion interactions between plants are enough to put the ruinous effects of GMO at the level of Chupacabra?

I agree with Taleb and colleagues that many of the arguments to advance the use of GMOs are naive or spurious. At the same time, I think ensuring cheap food access to very large populations require the use of many different combinations of agricultural technologies, including GMO and non-GMO crops. There are many problems with modern agriculture both in their industrial and small-scale versions, but I would posit the focus on GMO is more a distraction than one of the core issues.

P.S. 2014-07-27 19:00 The article has a Pascal’s wager feeling: there is a potential for infinite loss (eternity in hell, human ruin), finite—supposedly small—loss (pleasure, agricultural productivity) and quite a bit of window dressing to the basic argument: if we assume a technology has potentially ruinous effect (pretty much infinite loss) with probability greater than 0, then the decision to avoid it is the correct one.

# Jetsam 20: mirror mirror

# Sometimes I feel (some) need for speed

I’m the first to acknowledge that most of my code could run faster. The truth of the matter is that, in essence, I write ‘quickies’: code that will run once or twice, so there is no incentive to spend days or hours in shaving seconds of a computation. Most analyses of research data fall in to this approach: read data-clean data-fit model-check model-be happy-write article-(perhaps) make data and code available-move on with life.

One of the reasons why my code doesn’t run faster or uses less memory is the trade-off between the cost of my time (very high) compared to the cost of more memory or faster processors (very cheap) and the gains of shaving a few seconds or minutes of computer time, which tend to be fairly little.

In R vectorization is faster than working with each vector element, although it implies allocating memory for whole vectors and matrices, which for large-enough problems may become prohibitively expensive. On the other hand, not vectorizing some operations may turn your problem into an excruciatingly slow exercise and, for example in large simulations, in practice intractable in a useful timeframe.

John Muschelli wrote an interesting post reimplementing 2×2 frequency tables for a highly specific use, comparing the results of image processing algorithms. In John’s case, there are two greater-than-9-million-long-elements logical vectors, and when comparing vectors for *many* images the process becomes very slow. He explains part of his rationale in his blog (go and read it), but say that his solution can be expressed like this:

# I don't have any image data, but I'll simulate a couple # of 10 million long logical vectors (just to round things) set.seed(2014) manual = sample(c(TRUE, FALSE), 10E6, replace = TRUE) auto = sample(c(TRUE, FALSE), 10E6, replace = TRUE) logical.tab = function(x, y) { tt = sum(x & y) tf = sum(x & !y) ft = sum(!x & y) ff = sum(!x & !y) return(matrix(c(ff, tf, ft, tt), 2, 2)) } logical.tab(manual, auto) |

which uses 1/4 of the time used by `table(manual, auto)`

. Mission accomplished! However, if I stopped here this blog post would not make much sense, simply rehashing (pun intended) John’s code. The point is to explain what is going on and, perhaps, to find even faster ways of performing the calculations. As a start, we have to be aware that the calculations in `logical.tab()`

rely on logical (boolean) operations and on the coercion of logical vectors to numerical values, as stated in the documentation:

Logical vectors are coerced to integer vectors in contexts where a numerical value is required, with TRUE being mapped to 1L, FALSE to 0L and NA to NA_integer_.

In R Logical operations can be slower than mathematical ones, a consideration that may guide us to think of the problem in a slightly different way. For example, take the difference between the vectors (`dif = x - y`

), so both `TRUE-TRUE`

(1-1) and `FALSE-FALSE`

(0-0) are 0, while `TRUE - FALSE`

(1-0) is 1 and `FALSE - TRUE`

(0-1) is -1. Therefore:

- the sum of positive values (
`sum(dif > 0)`

) is the frequency of`TRUE & FALSE`

, - while the sum of negative values (
`sum(dif < 0)`

) is the frequency of`FALSE & TRUE`

.

The values for `TRUE & TRUE`

can be obtained by adding up the element-wise multiplication of the two vectors, as `TRUE*TRUE`

(1*1) is the only product that's different from zero. A vectorized way of performing this operation would be to use `t(x) %*% y`

; however, for large vectors the implementation `crossprod(x, y)`

is faster. The values for `FALSE & FALSE`

are simply the difference of the length of the vectors and the previously calculated frequencies `length(dif) - tt - tf - ft`

. Putting it all together:

basic.tab = function(x, y) { dif = x - y tf = sum(dif > 0) ft = sum(dif < 0) tt = crossprod(x, y) ff = length(dif) - tt - tf - ft return(c(tf, ft, tt, ff)) } |

This code takes 1/20 of the time taken by `table(x, y)`

. An interesting result is that the crossproduct `crossprod(x, y)`

can also be expressed as `sum(x * y)`

, which I didn't expect to be faster but, hey, it is. So we can express the code as:

basic.tab2 = function(x, y) { dif = x - y tf = sum(dif > 0) ft = sum(dif < 0) tt = sum(x*y) ff = length(dif) - tt - tf - ft return(c(tf, ft, tt, ff)) } |

to get roughly 1/22 of the time. The cost of logical operations is easier to see if we isolate particular calculations in logical.tab and basic.tab; for example,

tf1 = function(x, y) { tf = sum(x & !y) } |

is slower than

tf2 = function(x, y) { dif = x - y tf = sum(dif > 0) } |

This also got me thinking of the cost of coercion: Would it take long? In fact, coercing logical vectors to numeric has little (at least I couldn't see from a practical viewpoint) if any cost. In some cases relying on logical vectors converted using `as.numeric()`

seems to be detrimental on terms of speed.

As I mentioned at the beginning, vectorization uses plenty of memory, so if we were constrained in that front and we wanted to do a single pass on the data we could write an explicit loop:

loopy.tab = function(x, y) { tt = 0; tf = 0; ft = 0; ff = 0 for(i in seq_along(x)) { if(x[i] == TRUE & y[i] == TRUE) tt = tt + 1 else if(x[i] == TRUE & y[i] == FALSE) tf = tf + 1 else if(x[i] == FALSE & y[i] == TRUE) ft = ft + 1 else ff = ff + 1 } return(matrix(c(ff, tf, ft, tt), 2, 2)) } |

The loopy.tab does only one pass over the vectors and should use less memory as it doesn't need to create those huge 10M elements vectors all the time (at least it would if we used a proper iterator in the loop instead of iterating on a vector of length 10M (that's 40 MB big in this case). The iterators package may help here). We save room/memory at the expense of speed, as `loopy.tab`

is ten times slower than the original `table()`

function. Of course one could run it a lot faster if implemented in another language like C++ or Fortran and here Rcpp or Rcpp11 would come handy (updated! See below).

This is only a not-so-short way of reminding myself what's going on when trading-off memory, execution speed and my personal time. Thus, I am not saying that any of these functions is the best possible solution, but playing with ideas and constraints that one often considers when writing code. Incidentally, an easy way to compare the execution time of these functions is using the `microbenchmark`

library. For example:

library(microbenchmark) microbenchmark(logical.tab(manual, auto), basic.tab2(manual, auto), times = 1000) |

will spit numbers when running a couple of functions 1,000 times with results that apply to your specific system.

PS 2014-07-12 11:08 NZST Hadley Wickham suggests using `tabulate(manual + auto * 2 + 1, 4)`

as a fast alternative. Nevertheless, I would like to point out that i- the initial tabulation problem is only an excuse to discuss a subset of performance issues when working with R and ii- this post is more about the journey than the destination.

PS 2014-07-13 20:32 NZST Links to two posts related to this one:

- Wiekvoet's work which i- relies on marginals and ii- reduces the use of logical operators even more than my
`basic.tab2()`

, simultaneously taking around 1/4 of the time. - Yin Zhu's post describing vectors is a handy reminder of the basic building blocks used in this post.

### Updated with Rcpp goodness!

PS 2014-07-13 21:45 NZST I browsed the Rcpp Gallery and Hadley Wickham's Rcpp tutorial and quickly converted `loopy.tab()`

to C++. Being a bit of an oldie I've used Fortran (90, not *that* old) before but never C++, so the following code is probably not very idiomatic.

library(Rcpp) cppFunction('NumericVector loopyCpp(LogicalVector x, LogicalVector y) { int niter = x.size(); int tt = 0, tf = 0, ft = 0, ff = 0; NumericVector tab(4); for(int i = 0; i < niter; i++) { if(x[i] == TRUE && y[i] == TRUE) tt++; else if(x[i] == TRUE && y[i] == FALSE) tf++; else if(x[i] == FALSE && y[i] == TRUE) ft++; else ff++; } tab[0] = ff; tab[1] = tf; tab[2] = ft; tab[3] = tt; return tab; }' ) loopyCpp(manual, auto) |

Loops and all it runs roughly twice as fast as `basic.tab2()`

but it should also use much less memory.

# Jetsam 19: High power

# Jetsam 18: supermarket trolley

# Jetsam 17: Scaffolding

# Less wordy R

The Swarm Lab presents a nice comparison of R and Python code for a simple (read ‘one could do it in Excel’) problem. The example works, but I was surprised by how wordy the R code was and decided to check if one could easily produce a shorter version.

The beginning is pretty much the same, although I’ll use ggplot2 rather than lattice, because it will be a lot easier (and shorter) to get the desired appearance for the plots:

require(Quandl) require(ggplot2) # Load data from Quandl my.data = Quandl("TPC/HIST_RECEIPT", start_date = "1945-12-31", end_date = "2013-12-31") |

The whole example relies on only three variables and—as I am not great at typing—I tend to work with shorter variable names. I directly changed the names for variables 1 to 3:

# Display first lines of the data frame # and set short names for first three columns head(my.data) names(my.data)[1:3] = c('year', 'indtax', 'corptax') |

It is a lot easier to compare the regression lines if we change the shape of the data set from wide to long, where there is one variable for year, one for tax type, and one for the actual tax rate. It would be possible to use one of Hadley’s packages to get a simpler syntax for this, but I decided to stick to the minimum set of requirements:

# Change shape to fit both regressions simultaneously mdlong = reshape(my.data[, 1:3], idvar = 'year', times = c('Individual', 'Corporate'), varying = list(2:3), direction = 'long') mdlong$taxtype = factor(mdlong$time) |

And now we are ready to produce the plots. The first one can be a rough cut to see if we get the right elements:

ggplot(mdlong, aes(x = year, y = indtax, color = taxtype)) + geom_point() + geom_line() + geom_smooth(method = 'lm') |

Yes, this one has the points, lines, linear regression and 95% confidence intervals for the mean predicted responses, but we still need to get rid of the grey background and get black labels (`theme_bw()`

), set the right axis labels and ticks (`scale_x... scale_y...`

) and set the right color palette for points and lines (`scale_colour_manual`

) and filling the confidence intervals (`scale_colour_fill`

) like so:

# Plotting the graph (first match color palette) and put the regression # lines as well serious.palette = c('#AD3333', '#00526D') ggplot(mdlong, aes(x = year, y = indtax, color = taxtype)) + geom_point() + geom_line() + geom_smooth(method = 'lm', aes(fill = taxtype)) + theme_bw() + scale_y_continuous('Income taxes (% of GDP)', breaks = seq(0, 12, 2), minor_breaks = NULL) + scale_x_date('Fiscal year', minor_breaks = NULL) + scale_colour_manual(values=serious.palette) + scale_fill_manual(values=serious.palette) |

One can still change font sizes to match the original plots, reposition the legend, change the aspect ratio while saving the png graphs (all simple statements) but you get the idea. If now we move to fitting the regression lines:

# Fitting a regression with dummy variables m1 = lm(indtax ~ year*taxtype, data = mdlong) summary(m1) # The regressions have different intercepts and slopes # Call: # lm(formula = indtax ~ year * taxtype, data = mdlong) # # Residuals: # Min 1Q Median 3Q Max # -1.95221 -0.44303 -0.05731 0.35749 2.39415 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 3.435e+00 1.040e-01 33.01 <2e-16 *** # year -1.564e-04 1.278e-05 -12.23 <2e-16 *** # taxtypeIndividual 4.406e+00 1.471e-01 29.94 <2e-16 *** # year:taxtypeIndividual 1.822e-04 1.808e-05 10.08 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 0.7724 on 134 degrees of freedom # Multiple R-squared: 0.9245, Adjusted R-squared: 0.9228 # F-statistic: 546.9 on 3 and 134 DF, p-value: < 2.2e-16 |

This gives the regression coefficients for Corporate (3.45 – 1.564e-04 year) and Individual ([3.45 + 4.41] + [-1.564e-04 + 1.822e-04] year or 7.84 + 2.58e-05 year). As a bonus you get the comparison between regression lines.

In R as a second language I pointed out that ‘brevity reduces the time between thinking and implementation, so we can move on and keep on trying new ideas’. Some times it seriously does.