The text below the *** represents my author’s version provided as part of my Green Access rights. The article published in the International Journal of Agricultural Sustainability [copyright Taylor & Francis]; is freely available online at http://dx.doi.org/10.1080/14735903.2014.939842).

While I had many issues with the original article, I decided to focus on three problems—to make the submission brief and fit under the 1,000 words limit enforced by the journal editor. The first point I make is a summary of my previous post on the article, and then move on to two big problems: assuming that only the choice of biotechnology affects yield (making the comparison between USA and Western Europe inadequate) and comparing use of agrochemicals at the wrong scale (national- versus crop-level).

***

## Abstract

This comment highlight issues when comparing genetically modified (GM) crops to non-GM ones across countries. Ignoring structural differences between agricultural sectors and assuming common yield trajectories before the time of introduction of GM crops results on misestimating the effect of GM varieties. Further data collection and analyses should guide policy-makers to encourage diverse approaches to agriculture, rather than excluding specific technologies (like GM crops) from the onset.

Keywords:genetic modification; biotechnology; productivity; economics

In a recent article Heinemann et al. (2013) focused “on the US staple crop agrobiodiversity, particularly maize” using the contrast between the yield of Western Europe and United States as a proxy for the comparison between genetically modified (GM) maize versus non-GM maize. They found no yield benefit from using GM maize when comparing the United States to Western Europe.

In addition, Heinemann et al. contrasted wheat yields across United States and Western Europe to highlight the superiority of the European biotechnological package from a sustainability viewpoint.

I am compelled to comment on two aspects that led the authors to draw incorrect conclusions on these issues. My statistical code and data are available as supplementary material.

Heinemann et al. used FAO data[↓], from 1961 to 2010 inclusive, to fit linear models with yield as the response variable, country and year as predictors. Based on this analysis they concluded, “W. Europe has benefitted from the same, or marginally greater, yield increases without GM”. However, this assumes a common yield trajectory for United States and Western Europe before significant commercial use of GM maize, conflating GM and non-GM yields. GM maize adoption in United States has continually increased from 25% of area of maize planted in 2000 to the current 90% (Figure 1, United States Department of Agriculture 2013[↓]).

If we fit a linear model from 1961 to 1999 (last year with less than 25% area of GM maize) we obtain the following regression equations \(y = 1094.8 x + 39895.6\) (United States, R^{2} = 0.80) and \(y = 1454.5 x + 29802.2\) (W. Europe, R^{2} = 0.90). This means that Western Europe started with a considerably lower yield than the USA (29,802.2 vs 39,895.6 hg/ha) in 1961 but increased yields faster than USA (1,454.5 vs 1,094.8 hg/ha per year) before substantial use of GM maize. By 1999 yield in Western Europe was superior to that in United States.

This is even more evident in Figure 2, which shows average yield per decade, removing year-to-year extraneous variation (e.g. due to weather). Western European yields surpassed United States’s during the 1990s (Figure 2). This trend reverses in the 2000s, while United States simultaneously increased the percentage of planted area with GM maize, directly contradicting Heinemann et al.’s claim.

When discussing non-GM crops using wheat the authors state, “the combination of biotechnologies used by W. Europe is demonstrating greater productivity than the combination used by the United States”. This sentence summarizes one of the central problems of their article: assuming that, if it were not for the choice of biotechnology bundle, the agricultural sectors would have the same intrinsic yield, making them comparable. However, many inputs beside biotechnology affect yield. For example, Neumann et al. (2010) studied the spatial distribution of yield and found that in the Unites States “*access* can explain most of the variability in wheat efficiency. In the more remote regions land prices are lower and inputs are therefore often substituted by land leading to lower efficiencies”. Lower yields in United States make sense from an economic point of view, as land replaces more expensive inputs like agrochemicals.

Heinemann et al. support their case by comparing pesticide use between United States and France across *all* crops. However, what is relevant to the discussion is pesticide use for the crops being compared. European cereals, and wheat in particular, are the most widely fungicide-treated group of crops worldwide (Kucket al. 2012). For example, 27% of the wheat planted area in France was already treated with fungicides by 1979 (Jenkins and Lescar 1980). More than 30 years later in the United States this figure has reached only 19% for winter wheat (which accounts for 70% of planted area, NASS 2013). Fungicide applications result on higher yield responses (Oerke 2006).

Heinemann et al. ignored available data on GM adoption when analysing maize yields. They also mistakenly treated biotechnological bundles as the only/main explanation for non-GMO yield differences between United States and Western Europe. These issues mean that the thrust of their general conclusion is unsupported by the available evidence. Nevertheless, their article also raised issues that deserve more consideration; e.g. the roles of agricultural subsidies and market concentration on food security.

Agricultural sustainability requires carefully matching options in the biotechnology portfolio to site-specific economic, environmental and cultural constraints. Further data collection and analyses should lead policy makers to encourage diverse approaches to agriculture, rather than excluding specific technologies (like GMOs and pesticides) from the onset.

Heinemann, J. A., Massaro, M., Coray, D. S., Agapito-Tenfen, S. Z. and Wen, J. D. 2013. Sustainability and innovation in staple crop production in the US Midwest. *International Journal of Agricultural Sustainability* (available here).

Jenkins, J. E. E. and Lescar, L. 1980. Use of foliar fungicides on cereals in Western Europe. *Plant Disease*, 64(11): 987-994 (behind paywall).

Kuck, K. H., Leadbeater, A. and Gisi, U. 2012. FRAC Mode of Action Classification and Resistance Risk of Fungicides. In: Krämer, W., Schirmer, U., Jeschke, P. and Witschel, M., eds., *Modern Crop Protection Compounds*. Wiley. 539-567.

NASS, 2013. Highlights: 2012 Agricultural Chemical Use Survey. Wheat. United States Department of Agriculture (available here).

Neumann, K., Verburg, P. H., Stehfest, E., and Müller, C. 2010. The yield gap of global grain production: a spatial analysis. *Agricultural Systems*, 103(5), 316–326 (behind paywall).

Oerke E. 2006. Crop losses to pests. *Journal of Agriculture Science*, 144: 31-43 (behind paywall, or Free PDF).

United States Department of Agriculture. 2013. Adoption of Genetically Engineered Crops in the U.S. USDA Economic Research Service (available here).

You can replicate the analyses and plots produced in this comment using the following files:

]]>In making their case Taleb and colleagues also display a very naive view of non-GMO agriculture. Selective-breeding accelerated tremendously since the formalization of quantitative (statistical) genetics theory around the 1930s. Moreover, since the green revolution, the speed of creation of new cultivars, their geographical spread and the reduction of the number of genotypes commercially planted has accelerated. This all happened *before* the introduction of GMO as part of agricultural breeding programs. Using Taleb et al.’s terminology, crop breeding mostly abandoned ‘bottom-up modifications’ more than half a century ago. Therefore, the comparison between GMO and non-GMO crops is not one of *‘between tinkering with the selective breeding of genetic components of organisms that have previously undergone extensive histories of selection and the top-down engineering of taking a gene from a fish and putting it into a tomato’* but between high-speed, fast-deployment breeding programs and *the same* program with a few extra *plant* genes (sorry, no fish for you), sometimes even from organisms that could conventionally breed with each other (cisgenesis).

If we accept the statement ‘in addition to intentional cultivation, GMOs have the propensity to spread uncontrollably, and thus their risks cannot be localized’ and put it together with the total area of planted GMO (say 160 million hectares, or 10% of planted area, over the last 20 years) we face potentially billions of interactions between GMO and non-GMO plants. Given that there are no reported events of any large-scale disruption in billions of plant interactions—and relying on their reference to the Russian roulette fallacy *‘one needs a large sample for claims of absence of risk in the presence of a small probability of ruin’*— this would support the idea that such probability of ruin is infinitesimally small indeed, as we are way past sample size n = 1. A similar argument would apply to the billions of meals ingested by millions of humans every day: there is no evidence of large (or small) scale effects of consumption of GMO on human health.

From a probabilistic point of view, absence of evidence is evidence (but not proof) of absence (see, e.g., this article). In the same way, we cannot negate the existence of the Chupacabra, but the lack of any reliable sighting puts its existence in the very highly unlikely basket. For any scientist finding real evidence of the deleterious effects of GMOs (or the existence of the Chupacabra) would be a dream come true: guaranteed article in Nature, lecture circuit, book contract, the works. Many are actively looking for negative effects for GMO, but no one has found anything beyond noise. The question is How many billion interactions between plants are enough to put the ruinous effects of GMO at the level of Chupacabra?

I agree with Taleb and colleagues that many of the arguments to advance the use of GMOs are naive or spurious. At the same time, I think ensuring cheap food access to very large populations require the use of many different combinations of agricultural technologies, including GMO and non-GMO crops. There are many problems with modern agriculture both in their industrial and small-scale versions, but I would posit the focus on GMO is more a distraction than one of the core issues.

P.S. 2014-07-27 19:00 The article has a Pascal’s wager feeling: there is a potential for infinite loss (eternity in hell, human ruin), finite—supposedly small—loss (pleasure, agricultural productivity) and quite a bit of window dressing to the basic argument: if we assume a technology has potentially ruinous effect (pretty much infinite loss) with probability greater than 0, then the decision to avoid it is the correct one.

]]>One of the reasons why my code doesn’t run faster or uses less memory is the trade-off between the cost of my time (very high) compared to the cost of more memory or faster processors (very cheap) and the gains of shaving a few seconds or minutes of computer time, which tend to be fairly little.

In R vectorization is faster than working with each vector element, although it implies allocating memory for whole vectors and matrices, which for large-enough problems may become prohibitively expensive. On the other hand, not vectorizing some operations may turn your problem into an excruciatingly slow exercise and, for example in large simulations, in practice intractable in a useful timeframe.

John Muschelli wrote an interesting post reimplementing 2×2 frequency tables for a highly specific use, comparing the results of image processing algorithms. In John’s case, there are two greater-than-9-million-long-elements logical vectors, and when comparing vectors for *many* images the process becomes very slow. He explains part of his rationale in his blog (go and read it), but say that his solution can be expressed like this:

# I don't have any image data, but I'll simulate a couple # of 10 million long logical vectors (just to round things) set.seed(2014) manual = sample(c(TRUE, FALSE), 10E6, replace = TRUE) auto = sample(c(TRUE, FALSE), 10E6, replace = TRUE) logical.tab = function(x, y) { tt = sum(x & y) tf = sum(x & !y) ft = sum(!x & y) ff = sum(!x & !y) return(matrix(c(ff, tf, ft, tt), 2, 2)) } logical.tab(manual, auto) |

which uses 1/4 of the time used by `table(manual, auto)`

. Mission accomplished! However, if I stopped here this blog post would not make much sense, simply rehashing (pun intended) John’s code. The point is to explain what is going on and, perhaps, to find even faster ways of performing the calculations. As a start, we have to be aware that the calculations in `logical.tab()`

rely on logical (boolean) operations and on the coercion of logical vectors to numerical values, as stated in the documentation:

Logical vectors are coerced to integer vectors in contexts where a numerical value is required, with TRUE being mapped to 1L, FALSE to 0L and NA to NA_integer_.

In R Logical operations can be slower than mathematical ones, a consideration that may guide us to think of the problem in a slightly different way. For example, take the difference between the vectors (`dif = x - y`

), so both `TRUE-TRUE`

(1-1) and `FALSE-FALSE`

(0-0) are 0, while `TRUE - FALSE`

(1-0) is 1 and `FALSE - TRUE`

(0-1) is -1. Therefore:

- the sum of positive values (
`sum(dif > 0)`

) is the frequency of`TRUE & FALSE`

, - while the sum of negative values (
`sum(dif < 0)`

) is the frequency of`FALSE & TRUE`

.

The values for `TRUE & TRUE`

can be obtained by adding up the element-wise multiplication of the two vectors, as `TRUE*TRUE`

(1*1) is the only product that's different from zero. A vectorized way of performing this operation would be to use `t(x) %*% y`

; however, for large vectors the implementation `crossprod(x, y)`

is faster. The values for `FALSE & FALSE`

are simply the difference of the length of the vectors and the previously calculated frequencies `length(dif) - tt - tf - ft`

. Putting it all together:

basic.tab = function(x, y) { dif = x - y tf = sum(dif > 0) ft = sum(dif < 0) tt = crossprod(x, y) ff = length(dif) - tt - tf - ft return(c(tf, ft, tt, ff)) } |

This code takes 1/20 of the time taken by `table(x, y)`

. An interesting result is that the crossproduct `crossprod(x, y)`

can also be expressed as `sum(x * y)`

, which I didn't expect to be faster but, hey, it is. So we can express the code as:

basic.tab2 = function(x, y) { dif = x - y tf = sum(dif > 0) ft = sum(dif < 0) tt = sum(x*y) ff = length(dif) - tt - tf - ft return(c(tf, ft, tt, ff)) } |

to get roughly 1/22 of the time. The cost of logical operations is easier to see if we isolate particular calculations in logical.tab and basic.tab; for example,

tf1 = function(x, y) { tf = sum(x & !y) } |

is slower than

tf2 = function(x, y) { dif = x - y tf = sum(dif > 0) } |

This also got me thinking of the cost of coercion: Would it take long? In fact, coercing logical vectors to numeric has little (at least I couldn't see from a practical viewpoint) if any cost. In some cases relying on logical vectors converted using `as.numeric()`

seems to be detrimental on terms of speed.

As I mentioned at the beginning, vectorization uses plenty of memory, so if we were constrained in that front and we wanted to do a single pass on the data we could write an explicit loop:

loopy.tab = function(x, y) { tt = 0; tf = 0; ft = 0; ff = 0 for(i in seq_along(x)) { if(x[i] == TRUE & y[i] == TRUE) tt = tt + 1 else if(x[i] == TRUE & y[i] == FALSE) tf = tf + 1 else if(x[i] == FALSE & y[i] == TRUE) ft = ft + 1 else ff = ff + 1 } return(matrix(c(ff, tf, ft, tt), 2, 2)) } |

The loopy.tab does only one pass over the vectors and should use less memory as it doesn't need to create those huge 10M elements vectors all the time (at least it would if we used a proper iterator in the loop instead of iterating on a vector of length 10M (that's 40 MB big in this case). The iterators package may help here). We save room/memory at the expense of speed, as `loopy.tab`

is ten times slower than the original `table()`

function. Of course one could run it a lot faster if implemented in another language like C++ or Fortran and here Rcpp or Rcpp11 would come handy (updated! See below).

This is only a not-so-short way of reminding myself what's going on when trading-off memory, execution speed and my personal time. Thus, I am not saying that any of these functions is the best possible solution, but playing with ideas and constraints that one often considers when writing code. Incidentally, an easy way to compare the execution time of these functions is using the `microbenchmark`

library. For example:

library(microbenchmark) microbenchmark(logical.tab(manual, auto), basic.tab2(manual, auto), times = 1000) |

will spit numbers when running a couple of functions 1,000 times with results that apply to your specific system.

PS 2014-07-12 11:08 NZST Hadley Wickham suggests using `tabulate(manual + auto * 2 + 1, 4)`

as a fast alternative. Nevertheless, I would like to point out that i- the initial tabulation problem is only an excuse to discuss a subset of performance issues when working with R and ii- this post is more about the journey than the destination.

PS 2014-07-13 20:32 NZST Links to two posts related to this one:

- Wiekvoet's work which i- relies on marginals and ii- reduces the use of logical operators even more than my
`basic.tab2()`

, simultaneously taking around 1/4 of the time. - Yin Zhu's post describing vectors is a handy reminder of the basic building blocks used in this post.

PS 2014-07-13 21:45 NZST I browsed the Rcpp Gallery and Hadley Wickham's Rcpp tutorial and quickly converted `loopy.tab()`

to C++. Being a bit of an oldie I've used Fortran (90, not *that* old) before but never C++, so the following code is probably not very idiomatic.

library(Rcpp) cppFunction('NumericVector loopyCpp(LogicalVector x, LogicalVector y) { int niter = x.size(); int tt = 0, tf = 0, ft = 0, ff = 0; NumericVector tab(4); for(int i = 0; i < niter; i++) { if(x[i] == TRUE && y[i] == TRUE) tt++; else if(x[i] == TRUE && y[i] == FALSE) tf++; else if(x[i] == FALSE && y[i] == TRUE) ft++; else ff++; } tab[0] = ff; tab[1] = tf; tab[2] = ft; tab[3] = tt; return tab; }' ) loopyCpp(manual, auto) |

Loops and all it runs roughly twice as fast as `basic.tab2()`

but it should also use much less memory.

The beginning is pretty much the same, although I’ll use ggplot2 rather than lattice, because it will be a lot easier (and shorter) to get the desired appearance for the plots:

require(Quandl) require(ggplot2) # Load data from Quandl my.data = Quandl("TPC/HIST_RECEIPT", start_date = "1945-12-31", end_date = "2013-12-31") |

The whole example relies on only three variables and—as I am not great at typing—I tend to work with shorter variable names. I directly changed the names for variables 1 to 3:

# Display first lines of the data frame # and set short names for first three columns head(my.data) names(my.data)[1:3] = c('year', 'indtax', 'corptax') |

It is a lot easier to compare the regression lines if we change the shape of the data set from wide to long, where there is one variable for year, one for tax type, and one for the actual tax rate. It would be possible to use one of Hadley’s packages to get a simpler syntax for this, but I decided to stick to the minimum set of requirements:

# Change shape to fit both regressions simultaneously mdlong = reshape(my.data[, 1:3], idvar = 'year', times = c('Individual', 'Corporate'), varying = list(2:3), direction = 'long') mdlong$taxtype = factor(mdlong$time) |

And now we are ready to produce the plots. The first one can be a rough cut to see if we get the right elements:

ggplot(mdlong, aes(x = year, y = indtax, color = taxtype)) + geom_point() + geom_line() + geom_smooth(method = 'lm') |

Yes, this one has the points, lines, linear regression and 95% confidence intervals for the mean predicted responses, but we still need to get rid of the grey background and get black labels (`theme_bw()`

), set the right axis labels and ticks (`scale_x... scale_y...`

) and set the right color palette for points and lines (`scale_colour_manual`

) and filling the confidence intervals (`scale_colour_fill`

) like so:

# Plotting the graph (first match color palette) and put the regression # lines as well serious.palette = c('#AD3333', '#00526D') ggplot(mdlong, aes(x = year, y = indtax, color = taxtype)) + geom_point() + geom_line() + geom_smooth(method = 'lm', aes(fill = taxtype)) + theme_bw() + scale_y_continuous('Income taxes (% of GDP)', breaks = seq(0, 12, 2), minor_breaks = NULL) + scale_x_date('Fiscal year', minor_breaks = NULL) + scale_colour_manual(values=serious.palette) + scale_fill_manual(values=serious.palette) |

One can still change font sizes to match the original plots, reposition the legend, change the aspect ratio while saving the png graphs (all simple statements) but you get the idea. If now we move to fitting the regression lines:

# Fitting a regression with dummy variables m1 = lm(indtax ~ year*taxtype, data = mdlong) summary(m1) # The regressions have different intercepts and slopes # Call: # lm(formula = indtax ~ year * taxtype, data = mdlong) # # Residuals: # Min 1Q Median 3Q Max # -1.95221 -0.44303 -0.05731 0.35749 2.39415 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 3.435e+00 1.040e-01 33.01 <2e-16 *** # year -1.564e-04 1.278e-05 -12.23 <2e-16 *** # taxtypeIndividual 4.406e+00 1.471e-01 29.94 <2e-16 *** # year:taxtypeIndividual 1.822e-04 1.808e-05 10.08 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 0.7724 on 134 degrees of freedom # Multiple R-squared: 0.9245, Adjusted R-squared: 0.9228 # F-statistic: 546.9 on 3 and 134 DF, p-value: < 2.2e-16 |

This gives the regression coefficients for Corporate (3.45 – 1.564e-04 year) and Individual ([3.45 + 4.41] + [-1.564e-04 + 1.822e-04] year or 7.84 + 2.58e-05 year). As a bonus you get the comparison between regression lines.

In R as a second language I pointed out that ‘brevity reduces the time between thinking and implementation, so we can move on and keep on trying new ideas’. Some times it seriously does.

]]>Once we have a relatively stable government and educated people can rely on property rights, the effects of different biotechnologies will be magnified and it will be possible to better assess them. I would say that matching the most appropriate technologies to the local environmental, economic and cultural conditions is a good sign of sustainable agriculture. I would also say that the broader the portfolio of biotechnology and agronomic practices the more likely a good match will be. That is, I would not *a priori* exclude any biotechnology from the table based on *generic* considerations.

Should the success of a biotechnology for food security be measured as yield? It could be *one* of the desired effects but it is not necessarily the most important one. For example, having less fluctuating production (that is reducing the variance rather than increasing the mean) could be more relevant. Or we could be interested in creating combinations of traits that are difficult to achieve by traditional breeding (e.g. biofortification), where yield is still the same but nutritional content differs. Or we would like to have a reduction of inputs (agrochemicals, for example) while maintaining yield. There are many potential answers and—coming back to matching practices to local requirements—using a simple average of all crops in a country (or a continent) is definitely the wrong scale of assessment. We do not want to work with an average farmer or an average consumer but to target specific needs with the best available practices. Some times this will include {insert biotech, agronomical practices here}, other times this will include {insert another biotech and set of agronomical practices here}.

And that is the way I think of improving food security.

]]>