The Precautionary Principle/Critique of GMOs

Nassim Nicolas Taleb and colleagues present an (almost?) tautological view of the effect of GMO (PDF): the large areas of establishment, human consumption and some unspecified risk mechanism (which does not seem to affect non-GMO crops, see next paragraph) may cause ruin to humanity, because, hey, they say so. I could come up with a similar scenario in which we should stop working on any new computing development, because there is a non-zero probability in which we may end-up with non-bottom-up tinkering causing the rise of the machines (a Black Swan event) that has ruinous consequences for humanity (with apologies to the Terminator). For some reason that escapes me, I do not find it compelling.

In making their case Taleb and colleagues also display a very naive view of non-GMO agriculture. Selective-breeding accelerated tremendously since the formalization of quantitative (statistical) genetics theory around the 1930s. Moreover, since the green revolution, the speed of creation of new cultivars, their geographical spread and the reduction of the number of genotypes commercially planted has accelerated. This all happened before the introduction of GMO as part of agricultural breeding programs. Using Taleb et al.’s terminology, crop breeding mostly abandoned ‘bottom-up modifications’ more than half a century ago. Therefore, the comparison between GMO and non-GMO crops is not one of ‘between tinkering with the selective breeding of genetic components of organisms that have previously undergone extensive histories of selection and the top-down engineering of taking a gene from a fish and putting it into a tomato’ but between high-speed, fast-deployment breeding programs and the same program with a few extra plant genes (sorry, no fish for you), sometimes even from organisms that could conventionally breed with each other (cisgenesis).

If we accept the statement ‘in addition to intentional cultivation, GMOs have the propensity to spread uncontrollably, and thus their risks cannot be localized’ and put it together with the total area of planted GMO (say 160 million hectares, or 10% of planted area, over the last 20 years) we face potentially billions of interactions between GMO and non-GMO plants. Given that there are no reported events of any large-scale disruption in billions of plant interactions—and relying on their reference to the Russian roulette fallacy ‘one needs a large sample for claims of absence of risk in the presence of a small probability of ruin’— this would support the idea that such probability of ruin is infinitesimally small indeed, as we are way past sample size n = 1. A similar argument would apply to the billions of meals ingested by millions of humans every day: there is no evidence of large (or small) scale effects of consumption of GMO on human health.

From a probabilistic point of view, absence of evidence is evidence (but not proof) of absence (see, e.g., this article). In the same way, we cannot negate the existence of the Chupacabra, but the lack of any reliable sighting puts its existence in the very highly unlikely basket. For any scientist finding real evidence of the deleterious effects of GMOs (or the existence of the Chupacabra) would be a dream come true: guaranteed article in Nature, lecture circuit, book contract, the works. Many are actively looking for negative effects for GMO, but no one has found anything beyond noise. The question is How many billion interactions between plants are enough to put the ruinous effects of GMO at the level of Chupacabra?

I agree with Taleb and colleagues that many of the arguments to advance the use of GMOs are naive or spurious. At the same time, I think ensuring cheap food access to very large populations require the use of many different combinations of agricultural technologies, including GMO and non-GMO crops. There are many problems with modern agriculture both in their industrial and small-scale versions, but I would posit the focus on GMO is more a distraction than one of the core issues.

Transgenic eel, or not (Photo: Luis, click to enlarge).

Transgenic eel, or not (Photo: Luis, click to enlarge).

P.S. 2014-07-27 19:00 The article has a Pascal’s wager feeling: there is a potential for infinite loss (eternity in hell, human ruin), finite—supposedly small—loss (pleasure, agricultural productivity) and quite a bit of window dressing to the basic argument: if we assume a technology has potentially ruinous effect (pretty much infinite loss) with probability greater than 0, then the decision to avoid it is the correct one.

Sometimes I feel (some) need for speed

I’m the first to acknowledge that most of my code could run faster. The truth of the matter is that, in essence, I write ‘quickies': code that will run once or twice, so there is no incentive to spend days or hours in shaving seconds of a computation. Most analyses of research data fall in to this approach: read data-clean data-fit model-check model-be happy-write article-(perhaps) make data and code available-move on with life.

One of the reasons why my code doesn’t run faster or uses less memory is the trade-off between the cost of my time (very high) compared to the cost of more memory or faster processors (very cheap) and the gains of shaving a few seconds or minutes of computer time, which tend to be fairly little.

In R vectorization is faster than working with each vector element, although it implies allocating memory for whole vectors and matrices, which for large-enough problems may become prohibitively expensive. On the other hand, not vectorizing some operations may turn your problem into an excruciatingly slow exercise and, for example in large simulations, in practice intractable in a useful timeframe.

Dealing with vectors and matrices is like dealing with many chairs simultaneously. Some patterns and operations are easier and faster than others. (Photo: Luis, click to enlarge).

Dealing with vectors and matrices is like dealing with many chairs simultaneously. Some patterns and operations are easier and faster than others. (Photo: Luis, click to enlarge).

John Muschelli wrote an interesting post reimplementing 2×2 frequency tables for a highly specific use, comparing the results of image processing algorithms. In John’s case, there are two greater-than-9-million-long-elements logical vectors, and when comparing vectors for many images the process becomes very slow. He explains part of his rationale in his blog (go and read it), but say that his solution can be expressed like this:

# I don't have any image data, but I'll simulate a couple
# of 10 million long logical vectors (just to round things)
set.seed(2014)
 
manual = sample(c(TRUE, FALSE), 10E6, replace = TRUE)
auto = sample(c(TRUE, FALSE), 10E6, replace = TRUE)
 
logical.tab = function(x, y) {
    tt = sum(x & y)
    tf = sum(x & !y)
    ft = sum(!x & y)
    ff = sum(!x & !y)
    return(matrix(c(ff, tf, ft, tt), 2, 2))
}
 
logical.tab(manual, auto)

which uses 1/4 of the time used by table(manual, auto). Mission accomplished! However, if I stopped here this blog post would not make much sense, simply rehashing (pun intended) John’s code. The point is to explain what is going on and, perhaps, to find even faster ways of performing the calculations. As a start, we have to be aware that the calculations in logical.tab() rely on logical (boolean) operations and on the coercion of logical vectors to numerical values, as stated in the documentation:

Logical vectors are coerced to integer vectors in contexts where a numerical value is required, with TRUE being mapped to 1L, FALSE to 0L and NA to NA_integer_.

In R Logical operations can be slower than mathematical ones, a consideration that may guide us to think of the problem in a slightly different way. For example, take the difference between the vectors (dif = x - y), so both TRUE-TRUE (1-1) and FALSE-FALSE (0-0) are 0, while TRUE - FALSE (1-0) is 1 and FALSE - TRUE (0-1) is -1. Therefore:

  • the sum of positive values (sum(dif > 0)) is the frequency of TRUE & FALSE,
  • while the sum of negative values (sum(dif < 0)) is the frequency of FALSE & TRUE.

The values for TRUE & TRUE can be obtained by adding up the element-wise multiplication of the two vectors, as TRUE*TRUE (1*1) is the only product that's different from zero. A vectorized way of performing this operation would be to use t(x) %*% y; however, for large vectors the implementation crossprod(x, y) is faster. The values for FALSE & FALSE are simply the difference of the length of the vectors and the previously calculated frequencies length(dif) - tt - tf - ft. Putting it all together:

basic.tab = function(x, y) {
    dif = x - y
    tf = sum(dif > 0)
    ft = sum(dif < 0)
    tt = crossprod(x, y)
    ff = length(dif) - tt - tf - ft
    return(c(tf, ft, tt, ff))
}

This code takes 1/20 of the time taken by table(x, y). An interesting result is that the crossproduct crossprod(x, y) can also be expressed as sum(x * y), which I didn't expect to be faster but, hey, it is. So we can express the code as:

basic.tab2 = function(x, y) {
    dif = x - y
    tf = sum(dif > 0)
    ft = sum(dif < 0)
    tt = sum(x*y)
    ff = length(dif) - tt - tf - ft
    return(c(tf, ft, tt, ff))
}

to get roughly 1/22 of the time. The cost of logical operations is easier to see if we isolate particular calculations in logical.tab and basic.tab; for example,

tf1 = function(x, y) {
    tf = sum(x & !y)
}

is slower than

tf2 = function(x, y) {
    dif = x - y
    tf = sum(dif > 0)
}

This also got me thinking of the cost of coercion: Would it take long? In fact, coercing logical vectors to numeric has little (at least I couldn't see from a practical viewpoint) if any cost. In some cases relying on logical vectors converted using as.numeric() seems to be detrimental on terms of speed.

As I mentioned at the beginning, vectorization uses plenty of memory, so if we were constrained in that front and we wanted to do a single pass on the data we could write an explicit loop:

loopy.tab = function(x, y) {
    tt = 0; tf = 0; ft = 0; ff = 0
 
    for(i in seq_along(x)) {
        if(x[i] == TRUE & y[i] == TRUE)
            tt = tt + 1
        else 
            if(x[i] == TRUE & y[i] == FALSE)
                tf = tf + 1
            else
                if(x[i] == FALSE & y[i] == TRUE)
                    ft = ft + 1
                else 
                    ff = ff + 1
    }
    return(matrix(c(ff, tf, ft, tt), 2, 2))
}

The loopy.tab does only one pass over the vectors and should use less memory as it doesn't need to create those huge 10M elements vectors all the time (at least it would if we used a proper iterator in the loop instead of iterating on a vector of length 10M (that's 40 MB big in this case). The iterators package may help here). We save room/memory at the expense of speed, as loopy.tab is ten times slower than the original table() function. Of course one could run it a lot faster if implemented in another language like C++ or Fortran and here Rcpp or Rcpp11 would come handy (updated! See below).

This is only a not-so-short way of reminding myself what's going on when trading-off memory, execution speed and my personal time. Thus, I am not saying that any of these functions is the best possible solution, but playing with ideas and constraints that one often considers when writing code. Incidentally, an easy way to compare the execution time of these functions is using the microbenchmark library. For example:

library(microbenchmark)
microbenchmark(logical.tab(manual, auto), basic.tab2(manual, auto), times = 1000)

will spit numbers when running a couple of functions 1,000 times with results that apply to your specific system.

PS 2014-07-12 11:08 NZST Hadley Wickham suggests using tabulate(manual + auto * 2 + 1, 4) as a fast alternative. Nevertheless, I would like to point out that i- the initial tabulation problem is only an excuse to discuss a subset of performance issues when working with R and ii- this post is more about the journey than the destination.

PS 2014-07-13 20:32 NZST Links to two posts related to this one:

  • Wiekvoet's work which i- relies on marginals and ii- reduces the use of logical operators even more than my basic.tab2(), simultaneously taking around 1/4 of the time.
  • Yin Zhu's post describing vectors is a handy reminder of the basic building blocks used in this post.

Updated with Rcpp goodness!

PS 2014-07-13 21:45 NZST I browsed the Rcpp Gallery and Hadley Wickham's Rcpp tutorial and quickly converted loopy.tab() to C++. Being a bit of an oldie I've used Fortran (90, not that old) before but never C++, so the following code is probably not very idiomatic.

library(Rcpp)
cppFunction('NumericVector loopyCpp(LogicalVector x, LogicalVector y) {
    int niter = x.size();
    int tt = 0, tf = 0, ft = 0, ff = 0;
    NumericVector tab(4);
 
 
    for(int i = 0; i < niter; i++) {
        if(x[i] == TRUE && y[i] == TRUE)
            tt++;
        else
            if(x[i] == TRUE && y[i] == FALSE)
                tf++;
            else
                if(x[i] == FALSE && y[i] == TRUE)
                    ft++;
                else
                    ff++;
    }
 
    tab[0] = ff; tab[1] = tf; tab[2] = ft; tab[3] = tt;
    return tab;
}'
)
 
loopyCpp(manual, auto)

Loops and all it runs roughly twice as fast as basic.tab2() but it should also use much less memory.

Less wordy R

The Swarm Lab presents a nice comparison of R and Python code for a simple (read ‘one could do it in Excel’) problem. The example works, but I was surprised by how wordy the R code was and decided to check if one could easily produce a shorter version.

The beginning is pretty much the same, although I’ll use ggplot2 rather than lattice, because it will be a lot easier (and shorter) to get the desired appearance for the plots:

require(Quandl)
require(ggplot2)
 
# Load data from Quandl
my.data = Quandl("TPC/HIST_RECEIPT", 
                 start_date = "1945-12-31", 
                 end_date = "2013-12-31")

The whole example relies on only three variables and—as I am not great at typing—I tend to work with shorter variable names. I directly changed the names for variables 1 to 3:

# Display first lines of the data frame
# and set short names for first three columns
head(my.data)
names(my.data)[1:3] = c('year', 'indtax', 'corptax')

It is a lot easier to compare the regression lines if we change the shape of the data set from wide to long, where there is one variable for year, one for tax type, and one for the actual tax rate. It would be possible to use one of Hadley’s packages to get a simpler syntax for this, but I decided to stick to the minimum set of requirements:

# Change shape to fit both regressions simultaneously
mdlong = reshape(my.data[, 1:3], 
                 idvar = 'year', times = c('Individual', 'Corporate'), 
                 varying = list(2:3), direction = 'long')
 
mdlong$taxtype = factor(mdlong$time)

And now we are ready to produce the plots. The first one can be a rough cut to see if we get the right elements:

ggplot(mdlong, aes(x = year, y = indtax, color = taxtype)) + 
  geom_point() + geom_line() + geom_smooth(method = 'lm')
First cut of the taxes per year plot.

First cut of the taxes per year plot.

Yes, this one has the points, lines, linear regression and 95% confidence intervals for the mean predicted responses, but we still need to get rid of the grey background and get black labels (theme_bw()), set the right axis labels and ticks (scale_x... scale_y...) and set the right color palette for points and lines (scale_colour_manual) and filling the confidence intervals (scale_colour_fill) like so:

# Plotting the graph (first match color palette) and put the regression
# lines as well
serious.palette = c('#AD3333', '#00526D')
ggplot(mdlong, aes(x = year, y = indtax, color = taxtype)) + 
  geom_point() + geom_line() + geom_smooth(method = 'lm', aes(fill = taxtype)) + 
  theme_bw() + 
  scale_y_continuous('Income taxes (% of GDP)', breaks = seq(0, 12, 2), minor_breaks = NULL) + 
  scale_x_date('Fiscal year', minor_breaks = NULL) + 
  scale_colour_manual(values=serious.palette) + scale_fill_manual(values=serious.palette)
Way closer to the desired plot, still much shorter.

Way closer to the desired plot, still much shorter.

One can still change font sizes to match the original plots, reposition the legend, change the aspect ratio while saving the png graphs (all simple statements) but you get the idea. If now we move to fitting the regression lines:

# Fitting a regression with dummy variables
m1 = lm(indtax ~ year*taxtype, data = mdlong)
summary(m1)
 
# The regressions have different intercepts and slopes
# Call:
#   lm(formula = indtax ~ year * taxtype, data = mdlong)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -1.95221 -0.44303 -0.05731  0.35749  2.39415 
# 
# Coefficients:
#                            Estimate Std. Error t value Pr(>|t|)    
#   (Intercept)             3.435e+00  1.040e-01   33.01   <2e-16 ***
#   year                   -1.564e-04  1.278e-05  -12.23   <2e-16 ***
#   taxtypeIndividual       4.406e+00  1.471e-01   29.94   <2e-16 ***
#   year:taxtypeIndividual  1.822e-04  1.808e-05   10.08   <2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.7724 on 134 degrees of freedom
# Multiple R-squared:  0.9245,  Adjusted R-squared:  0.9228 
# F-statistic: 546.9 on 3 and 134 DF,  p-value: < 2.2e-16

This gives the regression coefficients for Corporate (3.45 – 1.564e-04 year) and Individual ([3.45 + 4.41] + [-1.564e-04 + 1.822e-04] year or 7.84 + 2.58e-05 year). As a bonus you get the comparison between regression lines.

In R as a second language I pointed out that ‘brevity reduces the time between thinking and implementation, so we can move on and keep on trying new ideas’. Some times it seriously does.

A couple of thoughts on biotech and food security

“What has {insert biotech here} done for food security?” This question starts at the wrong end of the problem, because food security is much larger than any biotechnology. I would suggest that governance, property rights and education are the fundamental issues for food security, followed by biotechnological options. For example, the best biotechnology is useless if one is trying to do agriculture in a war-ravaged country.

Once we have a relatively stable government and educated people can rely on property rights, the effects of different biotechnologies will be magnified and it will be possible to better assess them. I would say that matching the most appropriate technologies to the local environmental, economic and cultural conditions is a good sign of sustainable agriculture. I would also say that the broader the portfolio of biotechnology and agronomic practices the more likely a good match will be. That is, I would not a priori exclude any biotechnology from the table based on generic considerations.

Should the success of a biotechnology for food security be measured as yield? It could be one of the desired effects but it is not necessarily the most important one. For example, having less fluctuating production (that is reducing the variance rather than increasing the mean) could be more relevant. Or we could be interested in creating combinations of traits that are difficult to achieve by traditional breeding (e.g. biofortification), where yield is still the same but nutritional content differs. Or we would like to have a reduction of inputs (agrochemicals, for example) while maintaining yield. There are many potential answers and—coming back to matching practices to local requirements—using a simple average of all crops in a country (or a continent) is definitely the wrong scale of assessment. We do not want to work with an average farmer or an average consumer but to target specific needs with the best available practices. Some times this will include {insert biotech, agronomical practices here}, other times this will include {insert another biotech and set of agronomical practices here}.

And that is the way I think of improving food security.