Ordinal logistic GM pigs

This week another ‘scary GMO cause disease’ story was doing the rounds in internet: A long-term toxicology study on pigs fed a combined genetically modified (GM) soy and GM maize diet. Andrew Kniss, a non-smokable weeds expert, mentioned in Twitter that the statistical analyses in the study appeared to be kind of dodgy.

Curious, I decided to have a quick look and I was surprised, first, by the points the authors decide to highlight in their results, second, by the pictures and captioning used in the article and, last, by the way of running the analysis. As I’m in the middle of marking assignments and exams I’ll only have a quick go at part of the analysis. As I see it, the problem can be described as ‘there is a bunch of pigs who were fed either non-GM feed or GM feed. After some time (approximately 23 weeks) they were killed and went through a CSI-like autopsy’, where part of the exam involved the following process:

  1. Write down the type of feed the pig had during his/her life;
  2. Assess the condition of the stomach and put it in one of four boxes labeled ‘Nil’, ‘Mild’, ‘Moderate’ and ‘Severe’.

All this data is summarized in Table 3 of the paper (PDF). How would I go about the analysis? As I see it, we have a categorical response variable—which can take one of four mutually exclusive values—and a categorical predictor (diet). In addition, there is a natural order to the inflammation response variable in that Severe > Moderate > Mild > Nil.

Andrew Kniss wrote a post trying to reproduce the published results. Instead, I present the first approach I would try with the data: ordinal logistic regression. Not only that, but instead of using a hippie statistical software like R, I will use industrial-grade-business-like SAS:

/*
 Testing SAS web editor fitting pigs data
 Luis A. Apiolaza - School of Forestry
*/
 
*Reading data set;
data pigs;
  input inflam $ diet $ count;
  datalines;
  Nil Non-GMO 4
  Mild Non-GMO 31
  Moderate Non-GMO 29
  Severe Non-GMO 9
  Nil GMO 8
  Mild GMO 23
  Moderate GMO 18
  Severe GMO 23
;
run;
 
*Showing data set;
proc print data = pigs;
run;
 
*El quicko analysis;
ods graphics on;
proc logistic data = pigs order = data;
  freq count;
  class inflam (ref = "Nil") diet (ref = "Non-GMO") / param = ref;
  model inflam = diet / link = glogit;
  oddsratio diet;
run;

This produces a simple table with the same data as the paper and some very non-exciting results, which are better summarized in a single graph:

/*
Obs	inflam   diet   count
1	Nil      Non-GMO  4
2	Mild     Non-GMO 31
3	Moderate Non-GMO 29
4	Severe   Non-GMO  9
5	Nil      GMO      8
6	Mild     GMO     23
7	Moderate GMO     18
8	Severe   GMO     23
*/
Odd ratios for the different levels of stomach inflammation.

Odd ratios for the different levels of stomach inflammation.

The odd ratios would be 1 for no difference between the treatments. The graph shows that the confidence limits for all levels of inflammation include 1, so move on, nothing to see. In fact, GMO-fed pigs tend to have less inflammation for most disease categories.

P.S. There are many ways of running an analysis for this data set, but I’m in favor of approaches that take the whole problem in one go rather than looking at one class at the time. In an ideal situation we would have a continuous assessment for inflammation and the analysis would be a one-way ANOVA. I understand that for practical reasons one may prefer to split the response in four classes.

P.S.2 2013-06-15 I often act as a reviewer for scientific journals. In the case of this article some of my comments would have included: the analysis does not use the structure of the data properly, the photographs of the damaged organs should include both types of diet for each inflammation class (or at least include the most representative diet for the class), and the authors should highlight that there are no significant differences between the two diets for animal health; that is, the trial provides evidence for no difference between feeds. I still feel that the authors should be more forthcoming on terms of disclosing potential conflicts of interest too, but that’s their decision.

Tongue-in-cheek, of course, and with reference to weeds. This blog mostly uses R, but I’m pushing myself to use lots of different software to ‘keep the language’. Now if I could only do this with Spanish.

Flotsam 12: early June linkathon

A list of interesting R/Stats quickies to keep the mind distracted:

  • A long draft Advanced Data Analysis from an Elementary Point of View by Cosma Shalizi, in which he uses R to drive home the message. Not your average elementary point of view.
  • Good notes by Frank Davenport on starting using R with data from a Geographic Information System (GIS). Read this so you get a general idea of how things fit together.
  • If you are in to maps, Omnia sunt Communia! provides many good tips on producing them using R.
  • Mark James Adams reminded us that Prediction ≠ Understanding, probably inspired by Dan Gianola‘s course on Whole Genome Prediction. He is a monster of Bayesian applications to genetic evaluation.
  • If you are in to data/learning visualization you have to watch Bret Victor’s presentation on Media for thinking the unthinkable. He is so far ahead what we normally do that it is embarrassing.
  • I follow mathematician Atabey Kaygun in twitter and since yesterday I’ve been avidly reading his coverage of the protests in Turkey. Surely there are more important things going on in the world than the latest R gossip.

I’m marking too many assignments right now to have enough time to write something more substantial. I can see the light at the end of the tunnel though.

Analyzing a simple experiment with heterogeneous variances using asreml, MCMCglmm and SAS

I was working with a small experiment which includes families from two Eucalyptus species and thought it would be nice to code a first analysis using alternative approaches. The experiment is a randomized complete block design, with species as fixed effect and family and block as a random effects, while the response variable is growth strain (in \( \mu \epsilon\)).

When looking at the trees one can see that the residual variances will be very different. In addition, the trees were growing in plastic bags laid out in rows (the blocks) and columns. Given that trees were growing in bags siting on flat terrain, most likely the row effects are zero.

Below is the code for a first go in R (using both MCMCglmm and ASReml-R) and SAS. I had stopped using SAS for several years, mostly because I was running a mac for which there is no version. However, a few weeks ago I started accessing it via their OnDemand for Academics program via a web browser.

The R code using REML looks like:

# Options
options(stringsAsFactors = FALSE)
setwd('~/Dropbox/research/2013/growth-stress')
 
# Packages
require(ggplot2)
require(asreml)
require(MCMCglmm)
 
 
# Reading data, renaming, factors, etc
gs = read.csv('eucalyptus-growth-stress.csv')
summary(gs)
 
# Both argophloia and bosistoana
gsboth = subset(gs, !is.na(strain))
gsboth = within(gsboth, {
	species = factor(species)
	row = factor(row)
	column = factor(column)
	fam = factor(fam)
})            
 
 
 
ma = asreml(strain ~ species, random = ~ fam + row, 
            rcov = ~ at(species):units,
            data = gsboth)
 
summary(ma)$varcomp
 
#                                   gamma  component std.error   z.ratio constraint
#fam!fam.var                    27809.414  27809.414 10502.036 2.6480022   Positive
#row!row.var                     2337.164   2337.164  3116.357 0.7499666   Positive
#species_E.argopholia!variance 111940.458 111940.458 26609.673 4.2067580   Positive
#species_E.bosistoana!variance  63035.256  63035.256  7226.768 8.7224681   Positive

While using MCMC we get estimates in the ballpark by using:

# Priors
bothpr = list(R = list(V = diag(c(50000, 50000)), nu = 3), 
              G = list(G1 = list(V = 20000, nu = 0.002), 
                       G2 = list(V = 20000, nu = 0.002),
                       G3 = list(V = 20000, nu = 0.002)))
 
# MCMC                    
m2 = MCMCglmm(strain ~ species, random = ~ fam + row + column, 
              rcov = ~ idh(species):units,
              data = gsboth,
              prior = bothpr,
              pr = TRUE,
              family = 'gaussian',
              burnin = 10000,
              nitt = 40000,
              thin = 20,
              saveX = TRUE,
              saveZ = TRUE,
              verbose = FALSE)
 
summary(m2)
 
# Iterations = 10001:39981
# Thinning interval  = 20
# Sample size  = 1500 
# 
# DIC: 3332.578 
# 
# G-structure:  ~fam
# 
# post.mean l-95% CI u-95% CI eff.samp
# fam     30315    12211    55136     1500
# 
# ~row
# 
# post.mean l-95% CI u-95% CI eff.samp
# row      1449    5.928     6274    589.5
# 
# R-structure:  ~idh(species):units
# 
# post.mean l-95% CI u-95% CI eff.samp
# E.argopholia.units    112017    71152   168080     1500
# E.bosistoana.units     65006    52676    80049     1500
# 
# Location effects: strain ~ species 
# 
# post.mean l-95% CI u-95% CI eff.samp  pMCMC    
# (Intercept)            502.21   319.45   690.68     1500 <7e-04 ***
# speciesE.bosistoana   -235.95  -449.07   -37.19     1361  0.036 *

The SAS code is not that disimilar, except for the clear demarcation between data processing (data step, for reading files, data transformations, etc) and specific procs (procedures), in this case to summarize data, produce a boxplot and fit a mixed model.

* termstr=CRLF accounts for the windows-like line endings of the data set;
data gs;
  infile "/home/luis/Projects/euc-strain/growthstresses.csv" 
        dsd termstr=CRLF firstobs=2;
  input row column species $ family $ strain;
  if strain ^= .;
run;
 
proc summary data = gs print;
  class species;
  var strain;
run;
 
proc boxplot data = gs;
  plot strain*species;
run;
 
proc mixed data = gs;
  class row species family;
  model strain = species;
  random row family;
  repeated species / group=species;
run;
 
/*
Covariance Parameter Estimates
Cov Parm 	Group 	               Estimate
row 	  	                       2336.80
family 	  	                       27808
species 	species E.argoph 	111844
species 	species E.bosist 	63036
*/
SAS boxplot for the data set.

SAS boxplot for the data set.

I like working with multiple languages and I realized that, in fact, I missed SAS a bit. It was like meeting an old friend; at the beginning felt strange but we were quickly chatting away after a few minutes.

Subsetting data

At School we use R across many courses, because students are supposed to use statistics under a variety of contexts. Imagine their disappointment when they pass stats and discovered that R and statistics haven’t gone away!

When students start working with real data sets one of their first stumbling blocks is subsetting data. We have data sets and either they are required to deal with different subsets or there is data cleaning to do. For some reason, many students struggle with what should be a simple task.

If one thinks of data as as a matrix/2-dimensional array, subsetting boils down to extracting the needed rows (cases) and columns (variables). In the R world one can do this in a variety of ways, ranging from the cryptic to the explicit and clear. For example, let’s assume that we have a dataset called alltrials with heights and stem diameters for a number of trees in different trials (We may have a bunch of additional covariates and experimental design features that we’ll ignore for the moment). How do we extract all trees located in Christchurch?

Two common approaches are:

mytrial = alltrials[alltrials$location == 'Christchurch', ]
 
mytrial = subset(alltrials, location == 'Christchurch')

While both expressions are equivalent, the former reads like Klingon to students, while the latter makes explicit that we are obtaining a subset of the original data set. This can easily be expanded to more complex conditions; for example to include all trees from Christchurch that are taller than 10 m:

mytrial = alltrials[alltrials$location == 'Christchurch' & alltrials$height > 10, ]
 
mytrial = subset(alltrials, location == 'Christchurch' & height > 10)

I think the complication with the Klingonian notation comes mostly from two sources:

  • Variable names for subsetting the data set are not directly accessible, so we have to prefix them with the NameOfDataset$, making the code more difficult to read, particularly if we join several conditions with & and |.
  • Hanging commas: if we are only working with rows or columns we have to acknowledge it by suffixing or prefixing with a comma, which are often forgotten.

Both points result on frustrating error messages like

  • Error in `[.data.frame`(alltrials, location == 'Christchurch', ) : object 'location' not found for the first point or
  • Error in `[.data.frame`(alltrials, alltrials$location == 'Christchurch') undefined columns selected for the second point.

The generic forms of these two notations are:

dataset[what to do with rows, what to do with columns]
 
subset(dataset, what to do with rows, what to do with columns)

We often want to keep a subset of the observed cases and keep (or drop) specific variables. For example, we want to keep trees in 'Christchurch' and we want to ignore diameter, because the assessor was 'high' that day:

# With this notation things get a bit trickier
# The easiest way is to provide the number of the variable
# Here diameter is the 5th variable in the dataset
mytrial = alltrials[all.trials$location == 'Christchurch' & all.trials$height > 10, -5]
 
# This notation is still straightforward
mytrial = subset(alltrials, location == 'Christchurch' & height > 10, select = -diameter)

There are, however, situations where Klingon is easier or more efficient. For example, to take a random sample of 100 trees from the full dataset:

 
mytrial = alltrials[sample(1:nrow(alltrials), 100, replace = FALSE),]

If you are interested in this issue Quick-R has a good description of subsetting. I'm sure this basic topic must has been covered many times, although I doubt anyone used Klingon in the process.

Gratuitous picture: building blocks for research (Photo: Luis).

Gratuitous picture: building blocks for R (Photo: Luis).

Protectionism under another name

This morning Radio New Zealand covered a story (audio) where Tomatoes New Zealand (TNZ, the growers association) was asking the Government to introduce compulsory labeling for irradiated products (namely imported Australian tomatoes), stating that consumers deserve an informed choice (TNZ Press Release). Two points that I think merit attention:

  • Food irradiation is perfectly safe: it does not make food radioactive, it does not alter the nutritional value of food and reduces the presence (or completely eliminates) the presence of microorganisms that cause disease or pests (the latter being the reason for irradiation in this case).
  • The second point is that the call for labeling does not come from consumers (or an organization representing them) but from producers that face competition.

This situation reminded me of Milton Friedman talking about professional licenses The justification offered is always the same: to protect the consumer. However, the reason is demonstrated by observing who lobbies. Who is doing the lobbying is very telling in this case, particularly because there is no real reason to induce fear on the consumer, except that irradiation sounds too close to radioactive, and therefore TNZ is hoping to steer consumers away from imported tomatoes. Given that TNZ is for informing the consumer they could label tomatoes with ‘many of the characteristics in this tomato are the product of mutations‘. Harmless but scary.

P.S. The media is now picking up the story. Shameful manipulation.
P.S.2 Give that TNZ is in favor of the right to know I want a list of all chemicals used in the production of New Zealand tomatoes, how good is their water management and the employment practices of tomato growers.