Categories

## Implementing a model as an R package

In our research group we often have people creating statistical models that end up in publications but, most of the time, the practical implementation of those models is lacking. I mean, we have a bunch of barely functioning code that is very difficult to use in a reliable way in operations of the breeding programs. I was very keen on continue using one of the models in our research, enough to rewrite and document the model fitting, and then create another package for using the.model in operations.

Unfortunately, neither the data nor the model are mine to give away, so I can’t share them (yet). But I hope these notes will help you in you are in the same boat and need to use your models (or ‘you’ are in fact future me, who tend to forget how or why I wrote code in a specific way).

## A basic motivational example

Let’s start with a simple example: linear regression. We want to predict a response using a predictor variable, and then we can predict the response for new values of the predictor contained in new_data with:

my_model <- lm(response ~ predictor, data = my_data)
predictions <- predict(my_model, newdat = new_data)

# Saving the model object
save(my_model, 'model_file.Rda')


The model coefficients needed to predict new values are stored in the my_model object. If we want to use the model elsewhere, we can save the object as an .Rda file, in this case model_file.Rda.

We can later read the model file in, say, a different project and get new predictions using:

load('model_file.Rda')
more_predictions <- predict(my_model, newdat = yet_another_new_data)


## Near-infrared Spectroscopy

Near-infrared spectroscopy is the stuff of CSI and other crime shows. We measure the reflection at different wavelengths and run a regression analysis using what we want to predict as Y in the model. The number of predictors (wavelengths) is much larger than in the previous example—1,296 for the NIR machine we are using—so it is not unusual to have more predictors than observations. NIR spectra are often trained using pls() (from the pls package) with help from functions from the prospectr package.

I could still use the save/load approach from the motivational example to store and reuse the model object created with pls but, instead, I wanted to implement the model, plus some auxiliary functions, in a package to make the functions easier to use in our lab.

I had two issues/struggles/learning opportunities that I needed to sort out to get this package working:

### 1. How to automatically load the model object when attaching the package?

Normally, datasets and other objects go in the data folder, where they are made available to the user. Instead, I wanted to make the object internally available. The solution turned out to be quite straightforward: save the model object to a file called sysdata.rda (either in the R or data folders of the package). This file is automatically loaded when we run library(package_name). We just need to create that file with something like:

save(my_model, 'sysdata.rda')


### 2. How to make predict.pls work in the package?

I was struggling to use the predict function, as in my head it was being provided by the pls package. However, pls is only extending the predict function, which comes with the default R installation but is part of the stats package. At the end, sorted it out with the following Imports, Depends and LazyData in the DESCRIPTION file:

Imports: prospectr,
stats
Depends: pls
Encoding: UTF-8
LazyData: Yes


Now it is possible to use predict, just remember to specify the package where it is coming from, as in:

stats::predict(my_model, ncomp = n_components,
newdata = spectra, interval = 'confidence')


Nothing groundbreaking, I know, but spent a bit of time sorting out that couple of annoyances before everything fell into place. Right now we are using the models in a much easier and reproducible way.

Categories

## Being data curious: the strange case of lamb consumption in NZ

There is a lot of talk about the skills needed for working in Statistics/Data Science, with the discussion often focusing on theoretical understanding, programming languages, exploratory data analysis, and visualization. There are many good blog posts dealing with how you get data, process it with your favorite language and then creating some good-looking plots. However, in my opinion, one important skill is curiosity; more specifically being data curious.

Often times being data curious doesn’t require statistics or coding, but just searching for and looking at graphs. A quick example comes from Mike Dickinson’s tweet: “This is extraordinary: within a decade, NZers basically stopped eating lamb. 160 years of tradition scrapped almost overnight.”

After reading the news article, many people came up with good potential explanations: Have the relative prices changed? Do we have different demographics with not so much appetite for lamb? etc.

Few comments questioned the data until Peter Ellis voiced exactly what was nagging me:

Do the two data points make sense? In this data-abundant world, it didn’t take long to find the time series from which the points came from in this handy OECD page.

A quick look shows that the series contains both quoted consumption figures, showing the talked-about 10-year decline. Even more surprisingly, one can see that practically most of the decline occurred from 2008 to 2009 (from 17.7 to 4.9 kg/person), which is a bizarre drop for a single year. A single person may have large differences in consumption from one year to the next; however, over a whole country those deviations tend to be averaged out. This highlights another issue with the time series: it wiggles like crazy.

When exploring data is useful to have some sort of benchmark to see if other things are also changing at the same time. I chose our neighbor Australia—with a not so different diet, similar part of the world—as my benchmark. The Australian time series doesn’t show a change like NZ. Besides using the benchmark for the same product, we can also compare what’s going on with other meats. For example, beef and veal, pork and poultry.

All the series are smoother and show similar trends in Australia and New Zealand, which makes the lamb saga increasingly look like a mistake. We can now move from trying to explain social changes that are driving the change between two numbers, to being highly suspicious about the numbers under discussion!

So where could be the problem coming from? Consumption per capita requires i) total domestic consumption of sheep meat and ii) population of the country. We are pretty sure we have good data for population, courtesy of Statistics New Zealand. How would one go about estimating domestic consumption of sheep meat? Probably one would:

• Get the size of the New Zealand sheep flock. We can get sheep numbers from Statistics NZ Agricultural Production Statistics. Livestock numbers are a national indicator, which tend to have high accuracy.
• Get an idea of the proportion of the flock that’s exported, which we know is pretty substantial. I don’t know how good these numbers are, but Beef & Lamb NZ gives us an idea of how many sheep are slaughtered for export. This number, which hovers around 20 million a year seems quite consistent. We have to remember that not the whole population is slaughtered every year, as we have to replace the flock.
• The difference between flock size – (sheep for export + replacement sheep) should be the number of sheep for domestic consumption.
• We need a conversion factor between number of sheep and kg of meat produced, so we can calculate meat consumption/capita.

I would assume that the sheep-meat conversion factor will show little fluctuation from year to year, so perhaps the likely culprit is the penultimate point, estimating the number of sheep for domestic consumption. One thing that grabs my attention is that while the flock is getting smaller, the number of sheep for exports stays around the same, which should mean fewer sheep available for the domestic market, giving credibility to the lower lamb consumption trend.

I don’t know if this the actual explanation for the “lamb consumption crash”. If I had more time I could chase some of the domestic consumption numbers, even call the Beef & Lamb people. But this should be enough to get you started with an example on how to question the news using real data. I’m sure you reader can come up with better ways of looking at this and other stories.

Categories

## Functions with multiple results in tidyverse

I have continued playing with the tidyverse for different parts of a couple of projects.

Often I need to apply a function by groups of observations; sometimes, that function returns more than a single number. It could be something like for each group fit a distribution and return the distribution parameters. Or, simpler for the purposes of this exploration, calculate and return a bunch of numbers.

describe_c <- function(x) {
mn <- mean(x, na.rm = TRUE)
dev <- sd(x, na.rm = TRUE)
n <- sum(!is.na(x))
cv <- dev/mn*100
return(c(mean = mn, sdev = dev, count = n, coefvar = cv))
}


If I have a data frame called field_data, with family codes (trees with the same parents, codes have been changed to protect the innocent) and stem diameters (in mm), I could do the following in base R:

# This line produces an annoying list
summary_one <- with(field_data, tapply(stem, family, FUN = describe_v))

# This puts together a matrix by joining
# the list results using rbind()
summary_one <- do.call(rbind, summary_one)

# To continue processing it might be better to convert
# to a data frame
summary_one <- data.frame(summary_one)


And if I need to do this for several variables, I will need to merge each of these matrices in a data frame.

Continuing with my experimentation with the tidyverse, I was wondering how to get the above going with dplyr et al. After failing a few times I asked the question in Twitter and got a number of helpful replies.

One of the keys is that dplyr can store a list result from a function. Modifying my toy function is pretty straightforward, and now looks like:

describe_list <- function(x) {
mn <- mean(x, na.rm = TRUE)
dev <- sd(x, na.rm = TRUE)
n <- sum(!is.na(x))
cv <- dev/mn*100
return(list(c(mean = mn, sdev = dev, count = n, coefvar = cv)))
}


And we can check the contents of summary_two to see we have a list in which each element contains 4 values:

head(summary_two)
#  A tibble: 6 x 2
#   family     model
#
# 1      A
# 2      B
# 3      C
# 4      D
# 5      E
# 6      F


We still need to extract the elements of each element of the list and assign them to a variable name. Using map from the purrr package is pretty straightforward in this case, and we can extract the values either using their names or their position in the element.

summary_two %>%
mutate(mn = map_dbl(model,'mean'),
sd = map_dbl(model,'sdev'),
n = map_dbl(model,'count'),

#  A tibble: 6 x 6
#   family     model       mn       sd     n       cv
#
# 1      A  190.8306 23.71290   425 12.42615
# 2      B  190.1111 25.46554   396 13.39508
# 3      C  188.2646 27.39215   461 14.54981
# 4      D  189.2668 25.16330   431 13.29514
# 5      E  183.5238 19.70182    21 10.73530
# 6      F  183.1250 28.82377    24 15.73994


I'm still playing with ideas to be lazier at extraction time. An almost abhorrent idea is to provide the output as character for posterior type conversion, as in:

describe_char <- function(x) {
mn <- mean(x, na.rm = TRUE)
dev <- sd(x, na.rm = TRUE)
n <- sum(!is.na(x))
cv <- dev/mn*100
return(paste(mn, dev, n, cv, sep = ':'))
}

field_data %>%
group_by(family) %>%
summarise(model = describe_char(stem)) -> summary_three

# A tibble: 6 x 2
#   family                                                  model
#
# 1      A 190.830588235294:23.7128956613006:425:12.4261502731746
# 2      B 190.111111111111:25.4655444116168:396:13.3950847284951
# 3      C  188.26464208243:27.3921487349435:461:14.5498105390125
# 4      D 189.266821345708:25.1632953227626:431:13.2951434085746
# 5      E   183.52380952381:19.7018249094317:21:10.7352963959021
# 6      F           183.125:28.8237711378767:24:15.7399432834822

summary_three %>%
separate(model, c('mn', 'sd', 'n', 'cv'), sep = ':') %>% head

# A tibble: 6 x 5
#   family               mn               sd     n               cv
#
# 1      A 190.830588235294 23.7128956613006   425 12.4261502731746
# 2      B 190.111111111111 25.4655444116168   396 13.3950847284951
# 3      C  188.26464208243 27.3921487349435   461 14.5498105390125
# 4      D 189.266821345708 25.1632953227626   431 13.2951434085746
# 5      E  183.52380952381 19.7018249094317    21 10.7352963959021
# 6      F          183.125 28.8237711378767    24 15.7399432834822


And we can get all the way there with:

summary_three %>%
separate(model, c('mn', 'sd', 'n', 'cv'), sep = ':') %>%
mutate_at(c('mn', 'sd', 'n', 'cv'), as.numeric) %>% head

# A tibble: 6 x 5
#   family       mn       sd     n       cv
#
# 1      A 190.8306 23.71290   425 12.42615
# 2      B 190.1111 25.46554   396 13.39508
# 3      C 188.2646 27.39215   461 14.54981
# 4      D 189.2668 25.16330   431 13.29514
# 5      E 183.5238 19.70182    21 10.73530
# 6      F 183.1250 28.82377    24 15.73994


Which I assume has all sort of potential negative side-effects, but looks really cool.

In case you want to play with the problem, here is a tiny example of field data.

Categories

## Turtles all the way down

One of the main uses for R is for exploration and learning. Let’s say that I wanted to learn simple linear regression (the bread and butter of statistics) and see how the formulas work. I could simulate a simple example and fit the regression with R:

library(arm)  # For display()

# Simulate 5 observations
set.seed(50)
x <- 1:5
y <- 2 + 3*x + rnorm(5, mean = 0, sd = 3)

# Fit regression
reg <- lm(y ~ x, dat)
display(reg)

# lm(formula = y ~ x, data = dat)
#             coef.est coef.se
# (Intercept) 3.99     3.05
# x           2.04     0.92
# ---
# n = 5, k = 2
# residual sd = 2.91, R-Squared = 0.62

# Plot it
plot(y ~ x)
abline(coef(reg))


The formulas for the intercept ($$b_0$$) and the slope ($$b_1$$) are pretty simple, and I have been told that there is a generic expression that instead uses matrices.

$$b_1 = \frac{\sum{x y} - n \bar{x} \bar{y}}{\sum{x x} - n \bar{x}^2}$$
$$b_0 = \bar{y} - b_1 \bar{x}$$

$$\boldsymbol{b} = \boldsymbol{X}\boldsymbol{X}^{-1} \boldsymbol{Xy}$$

How do the contents of the matrices and the simple formulates relate to each other?

# Formulas for slope and intercept
b1 <- (sum(x*y) - length(x)*mean(x)*mean(y))/(sum(x*x) - length(x)*mean(x)^2)
b0 <- mean(y) - b1*mean(x)


Funnily enough, looking at the matrices we can see similar sums of squares and crossproducts as in the formulas.

X <- model.matrix(reg)

#   (Intercept) x
# 1           1 1
# 2           1 2
# 3           1 3
# 4           1 4
# 5           1 5
# attr(,"assign")
# [1] 0 1

t(X) %*% X
#             (Intercept)  x
# (Intercept)           5 15
# x                    15 55

# So XX contains bits and pieces of the previous formulas
length(x)
# [1] 5

sum(x)
# [1] 15

sum(x*x)
# [1] 55

# And so does Xy
t(X) %*% y
#                  [,1]
# (Intercept)  50.61283
# x           172.27210

sum(y)
# [1] 50.61283

sum(x*y)
# [1] 172.2721

# So if we combine the whole lot and remember that
# solves calculates the inverse
solve(t(X) %*% X) %*% t(X) %*% y
#                 [,1]
# (Intercept) 3.992481
# x           2.043362


But I have been told that R (as most statistical software) doesn't use the inverse of the matrix for estimating the coefficients. So how does it work?

If I type lm R will print the code of the lm() function. A quick look will reveal that there is a lot of code reading the arguments and checking that everything is OK before proceeding. However, the function then calls something else: lm.fit(). With some trepidation I type lm.fit, which again performs more checks and then calls something with a different notation:

z <- .Call(C_Cdqrls, x, y, tol, FALSE)


This denotes a call to a C language function, which after some searching in Google we find in a readable form in the lm.c file. Another quick look brings more checking and a call to Fortran code:

F77_CALL(dqrls)(REAL(qr), &n, &p, REAL(y), &ny, &rtol,
REAL(coefficients), REAL(residuals), REAL(effects),
&rank, INTEGER(pivot), REAL(qraux), work);


which is a highly tuned routine for QR decomposition in a linear algebra library. By now we know that the general matrix expression produces the same as our initial formula, and that the R lm() function does not use a matrix inverse but QR decomposition to solve the system of equations.

One of the beauties of R is that brought the power of statistical computing to the masses, by not only letting you fit models but also having a peek at how things are implemented. As a user, I don't need to know that there is a chain of function calls initiated by my bread-and-butter linear regression. But it is comforting to the nerdy me, that I can have a quick look at that.

All this for free, which sounds like a very good deal to me.

Categories

## Cute Gibbs sampling for rounded observations

I was attending a course of Bayesian Statistics where this problem showed up:

There is a number of individuals, say 12, who take a pass/fail test 15 times. For each individual we have recorded the number of passes, which can go from 0 to 15. Because of confidentiality issues, we are presented with rounded-to-the-closest-multiple-of-3 data ($$\mathbf{R}$$). We are interested on estimating $$\theta$$ of the Binomial distribution behind the data.

Rounding is probabilistic, with probability 2/3 if you are one count away from a multiple of 3 and probability 1/3 if the count is you are two counts away. Multiples of 3 are not rounded.

We can use Gibbs sampling to alternate between sampling the posterior for the unrounded $$\mathbf{Y}$$ and $$\theta$$. In the case of $$\mathbf{Y}$$ I used:

# Possible values that were rounded to R
possible <- function(rounded) {
if(rounded == 0) {
options <- c(0, 1, 2)
} else {
options <- c(rounded - 2, rounded - 1, rounded,
rounded + 1, rounded + 2)
}
return(options)
}

# Probability mass function of numbers rounding to R
# given theta
prior_y <- function(options, theta) {
p <- dbinom(options, 15, prob = theta)
return(p)
}

# Likelihood of rounding
like_round3 <- function(options) {
if(length(options) == 3) {
like <- c(1, 2/3, 1/3) }
else {
like <- c(1/3, 2/3, 1, 2/3, 1/3)
}
return(like)
}

# Estimating posterior mass function and drawing a
# random value of it
posterior_sample_y <- function(R, theta) {
po <- possible(R)
pr <- prior_y(po, theta)
li <- like_round3(po)
post <- li*pr/sum(li*pr)
samp <- sample(po, 1, prob = post)
return(samp)
}


While for $$theta$$ we are assuming a vague $$mbox{Beta}(alpha, eta)$$, with $$alpha$$ and $$eta$$ equal to 1, as prior density function for $$theta$$, so the posterior density is a $$mbox{Beta}(alpha + sum Y_i, eta + 12*15 - sum Y_i)$$.

## Function to sample from the posterior Pr(theta | Y, R)
posterior_sample_theta <- function(alpha, beta, Y) {
theta <- rbeta(1, alpha + sum(Y), beta + 12*15 - sum(Y))
return(theta)
}


I then implemented the sampler as:

## Data
R <- c(0, 0, 3, 9, 3, 0, 6, 3, 0, 6, 0, 3)
nsim <- 10000
burnin <- 1000
alpha <- 1
beta <- 1
store <- matrix(0, nrow = nsim, ncol = length(R) + 1)

starting.values <- c(R, 0.1)

## Sampling
store[1,] <- starting.values
for(draw in 2:nsim){
current <- store[draw - 1,]
for(obs in 1:length(R)) {
y <- posterior_sample_y(R[obs], current[length(R) + 1])
# Jump or not still missing
current[obs] <- y
}
theta <- posterior_sample_theta(alpha, beta, current[1:length(R)])
# Jump or not still missing
current[length(R) + 1] <- theta

store[draw,] <- current
}


And plotted the results as:

plot((burnin+1):nsim, store[(burnin+1):nsim,13], type = 'l')

library(ggplot2)

ggplot(data.frame(theta = store[(burnin+1):nsim,13]), aes(x = theta)) +
geom_density(fill = 'blue', alpha = 0.5)


multiple_plot <- data.frame(Y = matrix(store[(burnin+1):nsim, 1:12],
nrow = (nsim - burnin)*12,
ncol = 1))
multiple_plot$obs <- factor(rep(1:12, each = (nsim - burnin))) ggplot(multiple_plot, aes(x = Y)) + geom_histogram() + facet_grid(~obs)  I thought it was a nice, cute example of simultaneously estimating a latent variable and, based on that, estimating the parameter behind it. Categories ## Teaching linear models I teach several courses every year and the most difficult to pull off is FORE224/STAT202: regression modeling. The academic promotion application form in my university includes a section on one’s ‘teaching philosophy’. I struggle with that part because I suspect I lack anything as grandiose as a philosophy when teaching: as most university lecturers I never studied teaching, although I try to do my best. If anything, I can say that I enjoy teaching and helping students to ‘get it’ and that I want to instill a sense of ‘statistics is fun’ in them. I spend quite a bit of time looking for memorable examples, linking to stats in the news (statschat and listening the news while walking my dog are very helpful here) and collecting data. But a philosophy? Don’t think so. One of the hardest parts of the course is the diversity of student backgrounds. Hitting the right level, the right tone is very hard. Make it too easy and the 1/5 to 1/4 of students with a good mathematical background will hate it; they may even decide to abandon any intention of continuing doing stats if ‘that’s all there is about the topic’. Make it too complicated and half the class will fail and/or hate the content. Part of the problem is based around what we mean by teaching ‘statistics’. In some cases it seems limited to what specific software does; for example, teaching with Excel means restriction to whatever models are covered in Excel’s Data Analysis Toolpak (DAT). The next choice when teaching is using menu-driven software (e.g. SPSS), which provides much more statistical functionality than Excel + DAT, at the expense of being further removed from common usability conventions. At the other extreme of simplicity is software that requires coding to control the analyses (e.g. R or SAS). In general, the more control we want, the more we have to learn to achieve it. A while ago I made a distinction between the different levels of learning (user cases) when teaching statistics. In summary, we had i- very few students getting in to statistics and heavy duty coding, ii- a slightly larger group that will use stats while in a while and iii- the majority that will mostly consume statistics. I feel a duty towards the three groups, while admitting that I have predilection for the first one. Nevertheless, the third group provides most of the challenges and need for thinking about how to teach the subject. When teaching linear models (general form $$y = X eta + epsilon$$) we tend to compartmentalize content: we have an ANOVA course if the design matrix $$X$$ represents categorical predictors (contains only 1s and 0s), a regression course if $$X$$ is full of continuous predictors and we talk about ANCOVA or regression on dummy variables if $$X$$ is a combination of both. The use of different functions for different contents of $$X$$ (for example aov() versus lm() in R or proc reg versus proc glm in SAS) further consolidates the distinction. Even when using menus, software tends to guide students through different submenus depending on the type of $$X$$. At the beginning of the course we restrict ourselves to $$X$$ full of continuous predictors, but we introduce the notion of matrices with small examples. This permits showing the connection between all the linear model courses (because a rose by any other name…) and it also allows deriving a general expression of the formulas for the regression coefficients (essential for the most advanced students). Slower students may struggle with some of this material; however, working with small examples they can replicate the results from R (or Excel or SAS or whatever one uses to teach). Some times they even think it is cool. Here is where the model.matrix() R function becomes handy; rather than building incidence matrices by hand—which is easy for tiny examples—we can get the matrices used by the lm() function to then calculate regression parameters (and any other output) for more complex models. Once students get the idea that on matrix terms our teaching compartments are pretty much the same, we can reinforce the idea by using a single function (or proc) to show that we can obtain all the bits and pieces that make up what we call ‘fitting the model’. This highlights the idea that ANOVA, ANCOVA & regression are subsets of linear models, which are subsets of linear mixed models, which are subsets of generalized linear mixed models. A statistical Russian doll. We want students to understand, some times so badly that we lower the bar to a point where there is no much to understand. Here is the tricky part, finding the right level of detail so all types of students learn to enjoy the topic, although at different levels of understanding. There is software that generates code from menus too, like Stata or Genstat. P.S. This is part of my thinking aloud with hesitation about teaching, as in Statistics unplugged, Excel, fanaticism and R, Split-plot 1: How does a linear mixed model look like?, R, academia and the democratization of statistics, Mid-January flotsam: teaching edition & Teaching with R: the switch. I am always looking for better ways of transferring knowledge. Categories ## Statistics unplugged How much does statistical software help and how much it interferes when teaching statistical concepts? Software used in the practice of statistics (say R, SAS, Stata, etc) brings to the party a mental model that it’s often alien to students, while being highly optimized for practitioners. It is possible to introduce a minimum of distraction while focusing on teaching concepts, although it requires careful choice of a subset of functionality. Almost invariably some students get stuck with the software and everything goes downhill from there; the student moved from struggling with a concept to struggling with syntax (Do I use a parenthesis here?). I am a big fan of Tim Bell’s Computer Science Unplugged, a program for teaching Computer Science’s ideas at primary and secondary school without using computers (see example videos). Here is an example video for public key encryption: [youtube=http://youtu.be/jJrICB_HvuI&w=500] This type of instruction makes me question both how we teach statistics and at what level we can start teaching statistics. The good news is that the New Zealand school curriculum includes statistics in secondary school, for which there is increasing number of resources. However, I think we could be targeting students even earlier. This year my wife was helping primary school students participating in a science fair and I ended up volunteering to introduce them to some basic concepts so they could design their own experiments. Students got the idea of the need for replication, randomization, etc based on a simple question: Did one of them have special powers to guess the result of flipping a coin? (Of course this is Fisher’s tea-drinking-lady-experiment, but no 10 year old cares about tea, while at least some of them care about super powers). After the discussion one of them ran a very cool experiment on the effect of liquefaction on the growth of native grasses (very pertinent in post-earthquake Christchurch), with 20 replicates (pots) for each treatment. He got the concepts behind the experiment; software just entered the scene when we needed to confirm our understanding of the results in a visual way: People tell me that teaching stats without a computer is like teaching chemistry without a lab or doing astronomy without a telescope, or… you get the idea. At the same time, there are some books that describe some class activities that do not need a computer; e.g. Gelman’s Teaching Statistics: A Bag of Tricks. (Incidentally, why is that book so friggin’ expensive?) ## Back to uni Back from primary school kiddies to a regression course at university. Let’s say that we have two variables, x & y, and that we want to regress y (response) on x (predictor) and get diagnostic plots. In R we could simulate some data and plot the relationship using something like this: # Basic regression data n = 100 x = 1:n y = 70 + x*5 + rnorm(n, 0, 40) # Changing couple of points and plotting y[50] = 550 y[100] = 350 plot(y ~ x)  We can the fit the linear regression and get some diagnostic plots using: # Regression and diagnostics m1 = lm(y ~ x) par(mfrow = c(2,2)) plot(m1) par(mfrow = c(1,1))  If we ask students to explain the 4th plot—which displays discrepancy (how far a point is from the general trend) on leverage (how far is a point from the center of mass, pivot of the regression)—many of them will struggle to say what is going on in that plot. At that moment one could go and calculate the Hat matrix of the regression ($$X (X’X)^{-1} X’$$) and get leverage from the diagonal, etc and students will get a foggy idea. Another, probably better, option is to present the issue as a physical system on which students already have experience. A good candidate for physical system is using a seesaw, because many (perhaps most) students experienced playing in one as children. Take your students to a playground (luckily there is one next to uni), get them playing with a seesaw. The influence of a point is related to the product of leverage (how far from the pivot we are applying force) and discrepancy (how big is the force applied). The influence of a point on the estimated regression coefficients will be very large when we apply a strong force far from the pivot (as in our point y[100]), just as it happens in a seesaw. We can apply lots of force (discrepancy) near the pivot (as in our point [y[50]) and little will happen. Students like mucking around with the seesaw and, more importantly, they remember. Analogy can go only so far. Some times a physical analogy like a quincunx (to demonstrate the central limit theorem) ends up being more confusing than using an example with variables that are more meaningful for students. I don’t know what is the maximum proportion of course content that could be replaced by using props, experiments, animations, software specifically designed to make a point (rather than to run analysis), etc. I do know that we still need to introduce ‘proper’ statistical software—at some point students have to face praxis. Nevertheless, developing an intuitive understanding is vital to move from performing monkeys; that is, people clicking on menus or going over the motion of copy/pasting code without understanding what’s going on in the analyses. I’d like to hear if you have any favorite demos/props/etc when explaining statistical concepts. P.S. In this post I don’t care if you love stats software, but I specifically care about helping learners who struggle understanding concepts. Categories ## Flotsam 13: early July links Man flu kept me at home today, so I decided to do something ‘useful’ and go for a linkathon: Sometimes people are truthful and cruel. Here Gappy on a mission goes for the jugular: https://twitter.com/gappy3000/status/354063247814561792 Over and out. Categories ## My take on the USA versus Western Europe comparison of GM corn A few days ago I came across Jack Heinemann and collaborators’ article (Sustainability and innovation in staple crop production in the US Midwest, Open Access) comparing the agricultural sectors of USA and Western Europe. While the article is titled around the word sustainability, the main comparison stems from the use of Genetically Modified crops in USA versus the absence of them in Western Europe. I was curious about part of the results and discussion which, in a nutshell, suggest that “GM cropping systems have not contributed to yield gains, are not necessary for yield gains, and appear to be eroding yields compared to the equally modern agroecosystem of Western Europe”. The authors relied on several crops for the comparison (Maize/corn, rapeseed/canolasee P.S.6, soybean and cotton); however, I am going to focus on a single one (corn) for two reasons: 1. I can’t afford a lot of time for blog posts when I should be preparing lectures and 2. I like eating corn. When the authors of the paper tackled corn the comparison was between the USA and Western Europe, using the United Nations definition of Western Europe (i.e. Austria, Belgium, France, Germany, Liechtenstein, Luxembourg, Monaco, Netherlands, Switzerland). Some large European corn producers like Italy are not there because of the narrow definition of Western. I struggled with the comparison used by the authors because, in my opinion, there are potentially so many confounded effects (different industry structures, weather, varieties, etc.) that it can’t provide the proper counterfactual for GM versus non-GM crops. Anyway, I decided to have a look at the same data to see if I would reach the same conclusions. The article provides a good description of where the data came from, as well as how the analyses were performed. Small details to match exactly the results were fairly easy to figure out. I downloaded the FAO corn data (3.7 MB csv file) for all countries (so I can reuse the code and data later for lectures and assignments). I then repeated the plots using the following code: # Default directory setwd('~/Dropbox/quantumforest') # Required packages require(ggplot2) require(labels) # Reading FAO corn data FAOcorn = read.csv('FAOcorn.csv') # Extracting Area FAOarea = subset(FAOcorn, Element == 'Area Harvested', select = c('Country', 'Year', 'Value')) names(FAOarea)[3] = 'Area' # and production FAOprod = subset(FAOcorn, Element == 'Production', select = c('Country', 'Year', 'Value')) names(FAOprod)[3] = 'Production' # to calculate yield in hectograms FAOarea = merge(FAOarea, FAOprod, by = c('Country', 'Year')) FAOarea$Yield = with(FAOarea, Production/Area*10000)

# Subsetting only the countries of interest (and years to match paper)
FAOarticle = subset(FAOarea, Country == 'United States of America' | Country == 'Western Europe')

# Plot with regression lines
ggplot(FAOarticle, aes(x = Year, y = Yield, color = Country)) +
geom_point() + stat_smooth(method = lm, fullrange = TRUE, alpha = 0.1) +
scale_y_continuous('Yield [hectograms/ha]', limits = c(0, 100000), labels = comma) +
theme(legend.position="top")


I could obtain pretty much the same regression model equations as in the article by expressing the years as deviation from 1960 as in:

# Expressing year as a deviation from 1960, so results
# match paper
FAOarticle$NewYear = with(FAOarticle, Year - 1960) usa.lm = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'United States of America') summary(usa.lm) #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "United States of America") # #Residuals: # Min 1Q Median 3Q Max #-18435.4 -1958.3 338.3 3663.8 10311.0 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 38677.34 1736.92 22.27|t|) #(Intercept) 31510.14 1665.90 18.91 Heinemann and collaborators then point out the following: …the slope in yield increase by year is steeper in W. Europe (y?=?1344.2x?+?31512, R²?=?0.92084) than the United States (y?=?1173.8x?+?38677, R²?=?0.89093) from 1961 to 2010 (Figure 1). This shows that in recent years W. Europe has had similar and even slightly higher yields than the United States despite the latter’s use of GM varieties. However, that interpretation using all data assumes that both ‘countries’ are using GMO all the time. An interesting thing is that USA and Western Europe were in different trends already before the introduction of GM corn. We can state that because we have some idea of when GM crops were introduced in the USA. This information is collected by the US Department of Agriculture in their June survey to growers and made publicly available at the State level (GMcornPenetration.csv): cornPenetration = read.csv('GMcornPenetration.csv') ggplot(cornPenetration, aes(x = Year, y = PerAllGM)) + geom_line() + facet_wrap(~ State) + scale_y_continuous('Percentage of GM corn') + theme(axis.text.x = theme_text(angle=90))  This graph tells us that by the year 2000 the percentage of planted corn was way below 50% in most corn producing states (in fact, it was 25% at the country level). From that time on we have a steady increase reaching over 80% for most states by 2008. Given this, it probably makes sense to assume that, at the USA level, yield reflects non-GM corn until 1999 and progressively reflects the effect of GM genotypes from 2000 onwards. This division is somewhat arbitrary, but easy to implement. We can repeat the previous analyzes limiting the data from 1961 until, say, 1999: usa.lm2 = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'United States of America' & Year < 2000) summary(usa.lm2) #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "United States of America" & Year < 2000) # #Residuals: # Min 1Q Median 3Q Max #-17441 -2156 1123 3989 9878 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 39895.57 2084.81 19.14 < 2e-16 *** #NewYear 1094.82 90.84 12.05 2.25e-14 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 6385 on 37 degrees of freedom #Multiple R-squared: 0.797, Adjusted R-squared: 0.7915 #F-statistic: 145.2 on 1 and 37 DF, p-value: 2.245e-14 weu.lm2 = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'Western Europe' & Year < 2000) summary(weu.lm2) #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "Western Europe" & Year < 2000) # #Residuals: # Min 1Q Median 3Q Max #-10785 -3348 -34 3504 11117 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 29802.17 1813.79 16.43 These analyses indicate that Western Europe started with a lower yield than the USA (29,802.17 vs 39,895.57 hectograms/ha) and managed to increase yield much more quickly (1,454.48 vs 1,094.82 hectograms/ha per year) before any use of GM corn by the USA. Figure 1 shows a messy picture because there are numerous factors affecting yield each year (e.g. weather has a large influence). We can take averages for each decade and see how the two ‘countries’ are performing: # Aggregating every decade. # 2013-07-05 20:10 NZST I fixed the aggregation because it was averaging yields rather # calculating total production and area for the decade and then calculating average yield # Discussion points are totally valid FAOarticle$Decade = cut(FAOarticle$Year, breaks = seq(1959, 2019, 10), labels = paste(seq(1960, 2010, 10), 's', sep = '')) decadeProd = aggregate(Production ~ Country + Decade, data = FAOarticle, FUN = sum) decadeArea = aggregate(Area ~ Country + Decade, data = FAOarticle, FUN = sum) decadeYield = merge(decadeProd, decadeArea, by = c('Country', 'Decade')) decadeYield$Yield = with(decadeYield, Production/Area*10000)

geom_bar(stat = 'identity', position = 'dodge') +
scale_y_continuous('Yield [hectograms/ha]', expand = c(0, 0)) +
theme(legend.position="top")


This last figure requires more attention. We can again see that Western Europe starts with lower yields than the USA; however, it keeps on increasing those yields faster than USA, overtaking it during the 1990s. Again, all this change happened while both USA and Western Europe were not using GM corn. The situation reverses in the 2000s, when the USA overtakes Western Europe, while the USA continuously increased the percentage of GM corn. The last bar in Figure 3 is misleading because it includes a single year (2010) and we know that yields in USA went down in 2011 and 2012, affected by a very large drought (see Figure 4).

At least when looking at corn, I can’t say (with the same data available to Heinemann) that there is no place or need for GM genotypes. I do share some of his concerns with respect to the low level of diversity present in staple crops but, in contrast to his opinion, I envision a future for agriculture that includes large-scale operations (either GM or no-GM), as well as smaller operations (including organic ones). I’d like to finish with some optimism looking further back to yield, because the USDA National Agricultural Statistics Service keeps yield statistics for corn since 1866(!) (csv file), although it uses bizarre non-metric units (bushels/acre). As a metric boy, I converted to kilograms per hectare (multiplying by 62.77 from this page) and then to hectograms (100 g) multiplying by 10.

# Reading NASS corn data
# Conversion to sensical units (see Iowa State Extension article)
# http://www.extension.iastate.edu/agdm/wholefarm/html/c6-80.html
NASS$Yield = with(NASS, Value*62.77*10) # Average by decade NASS$Decade = cut(NASS$Year, breaks = seq(1859, 2019, 10), labels = paste(seq(1860, 2010, 10), 's', sep = '')) oldYield = aggregate(Yield ~ Decade, data = NASS, FUN = mean) # Plotting ggplot(oldYield, aes(x = Decade, y = Yield)) + geom_bar(stat = 'identity') + scale_y_continuous('Yield [hectograms]', expand = c(0, 0))  It is interesting to see that there was little change until the 1940s, with the advent of the Green Revolution (modern breeding techniques, fertilization, pesticides, etc.). The 2010s decade in Figure 4 includes 2010, 2011 and 2012, with the last two years reflecting extensive droughts. Drought tolerance is one of the most important traits in modern breeding programs. While Prof. Heinemann and myself work for the same university I don’t know him in person. P.S. Did you know that Norman Borlaug (hero of mine) studied forestry? P.S.2 Time permitting I’ll have a look at other crops later. I would have liked to test a regression with dummy variables for corn to account for pre-2000 and post-1999, but there are not yet many years to fit a decent model (considering natural variability between years). We’ll have to wait for that one. P.S.3 I share some of Heinemann’s concerns relating to subsidies and other agricultural practices. P.S.4 In case anyone is interested, I did write about a GM-fed pigs study not long ago. P.S.5 2013-07-05 20:10 NZST. I updated Figures 1 and 3 to clearly express that yield was in hectograms/ha and recalculated average decade yield because it was originally averaging yields rather calculating total production and area for the decade and then calculating average yield. The discussion points I raised are still completely valid. P.S.6 2013-07-07 11:30 NZST. The inclusion of Canadian canola does not make any sense because, as far as I know, Canada is not part of Western Europe or the US Midwest. This opens the inclusion of crops from any origin as far as the results are convenient for one’s argument. P.S.7 2014-08-06 13:00 NZST My comment was expanded and published in the journal (or here for HTML version with comments on the authors’ reply to my comment. Categories ## GM-fed pigs, chance and how research works Following my post on GM-fed pigs I received several comments, mostly through Twitter. Some people liked having access to an alternative analysis, while others replied with typical anti-GM slogans, completely ignoring that I was posting about the technical side of the paper. This post is not for the slogan crowd (who clearly are not interested in understanding), but for people that would like to know more about how one would evaluate claims from a scientific article. While I refer to the pig paper, most issues apply to any paper that uses statistics. In general, researchers want to isolate the effect of the treatments under study (diets in this case) from any other extraneous influence. We want control over the experimental conditions, so we can separate the effects of interest from all other issues that could create differences between our experimental units (pigs in this case). What could create ‘noise’ in our results? Animals could have different genetic backgrounds (for example with different parents), they could be exposed to different environmental conditions, they could be treated differently (more kindly or harshly), etc. Once we control those conditions as much as possible we would randomly assign animals to each of the treatments (diets). The rationale behind random allocation is easy to follow. Let’s say that we can see that some animals are healthier than others before starting the trial. If I had a pro-GM agenda, I could assign the healthiest animals to the GM-diet and we would not be able to separate the effect of the treatment from the initial difference. To avoid this, we could have many labels in a hat, shake the hat, and for each pig take a label that randomly assigns the pig to a diet so the comparison is fair. Researchers also like to have a baseline of the conditions before the treatments are applied. This way we can ‘adjust’ the results by any pre-existing differences. For example, there could be measurable differences on health, size, etc. I normally work with trees in experiments and we routinely assess the height of the trees just planted, so we can establish a baseline. Finally, we often have a ‘default’ treatment which represents the status quo and acts as a comparison point for the new treatments. In the GM pig case, the default is a non-GM diet and the new treatment is the GM-diet. The paper on GM fed pigs states that they tried to have as much control as possible of the growing conditions and that they used random allocation to the two feeding treatments. I have no problems with the paper up to this point. When doing research it is good manners to state one’s expectations before the start of the trial. Doing this provides the experimenter with guidance about how to proceed with the evaluation of the experiment: 1. What are the characteristics under study? Both the response variables (in the past called ‘dependent variables’) and the predictors (old name ‘independent variables’). 2. What is the magnitude of the differences between groups that we would consider practically significant? Put another way, what would be the size of the difference that one would care about? For example, if we have two groups of pigs and the difference on weight between them is 1 g (0.035 ounces for people with funny units), who cares? If the difference was 5 kg (11 pounds, funny units again) then, perhaps, we are in to something. 3. What level of statistical significance we consider appropriate? Even if we assume that truly there is no difference between the two diets, we would expect to see small differences between the two groups just by chance. Big differences are more unlikely. It is common in research to state that a difference is statistically significant if the probability of observing the difference is smaller than, say, 0.05 (or 1 in 20). There is nothing sacred about the number but just a convention. By this stage one would expect a researcher to state one or more hypotheses to be tested. For example, ‘I expect that the GM diet will increase [condition here] by [number here] percent’. One can run an experiment ignoring ‘good manners’, but (and this is a big but) an informed reader will become suspicious if suddenly one starts testing hypotheses like there is no tomorrow. Why? Because if one conducts too many tests one is bound to find statistically significant results even when there are none. The comic below presents a brief example with jelly beans assuming that we claim significance for an event occurring with probability 0.05 (1 in 20) or less. Notice that the scientists use 20 colors of jelly beans and find that green ones ’cause’ acne. Running so many tests—without adjusting down the probability of the event that we would call significant, so p needs to be much smaller than 0.05 to be significant—results in wrong conclusions. In the pig paper there are 8 tests in Table 2, 18 (or 15 with some value) in Table 3, 8 in Table 4 and 17 in Table 5 for a total of 49 (or 46 with some testable values). In fact one would expect to find a couple of significant results (at 0.05 or 1 in 20) by chance even if there are absolutely no differences in reality. Add to this that many of the tests are unnecessary, because they are performing the wrong type of analysis. For example, there are four separate analyses for stomach inflammation; however, the analysis ignores the type of variable one is testing as I point out in a previous post. This is why, if I were Monsanto, I would use the paper as evidence supporting the idea that there is no difference between the two diets: 1. the paper was ‘independent’ (although the authors have a clear anti-GM bias) and 2. when running proper analyses (accounting for the type of variable and the number of tests that one is running) no difference is statistically significant. P.S. 2013-06-21. 15:38 NZST A footnote about data availability: it is becoming increasingly common to make both the data and code used to analyze the data available when publishing (part of Open Science). I would expect that a paper that is making bold claims that could have policy implications would provide those, which does not happens in this case. However, the publication does include part of the data in results Tables 3 & 4, in the counts of pigs under different classes of inflammation. Categories ## Ordinal logistic GM pigs This week another ‘scary GMO cause disease’ story was doing the rounds in internet: A long-term toxicology study on pigs fed a combined genetically modified (GM) soy and GM maize diet. Andrew Kniss, a non-smokable weeds expert, mentioned in Twitter that the statistical analyses in the study appeared to be kind of dodgy. Curious, I decided to have a quick look and I was surprised, first, by the points the authors decide to highlight in their results, second, by the pictures and captioning used in the article and, last, by the way of running the analysis. As I’m in the middle of marking assignments and exams I’ll only have a quick go at part of the analysis. As I see it, the problem can be described as ‘there is a bunch of pigs who were fed either non-GM feed or GM feed. After some time (approximately 23 weeks) they were killed and went through a CSI-like autopsy’, where part of the exam involved the following process: 1. Write down the type of feed the pig had during his/her life; 2. Assess the condition of the stomach and put it in one of four boxes labeled ‘Nil’, ‘Mild’, ‘Moderate’ and ‘Severe’. All this data is summarized in Table 3 of the paper (PDF). How would I go about the analysis? As I see it, we have a categorical response variable—which can take one of four mutually exclusive values—and a categorical predictor (diet). In addition, there is a natural order to the inflammation response variable in that Severe > Moderate > Mild > Nil. Andrew Kniss wrote a post trying to reproduce the published results. Instead, I present the first approach I would try with the data: ordinal logistic regression. Not only that, but instead of using a hippie statistical software like R, I will use industrial-grade-business-like SAS: /* Testing SAS web editor fitting pigs data Luis A. Apiolaza - School of Forestry */ *Reading data set; data pigs; input inflam$ diet \$ count;
datalines;
Nil Non-GMO 4
Mild Non-GMO 31
Moderate Non-GMO 29
Severe Non-GMO 9
Nil GMO 8
Mild GMO 23
Moderate GMO 18
Severe GMO 23
;
run;

*Showing data set;
proc print data = pigs;
run;

*El quicko analysis;
ods graphics on;
proc logistic data = pigs order = data;
freq count;
class inflam (ref = "Nil") diet (ref = "Non-GMO") / param = ref;
model inflam = diet / link = glogit;
oddsratio diet;
run;  

This produces a simple table with the same data as the paper and some very non-exciting results, which are better summarized in a single graph:

/*
Obs	inflam   diet   count
1	Nil      Non-GMO  4
2	Mild     Non-GMO 31
3	Moderate Non-GMO 29
4	Severe   Non-GMO  9
5	Nil      GMO      8
6	Mild     GMO     23
7	Moderate GMO     18
8	Severe   GMO     23
*/


The odd ratios would be 1 for no difference between the treatments. The graph shows that the confidence limits for all levels of inflammation include 1, so move on, nothing to see. In fact, GMO-fed pigs tend to have less inflammation for most disease categories.

P.S. There are many ways of running an analysis for this data set, but I’m in favor of approaches that take the whole problem in one go rather than looking at one class at the time. In an ideal situation we would have a continuous assessment for inflammation and the analysis would be a one-way ANOVA. I understand that for practical reasons one may prefer to split the response in four classes.

P.S.2 2013-06-15 I often act as a reviewer for scientific journals. In the case of this article some of my comments would have included: the analysis does not use the structure of the data properly, the photographs of the damaged organs should include both types of diet for each inflammation class (or at least include the most representative diet for the class), and the authors should highlight that there are no significant differences between the two diets for animal health; that is, the trial provides evidence for no difference between feeds. I still feel that the authors should be more forthcoming on terms of disclosing potential conflicts of interest too, but that’s their decision.

P.S.3 2013-07-04 I expand on aspects of the general research process in this post.

Tongue-in-cheek, of course, and with reference to weeds. This blog mostly uses R, but I’m pushing myself to use lots of different software to ‘keep the language’. Now if I could only do this with Spanish.

Categories

## Matrix Algebra Useful for Statistics

I was having a conversation with an acquaintance about courses that were particularly useful in our work. My forestry degree involved completing 50 compulsory + 10 elective courses; if I had to choose courses that were influential and/or really useful they would be Operations Research, Economic Evaluation of Projects, Ecology, 3 Calculus and 2 Algebras. Subsequently my PhD was almost entirely research based but I sort of did Matrix Algebra: Dorian lent me his copy of Searle’s Matrix Algebra Useful for Statistics and passed me a pile of assignments that Shayle Searle used to give in his course in Cornell. I completed the assignments on my own pace and then sat a crazy take-home exam for 24 hours.

Later that year I bought a cloth-bound 1982 version of the book, not the alien vomit purple paperback reprint currently on sale, which I consult from time to time. Why would one care about matrix algebra? Besides being a perfectly respectable intellectual endeavor on itself, maybe you can see that the degrees of freedom are the rank of a quadratic form; you can learn from this book what a quadratic form and a matrix rank are. Or you want to see more clearly the relationship between regression and ANOVA, because in matrix form a linear model is a linear model is a linear model. The commands outer, inner and kronecker` product make a lot more sense once you know what an outer product and an inner product of vectors are. Thus, if you really want to understand a matrix language for data analysis and statistics (like R), it seems reasonable to try to understand the building blocks for such a language.

The book does not deal with any applications to statistics until chapter 13. Before that it is all about laying foundations to understand the applications, but do not expect nice graphs and cute photos. This is a very good text where one as to use the brain to imagine what’s going on in the equations and demonstrations. The exercises rely a lot on ‘prove this’ and ‘prove that’, which lead to much frustration and, after persevering, to many ‘aha! moments’.

I am the first to accept that I have a biased opinion about this book, because it has sentimental value. It represents difficult times, dealing with a new language, culture and, on top of that, animal breeding. At the same time, it opened doors to a whole world of ideas. This is much more than I can say of most books.

PS 2012-12-17: I have commented on a few more books in these posts.

A good part of my electives were in humanities (history & literature), which was unusual for forestry. I just couldn’t conceive going through a university without doing humanities.