Archive for the ‘stats’ Category

Man flu kept me at home today, so I decided to do something ‘useful’ and go for a linkathon:

Sometimes people are truthful and cruel. Here Gappy on a mission goes for the jugular:

Over and out.

A few days ago I came across Jack Heinemann and collaborators’ article (Sustainability and innovation in staple crop production in the US Midwest, Open Access) comparing the agricultural sectors of USA and Western Europe. While the article is titled around the word sustainability, the main comparison stems from the use of Genetically Modified crops in USA versus the absence of them in Western Europe.

I was curious about part of the results and discussion which, in a nutshell, suggest that “GM cropping systems have not contributed to yield gains, are not necessary for yield gains, and appear to be eroding yields compared to the equally modern agroecosystem of Western Europe”. The authors relied on several crops for the comparison (Maize/corn, rapeseed/canolasee P.S.6, soybean and cotton); however, I am going to focus on a single one (corn) for two reasons: 1. I can’t afford a lot of time for blog posts when I should be preparing lectures and 2. I like eating corn.

When the authors of the paper tackled corn the comparison was between the USA and Western Europe, using the United Nations definition of Western Europe (i.e. Austria, Belgium, France, Germany, Liechtenstein, Luxembourg, Monaco, Netherlands, Switzerland). Some large European corn producers like Italy are not there because of the narrow definition of Western.

I struggled with the comparison used by the authors because, in my opinion, there are potentially so many confounded effects (different industry structures, weather, varieties, etc.) that it can’t provide the proper counterfactual for GM versus non-GM crops. Anyway, I decided to have a look at the same data to see if I would reach the same conclusions. The article provides a good description of where the data came from, as well as how the analyses were performed. Small details to match exactly the results were fairly easy to figure out. I downloaded the FAO corn data (3.7 MB csv file) for all countries (so I can reuse the code and data later for lectures and assignments). I then repeated the plots using the following code:

# Default directory setwd('~/Dropbox/quantumforest')   # Required packages require(ggplot2) require(labels)   # Reading FAO corn data FAOcorn = read.csv('FAOcorn.csv')   # Extracting Area FAOarea = subset(FAOcorn, Element == 'Area Harvested', select = c('Country', 'Year', 'Value'))   names(FAOarea)[3] = 'Area'   # and production FAOprod = subset(FAOcorn, Element == 'Production', select = c('Country', 'Year', 'Value'))   names(FAOprod)[3] = 'Production'   # to calculate yield in hectograms FAOarea = merge(FAOarea, FAOprod, by = c('Country', 'Year')) FAOarea$Yield = with(FAOarea, Production/Area*10000) # Subsetting only the countries of interest (and years to match paper) FAOarticle = subset(FAOarea, Country == 'United States of America' | Country == 'Western Europe') # Plot with regression lines ggplot(FAOarticle, aes(x = Year, y = Yield, color = Country)) + geom_point() + stat_smooth(method = lm, fullrange = TRUE, alpha = 0.1) + scale_y_continuous('Yield [hectograms/ha]', limits = c(0, 100000), labels = comma) + theme(legend.position="top") Figure 1. Corn yield per year for USA and Western Europe (click to enlarge). I could obtain pretty much the same regression model equations as in the article by expressing the years as deviation from 1960 as in: # Expressing year as a deviation from 1960, so results # match paper FAOarticle$NewYear = with(FAOarticle, Year - 1960)   usa.lm = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'United States of America') summary(usa.lm)   #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "United States of America") # #Residuals: # Min 1Q Median 3Q Max #-18435.4 -1958.3 338.3 3663.8 10311.0 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 38677.34 1736.92 22.27 <2e-16 *** #NewYear 1173.83 59.28 19.80 <2e-16 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 6049 on 48 degrees of freedom #Multiple R-squared: 0.8909, Adjusted R-squared: 0.8887     weu.lm = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'Western Europe') summary(weu.lm)   #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "Western Europe") # #Residuals: # Min 1Q Median 3Q Max #-14726.6 -3205.8 346.4 4000.6 10289.5 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 31510.14 1665.90 18.91 <2e-16 *** #NewYear 1344.42 56.86 23.65 <2e-16 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 5802 on 48 degrees of freedom #Multiple R-squared: 0.9209, Adjusted R-squared: 0.9193 #F-statistic: 559.1 on 1 and 48 DF, p-value: < 2.2e-16

Heinemann and collaborators then point out the following:

…the slope in yield increase by year is steeper in W. Europe (y = 1344.2x + 31512, R² = 0.92084) than the United States (y = 1173.8x + 38677, R² = 0.89093) from 1961 to 2010 (Figure 1). This shows that in recent years W. Europe has had similar and even slightly higher yields than the United States despite the latter’s use of GM varieties.

However, that interpretation using all data assumes that both ‘countries’ are using GMO all the time. An interesting thing is that USA and Western Europe were in different trends already before the introduction of GM corn. We can state that because we have some idea of when GM crops were introduced in the USA. This information is collected by the US Department of Agriculture in their June survey to growers and made publicly available at the State level (GMcornPenetration.csv):

cornPenetration = read.csv('GMcornPenetration.csv')   ggplot(cornPenetration, aes(x = Year, y = PerAllGM)) + geom_line() + facet_wrap(~ State) + scale_y_continuous('Percentage of GM corn') + theme(axis.text.x = theme_text(angle=90))

Figure 2. GM corn percentage by state in the USA (click to enlarge).

This graph tells us that by the year 2000 the percentage of planted corn was way below 50% in most corn producing states (in fact, it was 25% at the country level). From that time on we have a steady increase reaching over 80% for most states by 2008. Given this, it probably makes sense to assume that, at the USA level, yield reflects non-GM corn until 1999 and progressively reflects the effect of GM genotypes from 2000 onwards. This division is somewhat arbitrary, but easy to implement.

We can repeat the previous analyzes limiting the data from 1961 until, say, 1999:

usa.lm2 = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'United States of America' & Year < 2000) summary(usa.lm2)   #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "United States of America" & Year < 2000) # #Residuals: # Min 1Q Median 3Q Max #-17441 -2156 1123 3989 9878 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 39895.57 2084.81 19.14 < 2e-16 *** #NewYear 1094.82 90.84 12.05 2.25e-14 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 6385 on 37 degrees of freedom #Multiple R-squared: 0.797, Adjusted R-squared: 0.7915 #F-statistic: 145.2 on 1 and 37 DF, p-value: 2.245e-14   weu.lm2 = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'Western Europe' & Year < 2000) summary(weu.lm2)   #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "Western Europe" & Year < 2000) # #Residuals: # Min 1Q Median 3Q Max #-10785 -3348 -34 3504 11117 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 29802.17 1813.79 16.43 <2e-16 *** #NewYear 1454.48 79.03 18.40 <2e-16 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 5555 on 37 degrees of freedom #Multiple R-squared: 0.9015, Adjusted R-squared: 0.8988 #F-statistic: 338.7 on 1 and 37 DF, p-value: < 2.2e-16

These analyses indicate that Western Europe started with a lower yield than the USA (29,802.17 vs 39,895.57 hectograms/ha) and managed to increase yield much more quickly (1,454.48 vs 1,094.82 hectograms/ha per year) before any use of GM corn by the USA. Figure 1 shows a messy picture because there are numerous factors affecting yield each year (e.g. weather has a large influence). We can take averages for each decade and see how the two ‘countries’ are performing:

# Aggregating every decade. # 2013-07-05 20:10 NZST I fixed the aggregation because it was averaging yields rather # calculating total production and area for the decade and then calculating average yield # Discussion points are totally valid FAOarticle$Decade = cut(FAOarticle$Year, breaks = seq(1959, 2019, 10), labels = paste(seq(1960, 2010, 10), 's', sep = ''))   decadeProd = aggregate(Production ~ Country + Decade, data = FAOarticle, FUN = sum)   decadeArea = aggregate(Area ~ Country + Decade, data = FAOarticle, FUN = sum)   decadeYield = merge(decadeProd, decadeArea, by = c('Country', 'Decade')) decadeYield$Yield = with(decadeYield, Production/Area*10000) ggplot(decadeYield, aes(x = Decade, y = Yield, fill = Country)) + geom_bar(stat = 'identity', position = 'dodge') + scale_y_continuous('Yield [hectograms/ha]', expand = c(0, 0)) + theme(legend.position="top") Figure 3. Corn yield by decade (click to enlarge). This last figure requires more attention. We can again see that Western Europe starts with lower yields than the USA; however, it keeps on increasing those yields faster than USA, overtaking it during the 1990s. Again, all this change happened while both USA and Western Europe were not using GM corn. The situation reverses in the 2000s, when the USA overtakes Western Europe, while the USA continuously increased the percentage of GM corn. The last bar in Figure 3 is misleading because it includes a single year (2010) and we know that yields in USA went down in 2011 and 2012, affected by a very large drought (see Figure 4). At least when looking at corn, I can’t say (with the same data available to Heinemann) that there is no place or need for GM genotypes. I do share some of his concerns with respect to the low level of diversity present in staple crops but, in contrast to his opinion, I envision a future for agriculture that includes large-scale operations (either GM or no-GM), as well as smaller operations (including organic ones). I’d like to finish with some optimism looking further back to yield, because the USDA National Agricultural Statistics Service keeps yield statistics for corn since 1866(!) (csv file), although it uses bizarre non-metric units (bushels/acre). As a metric boy, I converted to kilograms per hectare (multiplying by 62.77 from this page) and then to hectograms (100 g) multiplying by 10. # Reading NASS corn data NASS = read.csv('NASScorn.csv') # Conversion to sensical units (see Iowa State Extension article) # http://www.extension.iastate.edu/agdm/wholefarm/html/c6-80.html NASS$Yield = with(NASS, Value*62.77*10)   # Average by decade NASS$Decade = cut(NASS$Year, breaks = seq(1859, 2019, 10), labels = paste(seq(1860, 2010, 10), 's', sep = ''))   oldYield = aggregate(Yield ~ Decade, data = NASS, FUN = mean)   # Plotting ggplot(oldYield, aes(x = Decade, y = Yield)) + geom_bar(stat = 'identity') + scale_y_continuous('Yield [hectograms]', expand = c(0, 0))

Figure 4. Historic average yield per decade for USA (click to enlarge).

It is interesting to see that there was little change until the 1940s, with the advent of the Green Revolution (modern breeding techniques, fertilization, pesticides, etc.). The 2010s decade in Figure 4 includes 2010, 2011 and 2012, with the last two years reflecting extensive droughts. Drought tolerance is one of the most important traits in modern breeding programs.

Drought’s Footprint map produced by The New York Times (click on graph to view larger version in the NYT). This can help to understand the Decade patterns in previous figures.

While Prof. Heinemann and myself work for the same university I don’t know him in person.

P.S. Did you know that Norman Borlaug (hero of mine) studied forestry?
P.S.2 Time permitting I’ll have a look at other crops later. I would have liked to test a regression with dummy variables for corn to account for pre-2000 and post-1999, but there are not yet many years to fit a decent model (considering natural variability between years). We’ll have to wait for that one.
P.S.3 I share some of Heinemann’s concerns relating to subsidies and other agricultural practices.
P.S.4 In case anyone is interested, I did write about a GM-fed pigs study not long ago.
P.S.5 2013-07-05 20:10 NZST. I updated Figures 1 and 3 to clearly express that yield was in hectograms/ha and recalculated average decade yield because it was originally averaging yields rather calculating total production and area for the decade and then calculating average yield. The discussion points I raised are still completely valid.
P.S.6 2013-07-07 11:30 NZST. The inclusion of Canadian canola does not make any sense because, as far as I know, Canada is not part of Western Europe or the US Midwest. This opens the inclusion of crops from any origin as far as the results are convenient for one’s argument.

Following my post on GM-fed pigs I received several comments, mostly through Twitter. Some people liked having access to an alternative analysis, while others replied with typical anti-GM slogans, completely ignoring that I was posting about the technical side of the paper. This post is not for the slogan crowd (who clearly are not interested in understanding), but for people that would like to know more about how one would evaluate claims from a scientific article. While I refer to the pig paper, most issues apply to any paper that uses statistics.

In general, researchers want to isolate the effect of the treatments under study (diets in this case) from any other extraneous influence. We want control over the experimental conditions, so we can separate the effects of interest from all other issues that could create differences between our experimental units (pigs in this case). What could create ‘noise’ in our results? Animals could have different genetic backgrounds (for example with different parents), they could be exposed to different environmental conditions, they could be treated differently (more kindly or harshly), etc.

Once we control those conditions as much as possible we would randomly assign animals to each of the treatments (diets). The rationale behind random allocation is easy to follow. Let’s say that we can see that some animals are healthier than others before starting the trial. If I had a pro-GM agenda, I could assign the healthiest animals to the GM-diet and we would not be able to separate the effect of the treatment from the initial difference. To avoid this, we could have many labels in a hat, shake the hat, and for each pig take a label that randomly assigns the pig to a diet so the comparison is fair.

Researchers also like to have a baseline of the conditions before the treatments are applied. This way we can ‘adjust’ the results by any pre-existing differences. For example, there could be measurable differences on health, size, etc. I normally work with trees in experiments and we routinely assess the height of the trees just planted, so we can establish a baseline.

Finally, we often have a ‘default’ treatment which represents the status quo and acts as a comparison point for the new treatments. In the GM pig case, the default is a non-GM diet and the new treatment is the GM-diet.

The paper on GM fed pigs states that they tried to have as much control as possible of the growing conditions and that they used random allocation to the two feeding treatments. I have no problems with the paper up to this point.

When doing research it is good manners to state one’s expectations before the start of the trial. Doing this provides the experimenter with guidance about how to proceed with the evaluation of the experiment:

1. What are the characteristics under study? Both the response variables (in the past called ‘dependent variables’) and the predictors (old name ‘independent variables’).
2. What is the magnitude of the differences between groups that we would consider practically significant? Put another way, what would be the size of the difference that one would care about? For example, if we have two groups of pigs and the difference on weight between them is 1 g (0.035 ounces for people with funny units), who cares? If the difference was 5 kg (11 pounds, funny units again) then, perhaps, we are in to something.
3. What level of statistical significance we consider appropriate? Even if we assume that truly there is no difference between the two diets, we would expect to see small differences between the two groups just by chance. Big differences are more unlikely. It is common in research to state that a difference is statistically significant if the probability of observing the difference is smaller than, say, 0.05 (or 1 in 20). There is nothing sacred about the number but just a convention.

By this stage one would expect a researcher to state one or more hypotheses to be tested. For example, ‘I expect that the GM diet will increase [condition here] by [number here] percent’. One can run an experiment ignoring ‘good manners’, but (and this is a big but) an informed reader will become suspicious if suddenly one starts testing hypotheses like there is no tomorrow. Why? Because if one conducts too many tests one is bound to find statistically significant results even when there are none.

The comic below presents a brief example with jelly beans assuming that we claim significance for an event occurring with probability 0.05 (1 in 20) or less. Notice that the scientists use 20 colors of jelly beans and find that green ones ’cause’ acne. Running so many tests—without adjusting down the probability of the event that we would call significant, so p needs to be much smaller than 0.05 to be significant—results in wrong conclusions.

XKCD explaining significance (click for original, larger version).

In the pig paper there are 8 tests in Table 2, 18 (or 15 with some value) in Table 3, 8 in Table 4 and 17 in Table 5 for a total of 49 (or 46 with some testable values). In fact one would expect to find a couple of significant results (at 0.05 or 1 in 20) by chance even if there are absolutely no differences in reality.

Add to this that many of the tests are unnecessary, because they are performing the wrong type of analysis. For example, there are four separate analyses for stomach inflammation; however, the analysis ignores the type of variable one is testing as I point out in a previous post.

This is why, if I were Monsanto, I would use the paper as evidence supporting the idea that there is no difference between the two diets:

1. the paper was ‘independent’ (although the authors have a clear anti-GM bias) and
2. when running proper analyses (accounting for the type of variable and the number of tests that one is running) no difference is statistically significant.

P.S. 2013-06-21. 15:38 NZST A footnote about data availability: it is becoming increasingly common to make both the data and code used to analyze the data available when publishing (part of Open Science). I would expect that a paper that is making bold claims that could have policy implications would provide those, which does not happens in this case. However, the publication does include part of the data in results Tables 3 & 4, in the counts of pigs under different classes of inflammation.

This week another ‘scary GMO cause disease’ story was doing the rounds in internet: A long-term toxicology study on pigs fed a combined genetically modified (GM) soy and GM maize diet. Andrew Kniss, a non-smokable weeds expert, mentioned in Twitter that the statistical analyses in the study appeared to be kind of dodgy.

Curious, I decided to have a quick look and I was surprised, first, by the points the authors decide to highlight in their results, second, by the pictures and captioning used in the article and, last, by the way of running the analysis. As I’m in the middle of marking assignments and exams I’ll only have a quick go at part of the analysis. As I see it, the problem can be described as ‘there is a bunch of pigs who were fed either non-GM feed or GM feed. After some time (approximately 23 weeks) they were killed and went through a CSI-like autopsy’, where part of the exam involved the following process:

1. Write down the type of feed the pig had during his/her life;
2. Assess the condition of the stomach and put it in one of four boxes labeled ‘Nil’, ‘Mild’, ‘Moderate’ and ‘Severe’.

All this data is summarized in Table 3 of the paper (PDF). How would I go about the analysis? As I see it, we have a categorical response variable—which can take one of four mutually exclusive values—and a categorical predictor (diet). In addition, there is a natural order to the inflammation response variable in that Severe > Moderate > Mild > Nil.

Andrew Kniss wrote a post trying to reproduce the published results. Instead, I present the first approach I would try with the data: ordinal logistic regression. Not only that, but instead of using a hippie statistical software like R, I will use industrial-grade-business-like SAS:

/* Testing SAS web editor fitting pigs data Luis A. Apiolaza - School of Forestry */   *Reading data set; data pigs; input inflam $diet$ count; datalines; Nil Non-GMO 4 Mild Non-GMO 31 Moderate Non-GMO 29 Severe Non-GMO 9 Nil GMO 8 Mild GMO 23 Moderate GMO 18 Severe GMO 23 ; run;   *Showing data set; proc print data = pigs; run;   *El quicko analysis; ods graphics on; proc logistic data = pigs order = data; freq count; class inflam (ref = "Nil") diet (ref = "Non-GMO") / param = ref; model inflam = diet / link = glogit; oddsratio diet; run;

This produces a simple table with the same data as the paper and some very non-exciting results, which are better summarized in a single graph:

/* Obs inflam diet count 1 Nil Non-GMO 4 2 Mild Non-GMO 31 3 Moderate Non-GMO 29 4 Severe Non-GMO 9 5 Nil GMO 8 6 Mild GMO 23 7 Moderate GMO 18 8 Severe GMO 23 */

Odd ratios for the different levels of stomach inflammation.

The odd ratios would be 1 for no difference between the treatments. The graph shows that the confidence limits for all levels of inflammation include 1, so move on, nothing to see. In fact, GMO-fed pigs tend to have less inflammation for most disease categories.

P.S. There are many ways of running an analysis for this data set, but I’m in favor of approaches that take the whole problem in one go rather than looking at one class at the time. In an ideal situation we would have a continuous assessment for inflammation and the analysis would be a one-way ANOVA. I understand that for practical reasons one may prefer to split the response in four classes.

P.S.2 2013-06-15 I often act as a reviewer for scientific journals. In the case of this article some of my comments would have included: the analysis does not use the structure of the data properly, the photographs of the damaged organs should include both types of diet for each inflammation class (or at least include the most representative diet for the class), and the authors should highlight that there are no significant differences between the two diets for animal health; that is, the trial provides evidence for no difference between feeds. I still feel that the authors should be more forthcoming on terms of disclosing potential conflicts of interest too, but that’s their decision.

P.S.3 2013-07-04 I expand on aspects of the general research process in this post.

Tongue-in-cheek, of course, and with reference to weeds. This blog mostly uses R, but I’m pushing myself to use lots of different software to ‘keep the language’. Now if I could only do this with Spanish.

I was having a conversation with an acquaintance about courses that were particularly useful in our work. My forestry degree involved completing 50 compulsory + 10 elective courses; if I had to choose courses that were influential and/or really useful they would be Operations Research, Economic Evaluation of Projects, Ecology, 3 Calculus and 2 Algebras. Subsequently my PhD was almost entirely research based but I sort of did Matrix Algebra: Dorian lent me his copy of Searle’s Matrix Algebra Useful for Statistics and passed me a pile of assignments that Shayle Searle used to give in his course in Cornell. I completed the assignments on my own pace and then sat a crazy take-home exam for 24 hours.

Later that year I bought a cloth-bound 1982 version of the book, not the alien vomit purple paperback reprint currently on sale, which I consult from time to time. Why would one care about matrix algebra? Besides being a perfectly respectable intellectual endeavor on itself, maybe you can see that the degrees of freedom are the rank of a quadratic form; you can learn from this book what a quadratic form and a matrix rank are. Or you want to see more clearly the relationship between regression and ANOVA, because in matrix form a linear model is a linear model is a linear model. The commands outer, inner and kronecker product make a lot more sense once you know what an outer product and an inner product of vectors are. Thus, if you really want to understand a matrix language for data analysis and statistics (like R), it seems reasonable to try to understand the building blocks for such a language.

The book does not deal with any applications to statistics until chapter 13. Before that it is all about laying foundations to understand the applications, but do not expect nice graphs and cute photos. This is a very good text where one as to use the brain to imagine what’s going on in the equations and demonstrations. The exercises rely a lot on ‘prove this’ and ‘prove that’, which lead to much frustration and, after persevering, to many ‘aha! moments’.

XKCD 1050: In your face! Actually I feel the opposite concerning math.

I am the first to accept that I have a biased opinion about this book, because it has sentimental value. It represents difficult times, dealing with a new language, culture and, on top of that, animal breeding. At the same time, it opened doors to a whole world of ideas. This is much more than I can say of most books.

PS 2012-12-17: I have commented on a few more books in these posts.

A good part of my electives were in humanities (history & literature), which was unusual for forestry. I just couldn’t conceive going through a university without doing humanities.

This week I’ve tried to i-stay mostly in the descriptive statistics realm and ii-surround any simple(istic) models with caveats and pointing that they are very preliminary. We are working with a sample of ~1,000 schools that did reply to Fairfax’s request, while there is a number of schools that either ignored the request or told Fairfax to go and F themselves. Why am I saying this? If one goes and gets a simple table of the number of schools by type and decile there is something quite interesting: we have different percentages for different types of schools represented in the sample and the possibility of bias on the reporting to Fairfax, due to potential low performance (references to datasets correspond to the ones I used in this post):

summary(standards$school.type) # Composite (Year 1-10) Composite (Year 1-15) Contributing (Year 1-6) # 1 29 403 # Full Primary (Year 1-8) Intermediate (year 7 and 8) Restricted Composite (Yr 7-10) # 458 62 1 # Secondary (Year 7-15) # 56 Now let’s compare this number with the school directory: summary(factor(directory$school.type)) # Composite (Year 1-10) Composite (Year 1-15) Contributing (Year 1-6) # 4 149 775 # Correspondence School Full Primary (Year 1-8) Intermediate (year 7 and 8) # 1 1101 122 #Restricted Composite (Yr 7-10) Secondary (Year 11-15) Secondary (Year 7-10) # 4 2 2 # Secondary (Year 7-15) Secondary (Year 9-15) Special School # 100 238 39 # Teen Parent Unit # 20

As a proportion we are missing more secondary schools. We can use the following code to get an idea of how similar are school types, because the small number of different composite schools is a pain. If

# Performance of Contributing (Year 1-6) and # Full Primary (Year 1-8) looks pretty much the # same. Composites could be safely merged qplot(school.type, reading.OK, data = standards, geom = 'jitter')   qplot(school.type, writing.OK, data = standards, geom = 'jitter')   qplot(school.type, math.OK, data = standards, geom = 'jitter')   # Merging school types and plotting them colored # by decile standards$school.type.4 <- standards$school.type levels(standards$school.type.4) <- c('Composite', 'Composite', 'Primary', 'Primary', 'Intermediate', 'Composite', 'Secondary') qplot(school.type.4, reading.OK, colour = decile, data = standards, geom = 'jitter') Representation of different schools types and deciles is uneven. Different participations in the sample for school types. This type is performance in mathematics. I’m using jittering rather than box and whisker plots to i- depict all the schools and ii- get an idea of the different participation of school types in the dataset. Sigh. Another caveat to add in the discussion. P.S. 2012-09-27 16:15. Originally I mentioned in this post the lack of secondary schools (Year 9-15) but, well, they are not supposed to be here, because National Standards apply to years 1 to 8 (Thanks to Michael MacAskill for pointing out my error.) I like the idea of having data on school performance, not to directly rank schools—hard, to say the least, at this stage—but because we can start having a look at the factors influencing test results. I imagine the opportunity in the not so distant future to run hierarchical models combining Ministry of Education data with Census/Statistics New Zealand data. At the same time, there is the temptation to come up with very simple analyses that would make appealing newspaper headlines. I’ll read the data and create a headline and then I’ll move to something that, personally, seems more important. In my previous post I combined the national standards for around 1,000 schools with decile information to create the standards.csv file. library(ggplot2) # Reading data, building factors standards <- read.csv('standards.csv') standards$decile.2008 <- factor(standards$decile.2008) standards$electorate <- factor(standards$electorate) standards$school.type <- factor(standards$school.type) # Removing special schools (all 8 of them) because their # performance is not representative of the overall school # system standards <- subset(standards, as.character(school.type) != 'Special School') standards$school.type <- standards$school.type[drop = TRUE] # Create performance variables. Proportion of the students that # at least meet the standards (i.e. meet + above) # Only using all students, because subsets are not reported # by many schools standards$math.OK <- with(standards, math.At + math.A) standards$reading.OK <- with(standards, reading.At + reading.A) standards$writing.OK <- with(standards, writing.At + writing.A)

Up to this point we have read the data, removed special schools and created variables that represent the proportion of students that al least meet standards. Now I’ll do something very misleading: calculate the average for reading.OK for each school decile and plot it, showing the direct relationship between socio-economic decile and school performance.

mislead <- aggregate(reading.OK ~ decile.2008, data = standards, mean) qplot(decile.2008, reading.OK , data = mislead)

Scatterplot of average proportion of students at least meeting the reading national standards for each socioeconomic decile.

Using only the average strengthens the relationship, but it hides something extremely important: the variability within each decile. The following code will display box and whisker plots for each decile: the horizontal line is the median, the box contains the middle 50% of the schools and the vertical lines 1.5 times the interquantile range (pretty much the range in most cases):

qplot(decile.2008, reading.OK, data = standards, geom = 'boxplot', xlab = 'Decile in 2008', ylab = 'Reading at or above National Standard')

Box and whisker plot for proportion of students at least meeting the reading national standard for each decile. Notice the variability for each decile.

A quick look at the previous graph shows a few interesting things:

• the lower the decile the more variable school performance,
• there is pretty much no difference between deciles 6, 7, 8 and 9, and a minor increase for decile 10.
• there is a trend for decreasing performance for lower deciles; however, there is also a huge variability within those deciles.

We can repeat the same process for writing.OK and math.OK with similar trends, although the level of standard achievement is lower than for reading:

qplot(decile.2008, writing.OK, data = standards, geom = 'boxplot', xlab = 'Decile in 2008', ylab = 'Writing at or above National Standard')   qplot(decile.2008, math.OK, data = standards, geom = 'boxplot', xlab = 'Decile in 2008', ylab = 'Mathematics at or above National Standard')

Box and whisker plot for proportion of students at least meeting the writing national standard for each decile.

Box and whisker plot for proportion of students at least meeting the mathematics national standard for each decile.

Achievement in different areas (reading, writing, mathematics) is highly correlated:

cor(standards[, c('reading.OK', 'writing.OK', 'math.OK')], use = 'pairwise.complete.obs')   # reading.OK writing.OK math.OK #reading.OK 1.0000000 0.7886292 0.7749094 #writing.OK 0.7886292 1.0000000 0.7522446 #math.OK 0.7749094 0.7522446 1.0000000   # We can plot an example and highlight decile groups # 1-5 vs 6-10 qplot(reading.OK, math.OK, data = standards, color = ifelse(as.numeric(decile.2008) > 5, 1, 0)) + opts(legend.position = 'none')

Scatterplot for proportion meeting mathematics and reading national standards. Dark points are deciles 1 to 5, while light points are deciles 6 to 10. Notice the large overlap for performance between the two groups.

All these graphs are only descriptive/exploratory; however, once we had more data we could move to hierarchical models to start adjusting performance by socioeconomic aspects, geographic location, type of school, urban or rural setting, school size, ethnicity, etc. Potentially, this would let us target resources on schools that could be struggling to perform; nevertheless, any attempt at creating a ‘quick & dirty’ ranking ignoring the previously mentioned covariates would be, at the very least, misleading.

Note 2012-09-26: I have updated data and added a few plots in this post.

Reached mid-semester point, with quite a few new lectures to prepare. Nothing extremely complicated but, as always, the tricky part is finding a way to make it meaningful and memorable. Sometimes, and this is one of those times, I sound like a broken record but I’m a bit obsessive about helping people to ‘get’ a topic.

Gratuitous picture: Lola, Lisbon, Portugal(Photo: Luis).

I like statistics and I struggle with statistics. Often times I get frustrated when I don’t understand and I really struggled to make sense of Krushke’s Bayesian analysis of a split-plot, particularly because ‘it didn’t look like’ a split-plot to me.

Additionally, I have made a few posts discussing linear mixed models using several different packages to fit them. At no point I have shown what are the calculations behind the scenes. So, I decided to combine my frustration and an explanation to myself in a couple of posts. This is number one and, eventually, there will be a follow up.

Example of forestry split-plot: one of my colleagues has a trial in which stocking (number of trees per ha) is the main plot and fertilization is the subplot (higher stockings look darker because trees are closer together).

How do linear mixed models look like

Linear mixed models, models that combine so-called fixed and random effects, are often represented using matrix notation as:

$$\boldsymbol{y} = \boldsymbol{X b} + \boldsymbol{Z a} + \boldsymbol{e}$$

where $$\boldsymbol{y}$$ represents the response variable vector, $$\boldsymbol{X} \mbox{ and } \boldsymbol{Z}$$ are incidence matrices relating the response to the fixed ($$\boldsymbol{b}$$ e.g. overall mean, treatment effects, etc) and random ($$\boldsymbol{a}$$, e.g. family effects, additive genetic effects and a host of experimental design effects like blocks, rows, columns, plots, etc), and last the random residuals ($$\boldsymbol{e}$$).

The previous model equation still requires some assumptions about the distribution of the random effects. A very common set of assumptions is to say that the residuals are iid (identical and independently distributed) normal (so $$\boldsymbol{R} = \sigma^2_e \boldsymbol{I}$$) and that the random effects are independent of each other, so $$\boldsymbol{G} = \boldsymbol{B} \bigoplus \boldsymbol{M}$$ is a direct sum of the variance of blocks ($$\boldsymbol{B} = \sigma^2_B \boldsymbol{I}$$) and main plots ($$\boldsymbol{M} = \sigma^2_M \boldsymbol{I}$$).

An interesting feature of this matrix formulation is that we can express all sort of models by choosing different assumptions for our covariance matrices (using different covariance structures). Do you have longitudinal data (units assessed repeated times)? Is there spatial correlation? Account for this in the $$\boldsymbol{R}$$ matrix. Random effects are correlated (e.g. maternal and additive genetic effects in genetics)? Account for this in the $$\boldsymbol{G}$$ matrix. Multivariate response? Deal with unstructured $$\boldsymbol{R}$$ and $$\boldsymbol{G}$$, or model the correlation structure using different constraints (and the you’ll need asreml).

By the way, the history of linear mixed models is strongly related to applications of quantitative genetics for the prediction of breeding values, particularly in dairy cattle. Charles Henderson developed what is now called Henderson’s Mixed Model Equations to simultaneously estimate fixed effects and predict random genetic effects:

$$\left [ \begin{array}{cc} \boldsymbol{X}' \boldsymbol{R}^{-1} \boldsymbol{X} & \boldsymbol{X}' \boldsymbol{R}^{-1} \boldsymbol{Z} \\ \boldsymbol{Z}' \boldsymbol{R}^{-1} \boldsymbol{X} & \boldsymbol{Z}' \boldsymbol{R}^{-1} \boldsymbol{Z} + \boldsymbol{G}^{-1} \end{array} \right ] \left [ \begin{array}{c} \boldsymbol{b} \\ \boldsymbol{a} \end{array} \right ] = \left [ \begin{array}{c} \boldsymbol{X}' \boldsymbol{R}^{-1} \boldsymbol{y} \\ \boldsymbol{Z}' \boldsymbol{R}^{-1} \boldsymbol{y} \end{array} \right ]$$

The big matrix on the left-hand side of this equation is often called the $$\boldsymbol{C}$$ matrix. You could be thinking ‘What does this all mean?’ It is easier to see what is going on with a small example, but rather than starting with, say, a complete block design, we’ll go for a split-plot to start tackling my annoyance with the aforementioned blog post.

Old school: physical split-plots

Given that I’m an unsophisticated forester and that I’d like to use data available to anyone, I’ll rely on an agricultural example (so plots are actually physical plots in the field) that goes back to Frank Yates. There are two factors (oats variety, with three levels, and fertilization, with four levels). Yates, F. (1935) Complex experiments, Journal of the Royal Statistical Society Suppl. 2, 181–247 (behind pay wall here).

Layout of oats experiment in Yates’s paper, from a time when articles were meaty. Each of the 6 replicates is divided in to 3 main plots for oats varieties (v1, v2 and v3), while each variety was divided into four parts with different levels of fertilization (manure—animal crap—n1 to n4). Cells display yield.

Now it is time to roll up our sleeves and use some code, getting the data and fitting the same model using nlme (m1) and asreml (m2), just for the fun of it. Anyway, nlme and asreml produce exactly the same results.

We will use the oats data set that comes with MASS, although there is also an Oats data set in nlme and another version in the asreml package. (By the way, you can see a very good explanation by Bill Venables of a ‘traditional’ ANOVA analysis for a split-plot here):

require(MASS) # we get the oats data from here require(nlme) # for lme function require(asreml) # for asreml function. This dataset use different variable names, # which may require renaming a dataset to use the code below   # Get the oats data set and check structure data(oats) head(oats) str(oats)   # Create a main plot effect for clarity's sake oats$MP = oats$V   # Split-plot using NLME m1 = lme(Y ~ V*N, random = ~ 1|B/MP, data = oats) summary(m1) fixef(m1) ranef(m1)   # Split-plot using ASReml m2 = asreml(Y ~ V*N, random = ~ B/MP, data = oats) summary(m2)$varcomp coef(m2)$fixed coef(m2)$random fixef(m1) # (Intercept) VMarvellous VVictory N0.2cwt # 80.0000000 6.6666667 -8.5000000 18.5000000 # N0.4cwt N0.6cwt VMarvellous:N0.2cwt VVictory:N0.2cwt # 34.6666667 44.8333333 3.3333333 -0.3333333 #VMarvellous:N0.4cwt VVictory:N0.4cwt VMarvellous:N0.6cwt VVictory:N0.6cwt # -4.1666667 4.6666667 -4.6666667 2.1666667 ranef(m1) # Level: B # (Intercept) # I 25.421511 # II 2.656987 # III -6.529883 # IV -4.706019 # V -10.582914 # VI -6.259681 # # Level: MP %in% B # (Intercept) # I/Golden.rain 2.348296 # I/Marvellous -3.854348 # I/Victory 14.077467 # II/Golden.rain 4.298706 # II/Marvellous 6.209473 # II/Victory -9.194250 # III/Golden.rain -7.915950 # III/Marvellous 10.750776 # III/Victory -6.063976 # IV/Golden.rain 5.789462 # IV/Marvellous -7.115566 # IV/Victory -1.001111 # V/Golden.rain 1.116768 # V/Marvellous -9.848096 # V/Victory 3.497878 # VI/Golden.rain -5.637282 # VI/Marvellous 3.857761 # VI/Victory -1.316009 Now we can move to implement the Mixed Model Equations, where probably the only gotcha is the definition of the $$\boldsymbol{Z}$$ matrix (incidence matrix for random effects), as both nlme and asreml use ‘number of levels of the factor’ for both the main and interactions effects, which involves using the contrasts.arg argument in model.matrix(). # Variance components varB = 214.477 varMP = 106.062 varR = 177.083 # Basic vector and matrices: y, X, Z, G & R y = matrix(oats$Y, nrow = dim(oats)[1], ncol = 1) X = model.matrix(~ V*N, data = oats) Z = model.matrix(~ B/MP - 1, data = oats, contrasts.arg = list(B = contrasts(oats$B, contrasts = F), MP = contrasts(oats$MP, contrasts = F)))   G = diag(c(rep(varB, 6), rep(varMP, 18))) R = diag(varR, 72, 72) Rinv = solve(R)   # Components of C XpX = t(X) %*% Rinv %*% X ZpZ = t(Z) %*% Rinv %*% Z   XpZ = t(X) %*% Rinv %*% Z ZpX = t(Z) %*% Rinv %*% X   Xpy = t(X) %*% Rinv %*% y Zpy = t(Z) %*% Rinv %*% y   # Building C * [b a] = RHS C = rbind(cbind(XpX, XpZ), cbind(ZpX, ZpZ + solve(G))) RHS = rbind(Xpy, Zpy)   blup = solve(C, RHS) blup   # [,1] # (Intercept) 80.0000000 # VMarvellous 6.6666667 # VVictory -8.5000000 # N0.2cwt 18.5000000 # N0.4cwt 34.6666667 # N0.6cwt 44.8333333 # VMarvellous:N0.2cwt 3.3333333 # VVictory:N0.2cwt -0.3333333 # VMarvellous:N0.4cwt -4.1666667 # VVictory:N0.4cwt 4.6666667 # VMarvellous:N0.6cwt -4.6666667 # VVictory:N0.6cwt 2.1666667 # BI 25.4215578 # BII 2.6569919 # BIII -6.5298953 # BIV -4.7060280 # BV -10.5829337 # BVI -6.2596927 # BI:MPGolden.rain 2.3482656 # BII:MPGolden.rain 4.2987082 # BIII:MPGolden.rain -7.9159514 # BIV:MPGolden.rain 5.7894753 # BV:MPGolden.rain 1.1167834 # BVI:MPGolden.rain -5.6372811 # BI:MPMarvellous -3.8543865 # BII:MPMarvellous 6.2094778 # BIII:MPMarvellous 10.7507978 # BIV:MPMarvellous -7.1155687 # BV:MPMarvellous -9.8480945 # BVI:MPMarvellous 3.8577740 # BI:MPVictory 14.0774514 # BII:MPVictory -9.1942649 # BIII:MPVictory -6.0639747 # BIV:MPVictory -1.0011059 # BV:MPVictory 3.4978963 # BVI:MPVictory -1.3160021

Not surprisingly, we get the same results, except that we start assuming the variance components from the previous analyses, so we can avoid implementing the code for restricted maximum likelihood estimation as well. By the way, given that $$\boldsymbol{R}^{-1}$$ is in all terms it can be factored out from the MME, leaving terms like $$\boldsymbol{X}’ \boldsymbol{X}$$, i.e. without $$\boldsymbol{R}^{-1}$$, making for simpler calculations. In fact, if you drop the $$\boldsymbol{R}^{-1}$$ it is easier to see what is going on in the different components of the $$\boldsymbol{C}$$ matrix. For example, print $$\boldsymbol{X}’ \boldsymbol{X}$$ and you’ll get the sum of observations for the overall mean and for each of the levels of the fixed effect factors. Give it a try with the other submatrices too!

XpXnoR = t(X) %*% X XpXnoR # (Intercept) VMarvellous VVictory N0.2cwt N0.4cwt N0.6cwt #(Intercept) 72 24 24 18 18 18 #VMarvellous 24 24 0 6 6 6 #VVictory 24 0 24 6 6 6 #N0.2cwt 18 6 6 18 0 0 #N0.4cwt 18 6 6 0 18 0 #N0.6cwt 18 6 6 0 0 18 #VMarvellous:N0.2cwt 6 6 0 6 0 0 #VVictory:N0.2cwt 6 0 6 6 0 0 #VMarvellous:N0.4cwt 6 6 0 0 6 0 #VVictory:N0.4cwt 6 0 6 0 6 0 #VMarvellous:N0.6cwt 6 6 0 0 0 6 #VVictory:N0.6cwt 6 0 6 0 0 6 # VMarvellous:N0.2cwt VVictory:N0.2cwt VMarvellous:N0.4cwt #(Intercept) 6 6 6 #VMarvellous 6 0 6 #VVictory 0 6 0 #N0.2cwt 6 6 0 #N0.4cwt 0 0 6 #N0.6cwt 0 0 0 #VMarvellous:N0.2cwt 6 0 0 #VVictory:N0.2cwt 0 6 0 #VMarvellous:N0.4cwt 0 0 6 #VVictory:N0.4cwt 0 0 0 #VMarvellous:N0.6cwt 0 0 0 #VVictory:N0.6cwt 0 0 0 # VVictory:N0.4cwt VMarvellous:N0.6cwt VVictory:N0.6cwt #(Intercept) 6 6 6 #VMarvellous 0 6 0 #VVictory 6 0 6 #N0.2cwt 0 0 0 #N0.4cwt 6 0 0 #N0.6cwt 0 6 6 #VMarvellous:N0.2cwt 0 0 0 #VVictory:N0.2cwt 0 0 0 #VMarvellous:N0.4cwt 0 0 0 #VVictory:N0.4cwt 6 0 0 #VMarvellous:N0.6cwt 0 6 0 #VVictory:N0.6cwt 0 0 6

I will leave it here and come back to this problem as soon as I can.

† Incidentally, a lot of the theoretical development was supported by Shayle Searle (a Kiwi statistician and Henderson’s colleague in Cornell University).

— “You are a pussy” emailed my friend.
— “Sensu cat?” I replied.
— “No. Sensu chicken” blurbed my now ex-friend.

What was this about? He read my post on R, Julia and the shiny new thing, which prompted him to assume that I was the proverbial old dog unwilling (or was it unable?) to learn new tricks. (Incidentally, with friends like this who needs enemies? Hi, Gus.)

Having a look at different—statistical—horses (Photo: Luis, click to enlarge).

I decided to tackle a small—but hopefully useful—piece of code: fitting/training a Genome Wide Selection model, using the Bayes A approach put forward by Meuwissen, Hayes and Goddard in 2001. In that approach the breeding values of the individuals (response) are expressed as a function of a very large number of random predictors (2000, our molecular markers). The dataset (csv file) is a simulation of 2000 bi-allelic markers (aa = 0, Aa = 1, AA = 2) for 250 individuals, followed by the phenotypes (column 2001) and breeding values (column 2002). These models are frequently adjusted using MCMC.

In 2010 I attended this course in Ames, Iowa where Rohan Fernando passed us the following R code (pretty much a transliteration from C code; notice the trailing semicolons, for example). P.D. 2012-04-26 Please note that this is teaching code not production code:

nmarkers = 2000; # number of markers startMarker = 1981; # set to 1 to use all numiter = 2000; # number of iterations vara = 1.0/20.0;   # input data data = matrix(scan("trainData.out0"),ncol=nmarkers+2,byrow=TRUE); nrecords = dim(data)[1];   beg = Sys.time()   # x has the mean followed by the markers x = cbind(1,data[,startMarker:nmarkers]); y = data[,nmarkers+1]; a = data[,nmarkers+2]; # inital values   nmarkers = nmarkers - startMarker + 1; mean2pq = 0.5; # just an approximation scalea = 0.5*vara/(nmarkers*mean2pq); # 0.5 = (v-2)/v for v=4   size = dim(x)[2]; b = array(0.0,size); meanb = b; b[1] = mean(y); var = array(0.0,size);   # adjust y ycorr = y - x%*%b;   # MCMC sampling for (iter in 1:numiter){ # sample vare vare = ( t(ycorr)%*%ycorr )/rchisq(1,nrecords + 3);   # sample intercept ycorr = ycorr + x[,1]*b[1]; rhs = sum(ycorr)/vare; invLhs = 1.0/(nrecords/vare); mean = rhs*invLhs; b[1] = rnorm(1,mean,sqrt(invLhs)); ycorr = ycorr - x[,1]*b[1]; meanb[1] = meanb[1] + b[1];   # sample variance for each locus for (locus in 2:size){ var[locus] = (scalea*4+b[locus]*b[locus])/rchisq(1,4.0+1) }   # sample effect for each locus for (locus in 2:size){ # unadjust y for this locus ycorr = ycorr + x[,locus]*b[locus]; rhs = t(x[,locus])%*%ycorr/vare; lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus]; invLhs = 1.0/lhs; mean = invLhs*rhs; b[locus]= rnorm(1,mean,sqrt(invLhs)); #adjust y for the new value of this locus ycorr = ycorr - x[,locus]*b[locus]; meanb[locus] = meanb[locus] + b[locus]; } }   Sys.time() - beg   meanb = meanb/numiter; aHat = x %*% meanb;

Thus, we just need defining a few variables, reading the data (marker genotypes, breeding values and phenotypic data) into a matrix, creating loops, matrix and vector multiplication and generating random numbers (using a Gaussian and Chi squared distributions). Not much if you think about it, but I didn’t have much time to explore Julia’s features as to go for something more complex.

nmarkers = 2000 # Number of markers startmarker = 1981 # Set to 1 to use all numiter = 2000 # Number of iterations   data = dlmread("markers.csv", ',') (nrecords, ncols) = size(data)   tic()   #this is the mean and markers matrix X = hcat(ones(Float64, nrecords), data[:, startmarker:nmarkers]) y = data[:, nmarkers + 1] a = data[:, nmarkers + 2]   nmarkers = nmarkers - startmarker + 1 vara = 1.0/nmarkers mean2pq = 0.5   scalea = 0.5*vara/(nmarkers*mean2pq) # 0.5 = (v-2)/v for v=4   ndesign = size(X, 2) b = zeros(Float64, ndesign) meanb = zeros(Float64, ndesign) b[1] = mean(y) varian = zeros(Float64, ndesign)   # adjust y ycorr = y - X * b   # MCMC sampling for i = 1:numiter # sample vare vare = dot(ycorr, ycorr )/randchi2(nrecords + 3)   # sample intercept ycorr = ycorr + X[:, 1] * b[1]; rhs = sum(ycorr)/vare; invlhs = 1.0/(nrecords/vare); mn = rhs*invlhs; b[1] = randn() * sqrt(invlhs) + mn; ycorr = ycorr - X[:, 1] * b[1]; meanb[1] = meanb[1] + b[1];   # sample variance for each locus for locus = 2:ndesign varian[locus] = (scalea*4 + b[locus]*b[locus])/randchi2(4.0 + 1); end   # sample effect for each locus for locus = 2:ndesign # unadjust y for this locus ycorr = ycorr + X[:, locus] * b[locus]; rhs = dot(X[:, locus], ycorr)/vare; lhs = dot(X[:, locus], X[:, locus])/vare + 1.0/varian[locus]; invlhs = 1.0/lhs; mn = invlhs * rhs; b[locus] = randn() * sqrt(invlhs) + mn; #adjust y for the new value of this locus ycorr = ycorr - X[:, locus] * b[locus]; meanb[locus] = meanb[locus] + b[locus]; end end   toc()   meanb = meanb/numiter; aHat = X * meanb;

The code looks remarkably similar and there are four main sources of differences:

1. The first trivial one is that the original code read a binary dataset and I didn’t know how to do it in Julia, so I’ve read a csv file instead (this is why I start timing after reading the file too).
2. The second trivial one is to avoid name conflicts between variables and functions; for example, in R the user is allowed to have a variable called var that will not interfere with the variance function. Julia is picky about that, so I needed renaming some variables.
3. Julia pases variables by reference, while R does so by value when assigning matrices, which tripped me because in the original R code there was something like: b = array(0.0,size); meanb = b;. This works fine in R, but in Julia changes to the b vector also changed meanb.
4. The definition of scalar vs array created some problems in Julia. For example y' * y (t(y) %*% y in R) is numerically equivalent to dot(y, y). However, the first version returns an array, while the second one a scalar. I got an error message when trying to store the ‘scalar like an array’ in to an array. I find that confusing.

One interesting point in this comparison is using rough code, not really optimized for speed; in fact, the only thing that I can say of the Julia code is that ‘it runs’ and it probably is not very idiomatic. Testing runs with different numbers of markers we get that R needs roughly 2.8x the time used by Julia. The Julia website claims better results in benchmarks, but in real life we work with, well, real problems.

In 1996-7 I switched from SAS to ASReml for genetic analyses because it was 1-2 orders of magnitude faster and opened a world of new models. Today a change from R to Julia would deliver (in this particular case) a much more modest speed up (~3x), which is OK but not worth changing languages (yet). Together with the embryonic graphical capabilities and the still-to-develop ecosystem of packages, means that I’m still using R. Nevertheless, the Julia team has achieved very impressive performance in very little time, so it is worth to keep an eye on their progress.

P.S.1 Readers are welcome to suggesting ways of improving the code.
P.S.2 WordPress does not let me upload the binary version of the simulated data.
P.S.3 Hey WordPress guys; it would be handy if the sourcecode plugin supported Julia!
P.S.4 2012-04-26 Following AL’s recommendation in the comments, one can replace in R:

rhs = t(x[,locus])%*%ycorr/vare; lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus]

by

rhs = crossprod(x[,locus],ycorr)/vare lhs = crossprod(x[,locus],x[,locus])/vare + 1.0/var[locus]

reducing execution time by roughly 20%, making the difference between Julia and R even smaller.