Archive for the ‘rblogs’ Category

Man flu kept me at home today, so I decided to do something ‘useful’ and go for a linkathon:

Sometimes people are truthful and cruel. Here Gappy on a mission goes for the jugular:

Over and out.

A few days ago I came across Jack Heinemann and collaborators’ article (Sustainability and innovation in staple crop production in the US Midwest, Open Access) comparing the agricultural sectors of USA and Western Europe. While the article is titled around the word sustainability, the main comparison stems from the use of Genetically Modified crops in USA versus the absence of them in Western Europe.

I was curious about part of the results and discussion which, in a nutshell, suggest that “GM cropping systems have not contributed to yield gains, are not necessary for yield gains, and appear to be eroding yields compared to the equally modern agroecosystem of Western Europe”. The authors relied on several crops for the comparison (Maize/corn, rapeseed/canolasee P.S.6, soybean and cotton); however, I am going to focus on a single one (corn) for two reasons: 1. I can’t afford a lot of time for blog posts when I should be preparing lectures and 2. I like eating corn.

When the authors of the paper tackled corn the comparison was between the USA and Western Europe, using the United Nations definition of Western Europe (i.e. Austria, Belgium, France, Germany, Liechtenstein, Luxembourg, Monaco, Netherlands, Switzerland). Some large European corn producers like Italy are not there because of the narrow definition of Western.

I struggled with the comparison used by the authors because, in my opinion, there are potentially so many confounded effects (different industry structures, weather, varieties, etc.) that it can’t provide the proper counterfactual for GM versus non-GM crops. Anyway, I decided to have a look at the same data to see if I would reach the same conclusions. The article provides a good description of where the data came from, as well as how the analyses were performed. Small details to match exactly the results were fairly easy to figure out. I downloaded the FAO corn data (3.7 MB csv file) for all countries (so I can reuse the code and data later for lectures and assignments). I then repeated the plots using the following code:

# Default directory setwd('~/Dropbox/quantumforest')   # Required packages require(ggplot2) require(labels)   # Reading FAO corn data FAOcorn = read.csv('FAOcorn.csv')   # Extracting Area FAOarea = subset(FAOcorn, Element == 'Area Harvested', select = c('Country', 'Year', 'Value'))   names(FAOarea)[3] = 'Area'   # and production FAOprod = subset(FAOcorn, Element == 'Production', select = c('Country', 'Year', 'Value'))   names(FAOprod)[3] = 'Production'   # to calculate yield in hectograms FAOarea = merge(FAOarea, FAOprod, by = c('Country', 'Year')) FAOarea$Yield = with(FAOarea, Production/Area*10000) # Subsetting only the countries of interest (and years to match paper) FAOarticle = subset(FAOarea, Country == 'United States of America' | Country == 'Western Europe') # Plot with regression lines ggplot(FAOarticle, aes(x = Year, y = Yield, color = Country)) + geom_point() + stat_smooth(method = lm, fullrange = TRUE, alpha = 0.1) + scale_y_continuous('Yield [hectograms/ha]', limits = c(0, 100000), labels = comma) + theme(legend.position="top") Figure 1. Corn yield per year for USA and Western Europe (click to enlarge). I could obtain pretty much the same regression model equations as in the article by expressing the years as deviation from 1960 as in: # Expressing year as a deviation from 1960, so results # match paper FAOarticle$NewYear = with(FAOarticle, Year - 1960)   usa.lm = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'United States of America') summary(usa.lm)   #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "United States of America") # #Residuals: # Min 1Q Median 3Q Max #-18435.4 -1958.3 338.3 3663.8 10311.0 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 38677.34 1736.92 22.27 <2e-16 *** #NewYear 1173.83 59.28 19.80 <2e-16 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 6049 on 48 degrees of freedom #Multiple R-squared: 0.8909, Adjusted R-squared: 0.8887     weu.lm = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'Western Europe') summary(weu.lm)   #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "Western Europe") # #Residuals: # Min 1Q Median 3Q Max #-14726.6 -3205.8 346.4 4000.6 10289.5 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 31510.14 1665.90 18.91 <2e-16 *** #NewYear 1344.42 56.86 23.65 <2e-16 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 5802 on 48 degrees of freedom #Multiple R-squared: 0.9209, Adjusted R-squared: 0.9193 #F-statistic: 559.1 on 1 and 48 DF, p-value: < 2.2e-16

Heinemann and collaborators then point out the following:

…the slope in yield increase by year is steeper in W. Europe (y = 1344.2x + 31512, R² = 0.92084) than the United States (y = 1173.8x + 38677, R² = 0.89093) from 1961 to 2010 (Figure 1). This shows that in recent years W. Europe has had similar and even slightly higher yields than the United States despite the latter’s use of GM varieties.

However, that interpretation using all data assumes that both ‘countries’ are using GMO all the time. An interesting thing is that USA and Western Europe were in different trends already before the introduction of GM corn. We can state that because we have some idea of when GM crops were introduced in the USA. This information is collected by the US Department of Agriculture in their June survey to growers and made publicly available at the State level (GMcornPenetration.csv):

cornPenetration = read.csv('GMcornPenetration.csv')   ggplot(cornPenetration, aes(x = Year, y = PerAllGM)) + geom_line() + facet_wrap(~ State) + scale_y_continuous('Percentage of GM corn') + theme(axis.text.x = theme_text(angle=90))

Figure 2. GM corn percentage by state in the USA (click to enlarge).

This graph tells us that by the year 2000 the percentage of planted corn was way below 50% in most corn producing states (in fact, it was 25% at the country level). From that time on we have a steady increase reaching over 80% for most states by 2008. Given this, it probably makes sense to assume that, at the USA level, yield reflects non-GM corn until 1999 and progressively reflects the effect of GM genotypes from 2000 onwards. This division is somewhat arbitrary, but easy to implement.

We can repeat the previous analyzes limiting the data from 1961 until, say, 1999:

usa.lm2 = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'United States of America' & Year < 2000) summary(usa.lm2)   #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "United States of America" & Year < 2000) # #Residuals: # Min 1Q Median 3Q Max #-17441 -2156 1123 3989 9878 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 39895.57 2084.81 19.14 < 2e-16 *** #NewYear 1094.82 90.84 12.05 2.25e-14 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 6385 on 37 degrees of freedom #Multiple R-squared: 0.797, Adjusted R-squared: 0.7915 #F-statistic: 145.2 on 1 and 37 DF, p-value: 2.245e-14   weu.lm2 = lm(Yield ~ NewYear, data = FAOarticle, subset = Country == 'Western Europe' & Year < 2000) summary(weu.lm2)   #Call: #lm(formula = Yield ~ NewYear, data = FAOarticle, subset = Country == # "Western Europe" & Year < 2000) # #Residuals: # Min 1Q Median 3Q Max #-10785 -3348 -34 3504 11117 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 29802.17 1813.79 16.43 <2e-16 *** #NewYear 1454.48 79.03 18.40 <2e-16 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 5555 on 37 degrees of freedom #Multiple R-squared: 0.9015, Adjusted R-squared: 0.8988 #F-statistic: 338.7 on 1 and 37 DF, p-value: < 2.2e-16

These analyses indicate that Western Europe started with a lower yield than the USA (29,802.17 vs 39,895.57 hectograms/ha) and managed to increase yield much more quickly (1,454.48 vs 1,094.82 hectograms/ha per year) before any use of GM corn by the USA. Figure 1 shows a messy picture because there are numerous factors affecting yield each year (e.g. weather has a large influence). We can take averages for each decade and see how the two ‘countries’ are performing:

# Aggregating every decade. # 2013-07-05 20:10 NZST I fixed the aggregation because it was averaging yields rather # calculating total production and area for the decade and then calculating average yield # Discussion points are totally valid FAOarticle$Decade = cut(FAOarticle$Year, breaks = seq(1959, 2019, 10), labels = paste(seq(1960, 2010, 10), 's', sep = ''))   decadeProd = aggregate(Production ~ Country + Decade, data = FAOarticle, FUN = sum)   decadeArea = aggregate(Area ~ Country + Decade, data = FAOarticle, FUN = sum)   decadeYield = merge(decadeProd, decadeArea, by = c('Country', 'Decade')) decadeYield$Yield = with(decadeYield, Production/Area*10000) ggplot(decadeYield, aes(x = Decade, y = Yield, fill = Country)) + geom_bar(stat = 'identity', position = 'dodge') + scale_y_continuous('Yield [hectograms/ha]', expand = c(0, 0)) + theme(legend.position="top") Figure 3. Corn yield by decade (click to enlarge). This last figure requires more attention. We can again see that Western Europe starts with lower yields than the USA; however, it keeps on increasing those yields faster than USA, overtaking it during the 1990s. Again, all this change happened while both USA and Western Europe were not using GM corn. The situation reverses in the 2000s, when the USA overtakes Western Europe, while the USA continuously increased the percentage of GM corn. The last bar in Figure 3 is misleading because it includes a single year (2010) and we know that yields in USA went down in 2011 and 2012, affected by a very large drought (see Figure 4). At least when looking at corn, I can’t say (with the same data available to Heinemann) that there is no place or need for GM genotypes. I do share some of his concerns with respect to the low level of diversity present in staple crops but, in contrast to his opinion, I envision a future for agriculture that includes large-scale operations (either GM or no-GM), as well as smaller operations (including organic ones). I’d like to finish with some optimism looking further back to yield, because the USDA National Agricultural Statistics Service keeps yield statistics for corn since 1866(!) (csv file), although it uses bizarre non-metric units (bushels/acre). As a metric boy, I converted to kilograms per hectare (multiplying by 62.77 from this page) and then to hectograms (100 g) multiplying by 10. # Reading NASS corn data NASS = read.csv('NASScorn.csv') # Conversion to sensical units (see Iowa State Extension article) # http://www.extension.iastate.edu/agdm/wholefarm/html/c6-80.html NASS$Yield = with(NASS, Value*62.77*10)   # Average by decade NASS$Decade = cut(NASS$Year, breaks = seq(1859, 2019, 10), labels = paste(seq(1860, 2010, 10), 's', sep = ''))   oldYield = aggregate(Yield ~ Decade, data = NASS, FUN = mean)   # Plotting ggplot(oldYield, aes(x = Decade, y = Yield)) + geom_bar(stat = 'identity') + scale_y_continuous('Yield [hectograms]', expand = c(0, 0))

Figure 4. Historic average yield per decade for USA (click to enlarge).

It is interesting to see that there was little change until the 1940s, with the advent of the Green Revolution (modern breeding techniques, fertilization, pesticides, etc.). The 2010s decade in Figure 4 includes 2010, 2011 and 2012, with the last two years reflecting extensive droughts. Drought tolerance is one of the most important traits in modern breeding programs.

Drought’s Footprint map produced by The New York Times (click on graph to view larger version in the NYT). This can help to understand the Decade patterns in previous figures.

While Prof. Heinemann and myself work for the same university I don’t know him in person.

P.S. Did you know that Norman Borlaug (hero of mine) studied forestry?
P.S.2 Time permitting I’ll have a look at other crops later. I would have liked to test a regression with dummy variables for corn to account for pre-2000 and post-1999, but there are not yet many years to fit a decent model (considering natural variability between years). We’ll have to wait for that one.
P.S.3 I share some of Heinemann’s concerns relating to subsidies and other agricultural practices.
P.S.4 In case anyone is interested, I did write about a GM-fed pigs study not long ago.
P.S.5 2013-07-05 20:10 NZST. I updated Figures 1 and 3 to clearly express that yield was in hectograms/ha and recalculated average decade yield because it was originally averaging yields rather calculating total production and area for the decade and then calculating average yield. The discussion points I raised are still completely valid.
P.S.6 2013-07-07 11:30 NZST. The inclusion of Canadian canola does not make any sense because, as far as I know, Canada is not part of Western Europe or the US Midwest. This opens the inclusion of crops from any origin as far as the results are convenient for one’s argument.

A list of interesting R/Stats quickies to keep the mind distracted:

• A long draft Advanced Data Analysis from an Elementary Point of View by Cosma Shalizi, in which he uses R to drive home the message. Not your average elementary point of view.
• Good notes by Frank Davenport on starting using R with data from a Geographic Information System (GIS). Read this so you get a general idea of how things fit together.
• If you are in to maps, Omnia sunt Communia! provides many good tips on producing them using R.
• Mark James Adams reminded us that Prediction ≠ Understanding, probably inspired by Dan Gianola‘s course on Whole Genome Prediction. He is a monster of Bayesian applications to genetic evaluation.
• If you are in to data/learning visualization you have to watch Bret Victor’s presentation on Media for thinking the unthinkable. He is so far ahead what we normally do that it is embarrassing.
• I follow mathematician Atabey Kaygun in twitter and since yesterday I’ve been avidly reading his coverage of the protests in Turkey. Surely there are more important things going on in the world than the latest R gossip.

I’m marking too many assignments right now to have enough time to write something more substantial. I can see the light at the end of the tunnel though.

I was working with a small experiment which includes families from two Eucalyptus species and thought it would be nice to code a first analysis using alternative approaches. The experiment is a randomized complete block design, with species as fixed effect and family and block as a random effects, while the response variable is growth strain (in $$\mu \epsilon$$).

When looking at the trees one can see that the residual variances will be very different. In addition, the trees were growing in plastic bags laid out in rows (the blocks) and columns. Given that trees were growing in bags siting on flat terrain, most likely the row effects are zero.

Below is the code for a first go in R (using both MCMCglmm and ASReml-R) and SAS. I had stopped using SAS for several years, mostly because I was running a mac for which there is no version. However, a few weeks ago I started accessing it via their OnDemand for Academics program via a web browser.

The R code using REML looks like:

# Options options(stringsAsFactors = FALSE) setwd('~/Dropbox/research/2013/growth-stress')   # Packages require(ggplot2) require(asreml) require(MCMCglmm)     # Reading data, renaming, factors, etc gs = read.csv('eucalyptus-growth-stress.csv') summary(gs)   # Both argophloia and bosistoana gsboth = subset(gs, !is.na(strain)) gsboth = within(gsboth, { species = factor(species) row = factor(row) column = factor(column) fam = factor(fam) })       ma = asreml(strain ~ species, random = ~ fam + row, rcov = ~ at(species):units, data = gsboth)   summary(ma)$varcomp # gamma component std.error z.ratio constraint #fam!fam.var 27809.414 27809.414 10502.036 2.6480022 Positive #row!row.var 2337.164 2337.164 3116.357 0.7499666 Positive #species_E.argopholia!variance 111940.458 111940.458 26609.673 4.2067580 Positive #species_E.bosistoana!variance 63035.256 63035.256 7226.768 8.7224681 Positive While using MCMC we get estimates in the ballpark by using: # Priors bothpr = list(R = list(V = diag(c(50000, 50000)), nu = 3), G = list(G1 = list(V = 20000, nu = 0.002), G2 = list(V = 20000, nu = 0.002), G3 = list(V = 20000, nu = 0.002))) # MCMC m2 = MCMCglmm(strain ~ species, random = ~ fam + row + column, rcov = ~ idh(species):units, data = gsboth, prior = bothpr, pr = TRUE, family = 'gaussian', burnin = 10000, nitt = 40000, thin = 20, saveX = TRUE, saveZ = TRUE, verbose = FALSE) summary(m2) # Iterations = 10001:39981 # Thinning interval = 20 # Sample size = 1500 # # DIC: 3332.578 # # G-structure: ~fam # # post.mean l-95% CI u-95% CI eff.samp # fam 30315 12211 55136 1500 # # ~row # # post.mean l-95% CI u-95% CI eff.samp # row 1449 5.928 6274 589.5 # # R-structure: ~idh(species):units # # post.mean l-95% CI u-95% CI eff.samp # E.argopholia.units 112017 71152 168080 1500 # E.bosistoana.units 65006 52676 80049 1500 # # Location effects: strain ~ species # # post.mean l-95% CI u-95% CI eff.samp pMCMC # (Intercept) 502.21 319.45 690.68 1500 <7e-04 *** # speciesE.bosistoana -235.95 -449.07 -37.19 1361 0.036 * The SAS code is not that disimilar, except for the clear demarcation between data processing (data step, for reading files, data transformations, etc) and specific procs (procedures), in this case to summarize data, produce a boxplot and fit a mixed model. * termstr=CRLF accounts for the windows-like line endings of the data set; data gs; infile "/home/luis/Projects/euc-strain/growthstresses.csv" dsd termstr=CRLF firstobs=2; input row column species$ family $strain; if strain ^= .; run; proc summary data = gs print; class species; var strain; run; proc boxplot data = gs; plot strain*species; run; proc mixed data = gs; class row species family; model strain = species; random row family; repeated species / group=species; run; /* Covariance Parameter Estimates Cov Parm Group Estimate row 2336.80 family 27808 species species E.argoph 111844 species species E.bosist 63036 */ SAS boxplot for the data set. I like working with multiple languages and I realized that, in fact, I missed SAS a bit. It was like meeting an old friend; at the beginning felt strange but we were quickly chatting away after a few minutes. At School we use R across many courses, because students are supposed to use statistics under a variety of contexts. Imagine their disappointment when they pass stats and discovered that R and statistics haven’t gone away! When students start working with real data sets one of their first stumbling blocks is subsetting data. We have data sets and either they are required to deal with different subsets or there is data cleaning to do. For some reason, many students struggle with what should be a simple task. If one thinks of data as as a matrix/2-dimensional array, subsetting boils down to extracting the needed rows (cases) and columns (variables). In the R world one can do this in a variety of ways, ranging from the cryptic to the explicit and clear. For example, let’s assume that we have a dataset called alltrials with heights and stem diameters for a number of trees in different trials (We may have a bunch of additional covariates and experimental design features that we’ll ignore for the moment). How do we extract all trees located in Christchurch? Two common approaches are: mytrial = alltrials[alltrials$location == 'Christchurch', ]   mytrial = subset(alltrials, location == 'Christchurch')

While both expressions are equivalent, the former reads like Klingon to students, while the latter makes explicit that we are obtaining a subset of the original data set. This can easily be expanded to more complex conditions; for example to include all trees from Christchurch that are taller than 10 m:

mytrial = alltrials[alltrials$location == 'Christchurch' & alltrials$height > 10, ]   mytrial = subset(alltrials, location == 'Christchurch' & height > 10)

I think the complication with the Klingonian notation comes mostly from two sources:

• Variable names for subsetting the data set are not directly accessible, so we have to prefix them with the NameOfDataset$, making the code more difficult to read, particularly if we join several conditions with & and |. • Hanging commas: if we are only working with rows or columns we have to acknowledge it by suffixing or prefixing with a comma, which are often forgotten. Both points result on frustrating error messages like • Error in [.data.frame(alltrials, location == 'Christchurch', ) : object 'location' not found for the first point or • Error in [.data.frame(alltrials, alltrials$location == 'Christchurch') undefined columns selected for the second point.

The generic forms of these two notations are:

dataset[what to do with rows, what to do with columns]   subset(dataset, what to do with rows, what to do with columns)

We often want to keep a subset of the observed cases and keep (or drop) specific variables. For example, we want to keep trees in 'Christchurch' and we want to ignore diameter, because the assessor was 'high' that day:

# With this notation things get a bit trickier # The easiest way is to provide the number of the variable # Here diameter is the 5th variable in the dataset mytrial = alltrials[all.trials$location == 'Christchurch' & all.trials$height > 10, -5]   # This notation is still straightforward mytrial = subset(alltrials, location == 'Christchurch' & height > 10, select = -diameter)

There are, however, situations where Klingon is easier or more efficient. For example, to take a random sample of 100 trees from the full dataset:

  mytrial = alltrials[sample(1:nrow(alltrials), 100, replace = FALSE),]

If you are interested in this issue Quick-R has a good description of subsetting. I'm sure this basic topic must has been covered many times, although I doubt anyone used Klingon in the process.

Gratuitous picture: building blocks for R (Photo: Luis).

It used to be that the one of the first decisions to make when learning to program was between compiled (e.g. C or FORTRAN) and interpreted (e.g. Python) languages. In my opinion these days one would have to be a masochist to learn with a compiled language: the extra compilation time and obscure errors are a killer when learning.

Today the decision would be between using a generic interpreted language (e.g. Python) and an interpreted domain specific language (DSL) like R, MATLAB, etc. While some people prefer generic languages, I’d argue that immediate feedback and easy accomplishment of useful tasks are a great thing when one is learning something for the first time.

As an example, a while ago my son asked me what I was doing in the computer and I told him that I was programming some analyses in R. I showed that the program was spitting back some numbers and plots, a fact that he found totally unremarkable and uninteresting. I searched in internet and I found Scratch, a visual programming language, that let’s the user moves blocks representing code around and build interactive games: now my son was sold. Together we are learning about loops, control structures and variables, drawing characters, etc. We are programming because the problems are i- much more interesting for him and ii- achievable in a short time frame.

An example scratch script.

Learning to program for statistics, or other scientific domains for that matter, is not that different from being a kid and learning programming. Having to do too much to get even a mildly interesting result is frustrating and discouraging; it is not that the learner is dumb, but that he has to build too many functions to get a meager reward. This is why I’d say that you should use whatever language already has a large amount of functionality (‘batteries included’ in Python parlance) for your discipline. Choose rightly and you are half-way there.

‘But’ someone will say, R is not a real language. Sorry, but it is a real language (Turing complete and the whole shebang) with oddities, as any other language, granted. As with human languages, the more you study the easier it gets to learn a new language. In fact, the syntax for many basic constructs in R is highly similar to alternatives:

# This is R code # Loop for(i in 1:10){ print(i) } #[1] 1 #[1] 2 #[1] 3 #[1] 4 #[1] 5 #[1] 6 #[1] 7 #[1] 8 #[1] 9 #[1] 10   # Control if(i == 10) { print('It is ten!') }   #[1] "It is ten!"
# This is Python code # Loop for i in range(1,11): # Python starts indexing from zero print(i)   #1 #2 #3 #4 #5 #6 #7 #8 #9 #10   # Control if i == 10: print("It is ten!")   #It is ten!

By the way, I formatted the code to highlight similarities. Of course there are plenty of differences between the languages, but many of the building blocks (the core ideas if you will) are shared. You learn one and the next one gets easier.

How does one start learning to program? If I look back at 1985 (yes, last millennium) I was struggling to learn how to program in BASIC when, suddenly, I had an epiphany. The sky opened and I heard a choir singing György Ligeti’s Atmosphères (from 2001: a Space Odyssey, you know) and then I realized: we are only breaking problems in little pieces and dealing with them one at the time. I’ve been breaking problems into little pieces since then. What else did you expect? Anyhow, if you feel that you want to learn how to code in R, or whatever language is the best option for your problems, start small. Just a few lines will do. Read other people’s code but, again, only small pieces that are supposed to do something small. At this stage is easy to get discouraged because everything is new and takes a lot of time. Don’t worry: everyone has to go through this process.

Many people struggle vectorizing the code so it runs faster. Again, don’t worry at the beginning if your code is slow. Keep on writing and learning. Read Norman Noam Ross’s FasteR! HigheR! StongeR! — A Guide to Speeding Up R Code for Busy People. The guide is very neat and useful, although you don’t need super fast code, not yet at least. You need working, readable and reusable code. You may not even need code: actually, try to understand the problem, the statistical theory, before you get coding like crazy. Programming will help you understand your problem a lot better, but you need a helpful starting point.

Don’t get distracted with the politics of research and repeatability and trendy things like git (noticed that I didn’t even link to them?). You’ll learn them in time, once you got a clue about how to program.

P.S. The code used in the examples could be shortened and sped up dramatically (e.g. 1:10 or range(1, 11)) but it is not the point of this post.
P.S.2. A while ago I wrote R is a language, which could be useful connected to this post.

‘No estaba muerto, andaba the parranda’ as the song says. Although rather than partying it mostly has been reading, taking pictures and trying to learn how to record sounds. Here there are some things I’ve come across lately.

I can’t remember if I’ve recommended Matloff’s The Art of R Programming before; if I haven’t, go and read the book for a good exposition of the language. Matloff also has an open book (as in free PDF, 3.5MB) entitled ‘From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science’. The download link is near the end of the page. He states that the reader ‘must know calculus, basic matrix algebra, and have some minimal skill in programming’, which incidentally is the bare minimum for someone that wants to get a good handle on stats. In my case I learned calculus partly with Piskunov’s book (I’m a sucker for Soviet books, free DjVu), matrix algebra with Searle’s book and programming with… that’s another story.

I’ve ordered a couple of books from CRC Press, which I hope to receive soon (it depends on how long it takes for the parcel to arrive to the middle of nowhere):

• Stroup’s Generalized Linear Mixed Models: Modern Concepts, Methods and Applications, which according to the blurb comes ‘with numerous examples using SAS PROC GLIMMIX’. You could be wondering Why is he reading a book that includes SAS as a selling point? Well, SAS is a very good statistical thinking that still has a fairly broad installed based. However, the real selling point is that I’ve read some explanations on mixed models written by Stroup and he has superb understanding of the topic. I’m really looking forward to put my paws on this book.
• Lunn et al.’s The BUGS Book: A Practical Introduction to Bayesian Analysis. I don’t use BUGS but occasionally use JAGS and one of the things that irks me of programs like BUGS, JAGS or INLA is that they follow the ‘here is a bunch of examples’ approach to documentation. This books is supposed to provide a much more detailed account of the ins and outs of fitting models and a proper manual. Or at least that’s what I’m hoping to find in it.

Finally, a link to a fairly long (and somewhat old) list of R tips and the acknowledgements of a PhD thesis that make you smile (via Arthur Charpentier).

Gratuitous picture: frozen fence (Photo: Luis).

‘He was not dead, he was out partying’.

I’ve been moving part of my work to university servers, where I’m just one more peasant user with little privileges. In exchange, I can access the jobs from anywhere and I can access multiple processors if needed. Given that I have a sieve-like memory, where configuration details quickly disappear through many small holes, I’m documenting the little steps needed to move my work environment there.

The server provides a default R installation but none of the additional packages I often install are available (most people accessing the servers don’t use R). I could contact the administrator to get them installed, but I’ve opted for installing them under my user space. For that I followed the instructions presented here, which in summary require adding the name of the default folder (/hpc/home/luis/rpackages) for the local library of packages to my .bashrc file:

R_LIBS="/hpc/home/luis/rpackages" export R_LIBS

I also have a temporary folder (called rpackages)in the account where I move the source of the packages to be installed (using SFTP). Once the R session is started it is possible to check that the local folder is first in the library path, confirming that R_LIBS has been made available to R.

Then I can install the packages I moved to the server with SFTP from the temporary folder to the local library using install.packages().

.libPaths() # [1] "/hpc/home/luis/rpackages" # [2] "/hpc/home/projects/packages/local.linux.ppc/pkg/R/2.15.1/lib64/R/library" # install.packages("~/temporary/plyr_1.8.tar.gz", lib="/hpc/home/luis/rpackages", repos=NULL)   # * installing *source* package 'plyr' ... # ** package 'plyr' successfully unpacked and MD5 sums checked # ** libs # gcc -std=gnu99 -I/usr/local/pkg/R/2.15.1/lib64/R/include -DNDEBUG -fPIC -O2 -c loop-apply.c -o loop-apply.o # gcc -std=gnu99 -I/usr/local/pkg/R/2.15.1/lib64/R/include -DNDEBUG -fPIC -O2 -c split-numeric.c -o split-numeric.o # gcc -std=gnu99 -shared -Wl,--as-needed -o plyr.so loop-apply.o split-numeric.o # installing to /hpc/home/luis/rpackages/plyr/libs # ** R # ** data # ** moving datasets to lazyload DB # ** inst # ** preparing package for lazy loading # ** help # *** installing help indices # ** building package indices # ** testing if installed package can be loaded # # * DONE (plyr)

Now we can load the package as normal:

require(plyr) # Loading required package: plyr

Nothing complicated or groundbreaking, just writing down the small details before I forget them.

Gratuitous picture: just different hardware (Photo: Luis).

First go and read An R wish list for 2012. None of the wishes came through in 2012. Fix the R website? No, it is the same this year. In fact, it is the same as in 2005. Easy to find help? Sorry, next year. Consistency and sane defaults? Coming soon to a theater near you (one day). Thus my wish list for 2012 is, very handy, still the wish list for 2013.

R as social software

The strength of R is not the software itself, but the community surrounding the software. Put another way, there are several languages that could offer the core functionality, but the whole ‘ecosystem’ that’s another thing. Softening @gappy3000′s comment: innovation is (mostly) happening outside the core.

This prompts some questions: Why isn’t ggplot2 or plyr in the default download? I don’t know if some people realize that ggplot2 is now one of the main attractions for R as data visualization language. Why isn’t Hadley’s name in this page? (Sorry I’m picking on him, first name that came to mind). How come there is not one woman in that page? I’m not saying there is an evil plan, but I’m wondering if (and how) the site and core reflect the R community and the diversity of interests (and uses). I’m also wondering what is the process to express these questions beyond a blog post. Perhaps in the developers email list?

I think that, in summary, my R wish for 2013 is that ‘The R project’—whoever that is—recognizes that the project is much more than the core download. I wish the list of contributors goes beyond the fairly small number of people with writing access to the source. I’d include those who write packages, those who explain, those who market and, yes, those who sell R. Finally, I wish all readers of Quantum Forest a great 2013.

Entry point to the R world. Same as ever.

P.S. Just in case, no, I’m not suggesting to be included in any list.

End-of-year posts are corny but, what the heck, I think I can let myself delve in to corniness once a year. The following code gives a snapshot of what and how was R for me in 2012.

outside.packages.2012 <- list(used.the.most = c('asreml', 'ggplot2'), largest.use.decline = c('MASS', 'lattice'), same.use = c('MCMCglmm', 'lme4'), would.like.use.more = 'JAGS')   skill.level <- list(improved = 'fewer loops (plyr and do.call())', unimproved = c('variable.naming (Still an InConsistent mess)', 'versioning (still hit and miss)'))   interfaces <- list(most.used = c('RStudio', 'plain vanilla R', 'text editor (Textmate and VIM)'), didnt.use.at.all = 'Emacs')   languages <- list(for.inquisition = c('R', 'Python', 'Javascript'), revisiting = 'J', discarded = 'Julia (note to self: revisit in a year)')   (R.2012 <- list(outside.packages.2012, skill.level, interfaces, languages))   # [[1]] # [[1]]$used.the.most # [1] "asreml" "ggplot2" # [[1]]$largest.use.decline # [1] "MASS" "lattice"   # [[1]]$same.use # [1] "MCMCglmm" "lme4" # [[1]]$would.like.use.more # [1] "JAGS"     # [[2]] # [[2]]$improved # [1] "fewer loops (plyr and do.call())" # [[2]]$unimproved # [1] "variable.naming (Still an InConsistent mess)" # [2] "versioning (still hit and miss)"     # [[3]] # [[3]]$most.used # [1] "RStudio" "plain vanilla R" # [3] "text editor (Textmate and VIM)" # [[3]]$didnt.use.at.all # [1] "Emacs"     # [[4]] # [[4]]$for.inquisition # [1] "R" "Python" "Javascript" # [[4]]$revisiting # [1] "J"   # [[4]]$discarded # [1] "Julia (note to self: revisit in a year)" So one can query this over-the-top structure with code like R.2012[[3]]$didnt.use.at.all to learn [1] "Emacs", but you already new that, didn’t you?

Despite all my complaints, monologuing about other languages and overall frustration, R has served me well. It’s just that I’d be disappointed if I were still using it a lot in ten-years time.

Gratuitous picture: building blocks for research (Photo: Luis).

Of course there was a lot more than R and stats this year. For example, the blogs I read most often have nothing to do with either topic: Isomorphismes (can’t define it), The music of sound (sound design), Offsetting behaviour (economics/politics in NZ). In fact, I need reading about a broad range of topics to feel human.

P.S. Incidentally, my favorite R function this year was subset(); I’ve been subsetting like there is no tomorrow. By the way, you are welcome to browse around the blog and subset whatever you like.