A list of interesting R/Stats quickies to keep the mind distracted:

• A long draft Advanced Data Analysis from an Elementary Point of View by Cosma Shalizi, in which he uses R to drive home the message. Not your average elementary point of view.
• Good notes by Frank Davenport on starting using R with data from a Geographic Information System (GIS). Read this so you get a general idea of how things fit together.
• If you are in to maps, Omnia sunt Communia! provides many good tips on producing them using R.
• Mark James Adams reminded us that Prediction ≠ Understanding, probably inspired by Dan Gianola‘s course on Whole Genome Prediction. He is a monster of Bayesian applications to genetic evaluation.
• If you are in to data/learning visualization you have to watch Bret Victor’s presentation on Media for thinking the unthinkable. He is so far ahead what we normally do that it is embarrassing.
• I follow mathematician Atabey Kaygun in twitter and since yesterday I’ve been avidly reading his coverage of the protests in Turkey. Surely there are more important things going on in the world than the latest R gossip.

I’m marking too many assignments right now to have enough time to write something more substantial. I can see the light at the end of the tunnel though.

Winter flying between Blenheim and Christchurch (Photo: Luis, click to enlarge).

I was working with a small experiment which includes families from two Eucalyptus species and thought it would be nice to code a first analysis using alternative approaches. The experiment is a randomized complete block design, with species as fixed effect and family and block as a random effects, while the response variable is growth strain (in $$\mu \epsilon$$).

When looking at the trees one can see that the residual variances will be very different. In addition, the trees were growing in plastic bags laid out in rows (the blocks) and columns. Given that trees were growing in bags siting on flat terrain, most likely the row effects are zero.

Below is the code for a first go in R (using both MCMCglmm and ASReml-R) and SAS. I had stopped using SAS for several years, mostly because I was running a mac for which there is no version. However, a few weeks ago I started accessing it via their OnDemand for Academics program via a web browser.

The R code using REML looks like:

# Options options(stringsAsFactors = FALSE) setwd('~/Dropbox/research/2013/growth-stress')   # Packages require(ggplot2) require(asreml) require(MCMCglmm)     # Reading data, renaming, factors, etc gs = read.csv('eucalyptus-growth-stress.csv') summary(gs)   # Both argophloia and bosistoana gsboth = subset(gs, !is.na(strain)) gsboth = within(gsboth, { species = factor(species) row = factor(row) column = factor(column) fam = factor(fam) })       ma = asreml(strain ~ species, random = ~ fam + row, rcov = ~ at(species):units, data = gsboth)   summary(ma)$varcomp # gamma component std.error z.ratio constraint #fam!fam.var 27809.414 27809.414 10502.036 2.6480022 Positive #row!row.var 2337.164 2337.164 3116.357 0.7499666 Positive #species_E.argopholia!variance 111940.458 111940.458 26609.673 4.2067580 Positive #species_E.bosistoana!variance 63035.256 63035.256 7226.768 8.7224681 Positive While using MCMC we get estimates in the ballpark by using: # Priors bothpr = list(R = list(V = diag(c(50000, 50000)), nu = 3), G = list(G1 = list(V = 20000, nu = 0.002), G2 = list(V = 20000, nu = 0.002), G3 = list(V = 20000, nu = 0.002))) # MCMC m2 = MCMCglmm(strain ~ species, random = ~ fam + row + column, rcov = ~ idh(species):units, data = gsboth, prior = bothpr, pr = TRUE, family = 'gaussian', burnin = 10000, nitt = 40000, thin = 20, saveX = TRUE, saveZ = TRUE, verbose = FALSE) summary(m2) # Iterations = 10001:39981 # Thinning interval = 20 # Sample size = 1500 # # DIC: 3332.578 # # G-structure: ~fam # # post.mean l-95% CI u-95% CI eff.samp # fam 30315 12211 55136 1500 # # ~row # # post.mean l-95% CI u-95% CI eff.samp # row 1449 5.928 6274 589.5 # # R-structure: ~idh(species):units # # post.mean l-95% CI u-95% CI eff.samp # E.argopholia.units 112017 71152 168080 1500 # E.bosistoana.units 65006 52676 80049 1500 # # Location effects: strain ~ species # # post.mean l-95% CI u-95% CI eff.samp pMCMC # (Intercept) 502.21 319.45 690.68 1500 <7e-04 *** # speciesE.bosistoana -235.95 -449.07 -37.19 1361 0.036 * The SAS code is not that disimilar, except for the clear demarcation between data processing (data step, for reading files, data transformations, etc) and specific procs (procedures), in this case to summarize data, produce a boxplot and fit a mixed model. * termstr=CRLF accounts for the windows-like line endings of the data set; data gs; infile "/home/luis/Projects/euc-strain/growthstresses.csv" dsd termstr=CRLF firstobs=2; input row column species$ family $strain; if strain ^= .; run; proc summary data = gs print; class species; var strain; run; proc boxplot data = gs; plot strain*species; run; proc mixed data = gs; class row species family; model strain = species; random row family; repeated species / group=species; run; /* Covariance Parameter Estimates Cov Parm Group Estimate row 2336.80 family 27808 species species E.argoph 111844 species species E.bosist 63036 */ SAS boxplot for the data set. I like working with multiple languages and I realized that, in fact, I missed SAS a bit. It was like meeting an old friend; at the beginning felt strange but we were quickly chatting away after a few minutes. Snowless December in Uppsala (Photo: Luis, click to enlarge). At School we use R across many courses, because students are supposed to use statistics under a variety of contexts. Imagine their disappointment when they pass stats and discovered that R and statistics haven’t gone away! When students start working with real data sets one of their first stumbling blocks is subsetting data. We have data sets and either they are required to deal with different subsets or there is data cleaning to do. For some reason, many students struggle with what should be a simple task. If one thinks of data as as a matrix/2-dimensional array, subsetting boils down to extracting the needed rows (cases) and columns (variables). In the R world one can do this in a variety of ways, ranging from the cryptic to the explicit and clear. For example, let’s assume that we have a dataset called alltrials with heights and stem diameters for a number of trees in different trials (We may have a bunch of additional covariates and experimental design features that we’ll ignore for the moment). How do we extract all trees located in Christchurch? Two common approaches are: mytrial = alltrials[alltrials$location == 'Christchurch', ]   mytrial = subset(alltrials, location == 'Christchurch')

While both expressions are equivalent, the former reads like Klingon to students, while the latter makes explicit that we are obtaining a subset of the original data set. This can easily be expanded to more complex conditions; for example to include all trees from Christchurch that are taller than 10 m:

mytrial = alltrials[alltrials$location == 'Christchurch' & alltrials$height > 10, ]   mytrial = subset(alltrials, location == 'Christchurch' & height > 10)

I think the complication with the Klingonian notation comes mostly from two sources:

• Variable names for subsetting the data set are not directly accessible, so we have to prefix them with the NameOfDataset$, making the code more difficult to read, particularly if we join several conditions with & and |. • Hanging commas: if we are only working with rows or columns we have to acknowledge it by suffixing or prefixing with a comma, which are often forgotten. Both points result on frustrating error messages like • Error in [.data.frame(alltrials, location == 'Christchurch', ) : object 'location' not found for the first point or • Error in [.data.frame(alltrials, alltrials$location == 'Christchurch') undefined columns selected for the second point.

The generic forms of these two notations are:

dataset[what to do with rows, what to do with columns]   subset(dataset, what to do with rows, what to do with columns)

We often want to keep a subset of the observed cases and keep (or drop) specific variables. For example, we want to keep trees in 'Christchurch' and we want to ignore diameter, because the assessor was 'high' that day:

# With this notation things get a bit trickier # The easiest way is to provide the number of the variable # Here diameter is the 5th variable in the dataset mytrial = alltrials[all.trials$location == 'Christchurch' & all.trials$height > 10, -5]   # This notation is still straightforward mytrial = subset(alltrials, location == 'Christchurch' & height > 10, select = -diameter)

There are, however, situations where Klingon is easier or more efficient. For example, to take a random sample of 100 trees from the full dataset:

  mytrial = alltrials[sample(1:nrow(alltrials), 100, replace = FALSE),]

If you are interested in this issue Quick-R has a good description of subsetting. I'm sure this basic topic must has been covered many times, although I doubt anyone used Klingon in the process.

Gratuitous picture: building blocks for R (Photo: Luis).

Suburbia bus stop, Christchurch (Photo: Luis, click to enlarge).

This morning Radio New Zealand covered a story (audio) where Tomatoes New Zealand (TNZ, the growers association) was asking the Government to introduce compulsory labeling for irradiated products (namely imported Australian tomatoes), stating that consumers deserve an informed choice (TNZ Press Release). Two points that I think merit attention:

• Food irradiation is perfectly safe: it does not make food radioactive, it does not alter the nutritional value of food and reduces the presence (or completely eliminates) the presence of microorganisms that cause disease or pests (the latter being the reason for irradiation in this case).
• The second point is that the call for labeling does not come from consumers (or an organization representing them) but from producers that face competition.

This situation reminded me of Milton Friedman talking about professional licenses The justification offered is always the same: to protect the consumer. However, the reason is demonstrated by observing who lobbies. Who is doing the lobbying is very telling in this case, particularly because there is no real reason to induce fear on the consumer, except that irradiation sounds too close to radioactive, and therefore TNZ is hoping to steer consumers away from imported tomatoes. Given that TNZ is for informing the consumer they could label tomatoes with ‘many of the characteristics in this tomato are the product of mutations‘. Harmless but scary.

P.S. The media is now picking up the story. Shameful manipulation.
P.S.2 Give that TNZ is in favor of the right to know I want a list of all chemicals used in the production of New Zealand tomatoes, how good is their water management and the employment practices of tomato growers.

I was fascinated by the shape of the power cords (Photo: Luis, click to enlarge).

It used to be that the one of the first decisions to make when learning to program was between compiled (e.g. C or FORTRAN) and interpreted (e.g. Python) languages. In my opinion these days one would have to be a masochist to learn with a compiled language: the extra compilation time and obscure errors are a killer when learning.

Today the decision would be between using a generic interpreted language (e.g. Python) and an interpreted domain specific language (DSL) like R, MATLAB, etc. While some people prefer generic languages, I’d argue that immediate feedback and easy accomplishment of useful tasks are a great thing when one is learning something for the first time.

As an example, a while ago my son asked me what I was doing in the computer and I told him that I was programming some analyses in R. I showed that the program was spitting back some numbers and plots, a fact that he found totally unremarkable and uninteresting. I searched in internet and I found Scratch, a visual programming language, that let’s the user moves blocks representing code around and build interactive games: now my son was sold. Together we are learning about loops, control structures and variables, drawing characters, etc. We are programming because the problems are i- much more interesting for him and ii- achievable in a short time frame.

An example scratch script.

Learning to program for statistics, or other scientific domains for that matter, is not that different from being a kid and learning programming. Having to do too much to get even a mildly interesting result is frustrating and discouraging; it is not that the learner is dumb, but that he has to build too many functions to get a meager reward. This is why I’d say that you should use whatever language already has a large amount of functionality (‘batteries included’ in Python parlance) for your discipline. Choose rightly and you are half-way there.

‘But’ someone will say, R is not a real language. Sorry, but it is a real language (Turing complete and the whole shebang) with oddities, as any other language, granted. As with human languages, the more you study the easier it gets to learn a new language. In fact, the syntax for many basic constructs in R is highly similar to alternatives:

# This is R code # Loop for(i in 1:10){ print(i) } #[1] 1 #[1] 2 #[1] 3 #[1] 4 #[1] 5 #[1] 6 #[1] 7 #[1] 8 #[1] 9 #[1] 10   # Control if(i == 10) { print('It is ten!') }   #[1] "It is ten!"
# This is Python code # Loop for i in range(1,11): # Python starts indexing from zero print(i)   #1 #2 #3 #4 #5 #6 #7 #8 #9 #10   # Control if i == 10: print("It is ten!")   #It is ten!

By the way, I formatted the code to highlight similarities. Of course there are plenty of differences between the languages, but many of the building blocks (the core ideas if you will) are shared. You learn one and the next one gets easier.

How does one start learning to program? If I look back at 1985 (yes, last millennium) I was struggling to learn how to program in BASIC when, suddenly, I had an epiphany. The sky opened and I heard a choir singing György Ligeti’s Atmosphères (from 2001: a Space Odyssey, you know) and then I realized: we are only breaking problems in little pieces and dealing with them one at the time. I’ve been breaking problems into little pieces since then. What else did you expect? Anyhow, if you feel that you want to learn how to code in R, or whatever language is the best option for your problems, start small. Just a few lines will do. Read other people’s code but, again, only small pieces that are supposed to do something small. At this stage is easy to get discouraged because everything is new and takes a lot of time. Don’t worry: everyone has to go through this process.

Many people struggle vectorizing the code so it runs faster. Again, don’t worry at the beginning if your code is slow. Keep on writing and learning. Read Norman Noam Ross’s FasteR! HigheR! StongeR! — A Guide to Speeding Up R Code for Busy People. The guide is very neat and useful, although you don’t need super fast code, not yet at least. You need working, readable and reusable code. You may not even need code: actually, try to understand the problem, the statistical theory, before you get coding like crazy. Programming will help you understand your problem a lot better, but you need a helpful starting point.

Don’t get distracted with the politics of research and repeatability and trendy things like git (noticed that I didn’t even link to them?). You’ll learn them in time, once you got a clue about how to program.

P.S. The code used in the examples could be shortened and sped up dramatically (e.g. 1:10 or range(1, 11)) but it is not the point of this post.
P.S.2. A while ago I wrote R is a language, which could be useful connected to this post.

It was a rainy day in Christchurch. Looking forward to receive the new windscreen for the recorder.