An R wish list for 2013

First go and read An R wish list for 2012. None of the wishes came through in 2012. Fix the R website? No, it is the same this year. In fact, it is the same as in 2005. Easy to find help? Sorry, next year. Consistency and sane defaults? Coming soon to a theater near you (one day). Thus my wish list for 2012 is, very handy, still the wish list for 2013.

R as social software

The strength of R is not the software itself, but the community surrounding the software. Put another way, there are several languages that could offer the core functionality, but the whole ‘ecosystem’ that’s another thing. Softening @gappy3000′s comment: innovation is (mostly) happening outside the core.

This prompts some questions: Why isn’t ggplot2 or plyr in the default download? I don’t know if some people realize that ggplot2 is now one of the main attractions for R as data visualization language. Why isn’t Hadley’s name in this page? (Sorry I’m picking on him, first name that came to mind). How come there is not one woman in that page? I’m not saying there is an evil plan, but I’m wondering if (and how) the site and core reflect the R community and the diversity of interests (and uses). I’m also wondering what is the process to express these questions beyond a blog post. Perhaps in the developers email list?

I think that, in summary, my R wish for 2013 is that ‘The R project’—whoever that is—recognizes that the project is much more than the core download. I wish the list of contributors goes beyond the fairly small number of people with writing access to the source. I’d include those who write packages, those who explain, those who market and, yes, those who sell R. Finally, I wish all readers of Quantum Forest a great 2013.

Entry point to the R world. Same as ever.

Entry point to the R world. Same as ever.

P.S. Just in case, no, I’m not suggesting to be included in any list.

My R year

End-of-year posts are corny but, what the heck, I think I can let myself delve in to corniness once a year. The following code gives a snapshot of what and how was R for me in 2012.

outside.packages.2012 <- list(used.the.most = c('asreml', 'ggplot2'),
                              largest.use.decline = c('MASS', 'lattice'),
                              same.use = c('MCMCglmm', 'lme4'),
                              would.like.use.more = 'JAGS')
 
skill.level <- list(improved = 'fewer loops (plyr and do.call())',
                    unimproved = c('variable.naming (Still an InConsistent mess)', 
                                   'versioning (still hit and miss)'))
 
interfaces <- list(most.used = c('RStudio', 'plain vanilla R', 'text editor (Textmate and VIM)'),
                   didnt.use.at.all = 'Emacs')
 
languages <- list(for.inquisition = c('R', 'Python', 'Javascript'),
                  revisiting = 'J',
                  discarded = 'Julia (note to self: revisit in a year)')
 
(R.2012 <- list(outside.packages.2012, 
                skill.level, 
                interfaces, 
                languages))
 
# [[1]]
# [[1]]$used.the.most
# [1] "asreml"  "ggplot2"
 
# [[1]]$largest.use.decline
# [1] "MASS"    "lattice"
 
# [[1]]$same.use
# [1] "MCMCglmm" "lme4"    
 
# [[1]]$would.like.use.more
# [1] "JAGS"
 
 
# [[2]]
# [[2]]$improved
# [1] "fewer loops (plyr and do.call())"
 
# [[2]]$unimproved
# [1] "variable.naming (Still an InConsistent mess)"
# [2] "versioning (still hit and miss)"             
 
 
# [[3]]
# [[3]]$most.used
# [1] "RStudio"                        "plain vanilla R"               
# [3] "text editor (Textmate and VIM)"
 
# [[3]]$didnt.use.at.all
# [1] "Emacs"
 
 
# [[4]]
# [[4]]$for.inquisition
# [1] "R"          "Python"     "Javascript"
 
# [[4]]$revisiting
# [1] "J"
 
# [[4]]$discarded
# [1] "Julia (note to self: revisit in a year)"

So one can query this over-the-top structure with code like R.2012[[3]]$didnt.use.at.all to learn [1] "Emacs", but you already new that, didn’t you?

Despite all my complaints, monologuing about other languages and overall frustration, R has served me well. It’s just that I’d be disappointed if I were still using it a lot in ten-years time.

Gratuitous picture: building blocks for research (Photo: Luis, click to enlarge).

Gratuitous picture: building blocks for research (Photo: Luis, click to enlarge).

Of course there was a lot more than R and stats this year. For example, the blogs I read most often have nothing to do with either topic: Isomorphismes (can’t define it), The music of sound (sound design), Offsetting behaviour (economics/politics in NZ). In fact, I need reading about a broad range of topics to feel human.

P.S. Incidentally, my favorite R function this year was subset(); I’ve been subsetting like there is no tomorrow. By the way, you are welcome to browse around the blog and subset whatever you like.

R for inquisition

A post on high-dimensional arrays by @isomorphisms reminded me of APL and, more generally, of matrix languages, which took me back to inquisitive computing: computing not in the sense of software engineering, or databases, or formats, but of learning by poking problems through a computer.

I like languages not because I can get a job by using one, but because I can think thoughts and express ideas through them. The way we think about a problem is somehow molded by the tools we use, and if we have loops, loops we use or if we have a terse matrix notation (see my previous post on Matrix Algebra Useful for Statistics), we may use that.

I used APL fairly briefly but I was impressed by some superficial aspects (hey, that’s a weird set of characters that needs a keyboard overlay) and some deeper ones (this is an actual language, cool PDF paper). The APL revolution didn’t happen, at least not directly, but it had an influence over several other languages (including R). Somehow as a group we took a different path from ‘Expository programming’, but I think that we have to recover at least part of that ethos, programming for understanding the world.

While many times I struggle with R frustrations, it is now my primary language for inquisitive computing, although some times I dive into something else. I like Mathematica, but can access it only while plugged to the university network (license limits). Python is turning into a great scientific computing environment—although still with a feeling of sellotape holding it together, J is like APL without the Klingon keyboard.

If anything, dealing with other ways of doing things leads to a better understanding of one’s primary language. Idioms that seem natural acquire a new sense of weirdness when compared to other languages. R’s basic functionality gives an excellent starting point for inquisitive computing but don’t forget other languages that can enrich the way we look at problems.

I am curious about what are people’s favorite inquisitive languages.

Gratuitous picture: inquisition, Why bloody trees grow like this? (Photo: Luis, click to enlarge).

Gratuitous picture: inquisition, Why bloody trees grow like this? (Photo: Luis, click to enlarge).

Matrix Algebra Useful for Statistics

I was having a conversation with an acquaintance about courses that were particularly useful in our work. My forestry degree involved completing 50 compulsory + 10 elective courses; if I had to choose courses that were influential and/or really useful they would be Operations Research, Economic Evaluation of Projects, Ecology, 3 Calculus and 2 Algebras. Subsequently my PhD was almost entirely research based but I sort of did Matrix Algebra: Dorian lent me his copy of Searle’s Matrix Algebra Useful for Statistics and passed me a pile of assignments that Shayle Searle used to give in his course in Cornell. I completed the assignments on my own pace and then sat a crazy take-home exam for 24 hours.

Later that year I bought a cloth-bound 1982 version of the book, not the alien vomit purple paperback reprint currently on sale, which I consult from time to time. Why would one care about matrix algebra? Besides being a perfectly respectable intellectual endeavor on itself, maybe you can see that the degrees of freedom are the rank of a quadratic form; you can learn from this book what a quadratic form and a matrix rank are. Or you want to see more clearly the relationship between regression and ANOVA, because in matrix form a linear model is a linear model is a linear model. The commands outer, inner and kronecker product make a lot more sense once you know what an outer product and an inner product of vectors are. Thus, if you really want to understand a matrix language for data analysis and statistics (like R), it seems reasonable to try to understand the building blocks for such a language.

The book does not deal with any applications to statistics until chapter 13. Before that it is all about laying foundations to understand the applications, but do not expect nice graphs and cute photos. This is a very good text where one as to use the brain to imagine what’s going on in the equations and demonstrations. The exercises rely a lot on ‘prove this’ and ‘prove that’, which lead to much frustration and, after persevering, to many ‘aha! moments’.

XKCD 1050: In your face! Actually I feel the opposite concerning math.

I am the first to accept that I have a biased opinion about this book, because it has sentimental value. It represents difficult times, dealing with a new language, culture and, on top of that, animal breeding. At the same time, it opened doors to a whole world of ideas. This is much more than I can say of most books.

PS 2012-12-17: I have commented on a few more books in these posts.

A good part of my electives were in humanities (history & literature), which was unusual for forestry. I just couldn’t conceive going through a university without doing humanities.

When R, or any other language, is not enough

This post is tangential to R, although R has a fair share of the issues I mention here, which include research reproducibility, open source, paying for software, multiple languages, salt and pepper.

There is an increasing interest in the reproducibility of research. In many topics we face multiple, often conflicting claims and as researchers we value the ability to evaluate those claims, including repeating/reproducing research results. While I share the interest in reproducibility, some times I feel we are obsessing too much on only part of the research process: statistical analysis. Even here, many people focus not on the models per se, but only on the code for the analysis, which should only use tools that are free of charge.

There has been enormous progress in the R world on literate programming, where the combination of RStudio + Markdown + knitr has made analyzing data and documenting the process almost enjoyable. Nevertheless, and here is the BUT coming, there is a large difference between making the code repeatable and making research reproducible.

As an example, currently I am working in a project that relies on two trials, which have taken a decade to grow. We took a few hundred increment cores from a sample of trees and processed them using a densitometer, an X-Ray diffractometer and a few other lab toys. By now you get the idea, actually replicating the research may take you quite a few resources before you even start to play with free software. At that point, of course, I want to be able to get the most of my data, which means that I won’t settle for a half-assed model because the software is not able to fit it. If you think about it, spending a couple of grands in software (say ASReml and Mathematica licenses) doesn’t sound outrageous at all. Furthermore, reproducing this piece of research would require: a decade, access to genetic material and lab toys. I’ll give you the code for free, but I can’t give you ten years or $0.25 million…

In addition, the research process may require linking disparate sources of data for which other languages (e.g. Python) may be more appropriate. Some times R is the perfect tool for the job, while other times I feel like we have reached peak VBS (Visual Basic Syndrome) in R: people want to use it for everything, even when it’s a bad idea.

In summary,

  • research is much more than a few lines of R (although they are very important),
  • even when considering data collection and analysis it is a good idea to know more than a single language/software, because it broadens analytical options
  • I prefer free (freedom+beer) software for research; however, I rely on non-free, commercial software for part of my work because it happens to be the best option for specific analyses.

Disclaimer: my primary analysis language is R and I often use lme4, MCMCglmm and INLA (all free). However, many (if not most) of my analyses that use genetic information rely on ASReml (paid, not open source). I’ve used Mathematica, Matlab, Stata and SAS for specific applications with reasonably priced academic licenses.

Gratuitous picture: 3000 trees leaning in a foggy Christchurch day (Photo: Luis).

Gratuitous picture: 3000 trees leaning in a foggy Christchurch day (Photo: Luis, click to enlarge).

R pitfalls #4: redefining the basics

I try to be economical when writing code; for example, I tend to use single quotes over double quotes for characters because it saves me one keystroke. One area where I don’t do that is when typing TRUE and FALSE (R accepts T and F as well), just because it is clearer to see in code and syntax highlighting kicks in. That’s why I was surprised to read Jason Morgan’s post in that it is possible to redefine T and F and get undesirable behavior.

Playing around it is quite easy to redefine other fundamental constants in R. For example, I posted in Twitter:

[sourcecode lang="R"]
> pi
[1] 3.141593
> pi <- 2
> pi*2
[1] 4
[/sourcecode]

Ouch, dangerous! I tend to muck around with matrices quite a bit and, being a friend of parsimony, I often use capital letters to represent them. This would have eventually bitten me if I had used the abbreviated TRUE and FALSE. As Kevin Ushey replied to my tweet, one can redefine even basic functions like ‘+’ and be pure evil; over the top, sure, but possible.

Clown faces
Some times coding is scary (Photo: Luis).

Multisite, multivariate genetic analysis: simulation and analysis

The email wasn’t a challenge but a simple question: Is it possible to run a multivariate analysis in multiple sites? I was going to answer yes, of course, and leave it there but it would be a cruel, non-satisfying answer. We can get a better handle of the question if we use a simple example; let’s assume that we have two traits (call them tree stem diameter and stem density) assessed in two sites (localities).

Because this is genetics we have a family structure, let’s say half-siblings so we only half the mother in common, and we will ignore any experimental design features to keep things simple. We have 100 families, with 30 trees each, in sites A and B, for a total of 6,000 trees (100 x 30 x 2). The data could look like this:

site family tree diam dens
 A     1     1   20    398
 A     1     2   19    400
...
 A    100   30   24    375
...
 B     1    1    25    396
...
 B    100   30   23    403

We can also think of a trait measured in two sites as separate, but correlated, traits. For example, diameter in site 1 (diam1) is a different variable from diameter in site 2 (diam 2). So now we have four response variables (diam1, dens1, diam2, dens2), two of which have only missing values in a given site:

site family tree diam1 dens1 diam2 dens2
 A     1     1    20    398   NA    NA
 A     1     2    19    400   NA    NA
...
 A    100   30    24    375   NA    NA
...
 B      1    1    NA    NA    25    396
...
 B    100   30    NA    NA    23    403

All variables are correlated at the genetic level with an unstructured G covariance matrix, while at the residual level are only correlated within-site (same tree was assessed for both traits), but there is zero correlation between sites, because one could be assessed near Christchurch, while the other near Dunedin, for example.

###
### Data generation
###
 
# Simulate 100 half-sib families with 30 trees each
# for the four traits in two sites
n.fam <- 100
n.prog <- 30
n.trait <- 4
n.site <- 2
 
# Couple of variance matrices,
# G is for additive genetic effects
# R is for residuals
# All variances are assumed equal to 1 without loss
# of generality
G <- matrix(c(1, 0.7, 0.5, 0.3,
              0.7, 1, 0.2, 0.1,
              0.5, 0.2, 1, 0.8,
              0.3, 0.1, 0.8, 1), 4, 4)
 
R <- matrix(c(1, 0.3, 0, 0,
              0.3, 1, 0, 0,
              0, 0, 1, 0.2,
              0, 0, 0.2, 1), 4, 4)
 
 
# Between-family variances account for 1/4 of additive
# genetic effects. Within-family account for 3/4 of
# additive + residual.
# We also get the Cholesky decomposition in the same step
BFL <- chol(1/4 * G)
WFL <- chol(3/4 * G + R)
 
# Simulate random family effects for four traits
# Simulate random family effects for four traits
fam.eff <- t(BFL) %*% matrix(rnorm(n.fam*n.trait), 
                             n.trait, n.fam)
fam.eff <- t(fam.eff) # This is 100 x 4
 
tre.eff <- t(WFL) %*% matrix(rnorm(n.prog*n.fam*n.trait), 
                             n.trait, n.prog*n.fam)
tre.eff <- t(tre.eff) # This is 3000 x 4
 
# Expand family effects matrix (n.prog each) to match 
# dimension of tree effects
pheno <- fam.eff[rep(1:dim(fam.eff)[1], each = n.prog),] + tre.eff
 
# Now to 2 sites
pheno2s <- matrix(NA, n.prog*n.fam*n.site, n.trait)
colnames(pheno2s) <- c('diam1', 'dens1', 'diam2', 'dens2')
 
pheno2s[1:3000, 1:2] <- pheno[1:3000, 1:2]
pheno2s[3001:6000, 3:4] <- pheno[1:3000, 3:4]
 
# Creating data set. Family and tree are shamelessly recycled
sim.data <- data.frame(site = factor(rep(c('A', 'B'), each = n.fam*n.prog)),
                       family = factor(rep(1:n.fam, each = n.prog)),
                       tree = factor(1:30),
                       pheno2s)

Some neat things:

  • Data simulation relies on Cholesky decomposition, as explained in this post over a year ago (time flies!). Generation is a bit more complex this time because we have to account for two, rather than one, covariance structures.
  • Another point with data simulation is that one could generate one set of correlated values at the time by using something like t(BFL) %*% diag(rnorm(trait), 3) and loop it or use apply(). This would require much less memory but would also be much slower.
  • We need to repeat each line of the family effects matrix 30 times so we can add to the individual tree effects. Often we use indexing in matrices or data frames to extract a few rows. Instead here I’m using to repeat a given number of times each row by indexing with rep().

If we show observations 2990 to 3010 (last 10 for site A and first 10 for site B) we can see the pattern of observations below, which will have the required structure to have a US (or heterogeneous variance correlation matrix) for G and a block diagonal matrix for R. By the way, one could also have block diagonal matrices with repeated measurements, although probably with a different correlation structure:

> pheno2s[2990:3010,]
            diam1      dens1      diam2       dens2
 [1,]  0.57087250 -0.8521059         NA          NA
 [2,]  0.94859621  0.6599391         NA          NA
 [3,] -3.37405451 -0.6093312         NA          NA
 [4,]  0.93541048 -0.7977893         NA          NA
 [5,] -0.74758553  0.7962593         NA          NA
 [6,]  0.51280201  1.4870425         NA          NA
 [7,] -1.92571147 -0.2554365         NA          NA
 [8,] -1.15923045 -2.0656582         NA          NA
 [9,] -0.82145904 -0.3138340         NA          NA
[10,]  1.79631670  0.3814723         NA          NA
[11,] -0.01604778 -1.1804723         NA          NA
[12,]          NA         NA -0.1436143 -1.97628883
[13,]          NA         NA -2.7099687 -2.93832962
[14,]          NA         NA -2.5153420  0.73780760
[15,]          NA         NA  0.3084056  0.61696714
[16,]          NA         NA -0.2909500  0.78111864
[17,]          NA         NA -1.8629862 -2.19346309
[18,]          NA         NA  0.8673053 -0.07692884
[19,]          NA         NA -0.1459703  0.36981965
[20,]          NA         NA -0.7688851 -0.96765799
[21,]          NA         NA  0.6637173 -0.34924814
Gratuitous picture: driving in rural Japan, the cultural value of wood and breaking preconceptions (Photo: Luis).

Gratuitous picture: driving in rural Japan, the cultural value of wood and breaking preconceptions (Photo: Luis, click to enlarge).

For the analysis we will use ASReml, which is the Pro option if you work in a breeding program and require solving much larger systems of equations than this tiny example (say 50-100 traits, large pedigrees, etc). Another option for playing with this small data set would be to use MCMCglmm, which also allows for multivariate evaluation of linear mixed models.

###
### Multivariate analysis
###
 
require(asreml)
 
# Assuming zero correlation between traits (equivalent
# to four univariate analyses)
m1 <- asreml(cbind(diam1, dens1, diam2, dens2) ~ trait, 
             random = ~ diag(trait):family, 
             rcov = ~ units:diag(trait),
             data = sim.data)
 
summary(m1)$varcomp
 
#                                 gamma component  std.error   z.ratio constraint
#trait:family!trait.diam1.var 0.2571274 0.2571274 0.04524049  5.683569   Positive
#trait:family!trait.dens1.var 0.2265041 0.2265041 0.04059742  5.579274   Positive
#trait:family!trait.diam2.var 0.2383959 0.2383959 0.04185982  5.695102   Positive
#trait:family!trait.dens2.var 0.2567999 0.2567999 0.04459246  5.758818   Positive
#R!variance                   1.0000000 1.0000000         NA        NA      Fixed
#R!trait.diam1.var            1.8290472 1.8290472 0.04803313 38.078866   Positive
#R!trait.dens1.var            1.7674960 1.7674960 0.04641672 38.078866   Positive
#R!trait.diam2.var            1.6779793 1.6779793 0.04406589 38.078866   Positive
#R!trait.dens2.var            1.7028171 1.7028171 0.04471817 38.078866   Positive
 
# The multivariate analysis allowing for correlation
# is a bit more complex, so we start by generating 
# a structure for starting values
m2.sv <- asreml(cbind(diam1, dens1, diam2, dens2) ~ trait, 
                random = ~ corgh(trait):family, 
                rcov = ~ units:us(trait),
                data = sim.data,
                start.values = TRUE)
 
# Now we'll constraint some R variance components
# to zero and fix them there. These are values for
# which we now they are 0
sv <- m2.sv$gammas.table
sv$Value[c 1="16," 2="18," 3="19)" language="(15,"][/c] <- 0
sv$Constraint[c 1="16," 2="18," 3="19)" language="(15,"][/c] <- 'F'
 
# Run the analyses with those constraints for the R matrix
# using the restricted values in R.param
m2 <- asreml(cbind(diam1, dens1, diam2, dens2) ~ trait, 
             random = ~ corgh(trait):family, 
             rcov = ~ units:us(trait, init = r.init),
             data = sim.data,
             R.param = sv)
 
summary(m2)$varcomp
#                                                 gamma   component  std.error
# trait:family!trait.dens1:!trait.diam1.cor 0.72535514 0.72535514 0.06381776
# trait:family!trait.diam2:!trait.diam1.cor 0.49100215 0.49100215 0.10067768
# trait:family!trait.diam2:!trait.dens1.cor 0.21445972 0.21445972 0.12085780
# trait:family!trait.dens2:!trait.diam1.cor 0.37476119 0.37476119 0.10975970
# trait:family!trait.dens2:!trait.dens1.cor 0.05221773 0.05221773 0.12439924
# trait:family!trait.dens2:!trait.diam2.cor 0.85356318 0.85356318 0.04321683
# trait:family!trait.diam1                  0.25712744 0.25712744 0.04524049
# trait:family!trait.dens1                  0.22650410 0.22650410 0.04059742
# trait:family!trait.diam2                  0.23839593 0.23839593 0.04185982
# trait:family!trait.dens2                  0.25679989 0.25679989 0.04459246
# R!variance                                1.00000000 1.00000000         NA
# R!trait.diam1:diam1                       1.82904721 1.82904721 0.04803313
# R!trait.dens1:diam1                       0.90450432 0.90450432 0.03737490
# R!trait.dens1:dens1                       1.76749598 1.76749598 0.04641672
# R!trait.diam2:diam1                       0.00000000 0.00000000         NA
# R!trait.diam2:dens1                       0.00000000 0.00000000         NA
# R!trait.diam2:diam2                       1.67797927 1.67797927 0.04406589
# R!trait.dens2:diam1                       0.00000000 0.00000000         NA
# R!trait.dens2:dens1                       0.00000000 0.00000000         NA
# R!trait.dens2:diam2                       0.71249532 0.71249532 0.03406354
# R!trait.dens2:dens2                       1.70281710 1.70281710 0.04471817
#
#                                              z.ratio    constraint
# trait:family!trait.dens1:!trait.diam1.cor 11.3660390 Unconstrained
# trait:family!trait.diam2:!trait.diam1.cor  4.8769710 Unconstrained
# trait:family!trait.diam2:!trait.dens1.cor  1.7744796 Unconstrained
# trait:family!trait.dens2:!trait.diam1.cor  3.4143789 Unconstrained
# trait:family!trait.dens2:!trait.dens1.cor  0.4197593 Unconstrained
# trait:family!trait.dens2:!trait.diam2.cor 19.7507102 Unconstrained
# trait:family!trait.diam1                   5.6835685      Positive
# trait:family!trait.dens1                   5.5792738      Positive
# trait:family!trait.diam2                   5.6951016      Positive
# trait:family!trait.dens2                   5.7588182      Positive
# R!variance                                        NA         Fixed
# R!trait.diam1:diam1                       38.0788655      Positive
# R!trait.dens1:diam1                       24.2008477      Positive
# R!trait.dens1:dens1                       38.0788655      Positive
# R!trait.diam2:diam1                               NA         Fixed
# R!trait.diam2:dens1                               NA         Fixed
# R!trait.diam2:diam2                       38.0788655      Positive
# R!trait.dens2:diam1                               NA         Fixed
# R!trait.dens2:dens1                               NA         Fixed
# R!trait.dens2:diam2                       20.9166565      Positive
# R!trait.dens2:dens2                       38.0788655      Positive

In model 1 each of the variances was supposed to be ~0.25 (1/4 * 1) and the residual variances ~1.75 (3/4*1 + 1). Once we move to model 2 we also get values similar to the correlations we were trying to simulate. And this is the end of the long answer.

P.S. If, for some bizarre reason, you would like to use SAS for this type of analyses, Fikret Isik has proc mixed code to run multivariate genetic analyses.