Review: “Forest Analytics with R: an introduction”

Forestry is the province of variability. From a spatial point of view this variability ranges from within-tree variation (e.g. modeling wood properties) to billions of trees growing in millions of hectares (e.g. forest inventory). From a temporal point of view we can deal with daily variation in a physiological model to many decades in an empirical growth and yield model. Therefore, it is not surprising that there is a rich tradition of statistical applications to forestry problems.

At the same time, the scope of statistical problems is very diverse. As the saying goes forestry deals with “an ocean of knowledge, but only one centimeter deep”, which is perhaps an elegant way of saying a jack of all trades, master of none. Forest Analytics with R: an introduction by Andrew Robinson and Jeff Hamann (FAWR hereafter) attempts to provide a consistent overview of typical statistical techniques in forestry as they are implemented using the R statistical system.

Following the compulsory introduction to the R language and forest data management concepts, FAWR deals mostly with three themes: sampling and mapping (forest inventory), allometry and model fitting (e.g. diameter distributions, height-diameter equations and growth models), and simulation and optimization (implementing a growth and yield model, and forest estate planning). For each area the book provides a brief overview of the problem, a general description of the statistical issues, and then it uses R to deal with one or more example data sets. Because of this structure, chapters tend to stand on their own and guide the reader towards a standard analysis of the problem, with liberal use of graphics (very useful) and plenty of interspersing code with explanations (which can be visually confusing for some readers).

While the authors bill the book as using “state-of-the-art statistical and data-handling functionality”, the most modern applications are probably the use of non-linear mixed-effects models using a residual maximum likelihood approach. There is no coverage of, for example, Bayesian methodologies increasingly present in the forest biometrics literature.

Harvesting Eucalyptus urophylla x E. grandis hybrid clones in Brazil (Photo: Luis).

FAWR reminds me of a great but infuriating book by Italo Calvino (1993): “If on a Winter’s Night a Traveler“. Calvino starts many good stories and, once the reader is hooked in them, keeps on moving to a new one. The authors of FAWR acknowledge that they will only introduce the techniques, but a more comprehensive coverage of some topics would be appreciated. Readers with some experience in the topic may choose to skip the book altogether and move directly to, for example, Pinheiro and Bates (2000) book on Mixed-Effect Models in S and S-Plus and Lumley’s (2010) Complex Surveys: A Guide to Analysis Using R. FAWR is part of the growing number of “do X using R” books that, although useful in the short term, are so highly tied to specific software that one suspects they should come with a best-before date. A relevant question is how much content is left once we drop the software specific parts… perhaps not enough.

The book certainly has redeeming features. For example, Part IV introduces the reader to calling an external function written in C (a growth model), to then combining the results with R functions to create a credible growth and yield forecasting system. Later the authors tackle harvest scheduling through linear programming models, task often addressed using domain-specific (both proprietary and expensive) software. The authors use this part to provide a good case study of model implementation.

At the end of the day I am ambivalent about FAWR. On the one hand, it is possible to find better coverage of most topics in other books or R documentation. On the other, it provides a convenient point of entry if one is lost on how to start working in forest biometrics with R. An additional positive aspect is that the book increases R credibility as an alternative for forest analytics, which makes me wish this book had been around 3 years ago, when I needed to convince colleagues to move our statistics teaching to R.

P.S. This review was published with minor changes as “Apiolaza, L.A. 2012. Andrew P. Robinson, Jeff D. Hamann: Forest Analytics With R: An Introduction. Springer, 2011. ISBN 978-1-4419-7761-8. xv+339 pp. Journal of Agricultural, Biological and Environmental Statistics 17(2): 306-307″ (DOI: 10.1007/s13253-012-0093-y).
P.S.2. 2012-05-31. After publishing this text I discovered that I already used the sentence “[f]orestry deals with variability and variability is the province of statistics” in a blog post in 2009.
P.S.3. 2012-05-31. I first heard the saying “forestry deals with an ocean of knowledge, but only one centimeter deep” around 1994 in a presentation by Oscar García in Valdivia, Chile.
P.S.4. 2012-06-01. Added links to both authors internet presence.

End of May flotsam

The end is near! At least the semester is coming to an end, so students have crazy expectations like getting marks back for assignments, and administrators want to see exam scripts. Sigh! What has been happening meanwhile in Quantum Forest?

  • Tom cracked me up with “…my data is so fucking patchy. I’m zipoissoning the place up like a motherfucker, or something”. I probably need to embark in some zipoissoning, and he was kind enough to send me some links.
  • People keep on kicking this guy called “p-value” when he is still unconscious on the floor. Bob O’Hara declares that p-values are evil. Not funny! John Cook reminds us that “The language of science is the language of probability, and not of p-values.” —Luis Pericchi”. Actually, these days the language of Science is English or whatever passes for English in a press release.
  • Discussion with Mark about the canonical pronunciation for MCMCglmm: mac-mac-glim, em-see-em-see-glim or Luis’s dumb em-see-em-see-gee-el-em-em. We need a press release from Jarrod Hadfield to clear the air!
  • RStudio now supports knitr; I’m looking forward to being able to send email from it. Wait, then it would be like a pretty Emacs.

Unfortunately named Fiat dealer in Southern Brazil. Ideal if you want to zipoisson your way around. Locals told me that it was a German surname, pronounced Fook. Mmh. (Photo: Luis)

  • Did you know? There is life beyond R. Pandas keeps on growing (if Python is your thing). Douglas Bates keeps on digging Julia. I ‘discovered’ Bartosz Milewski‘s blog, which I enjoy reading although I understand a small fraction of what he’s taking about. I came across Bartosz while looking for information on using supercomputers.
  • Data points: “How do you know you have an ageing economy? Adult nappy sales are more than kids’ nappy sales. That’s Japan now.” tweeted Bernard. “Crap!” was my reaction (nappy = diaper for US readers).
  • Feeling frustrated using R? Just go for some language Schadenfreude at Abandon Matlab.
  • Going to Auckland? Our agent Mike has just the place to go “La Voie Française (875 Dom Rd) is worth a trip. Great baguettes, $2 flaky croissants, queue out the door”.
  • Still shaking in Christchurch. Last Monday I was teaching while we had a 5.2 magnitude quake; we kept on going with the lecture.

And that’s my view of the month from down under Christchurch.

On point of view

Often times we experience mental paralysis: we can only see a problem, a situation or a person from a single point of view (mea culpa). Some times the single mindedness of our view point becomes so bad that we inexorably drift to complete silliness. This is the case when one keeps on insisting on a point that has been shown to be, how to put it, wrong.

Photography is a fascinating hobby. I think it was around 30 years ago, may be a bit earlier, that I started taking it more seriously. Learned to process film and to use an enlarger and to witness the magic of an image slowly appearing on paper submerged in developer, while a dim red light bathed the room. A few years later I stopped taking pictures, mostly due to economic problems: I was not able to even buy film, let alone to process it. Photography stayed dormant for many years, then resurfaced in the digital area, but it did not feel the same.

Where was I? Point of view. Now I can see my son taking pictures and something clicked—pun intended—in my mind. His point of view is related to mine but with no composition or any other restrictions (and all pictures come from half a meter below my normal eye level). The “problems” I see are many times not there any more.

We slowly, and often unknowingly, become role models. We transmit at least part of our points of view, which is a really scary thought. The way we move our hands, our tone of voice, our way to view the world are absorbed and learned. Once I understood this I could start polishing my self, try to become more flexible, perhaps even more tolerant, because my point of view—or at least parts of it—will live on, even when coming from half a meter below.

Luis in Sydney Botanical Gardens (Photo: Orlando).

P.S. I posted this text in my other blog on 2011-01-23 and it reappears here as part of my blog consolidation process.

R’s increasing popularity. Should we care?

Some people will say ‘you have to learn R if you want to get a job doing statistics/data science’. I say bullshit, you have to learn statistics and learn to work in a variety of languages if you want to be any good, beyond getting a job today coding in R.

R4stats has a recent post discussing the increasing popularity of R against other statistical software, using citation counts in Google Scholar. It is a flawed methodology, at least as flawed as other methodologies used to measure language popularities. Nevertheless, I think is hard to argue against the general trend: R is becoming more popular. There is a deluge of books looking at R from every angle, thousands of packages and many jobs openings asking for R experience, which prompts the following question:

Should you/I/we care?

First answer: no. I try to use the best tool for the job; which often happens to be R but it can also be Python, SAS or Fortran. It is nice to be able to use the same tool, say R, across a range of problems, but there are occasions when it feels like using Excel for statistics: one can do it, but one knows that it isn’t a great idea. I know good statisticians that prefer R, SAS or Genstat; the tool doesn’t make you good in the same way that I could buy a Rickenbacker 4001 and I wouldn’t play like Geddy Lee.

Second answer: yes. Popularity attracts good people, who develop good packages, making new techniques available first in R. This doesn’t matter if you are into plain vanilla analyses (there is nothing wrong with this, by the way). Popularity + open source means that the system has been ported to a diversity of computer systems. Need R in a supercomputer? Done. R in a mac? Done. R for your strange operating system, for which there are C and Fortran compilers? Download it and compile it. Done. There is also the ‘I’m not crazy aspect’: other people take the software seriously.

Gratuitous picture: Firescapes II, night illuminated by bonfire (Photo: Luis).

I find people learning R because of ‘job opportunities’ irritating, in the same way that people learn javascript or java only to get a job. Give me any time people that learn R—or any other language for that matter—because they want to, because they are curious, because they want to share their discoveries with other people. Again, it is the difference between someone competent and someone great at what they do; great analysts are very much better than competent ones.

In the comments for the R4stats post there is a reference to R fanboys. Are R fanboys worse than fanboys of other statistical systems? In some respects the answer is yes, because many R users are also open source & open science supporters. Personally, I support both concepts, although I’m not dogmatic about them: I do buy some proprietary software and often can’t provide every detail about my work (commercially sensitive results). Maybe we are looking for a deeper change: we want to democratize statistics. We push for R not necessarily because it is intrinsically a better language, but because we can envision many people doing statistics to better understand the world around us and R is free. Anyway, I would prefer you call me a Python fanboy with split R personality.

Bivariate linear mixed models using ASReml-R with multiple cores

A while ago I wanted to run a quantitative genetic analysis where the performance of genotypes in each site was considered as a different trait. If you think about it, with 70 sites and thousands of genotypes one is trying to fit a 70×70 additive genetic covariance matrix, which requires 70*69/2 = 2,415 covariance components. Besides requiring huge amounts of memory and being subject to all sort of estimation problems there were all sort of connectedness issues that precluded the use of Factor Analytic models to model the covariance matrix. The best next thing was to run over 2,000 bivariate analyses to build a large genetic correlation matrix (which has all sort of issues, I know). This meant leaving the computer running for over a week.

In another unrelated post, Kevin asked if I have ever considered using ASReml-R to run in parallel using a computer with multiple cores. Short answer, I haven’t, but there is always a first time.

require(asreml) # Multivariate mixed models
require(doMC)   # To access multiple cores
registerDoMC()  # to start a "parallel backend"
# Read basic density data
# Phenotypes
dat = read.table('density09.txt', header = TRUE)
dat$FCLN = factor(dat$FCLN)
dat$MCLN = factor(dat$MCLN)
dat$REP = factor(dat$REP)
dat$SETS = factor(dat$SETS)
dat$FAM = factor(dat$FAM)
dat$CP = factor(dat$CP)
dat$Tree_id = factor(dat$Tree_id)
# Pedigree
ped = read.table('pedigree.txt', header = TRUE)
names(ped)[1] = 'Tree_id'
ped$Tree_id = factor(ped$Tree_id)
ped$Mother = factor(ped$Mother)
ped$Father = factor(ped$Father)
# Inverse of the numerator relationship matrix
Ainv = asreml.Ainverse(ped)$ginv
# Wrap call to a generic bivariate analysis
# (This one uses the same model for all sites
#  but it doesn't need to be so)
bivar = function(trial1, trial2) {
  t2 = dat[dat$Trial_id == trial1 | dat$Trial_id == trial2,]
  t2$den1 = ifelse(t2$Trial_id == trial1, t2$DEN, NA)
  t2$den2 = ifelse(t2$Trial_id == trial2, t2$DEN, NA)
  t2$Trial_id = t2$Trial_id[drop = TRUE]
  # Bivariate run
  # Fits overall mean, random correlated additive effects with
  # heterogeneous variances and diagonal matrix for heterogeneous
  # residuals =  asreml(cbind(den1,den2) ~ trait,
                   random = ~ ped(Tree_id):corgh(trait),
                   data = t2,
                   ginverse = list(Tree_id = Ainv),
                   rcov = ~ units:diag(trait),
                   workspace=64e06, maxiter=30)
  # Returns only log-Likelihood for this example
# Run the bivariate example in parallel for two pairs of sites
# FR38_1 with FR38_2, FR38_1 with FR38_3
foreach(trial1=rep("FR38_1",2), trial2=c("FR38_2", "FR38_3")) %dopar% 
        bivar(trial1, trial2)

The code runs in my laptop (only two cores) but I still have to test its performance in my desktop (4 cores) and see if it really makes a difference in time versus running the analyses sequentially. Initially my mistake was using the multicore package (require(multicore)) directly, which doesn’t start the ‘parallel backend’ and ran sequentially. Using require(doMC) loads multicore but takes care of starting the ‘parallel backend’.

Gratuitous picture: Trees at 8 pm illuminated by bonfire and full moon (Photo: Luis).

P.S. Thanks to Etienne Laliberté, who sent me an email in early 2010 pointing out that I had to use doMC. One of the few exceptions in the pile of useless emails I have archived.
P.S.2. If you want to learn more about this type of models I recommend two books: Mrode’s Linear Models for the Prediction of Animal Breeding Values, which covers multivariate evaluation with lots of gory details, and Lynch and Walsh’s Genetics and Analysis of Quantitative Traits, which is the closest thing to the bible in quantitative genetics.
P.S.3. 2012-05-08. I did run the code in my desktop (iMac with 4 cores), making a couple of small changes in the code to have a fairer comparison between using %dopar% (for parallel code) and %par% (for sequential execution).

  1. Added trace = FALSE to the asreml() call, to avoid printing partial iteration details to the screen, which happened when using %do%.
  2. Used 4 pairs of sites, so I could utilize all cores in my desktop.

Execution time for the parallel version—measured with Sys.time()—was, roughly, 1/(number of cores) the time required by the sequential version.