Review: “Forest Analytics with R: an introduction”

Forestry is the province of variability. From a spatial point of view this variability ranges from within-tree variation (e.g. modeling wood properties) to billions of trees growing in millions of hectares (e.g. forest inventory). From a temporal point of view we can deal with daily variation in a physiological model to many decades in an empirical growth and yield model. Therefore, it is not surprising that there is a rich tradition of statistical applications to forestry problems.

At the same time, the scope of statistical problems is very diverse. As the saying goes forestry deals with “an ocean of knowledge, but only one centimeter deep”, which is perhaps an elegant way of saying a jack of all trades, master of none. Forest Analytics with R: an introduction by Andrew Robinson and Jeff Hamann (FAWR hereafter) attempts to provide a consistent overview of typical statistical techniques in forestry as they are implemented using the R statistical system.

Following the compulsory introduction to the R language and forest data management concepts, FAWR deals mostly with three themes: sampling and mapping (forest inventory), allometry and model fitting (e.g. diameter distributions, height-diameter equations and growth models), and simulation and optimization (implementing a growth and yield model, and forest estate planning). For each area the book provides a brief overview of the problem, a general description of the statistical issues, and then it uses R to deal with one or more example data sets. Because of this structure, chapters tend to stand on their own and guide the reader towards a standard analysis of the problem, with liberal use of graphics (very useful) and plenty of interspersing code with explanations (which can be visually confusing for some readers).

While the authors bill the book as using “state-of-the-art statistical and data-handling functionality”, the most modern applications are probably the use of non-linear mixed-effects models using a residual maximum likelihood approach. There is no coverage of, for example, Bayesian methodologies increasingly present in the forest biometrics literature.

Harvesting Eucalyptus urophylla x E. grandis hybrid clones in Brazil (Photo: Luis).

FAWR reminds me of a great but infuriating book by Italo Calvino (1993): “If on a Winter’s Night a Traveler“. Calvino starts many good stories and, once the reader is hooked in them, keeps on moving to a new one. The authors of FAWR acknowledge that they will only introduce the techniques, but a more comprehensive coverage of some topics would be appreciated. Readers with some experience in the topic may choose to skip the book altogether and move directly to, for example, Pinheiro and Bates (2000) book on Mixed-Effect Models in S and S-Plus and Lumley’s (2010) Complex Surveys: A Guide to Analysis Using R. FAWR is part of the growing number of “do X using R” books that, although useful in the short term, are so highly tied to specific software that one suspects they should come with a best-before date. A relevant question is how much content is left once we drop the software specific parts… perhaps not enough.

The book certainly has redeeming features. For example, Part IV introduces the reader to calling an external function written in C (a growth model), to then combining the results with R functions to create a credible growth and yield forecasting system. Later the authors tackle harvest scheduling through linear programming models, task often addressed using domain-specific (both proprietary and expensive) software. The authors use this part to provide a good case study of model implementation.

At the end of the day I am ambivalent about FAWR. On the one hand, it is possible to find better coverage of most topics in other books or R documentation. On the other, it provides a convenient point of entry if one is lost on how to start working in forest biometrics with R. An additional positive aspect is that the book increases R credibility as an alternative for forest analytics, which makes me wish this book had been around 3 years ago, when I needed to convince colleagues to move our statistics teaching to R.

P.S. This review was published with minor changes as “Apiolaza, L.A. 2012. Andrew P. Robinson, Jeff D. Hamann: Forest Analytics With R: An Introduction. Springer, 2011. ISBN 978-1-4419-7761-8. xv+339 pp. Journal of Agricultural, Biological and Environmental Statistics 17(2): 306-307” (DOI: 10.1007/s13253-012-0093-y).
P.S.2. 2012-05-31. After publishing this text I discovered that I already used the sentence “[f]orestry deals with variability and variability is the province of statistics” in a blog post in 2009.
P.S.3. 2012-05-31. I first heard the saying “forestry deals with an ocean of knowledge, but only one centimeter deep” around 1994 in a presentation by Oscar García in Valdivia, Chile.
P.S.4. 2012-06-01. Added links to both authors internet presence.

R’s increasing popularity. Should we care?

Some people will say ‘you have to learn R if you want to get a job doing statistics/data science’. I say bullshit, you have to learn statistics and learn to work in a variety of languages if you want to be any good, beyond getting a job today coding in R.

R4stats has a recent post discussing the increasing popularity of R against other statistical software, using citation counts in Google Scholar. It is a flawed methodology, at least as flawed as other methodologies used to measure language popularities. Nevertheless, I think is hard to argue against the general trend: R is becoming more popular. There is a deluge of books looking at R from every angle, thousands of packages and many jobs openings asking for R experience, which prompts the following question:

Should you/I/we care?

First answer: no. I try to use the best tool for the job; which often happens to be R but it can also be Python, SAS or Fortran. It is nice to be able to use the same tool, say R, across a range of problems, but there are occasions when it feels like using Excel for statistics: one can do it, but one knows that it isn’t a great idea. I know good statisticians that prefer R, SAS or Genstat; the tool doesn’t make you good in the same way that I could buy a Rickenbacker 4001 and I wouldn’t play like Geddy Lee.

Second answer: yes. Popularity attracts good people, who develop good packages, making new techniques available first in R. This doesn’t matter if you are into plain vanilla analyses (there is nothing wrong with this, by the way). Popularity + open source means that the system has been ported to a diversity of computer systems. Need R in a supercomputer? Done. R in a mac? Done. R for your strange operating system, for which there are C and Fortran compilers? Download it and compile it. Done. There is also the ‘I’m not crazy aspect’: other people take the software seriously.

Gratuitous picture: Firescapes II, night illuminated by bonfire (Photo: Luis).

I find people learning R because of ‘job opportunities’ irritating, in the same way that people learn javascript or java only to get a job. Give me any time people that learn R—or any other language for that matter—because they want to, because they are curious, because they want to share their discoveries with other people. Again, it is the difference between someone competent and someone great at what they do; great analysts are very much better than competent ones.

In the comments for the R4stats post there is a reference to R fanboys. Are R fanboys worse than fanboys of other statistical systems? In some respects the answer is yes, because many R users are also open source & open science supporters. Personally, I support both concepts, although I’m not dogmatic about them: I do buy some proprietary software and often can’t provide every detail about my work (commercially sensitive results). Maybe we are looking for a deeper change: we want to democratize statistics. We push for R not necessarily because it is intrinsically a better language, but because we can envision many people doing statistics to better understand the world around us and R is free. Anyway, I would prefer you call me a Python fanboy with split R personality.

On the destruction of a trial of genetically modified pines

The media in New Zealand briefly covered the destruction of a trial with genetically modified pines (Pinus radiata D. Don, vulgar name Radiata pine, Monterey pine) near Rotorua. This is not the first time that Luddites destroy a trial, ignoring that they have been established following regulations from the Environmental Protection Agency. Most people have discussed this pseudo-religious vandalism either from the wasting resources (money, more importantly time, delays on publication for scientists, etc) or from the criminal activity points of view.

I will discuss something slightly different, when would we plant genetically modified trees?

Some background first

In New Zealand, plantations of forests trees are established by the private sector (mostly forest companies and small growers–usually farmers). Most of the stock planted in the country has some degree of (traditional) breeding, and it ranges from seed mixes with a large numbers of parents to the deployment of genetically identical clones. The higher the degree of improvement the most likely is that tree deployment involves a small number of highly selected genotypes. Overall, most tree plantations are based on open-pollinated seed with a modest degree of genetic improvement, which is much more genetically diverse than most agricultural crops. In contrast, agricultural crops tend to deploy named clonal varieties which is what we buy in supermarkets: Gold kiwifruit, Gala apples, Nadine potatoes, etc.

Stating the obvious, tree and agricultural growers will pay more for genetic material if they have the expectation that the seeds, cuttings, tubers, etc are going to provide higher quantity and/or quality of products which will pay for the extra expense. Here we can see a big difference between people growing trees and annual/short rotation crops: there is a large lag between tree establishment and income coming from the trees, which means that when one runs a discounted cash flow analysis to estimate profitability:

  1. Income is in the distant future (say 25-30 years) and are heavily discounted.
  2. Establishment costs, which include buying the genetic material, are not discounted because they happen right now.

Unsurprisingly, growers want to reduce establishment costs as much as they can and remember that the cost of trees is an important component. This means that most people planting trees will go for cheaper, low level of genetic improvement trees (often seedlings), unless they are convinced that they can recover the extra expense with more improved trees (usually clones, which cost at least double than seedlings).

What’s the relationship with genetic modification?

Modification of any organism is an expensive process, which means that:

  1. One would only modify individuals with an outstanding genetic background; i.e. start with a good genotype to end up with a great one.
  2. Successful modifications will be clonally propagated to scale up the modification, driving down unit cost.

Thus, we have a combination of very good genotypes plus clonal propagation plus no discounting, which would make establishment costs very high (although no impossible). There is a second element that, at least for now, would delay adoption. Most large forest growers will have some type of product certification, which establishes that the grower is using good forestry, environmental and social practices. Think of it as a sticker that says the producer of this piece of wood is a good guy, so please feel confident about buying this product; that is, this sticker is part of a marketing strategy. Currently some forest certification organizations do not accept the use of genetically modified organisms (e.g. Forest Certification Council, PDF of GMO policy).

This does not mean that it is not financially possible to plant genetically modified trees. For once, modification costs would reduce with economies of scale (as for most biotechnologies), and one of the reasons we don’t have these economies is the political pressure by almost-religious zealots against GMO, which make people scared about being first to plant GM trees/plants. Another option is to change the GMO policy for some certification agencies or, relying on other certification organizations that do accept GMOs. Each individual forest company would have to evaluate the trade-offs of the certification decision, as they do not work as a block.

A simple scenario

Roughly 80% percent of the forest plantations in New Zealand correspond to radiata pine. Now imagine that we face a very destructive pest or disease that has the potential to severely damage the survival/growth of the trees. I know that it would take us a long time (decades?) to breed trees resistant to this problem. I also know that the GM crowd could insert several disease resistance genes and silence flowering, so we don’t have reproduction of modified trees. Would you support the use of genetic modification to save one of the largest industries of the country? I would.

However, before using the technology I would like to have access to data from trials growing in New Zealand conditions. The destruction of trials makes extremely difficult to make informed decisions and this is the worst crime. This people are not just destroying trees but damaging our ability to properly make decisions as a society, evaluating the pros and cons of our activities.

P.S. These are just my personal musings about the subject and do not represent the views of the forest companies, the university or anyone else. I do not work on genetic modification, but I am a quantitative geneticist & tree breeder.
P.S.2. While I do not work on genetic modification—so I’d struggle to call that crowd ‘colleagues’—I support researchers on that topic in their effort to properly evaluate the performance of genetically modified trees.

Early-February flotsam

Mike Croucher at Walking Randomly points out an interesting difference in operator precedence for several mathematical packages to evaluate a simple operation 2^3^4. It is pretty much a divide between Matlab and Excel (does the later qualify as mathematical software?) on one side with result 4096 (or (2^3)^4) and Mathematica, R and Python on the other, resulting on 2417851639229258349412352 (or 2^(3^4)). Remember your parentheses…

Corey Chivers, aka Bayesian Biologist, uses R to help students understand the Monty Hall problem. I think a large part of the confusion to grok it stems from a convenient distraction: opening doors. The problem could be reframed as: i- you pick a door (so probability of winning the prize is 1/3) and Monty gets the other two doors (probability of winning is 2/3), ii- Monty is offering to switch all his doors for yours, so switching increases the probability of winning, iii- Monty will never open a winning door to entice the switch, so we should forget about them.

To make the point clearer, let’s imagine now that instead of 3 doors the game has 10 doors. You pick one (probability of winning 1/10) and Monty keeps 9 (probability of winning 9/10). Would you switch one door for nine? Of course! The fact that Monty will open 8 non-winning doors rather than all of his doors does not make a difference in the deal.

[sourcecode language=”R”]
# Number of games and doors
n.games = 10000
n.doors = 10

# Assign prize to door for each game. Remember:
# Monty keeps all doors not chosen by player
prize.door = floor(runif(n.games, 1, n.doors + 1))
player.door = floor(runif(n.games, 1, n.doors + 1))

# If prize.door and player.door are the same
# and player does not switch
are.same = prize.door == player.door
cat(‘Probability of winning by not switching’, sum(are.same)/n.games, ‘
‘)
cat(‘Probability of winning by switching’, (n.games – sum(are.same))/n.games, ‘
‘)
[/sourcecode]

Gratuitous picture: fish in New Brighton pier (Photo: Luis).

Pierre Lemieux reminds us that “a dishonest statistician is an outliar”.

If you want to make dulce de leche using condensed milk—but lack a pressure cooker—use an autoclave for 50 to 60 minutes. HT: Heidi Smith. Geeky and one needs an autoclave worth thousands of dollars, but that’s what universities are for.

Lesser and Pearl inform us that there are at least 20 modalities for making statistics fun in “Functional Fun in Statistics Teaching: Resources, Research and Recommendations”. HT: Chelsea Heaven. I’ve used music, videos, cartoons, jokes, striking examples using body parts, quotations, food, juggling, etc.

An old review of Buddhism without Beliefs: A Contemporary Guide to Awakening by Stephen Batchelor. I can’t see any statistical angle, but I liked that book.

P.S. Awesome video by OK Go HT: Eric Crampton.

Back to quantitative genetics!

Mid-January flotsam: teaching edition

I was thinking about new material that I will use for teaching this coming semester (starting the third week of February) and suddenly compiled the following list of links:

Enough procrastination. Let’s keep on filling out PBRF forms; it is the right time for that hexennial activity.

Gratuitous picture: Avonhead Cemetery, a place to reflect on teaching and the meaning of life (Photo by Luis).

R, academia and the democratization of statistics

I am not a statistician but I use statistics, teach some statistics and write about applications of statistics in biological problems.

Last week I was in this biostatistics conference, talking with a Ph.D. student who was surprised about this situation because I didn’t have any statistical training. I corrected “any formal training”. On the first day one of the invited speakers was musing about the growing number of “amateurs” using statistics—many times wrongly—and about what biostatisticians could offer as professional value-adding. Yes, he was talking about people like me spoiling the party.

Twenty years ago it was more difficult to say “cool, I have some data and I will run some statistical analyses” because (and you can easily see where I am going here) access to statistical software was difficult. You were among the lucky ones if you were based at a university or a large company, because you had access to SAS, SPSS, MINITAB, etc. However, you were out of luck outside of these environments, because there was no way to easily afford a personal licence, not for a hobby at least. This greatly limited the pool of people that could afford to muck around with stats.

Gratuitous picture: spiral in Sydney

Enter R and other free (sensu gratis) software that allowed us to skip the specialist, skip the university or the large organization. Do you need a formal degree in statistics to start running analyses? Do you even need to go through a university (for any degree, it doesn’t really matter) to do so? There are plenty of resources to start, download R or a matrix language, get online tutorials and books, read, read, read and ask questions in email lists or fora when you get stuck. If you are a clever cookie—and some of you clearly are one—you could easily cover as much ground as someone going through a university degree. It is probably still not enough to settle down, but it is a great start and a great improvement over the situation twenty years ago.

This description leaves three groups in trouble, trying to sort out their “value-adding” ability: academia, (bio)statisticians and software makers. What are universities offering, that’s unique enough, to justify the time and money invested by individuals and governments? If you make software, What makes it special? For how long can you rely on tradition and inertia so people don’t switch to something else? What’s so special about your (bio) statistical training to justify having one “you” in the organization? Too many questions, so better I go to sleep.

P.S. Did I write “value-adding” ability? I must have been talking to the suits for too long… Next post I may end up writing “value-adding proposition”!

On the (statistical) road, workshops and R

Things have been a bit quiet at Quantum Forest during the last ten days. Last Monday (Sunday for most readers) I flew to Australia to attend a couple of one-day workshops; one on spatial analysis (in Sydney) and another one on modern applications of linear mixed models (in Wollongong). This will be followed by attending The International Biometric Society Australasian Region Conference in Kiama.

I would like to comment on the workshops to look for commonalities and differences. First, both workshops heavily relied on R, supporting the idea that if you want to reach a lot of people and get them using your ideas, R is pretty much the vehicle to do so. It is almost trivial to get people to install R and RStudio before the workshop so they are ready to go. “Almost” because you have to count on someone having a bizarre software configuration or draconian security policies for their computer.

The workshop on Spatial Analysis also used WinBUGS, which with all respect to the developers, is a clunky piece of software. Calling it from R or using JAGS from R seems to me a much more sensible way of using a Bayesian approach while maintaining access to the full power of R. The workshop on linear mixed models relied on asreml-R; if you haven’t tried it, please give it a go (it is a free license for academic/research use). There were applications on multi-stage experiments, composite samples and high dimensional data (molecular information). In addition, there was an initial session on optimal design of experiments.

In my opinion, the second workshop (modern applications…) was much more successful than the first one (spatial analysis…) for a few reasons:

  • One has to limit the material to cover in a one-day workshop; if you want to cover a lot consider three days so people can digest all the material.
  • One has to avoid the “split-personality” approach to presentations; having very-basic and super-hard but nothing in the middle is not a great idea (IMHO). Pick a starting point and gradually move people from there.
  • Limit the introduction of new software. One software per day seems to be a good rule of thumb.

Something bizarre (for me, at least) was the difference between the audiences. In Sydney the crowd was a lot younger, with many trainees in biostatistics coming mostly from health research. They had little exposure to R and seemed to come from a mostly SAS shop. The crowd in Wollongong had people with a lot of experience (OK, oldish) both in statistics and R. I was expecting young people to be more conversant in R.

Tomorrow we will drive down to Kiama, sort out registration and then go to the welcome BBQ. Funny thing is that this is my first statistics conference; as I mentioned in the About page of this blog, I am a simple forester. 🙂

If you are writing a book on Bayesian statistics

This post is somewhat marginal to R in that there are several statistical systems that could be used to tackle the problem. Bayesian statistics is one of those topics that I would like to understand better, much better, in fact. Unfortunately, I struggle to get the time to attend courses on the topic between running my own lectures, research and travel; there are always books, of course.

After we had some strong earthquakes in Christchurch we have had limited access to most part of our physical library (still had full access to all our electronic collection). Last week I had a quick visit to the library and picked up three introductory books: Albert’s Bayesian computation with R, Marin and Robert’s Bayesian core: a practical approach to computational Bayesian statistics and Bolstad’s Understanding computational Bayesian statistics (all links to Amazon). My intention was to see if I could use one (or several of them) to start on the topic. What follows are my (probably unfair) comments after reading the first couple of chapters of each book.

In my (highly individual and dubious) opinion Albert’s book is the easiest to read. I was waiting to see the doctor while reading—and actually understanding—some of the concepts. The book is certainly geared towards R users and gradually develops the code necessary to run simple analyses from estimating a proportion to fitting (simple) hierarchical linear models. I’m still reading, which is a compliment.

Marin and Robert’s book is quite different in that uses R as a vehicle (like this blog) but the focus is more on the conceptual side and covers more types of models than Albert’s book. I do not have the probability background for this course (or maybe I did, but it was ages ago); however, the book makes me want to learn/refresh that background. An annoying comment on the book is that it is “self-contained”; well, anything is self-contained if one asks for enough prerequisites! I’m still reading (jumping between Albert’s and this book), and the book has managed to capture my interest.

Finally, Bolstad’s book. How to put this? “It is not you, it is me”. It is much more technical and I do not have the time, nor the patience, to wait until chapter 8 to do something useful (logistic regression). This is going back to the library until an indeterminate future.

If you are now writing a book on the topic I would like to think of the following user case:

  • the reader has little or no exposure to Bayesian statistics, but it has been working for a while with ‘classical’ methods,
  • the reader is self-motivated, but he doesn’t want to spend ages to be able to fit even a simple linear regression,
  • the reader has little background on probability theory, but he is willing to learn some in between learning the tools and to run some analyses,
  • using a statistical system that allows for both classical and Bayesian approaches is a plus.

It is hard for me to be more selfish in this description; you are potentially writing a book for me.

† After the first quake our main library looked like this. Now it is mostly normal.

P.S. After publishing this post I remembered that I came across a PDF copy of Doing Bayesian Data Analysis: A Tutorial with R and BUGS by Kruschke. Setting aside the dodginess of the copy, the book looked well-written, started from first principles and had puppies on the cover (!), so I ordered it from Amazon.

P.D. 2011-12-03 23:45 AEST Christian Robert sent me a nice email and wrote a few words on my post. Yes, I’m still plodding along with the book although I’m taking a ten day break while traveling in Australia.

P.D. 2011-11-25 12:25 NZST Here is a list of links to Amazon for the books suggested in the comments:

Teaching with R: the tools

I bought an Android phone, nothing fancy just my first foray in the smartphone world, which is a big change coming from the dumb phone world(*). Everything is different and I am back at being a newbie; this is what many students feel the same time they are exposed to R. However, and before getting into software, I find it useful to think of teaching from several points of view, considering that there are several user cases:

  1. A few of the students will get into statistics and heavy duty coding, they will be highly motivated and take several stats courses. It is in their best interest to learn R (and other software) very well.
  2. Some, but not many, of the students will use statistics once in while. A point and click GUI may help to remember commands.
  3. Most students will have to consume statistics, read reports, request information from colleagues and act as responsible, statistically literate citizens. I think that concepts (rather than software) are far the most relevant element to this group.

The first group requires access to all the power in R and, very likely, has at least a passing idea of coding in other languages. The second and third groups are occasional users, which will tend to forget the language and notation and most likely will need a system with menus.

At this point, some of the readers may be tempted to say that everyone (that is groups 1—3) should learn to write R code and they just need a good text editor (say Emacs). This may come as a surprise but normal people not only do not use text editors, they even don’t know what they are. You could be tempted to say that having a good editor would also let them write in LaTeX (or XeLaTeX), which is an excellent way to ‘future proof’ your documents. Please let me repeat this: normal people do not write in LaTeX, they use Word or something equivalent.

But, but. Yes, I know, we are not normal people.

What are the problems?

When working in my Ph.D. I had the ‘brilliant’ (a.k.a. masochistic) idea of using different languages for each part of my project: Fortran 90 (Lahey), ASReml, Python (ActiveState), Matlab and Mathematica. One thing that I experienced, was that working with a scripting language integrated with a good IDE (e.g. ActiveState Python or Matlab) was much more conducive to learning than a separate console and text editor. I still have fond memories of learning and using Python. This meandering description brings me back to what we should use for teaching.

Let’s be honest, the stock R IDE that one gets with the initial download is spartan if you are in OS X and plain sucky if you are in Windows. Given that working with the console plus a text editor (Emacs, Vim, Textmate, etc) is an uncomfortable learning experience (at least in my opinion) there is a nice niche for IDEs like RStudio, which integrate editor, data manager, graphs, etc.; particularly if they are cross-platform. Why is that RStudio is not included as the default R IDE? (Incidentally, I have never used Revolution R Productivity Environment—Windows only—that looks quite cool).

Today I am tempted to recommend moving the whole course to RStudio, which means installing it in an awful lot of computers at the university. One of the issues that stops me is that is introducing another layer of abstraction to R. We have the plain-vanilla console, then the normal installation and, on top, RStudio. On the other hand, we are already introducing an extra level with R commander.

At this point we reach the point-and-click GUI. The last two years we have used R Commander, which has helped, but I have never felt entirely comfortable with it. This year I had a chat with some students that used SPSS before and, after the initial shock, they seemed to cope with R Commander. In a previous post someone suggested Deducer, which I hope to check before the end of this year. I am always on the look out for a good and easy interface for students that fall in the second and third cases (see above). It would be nice to have a series of interfaces that look like Jeroen Ooms’s prototypes. Please let me know if you have any suggestions.

(*)This is not strictly true, as I had a Nokia E72, which was a dumb phone with a lot of buttons pretending to be a smartphone.

(**)This post should be read as me thinking aloud and reflecting today’s impressions, which are continually evolving. I am not yet truly comfortable with any GUI in the R world, and still feel that SPSS, Splus or Genstat (for example) provide a nicer flow on the GUI front.

Teaching with R: the switch

There are several blog posts, websites (and even books) explaining the transition from using another statistical system (e.g. SAS, SPSS, Stata, etc) to relying on R. Most of that material treats the topic from the point of view of i- an individual user and ii- a researcher. This post explains some of the issues involved in, first, moving several users and, second, with an emphasis in teaching.

I have made part of this information available before, but I wanted to update it and keep it together with all the other posts in Quantum Forest. The process started in March 2009.

March 2009

I started explaining to colleagues my position on using R (and R commander) for teaching purposes. Some background first: forestry deals with variability and variability is the province of statistics. The use of statistics permeates forestry: we use sampling for inventory purposes, we use all sort of complex linear and non-linear regression models to predict growth, linear mixed models are the bread and butter of the analysis of experiments, etc.

I think it is fair to expect foresters to be at least acquainted with basic statistical tools, and we have two courses covering ANOVA and regression. In addition, we are supposed to introduce/reinforce statistical concepts in several other courses. So far so good, until we reached the issue of software.

During the first year of study, it is common to use MS Excel. I am not a big fan of Excel, but I can tolerate its use: people do not require much training to (ab)use it and it has a role to introduce students to some of the ’serious/useful’ functions of a computer; that is, beyond gaming. However, one can hit Excel limits fairly quickly which–together with the lack of audit trail for the analyses and the need to repeat all the pointing and clicking every time we need an analysis–makes looking for more robust tools very important.

Until the end of 2009 SAS (mostly BASE and STAT, with some sprinkles of GRAPH) was our robust tool. SAS was introduced in second year during the ANOVA and regression courses. SAS is a fine product, however:

  • We spent a very long time explaining how to write simple SAS scripts. Students forgot the syntax very quickly.
  • SAS’s graphical capabilities are fairly ordinary and not at all conducive to exploratory data analysis.
  • SAS is extremely expensive.
  • SAS tends to define the subject; I mean, it adopts new techniques very slowly, so there is the tendency to do only what SAS can do. This is unimportant for undergrads, but it is relevant for postgrads.
  • Users sometimes store data in SAS’s own format, which introduces another source of lock-in.

At the time, in my research work I used mostly ASReml (for specialized genetic analyses) and R (for general work); since thenI have moved towards using asreml-R (an R library that interfaces ASReml) to have a consistent work environment. For teaching I was using SAS to be consistent with second-year material.

Considering the previously mentioned barriers for students I started playing with R-commander (Rcmdr), a cross-platform GUI for R created by John Fox (the writer of some very nice statistics books, by the way. As I see it:

  • R in command mode is not more difficult (but not simpler either) for students than SAS. I think that SAS is more consistent and they have worked hard at keeping a very similar structure between PROCs.
  • We can get R-commander to start working right away with simple(r) methods, while maintaining the possibility of moving to more complex methods later by typing commands or programming.
  • It is free, so our students can load it into their laptops and keep on using it when they are gone. This is particularly true with international students: many of them will never see SAS again in their home countries.
  • It allows an easy path to data exploration (pre-requisite for building decent models) and high quality graphs.
  • R is open source (nice, but not a deal breaker for me) and easily extensible (this one is really important for me).

At the time I thought that R would be an excellent fit for teaching; nevertheless, there could be a few drawbacks, mostly when dealing with postgrads:

  • There are restrictions to the size of datasets (they have to fit in memory), although there are ways to deal with some of the restrictions. On the other hand, I have hit the limits of PROC GLM and PROC MIXED before and that is where ASReml shines. In two years this has never been a problem.
  • Some people have an investment in SAS and may not like the idea of using a different software. This was a problem the first few months.

As someone put it many years ago–there is always resistance to change:

It must be remembered that there is nothing more difficult to plan, more doubtful of success, nor more dangerous to manage, than the creation of a new system. For the initiator has the enmity of all who would profit by the preservation of the old institutions and merely lukewarm defenders in those who would gain by the new ones.—Niccolò Machiavelli, The Prince, Chapter 6

.

Five months later: August 2009

At the department level, I had to spend substantial time compiling information to prove that R could satisfy my colleagues’ statistical needs. Good selling points were nlme/lme4, lattice/ggplot2 and pointing my most statistically inclined colleagues to CRAN. Another important issue was the ability to have a GUI (Rcmdr) that could be adapted to our specific needs. At that time the School of Forestry adopted R as the default software for teaching any statistical content during the four years of the curriculum.

At the university level, my questions to the department of Mathematics and Statistics sparkled a lot of internal discussion, which resulted in R being adopted as the standard software for the ANOVA and regression second year courses (it was already the standard for many courses in 3rd and 4th year). The decision was not unanimous, particularly because for statisticians SAS is one of those ‘must be in the CV’ skills, but they went for change. The second year courses are offered across colleges, which makes the change very far reaching. These changes implied that many computers in the university labs now come with R pre-installed.

A year later: April 2010

R and R-commander were installed in our computer labs and we started using them in our Research Methods course. It is still too early to see what will be the effect of R versus SAS, but we expect to see an increase on the application of statistics within our curriculum.

One thing that I did not properly consider in the process were the annoying side-effects of the university’s computer policies. Students are not allowed to install software in the university computers and R packages fall within that category. We can either stay with the defaults + R commander (our current position) or introduce an additional complication for students, pushing them to define their own library location. I’d rather teach ggplot2 than lattice, but ggplot2 is an extra installation. Choices, choices… On the positive side, the default installation for some of the computer labs install all the packages by default.

Two years later: March 2011

Comments after teaching a regression modeling course using R-commander:

  • Some students really appreciate the possibility of using R-commander as their ‘total analysis system’. Most students that have never used a command line environment prefer it.
  • Students that have some experience with command-line work do not like much R-commander as they find it confusing, particularly when it is possible to access the R console through two points: Rcmdr and the default console. Some of them could not see the point of using an environment with a limited subset of functionality.
  • Data transformation facilities in R-commander are somewhat limited to the simplest cases.
  • Why is that the linear regression item does not accept categorical predictors? That works under ‘linear models’, but it is such an arbitrary separation.
  • The OS X version of R-commander (under X Windows) is butt ugly. This is not John Fox’s fault, but just a fact of life.

In general, R would benefit of having a first-class Excel import system that worked across platforms. Yes, I know that some people say that researchers should not use Excel; however, there is a distinction between normative and positive approaches to research. People do use Excel and insisting that they should not is not helpful.

I would love to hear anyone else’s experiences teaching basic statistics with R. Any comments?