Categories
r rblogs stats teaching

R, academia and the democratization of statistics

I am not a statistician but I use statistics, teach some statistics and write about applications of statistics in biological problems.

Last week I was in this biostatistics conference, talking with a Ph.D. student who was surprised about this situation because I didn’t have any statistical training. I corrected “any formal training”. On the first day one of the invited speakers was musing about the growing number of “amateurs” using statistics—many times wrongly—and about what biostatisticians could offer as professional value-adding. Yes, he was talking about people like me spoiling the party.

Twenty years ago it was more difficult to say “cool, I have some data and I will run some statistical analyses” because (and you can easily see where I am going here) access to statistical software was difficult. You were among the lucky ones if you were based at a university or a large company, because you had access to SAS, SPSS, MINITAB, etc. However, you were out of luck outside of these environments, because there was no way to easily afford a personal licence, not for a hobby at least. This greatly limited the pool of people that could afford to muck around with stats.

Gratuitous picture: spiral in Sydney

Enter R and other free (sensu gratis) software that allowed us to skip the specialist, skip the university or the large organization. Do you need a formal degree in statistics to start running analyses? Do you even need to go through a university (for any degree, it doesn’t really matter) to do so? There are plenty of resources to start, download R or a matrix language, get online tutorials and books, read, read, read and ask questions in email lists or fora when you get stuck. If you are a clever cookie—and some of you clearly are one—you could easily cover as much ground as someone going through a university degree. It is probably still not enough to settle down, but it is a great start and a great improvement over the situation twenty years ago.

This description leaves three groups in trouble, trying to sort out their “value-adding” ability: academia, (bio)statisticians and software makers. What are universities offering, that’s unique enough, to justify the time and money invested by individuals and governments? If you make software, What makes it special? For how long can you rely on tradition and inertia so people don’t switch to something else? What’s so special about your (bio) statistical training to justify having one “you” in the organization? Too many questions, so better I go to sleep.

P.S. Did I write “value-adding” ability? I must have been talking to the suits for too long… Next post I may end up writing “value-adding proposition”!

Categories
linear models r rblogs stats teaching

On the (statistical) road, workshops and R

Things have been a bit quiet at Quantum Forest during the last ten days. Last Monday (Sunday for most readers) I flew to Australia to attend a couple of one-day workshops; one on spatial analysis (in Sydney) and another one on modern applications of linear mixed models (in Wollongong). This will be followed by attending The International Biometric Society Australasian Region Conference in Kiama.

I would like to comment on the workshops to look for commonalities and differences. First, both workshops heavily relied on R, supporting the idea that if you want to reach a lot of people and get them using your ideas, R is pretty much the vehicle to do so. It is almost trivial to get people to install R and RStudio before the workshop so they are ready to go. “Almost” because you have to count on someone having a bizarre software configuration or draconian security policies for their computer.

The workshop on Spatial Analysis also used WinBUGS, which with all respect to the developers, is a clunky piece of software. Calling it from R or using JAGS from R seems to me a much more sensible way of using a Bayesian approach while maintaining access to the full power of R. The workshop on linear mixed models relied on asreml-R; if you haven’t tried it, please give it a go (it is a free license for academic/research use). There were applications on multi-stage experiments, composite samples and high dimensional data (molecular information). In addition, there was an initial session on optimal design of experiments.

In my opinion, the second workshop (modern applications…) was much more successful than the first one (spatial analysis…) for a few reasons:

  • One has to limit the material to cover in a one-day workshop; if you want to cover a lot consider three days so people can digest all the material.
  • One has to avoid the “split-personality” approach to presentations; having very-basic and super-hard but nothing in the middle is not a great idea (IMHO). Pick a starting point and gradually move people from there.
  • Limit the introduction of new software. One software per day seems to be a good rule of thumb.

Something bizarre (for me, at least) was the difference between the audiences. In Sydney the crowd was a lot younger, with many trainees in biostatistics coming mostly from health research. They had little exposure to R and seemed to come from a mostly SAS shop. The crowd in Wollongong had people with a lot of experience (OK, oldish) both in statistics and R. I was expecting young people to be more conversant in R.

Tomorrow we will drive down to Kiama, sort out registration and then go to the welcome BBQ. Funny thing is that this is my first statistics conference; as I mentioned in the About page of this blog, I am a simple forester. 🙂

Categories
bayesian books rblogs teaching

If you are writing a book on Bayesian statistics

This post is somewhat marginal to R in that there are several statistical systems that could be used to tackle the problem. Bayesian statistics is one of those topics that I would like to understand better, much better, in fact. Unfortunately, I struggle to get the time to attend courses on the topic between running my own lectures, research and travel; there are always books, of course.

After we had some strong earthquakes in Christchurch we have had limited access to most part of our physical library (still had full access to all our electronic collection). Last week I had a quick visit to the library and picked up three introductory books: Albert’s Bayesian computation with R, Marin and Robert’s Bayesian core: a practical approach to computational Bayesian statistics and Bolstad’s Understanding computational Bayesian statistics (all links to Amazon). My intention was to see if I could use one (or several of them) to start on the topic. What follows are my (probably unfair) comments after reading the first couple of chapters of each book.

In my (highly individual and dubious) opinion Albert’s book is the easiest to read. I was waiting to see the doctor while reading—and actually understanding—some of the concepts. The book is certainly geared towards R users and gradually develops the code necessary to run simple analyses from estimating a proportion to fitting (simple) hierarchical linear models. I’m still reading, which is a compliment.

Marin and Robert’s book is quite different in that uses R as a vehicle (like this blog) but the focus is more on the conceptual side and covers more types of models than Albert’s book. I do not have the probability background for this course (or maybe I did, but it was ages ago); however, the book makes me want to learn/refresh that background. An annoying comment on the book is that it is “self-contained”; well, anything is self-contained if one asks for enough prerequisites! I’m still reading (jumping between Albert’s and this book), and the book has managed to capture my interest.

Finally, Bolstad’s book. How to put this? “It is not you, it is me”. It is much more technical and I do not have the time, nor the patience, to wait until chapter 8 to do something useful (logistic regression). This is going back to the library until an indeterminate future.

If you are now writing a book on the topic I would like to think of the following user case:

  • the reader has little or no exposure to Bayesian statistics, but it has been working for a while with ‘classical’ methods,
  • the reader is self-motivated, but he doesn’t want to spend ages to be able to fit even a simple linear regression,
  • the reader has little background on probability theory, but he is willing to learn some in between learning the tools and to run some analyses,
  • using a statistical system that allows for both classical and Bayesian approaches is a plus.

It is hard for me to be more selfish in this description; you are potentially writing a book for me.

† After the first quake our main library looked like this. Now it is mostly normal.

P.S. After publishing this post I remembered that I came across a PDF copy of Doing Bayesian Data Analysis: A Tutorial with R and BUGS by Kruschke. Setting aside the dodginess of the copy, the book looked well-written, started from first principles and had puppies on the cover (!), so I ordered it from Amazon.

P.D. 2011-12-03 23:45 AEST Christian Robert sent me a nice email and wrote a few words on my post. Yes, I’m still plodding along with the book although I’m taking a ten day break while traveling in Australia.

P.D. 2011-11-25 12:25 NZST Here is a list of links to Amazon for the books suggested in the comments:

Categories
r rblogs stats teaching

Teaching with R: the tools

I bought an Android phone, nothing fancy just my first foray in the smartphone world, which is a big change coming from the dumb phone world(*). Everything is different and I am back at being a newbie; this is what many students feel the same time they are exposed to R. However, and before getting into software, I find it useful to think of teaching from several points of view, considering that there are several user cases:

  1. A few of the students will get into statistics and heavy duty coding, they will be highly motivated and take several stats courses. It is in their best interest to learn R (and other software) very well.
  2. Some, but not many, of the students will use statistics once in while. A point and click GUI may help to remember commands.
  3. Most students will have to consume statistics, read reports, request information from colleagues and act as responsible, statistically literate citizens. I think that concepts (rather than software) are far the most relevant element to this group.

The first group requires access to all the power in R and, very likely, has at least a passing idea of coding in other languages. The second and third groups are occasional users, which will tend to forget the language and notation and most likely will need a system with menus.

At this point, some of the readers may be tempted to say that everyone (that is groups 1—3) should learn to write R code and they just need a good text editor (say Emacs). This may come as a surprise but normal people not only do not use text editors, they even don’t know what they are. You could be tempted to say that having a good editor would also let them write in LaTeX (or XeLaTeX), which is an excellent way to ‘future proof’ your documents. Please let me repeat this: normal people do not write in LaTeX, they use Word or something equivalent.

But, but. Yes, I know, we are not normal people.

What are the problems?

When working in my Ph.D. I had the ‘brilliant’ (a.k.a. masochistic) idea of using different languages for each part of my project: Fortran 90 (Lahey), ASReml, Python (ActiveState), Matlab and Mathematica. One thing that I experienced, was that working with a scripting language integrated with a good IDE (e.g. ActiveState Python or Matlab) was much more conducive to learning than a separate console and text editor. I still have fond memories of learning and using Python. This meandering description brings me back to what we should use for teaching.

Let’s be honest, the stock R IDE that one gets with the initial download is spartan if you are in OS X and plain sucky if you are in Windows. Given that working with the console plus a text editor (Emacs, Vim, Textmate, etc) is an uncomfortable learning experience (at least in my opinion) there is a nice niche for IDEs like RStudio, which integrate editor, data manager, graphs, etc.; particularly if they are cross-platform. Why is that RStudio is not included as the default R IDE? (Incidentally, I have never used Revolution R Productivity Environment—Windows only—that looks quite cool).

Today I am tempted to recommend moving the whole course to RStudio, which means installing it in an awful lot of computers at the university. One of the issues that stops me is that is introducing another layer of abstraction to R. We have the plain-vanilla console, then the normal installation and, on top, RStudio. On the other hand, we are already introducing an extra level with R commander.

At this point we reach the point-and-click GUI. The last two years we have used R Commander, which has helped, but I have never felt entirely comfortable with it. This year I had a chat with some students that used SPSS before and, after the initial shock, they seemed to cope with R Commander. In a previous post someone suggested Deducer, which I hope to check before the end of this year. I am always on the look out for a good and easy interface for students that fall in the second and third cases (see above). It would be nice to have a series of interfaces that look like Jeroen Ooms’s prototypes. Please let me know if you have any suggestions.

(*)This is not strictly true, as I had a Nokia E72, which was a dumb phone with a lot of buttons pretending to be a smartphone.

(**)This post should be read as me thinking aloud and reflecting today’s impressions, which are continually evolving. I am not yet truly comfortable with any GUI in the R world, and still feel that SPSS, Splus or Genstat (for example) provide a nicer flow on the GUI front.

Categories
r rblogs teaching

Teaching with R: the switch

There are several blog posts, websites (and even books) explaining the transition from using another statistical system (e.g. SAS, SPSS, Stata, etc) to relying on R. Most of that material treats the topic from the point of view of i- an individual user and ii- a researcher. This post explains some of the issues involved in, first, moving several users and, second, with an emphasis in teaching.

I have made part of this information available before, but I wanted to update it and keep it together with all the other posts in Quantum Forest. The process started in March 2009.

March 2009

I started explaining to colleagues my position on using R (and R commander) for teaching purposes. Some background first: forestry deals with variability and variability is the province of statistics. The use of statistics permeates forestry: we use sampling for inventory purposes, we use all sort of complex linear and non-linear regression models to predict growth, linear mixed models are the bread and butter of the analysis of experiments, etc.

I think it is fair to expect foresters to be at least acquainted with basic statistical tools, and we have two courses covering ANOVA and regression. In addition, we are supposed to introduce/reinforce statistical concepts in several other courses. So far so good, until we reached the issue of software.

During the first year of study, it is common to use MS Excel. I am not a big fan of Excel, but I can tolerate its use: people do not require much training to (ab)use it and it has a role to introduce students to some of the ’serious/useful’ functions of a computer; that is, beyond gaming. However, one can hit Excel limits fairly quickly which–together with the lack of audit trail for the analyses and the need to repeat all the pointing and clicking every time we need an analysis–makes looking for more robust tools very important.

Until the end of 2009 SAS (mostly BASE and STAT, with some sprinkles of GRAPH) was our robust tool. SAS was introduced in second year during the ANOVA and regression courses. SAS is a fine product, however:

  • We spent a very long time explaining how to write simple SAS scripts. Students forgot the syntax very quickly.
  • SAS’s graphical capabilities are fairly ordinary and not at all conducive to exploratory data analysis.
  • SAS is extremely expensive.
  • SAS tends to define the subject; I mean, it adopts new techniques very slowly, so there is the tendency to do only what SAS can do. This is unimportant for undergrads, but it is relevant for postgrads.
  • Users sometimes store data in SAS’s own format, which introduces another source of lock-in.

At the time, in my research work I used mostly ASReml (for specialized genetic analyses) and R (for general work); since thenI have moved towards using asreml-R (an R library that interfaces ASReml) to have a consistent work environment. For teaching I was using SAS to be consistent with second-year material.

Considering the previously mentioned barriers for students I started playing with R-commander (Rcmdr), a cross-platform GUI for R created by John Fox (the writer of some very nice statistics books, by the way. As I see it:

  • R in command mode is not more difficult (but not simpler either) for students than SAS. I think that SAS is more consistent and they have worked hard at keeping a very similar structure between PROCs.
  • We can get R-commander to start working right away with simple(r) methods, while maintaining the possibility of moving to more complex methods later by typing commands or programming.
  • It is free, so our students can load it into their laptops and keep on using it when they are gone. This is particularly true with international students: many of them will never see SAS again in their home countries.
  • It allows an easy path to data exploration (pre-requisite for building decent models) and high quality graphs.
  • R is open source (nice, but not a deal breaker for me) and easily extensible (this one is really important for me).

At the time I thought that R would be an excellent fit for teaching; nevertheless, there could be a few drawbacks, mostly when dealing with postgrads:

  • There are restrictions to the size of datasets (they have to fit in memory), although there are ways to deal with some of the restrictions. On the other hand, I have hit the limits of PROC GLM and PROC MIXED before and that is where ASReml shines. In two years this has never been a problem.
  • Some people have an investment in SAS and may not like the idea of using a different software. This was a problem the first few months.

As someone put it many years ago–there is always resistance to change:

It must be remembered that there is nothing more difficult to plan, more doubtful of success, nor more dangerous to manage, than the creation of a new system. For the initiator has the enmity of all who would profit by the preservation of the old institutions and merely lukewarm defenders in those who would gain by the new ones.—Niccolò Machiavelli, The Prince, Chapter 6

.

Five months later: August 2009

At the department level, I had to spend substantial time compiling information to prove that R could satisfy my colleagues’ statistical needs. Good selling points were nlme/lme4, lattice/ggplot2 and pointing my most statistically inclined colleagues to CRAN. Another important issue was the ability to have a GUI (Rcmdr) that could be adapted to our specific needs. At that time the School of Forestry adopted R as the default software for teaching any statistical content during the four years of the curriculum.

At the university level, my questions to the department of Mathematics and Statistics sparkled a lot of internal discussion, which resulted in R being adopted as the standard software for the ANOVA and regression second year courses (it was already the standard for many courses in 3rd and 4th year). The decision was not unanimous, particularly because for statisticians SAS is one of those ‘must be in the CV’ skills, but they went for change. The second year courses are offered across colleges, which makes the change very far reaching. These changes implied that many computers in the university labs now come with R pre-installed.

A year later: April 2010

R and R-commander were installed in our computer labs and we started using them in our Research Methods course. It is still too early to see what will be the effect of R versus SAS, but we expect to see an increase on the application of statistics within our curriculum.

One thing that I did not properly consider in the process were the annoying side-effects of the university’s computer policies. Students are not allowed to install software in the university computers and R packages fall within that category. We can either stay with the defaults + R commander (our current position) or introduce an additional complication for students, pushing them to define their own library location. I’d rather teach ggplot2 than lattice, but ggplot2 is an extra installation. Choices, choices… On the positive side, the default installation for some of the computer labs install all the packages by default.

Two years later: March 2011

Comments after teaching a regression modeling course using R-commander:

  • Some students really appreciate the possibility of using R-commander as their ‘total analysis system’. Most students that have never used a command line environment prefer it.
  • Students that have some experience with command-line work do not like much R-commander as they find it confusing, particularly when it is possible to access the R console through two points: Rcmdr and the default console. Some of them could not see the point of using an environment with a limited subset of functionality.
  • Data transformation facilities in R-commander are somewhat limited to the simplest cases.
  • Why is that the linear regression item does not accept categorical predictors? That works under ‘linear models’, but it is such an arbitrary separation.
  • The OS X version of R-commander (under X Windows) is butt ugly. This is not John Fox’s fault, but just a fact of life.

In general, R would benefit of having a first-class Excel import system that worked across platforms. Yes, I know that some people say that researchers should not use Excel; however, there is a distinction between normative and positive approaches to research. People do use Excel and insisting that they should not is not helpful.

I would love to hear anyone else’s experiences teaching basic statistics with R. Any comments?