Paying for a job well done

At the moment I am writing R code that involves a lot of simulation for a project. This time I wanted to organize the work properly, put a package together, document it,… the whole shebang. Hadley Wickham has excellent documentation for this process in Advanced R, which works very well as a website. Up to this point there is nothing new; but the material is also available as a book.

At this point in my life I do not want to have a physical object if I can avoid it. On top of that, code tutorials work a lot better as a website, so one can copy, paste and experiment. PDF or ebooks are not very handy for this subject either. Here enters a revolutionary notion: I like to pay people who do a good job and, in the process, make my job easier but sometimes I do not want an object in exchange.

One short term solution: asking Hadley for his favorite charity and donating the cost of a copy of the book. That gets most people happy except, perhaps, the publisher. I then remembered this idea by Cory Doctorow, in which he acts as a middleman between people who wish to pay him for his stories (but don’t want a physical copy of books) and school libraries that wish to have copies of the books.

Wouldn’t it be nice to have an arrangement like that for programming and research books? For example, we could get R learners who prefer but can’t afford books and people willing to pay for them.

Paying for intangibles (Photo: Luis, click to enlarge).

Paying for intangibles (Photo: Luis, click to enlarge).

Flotsam 11: mostly on books

‘No estaba muerto, andaba the parranda’ as the song says. Although rather than partying it mostly has been reading, taking pictures and trying to learn how to record sounds. Here there are some things I’ve come across lately.

I can’t remember if I’ve recommended Matloff’s The Art of R Programming before; if I haven’t, go and read the book for a good exposition of the language. Matloff also has an open book (as in free PDF, 3.5MB) entitled ‘From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science’. The download link is near the end of the page. He states that the reader ‘must know calculus, basic matrix algebra, and have some minimal skill in programming’, which incidentally is the bare minimum for someone that wants to get a good handle on stats. In my case I learned calculus partly with Piskunov’s book (I’m a sucker for Soviet books, free DjVu), matrix algebra with Searle’s book and programming with… that’s another story.

I’ve ordered a couple of books from CRC Press, which I hope to receive soon (it depends on how long it takes for the parcel to arrive to the middle of nowhere):

  • Stroup’s Generalized Linear Mixed Models: Modern Concepts, Methods and Applications, which according to the blurb comes ‘with numerous examples using SAS PROC GLIMMIX’. You could be wondering Why is he reading a book that includes SAS as a selling point? Well, SAS is a very good statistical thinking that still has a fairly broad installed based. However, the real selling point is that I’ve read some explanations on mixed models written by Stroup and he has superb understanding of the topic. I’m really looking forward to put my paws on this book.
  • Lunn et al.’s The BUGS Book: A Practical Introduction to Bayesian Analysis. I don’t use BUGS but occasionally use JAGS and one of the things that irks me of programs like BUGS, JAGS or INLA is that they follow the ‘here is a bunch of examples’ approach to documentation. This books is supposed to provide a much more detailed account of the ins and outs of fitting models and a proper manual. Or at least that’s what I’m hoping to find in it.

Finally, a link to a fairly long (and somewhat old) list of R tips and the acknowledgements of a PhD thesis that make you smile (via Arthur Charpentier).

Gratuitous picture: frozen fence (Photo: Luis, click to enlarge).

Gratuitous picture: frozen fence (Photo: Luis, click to enlarge).


‘He was not dead, he was out partying’.

Matrix Algebra Useful for Statistics

I was having a conversation with an acquaintance about courses that were particularly useful in our work. My forestry degree involved completing 50 compulsory + 10 elective courses; if I had to choose courses that were influential and/or really useful they would be Operations Research, Economic Evaluation of Projects, Ecology, 3 Calculus and 2 Algebras. Subsequently my PhD was almost entirely research based but I sort of did Matrix Algebra: Dorian lent me his copy of Searle’s Matrix Algebra Useful for Statistics and passed me a pile of assignments that Shayle Searle used to give in his course in Cornell. I completed the assignments on my own pace and then sat a crazy take-home exam for 24 hours.

Later that year I bought a cloth-bound 1982 version of the book, not the alien vomit purple paperback reprint currently on sale, which I consult from time to time. Why would one care about matrix algebra? Besides being a perfectly respectable intellectual endeavor on itself, maybe you can see that the degrees of freedom are the rank of a quadratic form; you can learn from this book what a quadratic form and a matrix rank are. Or you want to see more clearly the relationship between regression and ANOVA, because in matrix form a linear model is a linear model is a linear model. The commands outer, inner and kronecker product make a lot more sense once you know what an outer product and an inner product of vectors are. Thus, if you really want to understand a matrix language for data analysis and statistics (like R), it seems reasonable to try to understand the building blocks for such a language.

The book does not deal with any applications to statistics until chapter 13. Before that it is all about laying foundations to understand the applications, but do not expect nice graphs and cute photos. This is a very good text where one as to use the brain to imagine what’s going on in the equations and demonstrations. The exercises rely a lot on ‘prove this’ and ‘prove that’, which lead to much frustration and, after persevering, to many ‘aha! moments’.

XKCD 1050: In your face! Actually I feel the opposite concerning math.

I am the first to accept that I have a biased opinion about this book, because it has sentimental value. It represents difficult times, dealing with a new language, culture and, on top of that, animal breeding. At the same time, it opened doors to a whole world of ideas. This is much more than I can say of most books.

PS 2012-12-17: I have commented on a few more books in these posts.

A good part of my electives were in humanities (history & literature), which was unusual for forestry. I just couldn’t conceive going through a university without doing humanities.

Review: “Forest Analytics with R: an introduction”

Forestry is the province of variability. From a spatial point of view this variability ranges from within-tree variation (e.g. modeling wood properties) to billions of trees growing in millions of hectares (e.g. forest inventory). From a temporal point of view we can deal with daily variation in a physiological model to many decades in an empirical growth and yield model. Therefore, it is not surprising that there is a rich tradition of statistical applications to forestry problems.

At the same time, the scope of statistical problems is very diverse. As the saying goes forestry deals with “an ocean of knowledge, but only one centimeter deep”, which is perhaps an elegant way of saying a jack of all trades, master of none. Forest Analytics with R: an introduction by Andrew Robinson and Jeff Hamann (FAWR hereafter) attempts to provide a consistent overview of typical statistical techniques in forestry as they are implemented using the R statistical system.

Following the compulsory introduction to the R language and forest data management concepts, FAWR deals mostly with three themes: sampling and mapping (forest inventory), allometry and model fitting (e.g. diameter distributions, height-diameter equations and growth models), and simulation and optimization (implementing a growth and yield model, and forest estate planning). For each area the book provides a brief overview of the problem, a general description of the statistical issues, and then it uses R to deal with one or more example data sets. Because of this structure, chapters tend to stand on their own and guide the reader towards a standard analysis of the problem, with liberal use of graphics (very useful) and plenty of interspersing code with explanations (which can be visually confusing for some readers).

While the authors bill the book as using “state-of-the-art statistical and data-handling functionality”, the most modern applications are probably the use of non-linear mixed-effects models using a residual maximum likelihood approach. There is no coverage of, for example, Bayesian methodologies increasingly present in the forest biometrics literature.

Harvesting Eucalyptus urophylla x E. grandis hybrid clones in Brazil (Photo: Luis).

FAWR reminds me of a great but infuriating book by Italo Calvino (1993): “If on a Winter’s Night a Traveler“. Calvino starts many good stories and, once the reader is hooked in them, keeps on moving to a new one. The authors of FAWR acknowledge that they will only introduce the techniques, but a more comprehensive coverage of some topics would be appreciated. Readers with some experience in the topic may choose to skip the book altogether and move directly to, for example, Pinheiro and Bates (2000) book on Mixed-Effect Models in S and S-Plus and Lumley’s (2010) Complex Surveys: A Guide to Analysis Using R. FAWR is part of the growing number of “do X using R” books that, although useful in the short term, are so highly tied to specific software that one suspects they should come with a best-before date. A relevant question is how much content is left once we drop the software specific parts… perhaps not enough.

The book certainly has redeeming features. For example, Part IV introduces the reader to calling an external function written in C (a growth model), to then combining the results with R functions to create a credible growth and yield forecasting system. Later the authors tackle harvest scheduling through linear programming models, task often addressed using domain-specific (both proprietary and expensive) software. The authors use this part to provide a good case study of model implementation.

At the end of the day I am ambivalent about FAWR. On the one hand, it is possible to find better coverage of most topics in other books or R documentation. On the other, it provides a convenient point of entry if one is lost on how to start working in forest biometrics with R. An additional positive aspect is that the book increases R credibility as an alternative for forest analytics, which makes me wish this book had been around 3 years ago, when I needed to convince colleagues to move our statistics teaching to R.

P.S. This review was published with minor changes as “Apiolaza, L.A. 2012. Andrew P. Robinson, Jeff D. Hamann: Forest Analytics With R: An Introduction. Springer, 2011. ISBN 978-1-4419-7761-8. xv+339 pp. Journal of Agricultural, Biological and Environmental Statistics 17(2): 306-307″ (DOI: 10.1007/s13253-012-0093-y).
P.S.2. 2012-05-31. After publishing this text I discovered that I already used the sentence “[f]orestry deals with variability and variability is the province of statistics” in a blog post in 2009.
P.S.3. 2012-05-31. I first heard the saying “forestry deals with an ocean of knowledge, but only one centimeter deep” around 1994 in a presentation by Oscar García in Valdivia, Chile.
P.S.4. 2012-06-01. Added links to both authors internet presence.

The weirdness of ebooks

A couple of weeks ago I got a Sony Reader PRS-T1 through the use of Flybuys (a loyalty card scheme available in New Zealand). I had been thinking about buying an Amazon Kindle but then we got the Flybuys catalogue and I could not see the point of shelling out cash for something that I could get much more cheaply.

The device is quite nice and my only hardware quibble is the front frame of the screen, which is too reflective. In contrast, the Sony software to synchronize ebook reader and computer (a mac in my case) is a piece of junk. Therefore, the first thing I did was to install Calibre, which is not pretty but quite effective as a book manager.

By default Sony software pushes the reader to buy books through Whitcoulls, which manages to inspire limitless disappointment: selection is poor and prices high. Why would someone pays $27 for an ebook? The good thing is one can buy books from other sources (e.g. Book Depository or quite a few other book stores), at much lower prices, often below $10. Most books will come with a DRM (usually using Adobe Digital Editions) although there are a few bookshops (e.g. Baen) that ship them DRM-free.

Here is, perhaps, the biggest disappointment with the way many publishers/stores are dealing with their customers; they are treating us like potential criminals. We already went through this with iTunes, which eliminated DRM for music files some years ago. Why choose to unnecessarily constrain the files, particularly when DRM is annoying and can be easily broken? In a similar vein, how come that publishers manage to make pirate copies of books much more easily available than legal ones? If i- I have the money (promise, I do) and ii- I am willing to pay (promise, I do too), why can’t I get legal copies of books by Borges, Cortázar, Bolaño or whoever I want to read? Publishers should compulsorily read this comic by The Oatmeal.

On the plus side, as a quick Google search will show, it is possible to easily break either Amazon’s or Adobe’s DRM using a plugin for Calibre. I am not saying that one should do it, only that is an option. It would always be a pain to end up with unreadable books in the same way that Microsoft MSN customers ended up with unplayable music.

Tuff-luv ebook cover.

After that rant, how does it feel to read in the Sony Reader? It is quite nice and, after a little while, the hardware disappears and the story moves ahead, just like in a normal book. It won’t work for all books—e.g. if you are a fan of Edward Tufte’s books- but it is perfect for most novels or short stories. Rather than buying a typical Sony cover, I ordered this one from Tuff-Luv (a company from the UK) that arrived in less than one week: excellent service and shipped from Germany! The cover is nice looking and, more importantly, covers the shiny sides of the reader, so no more reflection.

In summary, I’m back at reading lots because it is easy to always carry many books in my backpack. This is bliss for an inveterate ‘reading many books in parallel’ aficionado.

Doing Bayesian Data Analysis now in JAGS

Around Christmas time I presented my first impressions of Kruschke’s Doing Bayesian Data Analysis. This is a very nice book but one of its drawbacks was that part of the code used BUGS, which left mac users like me stuck.

Kruschke has now made JAGS code available so I am happy clappy and looking forward to test this New Year present. In addition, there are other updates available for the programs included in the book.

The introduction of a new system

It must be remembered that there is nothing more difficult to plan, more doubtful of success, nor more dangerous to manage, than the creation of a new system. For the initiator has the enmity of all who would profit by the preservation of the old institutions and merely lukewarm defenders in those who would gain by the new ones.

—Niccolò Machiavelli, The Prince, Chapter 6.

E debbasi considerare, come non è cosa più difficile a trattare, né più dubia a riuscire, né più pericolosa a maneggiare, che farsi capo ad introdurre nuovi ordini; perché lo introduttore ha per nimici tutti quelli che degli ordini vecchi fanno bene, e ha tepidi defensori tutti quelli che delli ordini nuovi farebbono bene.

—Niccolò Machiavelli, Il Principe, Capitolo VI.

I’m a big fan of this quote, because it applies to so many new endeavors including the introduction of new topics in a university curriculum.

First impressions of Doing Bayesian Data Analysis

About a month ago I was discussing the approach that I would like to see in introductory Bayesian statistics books. In that post I mentioned a PDF copy of Doing Bayesian Data Analysis by John K. Kruschke and that I have ordered the book. Well, recently a parcel was waiting in my office with a spanking new, real paper copy of the book. A few days are not enough to provide a ‘proper’ review of the book but I would like to discuss my first impressions about the book, as they could be helpful for someone out there.

If I were looking for a single word to define the word it would be meaty, not on the “having the flavor or smell of meat” sense of the word as pointed out by Newton, but on the conceptual side. Kruschke has clearly put a lot of thought on how to draw a generic student with little background on the topic to start thinking of statistical concepts. In addition Kruschke clearly loves language and has an interesting, sometimes odd, sense of humor; anyway, Who am I to comment on someone else’s strange sense of humor?

Steak at Quantum Forest's HQ

Meaty like a ribeye steak slowly cooked at Quantum Forest's HQ in Christchurch.

One difference between the dodgy PDF copy and the actual book is the use of color, three shades of blue, to highlight section headers and graphical content. In general I am not a big fun of lots of colors and contentless pictures as used in modern calculus and physics undergraduate books. In this case, the effect is pleasant and makes browsing and reading the book more accessible. Most graphics really drive a point and support the written material, although there are exceptions in my opinion like some faux 3D graphs (Figure 17.2 and 17.3 under multiple linear regression) that I find somewhat confusing.

The book’s website contains PDF versions of the table of contents and chapter 1, which is a good way to whet your appetite. The book covers enough material as to be the sole text for an introductory Bayesian statistics course, either starting from scratch or as a transition from a previous course with a frequentist approach. There are plenty of exercises, a solutions manual and plenty of R code available.

The mere existence of this book prompts the question: Can we afford not to introduce students to a Bayesian approach to statistics? In turn this sparks the question How do we convince departments to de-emphasize the old way? (this quote is extremely relevant)

Verdict: if you are looking for a really introductory text, this is hands down the best choice. The material goes from the ‘OMG do I need to learn stats?’ level to multiple linear regression, ANOVA, hierarchical models and GLMs.

Newton and Luis with our new copy of Doing Bayesian Data Analyses

Dear Dr Kruschke, we really bought a copy. Promise! Newton and Luis (Photo: Orlando).

P.S. I’m still using a combination of books, including Krushke’s, and Marin and Robert’s for my own learning process.
P.S.2 There is a lot to be said about a book that includes puppies on its cover and references to A Prairie Home Companion on its first page (the show is sometimes re-broadcasted down under by Radio New Zealand).

Solomon saith

Solomon saith, There is no new thing upon the earth. So that as Plato had an imagination, that all knowledge was but remembrance; so Solomon giveth his sentence, That all novelty is but oblivion.

—Francis Bacon: Essays, LVIII quoted by Jorge Luis Borges in The Immortal (1949).