Scraping pages and downloading files using R

I have written a few posts discussing descriptive analyses of evaluation of National Standards for New Zealand primary schools.The data for roughly half of the schools was made available by the media, but the full version of the dataset is provided in a single-school basis. In the page for a given school there may be link to a PDF file with the information on standards sent by the school to the Ministry of Education.

I’d like to keep a copy of the PDF reports for all the schools for which I do not have performance information, so I decided to write an R script to download just over 1,000 PDF files. Once I can identify all the schools with missing information I just loop over the list, using the fact that all URL for the school pages start with the same prefix. I download the page, look for the name of the PDF file and then download the PDF file, which is named school_schoolnumber.pdf. And that’s it.

Of course life would be a lot simpler if the Ministry of Education made the information available in a usable form for analysis.

Can you help?

It would be great if you can help me to get the information from the reports. The following link randomly chooses a school, click on the “National Standards” tab and open the PDF file.

Then type the achievement numbers for reading, writing and mathematics in this Google Spreadsheet. No need to worry about different values per sex or ethnicity; the total values will do.

Gratuitous picture: a simple summer lunch (Photo: Luis).

m x n matrix with randomly assigned 0/1

Today Scott Chamberlain tweeted asking for a better/faster solution to building an m x n matrix with randomly assigned 0/1. He already had a working version:

Now, I’m the first to acknowledge that I’ve never got the ‘apply’ family of functions and that—thanks Hadley—if I need to do something like that I go for the plyr package. Nevertheless, if you think about it, the starting point is a sugary version of a loop; I like loops because they are explicit and clear, but they can be very slow in R. The loop could be written as:

In contrast, my first idea was to generate a bunch of uniformly distributed [0, 1) random numbers and round them to the closest integer, which is a more ‘matricy’ way of thinking:

and it happens to be a lot faster. My second idea, and Edmund Hart beat me to that one, was to use something like:

which is a nice option, generating r*c random numbers following a binomial distribution, which also has the advantage of allowing for different probabilities, rather than 0.5 as in m1. This is also faster than m1, although probably one would not notice the difference until working with fairly big matrices. When Scott pointed out the speed bump he got me thinking about order of the operations; would this be better?

In terms of speed, m3 fits between m1 and m2, so the order can make a difference.

David Smith and Rafael Maia came up with a different approach, using sample() which I had not considered at all. m4 has the advantage that one could use any number of values to randomize.

Any of these options can be timed using the system.time() function as in system.time(m3 = matrix(round(runif(r*c)), r, c)). It’s interesting how different people come up with alternative strategies (reflecting different ways of thinking about the problem) using exactly the same language. I’m not sure what should be the ‘obvious’ solution; but for almost any purpose any of them (with the exception of the explicit loop) will work just fine.

Gratuitous picture: training horses in the beach (Photo: Luis).

P.S. 2012-08-29, 22:33. Another way that would work (although 4 times slower than m1) would be m5 = matrix(ifelse(runif(r*c) < 0.5, 0, 1)). Still, it takes less than 0.5 seconds for a 1,000,000 elements matrix.
P.S. 2012-09-04. Scott Chamberlain wrote his own blog post discussing the different suggestions. Even cooler, Dirk Eddelbuettel implemented even faster creation of random 1/0 matrices using Rcpp, with inline C++ code. Much faster in a crazy overkill sort of way.

R’s increasing popularity. Should we care?

Some people will say ‘you have to learn R if you want to get a job doing statistics/data science’. I say bullshit, you have to learn statistics and learn to work in a variety of languages if you want to be any good, beyond getting a job today coding in R.

R4stats has a recent post discussing the increasing popularity of R against other statistical software, using citation counts in Google Scholar. It is a flawed methodology, at least as flawed as other methodologies used to measure language popularities. Nevertheless, I think is hard to argue against the general trend: R is becoming more popular. There is a deluge of books looking at R from every angle, thousands of packages and many jobs openings asking for R experience, which prompts the following question:

Should you/I/we care?

First answer: no. I try to use the best tool for the job; which often happens to be R but it can also be Python, SAS or Fortran. It is nice to be able to use the same tool, say R, across a range of problems, but there are occasions when it feels like using Excel for statistics: one can do it, but one knows that it isn’t a great idea. I know good statisticians that prefer R, SAS or Genstat; the tool doesn’t make you good in the same way that I could buy a Rickenbacker 4001 and I wouldn’t play like Geddy Lee.

Second answer: yes. Popularity attracts good people, who develop good packages, making new techniques available first in R. This doesn’t matter if you are into plain vanilla analyses (there is nothing wrong with this, by the way). Popularity + open source means that the system has been ported to a diversity of computer systems. Need R in a supercomputer? Done. R in a mac? Done. R for your strange operating system, for which there are C and Fortran compilers? Download it and compile it. Done. There is also the ‘I’m not crazy aspect’: other people take the software seriously.

Gratuitous picture: Firescapes II, night illuminated by bonfire (Photo: Luis).

I find people learning R because of ‘job opportunities’ irritating, in the same way that people learn javascript or java only to get a job. Give me any time people that learn R—or any other language for that matter—because they want to, because they are curious, because they want to share their discoveries with other people. Again, it is the difference between someone competent and someone great at what they do; great analysts are very much better than competent ones.

In the comments for the R4stats post there is a reference to R fanboys. Are R fanboys worse than fanboys of other statistical systems? In some respects the answer is yes, because many R users are also open source & open science supporters. Personally, I support both concepts, although I’m not dogmatic about them: I do buy some proprietary software and often can’t provide every detail about my work (commercially sensitive results). Maybe we are looking for a deeper change: we want to democratize statistics. We push for R not necessarily because it is intrinsically a better language, but because we can envision many people doing statistics to better understand the world around us and R is free. Anyway, I would prefer you call me a Python fanboy with split R personality.

Teaching code, production code, benchmarks and new languages

I’m a bit obsessive with words. May be I should have used learning in the title, rather than teaching code. Or perhaps remembering code. You know? Code where one actually has very clear idea of what is going on; for example, let’s say that we are calculating the average of a bunch of n numbers, we can have a loop that will add up each of them and then divide the total by n. Of course we wouldn’t do that in R, but use a simple function: mean(x).

In a previous post I compared R and Julia code and one of the commenters (Andrés) rightly pointed out that the code was inefficient. It was possible to speed up the calculation many times (and he sent me the code to back it up), because we could reuse intermediate results, generate batches of random numbers, etc. However, if you have studied the genomic selection problem, the implementations in my post are a lot closer to the algorithm. It is easier to follow and to compare, but not too flash in the speed department; for the latter we’d move to production code, highly optimized but not very similar to the original explanation.

This reminded me of the controversial Julia benchmarks, which implemented a series of 7 toy problems in a number of languages, following the ‘teaching code’ approach. Forcing ‘teaching code’ makes the comparison simple, as different language implementations look similar to each other. However, one is also throwing away the very thing that makes languages different: they have been designed and optimized using different typical uses in mind. For example, execution time for the pisum benchmark can be reduced sixty times just by replacing the internal loop with sum(1/c(1:10000)^2). Making comparisons easily understandable (so the code looks very similar) is orthogonal to making them realistic.

Gratuitous picture: looking for the right bicycle in Uppsala (Photo: Luis).

Tony Boyles asked, tongue in cheek, ‘The Best Statistical Programming Language is …Javascript?’ Well, if you define best as fastest language running seven benchmarks that may bear some resemblance to what you’d like to do (despite not having any statistical libraries) maybe the answer is yes.

I have to admit that I’m a sucker for languages; I like to understand new ways of expressing problems and, some times, one is lucky and finds languages that even allow tackling new problems. At the same time, most of my tests never progress beyond the ‘ah, cool, but it isn’t really that interesting’. So, if you are going to design a new language for statistical analyses you may want to:

  • Piggyback on another language. Case in point: R, which is far from an immaculate conception but an implementation of S, which gave it a start with both users and code. It may be that you extend a more general language (e.g. Incanter) rather than a domain specific one.
  • However, one needs a clean syntax from the beginning (ah, Python); my theory is that this partly explains why Incanter got little traction. (((too-many-parentheses-are-annoying)))
  • Related to the previous point, make extensibility to other languages very simple. R’s is a pain, while Julia’s seems to be much straightforward (judging by Douglas Bates’s examples).
  • Indices start at 1, not 0. Come on! This is stats and maths, not computer science. It is easier for us, your audience, to count from 1.
  • Programming structures are first class citizens, not half-assed constructs that complicate life. A clear counterexample are SAS’s macros and PROC IML, which are not conducive to people writing their own libraries/modules and sharing them with the community: they are poorly integrated with the rest of the SAS system.
  • Rely since day one on creating a community; as I pointed out in a previous post one thing is having a cool base language but a different one is recreating R’s package ecosystem. Good luck recreating R’s ecosystem working on your own.
  • However, you have to create a base language with sane basics included: access to both plain text and databases, easy summary functions (Xapply doesn’t cut it), glms and cool graphics (ggplot like) or people may not get hooked with the language, so they start contributing.

Two interesting resources discussing the adoption of R are this paper (PDF) by John Fox and this presentation by John Cook. Incidentally, John Fox is the author of my favorite book on regression and generalized linear models, no R code at all, just explanations and the math behind it all. John Cook writes The Endeavour, a very interesting blog with mathematical/programming bent.