Gratuitous picture: a simple summer lunch

Scraping pages and downloading files using R

2012-10-02 / Luis

I have written a few posts discussing descriptive analyses of evaluation of National Standards for New Zealand primary schools.The data for roughly half of the schools was made available by the media, but the full version of the dataset is provided in a single-school basis. In the page for a given school there may be link to a PDF file with the information on standards sent by the school to the Ministry of Education.

I’d like to keep a copy of the PDF reports for all the schools for which I do not have performance information, so I decided to write an R script to download just over 1,000 PDF files. Once I can identify all the schools with missing information I just loop over the list, using the fact that all URL for the school pages start with the same prefix. I download the page, look for the name of the PDF file and then download the PDF file, which is named school_schoolnumber.pdf. And that’s it.

Of course life would be a lot simpler if the Ministry of Education made the information available in a usable form for analysis.

library(XML) # HTML processing
options(stringsAsFactors = FALSE)

# Base URL
base.url <- 'http://www.educationcounts.govt.nz/find-a-school/school/national?school='
download.folder = '~/Downloads/schools/'

# Schools directory
directory <- read.csv('Directory-Schools-Current.csv')
directory <- subset(directory, 
                    !(school.type %in% c("Secondary (Year 9-15)", "Secondary (Year 11-15)")))

# Reading file obtained from stuff.co.nz obtained from here:
# http://schoolreport.stuff.co.nz/index.html
fairfax <- read.csv('SchoolReport_data_distributable.csv')
fairfax <- subset(fairfax, !is.na(reading.WB)) 

# Defining schools with missing information
to.get <- merge(directory, fairfax, by = 'school.id', all.x = TRUE)
to.get <- subset(to.get, is.na(reading.WB))

# Looping over schools, to find name of PDF file
# with information and download it

for(school in to.get$school.id){
  
  # Read HTML file, extract PDF link name
  cat('Processing school ', school, '\n')
  doc.html <- htmlParse(paste(base.url, school, sep = ''))
  doc.links <- xpathSApply(doc.html, "//a/@href")
  pdf.url <- as.character(doc.links[grep('pdf', doc.links)])
  if(length(pdf.url) > 0) {
    pdf.name <- paste(download.folder, 'school_', school, '.pdf', sep = '')
    download.file(pdf.url, pdf.name, method = 'auto', quiet = FALSE, mode = "w",
                  cacheOK = TRUE, extra = getOption("download.file.extra"))
  }
}

Can you help?

It would be great if you can help me to get the information from the reports. The following link randomly chooses a school, click on the "National Standards" tab and open the PDF file.

Then type the achievement numbers for reading, writing and mathematics in this Google Spreadsheet. No need to worry about different values per sex or ethnicity; the total values will do.

policy, programming, r, rblogs

8 Comments

John Baumgartner
2012-10-03 at 00:22

Very creative way of getting things done 🙂
Corey Chivers
2012-10-03 at 00:52

Could be a job for Amazon’s Mechanical Turk?
- Luis (Post author)
  2012-10-03 at 05:51
  
  It could, but this is a hobby and I’m hoping that people will participate.
John Baumgartner
2012-10-03 at 12:06

Just a heads up… results for some schools seems to be divided into years (e.g. http://www.educationcounts.govt.nz/__data/assets/pdf_file/0004/111865/1308_2011.pdf).
- Luis (Post author)
  2012-10-03 at 13:08
  
  Yes, we’ll have to get add up all the numbers and then get the proportions. Interesting reports, with traffic light colors and using Comic Sans.
  
  Thanks for the support; it seems to be such a divisive issue!
Tom
2012-10-03 at 13:09

Isn’t it possible to mine the data out of a PDF?
http://www.r-bloggers.com/reading-and-text-mining-a-pdf-file-in-r/
- Luis (Post author)
  2012-10-03 at 13:29
  
  No because the documents do not follow a preset format. One has to read it to know where are the relevant numbers. In addition, some of them are text documents while others are images.
mdev
2012-10-21 at 12:13

The was a useful example of practical data analysis for me, especially the political and technical issues with the underlying data. I strung them together in RStudio as a reproducible report – that’s the nature of reporting in education.

The reply from a teacher was a lesson too: the complex information encoded in an overall grade or assessment, both subjective observations and objective test data.

Starting data analysis from that point might produce analyses which the would embrace, e.g. supplying the kind of ‘spreadsheet’ teachers are compelled to construct for themselves to record grading information. That might support reproducible reporting locally within schools, not just nationally.