I have written a few posts discussing descriptive analyses of evaluation of National Standards for New Zealand primary schools.The data for roughly half of the schools was made available by the media, but the full version of the dataset is provided in a single-school basis. In the page for a given school there may be link to a PDF file with the information on standards sent by the school to the Ministry of Education.

I’d like to keep a copy of the PDF reports for all the schools for which I do not have performance information, so I decided to write an R script to download just over 1,000 PDF files. Once I can identify all the schools with missing information I just loop over the list, using the fact that all URL for the school pages start with the same prefix. I download the page, look for the name of the PDF file and then download the PDF file, which is named school_schoolnumber.pdf. And that’s it.

Of course life would be a lot simpler if the Ministry of Education made the information available in a usable form for analysis.

library(XML) # HTML processing options(stringsAsFactors = FALSE)   # Base URL base.url = 'http://www.educationcounts.govt.nz/find-a-school/school/national?school=' download.folder = '~/Downloads/schools/'   # Schools directory directory <- read.csv('Directory-Schools-Current.csv') directory <- subset(directory, !(school.type %in% c("Secondary (Year 9-15)", "Secondary (Year 11-15)")))   # Reading file obtained from stuff.co.nz obtained from here: # http://schoolreport.stuff.co.nz/index.html fairfax <- read.csv('SchoolReport_data_distributable.csv') fairfax <- subset(fairfax, !is.na(reading.WB))   # Defining schools with missing information to.get <- merge(directory, fairfax, by = 'school.id', all.x = TRUE) to.get <- subset(to.get, is.na(reading.WB))   # Looping over schools, to find name of PDF file # with information and download it   for(school in to.get\$school.id){   # Read HTML file, extract PDF link name cat('Processing school ', school, '\n') doc.html <- htmlParse(paste(base.url, school, sep = '')) doc.links <- xpathSApply(doc.html, "//a/@href") pdf.url <- as.character(doc.links[grep('pdf', doc.links)]) if(length(pdf.url) > 0) { pdf.name <- paste(download.folder, 'school_', school, '.pdf', sep = '') download.file(pdf.url, pdf.name, method = 'auto', quiet = FALSE, mode = "w", cacheOK = TRUE, extra = getOption("download.file.extra")) } }

## Can you help?

It would be great if you can help me to get the information from the reports. The following link randomly chooses a school, click on the “National Standards” tab and open the PDF file.

Then type the achievement numbers for reading, writing and mathematics in this Google Spreadsheet. No need to worry about different values per sex or ethnicity; the total values will do.

Gratuitous picture: a simple summer lunch (Photo: Luis).

• 2012/10/03 at 5:51 am

It could, but this is a hobby and I’m hoping that people will participate.

• 2012/10/03 at 1:08 pm

Yes, we’ll have to get add up all the numbers and then get the proportions. Interesting reports, with traffic light colors and using Comic Sans.

Thanks for the support; it seems to be such a divisive issue!

• 2012/10/03 at 1:29 pm

No because the documents do not follow a preset format. One has to read it to know where are the relevant numbers. In addition, some of them are text documents while others are images.