In two previous posts I put together a data set and presented some exploratory data analysis on school achievement for national standards. After those posts I exchanged emails with a few people about the sources of data and Jeremy Greenbrook-Held pointed out Education Counts as a good source of additional variables, malady including number of teachers per school and proportions for different ethnic groups.

The code below call three files: Directory-Schools-Current.csv, teacher-numbers.csv and SchoolReport_data_distributable.csv, which you can download from the links.

This updated data set is more comprehensive but it doesn’t change the general picture presented in my previous post beyond the headlines. Now we can get some cool graphs to point out the obvious, for example the large proportion of Maori and Pacific Island students in low decile schools:

Proportion of Pacific Island (vertical axis) and Maori students (horizontal axis) in schools with points colored by decile. Higher proportions for both are observed in low decile schools.

I have avoided ‘proper’ statistical modeling because i- there is substantial uncertainty in the data and ii- the national standards for all schools (as opposed to only 1,000 schools) will be released soon; we do’t know if the published data are a random sample. In any case, a quick linear model fitting the proportion of students that meet reading standards (reading.OK) as a function of decile and weighted by total school roll—to account for the varying school sizes—will explain roughly 45% of the observed variability on reading achievement.

Model fit has a few issues with distribution of residuals, we should probably use a power transformation for the response variable, but I wouldn’t spend much more time before getting the full data for national standards.

Residuals of a quick weighted linear model. The residuals show some heterogeneity of variance (top-left) and deviation from normality (top-right) with heavy tails.

Bonus plot: map of New Zealand based on school locations, colors depicting proportion of students meeting reading national standards.

[sourcecode lang=”R”]
qplot(longitude, latitude,
data = standards, color = reading.OK)
[/sourcecode]

New Zealand drawn using school locations; color coding is for proportion of students meeting reading national standards.

P.S. 2012-09-26 16:01. The simple model above could be fitted taking into account the order of the decile factor (using `ordered()`) or just fitting linear and quadratic terms for a numeric expression of decile. Anyway, that would account for 45% of the observed variability.

P.S. 2012-09-26 18:17. Eric Crampton has posted preliminary analyses based on this dataset in Offsetting Behaviour.