Mucking around with maps, schools and ethnicity in NZ

I’ve been having a conversation for a while with @kamal_hothi and @aschiff on maps, schools, census, making NZ data available, etc. This post documents some basic steps I used for creating a map on ethnic diversity in schools at the census-area-unit level. This “el quicko” version requires 3 ingredients:

  • Census area units shape files (available from Statistics New Zealand for free here).
  • School directory (directory-school-current.csv available for free here).
  • R with some spatial packages (also free).

We’ll read the school directory data, aggregate ethnicity information to calculate the Herfindahl–Hirschman Index of diversity and then plot it.

# School directory
direc = read.csv('directory-school-current.csv', skip = 3)
 
# Total number of students for each ethnicity by Census Area Unit 
hhi = aggregate(cbind(European..Pakeha, Maori, Pasifika, Asian, MELAA, Other) ~ 
                Census.Area.Unit, data = direc, FUN = sum, na.rm = TRUE)
 
# Function to calculate 
index = function(x) {
  total = sum(x, na.rm = TRUE)
  frac = x/total
  h = sum(frac^2)
  hn = if(total > 1, (h - 1/total)/(1 - 1/total), 1)
  return(hn)
}
 
Calculate the index for each area
hhi$hn = apply(hhi[,2:7], 1, index)
 
# Write data to use in QGis later
write.csv(hhi, 'diversity-index-census-area-unit.csv', quote = FALSE,
          row.names = FALSE)

Then I moved to create a map in R, for the sake of it:

library(rgdal) # for readOGR
library(sp)    # for spplot
 
# Reading shapefile
cau = readOGR(dsn='/Users/lap44/Dropbox/research/2015/census/2014 Digital Boundaries Generlised Clipped',
              layer='AU2014_GV_Clipped')
 
# Joining with school ethnicity data (notice we refer to @data, as cau contains spatial info as well)
cau@data = data.frame(cau@data, 
                      hhi[match(cau@data[,"AU2014_NAM"], hhi[,"Census.Area.Unit"]),])
 
# Limiting map to the area around Christchurch
spplot(cau, zcol = "hn", xlim = c(1540000, 1590000), 
       ylim= c(5163000, 5198000))

And we get a plot like this:

Ethnic diversity in schools at the Census Area Unit level (0 very diverse, 1 not diverse at all).

Ethnic diversity in schools at the Census Area Unit level (0 very diverse, 1 not diverse at all).

Just because it is Monday down under.

P.S. Using the diversity-index-census-area-unit.csv and joining it with the shapefile in QGIS one can get something prettier (I have to work on matching the color scales):

Similar map produced with point and click in QGIS.

Similar map produced with point and click in QGIS.

Map rendering is so much faster in QGIS than in R! Clearly the system has been optimized for this user case.

Left-to-right

When I write code I’m often amazed by the direction of the statements. I mean, we read and write left-to-right except when we assign statements to a variable. So here we are, doing our business, slowly working in a sentence and, puff!, we get this insight from the future and we assign it to our past, on the left. I commented this on Twitter and @fadesingh pointed me to this artistic creation of right-to-left programing language, except that the assign was left-to-right.

R has a right-assign operator (->) and I was thinking, how would it look if I wrote all the assigns to the left, using an adaptation of my previous post.

options(stringsAsFactors = FALSE)
library(ggplot2)
 
# School directory
read.csv('directory-school-current.csv', skip = 3) -> direc
 
# Decile changes
read.csv('DecileChanges_20142015.csv', skip = 2) -> dc
 
subset(dc, 
       select = c(Inst.., X2014.Decile, X2015.Decile, X2014.Step, 
       X2015..Step, Decile.Change, Step.Change)) -> dc
 
c('School.ID', 'deci2014', 'deci2015', 'step2014', 'step2015', 
  'decile.change', 'step.change') -> names(dc)
 
c(LETTERS[1:17], 'Z') -> steps
c(905.81, 842.11, 731.3, 617.8, 507.01, 420.54, 350.25, 
  277.32, 220.59, 182.74, 149.99, 135.12, 115.76, 93.71, 
  71.64, 46.86, 28.93, 0) -> money
 
within(dc, {
  ifelse(step2014 == '', NA, step2014) -> step2014
  ifelse(step2015 == '', NA, step2015) -> step2015
  sapply(step2014, function(x) money[match(x, steps)]) -> sm2014
  sapply(step2015, function(x) money[match(x, steps)]) -> sm2015
  sm2015 - sm2014 -> funding.change
}) -> dc
 
merge(dc, 
      direc[, c('School.ID', 'Total.School.Roll', 'Urban.Area', 'Regional.Council')], 
      by = 'School.ID', all.x = TRUE) -> dc
 
within(dc, {
  factor(Urban.Area) -> Urban.Area
  factor(Urban.Area, levels(Urban.Area)[c(3, 2, 4, 1)]) -> Urban.Area
  funding.change*Total.School.Roll/1000 -> school.level.change
}) -> dc
 
 
with(dc, {
  summary(funding.change)
  summary(school.level.change)
 
  sum(sm2014*Total.School.Roll/1000000, na.rm = NA)
  sum(sm2015*Total.School.Roll/1000000, na.rm = NA)
})
 
#### By Urban area
# Funding change per student on school size
qplot(Total.School.Roll, funding.change, data = dc[!is.na(dc$Urban.Area),], alpha = 0.8,
      xlab = 'Total School Roll', ylab = 'Funding change per student (NZ$)') + 
  theme_bw() + theme(legend.position = 'none') + facet_grid(~Urban.Area)

Not too alien; now using the magrittr package with right assign would make a lot more sense.

Funding change per student (NZ$) on total school roll (click to enlarge).

Funding change per student (NZ$) on total school roll (click to enlarge).

Back of the envelope look at school decile changes

Currently there is some discussion in New Zealand about the effect of the reclassification of schools in socioeconomic deciles. An interesting aspect of the funding system in New Zealand is that state and state-integrated schools with poorer families receive substantially more funding from the government than schools that receive students from richer families (see this page in the Ministry of Education’s website).

One thing that I haven’t noticed before is that funding decisions are more granular than simply using deciles, as deciles 1 to 4 are split into 3 steps each. For example, for Targeted Funding for Educational Achievement in 2015 we get the following amounts per student for decile: 1 (A: $905.81, B: $842.11, C: $731.3), 2 (D: $617.8, E: 507.01, F: 420.54), 3 (G: $350.25, H: $277.32, I: $220.59), 4 (J: $182.74, K: $149.99, L: $135.12), 5 ($115.76), 6 ($93.71), 7: ($71.64), 8 ($46.86), 9 ($28.93) and 10 ($0).

The Ministry of Education states that 784 schools ‘have moved to a higher decile rating’ while 800 ‘have moved to a lower decile rating’ (800 didn’t move). They do not mean that those numbers of schools changed deciles, but that information also includes changes of steps within deciles. Another issue is that it is not the same to move one step at the bottom of the scale (e.g. ~$63 from 1A to 1B) or at the top (~$29 from 9 to 10); that is, the relationship is not linear.

I assume that the baseline to measure funding changes is to calculate how much would a school would get per student in 2015 without any change of decile/step. That is, funding assuming that the previous step within decile had stayed constant. Then we can calculate how a student will get with the new decile/step for the school. I have limited this ‘back of the envelope’ calculation to Targeted Funding for Educational Achievement, which is not the only source of funding linked to deciles. There are other things like Special Education Grant and Careers Information Grant, but they have much smaller magnitude (maximum $73.94 & $37.31 per student) and the maximum differences between deciles 1 and 10 are 2:1.

options(stringsAsFactors = FALSE)
library(ggplot2)
 
# School directory
direc = read.csv('directory-school-current.csv', skip = 3)
 
# Decile changes
dc = read.csv('DecileChanges_20142015.csv', skip = 2)
dc = subset(dc, select = c(Inst.., X2014.Decile, X2015.Decile, X2014.Step, X2015..Step, Decile.Change, Step.Change))
names(dc) = c('School.ID', 'deci2014', 'deci2015', 'step2014', 'step2015', 'decile.change', 'step.change')
 
# Getting read of missing value
dc$step2014 = with(dc, ifelse(step2014 == '', NA, step2014))
dc$step2015 = with(dc, ifelse(step2015 == '', NA, step2015))

Steps are in capital letters and need to be translated into money. Once we get that we can calculate differences at both student level and school level:

steps = c(LETTERS[1:17], 'Z')
money = c(905.81, 842.11, 731.3, 617.8, 507.01, 420.54, 350.25, 277.32, 
          220.59, 182.74, 149.99, 135.12, 115.76, 93.71, 71.64, 46.86, 28.93, 0)
 
dc = within(dc, {
  sm2014 = sapply(step2014, function(x) money[match(x, steps)])
  sm2015 = sapply(step2015, function(x) money[match(x, steps)])
  # Change per student
  funding.change = sm2015 - sm2014
})
 
summary(dc$funding.change)
#    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#-812.100  -22.070    0.000   -3.448   22.070  707.000
 
# Merging with school directory data
dc = merge(dc, direc[, c('School.ID', 'Total.School.Roll')], 
           by = 'School.ID', all.x = TRUE)
 
# Calculating total change at school level
# considering roll in thousands of dollars
dc$school.level.change = with(dc, funding.change*Total.School.Roll/1000)
summary(dc$school.level.change)
#    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#-137.100   -2.820    0.000    2.689    3.627  413.100

If we look at the 50% of the schools in the middle of the distribution they had fairly small changes, approximately +/- $22 per student per year or at the school level +/- 3,000 dollars per year.

An interesting, though not entirely surprising, graph is plotting changes of funding on the size of the school. Large schools are much more stable on deciles/step than small ones.

# At student level
qplot(Total.School.Roll, funding.change, data = dc, alpha = 0.8,
      xlab = 'Total School Roll', ylab = 'Funding change per student (NZ$)') + 
  theme_bw() + theme(legend.position = 'none')
 
# At school level
qplot(Total.School.Roll, school.level.change, data = dc, alpha = 0.8,
      xlab = 'Total School Roll', ylab = 'Funding change per school (NZ$ 000)') + 
  theme_bw() + theme(legend.position = 'none')
Change of funding per student per year (NZ$) on size of the school (number of students).

Change of funding per student per year (NZ$) on size of the school (number of students).

Change of funding per school per year (thousands of NZ$) on school size (number of students).

Change of funding per school per year (thousands of NZ$) on school size (number of students).

Overall, there is a small change of the total amount of money for Targeted Funding for Educational Achievement used in the reclassified school system versus using the old deciles ($125M using 2014 deciles versus $132M using 2015 deciles) and for most schools the changes do not seem dramatic. There is, however, a number of schools (mostly small ones) who have had substantial changes to their funding. Very small schools will tend to display the largest changes, as the arrival or departure of only few pupils with very different socioeconomic backgrounds would have a substantial effect. An example would be Mata School in the Gisborne area, which moved 13 steps in decile funding (from A to N) with a roll of 11 kids. How to maintain a more steady funding regime seems to be a difficult challenge in those cases.

One consequence of the larger variability in small schools is that rural areas will be more affected by larger changes of funding. While overall 34% of the schools had no changes to their decile/step classification in rural areas that reduces to 22%; on top of that, the magnitude of the changes for rural schools is also larger.

Footnote:

Data files used for this post: DecileChanges_20142015 and directory-school-current.

Operational school funding is much more complex than deciles, as it includes allocations depending on number of students, use of Maori language, etc.

P.S. Stephen Senn highlights an obvious problem with the language the Ministry uses: there are 9 deciles (the points splitting the distribution into 10 parts). We should be talking about tenths, a much simpler word, instead of deciles.

Paying for a job well done

At the moment I am writing R code that involves a lot of simulation for a project. This time I wanted to organize the work properly, put a package together, document it,… the whole shebang. Hadley Wickham has excellent documentation for this process in Advanced R, which works very well as a website. Up to this point there is nothing new; but the material is also available as a book.

At this point in my life I do not want to have a physical object if I can avoid it. On top of that, code tutorials work a lot better as a website, so one can copy, paste and experiment. PDF or ebooks are not very handy for this subject either. Here enters a revolutionary notion: I like to pay people who do a good job and, in the process, make my job easier but sometimes I do not want an object in exchange.

One short term solution: asking Hadley for his favorite charity and donating the cost of a copy of the book. That gets most people happy except, perhaps, the publisher. I then remembered this idea by Cory Doctorow, in which he acts as a middleman between people who wish to pay him for his stories (but don’t want a physical copy of books) and school libraries that wish to have copies of the books.

Wouldn’t it be nice to have an arrangement like that for programming and research books? For example, we could get R learners who prefer but can’t afford books and people willing to pay for them.

Paying for intangibles (Photo: Luis, click to enlarge).

Paying for intangibles (Photo: Luis, click to enlarge).

Comment on Sustainability and innovation in staple crop production in the US Midwest

After writing a blog post about the paper “Sustainability and innovation in staple crop production in the US Midwest” I decided to submit a formal comment to the International Journal of Agricultural Sustainability in July 2013, which was published today. As far as I know, Heinemann et al. provided a rebuttal to my comments, which I have not seen but that should be published soon. This post is an example on how we can use open data (in this case from the USDA and FAO) and free software (R) to participate in scientific discussion (see supplementary material below).

The text below the *** represents my author’s version provided as part of my Green Access rights. The article published in the International Journal of Agricultural Sustainability [copyright Taylor & Francis]; is freely available online at http://dx.doi.org/10.1080/14735903.2014.939842).

While I had many issues with the original article, I decided to focus on three problems—to make the submission brief and fit under the 1,000 words limit enforced by the journal editor. The first point I make is a summary of my previous post on the article, and then move on to two big problems: assuming that only the choice of biotechnology affects yield (making the comparison between USA and Western Europe inadequate) and comparing use of agrochemicals at the wrong scale (national- versus crop-level).

PS 2014-08-05 16:30: Heinemann et al.’s reply to my comment is now available.

  • They state that “Apiolaza did not report variability in the annual yield data” and that they had to ‘reconstruct’ my figure. Firstly, the published figure 2 does include error bars and, secondly, the data and R code are available as supplementary information (in contrast to their original paper). Let’s blame the journal for not passing this information to them.
  • They also include 2 more years of data (2011 & 2012), although the whole point of my comment is that the conclusions they derived with the original data set were not justified. Heinemann et al. are right in that yields in 2011 & 2012 were higher in Western Europe than in the USA (the latter suffering a huge drought); however, in 2013 it was the opposite: 99,695 Hg/ha for USA vs 83,724 Hg/ha for Western Europe.
  • They question that I commented only on one crop, although I did cover another crop (wheat) quickly with respect to the use of pesticides—but not with the detail I wanted, as there was a 1,000 words limit. I would have also mentioned the higher variability for Western European wheat production, as it is weird that they pointed out that high variability is a problems for USA maize production but said nothing for wheat in Europe.
  • Furthermore, they also claimed “There is no statistically significant difference in the means over the entire 50-year period” however, a naive paired t-test t.test(FAOarticle$Yield[1:50], FAOarticle$Yield[51:100], paired = TRUE) (see full code below) says otherwise.

***

Abstract

This comment highlight issues when comparing genetically modified (GM) crops to non-GM ones across countries. Ignoring structural differences between agricultural sectors and assuming common yield trajectories before the time of introduction of GM crops results on misestimating the effect of GM varieties. Further data collection and analyses should guide policy-makers to encourage diverse approaches to agriculture, rather than excluding specific technologies (like GM crops) from the onset.

Keywords: genetic modification; biotechnology; productivity; economics

In a recent article Heinemann et al. (2013) focused “on the US staple crop agrobiodiversity, particularly maize” using the contrast between the yield of Western Europe and United States as a proxy for the comparison between genetically modified (GM) maize versus non-GM maize. They found no yield benefit from using GM maize when comparing the United States to Western Europe.

In addition, Heinemann et al. contrasted wheat yields across United States and Western Europe to highlight the superiority of the European biotechnological package from a sustainability viewpoint.

I am compelled to comment on two aspects that led the authors to draw incorrect conclusions on these issues. My statistical code and data are available as supplementary material.


1. Misestimating the effect of GM maize varieties

Heinemann et al. used FAO data[], from 1961 to 2010 inclusive, to fit linear models with yield as the response variable, country and year as predictors. Based on this analysis they concluded, “W. Europe has benefitted from the same, or marginally greater, yield increases without GM”. However, this assumes a common yield trajectory for United States and Western Europe before significant commercial use of GM maize, conflating GM and non-GM yields. GM maize adoption in United States has continually increased from 25% of area of maize planted in 2000 to the current 90% (Figure 1, United States Department of Agriculture 2013[]).

Figure 1: Adoption of GM maize in United States, expressed as percentage of planted area.

Figure 1: Adoption of GM maize in United States, expressed as percentage of total maize planted area (click to enlarge).

If we fit a linear model from 1961 to 1999 (last year with less than 25% area of GM maize) we obtain the following regression equations \(y = 1094.8 x + 39895.6\) (United States, R2 = 0.80) and \(y = 1454.5 x + 29802.2\) (W. Europe, R2 = 0.90). This means that Western Europe started with a considerably lower yield than the USA (29,802.2 vs 39,895.6 hg/ha) in 1961 but increased yields faster than USA (1,454.5 vs 1,094.8 hg/ha per year) before substantial use of GM maize. By 1999 yield in Western Europe was superior to that in United States.

This is even more evident in Figure 2, which shows average yield per decade, removing year-to-year extraneous variation (e.g. due to weather). Western European yields surpassed United States’s during the 1990s (Figure 2). This trend reverses in the 2000s, while United States simultaneously increased the percentage of planted area with GM maize, directly contradicting Heinemann et al.’s claim.

Figure 2: Average maize yield (and standard error) per decade for United States and Western Europe. The 2010s include a single year to replicate the original data set (click to enlarge).

Figure 2: Average maize yield (and standard error) per decade for United States and Western Europe. The 2010s include a single year to replicate the original data set (click to enlarge).

2. Ignoring structural differences between agricultural sectors

When discussing non-GM crops using wheat the authors state, “the combination of biotechnologies used by W. Europe is demonstrating greater productivity than the combination used by the United States”. This sentence summarizes one of the central problems of their article: assuming that, if it were not for the choice of biotechnology bundle, the agricultural sectors would have the same intrinsic yield, making them comparable. However, many inputs beside biotechnology affect yield. For example, Neumann et al. (2010) studied the spatial distribution of yield and found that in the Unites States “access can explain most of the variability in wheat efficiency. In the more remote regions land prices are lower and inputs are therefore often substituted by land leading to lower efficiencies”. Lower yields in United States make sense from an economic point of view, as land replaces more expensive inputs like agrochemicals.

Heinemann et al. support their case by comparing pesticide use between United States and France across all crops. However, what is relevant to the discussion is pesticide use for the crops being compared. European cereals, and wheat in particular, are the most widely fungicide-treated group of crops worldwide (Kucket al. 2012). For example, 27% of the wheat planted area in France was already treated with fungicides by 1979 (Jenkins and Lescar 1980). More than 30 years later in the United States this figure has reached only 19% for winter wheat (which accounts for 70% of planted area, NASS 2013). Fungicide applications result on higher yield responses (Oerke 2006).

Final remarks

Heinemann et al. ignored available data on GM adoption when analysing maize yields. They also mistakenly treated biotechnological bundles as the only/main explanation for non-GMO yield differences between United States and Western Europe. These issues mean that the thrust of their general conclusion is unsupported by the available evidence. Nevertheless, their article also raised issues that deserve more consideration; e.g. the roles of agricultural subsidies and market concentration on food security.

Agricultural sustainability requires carefully matching options in the biotechnology portfolio to site-specific economic, environmental and cultural constraints. Further data collection and analyses should lead policy makers to encourage diverse approaches to agriculture, rather than excluding specific technologies (like GMOs and pesticides) from the onset.

References

Heinemann, J. A., Massaro, M., Coray, D. S., Agapito-Tenfen, S. Z. and Wen, J. D. 2013. Sustainability and innovation in staple crop production in the US Midwest. International Journal of Agricultural Sustainability (available here).

Jenkins, J. E. E. and Lescar, L. 1980. Use of foliar fungicides on cereals in Western Europe. Plant Disease, 64(11): 987-994 (behind paywall).

Kuck, K. H., Leadbeater, A. and Gisi, U. 2012. FRAC Mode of Action Classification and Resistance Risk of Fungicides. In: Krämer, W., Schirmer, U., Jeschke, P. and Witschel, M., eds., Modern Crop Protection Compounds. Wiley. 539-567.

NASS, 2013. Highlights: 2012 Agricultural Chemical Use Survey. Wheat. United States Department of Agriculture (available here).

Neumann, K., Verburg, P. H., Stehfest, E., and Müller, C. 2010. The yield gap of global grain production: a spatial analysis. Agricultural Systems, 103(5), 316–326 (behind paywall).

Oerke E. 2006. Crop losses to pests. Journal of Agriculture Science, 144: 31-43 (behind paywall, or Free PDF).

United States Department of Agriculture. 2013. Adoption of Genetically Engineered Crops in the U.S. USDA Economic Research Service (available here).

Supplementary material

You can replicate the analyses and plots produced in this comment using the following files:

  • Maize production data for United States and Western Europe (csv, extracted from FAO). []
  • GMO maize penetration data (csv, extracted from USDA). []
  • R code for analyses (R file, changed extension to .txt so WordPress would not complain).