Mucking around with maps, schools and ethnicity in NZ

I’ve been having a conversation for a while with @kamal_hothi and @aschiff on maps, schools, census, making NZ data available, etc. This post documents some basic steps I used for creating a map on ethnic diversity in schools at the census-area-unit level. This “el quicko” version requires 3 ingredients:

  • Census area units shape files (available from Statistics New Zealand for free here).
  • School directory (directory-school-current.csv available for free here).
  • R with some spatial packages (also free).

We’ll read the school directory data, aggregate ethnicity information to calculate the Herfindahl–Hirschman Index of diversity and then plot it.

# School directory
direc = read.csv('directory-school-current.csv', skip = 3)
# Total number of students for each ethnicity by Census Area Unit 
hhi = aggregate(cbind(European..Pakeha, Maori, Pasifika, Asian, MELAA, Other) ~ 
                Census.Area.Unit, data = direc, FUN = sum, na.rm = TRUE)
# Function to calculate 
index = function(x) {
  total = sum(x, na.rm = TRUE)
  frac = x/total
  h = sum(frac^2)
  hn = if(total > 1, (h - 1/total)/(1 - 1/total), 1)
Calculate the index for each area
hhi$hn = apply(hhi[,2:7], 1, index)
# Write data to use in QGis later
write.csv(hhi, 'diversity-index-census-area-unit.csv', quote = FALSE,
          row.names = FALSE)

Then I moved to create a map in R, for the sake of it:

library(rgdal) # for readOGR
library(sp)    # for spplot
# Reading shapefile
cau = readOGR(dsn='/Users/lap44/Dropbox/research/2015/census/2014 Digital Boundaries Generlised Clipped',
# Joining with school ethnicity data (notice we refer to @data, as cau contains spatial info as well)
cau@data = data.frame(cau@data, 
                      hhi[match(cau@data[,"AU2014_NAM"], hhi[,"Census.Area.Unit"]),])
# Limiting map to the area around Christchurch
spplot(cau, zcol = "hn", xlim = c(1540000, 1590000), 
       ylim= c(5163000, 5198000))

And we get a plot like this:

Ethnic diversity in schools at the Census Area Unit level (0 very diverse, 1 not diverse at all).

Ethnic diversity in schools at the Census Area Unit level (0 very diverse, 1 not diverse at all).

Just because it is Monday down under.

P.S. Using the diversity-index-census-area-unit.csv and joining it with the shapefile in QGIS one can get something prettier (I have to work on matching the color scales):

Similar map produced with point and click in QGIS.

Similar map produced with point and click in QGIS.

Map rendering is so much faster in QGIS than in R! Clearly the system has been optimized for this user case.

Spatial correlation in designed experiments

Last Wednesday I had a meeting with the folks of the New Zealand Drylands Forest Initiative in Blenheim. In addition to sitting in a conference room and having nice sandwiches we went to visit one of our progeny trials at Cravens. Plantation forestry trials are usually laid out following a rectangular lattice defined by rows and columns. The trial follows an incomplete block design with 148 blocks and is testing 60 Eucalyptus bosistoana families. A quick look at survival shows an interesting trend: the bottom of the trial was much more affected by frost than the top.

library(ggplot2) # for qplot
qplot(col, row, fill = Surv, geom='tile', data = cravens)

We have the trial assessed for tree height (ht) at 2 years of age, where a plot also shows some spatial trends, with better growth on the top-left corner; this is the trait that I will analyze here.

Tree height can be expressed as a function of an overall constant, random block effects, random family effects and a normally distributed residual (this is our base model, m1). I will then take into account the position of the trees (expressed as rows and columns within the trial) to fit spatially correlated residuals (m2)—using separable rows and columns processes—to finally add a spatially independent residual to the mix (m3).

# Base model (non-spatial)
m1 = asreml(ht ~ 1, random = ~ Block + Family, data = cravens)
[1] -8834.071
                       gamma component std.error   z.ratio constraint
Block!Block.var   0.52606739 1227.4058 168.84681  7.269345   Positive
Family!Family.var 0.06257139  145.9898  42.20675  3.458921   Positive
R!variance        1.00000000 2333.1722  78.32733 29.787459   Positive

m1 represents a traditional family model with only the overall constant as the only fixed effect and a diagonal residual matrix (identity times the residual variance). In m2 I am modeling the R matrix as the Kronecker product of two separable autoregressive processes (one in rows and one in columns) times a spatial residual.

# Spatial model (without non-spatial residual)
m2 = asreml(ht ~ 1, random = ~ Block + Family,
            rcov = ~ ar1(row):ar1(col),
            data = cravens[order(cravens$row, cravens$col),])
[1] -8782.112
                       gamma    component    std.error   z.ratio
Block!Block.var   0.42232303 1025.9588216 157.00799442  6.534437
Family!Family.var 0.05848639  142.0822874  40.20346956  3.534080
R!variance        1.00000000 2429.3224308  88.83064209 27.347798
R!row.cor         0.09915320    0.0991532   0.02981808  3.325271
R!col.cor         0.28044024    0.2804402   0.02605972 10.761445
Block!Block.var         Positive
Family!Family.var       Positive
R!variance              Positive
R!row.cor          Unconstrained
R!col.cor          Unconstrained

Adding two parameters to the model results in improving the log-likelihood from -8834.071 to -8782.112, a difference of 51.959, but with what appears to be a low correlation in both directions. In m3 I am adding an spatially independent residual (using the keyword units), improving log-likelihood from -8782.112 to -8660.411: not too shabby.

m3 = asreml(ht ~ 1, random = ~ Block + Family + units,
            rcov = ~ ar1(row):ar1(col),
            data = cravens[order(cravens$row, cravens$col),])
[1] -8660.411
                         gamma    component    std.error    z.ratio
Block!Block.var   3.069864e-07 4.155431e-04 6.676718e-05   6.223763
Family!Family.var 1.205382e-01 1.631630e+02 4.327885e+01   3.770040
units!units.var   1.354166e+00 1.833027e+03 7.085134e+01  25.871456
R!variance        1.000000e+00 1.353621e+03 2.174923e+02   6.223763
R!row.cor         7.814065e-01 7.814065e-01 4.355976e-02  17.938724
R!col.cor         9.529984e-01 9.529984e-01 9.100529e-03 104.719017
Block!Block.var        Boundary
Family!Family.var      Positive
units!units.var        Positive
R!variance             Positive
R!row.cor         Unconstrained
R!col.cor         Unconstrained

Allowing for the independent residual (in addition to the spatial one) has permitted higher autocorrelations (particularly in columns), while the Block variance has gone very close to zero. Most of the Block variation is being captured now through the residual matrix (in the rows and columns autoregressive correlations), but we are going to keep it in the model, as it represents a restriction to the randomization process. In addition to log-likelihood (or AIC) we can get graphical descriptions of the fit by plotting the empirical semi-variogram as in plot(variogram(m3)):