# Archive for September, 2012

This week I’ve tried to i-stay mostly in the descriptive statistics realm and ii-surround any simple(istic) models with caveats and pointing that they are very preliminary. We are working with a sample of ~1,000 schools that did reply to Fairfax’s request, while there is a number of schools that either ignored the request or told Fairfax to go and F themselves. Why am I saying this? If one goes and gets a simple table of the number of schools by type and decile there is something quite interesting: we have different percentages for different types of schools represented in the sample and the possibility of bias on the reporting to Fairfax, due to potential low performance (references to datasets correspond to the ones I used in this post):

summary(standards$school.type) # Composite (Year 1-10) Composite (Year 1-15) Contributing (Year 1-6) # 1 29 403 # Full Primary (Year 1-8) Intermediate (year 7 and 8) Restricted Composite (Yr 7-10) # 458 62 1 # Secondary (Year 7-15) # 56 Now let’s compare this number with the school directory: summary(factor(directory$school.type)) # Composite (Year 1-10) Composite (Year 1-15) Contributing (Year 1-6) # 4 149 775 # Correspondence School Full Primary (Year 1-8) Intermediate (year 7 and 8) # 1 1101 122 #Restricted Composite (Yr 7-10) Secondary (Year 11-15) Secondary (Year 7-10) # 4 2 2 # Secondary (Year 7-15) Secondary (Year 9-15) Special School # 100 238 39 # Teen Parent Unit # 20

As a proportion we are missing more secondary schools. We can use the following code to get an idea of how similar are school types, because the small number of different composite schools is a pain. If

# Performance of Contributing (Year 1-6) and # Full Primary (Year 1-8) looks pretty much the # same. Composites could be safely merged qplot(school.type, reading.OK, data = standards, geom = 'jitter')   qplot(school.type, writing.OK, data = standards, geom = 'jitter')   qplot(school.type, math.OK, data = standards, geom = 'jitter')   # Merging school types and plotting them colored # by decile standards$school.type.4 <- standards$school.type levels(standards$school.type.4) <- c('Composite', 'Composite', 'Primary', 'Primary', 'Intermediate', 'Composite', 'Secondary') qplot(school.type.4, reading.OK, colour = decile, data = standards, geom = 'jitter') Representation of different schools types and deciles is uneven. Different participations in the sample for school types. This type is performance in mathematics. I’m using jittering rather than box and whisker plots to i- depict all the schools and ii- get an idea of the different participation of school types in the dataset. Sigh. Another caveat to add in the discussion. P.S. 2012-09-27 16:15. Originally I mentioned in this post the lack of secondary schools (Year 9-15) but, well, they are not supposed to be here, because National Standards apply to years 1 to 8 (Thanks to Michael MacAskill for pointing out my error.) Eric and I have been exchanging emails about potential analyses for the school data and he published a first draft model in Offsetting Behaviour. I have kept on doing mostly data exploration while we get a definitive full dataset, and looking at some of the pictures I thought we could present a model with fewer predictors. The starting point is the standards dataset I created in the previous post: # Make authority a factor standards$authority <- factor(standards$authority) # Plot relationship between school size and number of FTTE # There seems to be a separate trend for secondary schools qplot(total.roll, teachers.ftte, data = standards, color = school.type) There seems to be a different trend for secondary vs non-secondary schools concerning the relationship between number of full time teacher equivalent and total roll. The presence of a small number of large schools suggests that log transforming the variables could be a good idea. # Create a factor for secondary schools versus the rest standards$secondary <- factor(ifelse(standards$school.type == 'Secondary (Year 7-15)', 'Yes', 'No')) # Plot the relationship between number of students per FTTE # and type of school qplot(secondary, total.roll/teachers.ftte, data = standards, geom = 'boxplot', ylab = 'Number of students per FTTE') Difference on the number of students per FTTE between secondary and non-secondary schools. Now we fit a model where we are trying to predict reading standards achievement per school accounting for decile, authority , proportion of non-european students, secondary schools versus the rest, and a different effect of number of students per FTTE for secondary and non-secondary schools. standards$students.per.ftte <- with(standards, total.roll/teachers.ftte)   # Decile2 is just a numeric version of decile (rather than a factor) m1e <- lm(reading.OK ~ decile2 + I(decile2^2) + authority + I(1 - prop.euro) + secondary*students.per.ftte, data = standards, weights = total.roll) summary(m1e)   #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 0.6164089 0.0305197 20.197 < 2e-16 *** #decile2 0.0422733 0.0060756 6.958 6.24e-12 *** #I(decile2^2) -0.0015441 0.0004532 -3.407 0.000682 *** #authorityState: Not integrated -0.0440261 0.0089038 -4.945 8.94e-07 *** #I(1 - prop.euro) -0.0834869 0.0181453 -4.601 4.74e-06 *** #secondaryYes -0.2847035 0.0587674 -4.845 1.47e-06 *** #students.per.ftte 0.0023512 0.0011706 2.009 0.044854 * #secondaryYes:students.per.ftte 0.0167085 0.0040847 4.091 4.65e-05 *** ---   #Residual standard error: 1.535 on 999 degrees of freedom # (3 observations deleted due to missingness) #Multiple R-squared: 0.4942, Adjusted R-squared: 0.4906 #F-statistic: 139.4 on 7 and 999 DF, p-value: < 2.2e-16

The residuals are still a bit of a mess:

Residuals for this linear model: still a bit of a mess.

If we remember my previous post decile accounted for 45% of variation and we explain 4% more through the additional predictors. Non-integrated schools have lower performance, a higher proportion of non-European students reduce performance, secondary schools have lower performance and larger classes tend to perform better (Eric suggests reverse causality, I’m agnostic at this stage), although the rate of improvement changes between secondary and non-secondary schools. In contrast with Eric, I didn’t fit separate ethnicities as those predictors are related to each other and constrained to add up to one.

Of course this model is very preliminary, and a quick look at the coefficients will show that changes on any predictors besides decile will move the response by a very small amount (despite the tiny p-values and numerous stars next to them). The distribution of residuals is still heavy-tailed and there are plenty of questions about data quality; I’ll quote Eric here:

But differences in performance among schools of the same decile by definition have to be about something other than decile. I can’t tell from this data whether it’s differences in stat-juking, differences in unobserved characteristics of entering students, differences in school pedagogy, or something else. But there’s something here that bears explaining.

In two previous posts I put together a data set and presented some exploratory data analysis on school achievement for national standards. After those posts I exchanged emails with a few people about the sources of data and Jeremy Greenbrook-Held pointed out Education Counts as a good source of additional variables, including number of teachers per school and proportions for different ethnic groups.

The code below call three files: Directory-Schools-Current.csv, teacher-numbers.csv and SchoolReport_data_distributable.csv, which you can download from the links.

options(stringsAsFactors = FALSE)   library(ggplot2)   # Reading School directory and dropping some address information # for 2012 obtained from here: # http://www.educationcounts.govt.nz/directories/list-of-nz-schools directory <- read.csv('Directory-Schools-Current.csv') directory <- subset(directory, select = -1*c(3:9, 11:13))   # Reading teacher numbers for 2011 obtained from here: # http://www.educationcounts.govt.nz/statistics/schooling/teaching_staff teachers <- read.csv('teacher-numbers.csv')   # Reading file obtained from stuff.co.nz obtained from here: # http://schoolreport.stuff.co.nz/index.html fairfax <- read.csv('SchoolReport_data_distributable.csv')   # Merging directory and teachers info standards <- merge(directory, teachers, by = 'school.id', all.x = TRUE)   # Checking that the school names match # This shows ten cases, some of them obviously the same school # e.g. Epsom Girls Grammar School vs Epsom Girls' Grammar School tocheck <- subset(standards, school.name.x != school.name.y) tocheck[, c(1:2, 29)] #school.id school.name.x school.name.y #61 64 Epsom Girls Grammar School Epsom Girls' Grammar School #125 135 Fraser High School Hamilton's Fraser High School #362 402 Waiau Area School Tuatapere Community College #442 559 Te Wainui a Rua Whanganui Awa School #920 1506 St Michael's Catholic School (Remuera) St Michael's School (Remuera) #929 1515 Sunnyhills School Sunny Hills School #985 1573 Willow Park School Willowpark School #1212 1865 Te Wharekura o Maniapoto Te Wharekura o Oparure #1926 2985 Sacred Heart Cathedral School Sacred Heart School (Thorndon) #2266 3554 Waitaha School Waitaha Learning Centre   # Merging now with fairfax data standards <- merge(standards, fairfax, by = 'school.id', all.x = TRUE)   # Four schools have a different name tocheck2 <- subset(standards, school.name.x != school.name) tocheck2[, c(1:2, 32)] #school.id school.name.x school.name #920 1506 St Michael's Catholic School (Remuera) St Michael's School (Remuera) #929 1515 Sunnyhills School Sunny Hills School #985 1573 Willow Park School Willowpark School #2266 3554 Waitaha School Waitaha Learning Centre   # Checking the original spreadsheets it seems that, despite # the different name they are the same school (at least same # city)   # Now start dropping a few observations that are of no use # for any analysis; e.g. schools without data standards <- subset(standards, !is.na(reading.WB))   # Dropping school 498 Te Aho o Te Kura Pounamu Wellington # http://www.tekura.school.nz/ # It is a correspondence school without decile information standards <- subset(standards, school.id != 498)   # Converting variables to factors standards$decile <- factor(standards$decile) standards$school.type <- factor(standards$school.type) standards$education.region <- factor(standards$education.region, levels = c('Northern', 'Central North', 'Central South', 'Southern'))   # Removing 8 special schools standards <- subset(standards, as.character(school.type) != 'Special School') standards$school.type <- standards$school.type[drop = TRUE]   # Saving file for Eric # write.csv(standards, 'standardsUpdated.csv', # row.names = FALSE, quote = FALSE)   # Create performance groups standards$math.OK <- with(standards, math.At + math.A) standards$reading.OK <- with(standards, reading.At + reading.A) standards$writing.OK <- with(standards, writing.At + writing.A) # Creating a few proportions standards$prop.euro <- with(standards, european/total.roll) standards$prop.maori <- with(standards, maori/total.roll) standards$prop.pacific <- with(standards, pacific.island/total.roll) standards$prop.asian <- with(standards, asian/total.roll) This updated data set is more comprehensive but it doesn’t change the general picture presented in my previous post beyond the headlines. Now we can get some cool graphs to point out the obvious, for example the large proportion of Maori and Pacific Island students in low decile schools: qplot(prop.maori, prop.pacific, data = standards, color = decile, xlab = 'Proportion of Maori', ylab = 'Proportion of Pacific Island') Proportion of Pacific Island (vertical axis) and Maori students (horizontal axis) in schools with points colored by decile. Higher proportions for both are observed in low decile schools. I have avoided ‘proper’ statistical modeling because i- there is substantial uncertainty in the data and ii- the national standards for all schools (as opposed to only 1,000 schools) will be released soon; we do’t know if the published data are a random sample. In any case, a quick linear model fitting the proportion of students that meet reading standards (reading.OK) as a function of decile and weighted by total school roll—to account for the varying school sizes—will explain roughly 45% of the observed variability on reading achievement. m1 <- lm(reading.OK ~ decile, data = standards, weights = total.roll) summary(m1) # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.56164 0.01240 45.278 < 2e-16 *** # decile2 0.06238 0.01769 3.527 0.000439 *** # decile3 0.10890 0.01749 6.225 7.07e-10 *** # decile4 0.16434 0.01681 9.774 < 2e-16 *** # decile5 0.18579 0.01588 11.697 < 2e-16 *** # decile6 0.22849 0.01557 14.675 < 2e-16 *** # decile7 0.25138 0.01578 15.931 < 2e-16 *** # decile8 0.24849 0.01488 16.701 < 2e-16 *** # decile9 0.28861 0.01492 19.347 < 2e-16 *** # decile10 0.31074 0.01439 21.597 < 2e-16 *** # # Residual standard error: 1.594 on 999 degrees of freedom # (1 observation deleted due to missingness) # Multiple R-squared: 0.4548, Adjusted R-squared: 0.4499 # F-statistic: 92.58 on 9 and 999 DF, p-value: < 2.2e-16 Model fit has a few issues with distribution of residuals, we should probably use a power transformation for the response variable, but I wouldn’t spend much more time before getting the full data for national standards. par(mfrow = c(2, 2)) plot(m1) par(mfrow = c(1, 1)) Residuals of a quick weighted linear model. The residuals show some heterogeneity of variance (top-left) and deviation from normality (top-right) with heavy tails. Bonus plot: map of New Zealand based on school locations, colors depicting proportion of students meeting reading national standards. qplot(longitude, latitude, data = standards, color = reading.OK)  New Zealand drawn using school locations; color coding is for proportion of students meeting reading national standards. P.S. 2012-09-26 16:01. The simple model above could be fitted taking into account the order of the decile factor (using ordered()) or just fitting linear and quadratic terms for a numeric expression of decile. Anyway, that would account for 45% of the observed variability. m1d <- lm(reading.OK ~ as.numeric(decile) + I(as.numeric(decile)^2), data = standards, weights = total.roll) summary(m1d) #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 0.5154954 0.0136715 37.706 < 2e-16 *** #as.numeric(decile) 0.0583659 0.0051233 11.392 < 2e-16 *** #I(as.numeric(decile)^2) -0.0023453 0.0004251 -5.517 4.39e-08 *** #--- #Residual standard error: 1.598 on 1006 degrees of freedom # (1 observation deleted due to missingness) #Multiple R-squared: 0.4483, Adjusted R-squared: 0.4472 #F-statistic: 408.8 on 2 and 1006 DF, p-value: < 2.2e-16 P.S. 2012-09-26 18:17. Eric Crampton has posted preliminary analyses based on this dataset in Offsetting Behaviour. Music and creativity relax me: Diego Stocco’s Experibass. I like the idea of having data on school performance, not to directly rank schools—hard, to say the least, at this stage—but because we can start having a look at the factors influencing test results. I imagine the opportunity in the not so distant future to run hierarchical models combining Ministry of Education data with Census/Statistics New Zealand data. At the same time, there is the temptation to come up with very simple analyses that would make appealing newspaper headlines. I’ll read the data and create a headline and then I’ll move to something that, personally, seems more important. In my previous post I combined the national standards for around 1,000 schools with decile information to create the standards.csv file. library(ggplot2) # Reading data, building factors standards <- read.csv('standards.csv') standards$decile.2008 <- factor(standards$decile.2008) standards$electorate <- factor(standards$electorate) standards$school.type <- factor(standards$school.type) # Removing special schools (all 8 of them) because their # performance is not representative of the overall school # system standards <- subset(standards, as.character(school.type) != 'Special School') standards$school.type <- standards$school.type[drop = TRUE] # Create performance variables. Proportion of the students that # at least meet the standards (i.e. meet + above) # Only using all students, because subsets are not reported # by many schools standards$math.OK <- with(standards, math.At + math.A) standards$reading.OK <- with(standards, reading.At + reading.A) standards$writing.OK <- with(standards, writing.At + writing.A)

Up to this point we have read the data, removed special schools and created variables that represent the proportion of students that al least meet standards. Now I’ll do something very misleading: calculate the average for reading.OK for each school decile and plot it, showing the direct relationship between socio-economic decile and school performance.

mislead <- aggregate(reading.OK ~ decile.2008, data = standards, mean) qplot(decile.2008, reading.OK , data = mislead)

Scatterplot of average proportion of students at least meeting the reading national standards for each socioeconomic decile.

Using only the average strengthens the relationship, but it hides something extremely important: the variability within each decile. The following code will display box and whisker plots for each decile: the horizontal line is the median, the box contains the middle 50% of the schools and the vertical lines 1.5 times the interquantile range (pretty much the range in most cases):

qplot(decile.2008, reading.OK, data = standards, geom = 'boxplot', xlab = 'Decile in 2008', ylab = 'Reading at or above National Standard')

Box and whisker plot for proportion of students at least meeting the reading national standard for each decile. Notice the variability for each decile.

A quick look at the previous graph shows a few interesting things:

• the lower the decile the more variable school performance,
• there is pretty much no difference between deciles 6, 7, 8 and 9, and a minor increase for decile 10.
• there is a trend for decreasing performance for lower deciles; however, there is also a huge variability within those deciles.

We can repeat the same process for writing.OK and math.OK with similar trends, although the level of standard achievement is lower than for reading:

qplot(decile.2008, writing.OK, data = standards, geom = 'boxplot', xlab = 'Decile in 2008', ylab = 'Writing at or above National Standard')   qplot(decile.2008, math.OK, data = standards, geom = 'boxplot', xlab = 'Decile in 2008', ylab = 'Mathematics at or above National Standard')

Box and whisker plot for proportion of students at least meeting the writing national standard for each decile.

Box and whisker plot for proportion of students at least meeting the mathematics national standard for each decile.

Achievement in different areas (reading, writing, mathematics) is highly correlated:

cor(standards[, c('reading.OK', 'writing.OK', 'math.OK')], use = 'pairwise.complete.obs')   # reading.OK writing.OK math.OK #reading.OK 1.0000000 0.7886292 0.7749094 #writing.OK 0.7886292 1.0000000 0.7522446 #math.OK 0.7749094 0.7522446 1.0000000   # We can plot an example and highlight decile groups # 1-5 vs 6-10 qplot(reading.OK, math.OK, data = standards, color = ifelse(as.numeric(decile.2008) > 5, 1, 0)) + opts(legend.position = 'none')

Scatterplot for proportion meeting mathematics and reading national standards. Dark points are deciles 1 to 5, while light points are deciles 6 to 10. Notice the large overlap for performance between the two groups.

All these graphs are only descriptive/exploratory; however, once we had more data we could move to hierarchical models to start adjusting performance by socioeconomic aspects, geographic location, type of school, urban or rural setting, school size, ethnicity, etc. Potentially, this would let us target resources on schools that could be struggling to perform; nevertheless, any attempt at creating a ‘quick & dirty’ ranking ignoring the previously mentioned covariates would be, at the very least, misleading.

Note 2012-09-26: I have updated data and added a few plots in this post.

Some people have the naive idea that the national standards data and the decile data will not be put together. Given that at some point in time all data will be available, there is no technical reason to not have merged the data, which I have done in this post for an early release.

Stuff made data for the National Standards for a bit over 1,000 schools available here as an Excel File. I cleaned it up a bit to read it in R, saved it to csv format, of which you can get a copy here. Decile data is available from the Ministry of Education, which I transformed to csv and uploaded here. We can now merge the datasets using R:

options(stringsAsFactors = FALSE)   deciles <- read.csv('DecileChanges20072008.csv') standards <- read.csv('SchoolReport_data_distributable.csv')   standards <- merge(standards, deciles, by = 'school.id')

But there are some name discrepancies in the merge, which we can check using:

standards$name.discrepancy <- factor(ifelse(standards$school.name.x != standards$school.name.y, 1, 0)) tocheck <- subset(standards[, c(1, 2, 64, 77)], name.discrepancy == 1) # school.id school.name.x school.name.y name.discrepancy #9 82 Aidanfield Christian School Canterbury Christian College 1 #33 266 Waipa Christian School Bethel Christian School 1 #73 429 Excellere College Kamo Christian College 1 #143 1245 Christ the King Catholic School (Owairak Christ The King School (Mt Roskill) 1 #150 1256 Cornwall Park District School Cornwall Park School 1 #211 1360 Marist Catholic School (Herne Bay) Marist School (Herne Bay) 1 #270 1500 St Leo's Catholic School (Devonport) St Leos School (Devonport) 1 #297 1588 St Francis Xavier Catholic School (Whang St Francis Xavier School (Whangarei) 1 #305 1650 Drummond Primary School Central Southland Rural Primary School 1 #314 1678 Te Kura o Waikaremoana Te Kura O Waikaremoana 1 #335 1744 Horahora School (Cambridge) Horahora School 1 #389 1991 Tauranga Primary School Tauranga School 1 #511 2424 Parkland School (P North) Parkland School 1 #587 2663 Reignier Catholic School Reignier School (Taradale) 1 #627 2841 Fergusson Intermediate (Trentham) Fergusson Intermediate 1 #771 3213 Parklands School (Motueka) Parklands School 1 #956 3801 Pine Hill School (Dunedin) Pine Hill School 1 There are four schools that have very different names in the dataset: 82, 266, 429 and 1650. They may just have changed names or something could be wrong. At this point one may choose to drop them from any analysis; up to you. We then just rename the second variable to school.name and save the file to csv (available here standards.csv for your enjoyment). names(standards)[2] <- 'school.name' write.csv(standards, 'standards.csv', row.names = FALSE, quote = FALSE) Now we should try to merge any other economic and social information available in Statistics New Zealand. Note 2012-09-26: I have updated data and added a few plots in this post. Oddities tend to jump out when one uses software in a daily basis. The situation is even clearer when using software for teaching: many more people looking at it with fresh eyes. Let’s say that we are fitting a simple linear model and we use the summary function, then POW! i- one gets all sorts of stars next to each of the coefficients and ii- some tiny p-values with lots of digits. Since immemorial times (childcare, at least) we got star stickers when doing a good job and here we have R doing the same. It is possible to remove the stars, I know, but the default is the subject of this post. x <- 1:20 y <- 5 + 2*x + rnorm(20) *5 m <- lm(y ~ x) summary(m) #Call: #lm(formula = y ~ x) #Residuals: # Min 1Q Median 3Q Max #-9.1030 -0.8772 0.1168 2.3375 9.5846 #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 5.0193 2.0507 2.448 0.0249 * #x 2.0670 0.1712 12.074 4.57e-10 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #Residual standard error: 4.415 on 18 degrees of freedom #Multiple R-squared: 0.8901, Adjusted R-squared: 0.884 #F-statistic: 145.8 on 1 and 18 DF, p-value: 4.571e-10 In contrast, I like the simplicity of the display() function in the arm package. If we use it with the same model m we get: library(arm) display(m) #lm(formula = y ~ x) # coef.est coef.se #(Intercept) 5.02 2.05 #x 2.07 0.17 #--- #n = 20, k = 2 #residual sd = 4.41, R-Squared = 0.89 There is an emphasis on understanding and awareness of limitations. Rather than a false sense of accuracy and precision, we get coefficients rounded to two decimal places, no stars and no p-values. Did you really believed that the p-value was 0.0249 instead of 0.0243? Defaults also require consistency, because consistency begets ‘intuitiveness’. Every function that deals with data should take a data argument and avoid being schizophrenic: df <- data.frame(x = rnorm(100), y = rnorm(100)) # This works fine plot(y ~ x, data = df) # This craps out plot(x, y, data = df) #Error in plot(x, y, data = df) : object 'x' not found I’ve mentioned in previous posts the apply() family of functions. Yes, I hate them with passion and that’s why I think we should unite and rely on Talk like a Pirate Day (Rrrrstats, matey!) to lead a mutiny to get plyr as the new default. However, even some of my favorite packages have strange defaults. In ggplot2, for example, the size (too small) and the color (too light) of the text in the axes is extremely hard to read. Maybe not for you young reader, but for a long-sighted, receding-hairline-guy with salt & pepper beard is a struggle. This is magnified when pasting said graph in a Keynote or PowerPoint presentation and then using a crappy projector. Again, thanks Hadley, I can define my own theme but there are thousands of difficult to read graphs doing the rounds because, hey, those are the defaults. Every time we define a default we have to think that, even when we allow change, most users will probably stick to it. Of course I’m the first one to acknowledge that my favorite defaults may not coincide with yours, as shown below. Sticker in a University of Canterbury toilet cubicle. There are cultural differences on the default correct position. This is one of those times of the year: struggling to keep the head above the water, roughly one month before the last lecture of the semester. On top trying to squeeze trips, meetings and presentations in between while dealing with man flu. Gratuitous picture: looking for peace in Japan (Photo: Luis). Suicide is a tragic and complex problem. This week New Zealand’s Chief Coroner released its annual statistics on suicide, which come with several tables and figures. One of those figures refers to monthly suicides in the Christchurch region (where I live) and comes with an interesting comment: Suicides in the Christchurch region (Timaru to Kaikoura) have risen from 67 (2010/11) to 81 (2011/12). The average number of suicides per year for this region over the past four years is 74. The figure of 67 deaths last year reflected the drop in suicides post-earthquake. The phenomenon of a drop in the suicide rate after a large scale crisis event, such as a natural disaster, has been observed elsewhere. [my emphasis] ‘Provisional Suicide deaths in relation to the Christchurch earthquakes’ this is the original header for the graph in the report. The first earthquake was in September 2010 and is marked with red dots. The figure highlights the earthquake and its major aftershocks using different colors. It is true that we have faced large problems following the 4th September 2010 earthquake and thousands of aftershocks, but can we really make the coroner’s interpretation (already picked up by the media)? In fact, one could have a look at the data before the earthquake, where there are big drops and rises (What would be the explanation for that? Nothing to do with any earthquake). In fact, the average number of suicides hasn’t really changed after the quake. I typed the data in to this file, calculated the mean number of suicides per month (~6.3) and generated a few random realizations of a Poisson process using that mean; here I’m plotting the real data in panel 1 and 4 other randomly generated series in panels 2 to 5. require(lattice) su = read.csv('suicide-canterbury-2012.csv') su$Day = ifelse(Month %in% c(1, 3, 5, 7, 8, 10, 12), 31, ifelse(Month %in% c(4, 6, 9, 11), 30, 28)) su$Date = as.Date(paste(Day, Month, Year, sep = '-'), format = '%d-%m-%Y') # No huge autocorrelation acf(su$Number)   # Actual data xyplot(Number ~ Date, data = su, type = 'b')   # Mean number of monthly suicides: 6.283 # Simulating 4 5-year series using Poisson for # panels 2 to 5. Panel 1 contains the observed data simulated = data.frame(s = factor(rep(1:5, each = 60)), d = rep(su$Date, 5), n = c(su$Number, rpois(240, lambda = mean(su\$Number))))   xyplot(n ~ d | s, simulated, type = 'b', cex = 0.3, layout = c(1, 5), xlab = 'Date', ylab = 'Suicides per month')

Observed suicide data for Christchurch (panel 1) and four 60-month simulations (panels 2-5).

Do they really look different? We could try to fit any set of peaks and valleys to our favorite narrative; however, it doesn’t necessarily make any sense. We’ll need a larger data set to see any long-time effects of the earthquake.

P.S. 2012-09-18. Thomas Lumley comments on this post in StatsChat.