In quantitative genetic analyses we often use a pedigree to represent the relatedness between individuals, so this is accounted in the analyses, because the observations are not independent of each other. Often this pedigree contains alphanumeric labels, and most software can cope with that.

Sometimes, though, we want to use numeric identities because we would like to make the data available to third parties (other researchers, publication), and there is commercial sensitivity about them. Or just want to use a piece of software that can’t deal with character identities.

Last night put together an El quicko^{*} function to numberify identities, which returns a list with a numeric version of the pedigree and a key to then go back to the old identities.

There is a lot of talk about the skills needed for working in Statistics/Data Science, with the discussion often focusing on theoretical understanding, programming languages, exploratory data analysis, and visualization. There are many good blog posts dealing with how you get data, process it with your favorite language and then creating some good-looking plots. However, in my opinion, one important skill is curiosity; more specifically being data curious.

Often times being data curious doesn’t require statistics or coding, but just searching for and looking at graphs. A quick example comes from Mike Dickinson’s tweet: “This is extraordinary: within a decade, NZers basically stopped eating lamb. 160 years of tradition scrapped almost overnight.”

This is extraordinary: within a decade, NZers basically stopped eating lamb. 160 years of tradition scrapped almost overnight. pic.twitter.com/o1UTsG4su8

After reading the news article, many people came up with good potential explanations: Have the relative prices changed? Do we have different demographics with not so much appetite for lamb? etc.

Food price index (not adjusted for overall inflation) meat prices Oct 2017 as % of price in June 2006

Beef and veal 158.8 Pork 111.1 Mutton, lamb, and hogget 135.6 Poultry 127.9 Preserved, prepared, and processed meat 111.2 Fish and other seafood 135.8

Do the two data points make sense? In this data-abundant world, it didn’t take long to find the time series from which the points came from in this handy OECD page.

A quick look shows that the series contains both quoted consumption figures, showing the talked-about 10-year decline. Even more surprisingly, one can see that practically most of the decline occurred from 2008 to 2009 (from 17.7 to 4.9 kg/person), which is a bizarre drop for a single year. A single person may have large differences in consumption from one year to the next; however, over a whole country those deviations tend to be averaged out. This highlights another issue with the time series: it wiggles like crazy.

When exploring data is useful to have some sort of benchmark to see if other things are also changing at the same time. I chose our neighbor Australia—with a not so different diet, similar part of the world—as my benchmark. The Australian time series doesn’t show a change like NZ. Besides using the benchmark for the same product, we can also compare what’s going on with other meats. For example, beef and veal, pork and poultry.

All the series are smoother and show similar trends in Australia and New Zealand, which makes the lamb saga increasingly look like a mistake. We can now move from trying to explain social changes that are driving the change between two numbers, to being highly suspicious about the numbers under discussion!

So where could be the problem coming from? Consumption per capita requires i) total domestic consumption of sheep meat and ii) population of the country. We are pretty sure we have good data for population, courtesy of Statistics New Zealand. How would one go about estimating domestic consumption of sheep meat? Probably one would:

Get the size of the New Zealand sheep flock. We can get sheep numbers from Statistics NZ Agricultural Production Statistics. Livestock numbers are a national indicator, which tend to have high accuracy.

Get an idea of the proportion of the flock that’s exported, which we know is pretty substantial. I don’t know how good these numbers are, but Beef & Lamb NZ gives us an idea of how many sheep are slaughtered for export. This number, which hovers around 20 million a year seems quite consistent. We have to remember that not the whole population is slaughtered every year, as we have to replace the flock.

The difference between flock size – (sheep for export + replacement sheep) should be the number of sheep for domestic consumption.

We need a conversion factor between number of sheep and kg of meat produced, so we can calculate meat consumption/capita.

I would assume that the sheep-meat conversion factor will show little fluctuation from year to year, so perhaps the likely culprit is the penultimate point, estimating the number of sheep for domestic consumption. One thing that grabs my attention is that while the flock is getting smaller, the number of sheep for exports stays around the same, which should mean fewer sheep available for the domestic market, giving credibility to the lower lamb consumption trend.

I don’t know if this the actual explanation for the “lamb consumption crash”. If I had more time I could chase some of the domestic consumption numbers, even call the Beef & Lamb people. But this should be enough to get you started with an example on how to question the news using real data. I’m sure you reader can come up with better ways of looking at this and other stories.

When you have students working in a project there is always an element of quality control. Some times the results just make sense, while others we are suspicious about something going wrong. This means going back to check the whole analysis process: can we retrace all the steps in a calculation (going back to data collection) and see if there is anything funny going on? So we sat with the student and started running code (in RStudio, of course) and I noticed something interesting: there was a lot of redundancy, pieces of code that didn’t do anything or were weirdly placed. These are typical signs of code copied from several sources, which together with the presence of setwd() showed unfamiliarity with R and RStudio (we have a mix of students with a broad range of R skills).

But the part that really caught my eye was that the script read many Near Infrared spectra files, column bound them together with the sample ID (which was 4 numbers separated by hyphens) and saved the 45 MB file to a CSV file. Then the student opened the file and split the sample ID into 4 columns, deleted the top row, saved the file and read it again into R to continue the process.

The friction point which forced the student to drop to Excel—the first of many not easily reproducible parts—was variable splitting. The loop for reading the files and some condition testing was hard to follow too. If one knows R well, any of these steps is relatively simple, but if one doesn’t know it, the copy and pasting from many different sources begins, often with inconsistent programming approaches.

Here is where I think the tidyverse brings something important to the table: consistency, more meaningful naming of functions and good documentation. For example, doing:

is probably the easiest way of dealing with separating the contents of a single variable.

When working with several collaborators (colleagues, students, etc) the easiest way to reduce friction is to convince/drag/supplicate everyone to adopt a common language. Within the R world, the tidyverse is the closest thing we have to a lingua franca of research collaboration. ‘But isn’t R a lingua franca already?’ you may ask. The problem is that programming in base R is often too weird for normal people, and too many people just give up before feeling they can do anything useful in R (particularly if they are proficient in Excel).

Even if you are an old dog (like me) I think it pays to change to a subset of R that is more learnable. And once someone gets hooked, the transition to adding non-tidyverse functions is more bearable.

I was reading an article about the results of our latest elections where I was having a look at the spatial pattern for votes in my city.

I was wondering how would I go over obtaining the data for something like that and went to the Electoral Commission, which has this neat page with links to CSV files with results at the voting place level. The CSV files have results for each of the candidates in the first few rows (which I didn’t care about) and at the party level later in the file.

As I could see it I needed to:

Read the Electoral Commission website and extract the table that contains the links to all CSV files.

Read each of the files and i- extract the electorate name, ii- skipping all the candidates votes, followed by iii-reading the party vote.

Remove sub-totals and other junk from the files.

Geocode the addresses

Use the data for whatever else I wanted (exam question anyone?).

So I first loaded the needed packages and read the list of CSV files:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

library(magrittr)

library(tidyverse)

library(rvest)

library(stringr)

library(ggmap)

# Extract list of CSV file names containing voting place data

Then wrote a couple of functions to, first, read the whole file, get the electorate name and, second, detect where the party vote starts to keep from that line onwards. Rather than explicitly looping over the list of CSV file names, I used map_dfr from the purrr package to extract the data and join all the results by row.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

get_electorate<-function(row){

row%>%str_split(pattern=',')%>%

unlist()%>%.[1]%>%str_split(pattern='-')%>%

unlist()%>%.[1]%>%str_trim()->elect

return(elect)

}

# Function to read only party-level votes from voting places

Cleaning the data and summarising by voting place (as one can vote for several electorates in a single place) is fairly straightforward. I appended the string Mobile to mobile teams that visited places like retirement homes, hospitals, prisons, etc:

Geolocation is the not-working-very-well part right now. First, I had problems with Google (beyond the 1,000 places limit for the query). Then I went for using the Data Science Kit as the source but, even excluding the mobile places, it was a bit hit and miss for geolocation, particularly as the format of some address (like corner of X and Y) is not the best for a search.

In addition, either of the two sources for geolocation work really slowly and may produce a lot of output. Using sink() could be a good idea to not end up with output for roughly 3,000 queries. I did try the mutate_geocode() function, but didn’t work out properly.

I was reading a piece by Graeme Edgeler who, near the end, asked “Where are New Zealand’s bellwether electorates?”. I didn’t know where the data came from or how was the “index of disproportionality for each electorate” calculated, but I saw it mostly as an opportunity to whip up some quick code to practice the use of R and look at other packages that play well with the tidyverse.

The task can be described as: fetch Wikipedia page with results of the 2014 parliamentary election, extract the table with results by electorate, calculate some form of deviation from the national results, get the top X electorates with lowest deviation from national results.

A web search revealed that this page contains a whole bunch of results for the 2014 election and that the specific results I’m interested in are in table number 17 of the list created by html_nodes('table'). Besides the tidyverse, I needed the packages rvest for web scraping, magrittr for using %<>% (pipe and assign to original data frame) and lucid for pretty printing the final table.

Rather than reading the national results directly from Wikipedia I just typed them in code, as I already had them from some other stuff I was working on. My measure of “disproportionality for each electorate” was as sophisticated as the sum of squared deviations.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

# Results for National, Labour, Green, NZ First, Conservative, Internet Mana & Māori

I’m sure there must be a ‘more idiomatic’ way of doing the squared deviation using the tidyverse. At the same time, using apply came naturally in my head when writing the code, so I opted for keeping it and not interrupting the coding flow. The results are pretty similar to the ones presented by Graeme in his piece.

I’m getting increasingly comfortable with this mestizo approach of using the tidyverse and base R for completing tasks. Whatever it takes to express what I need to achieve quickly and more or less in a readable way.

Often I need to apply a function by groups of observations; sometimes, that function returns more than a single number. It could be something like for each group fit a distribution and return the distribution parameters. Or, simpler for the purposes of this exploration, calculate and return a bunch of numbers.

1

2

3

4

5

6

7

describe_c<-function(x){

mn<-mean(x,na.rm=TRUE)

dev<-sd(x,na.rm=TRUE)

n<-sum(!is.na(x))

cv<-dev/mn*100

return(c(mean=mn,sdev=dev,count=n,coefvar=cv))

}

If I have a data frame called field_data, with family codes (trees with the same parents, codes have been changed to protect the innocent) and stem diameters (in mm), I could do the following in base R:

# To continue processing it might be better to convert

# to a data frame

summary_one<-data.frame(summary_one)

And if I need to do this for several variables, I will need to merge each of these matrices in a data frame.

Continuing with my experimentation with the tidyverse, I was wondering how to get the above going with dplyr et al. After failing a few times I asked the question in Twitter and got a number of helpful replies.

One of the keys is that dplyr can store a list result from a function. Modifying my toy function is pretty straightforward, and now looks like:

And we can check the contents of summary_two to see we have a list in which each element contains 4 values:

1

2

3

4

5

6

7

8

9

10

head(summary_two)

# A tibble: 6 x 2

# family model

# <fctr> <list>

# 1 A <dbl [4]>

# 2 B <dbl [4]>

# 3 C <dbl [4]>

# 4 D <dbl [4]>

# 5 E <dbl [4]>

# 6 F <dbl [4]>

We still need to extract the elements of each element of the list and assign them to a variable name. Using map from the purrr package is pretty straightforward in this case, and we can extract the values either using their names or their position in the element.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

summary_two%>%

mutate(mn=map_dbl(model,'mean'),

sd=map_dbl(model,'sdev'),

n=map_dbl(model,'count'),

cv=map_dbl(model,4))%>%head

# A tibble: 6 x 6

# family model mn sd n cv

# <fctr> <list> <dbl> <dbl> <dbl> <dbl>

# 1 A <dbl [4]> 190.8306 23.71290 425 12.42615

# 2 B <dbl [4]> 190.1111 25.46554 396 13.39508

# 3 C <dbl [4]> 188.2646 27.39215 461 14.54981

# 4 D <dbl [4]> 189.2668 25.16330 431 13.29514

# 5 E <dbl [4]> 183.5238 19.70182 21 10.73530

# 6 F <dbl [4]> 183.1250 28.82377 24 15.73994

I’m still playing with ideas to be lazier at extraction time. An almost abhorrent idea is to provide the output as character for posterior type conversion, as in:

One of the main uses for R is for exploration and learning. Let’s say that I wanted to learn simple linear regression (the bread and butter of statistics) and see how the formulas work. I could simulate a simple example and fit the regression with R:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

library(arm)# For display()

# Simulate 5 observations

set.seed(50)

x<-1:5

y<-2+3*x+rnorm(5,mean=0,sd=3)

# Fit regression

reg<-lm(y~x,dat)

display(reg)

# lm(formula = y ~ x, data = dat)

# coef.est coef.se

# (Intercept) 3.99 3.05

# x 2.04 0.92

# ---

# n = 5, k = 2

# residual sd = 2.91, R-Squared = 0.62

# Plot it

plot(y~x)

abline(coef(reg))

The formulas for the intercept (\(b_0\)) and the slope (\(b_1\)) are pretty simple, and I have been told that there is a generic expression that instead uses matrices.

\(b_1 = \frac{\sum{x y} – n \bar{x} \bar{y}}{\sum{x x} – n \bar{x}^2}\)
\(b_0 = \bar{y} – b_1 \bar{x}\)

Funnily enough, looking at the matrices we can see similar sums of squares and crossproducts as in the formulas.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

X<-model.matrix(reg)

# (Intercept) x

# 1 1 1

# 2 1 2

# 3 1 3

# 4 1 4

# 5 1 5

# attr(,"assign")

# [1] 0 1

t(X)%*%X

# (Intercept) x

# (Intercept) 5 15

# x 15 55

# So X`X contains bits and pieces of the previous formulas

length(x)

# [1] 5

sum(x)

# [1] 15

sum(x*x)

# [1] 55

# And so does X`y

t(X)%*%y

# [,1]

# (Intercept) 50.61283

# x 172.27210

sum(y)

# [1] 50.61283

sum(x*y)

# [1] 172.2721

# So if we combine the whole lot and remember that

# solves calculates the inverse

solve(t(X)%*%X)%*%t(X)%*%y

# [,1]

# (Intercept) 3.992481

# x 2.043362

But I have been told that R (as most statistical software) doesn’t use the inverse of the matrix for estimating the coefficients. So how does it work?

If I type lm R will print the code of the lm() function. A quick look will reveal that there is a lot of code reading the arguments and checking that everything is OK before proceeding. However, the function then calls something else: lm.fit(). With some trepidation I type lm.fit, which again performs more checks and then calls something with a different notation:

1

z<-.Call(C_Cdqrls,x,y,tol,FALSE)

This denotes a call to a C language function, which after some searching in Google we find in a readable form in the lm.c file. Another quick look brings more checking and a call to Fortran code:

1

2

3

F77_CALL(dqrls)(REAL(qr),&n,&p,REAL(y),&ny,&rtol,

REAL(coefficients),REAL(residuals),REAL(effects),

&rank,INTEGER(pivot),REAL(qraux),work);

which is a highly tuned routine for QR decomposition in a linear algebra library. By now we know that the general matrix expression produces the same as our initial formula, and that the R lm() function does not use a matrix inverse but QR decomposition to solve the system of equations.

One of the beauties of R is that brought the power of statistical computing to the masses, by not only letting you fit models but also having a peek at how things are implemented. As a user, I don’t need to know that there is a chain of function calls initiated by my bread-and-butter linear regression. But it is comforting to the nerdy me, that I can have a quick look at that.

All this for free, which sounds like a very good deal to me.

I started using R ages ago and have happily lived in mostly-base-R for data manipulation. Once in a while I move to something that makes a big difference, like ggplot2 in 2010 or Rmarkdown in 2015, but the set of packages I use for data + plotting hasn’t seen many changes. I have to confess that, meanwhile, I have tested quite a few approaches on the analytics side of things (last year was the turn of Bayesian for me).

Last week, I decided to learn more about the tidyverse, thinking of using it more with forestry postgrad students. Now, there is no lack of tutorials, reviews, documentation, etc. for the tidyverse, but most writing shows a final version of the code, without exposing the thinking and dead ends that go behind it. In this post I show how my code was changing, both after reading a few pieces of documentation and, mostly, from feedback I got from Hadley Wickham and Michael MacAskill via this Kiwi Twitter thread. This post shows minor differences in variable names from that thread, as I changed a few things while reading the files.

The problem is as follows: I have two data frames with trial assessments. Frame one, called early, covers trees at ages 5, 7 and 8 years (although ages are in months rather than years). Frame two, called late, covers trees at age 20 years. Yes, it takes a while working with trees.

We want to keep only age 8 years (96 months) from early and want to split a code into two variables, as well as convert a variable from numeric to character. In late we want to create a tree family code, based on a set of rules to connect field codes to the pedigree of trees. Did I mention that I work breeding trees?

Finally, we want to merge all the assessments from age 8 with the assessment at age 20 for the same trees.

Rather than showing the final version of the code, it is much more interesting to show its evolution, also including how I would have done this in base R. I’m omitting the reading of the file and boring case conversion of variable names, etc.

In base R, I would probably do something like this (I’m using the stringr package just to make my life easier):

I used multiple instances of mutate because I thought it would be easier to read. The use of map instead of sapply is cool, particularly when one starts looking at more advanced features. Comments from the crowd in Twitter: mutate the whole lot in a single statement (although Hadley pointed out that there was no performance penalty by using them separately) and try using case_when to make nested ifelse easier to understand. The comments on map went in two directions: either use map_chr as a clearer and safer alternative, or just use dplyr‘s separate function. The first option for early would look like:

However, I ended up going with the final version that used separate, which is easier on the eye and faster, for a final version that looks like this:

1

2

3

4

5

early%>%

filter(age==96)%>%

separate(section,c('rep','sets'),sep='_')%>%

mutate(tree=as.character(`tree position`),

genotype=as.character(1:nrow(.)+10^6))->early_8

So we filter early, separate the single set code into two variables (rep and sets) and create a couple of variables using mutate (one is a simple type conversion to character, while the other is a code starting at 1,000,000).

Some (many, most?) people may disagree with my use of right assign, which I love. Surely one could use either left assign or %<>% from the maggrittr package. By the way, why do I have to explicitely load magrittr (instead of relying on tidyverse) to access %<>%?

And this is how I go about learning new things: lots of false starts, often working with small examples (I used a few to check how left_join was working), lots of searching for explanations/tutorials (thanks to everyone who has written them) and asking in Twitter. If you are just starting programming, in any language, do not feel intimidated by cool looking code; most of the time it took many iterations to get it looking like that.

I was attending a course of Bayesian Statistics where this problem showed up:

There is a number of individuals, say 12, who take a pass/fail test 15 times. For each individual we have recorded the number of passes, which can go from 0 to 15. Because of confidentiality issues, we are presented with rounded-to-the-closest-multiple-of-3 data (\(\mathbf{R}\)). We are interested on estimating \(\theta\) of the Binomial distribution behind the data.

Rounding is probabilistic, with probability 2/3 if you are one count away from a multiple of 3 and probability 1/3 if the count is you are two counts away. Multiples of 3 are not rounded.

We can use Gibbs sampling to alternate between sampling the posterior for the unrounded \(\mathbf{Y}\) and \(\theta\). In the case of \(\mathbf{Y}\) I used:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

# Possible values that were rounded to R

possible<-function(rounded){

if(rounded==0){

options<-c(0,1,2)

}else{

options<-c(rounded-2,rounded-1,rounded,

rounded+1,rounded+2)

}

return(options)

}

# Probability mass function of numbers rounding to R

# given theta

prior_y<-function(options,theta){

p<-dbinom(options,15,prob=theta)

return(p)

}

# Likelihood of rounding

like_round3<-function(options){

if(length(options)==3){

like<-c(1,2/3,1/3)}

else{

like<-c(1/3,2/3,1,2/3,1/3)

}

return(like)

}

# Estimating posterior mass function and drawing a

# random value of it

posterior_sample_y<-function(R,theta){

po<-possible(R)

pr<-prior_y(po,theta)

li<-like_round3(po)

post<-li*pr/sum(li*pr)

samp<-sample(po,1,prob=post)

return(samp)

}

While for \(theta\) we are assuming a vague \(mbox{Beta}(alpha, eta)\), with \(alpha\) and \(eta\) equal to 1, as prior density function for \(theta\), so the posterior density is a \(mbox{Beta}(alpha + sum Y_i, eta + 12*15 – sum Y_i)\).

1

2

3

4

5

## Function to sample from the posterior Pr(theta | Y, R)

Currently there is some discussion in New Zealand about the effect of the reclassification of schools in socioeconomic deciles. An interesting aspect of the funding system in New Zealand is that state and state-integrated schools with poorer families receive substantially more funding from the government than schools that receive students from richer families (see this page in the Ministry of Education’s website).

One thing that I haven’t noticed before is that funding decisions are more granular than simply using deciles, as deciles 1 to 4 are split into 3 steps each. For example, for Targeted Funding for Educational Achievement in 2015 we get the following amounts per student for decile: 1 (A: $905.81, B: $842.11, C: $731.3), 2 (D: $617.8, E: 507.01, F: 420.54), 3 (G: $350.25, H: $277.32, I: $220.59), 4 (J: $182.74, K: $149.99, L: $135.12), 5 ($115.76), 6 ($93.71), 7: ($71.64), 8 ($46.86), 9 ($28.93) and 10 ($0).

The Ministry of Education states that 784 schools ‘have moved to a higher decile rating’ while 800 ‘have moved to a lower decile rating’ (800 didn’t move). They do not mean that those numbers of schools changed deciles, but that information also includes changes of steps within deciles. Another issue is that it is not the same to move one step at the bottom of the scale (e.g. ~$63 from 1A to 1B) or at the top (~$29 from 9 to 10); that is, the relationship is not linear.

I assume that the baseline to measure funding changes is to calculate how much would a school would get per student in 2015 without any change of decile/step. That is, funding assuming that the previous step within decile had stayed constant. Then we can calculate how a student will get with the new decile/step for the school. I have limited this ‘back of the envelope’ calculation to Targeted Funding for Educational Achievement, which is not the only source of funding linked to deciles. There are other things like Special Education Grant and Careers Information Grant, but they have much smaller magnitude (maximum $73.94 & $37.31 per student) and the maximum differences between deciles 1 and 10 are 2:1.

Steps are in capital letters and need to be translated into money. Once we get that we can calculate differences at both student level and school level:

If we look at the 50% of the schools in the middle of the distribution they had fairly small changes, approximately +/- $22 per student per year or at the school level +/- 3,000 dollars per year.

An interesting, though not entirely surprising, graph is plotting changes of funding on the size of the school. Large schools are much more stable on deciles/step than small ones.

xlab='Total School Roll',ylab='Funding change per school (NZ$ 000)')+

theme_bw()+theme(legend.position='none')

Overall, there is a small change of the total amount of money for Targeted Funding for Educational Achievement used in the reclassified school system versus using the old deciles ($125M using 2014 deciles versus $132M using 2015 deciles) and for most schools the changes do not seem dramatic. There is, however, a number of schools (mostly small ones) who have had substantial changes to their funding. Very small schools will tend to display the largest changes, as the arrival or departure of only few pupils with very different socioeconomic backgrounds would have a substantial effect. An example would be Mata School in the Gisborne area, which moved 13 steps in decile funding (from A to N) with a roll of 11 kids. How to maintain a more steady funding regime seems to be a difficult challenge in those cases.

One consequence of the larger variability in small schools is that rural areas will be more affected by larger changes of funding. While overall 34% of the schools had no changes to their decile/step classification in rural areas that reduces to 22%; on top of that, the magnitude of the changes for rural schools is also larger.

Operational school funding is much more complex than deciles, as it includes allocations depending on number of students, use of Maori language, etc.

P.S. Stephen Senn highlights an obvious problem with the language the Ministry uses: there are 9 deciles (the points splitting the distribution into 10 parts). We should be talking about tenths, a much simpler word, instead of deciles.