Categories

## Reading a folder with many small files

One of the tools we use in our research is NIR (Near-Infrared Spectroscopy), which we apply to thousands of samples to predict their chemical composition. Each NIR spectrum is contained in a CSV text file with two numerical columns: wavelength and reflectance. All files have the same number of rows (1296 in our case), which corresponds to the number of wavelengths assessed by the spectrometer. One last thing: the sample ID is encoded in the file name.

As an example, file A1-4-999-H-L.0000.csv’s contents look like:

8994.82461,0.26393
8990.96748,0.26391
8987.11035,0.26388
8983.25322,0.26402
8979.39609,0.26417
...

Once the contents of all the files are stored in a single matrix, one can apply a bunch of algorithms to build a model, and then use the model to predict chemical composition for new observations. I am not concerned about that process in this blog post, but only about reading thousands of small files from R, without relying on calls to the operating system to join the small files and read a single large file.

As I see it, I want to:

• give R a folder name,
• get a list of all the file names in that folder,
• iterate over that list and keep only the second column for each of the files.
• join the elements of the list.

I can use list.files() to get the names of all files in a folder. Rather than using a explicit loop, it’s easier to use lapply() to iterate over the list of names and apply the read.csv() function to all of them. I want a matrix, but lapply() creates a list, so I joined all the elements of the list using do.call() to bind the rows using rbind().

spectra_folder <- 'avery_raw_spectra'

# Read all files and keep second column only for each of them. Then join all rows
spectra_list <- list.files(path = folder, full.names = TRUE)
raw_spectra <- lapply(spectra_list,

raw_spectra <- do.call(rbind, raw_spectra)


There are many ways to test performance, for example using the microbenchmark package. Instead, I'm using something rather basic, almost cute, the Sys.time() function:

start <- Sys.time()
end <- Sys.time()
end - start


This takes about 12 seconds in my laptop (when reading over 6,000 files). I was curious to see if it would be dramatically faster with data.table, so I replaced read.csv() with fread() and joined the elements of the list using rbindlist().

library(data.table)

spectra_list <- list.files(path = folder, full.names = TRUE)
raw_spectra <-  lapply(spectra_list,

raw_spectra <- rbindlist(raw_spectra)
}


Using the same basic timing as before this takes around 10 seconds in my laptop

I have the impression that packages like data.table and readr have been optimized for reading larg(ish) files, so they won't necessarily help much in this reading-many-small-files type of problem. Instead, I tested going back to even more basic R functions (scan), and adding more information about the types of data I was reading. Essentially, going back even closer to base R.

read_folder_scan <- function(folder, prefix = 'F') {
# Read all files and keep second column only for each of them. Then join all rows
spectra_list <- list.files(path = folder, full.names = TRUE)
raw_spectra <- lapply(spectra_list,
function(x) matrix(scan(x, what = list(NULL, double()),
sep = ',', quiet = TRUE)[[2]], nrow = 1))
raw_spectra <- do.call(rbind, raw_spectra)
}


Timing this new version takes only 4 seconds, not adding any additional dependencies. Any of these versions is faster than the original code that was growing a data frame with rbind() one iteration at the time.

Categories

## Calculating parliament seats allocation and quotients

I was having a conversation about dropping the minimum threshold (currently 5% of the vote) for political parties to get representation in Parliament. The obvious question is how would seat allocation change, which of course involved a calculation. There is a calculator in the Electoral Commission website, but trying to understand how things work (and therefore coding) is my thing, and the Electoral Commission has a handy explanation of the Sainte-Laguë allocation formula used in New Zealand. So I had to write my own seat allocation function:

allocate_seats <- function(votes) {
denom <- seq(1, 121, 2)

quotients <- t(quotients)
colnames(quotients) <- parties

priority <- rank(-quotients)
seat_ranking <- matrix(priority, nrow = nrow(quotients), ncol = ncol(quotients))
seat_ranking <- ifelse(seat_ranking <= 120, seat_ranking, NA)
colnames(seat_ranking) <- parties

return(list(quotients = quotients, ranking = seat_ranking))
}


Testing it with the preliminary election results (that is, no including special votes) gives:

votes2017 <- c(998813, 776556, 162988, 126995, 10959, 48018, 23456)
names(votes2017) <- c('National', 'Labour', 'NZ First', 'Green', 'ACT',
'Opportunities', 'Māori')

seats2017$ranking # National Labour NZ First Green ACT Opportunities Māori # [1,] 1 2 6 9 98 22 46 # [2,] 3 4 19 26 NA 67 NA # [3,] 5 7 33 42 NA 112 NA # [4,] 8 11 47 59 NA NA NA # [5,] 10 13 60 77 NA NA NA # [6,] 12 15 73 93 NA NA NA # [7,] 14 17 86 110 NA NA NA # [8,] 16 21 100 NA NA NA NA # [9,] 18 24 113 NA NA NA NA #[10,] 20 27 NA NA NA NA NA # ...  In our current setup The Opportunities and Māori parties did not reach the minimum threshold (nor won an electorate as ACT violating the spirit of the system), so did not get any seats. Those 4 seats that would have gone to minor parties under no threshold ended up going to National and Labour (2 each). It sucks. Categories ## Collecting results of the New Zealand General Elections I was reading an article about the results of our latest elections where I was having a look at the spatial pattern for votes in my city. I was wondering how would I go over obtaining the data for something like that and went to the Electoral Commission, which has this neat page with links to CSV files with results at the voting place level. The CSV files have results for each of the candidates in the first few rows (which I didn’t care about) and at the party level later in the file. As I could see it I needed to: 1. Read the Electoral Commission website and extract the table that contains the links to all CSV files. 2. Read each of the files and i- extract the electorate name, ii- skipping all the candidates votes, followed by iii-reading the party vote. 3. Remove sub-totals and other junk from the files. 4. Geocode the addresses 5. Use the data for whatever else I wanted (exam question anyone?). So I first loaded the needed packages and read the list of CSV files: library(magrittr) library(tidyverse) library(rvest) library(stringr) library(ggmap) # Extract list of CSV file names containing voting place data voting_place <- 'http://www.electionresults.govt.nz/electionresults_2017_preliminary/voting-place-statistics.html' election17 <- read_html(voting_place) election17 %>% html_nodes('table') %>% html_nodes('a') %>% html_attr('href') %>% str_subset('csv') %>% paste('http://www.electionresults.govt.nz/electionresults_2017_preliminary','/',., sep = '') -> voting_place_list  Then wrote a couple of functions to, first, read the whole file, get the electorate name and, second, detect where the party vote starts to keep from that line onwards. Rather than explicitly looping over the list of CSV file names, I used map_dfr from the purrr package to extract the data and join all the results by row. get_electorate <- function(row) { row %>% str_split(pattern = ',') %>% unlist() %>% .[1] %>% str_split(pattern = '-') %>% unlist() %>% .[1] %>% str_trim() -> elect return(elect) } # Function to read only party-level votes from voting places extract_party_vote <- function(file_name) { all_records <- read_lines(file_name) electorate <- get_electorate(all_records[1]) start_party <- grep('Party Vote', all_records) party_records <- all_records[start_party:length(all_records)] party_records_df <- read.table(text = party_records, sep = ',', fill = TRUE, header = TRUE, quote = '"', stringsAsFactors = FALSE) names(party_records_df)[1:2] <- c('neighbourhood', 'address') party_records_df$electorate <- electorate
return(party_records_df)
}

vote_by_place <- map_dfr(voting_place_list, extract_party_vote)


Cleaning the data and summarising by voting place (as one can vote for several electorates in a single place) is fairly straightforward. I appended the string Mobile to mobile teams that visited places like retirement homes, hospitals, prisons, etc:

# Remove TOTAL and empty records
vote_by_place %>%
mutate(neighbourhood = ifelse(neighbourhood == '', paste(electorate, 'Mobile'), neighbourhood)) %>%
summarise_at(vars(ACT.New.Zealand:Informal.Party.Votes), sum, na.rm = TRUE) -> clean_vote_by_place


Geolocation is the not-working-very-well part right now. First, I had problems with Google (beyond the 1,000 places limit for the query). Then I went for using the Data Science Kit as the source but, even excluding the mobile places, it was a bit hit and miss for geolocation, particularly as the format of some address (like corner of X and Y) is not the best for a search.

In addition, either of the two sources for geolocation work really slowly and may produce a lot of output. Using sink() could be a good idea to not end up with output for roughly 3,000 queries. I did try the mutate_geocode() function, but didn't work out properly.

# Geolocate voting places
get_geoloc <- function(record) {
address <- paste(record$address, 'New Zealand', sep = ', ') where <- try(geocode(address, output = "latlona", source = "dsk"), silent = TRUE) where$address <- as.character(address)
where$locality <- as.character(record$neighbourhood)
return(where)
}

# Sending whole row to the map function
clean_vote_by_place %>% pmap_dfr(., lift_ld(get_geoloc)) -> geoloc_voting_places


David Robinson was kind enough to help me with the last line of the script, although he updated the advise to:

Given the size of my dataset, either option took bugger all time, although I have to say that

clean_vote_by_place %>% mutate(map(transpose(.), get_geoloc)) -> etc


looks prettier.

Once the data are geolocated, creating a visualisation is not so hard. Even old dogs can find their way to do that!

Categories

## Functions with multiple results in tidyverse

I have continued playing with the tidyverse for different parts of a couple of projects.

Often I need to apply a function by groups of observations; sometimes, that function returns more than a single number. It could be something like for each group fit a distribution and return the distribution parameters. Or, simpler for the purposes of this exploration, calculate and return a bunch of numbers.

describe_c <- function(x) {
mn <- mean(x, na.rm = TRUE)
dev <- sd(x, na.rm = TRUE)
n <- sum(!is.na(x))
cv <- dev/mn*100
return(c(mean = mn, sdev = dev, count = n, coefvar = cv))
}


If I have a data frame called field_data, with family codes (trees with the same parents, codes have been changed to protect the innocent) and stem diameters (in mm), I could do the following in base R:

# This line produces an annoying list
summary_one <- with(field_data, tapply(stem, family, FUN = describe_v))

# This puts together a matrix by joining
# the list results using rbind()
summary_one <- do.call(rbind, summary_one)

# To continue processing it might be better to convert
# to a data frame
summary_one <- data.frame(summary_one)


And if I need to do this for several variables, I will need to merge each of these matrices in a data frame.

Continuing with my experimentation with the tidyverse, I was wondering how to get the above going with dplyr et al. After failing a few times I asked the question in Twitter and got a number of helpful replies.

One of the keys is that dplyr can store a list result from a function. Modifying my toy function is pretty straightforward, and now looks like:

describe_list <- function(x) {
mn <- mean(x, na.rm = TRUE)
dev <- sd(x, na.rm = TRUE)
n <- sum(!is.na(x))
cv <- dev/mn*100
return(list(c(mean = mn, sdev = dev, count = n, coefvar = cv)))
}


And we can check the contents of summary_two to see we have a list in which each element contains 4 values:

head(summary_two)
#  A tibble: 6 x 2
#   family     model
#
# 1      A
# 2      B
# 3      C
# 4      D
# 5      E
# 6      F


We still need to extract the elements of each element of the list and assign them to a variable name. Using map from the purrr package is pretty straightforward in this case, and we can extract the values either using their names or their position in the element.

summary_two %>%
mutate(mn = map_dbl(model,'mean'),
sd = map_dbl(model,'sdev'),
n = map_dbl(model,'count'),

#  A tibble: 6 x 6
#   family     model       mn       sd     n       cv
#
# 1      A  190.8306 23.71290   425 12.42615
# 2      B  190.1111 25.46554   396 13.39508
# 3      C  188.2646 27.39215   461 14.54981
# 4      D  189.2668 25.16330   431 13.29514
# 5      E  183.5238 19.70182    21 10.73530
# 6      F  183.1250 28.82377    24 15.73994


I'm still playing with ideas to be lazier at extraction time. An almost abhorrent idea is to provide the output as character for posterior type conversion, as in:

describe_char <- function(x) {
mn <- mean(x, na.rm = TRUE)
dev <- sd(x, na.rm = TRUE)
n <- sum(!is.na(x))
cv <- dev/mn*100
return(paste(mn, dev, n, cv, sep = ':'))
}

field_data %>%
group_by(family) %>%
summarise(model = describe_char(stem)) -> summary_three

# A tibble: 6 x 2
#   family                                                  model
#
# 1      A 190.830588235294:23.7128956613006:425:12.4261502731746
# 2      B 190.111111111111:25.4655444116168:396:13.3950847284951
# 3      C  188.26464208243:27.3921487349435:461:14.5498105390125
# 4      D 189.266821345708:25.1632953227626:431:13.2951434085746
# 5      E   183.52380952381:19.7018249094317:21:10.7352963959021
# 6      F           183.125:28.8237711378767:24:15.7399432834822

summary_three %>%
separate(model, c('mn', 'sd', 'n', 'cv'), sep = ':') %>% head

# A tibble: 6 x 5
#   family               mn               sd     n               cv
#
# 1      A 190.830588235294 23.7128956613006   425 12.4261502731746
# 2      B 190.111111111111 25.4655444116168   396 13.3950847284951
# 3      C  188.26464208243 27.3921487349435   461 14.5498105390125
# 4      D 189.266821345708 25.1632953227626   431 13.2951434085746
# 5      E  183.52380952381 19.7018249094317    21 10.7352963959021
# 6      F          183.125 28.8237711378767    24 15.7399432834822


And we can get all the way there with:

summary_three %>%
separate(model, c('mn', 'sd', 'n', 'cv'), sep = ':') %>%
mutate_at(c('mn', 'sd', 'n', 'cv'), as.numeric) %>% head

# A tibble: 6 x 5
#   family       mn       sd     n       cv
#
# 1      A 190.8306 23.71290   425 12.42615
# 2      B 190.1111 25.46554   396 13.39508
# 3      C 188.2646 27.39215   461 14.54981
# 4      D 189.2668 25.16330   431 13.29514
# 5      E 183.5238 19.70182    21 10.73530
# 6      F 183.1250 28.82377    24 15.73994


Which I assume has all sort of potential negative side-effects, but looks really cool.

In case you want to play with the problem, here is a tiny example of field data.

Categories

## Turtles all the way down

One of the main uses for R is for exploration and learning. Let’s say that I wanted to learn simple linear regression (the bread and butter of statistics) and see how the formulas work. I could simulate a simple example and fit the regression with R:

library(arm)  # For display()

# Simulate 5 observations
set.seed(50)
x <- 1:5
y <- 2 + 3*x + rnorm(5, mean = 0, sd = 3)

# Fit regression
reg <- lm(y ~ x, dat)
display(reg)

# lm(formula = y ~ x, data = dat)
#             coef.est coef.se
# (Intercept) 3.99     3.05
# x           2.04     0.92
# ---
# n = 5, k = 2
# residual sd = 2.91, R-Squared = 0.62

# Plot it
plot(y ~ x)
abline(coef(reg))


The formulas for the intercept ($$b_0$$) and the slope ($$b_1$$) are pretty simple, and I have been told that there is a generic expression that instead uses matrices.

$$b_1 = \frac{\sum{x y} - n \bar{x} \bar{y}}{\sum{x x} - n \bar{x}^2}$$
$$b_0 = \bar{y} - b_1 \bar{x}$$

$$\boldsymbol{b} = \boldsymbol{X}\boldsymbol{X}^{-1} \boldsymbol{Xy}$$

How do the contents of the matrices and the simple formulates relate to each other?

# Formulas for slope and intercept
b1 <- (sum(x*y) - length(x)*mean(x)*mean(y))/(sum(x*x) - length(x)*mean(x)^2)
b0 <- mean(y) - b1*mean(x)


Funnily enough, looking at the matrices we can see similar sums of squares and crossproducts as in the formulas.

X <- model.matrix(reg)

#   (Intercept) x
# 1           1 1
# 2           1 2
# 3           1 3
# 4           1 4
# 5           1 5
# attr(,"assign")
# [1] 0 1

t(X) %*% X
#             (Intercept)  x
# (Intercept)           5 15
# x                    15 55

# So XX contains bits and pieces of the previous formulas
length(x)
# [1] 5

sum(x)
# [1] 15

sum(x*x)
# [1] 55

# And so does Xy
t(X) %*% y
#                  [,1]
# (Intercept)  50.61283
# x           172.27210

sum(y)
# [1] 50.61283

sum(x*y)
# [1] 172.2721

# So if we combine the whole lot and remember that
# solves calculates the inverse
solve(t(X) %*% X) %*% t(X) %*% y
#                 [,1]
# (Intercept) 3.992481
# x           2.043362


But I have been told that R (as most statistical software) doesn't use the inverse of the matrix for estimating the coefficients. So how does it work?

If I type lm R will print the code of the lm() function. A quick look will reveal that there is a lot of code reading the arguments and checking that everything is OK before proceeding. However, the function then calls something else: lm.fit(). With some trepidation I type lm.fit, which again performs more checks and then calls something with a different notation:

z <- .Call(C_Cdqrls, x, y, tol, FALSE)


This denotes a call to a C language function, which after some searching in Google we find in a readable form in the lm.c file. Another quick look brings more checking and a call to Fortran code:

F77_CALL(dqrls)(REAL(qr), &n, &p, REAL(y), &ny, &rtol,
REAL(coefficients), REAL(residuals), REAL(effects),
&rank, INTEGER(pivot), REAL(qraux), work);


which is a highly tuned routine for QR decomposition in a linear algebra library. By now we know that the general matrix expression produces the same as our initial formula, and that the R lm() function does not use a matrix inverse but QR decomposition to solve the system of equations.

One of the beauties of R is that brought the power of statistical computing to the masses, by not only letting you fit models but also having a peek at how things are implemented. As a user, I don't need to know that there is a chain of function calls initiated by my bread-and-butter linear regression. But it is comforting to the nerdy me, that I can have a quick look at that.

All this for free, which sounds like a very good deal to me.

Categories

## Old dog and the tidyverse

I started using R ages ago and have happily lived in mostly-base-R for data manipulation. Once in a while I move to something that makes a big difference, like ggplot2 in 2010 or Rmarkdown in 2015, but the set of packages I use for data + plotting hasn’t seen many changes. I have to confess that, meanwhile, I have tested quite a few approaches on the analytics side of things (last year was the turn of Bayesian for me).

Last week, I decided to learn more about the tidyverse, thinking of using it more with forestry postgrad students. Now, there is no lack of tutorials, reviews, documentation, etc. for the tidyverse, but most writing shows a final version of the code, without exposing the thinking and dead ends that go behind it. In this post I show how my code was changing, both after reading a few pieces of documentation and, mostly, from feedback I got from Hadley Wickham and Michael MacAskill via this Kiwi Twitter thread. This post shows minor differences in variable names from that thread, as I changed a few things while reading the files.

The problem is as follows: I have two data frames with trial assessments. Frame one, called early, covers trees at ages 5, 7 and 8 years (although ages are in months rather than years). Frame two, called late, covers trees at age 20 years. Yes, it takes a while working with trees.

We want to keep only age 8 years (96 months) from early and want to split a code into two variables, as well as convert a variable from numeric to character. In late we want to create a tree family code, based on a set of rules to connect field codes to the pedigree of trees. Did I mention that I work breeding trees?

Finally, we want to merge all the assessments from age 8 with the assessment at age 20 for the same trees.

Rather than showing the final version of the code, it is much more interesting to show its evolution, also including how I would have done this in base R. I’m omitting the reading of the file and boring case conversion of variable names, etc.

In base R, I would probably do something like this (I’m using the stringr package just to make my life easier):

library(stringr)

early_8 <- subset(early, age == 96)
early_8 <- within(early_8, {
rep <- sapply(section, function(x) unlist(str_split(x, '_'))[1])
sets <- sapply(section, function(x) unlist(str_split(x, '_'))[2])
tree <- as.character(tree position)
})

late <- within(late, {
family <- ifelse(field_code < 500, paste('885', str_pad(field_code, 3, pad = '0'), sep = ''),
ifelse(field_code >= 500 & field_code < 600, paste('883', str_pad(field_code - 500, 3, pad = '0'), sep = ''),
as.character(field_code)))
rep <- as.character(rep)
tree <- as.character(tree)
})

both <- merge(early_8, late, by.x = 'genotype', by.y = 'family')


My first approach to dealing with the early frame with the tidyverse looked like:

library(tidyverse)
library(stringr)

early %>%
filter(age == 96) %>%
mutate(rep = flatten_chr(map(section, function(x) unlist(str_split(x, '_'))[1]))) %>%
mutate(sets = flatten_chr(map(section, function(x) unlist(str_split(x, '_'))[2]))) %>%
mutate(tree = as.character(tree position)) -> early_8


While the second frame was processed using:

late %>%
mutate(family = ifelse(field_code < 500, paste('885', str_pad(field_code, 3, pad = '0'), sep = ''),
ifelse(field_code >= 500 & field_code < 600, paste('883', str_pad(field_code - 500, 3, pad = '0'), sep = ''),
as.character(field_code)))) %>%
mutate(rep = as.character(rep))
mutate(tree = as.character(tree)) -> late


I used multiple instances of mutate because I thought it would be easier to read. The use of map instead of sapply is cool, particularly when one starts looking at more advanced features. Comments from the crowd in Twitter: mutate the whole lot in a single statement (although Hadley pointed out that there was no performance penalty by using them separately) and try using case_when to make nested ifelse easier to understand. The comments on map went in two directions: either use map_chr as a clearer and safer alternative, or just use dplyr's separate function. The first option for early would look like:

early %>%
filter(age == 96) %>%
mutate(rep = map_chr(section, function(x) unlist(str_split(x, '_'))[1])) %>%
mutate(sets = map_chr(section, function(x) unlist(str_split(x, '_'))[2])) %>%
mutate(tree = as.character(tree position)) -> early_8


However, I ended up going with the final version that used separate, which is easier on the eye and faster, for a final version that looks like this:

early %>%
filter(age == 96) %>%
separate(section, c('rep', 'sets'), sep = '_') %>%
mutate(tree = as.character(tree position),
genotype = as.character(1:nrow(.) + 10^6)) -> early_8


So we filter early, separate the single set code into two variables (rep and sets) and create a couple of variables using mutate (one is a simple type conversion to character, while the other is a code starting at 1,000,000).

In the case of late, I ended up with:

late %>%
mutate(family = case_when(
field_code < 500  ~  paste('885', str_pad(field_code, 3, pad = '0'), sep = ''),
field_code >= 500 & field_code < 600  ~  paste('883', str_pad(field_code - 500, 3, pad = '0'), sep = ''),
TRUE ~ as.character(field_code)),
rep = as.character(rep),
tree = as.character(tree)) -> late


And we merge the files using:

early_8 %>%
left_join(late, c('rep', 'sets', 'tree')) %>%
left_join(base_pedigree, by = c('family' = 'genotype'))  -> both


Some (many, most?) people may disagree with my use of right assign, which I love. Surely one could use either left assign or %<>% from the maggrittr package. By the way, why do I have to explicitely load magrittr (instead of relying on tidyverse) to access %<>%?

And this is how I go about learning new things: lots of false starts, often working with small examples (I used a few to check how left_join was working), lots of searching for explanations/tutorials (thanks to everyone who has written them) and asking in Twitter. If you are just starting programming, in any language, do not feel intimidated by cool looking code; most of the time it took many iterations to get it looking like that.

Categories

## Left-to-right

When I write code I’m often amazed by the direction of the statements. I mean, we read and write left-to-right except when we assign statements to a variable. So here we are, doing our business, slowly working in a sentence and, puff!, we get this insight from the future and we assign it to our past, on the left. I commented this on Twitter and @fadesingh pointed me to this artistic creation of right-to-left programming language, except that the assign was left-to-right.

R has a right-assign operator (->) and I was thinking, how would it look if I wrote all the assigns to the left, using an adaptation of my previous post.

options(stringsAsFactors = FALSE)
library(ggplot2)

# School directory
read.csv('directory-school-current.csv', skip = 3) -> direc

# Decile changes
read.csv('DecileChanges_20142015.csv', skip = 2) -> dc

subset(dc,
select = c(Inst.., X2014.Decile, X2015.Decile, X2014.Step,
X2015..Step, Decile.Change, Step.Change)) -> dc

c('School.ID', 'deci2014', 'deci2015', 'step2014', 'step2015',
'decile.change', 'step.change') -> names(dc)

c(LETTERS[1:17], 'Z') -> steps
c(905.81, 842.11, 731.3, 617.8, 507.01, 420.54, 350.25,
277.32, 220.59, 182.74, 149.99, 135.12, 115.76, 93.71,
71.64, 46.86, 28.93, 0) -> money

within(dc, {
ifelse(step2014 == '', NA, step2014) -> step2014
ifelse(step2015 == '', NA, step2015) -> step2015
sapply(step2014, function(x) money[match(x, steps)]) -> sm2014
sapply(step2015, function(x) money[match(x, steps)]) -> sm2015
sm2015 - sm2014 -> funding.change
}) -> dc

merge(dc,
direc[, c('School.ID', 'Total.School.Roll', 'Urban.Area', 'Regional.Council')],
by = 'School.ID', all.x = TRUE) -> dc

within(dc, {
factor(Urban.Area) -> Urban.Area
factor(Urban.Area, levels(Urban.Area)[c(3, 2, 4, 1)]) -> Urban.Area
funding.change*Total.School.Roll/1000 -> school.level.change
}) -> dc

with(dc, {
summary(funding.change)
summary(school.level.change)

sum(sm2014*Total.School.Roll/1000000, na.rm = NA)
sum(sm2015*Total.School.Roll/1000000, na.rm = NA)
})

#### By Urban area
# Funding change per student on school size
qplot(Total.School.Roll, funding.change, data = dc[!is.na(dc$Urban.Area),], alpha = 0.8, xlab = 'Total School Roll', ylab = 'Funding change per student (NZ$)') +
theme_bw() + theme(legend.position = 'none') + facet_grid(~Urban.Area)


Not too alien; now using the magrittr package with right assign would make a lot more sense.

Categories

## Sometimes I feel (some) need for speed

I’m the first to acknowledge that most of my code could run faster. The truth of the matter is that, in essence, I write ‘quickies’: code that will run once or twice, so there is no incentive to spend days or hours in shaving seconds of a computation. Most analyses of research data fall in to this approach: read data-clean data-fit model-check model-be happy-write article-(perhaps) make data and code available-move on with life.

One of the reasons why my code doesn’t run faster or uses less memory is the trade-off between the cost of my time (very high) compared to the cost of more memory or faster processors (very cheap) and the gains of shaving a few seconds or minutes of computer time, which tend to be fairly little.

In R vectorization is faster than working with each vector element, although it implies allocating memory for whole vectors and matrices, which for large-enough problems may become prohibitively expensive. On the other hand, not vectorizing some operations may turn your problem into an excruciatingly slow exercise and, for example in large simulations, in practice intractable in a useful timeframe.

John Muschelli wrote an interesting post reimplementing 2×2 frequency tables for a highly specific use, comparing the results of image processing algorithms. In John’s case, there are two greater-than-9-million-long-elements logical vectors, and when comparing vectors for many images the process becomes very slow. He explains part of his rationale in his blog (go and read it), but say that his solution can be expressed like this:

# I don't have any image data, but I'll simulate a couple
# of 10 million long logical vectors (just to round things)
set.seed(2014)

manual = sample(c(TRUE, FALSE), 10E6, replace = TRUE)
auto = sample(c(TRUE, FALSE), 10E6, replace = TRUE)

logical.tab = function(x, y) {
tt = sum(x & y)
tf = sum(x & !y)
ft = sum(!x & y)
ff = sum(!x & !y)
return(matrix(c(ff, tf, ft, tt), 2, 2))
}

logical.tab(manual, auto)


which uses 1/4 of the time used by table(manual, auto). Mission accomplished! However, if I stopped here this blog post would not make much sense, simply rehashing (pun intended) John’s code. The point is to explain what is going on and, perhaps, to find even faster ways of performing the calculations. As a start, we have to be aware that the calculations in logical.tab() rely on logical (boolean) operations and on the coercion of logical vectors to numerical values, as stated in the documentation:

Logical vectors are coerced to integer vectors in contexts where a numerical value is required, with TRUE being mapped to 1L, FALSE to 0L and NA to NA_integer_.

In R Logical operations can be slower than mathematical ones, a consideration that may guide us to think of the problem in a slightly different way. For example, take the difference between the vectors (dif = x - y), so both TRUE-TRUE (1-1) and FALSE-FALSE (0-0) are 0, while TRUE - FALSE (1-0) is 1 and FALSE - TRUE (0-1) is -1. Therefore:

• the sum of positive values (sum(dif > 0)) is the frequency of TRUE & FALSE,
• while the sum of negative values (sum(dif < 0)) is the frequency of FALSE & TRUE.

The values for TRUE & TRUE can be obtained by adding up the element-wise multiplication of the two vectors, as TRUE*TRUE (1*1) is the only product that's different from zero. A vectorized way of performing this operation would be to use t(x) %*% y; however, for large vectors the implementation crossprod(x, y) is faster. The values for FALSE & FALSE are simply the difference of the length of the vectors and the previously calculated frequencies length(dif) - tt - tf - ft. Putting it all together:

basic.tab = function(x, y) {
dif = x - y
tf = sum(dif > 0)
ft = sum(dif < 0)
tt = crossprod(x, y)
ff = length(dif) - tt - tf - ft
return(c(tf, ft, tt, ff))
}


This code takes 1/20 of the time taken by table(x, y). An interesting result is that the crossproduct crossprod(x, y) can also be expressed as sum(x * y), which I didn't expect to be faster but, hey, it is. So we can express the code as:

basic.tab2 = function(x, y) {
dif = x - y
tf = sum(dif > 0)
ft = sum(dif < 0)
tt = sum(x*y)
ff = length(dif) - tt - tf - ft
return(c(tf, ft, tt, ff))
}


to get roughly 1/22 of the time. The cost of logical operations is easier to see if we isolate particular calculations in logical.tab and basic.tab; for example,

tf1 = function(x, y) {
tf = sum(x & !y)
}


is slower than

tf2 = function(x, y) {
dif = x - y
tf = sum(dif > 0)
}


This also got me thinking of the cost of coercion: Would it take long? In fact, coercing logical vectors to numeric has little (at least I couldn't see from a practical viewpoint) if any cost. In some cases relying on logical vectors converted using as.numeric() seems to be detrimental on terms of speed.

As I mentioned at the beginning, vectorization uses plenty of memory, so if we were constrained in that front and we wanted to do a single pass on the data we could write an explicit loop:

loopy.tab = function(x, y) {
tt = 0; tf = 0; ft = 0; ff = 0

for(i in seq_along(x)) {
if(x[i] == TRUE & y[i] == TRUE)
tt = tt + 1
else
if(x[i] == TRUE & y[i] == FALSE)
tf = tf + 1
else
if(x[i] == FALSE & y[i] == TRUE)
ft = ft + 1
else
ff = ff + 1
}
return(matrix(c(ff, tf, ft, tt), 2, 2))
}


The loopy.tab does only one pass over the vectors and should use less memory as it doesn't need to create those huge 10M elements vectors all the time (at least it would if we used a proper iterator in the loop instead of iterating on a vector of length 10M (that's 40 MB big in this case). The iterators package may help here). We save room/memory at the expense of speed, as loopy.tab is ten times slower than the original table() function. Of course one could run it a lot faster if implemented in another language like C++ or Fortran and here Rcpp or Rcpp11 would come handy (updated! See below).

This is only a not-so-short way of reminding myself what's going on when trading-off memory, execution speed and my personal time. Thus, I am not saying that any of these functions is the best possible solution, but playing with ideas and constraints that one often considers when writing code. Incidentally, an easy way to compare the execution time of these functions is using the microbenchmark library. For example:

library(microbenchmark)
microbenchmark(logical.tab(manual, auto), basic.tab2(manual, auto), times = 1000)


will spit numbers when running a couple of functions 1,000 times with results that apply to your specific system.

PS 2014-07-12 11:08 NZST Hadley Wickham suggests using tabulate(manual + auto * 2 + 1, 4) as a fast alternative. Nevertheless, I would like to point out that i- the initial tabulation problem is only an excuse to discuss a subset of performance issues when working with R and ii- this post is more about the journey than the destination.

PS 2014-07-13 20:32 NZST Links to two posts related to this one:

• Wiekvoet's work which i- relies on marginals and ii- reduces the use of logical operators even more than my basic.tab2(), simultaneously taking around 1/4 of the time.
• Yin Zhu's post describing vectors is a handy reminder of the basic building blocks used in this post.

### Updated with Rcpp goodness!

PS 2014-07-13 21:45 NZST I browsed the Rcpp Gallery and Hadley Wickham's Rcpp tutorial and quickly converted loopy.tab() to C++. Being a bit of an oldie I've used Fortran (90, not that old) before but never C++, so the following code is probably not very idiomatic.

library(Rcpp)
cppFunction('NumericVector loopyCpp(LogicalVector x, LogicalVector y) {
int niter = x.size();
int tt = 0, tf = 0, ft = 0, ff = 0;
NumericVector tab(4);

for(int i = 0; i < niter; i++) {
if(x[i] == TRUE && y[i] == TRUE)
tt++;
else
if(x[i] == TRUE && y[i] == FALSE)
tf++;
else
if(x[i] == FALSE && y[i] == TRUE)
ft++;
else
ff++;
}

tab[0] = ff; tab[1] = tf; tab[2] = ft; tab[3] = tt;
return tab;
}'
)

loopyCpp(manual, auto)


Loops and all it runs roughly twice as fast as basic.tab2() but it should also use much less memory.

Categories

## R as a second language

Imagine that you are studying English as a second language; you learn the basic rules, some vocabulary and start writing sentences. After a little while, it is very likely that you’ll write grammatically correct sentences that no native speaker would use. You’d be following the formalisms but ignoring culture, idioms, slang and patterns of effective use.

R is a language and any newcomers, particularly if they already know another programming language, will struggle at the beginning to get what is beyond the formal grammar and vocabulary. I use R for inquisition: testing ideas, data exploration, visualization; under this setting, the easiest is to perform a task the more likely is one going to do it. It is possible to use several other languages for this but—and I think this is an important but—R’s brevity reduces the time between thinking and implementation, so we can move on and keep on trying new ideas.

A typical example is when we want to repeat something or iterate over a collection of elements. In most languages if one wants to do something many times the obvious way is using a loop (coded like, for() or while()). It is possible to use a for() loop in R but many times is the wrong tool for the job, as it increases the lag between thought and code, moving us away from ‘the flow’.

# Generate some random data with 10 rows and 5 columns
M = matrix(round(runif(50, 1, 5), 0), nrow = 10, ncol = 5)
M

#      [,1] [,2] [,3] [,4] [,5]
# [1,]    2    3    4    2    1
# [2,]    3    1    3    3    4
# [3,]    4    2    5    1    3
# [4,]    2    4    4    5    3
# [5,]    2    3    1    4    4
# [6,]    3    2    2    5    1
# [7,]    1    3    5    5    2
# [8,]    5    4    2    5    4
# [9,]    3    2    3    4    3
#[10,]    4    4    1    2    3

# Create dumb function that returns mean and median
# for data
sillyFunction = function(aRow) {
c(mean(aRow), median(aRow))
}

# On-liner to apply our function to each row
apply(M, 1, sillyFunction)

#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,]  2.4  2.8    3  3.6  2.8  2.6  3.2    4    3   2.8
#[2,]  2.0  3.0    3  4.0  3.0  2.0  3.0    4    3   3.0

# or one could do it for each column
apply(M, 2, sillyFunction)

# Of course one could use a loop. Pre-allocating
# the result matrix would have a loop with little
# time penalty (versus growing the matrix)
nCases = dim(M)[1]
resMatrix = matrix(0, nrow = nCases, ncol = 2)
# and here is the loop
for(i in 1:nCases){
resMatrix[i, 1:2] = sillyFunction(M[i,])
}

resMatrix
# Same results as before
#      [,1] [,2]
# [1,]  2.4    2
# [2,]  2.8    3
# [3,]  3.0    3
# [4,]  3.6    4
# [5,]  2.8    3
# [6,]  2.6    2
# [7,]  3.2    3
# [8,]  4.0    4
# [9,]  3.0    3
#[10,]  2.8    3


One of the distinctive features of R is that there is already a lot of functionality available for jobs that occur frequently in data analysis. The easiest is to perform a task the more likely is one going to do it, which is perfect if one is exploring/thinking about data.

Thomas Lumley reminded me of the ACM citation for John Chambers—father of S of which R is an implementation—which stated that Chambers’s work:

…will forever alter the way people analyze, visualize, and manipulate data . . . S is an elegant, widely accepted, and enduring software system, with conceptual integrity, thanks to the insight, taste, and effort of John Chambers.

If I could summarize the relevance of R in a Tweetable phrase (with hash tags and everything) it would be:

Most data analysis languages underestimate the importance of interactivity/low barrier to exploration. That’s where #Rstats shines.

One could run statistical analyses with many languages (including generic ones), but to provide the right level of interactivity for analysis, visualization and data manipulation one ends up creating functions that, almost invariably, look a bit like R; pandas in Python, for example.

There are some complications with some of the design decisions in R, especially when we get down to consistency which begets memorability. A glaring example is the apply family of functions and here is where master opportunist (in the positive sense of expert at finding good opportunities) Hadley Wickham made sense out of confusion in his package plyr.

There is also a tension in languages under considerable use because speakers/writers/analysts/coders start adapting them to new situations, adding words and turns of phrase. Look at English for an example! This is also happening to R and some people wish the language looked different in some non-trivial ways. A couple of examples: Coffeescript for R and Rasmus Bååth’s suggestions. Not all of them can be implemented, but suggestions like this speak of the success of R.

If you are struggling to start working with R, as with other languages, first let go. The key to learning and working with a new language is immersing yourself in it; even better if you do it with people who already speak it.

Just to be clear, there are several good statistical languages. However, none is as supportive of rapid inquisition as R (IMO). It is not unusual to develop models in one language (e.g. R) and implement it in another for operational purposes (e.g. SAS, Python, whatever).

The first thing I admire about Hadley is his ‘good eye’ for finding points of friction. The second one is doing something about the frictions, often with very good taste.

P.S. It should come clear from this post that English is indeed my second language.

Categories

## Using Processing and R together (in OS X)

I wanted to develop a small experiment with a front end using the Processing language and the backend calculations in R; the reason why will be another post. This post explained the steps assuming that one already has R and Processing installed:

1. Install the Rserve package. This has to be done from source (e.g. using R CMD INSTALL packagename).
2. Download Rserve jar files and include them in the Processing sketch.

For example, this generates 100 normal distributed random numbers in R and then sorts them (code copy and pasted from second link):

import org.rosuda.REngine.Rserve.*;
import org.rosuda.REngine.*;

double[] data;

void setup() {
size(300, 300);

try {
RConnection c = new RConnection();
// generate 100 normal distributed random numbers and then sort them
data= c.eval("sort(rnorm(100))").asDoubles();

} catch ( REXPMismatchException rme ) {
rme.printStackTrace();

} catch ( REngineException ree ) {
ree.printStackTrace();
}
}

void draw() {
background(255);
for( int i = 0; i < data.length; i++) {
line( i * 3.0, height/2, i* 3.0, height/2 - (float)data[i] * 50 );
}
}


The problem is that this didn't work, because my OS X (I use macs) R installation didn't have shared libraries. My not-so-quick solution was to compile R from source, which involved:

1. Downloading R source. I went for the latest stable version, but I could have gone for the development one.
2. Setting up the latest version of C and Fortran compilers. I did have an outdated version of Xcode in my macbook air, but decided to delete it because i- uses many GB of room in a small drive and ii- it's a monster download. Instead I went for Apple's Command Line Tools, which is a small fraction of size and do the job.
3. In the case of gfortran, there are many sites pointing to this page that hosts a fairly outdated version, which was giving me all sorts of problems (e.g. "checking for Fortran 77 name-mangling scheme") because the versions between the C and Fortran compilers were out of whack. Instead, I downloaded the latest version from the GNU site.
4. Changing the config.site file in a few places, ensuring that I had:
5. CC="gcc -arch x86_64 -std=gnu99"
CXX="g++ -arch x86_64"
F77="gfortran -arch x86_64"
FC="gfortran -arch x86_64"


Then compiled using (didn't want X11 and enabling shared library):

./configure --without-x --enable-R-shlib
make
make check
make pdf # This produces a lot of rubbish on screen and it isn't really needed
make info


And finally installed using:

sudo make prefix=/luis/compiled install

This used a prefix because I didn't want to replace my fully functioning R installation, but just having another one with shared libraries. If one types R in terminal then it is still calling the old version; the new one is called via /luis/compiled/R.framework/Versions/Current/Resources/bin/R. I then installed Rserve in the new version and was able to call R from processing so I could obtain.

Now I can move to what I really wanted to do. File under stuff-that-I-may-need-to-remember-one-day.

Categories

## Overlay of design matrices in genetic analysis

I’ve ignored my quantitative geneticist side of things for a while (at least in this blog) so this time I’ll cover some code I was exchanging with a couple of colleagues who work for other organizations.

It is common to use diallel mating designs in plant and tree breeding, where a small number of parents acts as both males and females. For example, with 5 parents we can have 25 crosses, including reciprocals and selfing (crossing an individual with itself). Decades ago this mating design was tricky to fit and, considering an experimental layout with randomized complete blocks, one would have something like y = mu + blocks + dads + mums + cross + error. In this model dads and mums were estimating a fraction of the additive genetic variance. With the advent of animal model BLUP, was possible to fit something like y = mu + blocks + individual (using a pedigree) + cross + error. Another less computationally demanding alternative (at least with unrelated parents) is to fit a parental model, overlaying the design matrices for parents with something like this y = mu + blocks + (dad + mum) + cross + error.

The following code simulate data for a a diallel trial with three replicates and runs a parental model with ASReml. Later I expand the analysis building the matrices by hand.

# Defining diallel cross
n.blocks <- 3
diallel <- matrix(0, nrow = 5, ncol = 5)
males <- 1:dim(diallel)[1]
females <- 1:dim(diallel)[2]
cross <- factor(outer(males, females, paste, sep ='x'))
cross
#[1] 1x1 2x1 3x1 4x1 5x1 1x2 2x2 3x2 4x2 5x2 1x3 2x3 3x3 4x3 5x3 1x4 2x4
#[18] 3x4 4x4 5x4 1x5 2x5 3x5 4x5 5x5
#25 Levels: 1x1 1x2 1x3 1x4 1x5 2x1 2x2 2x3 2x4 2x5 3x1 3x2 3x3 ... 5x5

#### Generating random values for the trial
# Variance components
sig2.a <- 40    # additive genetic variance
sig2.sca <- 10  # specific combining ability (SCA)
sig2.b <- 10    # blocks
sig2.e <- 30    # residuals

# Numbers of everything
n.parents <- length(males)
n.crosses <- length(cross)
n.trees <- n.crosses*n.blocks

# Random values for everything
u.g <- rnorm(n.parents)*sqrt(sig2.a)
u.sca <- rnorm(n.crosses)*sqrt(sig2.sca)
u.block <- rnorm(n.blocks)*sqrt(sig2.b)
u.e <- rnorm(n.trees)*sqrt(sig2.e)

# Whole trial
trial <- data.frame(block = factor(rep(1:n.blocks, each = n.crosses)),
mum = factor(rep(females, n.blocks)),
dad = factor(rep(males, each = length(females))),
mate = factor(rep(cross, n.blocks)))

trial$yield <- 0.5*(u.g[trial$mum] + u.g[trial$dad]) + u.sca[trial$mate] + u.block[trial$block] + u.e head(trial) #block mum dad mate yield #1 1 1 1 1x1 -0.02185486 #2 1 2 1 2x1 10.79760712 #3 1 3 1 3x1 16.45186037 #4 1 4 1 4x1 8.15026291 #5 1 5 1 5x1 5.57707180 #6 1 1 2 1x2 -7.30675148 # Fitting the model with ASReml library(asreml) m1 <- asreml(yield ~ 1, random = ~ block + dad + and(mum) + mate, data = trial) summary(m1) # gamma component std.error z.ratio #block!block.var 1.299110e-02 3.861892e-01 1.588423e+00 0.2431274 #dad!dad.var 2.101417e-01 6.246930e+00 5.120745e+00 1.2199259 #mate!mate.var 4.589938e-07 1.364461e-05 2.340032e-06 5.8309519 #R!variance 1.000000e+00 2.972722e+01 5.098177e+00 5.8309519 # Obtaining the predicted breeding values for the parents coef(m1, pattern = 'dad') # effect #dad_1 1.780683 #dad_2 -2.121174 #dad_3 3.151991 #dad_4 -1.473620 #dad_5 -1.337879  How is the matrix overlay working? We can replicate the calculations used by ASReml by building the matrices from scratch and reusing the variance components, so we avoid the nastiness of writing code for residual maximum likelihood. Once I build the basic matrices I use the code from my How does a linear mixed model look like? post. # Building incidence matrices for males and females Zmales <- model.matrix(~ dad - 1, data = trial) Zfemales <- model.matrix(~ mum - 1, data = trial) # and() in ASReml overlays (i.e. sums) the matrices # (Notice that this creates 0s, 1 and 2s in the design matrix) Zparents <- Zmales + Zfemales Zparents[1:6,] # dad1 dad2 dad3 dad4 dad5 #1 2 0 0 0 0 #2 1 1 0 0 0 #3 1 0 1 0 0 #4 1 0 0 1 0 #5 1 0 0 0 1 #6 1 1 0 0 0 # Design matrices from other factors Zblocks <- model.matrix(~ block - 1, data = trial) Zmates <- model.matrix(~ mate - 1, data = trial) # Creating y, X and big Z matrices to solve mixed model equations y <- trial$yield
X <- matrix(1, nrow = 75, ncol = 1)
Z <- cbind(Zblocks, Zparents, Zmates)

# Building covariance matrices for random effects
# Using the variance components from the ASReml model
G = diag(c(rep(3.861892e-01, 3),
rep(6.246930e+00, 5),
rep(1.364461e-05, 25)))
R = diag(2.972722e+01, 75, 75)
Rinv = solve(R)

# Components of C
XpX = t(X) %*% Rinv %*% X
ZpZ = t(Z) %*% Rinv %*% Z

XpZ = t(X) %*% Rinv %*% Z
ZpX = t(Z) %*% Rinv %*% X

Xpy = t(X) %*% Rinv %*% y
Zpy = t(Z) %*% Rinv %*% y

# Building C * [b a] = RHS
C = rbind(cbind(XpX, XpZ),
cbind(ZpX, ZpZ + solve(G)))
RHS = rbind(Xpy, Zpy)

blup = solve(C, RHS)
blup
# These results are identical to the ones
# produced by ASReml


The overlay matrix Zparents is double the actual value we should use:

Zparents2 <- 0.5*(Zmales + Zfemales)
Zparents2[1:6,]
#1  1.0  0.0  0.0  0.0  0.0
#2  0.5  0.5  0.0  0.0  0.0
#3  0.5  0.0  0.5  0.0  0.0
#4  0.5  0.0  0.0  0.5  0.0
#5  0.5  0.0  0.0  0.0  0.5
#6  0.5  0.5  0.0  0.0  0.0


Repeating the analyses 'by hand' using Zparents2 to build the Z matrix results in the same overall mean and block effects, but the predicted breeding values for parents when using Zparents are 0.7 of the predicted breeding values for parents when using Zparents2. I may need to refit the model and obtain new variance components for parents when working with the correct overlaid matrix.

Categories

I have written a few posts discussing descriptive analyses of evaluation of National Standards for New Zealand primary schools.The data for roughly half of the schools was made available by the media, but the full version of the dataset is provided in a single-school basis. In the page for a given school there may be link to a PDF file with the information on standards sent by the school to the Ministry of Education.

I’d like to keep a copy of the PDF reports for all the schools for which I do not have performance information, so I decided to write an R script to download just over 1,000 PDF files. Once I can identify all the schools with missing information I just loop over the list, using the fact that all URL for the school pages start with the same prefix. I download the page, look for the name of the PDF file and then download the PDF file, which is named school_schoolnumber.pdf. And that’s it.

Of course life would be a lot simpler if the Ministry of Education made the information available in a usable form for analysis.

library(XML) # HTML processing
options(stringsAsFactors = FALSE)

# Base URL
base.url = 'http://www.educationcounts.govt.nz/find-a-school/school/national?school='

# Schools directory
directory <- subset(directory,
!(school.type %in% c("Secondary (Year 9-15)", "Secondary (Year 11-15)")))

# Reading file obtained from stuff.co.nz obtained from here:
# http://schoolreport.stuff.co.nz/index.html

# Defining schools with missing information
to.get <- merge(directory, fairfax, by = 'school.id', all.x = TRUE)

# Looping over schools, to find name of PDF file

for(school in to.get\$school.id){

cat('Processing school ', school, '\n')
doc.html <- htmlParse(paste(base.url, school, sep = ''))
`