r rblogs research stats teaching

Implementing a model as an R package

In our research group we often have people creating statistical models that end up in publications but, most of the time, the practical implementation of those models is lacking. I mean, we have a bunch of barely functioning code that is very difficult to use in a reliable way in operations of the breeding programs. I was very keen on continue using one of the models in our research, enough to rewrite and document the model fitting, and then create another package for using the.model in operations.

Unfortunately, neither the data nor the model are mine to give away, so I can’t share them (yet). But I hope these notes will help you in you are in the same boat and need to use your models (or ‘you’ are in fact future me, who tend to forget how or why I wrote code in a specific way).

A basic motivational example

Let’s start with a simple example: linear regression. We want to predict a response using a predictor variable, and then we can predict the response for new values of the predictor contained in new_data with:

my_model <- lm(response ~ predictor, data = my_data)
predictions <- predict(my_model, newdat = new_data)

# Saving the model object
save(my_model, 'model_file.Rda')

The model coefficients needed to predict new values are stored in the my_model object. If we want to use the model elsewhere, we can save the object as an .Rda file, in this case model_file.Rda.

We can later read the model file in, say, a different project and get new predictions using:

more_predictions <- predict(my_model, newdat = yet_another_new_data)

Near-infrared Spectroscopy

Near-infrared spectroscopy is the stuff of CSI and other crime shows. We measure the reflection at different wavelengths and run a regression analysis using what we want to predict as Y in the model. The number of predictors (wavelengths) is much larger than in the previous example—1,296 for the NIR machine we are using—so it is not unusual to have more predictors than observations. NIR spectra are often trained using pls() (from the pls package) with help from functions from the prospectr package.

I could still use the save/load approach from the motivational example to store and reuse the model object created with pls but, instead, I wanted to implement the model, plus some auxiliary functions, in a package to make the functions easier to use in our lab.

I had two issues/struggles/learning opportunities that I needed to sort out to get this package working:

1. How to automatically load the model object when attaching the package?

Normally, datasets and other objects go in the data folder, where they are made available to the user. Instead, I wanted to make the object internally available. The solution turned out to be quite straightforward: save the model object to a file called sysdata.rda (either in the R or data folders of the package). This file is automatically loaded when we run library(package_name). We just need to create that file with something like:

save(my_model, 'sysdata.rda')

2. How to make predict.pls work in the package?

I was struggling to use the predict function, as in my head it was being provided by the pls package. However, pls is only extending the predict function, which comes with the default R installation but is part of the stats package. At the end, sorted it out with the following Imports, Depends and LazyData in the DESCRIPTION file:

Imports: prospectr,
Depends: pls
Encoding: UTF-8
LazyData: Yes

Now it is possible to use predict, just remember to specify the package where it is coming from, as in:

stats::predict(my_model, ncomp = n_components,
               newdata = spectra, interval = 'confidence')

Nothing groundbreaking, I know, but spent a bit of time sorting out that couple of annoyances before everything fell into place. Right now we are using the models in a much easier and reproducible way.

programming r rblogs

Reading a folder with many small files

One of the tools we use in our research is NIR (Near-Infrared Spectroscopy), which we apply to thousands of samples to predict their chemical composition. Each NIR spectrum is contained in a CSV text file with two numerical columns: wavelength and reflectance. All files have the same number of rows (1296 in our case), which corresponds to the number of wavelengths assessed by the spectrometer. One last thing: the sample ID is encoded in the file name.

As an example, file A1-4-999-H-L.0000.csv’s contents look like:


Once the contents of all the files are stored in a single matrix, one can apply a bunch of algorithms to build a model, and then use the model to predict chemical composition for new observations. I am not concerned about that process in this blog post, but only about reading thousands of small files from R, without relying on calls to the operating system to join the small files and read a single large file.

As I see it, I want to:

  • give R a folder name,
  • get a list of all the file names in that folder,
  • iterate over that list and keep only the second column for each of the files.
  • join the elements of the list.

I can use list.files() to get the names of all files in a folder. Rather than using a explicit loop, it’s easier to use lapply() to iterate over the list of names and apply the read.csv() function to all of them. I want a matrix, but lapply() creates a list, so I joined all the elements of the list using to bind the rows using rbind().

spectra_folder <- 'avery_raw_spectra'

read_spectra_folder <- function(folder) {
  # Read all files and keep second column only for each of them. Then join all rows
  spectra_list <- list.files(path = folder, full.names = TRUE)
  raw_spectra <- lapply(spectra_list, 
                        function(x) read.csv(x, header = FALSE)[,2])
  raw_spectra <-, raw_spectra)

There are many ways to test performance, for example using the microbenchmark package. Instead, I'm using something rather basic, almost cute, the Sys.time() function:

start <- Sys.time()
option1 <- read_spectra_folder(spectra_folder)
end <- Sys.time()
end - start

This takes about 12 seconds in my laptop (when reading over 6,000 files). I was curious to see if it would be dramatically faster with data.table, so I replaced read.csv() with fread() and joined the elements of the list using rbindlist().


read_folder_dt <- function(folder) {
  spectra_list <- list.files(path = folder, full.names = TRUE)
  raw_spectra <-  lapply(spectra_list, 
                         function(x) fread(x, sep=",")[,2])
  raw_spectra <- rbindlist(raw_spectra)

Using the same basic timing as before this takes around 10 seconds in my laptop

I have the impression that packages like data.table and readr have been optimized for reading larg(ish) files, so they won't necessarily help much in this reading-many-small-files type of problem. Instead, I tested going back to even more basic R functions (scan), and adding more information about the types of data I was reading. Essentially, going back even closer to base R.

read_folder_scan <- function(folder, prefix = 'F') {
  # Read all files and keep second column only for each of them. Then join all rows
  spectra_list <- list.files(path = folder, full.names = TRUE)
  raw_spectra <- lapply(spectra_list, 
                        function(x) matrix(scan(x, what = list(NULL, double()), 
                                                sep = ',', quiet = TRUE)[[2]], nrow = 1))
  raw_spectra <-, raw_spectra)

Timing this new version takes only 4 seconds, not adding any additional dependencies. Any of these versions is faster than the original code that was growing a data frame with rbind() one iteration at the time.

code programming r rblogs

Calculating parliament seats allocation and quotients

I was having a conversation about dropping the minimum threshold (currently 5% of the vote) for political parties to get representation in Parliament. The obvious question is how would seat allocation change, which of course involved a calculation. There is a calculator in the Electoral Commission website, but trying to understand how things work (and therefore coding) is my thing, and the Electoral Commission has a handy explanation of the Sainte-Laguë allocation formula used in New Zealand. So I had to write my own seat allocation function:

allocate_seats <- function(votes) {
  parties <- names(votes)
  denom <- seq(1, 121, 2)
  quotients <- vapply(denom, FUN =  function(x) votes/x, FUN.VALUE = rep(1, length(votes)))
  quotients <- t(quotients)
  colnames(quotients) <- parties
  priority <- rank(-quotients)
  seat_ranking <- matrix(priority, nrow = nrow(quotients), ncol = ncol(quotients))
  seat_ranking <- ifelse(seat_ranking <= 120, seat_ranking, NA)
  colnames(seat_ranking) <- parties
  return(list(quotients = quotients, ranking = seat_ranking))

Testing it with the preliminary election results (that is, no including special votes) gives:

votes2017 <- c(998813, 776556, 162988, 126995, 10959, 48018, 23456)
names(votes2017) <- c('National', 'Labour', 'NZ First', 'Green', 'ACT', 
                  'Opportunities', 'Māori')

seats2017 <- allocate_seats(votes2017)


#      National Labour NZ First Green ACT Opportunities Māori
# [1,]        1      2        6     9  98            22    46
# [2,]        3      4       19    26  NA            67    NA
# [3,]        5      7       33    42  NA           112    NA
# [4,]        8     11       47    59  NA            NA    NA
# [5,]       10     13       60    77  NA            NA    NA
# [6,]       12     15       73    93  NA            NA    NA
# [7,]       14     17       86   110  NA            NA    NA
# [8,]       16     21      100    NA  NA            NA    NA
# [9,]       18     24      113    NA  NA            NA    NA
#[10,]       20     27       NA    NA  NA            NA    NA
# ...

In our current setup The Opportunities and Māori parties did not reach the minimum threshold (nor won an electorate as ACT violating the spirit of the system), so did not get any seats. Those 4 seats that would have gone to minor parties under no threshold ended up going to National and Labour (2 each). It sucks.

code programming r rblogs teaching

Collecting results of the New Zealand General Elections

I was reading an article about the results of our latest elections where I was having a look at the spatial pattern for votes in my city.

I was wondering how would I go over obtaining the data for something like that and went to the Electoral Commission, which has this neat page with links to CSV files with results at the voting place level. The CSV files have results for each of the candidates in the first few rows (which I didn’t care about) and at the party level later in the file.

As I could see it I needed to:

  1. Read the Electoral Commission website and extract the table that contains the links to all CSV files.
  2. Read each of the files and i- extract the electorate name, ii- skipping all the candidates votes, followed by iii-reading the party vote.
  3. Remove sub-totals and other junk from the files.
  4. Geocode the addresses
  5. Use the data for whatever else I wanted (exam question anyone?).

So I first loaded the needed packages and read the list of CSV files:


# Extract list of CSV file names containing voting place data
voting_place <- ''

election17 <- read_html(voting_place)
election17 %>% 
  html_nodes('table') %>% html_nodes('a') %>% 
  html_attr('href') %>% str_subset('csv') %>% 
  paste('','/',., sep = '') -> voting_place_list

Then wrote a couple of functions to, first, read the whole file, get the electorate name and, second, detect where the party vote starts to keep from that line onwards. Rather than explicitly looping over the list of CSV file names, I used map_dfr from the purrr package to extract the data and join all the results by row.

get_electorate <- function(row) {
  row %>% str_split(pattern = ',') %>% 
    unlist() %>% .[1] %>% str_split(pattern = '-') %>% 
    unlist() %>% .[1] %>% str_trim() -> elect

# Function to read only party-level votes from voting places
extract_party_vote <- function(file_name) {
  all_records <- read_lines(file_name)
  electorate <- get_electorate(all_records[1])
  start_party <- grep('Party Vote', all_records)
  party_records <- all_records[start_party:length(all_records)]
  party_records_df <- read.table(text = party_records, sep = ',', 
                                 fill = TRUE, header = TRUE, quote = '"',
                                 stringsAsFactors = FALSE)
  names(party_records_df)[1:2] <- c('neighbourhood', 'address')
  party_records_df$electorate <- electorate

# Download files and create dataframe
vote_by_place <- map_dfr(voting_place_list, extract_party_vote)

Cleaning the data and summarising by voting place (as one can vote for several electorates in a single place) is fairly straightforward. I appended the string Mobile to mobile teams that visited places like retirement homes, hospitals, prisons, etc:

# Remove TOTAL and empty records
vote_by_place %>% 
  filter(address != '') %>% 
  mutate(neighbourhood = ifelse(neighbourhood == '', paste(electorate, 'Mobile'), neighbourhood)) %>%
  group_by(neighbourhood, address) %>% 
  summarise_at(vars(ACT.New.Zealand:Informal.Party.Votes), sum, na.rm = TRUE) -> clean_vote_by_place

Geolocation is the not-working-very-well part right now. First, I had problems with Google (beyond the 1,000 places limit for the query). Then I went for using the Data Science Kit as the source but, even excluding the mobile places, it was a bit hit and miss for geolocation, particularly as the format of some address (like corner of X and Y) is not the best for a search.

In addition, either of the two sources for geolocation work really slowly and may produce a lot of output. Using sink() could be a good idea to not end up with output for roughly 3,000 queries. I did try the mutate_geocode() function, but didn't work out properly.

# Geolocate voting places    
get_geoloc <- function(record) {
  address <- paste(record$address, 'New Zealand', sep = ', ')
  where <- try(geocode(address, output = "latlona", source = "dsk"), silent = TRUE)
  where$address <- as.character(address)
  where$locality <- as.character(record$neighbourhood)

# Sending whole row to the map function
clean_vote_by_place %>% pmap_dfr(., lift_ld(get_geoloc)) -> geoloc_voting_places

David Robinson was kind enough to help me with the last line of the script, although he updated the advise to:


Given the size of my dataset, either option took bugger all time, although I have to say that

clean_vote_by_place %>% mutate(map(transpose(.), get_geoloc)) -> etc

looks prettier.

Once the data are geolocated, creating a visualisation is not so hard. Even old dogs can find their way to do that!

r rblogs teaching

Where are New Zealand’s bellwether electorates?

I was reading a piece by Graeme Edgeler who, near the end, asked “Where are New Zealand’s bellwether electorates?”. I didn’t know where the data came from or how was the “index of disproportionality for each electorate” calculated, but I saw it mostly as an opportunity to whip up some quick code to practice the use of R and look at other packages that play well with the tidyverse.

The task can be described as: fetch Wikipedia page with results of the 2014 parliamentary election, extract the table with results by electorate, calculate some form of deviation from the national results, get the top X electorates with lowest deviation from national results.

A web search revealed that this page contains a whole bunch of results for the 2014 election and that the specific results I’m interested in are in table number 17 of the list created by html_nodes('table'). Besides the tidyverse, I needed the packages rvest for web scraping, magrittr for using %<>% (pipe and assign to original data frame) and lucid for pretty printing the final table.


election14 <- read_html(',_2014')

election14 %>% 
  html_nodes('table') %>% .[[17]] %>% 
  html_table() %>% filter(Electorate != 'Electorate') -> electorate14


Rather than reading the national results directly from Wikipedia I just typed them in code, as I already had them from some other stuff I was working on. My measure of “disproportionality for each electorate” was as sophisticated as the sum of squared deviations.

# Results for National, Labour, Green, NZ First, Conservative, Internet Mana & Māori
national_results <- c(47.04, 25.13, 10.7, 8.66, 3.99, 1.42, 1.32)

electorate14 %>% mutate(total_vote = apply(.[,2:8], 1, sum),
                        dev = apply(.[,2:8], 1, function(x) sum((x - national_results)^2))) %>%
                 arrange(dev) %>% slice(1:15) %>% lucid

# A tibble: 15 x 10
#             Electorate National Labour Green `NZ First` Conservative `Internet Mana` Māori total_vote   dev
# 1                Ōtaki     49.1   24.8  9.46       9.96         4.41            0.65  0.44       98.8  9.02
# 2        Hamilton West     47.7   25.7  8.21      10.8          4.67            0.72  0.56       98.4 13.2 
# 3        Hamilton East     50     23.8 11          7.14         4.81            1     0.64       98.4 14.5 
# 4    West Coast-Tasman     44.8   23.5 13          8.71         5.12            0.76  0.28       96.2 15.7 
# 5               Napier     49.4   26    8.77       7.43         6.23            0.6   0.44       98.8 17.9 
# 6           Hutt South     45.3   28   12.8        7.48         3.57            0.72  0.53       98.3 18   
# 7           East Coast     48.6   22.7  9.21      11.8          4.08            1.17  0.95       98.6 20.7 
# 8               Nelson     44.4   24.7 14.1        7.67         5.5             0.83  0.33       97.6 23.4 
# 9         Invercargill     49.5   25.1  7.57      11.2          3.68            0.62  0.32       97.9 23.7 
#10            Whanganui     47.3   25.5  7.21      12            5.02            0.73  0.58       98.3 25.4 
#11            Northcote     50.7   22.1 11.6        7.32         4.31            0.95  0.46       97.5 26.3 
#12               Wigram     42.9   28.7 12.8        8.56         3.61            0.76  0.47       97.8 35.4 
#13 Christchurch Central     44.7   26.2 15.8        7.19         3.11            1.03  0.46       98.5 37   
#14             Tukituki     52     22.8  8.57       7.6          6.56            0.68  0.52       98.8 43.3 
#15           Port Hills     47     23.9 17.1        6.62         3.11            0.75  0.4        98.8 48.7 

I’m sure there must be a ‘more idiomatic’ way of doing the squared deviation using the tidyverse. At the same time, using apply came naturally in my head when writing the code, so I opted for keeping it and not interrupting the coding flow. The results are pretty similar to the ones presented by Graeme in his piece.

I’m getting increasingly comfortable with this mestizo approach of using the tidyverse and base R for completing tasks. Whatever it takes to express what I need to achieve quickly and more or less in a readable way.

programming r rblogs stats teaching

Functions with multiple results in tidyverse

I have continued playing with the tidyverse for different parts of a couple of projects.

Often I need to apply a function by groups of observations; sometimes, that function returns more than a single number. It could be something like for each group fit a distribution and return the distribution parameters. Or, simpler for the purposes of this exploration, calculate and return a bunch of numbers.

describe_c <- function(x) {
  mn <- mean(x, na.rm = TRUE)
  dev <- sd(x, na.rm = TRUE)
  n <- sum(!
  cv <- dev/mn*100
  return(c(mean = mn, sdev = dev, count = n, coefvar = cv))

If I have a data frame called field_data, with family codes (trees with the same parents, codes have been changed to protect the innocent) and stem diameters (in mm), I could do the following in base R:

# This line produces an annoying list
summary_one <- with(field_data, tapply(stem, family, FUN = describe_v))

# This puts together a matrix by joining 
# the list results using rbind()
summary_one <-, summary_one)

# To continue processing it might be better to convert
# to a data frame
summary_one <- data.frame(summary_one)

And if I need to do this for several variables, I will need to merge each of these matrices in a data frame.

Continuing with my experimentation with the tidyverse, I was wondering how to get the above going with dplyr et al. After failing a few times I asked the question in Twitter and got a number of helpful replies.

One of the keys is that dplyr can store a list result from a function. Modifying my toy function is pretty straightforward, and now looks like:

describe_list <- function(x) {
  mn <- mean(x, na.rm = TRUE)
  dev <- sd(x, na.rm = TRUE)
  n <- sum(!
  cv <- dev/mn*100
  return(list(c(mean = mn, sdev = dev, count = n, coefvar = cv)))

And we can check the contents of summary_two to see we have a list in which each element contains 4 values:

#  A tibble: 6 x 2
#   family     model
# 1      A 
# 2      B 
# 3      C 
# 4      D 
# 5      E 
# 6      F 

We still need to extract the elements of each element of the list and assign them to a variable name. Using map from the purrr package is pretty straightforward in this case, and we can extract the values either using their names or their position in the element.

summary_two %>% 
  mutate(mn = map_dbl(model,'mean'),
         sd = map_dbl(model,'sdev'),
         n = map_dbl(model,'count'),
         cv = map_dbl(model,4)) %>% head

#  A tibble: 6 x 6
#   family     model       mn       sd     n       cv
# 1      A  190.8306 23.71290   425 12.42615
# 2      B  190.1111 25.46554   396 13.39508
# 3      C  188.2646 27.39215   461 14.54981
# 4      D  189.2668 25.16330   431 13.29514
# 5      E  183.5238 19.70182    21 10.73530
# 6      F  183.1250 28.82377    24 15.73994

I'm still playing with ideas to be lazier at extraction time. An almost abhorrent idea is to provide the output as character for posterior type conversion, as in:

describe_char <- function(x) {
  mn <- mean(x, na.rm = TRUE)
  dev <- sd(x, na.rm = TRUE)
  n <- sum(!
  cv <- dev/mn*100
  return(paste(mn, dev, n, cv, sep = ':'))

field_data %>% 
  group_by(family) %>%
  summarise(model = describe_char(stem)) -> summary_three


# A tibble: 6 x 2
#   family                                                  model
# 1      A 190.830588235294:23.7128956613006:425:12.4261502731746
# 2      B 190.111111111111:25.4655444116168:396:13.3950847284951
# 3      C  188.26464208243:27.3921487349435:461:14.5498105390125
# 4      D 189.266821345708:25.1632953227626:431:13.2951434085746
# 5      E   183.52380952381:19.7018249094317:21:10.7352963959021
# 6      F           183.125:28.8237711378767:24:15.7399432834822

summary_three %>% 
  separate(model, c('mn', 'sd', 'n', 'cv'), sep = ':') %>% head

# A tibble: 6 x 5
#   family               mn               sd     n               cv
# 1      A 190.830588235294 23.7128956613006   425 12.4261502731746
# 2      B 190.111111111111 25.4655444116168   396 13.3950847284951
# 3      C  188.26464208243 27.3921487349435   461 14.5498105390125
# 4      D 189.266821345708 25.1632953227626   431 13.2951434085746
# 5      E  183.52380952381 19.7018249094317    21 10.7352963959021
# 6      F          183.125 28.8237711378767    24 15.7399432834822

And we can get all the way there with:

summary_three %>% 
  separate(model, c('mn', 'sd', 'n', 'cv'), sep = ':') %>% 
  mutate_at(c('mn', 'sd', 'n', 'cv'), as.numeric) %>% head

# A tibble: 6 x 5
#   family       mn       sd     n       cv
# 1      A 190.8306 23.71290   425 12.42615
# 2      B 190.1111 25.46554   396 13.39508
# 3      C 188.2646 27.39215   461 14.54981
# 4      D 189.2668 25.16330   431 13.29514
# 5      E 183.5238 19.70182    21 10.73530
# 6      F 183.1250 28.82377    24 15.73994

Which I assume has all sort of potential negative side-effects, but looks really cool.

In case you want to play with the problem, here is a tiny example of field data.

programming r rblogs stats teaching

Turtles all the way down

One of the main uses for R is for exploration and learning. Let’s say that I wanted to learn simple linear regression (the bread and butter of statistics) and see how the formulas work. I could simulate a simple example and fit the regression with R:

library(arm)  # For display() 

# Simulate 5 observations
x <- 1:5
y <- 2 + 3*x + rnorm(5, mean = 0, sd = 3)

# Fit regression
reg <- lm(y ~ x, dat)

# lm(formula = y ~ x, data = dat)
#             coef.est
# (Intercept) 3.99     3.05   
# x           2.04     0.92   
# ---
# n = 5, k = 2
# residual sd = 2.91, R-Squared = 0.62

# Plot it
plot(y ~ x)

Your typical toy problem.

The formulas for the intercept (\(b_0\)) and the slope (\(b_1\)) are pretty simple, and I have been told that there is a generic expression that instead uses matrices.

\(b_1 = \frac{\sum{x y} - n \bar{x} \bar{y}}{\sum{x x} - n \bar{x}^2}\)
\(b_0 = \bar{y} - b_1 \bar{x}\)

\( \boldsymbol{b} = \boldsymbol{X}`\boldsymbol{X}^{-1} \boldsymbol{Xy}\)

How do the contents of the matrices and the simple formulates relate to each other?

# Formulas for slope and intercept
b1 <- (sum(x*y) - length(x)*mean(x)*mean(y))/(sum(x*x) - length(x)*mean(x)^2)
b0 <- mean(y) - b1*mean(x)

Funnily enough, looking at the matrices we can see similar sums of squares and crossproducts as in the formulas.

X <- model.matrix(reg)

#   (Intercept) x
# 1           1 1
# 2           1 2
# 3           1 3
# 4           1 4
# 5           1 5
# attr(,"assign")
# [1] 0 1

t(X) %*% X
#             (Intercept)  x
# (Intercept)           5 15
# x                    15 55

# So X`X contains bits and pieces of the previous formulas
# [1] 5

# [1] 15

# [1] 55

# And so does X`y
t(X) %*% y
#                  [,1]
# (Intercept)  50.61283
# x           172.27210

# [1] 50.61283

# [1] 172.2721

# So if we combine the whole lot and remember that
# solves calculates the inverse
solve(t(X) %*% X) %*% t(X) %*% y
#                 [,1]
# (Intercept) 3.992481
# x           2.043362

But I have been told that R (as most statistical software) doesn't use the inverse of the matrix for estimating the coefficients. So how does it work?

If I type lm R will print the code of the lm() function. A quick look will reveal that there is a lot of code reading the arguments and checking that everything is OK before proceeding. However, the function then calls something else: With some trepidation I type, which again performs more checks and then calls something with a different notation:

z <- .Call(C_Cdqrls, x, y, tol, FALSE)

This denotes a call to a C language function, which after some searching in Google we find in a readable form in the lm.c file. Another quick look brings more checking and a call to Fortran code:

F77_CALL(dqrls)(REAL(qr), &n, &p, REAL(y), &ny, &rtol,
		    REAL(coefficients), REAL(residuals), REAL(effects),
		    &rank, INTEGER(pivot), REAL(qraux), work);

which is a highly tuned routine for QR decomposition in a linear algebra library. By now we know that the general matrix expression produces the same as our initial formula, and that the R lm() function does not use a matrix inverse but QR decomposition to solve the system of equations.

One of the beauties of R is that brought the power of statistical computing to the masses, by not only letting you fit models but also having a peek at how things are implemented. As a user, I don't need to know that there is a chain of function calls initiated by my bread-and-butter linear regression. But it is comforting to the nerdy me, that I can have a quick look at that.

All this for free, which sounds like a very good deal to me.

programming r rblogs teaching

Old dog and the tidyverse

I started using R ages ago and have happily lived in mostly-base-R for data manipulation. Once in a while I move to something that makes a big difference, like ggplot2 in 2010 or Rmarkdown in 2015, but the set of packages I use for data + plotting hasn’t seen many changes. I have to confess that, meanwhile, I have tested quite a few approaches on the analytics side of things (last year was the turn of Bayesian for me).

Last week, I decided to learn more about the tidyverse, thinking of using it more with forestry postgrad students. Now, there is no lack of tutorials, reviews, documentation, etc. for the tidyverse, but most writing shows a final version of the code, without exposing the thinking and dead ends that go behind it. In this post I show how my code was changing, both after reading a few pieces of documentation and, mostly, from feedback I got from Hadley Wickham and Michael MacAskill via this Kiwi Twitter thread. This post shows minor differences in variable names from that thread, as I changed a few things while reading the files.

The problem is as follows: I have two data frames with trial assessments. Frame one, called early, covers trees at ages 5, 7 and 8 years (although ages are in months rather than years). Frame two, called late, covers trees at age 20 years. Yes, it takes a while working with trees.

We want to keep only age 8 years (96 months) from early and want to split a code into two variables, as well as convert a variable from numeric to character. In late we want to create a tree family code, based on a set of rules to connect field codes to the pedigree of trees. Did I mention that I work breeding trees?

Finally, we want to merge all the assessments from age 8 with the assessment at age 20 for the same trees.

Rather than showing the final version of the code, it is much more interesting to show its evolution, also including how I would have done this in base R. I’m omitting the reading of the file and boring case conversion of variable names, etc.

In base R, I would probably do something like this (I’m using the stringr package just to make my life easier):


early_8 <- subset(early, age == 96)
early_8 <- within(early_8, {
  rep <- sapply(section, function(x) unlist(str_split(x, '_'))[1])
  sets <- sapply(section, function(x) unlist(str_split(x, '_'))[2])
  tree <- as.character(`tree position`)

late <- within(late, {
  family <- ifelse(field_code < 500, paste('885', str_pad(field_code, 3, pad = '0'), sep = ''),
                   ifelse(field_code >= 500 & field_code < 600, paste('883', str_pad(field_code - 500, 3, pad = '0'), sep = ''),
  rep <- as.character(rep)
  tree <- as.character(tree)

both <- merge(early_8, late, by.x = 'genotype', by.y = 'family')

My first approach to dealing with the early frame with the tidyverse looked like:


early %>%
  filter(age == 96) %>%
  mutate(rep = flatten_chr(map(section, function(x) unlist(str_split(x, '_'))[1]))) %>%
  mutate(sets = flatten_chr(map(section, function(x) unlist(str_split(x, '_'))[2]))) %>%
  mutate(tree = as.character(`tree position`)) -> early_8

While the second frame was processed using:

late %>% 
  mutate(family = ifelse(field_code < 500, paste('885', str_pad(field_code, 3, pad = '0'), sep = ''),
                         ifelse(field_code >= 500 & field_code < 600, paste('883', str_pad(field_code - 500, 3, pad = '0'), sep = ''),
                                as.character(field_code)))) %>%
  mutate(rep = as.character(rep))
  mutate(tree = as.character(tree)) -> late

I used multiple instances of mutate because I thought it would be easier to read. The use of map instead of sapply is cool, particularly when one starts looking at more advanced features. Comments from the crowd in Twitter: mutate the whole lot in a single statement (although Hadley pointed out that there was no performance penalty by using them separately) and try using case_when to make nested ifelse easier to understand. The comments on map went in two directions: either use map_chr as a clearer and safer alternative, or just use dplyr's separate function. The first option for early would look like:

early %>%
  filter(age == 96) %>%
  mutate(rep = map_chr(section, function(x) unlist(str_split(x, '_'))[1])) %>%
  mutate(sets = map_chr(section, function(x) unlist(str_split(x, '_'))[2])) %>%
  mutate(tree = as.character(`tree position`)) -> early_8

However, I ended up going with the final version that used separate, which is easier on the eye and faster, for a final version that looks like this:

early %>% 
  filter(age == 96) %>%
  separate(section, c('rep', 'sets'), sep = '_') %>%
  mutate(tree = as.character(`tree position`),
         genotype = as.character(1:nrow(.) + 10^6)) -> early_8

So we filter early, separate the single set code into two variables (rep and sets) and create a couple of variables using mutate (one is a simple type conversion to character, while the other is a code starting at 1,000,000).

In the case of late, I ended up with:

late %>% 
  mutate(family = case_when(
                    field_code < 500  ~  paste('885', str_pad(field_code, 3, pad = '0'), sep = ''),
                    field_code >= 500 & field_code < 600  ~  paste('883', str_pad(field_code - 500, 3, pad = '0'), sep = ''),
                    TRUE ~ as.character(field_code)),
         rep = as.character(rep), 
         tree = as.character(tree)) -> late

And we merge the files using:

early_8 %>% 
  left_join(late, c('rep', 'sets', 'tree')) %>%
  left_join(base_pedigree, by = c('family' = 'genotype'))  -> both

Some (many, most?) people may disagree with my use of right assign, which I love. Surely one could use either left assign or %<>% from the maggrittr package. By the way, why do I have to explicitely load magrittr (instead of relying on tidyverse) to access %<>%?

And this is how I go about learning new things: lots of false starts, often working with small examples (I used a few to check how left_join was working), lots of searching for explanations/tutorials (thanks to everyone who has written them) and asking in Twitter. If you are just starting programming, in any language, do not feel intimidated by cool looking code; most of the time it took many iterations to get it looking like that.

bayesian ggplot r rblogs stats teaching

Cute Gibbs sampling for rounded observations

I was attending a course of Bayesian Statistics where this problem showed up:

There is a number of individuals, say 12, who take a pass/fail test 15 times. For each individual we have recorded the number of passes, which can go from 0 to 15. Because of confidentiality issues, we are presented with rounded-to-the-closest-multiple-of-3 data (\(\mathbf{R}\)). We are interested on estimating \(\theta\) of the Binomial distribution behind the data.

Rounding is probabilistic, with probability 2/3 if you are one count away from a multiple of 3 and probability 1/3 if the count is you are two counts away. Multiples of 3 are not rounded.

We can use Gibbs sampling to alternate between sampling the posterior for the unrounded \(\mathbf{Y}\) and \(\theta\). In the case of \(\mathbf{Y}\) I used:

# Possible values that were rounded to R
possible <- function(rounded) {
if(rounded == 0) {
options <- c(0, 1, 2)
} else {
options <- c(rounded - 2, rounded - 1, rounded,
rounded + 1, rounded + 2)

# Probability mass function of numbers rounding to R
# given theta
prior_y <- function(options, theta) {
p <- dbinom(options, 15, prob = theta)

# Likelihood of rounding
like_round3 <- function(options) {
if(length(options) == 3) {
like <- c(1, 2/3, 1/3) }
else {
like <- c(1/3, 2/3, 1, 2/3, 1/3)

# Estimating posterior mass function and drawing a
# random value of it
posterior_sample_y <- function(R, theta) {
po <- possible(R)
pr <- prior_y(po, theta)
li <- like_round3(po)
post <- li*pr/sum(li*pr)
samp <- sample(po, 1, prob = post)

While for \(theta\) we are assuming a vague \(mbox{Beta}(alpha, eta)\), with \(alpha\) and \(eta\) equal to 1, as prior density function for \(theta\), so the posterior density is a \(mbox{Beta}(alpha + sum Y_i, eta + 12*15 - sum Y_i)\).

## Function to sample from the posterior Pr(theta | Y, R)
posterior_sample_theta <- function(alpha, beta, Y) {
theta <- rbeta(1, alpha + sum(Y), beta + 12*15 - sum(Y))

I then implemented the sampler as:

## Data
R <- c(0, 0, 3, 9, 3, 0, 6, 3, 0, 6, 0, 3)
nsim <- 10000
burnin <- 1000
alpha <- 1
beta <- 1
store <- matrix(0, nrow = nsim, ncol = length(R) + 1)

starting.values <- c(R, 0.1)

## Sampling
store[1,] <- starting.values
for(draw in 2:nsim){
current <- store[draw - 1,]
for(obs in 1:length(R)) {
y <- posterior_sample_y(R[obs], current[length(R) + 1])
# Jump or not still missing
current[obs] <- y
theta <- posterior_sample_theta(alpha, beta, current[1:length(R)])
# Jump or not still missing
current[length(R) + 1] <- theta

store[draw,] <- current

And plotted the results as:

plot((burnin+1):nsim, store[(burnin+1):nsim,13], type = 'l')


ggplot(data.frame(theta = store[(burnin+1):nsim,13]), aes(x = theta)) +
geom_density(fill = 'blue', alpha = 0.5)

Posterior density for Binomials's theta.
Posterior density for [latex]theta[/latex].

multiple_plot <- data.frame(Y = matrix(store[(burnin+1):nsim, 1:12],
nrow = (nsim - burnin)*12,
ncol = 1))
multiple_plot$obs <- factor(rep(1:12, each = (nsim - burnin)))

ggplot(multiple_plot, aes(x = Y)) + geom_histogram() + facet_grid(~obs)

Posterior mass for each rounded observation.
Posterior mass for each rounded observation.

I thought it was a nice, cute example of simultaneously estimating a latent variable and, based on that, estimating the parameter behind it.

maps r rblogs spatial

Mucking around with maps, schools and ethnicity in NZ

I’ve been having a conversation for a while with @kamal_hothi and @aschiff on maps, schools, census, making NZ data available, etc. This post documents some basic steps I used for creating a map on ethnic diversity in schools at the census-area-unit level. This “el quicko” version requires 3 ingredients:

  • Census area units shape files (available from Statistics New Zealand for free here).
  • School directory (directory-school-current.csv available for free here).
  • R with some spatial packages (also free).

We’ll read the school directory data, aggregate ethnicity information to calculate the Herfindahl–Hirschman Index of diversity and then plot it.

# School directory
direc = read.csv('directory-school-current.csv', skip = 3)

# Total number of students for each ethnicity by Census Area Unit
hhi = aggregate(cbind(European..Pakeha, Maori, Pasifika, Asian, MELAA, Other) ~
Census.Area.Unit, data = direc, FUN = sum, na.rm = TRUE)

# Function to calculate
index = function(x) {
total = sum(x, na.rm = TRUE)
frac = x/total
h = sum(frac^2)
hn = if(total > 1) (h - 1/total)/(1 - 1/total) else 1

# Calculate the index for each area
hhi$hn = apply(hhi[,2:7], 1, index)

# Write data to use in QGis later
write.csv(hhi, 'diversity-index-census-area-unit.csv', quote = FALSE,
row.names = FALSE)

Then I moved to create a map in R, for the sake of it:

library(rgdal) # for readOGR
library(sp)    # for spplot

# Reading shapefile
cau = readOGR(dsn='/Users/lap44/Dropbox/research/2015/census/2014 Digital Boundaries Generlised Clipped',

# Joining with school ethnicity data (notice we refer to @data, as cau contains spatial info as well)
cau@data = data.frame(cau@data,
hhi[match(cau@data[,"AU2014_NAM"], hhi[,"Census.Area.Unit"]),])

# Limiting map to the area around Christchurch
spplot(cau, zcol = "hn", xlim = c(1540000, 1590000),
ylim= c(5163000, 5198000))

And we get a plot like this:

Ethnic diversity in schools at the Census Area Unit level (0 very diverse, 1 not diverse at all).
Ethnic diversity in schools at the Census Area Unit level (0 very diverse, 1 not diverse at all).

Just because it is Monday down under.

P.S. Using the diversity-index-census-area-unit.csv and joining it with the shapefile in QGIS one can get something prettier (I have to work on matching the color scales):

Similar map produced with point and click in QGIS.
Similar map produced with point and click in QGIS.

Map rendering is so much faster in QGIS than in R! Clearly the system has been optimized for this user case.

data curious genetically modified rblogs teaching

Comment on Sustainability and innovation in staple crop production in the US Midwest

After writing a blog post about the paper “Sustainability and innovation in staple crop production in the US Midwest” I decided to submit a formal comment to the International Journal of Agricultural Sustainability in July 2013, which was published today. As far as I know, Heinemann et al. provided a rebuttal to my comments, which I have not seen but that should be published soon. This post is an example on how we can use open data (in this case from the USDA and FAO) and free software (R) to participate in scientific discussion (see supplementary material below).

The text below the *** represents my author’s version provided as part of my Green Access rights. The article published in the International Journal of Agricultural Sustainability [copyright Taylor & Francis]; is freely available online at

While I had many issues with the original article, I decided to focus on three problems—to make the submission brief and fit under the 1,000 words limit enforced by the journal editor. The first point I make is a summary of my previous post on the article, and then move on to two big problems: assuming that only the choice of biotechnology affects yield (making the comparison between USA and Western Europe inadequate) and comparing use of agrochemicals at the wrong scale (national- versus crop-level).

PS 2014-08-05 16:30: Heinemann et al.’s reply to my comment is now available.

  • They state that “Apiolaza did not report variability in the annual yield data” and that they had to ‘reconstruct’ my figure. Firstly, the published figure 2 does include error bars and, secondly, the data and R code are available as supplementary information (in contrast to their original paper). Let’s blame the journal for not passing this information to them.
  • They also include 2 more years of data (2011 & 2012), although the whole point of my comment is that the conclusions they derived with the original data set were not justified. Heinemann et al. are right in that yields in 2011 & 2012 were higher in Western Europe than in the USA (the latter suffering a huge drought); however, in 2013 it was the opposite: 99,695 Hg/ha for USA vs 83,724 Hg/ha for Western Europe.
  • They question that I commented only on one crop, although I did cover another crop (wheat) quickly with respect to the use of pesticides—but not with the detail I wanted, as there was a 1,000 words limit. I would have also mentioned the higher variability for Western European wheat production, as it is weird that they pointed out that high variability is a problems for USA maize production but said nothing for wheat in Europe.
  • Furthermore, they also claimed “There is no statistically significant difference in the means over the entire 50-year period” however, a naive paired t-test t.test(FAOarticle$Yield[1:50], FAOarticle$Yield[51:100], paired = TRUE) (see full code below) says otherwise.



This comment highlight issues when comparing genetically modified (GM) crops to non-GM ones across countries. Ignoring structural differences between agricultural sectors and assuming common yield trajectories before the time of introduction of GM crops results on misestimating the effect of GM varieties. Further data collection and analyses should guide policy-makers to encourage diverse approaches to agriculture, rather than excluding specific technologies (like GM crops) from the onset.

Keywords: genetic modification; biotechnology; productivity; economics

In a recent article Heinemann et al. (2013) focused “on the US staple crop agrobiodiversity, particularly maize” using the contrast between the yield of Western Europe and United States as a proxy for the comparison between genetically modified (GM) maize versus non-GM maize. They found no yield benefit from using GM maize when comparing the United States to Western Europe.

In addition, Heinemann et al. contrasted wheat yields across United States and Western Europe to highlight the superiority of the European biotechnological package from a sustainability viewpoint.

I am compelled to comment on two aspects that led the authors to draw incorrect conclusions on these issues. My statistical code and data are available as supplementary material.

1. Misestimating the effect of GM maize varieties

Heinemann et al. used FAO data[], from 1961 to 2010 inclusive, to fit linear models with yield as the response variable, country and year as predictors. Based on this analysis they concluded, “W. Europe has benefitted from the same, or marginally greater, yield increases without GM”. However, this assumes a common yield trajectory for United States and Western Europe before significant commercial use of GM maize, conflating GM and non-GM yields. GM maize adoption in United States has continually increased from 25% of area of maize planted in 2000 to the current 90% (Figure 1, United States Department of Agriculture 2013[]).

Figure 1: Adoption of GM maize in United States, expressed as percentage of planted area.
Figure 1: Adoption of GM maize in United States, expressed as percentage of total maize planted area (click to enlarge).

If we fit a linear model from 1961 to 1999 (last year with less than 25% area of GM maize) we obtain the following regression equations \(y = 1094.8 x + 39895.6\) (United States, R2 = 0.80) and \(y = 1454.5 x + 29802.2\) (W. Europe, R2 = 0.90). This means that Western Europe started with a considerably lower yield than the USA (29,802.2 vs 39,895.6 hg/ha) in 1961 but increased yields faster than USA (1,454.5 vs 1,094.8 hg/ha per year) before substantial use of GM maize. By 1999 yield in Western Europe was superior to that in United States.

This is even more evident in Figure 2, which shows average yield per decade, removing year-to-year extraneous variation (e.g. due to weather). Western European yields surpassed United States’s during the 1990s (Figure 2). This trend reverses in the 2000s, while United States simultaneously increased the percentage of planted area with GM maize, directly contradicting Heinemann et al.’s claim.

Figure 2: Average maize yield (and standard error) per decade for United States and Western Europe. The 2010s include a single year to replicate the original data set (click to enlarge).
Figure 2: Average maize yield (and standard error) per decade for United States and Western Europe. The 2010s include a single year to replicate the original data set (click to enlarge).

2. Ignoring structural differences between agricultural sectors

When discussing non-GM crops using wheat the authors state, “the combination of biotechnologies used by W. Europe is demonstrating greater productivity than the combination used by the United States”. This sentence summarizes one of the central problems of their article: assuming that, if it were not for the choice of biotechnology bundle, the agricultural sectors would have the same intrinsic yield, making them comparable. However, many inputs beside biotechnology affect yield. For example, Neumann et al. (2010) studied the spatial distribution of yield and found that in the Unites States “access can explain most of the variability in wheat efficiency. In the more remote regions land prices are lower and inputs are therefore often substituted by land leading to lower efficiencies”. Lower yields in United States make sense from an economic point of view, as land replaces more expensive inputs like agrochemicals.

Heinemann et al. support their case by comparing pesticide use between United States and France across all crops. However, what is relevant to the discussion is pesticide use for the crops being compared. European cereals, and wheat in particular, are the most widely fungicide-treated group of crops worldwide (Kucket al. 2012). For example, 27% of the wheat planted area in France was already treated with fungicides by 1979 (Jenkins and Lescar 1980). More than 30 years later in the United States this figure has reached only 19% for winter wheat (which accounts for 70% of planted area, NASS 2013). Fungicide applications result on higher yield responses (Oerke 2006).

Final remarks

Heinemann et al. ignored available data on GM adoption when analysing maize yields. They also mistakenly treated biotechnological bundles as the only/main explanation for non-GMO yield differences between United States and Western Europe. These issues mean that the thrust of their general conclusion is unsupported by the available evidence. Nevertheless, their article also raised issues that deserve more consideration; e.g. the roles of agricultural subsidies and market concentration on food security.

Agricultural sustainability requires carefully matching options in the biotechnology portfolio to site-specific economic, environmental and cultural constraints. Further data collection and analyses should lead policy makers to encourage diverse approaches to agriculture, rather than excluding specific technologies (like GMOs and pesticides) from the onset.


Heinemann, J. A., Massaro, M., Coray, D. S., Agapito-Tenfen, S. Z. and Wen, J. D. 2013. Sustainability and innovation in staple crop production in the US Midwest. International Journal of Agricultural Sustainability (available here).

Jenkins, J. E. E. and Lescar, L. 1980. Use of foliar fungicides on cereals in Western Europe. Plant Disease, 64(11): 987-994 (behind paywall).

Kuck, K. H., Leadbeater, A. and Gisi, U. 2012. FRAC Mode of Action Classification and Resistance Risk of Fungicides. In: Krämer, W., Schirmer, U., Jeschke, P. and Witschel, M., eds., Modern Crop Protection Compounds. Wiley. 539-567.

NASS, 2013. Highlights: 2012 Agricultural Chemical Use Survey. Wheat. United States Department of Agriculture (available here).

Neumann, K., Verburg, P. H., Stehfest, E., and Müller, C. 2010. The yield gap of global grain production: a spatial analysis. Agricultural Systems, 103(5), 316–326 (behind paywall).

Oerke E. 2006. Crop losses to pests. Journal of Agriculture Science, 144: 31-43 (behind paywall, or Free PDF).

United States Department of Agriculture. 2013. Adoption of Genetically Engineered Crops in the U.S. USDA Economic Research Service (available here).

Supplementary material

You can replicate the analyses and plots produced in this comment using the following files:

  • Maize production data for United States and Western Europe (csv, extracted from FAO). []
  • GMO maize penetration data (csv, extracted from USDA). []
  • R code for analyses (R file, changed extension to .txt so WordPress would not complain).
programming r rblogs teaching

Sometimes I feel (some) need for speed

I’m the first to acknowledge that most of my code could run faster. The truth of the matter is that, in essence, I write ‘quickies’: code that will run once or twice, so there is no incentive to spend days or hours in shaving seconds of a computation. Most analyses of research data fall in to this approach: read data-clean data-fit model-check model-be happy-write article-(perhaps) make data and code available-move on with life.

One of the reasons why my code doesn’t run faster or uses less memory is the trade-off between the cost of my time (very high) compared to the cost of more memory or faster processors (very cheap) and the gains of shaving a few seconds or minutes of computer time, which tend to be fairly little.

In R vectorization is faster than working with each vector element, although it implies allocating memory for whole vectors and matrices, which for large-enough problems may become prohibitively expensive. On the other hand, not vectorizing some operations may turn your problem into an excruciatingly slow exercise and, for example in large simulations, in practice intractable in a useful timeframe.

Dealing with vectors and matrices is like dealing with many chairs simultaneously. Some patterns and operations are easier and faster than others. (Photo: Luis, click to enlarge).
Dealing with vectors and matrices is like dealing with many chairs simultaneously. Some patterns and operations are easier and faster than others. (Photo: Luis, click to enlarge).

John Muschelli wrote an interesting post reimplementing 2×2 frequency tables for a highly specific use, comparing the results of image processing algorithms. In John’s case, there are two greater-than-9-million-long-elements logical vectors, and when comparing vectors for many images the process becomes very slow. He explains part of his rationale in his blog (go and read it), but say that his solution can be expressed like this:

# I don't have any image data, but I'll simulate a couple
# of 10 million long logical vectors (just to round things)

manual = sample(c(TRUE, FALSE), 10E6, replace = TRUE)
auto = sample(c(TRUE, FALSE), 10E6, replace = TRUE) = function(x, y) {
tt = sum(x & y)
tf = sum(x & !y)
ft = sum(!x & y)
ff = sum(!x & !y)
return(matrix(c(ff, tf, ft, tt), 2, 2))
}, auto)

which uses 1/4 of the time used by table(manual, auto). Mission accomplished! However, if I stopped here this blog post would not make much sense, simply rehashing (pun intended) John’s code. The point is to explain what is going on and, perhaps, to find even faster ways of performing the calculations. As a start, we have to be aware that the calculations in rely on logical (boolean) operations and on the coercion of logical vectors to numerical values, as stated in the documentation:

Logical vectors are coerced to integer vectors in contexts where a numerical value is required, with TRUE being mapped to 1L, FALSE to 0L and NA to NA_integer_.

In R Logical operations can be slower than mathematical ones, a consideration that may guide us to think of the problem in a slightly different way. For example, take the difference between the vectors (dif = x - y), so both TRUE-TRUE (1-1) and FALSE-FALSE (0-0) are 0, while TRUE - FALSE (1-0) is 1 and FALSE - TRUE (0-1) is -1. Therefore:

  • the sum of positive values (sum(dif > 0)) is the frequency of TRUE & FALSE,
  • while the sum of negative values (sum(dif < 0)) is the frequency of FALSE & TRUE.

The values for TRUE & TRUE can be obtained by adding up the element-wise multiplication of the two vectors, as TRUE*TRUE (1*1) is the only product that's different from zero. A vectorized way of performing this operation would be to use t(x) %*% y; however, for large vectors the implementation crossprod(x, y) is faster. The values for FALSE & FALSE are simply the difference of the length of the vectors and the previously calculated frequencies length(dif) - tt - tf - ft. Putting it all together: = function(x, y) {
dif = x - y
tf = sum(dif > 0)
ft = sum(dif < 0)
tt = crossprod(x, y)
ff = length(dif) - tt - tf - ft
return(c(tf, ft, tt, ff))

This code takes 1/20 of the time taken by table(x, y). An interesting result is that the crossproduct crossprod(x, y) can also be expressed as sum(x * y), which I didn't expect to be faster but, hey, it is. So we can express the code as:

basic.tab2 = function(x, y) {
dif = x - y
tf = sum(dif > 0)
ft = sum(dif < 0)
tt = sum(x*y)
ff = length(dif) - tt - tf - ft
return(c(tf, ft, tt, ff))

to get roughly 1/22 of the time. The cost of logical operations is easier to see if we isolate particular calculations in and; for example,

tf1 = function(x, y) {
tf = sum(x & !y)

is slower than

tf2 = function(x, y) {
dif = x - y
tf = sum(dif > 0)

This also got me thinking of the cost of coercion: Would it take long? In fact, coercing logical vectors to numeric has little (at least I couldn't see from a practical viewpoint) if any cost. In some cases relying on logical vectors converted using as.numeric() seems to be detrimental on terms of speed.

As I mentioned at the beginning, vectorization uses plenty of memory, so if we were constrained in that front and we wanted to do a single pass on the data we could write an explicit loop: = function(x, y) {
tt = 0; tf = 0; ft = 0; ff = 0

for(i in seq_along(x)) {
if(x[i] == TRUE & y[i] == TRUE)
tt = tt + 1
if(x[i] == TRUE & y[i] == FALSE)
tf = tf + 1
if(x[i] == FALSE & y[i] == TRUE)
ft = ft + 1
ff = ff + 1
return(matrix(c(ff, tf, ft, tt), 2, 2))

The does only one pass over the vectors and should use less memory as it doesn't need to create those huge 10M elements vectors all the time (at least it would if we used a proper iterator in the loop instead of iterating on a vector of length 10M (that's 40 MB big in this case). The iterators package may help here). We save room/memory at the expense of speed, as is ten times slower than the original table() function. Of course one could run it a lot faster if implemented in another language like C++ or Fortran and here Rcpp or Rcpp11 would come handy (updated! See below).

This is only a not-so-short way of reminding myself what's going on when trading-off memory, execution speed and my personal time. Thus, I am not saying that any of these functions is the best possible solution, but playing with ideas and constraints that one often considers when writing code. Incidentally, an easy way to compare the execution time of these functions is using the microbenchmark library. For example:

microbenchmark(, auto), basic.tab2(manual, auto), times = 1000)

will spit numbers when running a couple of functions 1,000 times with results that apply to your specific system.

PS 2014-07-12 11:08 NZST Hadley Wickham suggests using tabulate(manual + auto * 2 + 1, 4) as a fast alternative. Nevertheless, I would like to point out that i- the initial tabulation problem is only an excuse to discuss a subset of performance issues when working with R and ii- this post is more about the journey than the destination.

PS 2014-07-13 20:32 NZST Links to two posts related to this one:

  • Wiekvoet's work which i- relies on marginals and ii- reduces the use of logical operators even more than my basic.tab2(), simultaneously taking around 1/4 of the time.
  • Yin Zhu's post describing vectors is a handy reminder of the basic building blocks used in this post.

Updated with Rcpp goodness!

PS 2014-07-13 21:45 NZST I browsed the Rcpp Gallery and Hadley Wickham's Rcpp tutorial and quickly converted to C++. Being a bit of an oldie I've used Fortran (90, not that old) before but never C++, so the following code is probably not very idiomatic.

cppFunction('NumericVector loopyCpp(LogicalVector x, LogicalVector y) {
int niter = x.size();
int tt = 0, tf = 0, ft = 0, ff = 0;
NumericVector tab(4);

for(int i = 0; i < niter; i++) {
if(x[i] == TRUE && y[i] == TRUE)
if(x[i] == TRUE && y[i] == FALSE)
if(x[i] == FALSE && y[i] == TRUE)

tab[0] = ff; tab[1] = tf; tab[2] = ft; tab[3] = tt;
return tab;

loopyCpp(manual, auto)

Loops and all it runs roughly twice as fast as basic.tab2() but it should also use much less memory.