Categories
r rblogs research stats teaching

Implementing a model as an R package

In our research group we often have people creating statistical models that end up in publications but, most of the time, the practical implementation of those models is lacking. I mean, we have a bunch of barely functioning code that is very difficult to use in a reliable way in operations of the breeding programs. I was very keen on continue using one of the models in our research, enough to rewrite and document the model fitting, and then create another package for using the.model in operations.

Unfortunately, neither the data nor the model are mine to give away, so I can’t share them (yet). But I hope these notes will help you in you are in the same boat and need to use your models (or ‘you’ are in fact future me, who tend to forget how or why I wrote code in a specific way).

A basic motivational example

Let’s start with a simple example: linear regression. We want to predict a response using a predictor variable, and then we can predict the response for new values of the predictor contained in new_data with:

my_model <- lm(response ~ predictor, data = my_data)
predictions <- predict(my_model, newdat = new_data)

# Saving the model object
save(my_model, 'model_file.Rda')

The model coefficients needed to predict new values are stored in the my_model object. If we want to use the model elsewhere, we can save the object as an .Rda file, in this case model_file.Rda.

We can later read the model file in, say, a different project and get new predictions using:

load('model_file.Rda')
more_predictions <- predict(my_model, newdat = yet_another_new_data)

Near-infrared Spectroscopy

Near-infrared spectroscopy is the stuff of CSI and other crime shows. We measure the reflection at different wavelengths and run a regression analysis using what we want to predict as Y in the model. The number of predictors (wavelengths) is much larger than in the previous example—1,296 for the NIR machine we are using—so it is not unusual to have more predictors than observations. NIR spectra are often trained using pls() (from the pls package) with help from functions from the prospectr package.

I could still use the save/load approach from the motivational example to store and reuse the model object created with pls but, instead, I wanted to implement the model, plus some auxiliary functions, in a package to make the functions easier to use in our lab.

I had two issues/struggles/learning opportunities that I needed to sort out to get this package working:

1. How to automatically load the model object when attaching the package?

Normally, datasets and other objects go in the data folder, where they are made available to the user. Instead, I wanted to make the object internally available. The solution turned out to be quite straightforward: save the model object to a file called sysdata.rda (either in the R or data folders of the package). This file is automatically loaded when we run library(package_name). We just need to create that file with something like:

save(my_model, 'sysdata.rda')

2. How to make predict.pls work in the package?

I was struggling to use the predict function, as in my head it was being provided by the pls package. However, pls is only extending the predict function, which comes with the default R installation but is part of the stats package. At the end, sorted it out with the following Imports, Depends and LazyData in the DESCRIPTION file:

Imports: prospectr,
         stats
Depends: pls
Encoding: UTF-8
LazyData: Yes

Now it is possible to use predict, just remember to specify the package where it is coming from, as in:

stats::predict(my_model, ncomp = n_components,
               newdata = spectra, interval = 'confidence')

Nothing groundbreaking, I know, but spent a bit of time sorting out that couple of annoyances before everything fell into place. Right now we are using the models in a much easier and reproducible way.

Categories
programming r rblogs

Reading a folder with many small files

One of the tools we use in our research is NIR (Near-Infrared Spectroscopy), which we apply to thousands of samples to predict their chemical composition. Each NIR spectrum is contained in a CSV text file with two numerical columns: wavelength and reflectance. All files have the same number of rows (1296 in our case), which corresponds to the number of wavelengths assessed by the spectrometer. One last thing: the sample ID is encoded in the file name.

As an example, file A1-4-999-H-L.0000.csv’s contents look like:

8994.82461,0.26393
8990.96748,0.26391
8987.11035,0.26388
8983.25322,0.26402
8979.39609,0.26417
...

Once the contents of all the files are stored in a single matrix, one can apply a bunch of algorithms to build a model, and then use the model to predict chemical composition for new observations. I am not concerned about that process in this blog post, but only about reading thousands of small files from R, without relying on calls to the operating system to join the small files and read a single large file.

As I see it, I want to:

  • give R a folder name,
  • get a list of all the file names in that folder,
  • iterate over that list and keep only the second column for each of the files.
  • join the elements of the list.

I can use list.files() to get the names of all files in a folder. Rather than using a explicit loop, it’s easier to use lapply() to iterate over the list of names and apply the read.csv() function to all of them. I want a matrix, but lapply() creates a list, so I joined all the elements of the list using do.call() to bind the rows using rbind().

spectra_folder <- 'avery_raw_spectra'

read_spectra_folder <- function(folder) {
  # Read all files and keep second column only for each of them. Then join all rows
  spectra_list <- list.files(path = folder, full.names = TRUE)
  raw_spectra <- lapply(spectra_list, 
                        function(x) read.csv(x, header = FALSE)[,2])
  
  raw_spectra <- do.call(rbind, raw_spectra)

There are many ways to test performance, for example using the microbenchmark package. Instead, I'm using something rather basic, almost cute, the Sys.time() function:

start <- Sys.time()
option1 <- read_spectra_folder(spectra_folder)
end <- Sys.time()
end - start

This takes about 12 seconds in my laptop (when reading over 6,000 files). I was curious to see if it would be dramatically faster with data.table, so I replaced read.csv() with fread() and joined the elements of the list using rbindlist().

library(data.table)

read_folder_dt <- function(folder) {
  spectra_list <- list.files(path = folder, full.names = TRUE)
  raw_spectra <-  lapply(spectra_list, 
                         function(x) fread(x, sep=",")[,2])
  
  raw_spectra <- rbindlist(raw_spectra)
}

Using the same basic timing as before this takes around 10 seconds in my laptop

I have the impression that packages like data.table and readr have been optimized for reading larg(ish) files, so they won't necessarily help much in this reading-many-small-files type of problem. Instead, I tested going back to even more basic R functions (scan), and adding more information about the types of data I was reading. Essentially, going back even closer to base R.

read_folder_scan <- function(folder, prefix = 'F') {
  # Read all files and keep second column only for each of them. Then join all rows
  spectra_list <- list.files(path = folder, full.names = TRUE)
  raw_spectra <- lapply(spectra_list, 
                        function(x) matrix(scan(x, what = list(NULL, double()), 
                                                sep = ',', quiet = TRUE)[[2]], nrow = 1))
  raw_spectra <- do.call(rbind, raw_spectra)
}

Timing this new version takes only 4 seconds, not adding any additional dependencies. Any of these versions is faster than the original code that was growing a data frame with rbind() one iteration at the time.

Categories
r teaching

From character to numeric pedigrees

In quantitative genetic analyses we often use a pedigree to represent the relatedness between individuals, so this is accounted in the analyses, because the observations are not independent of each other. Often this pedigree contains alphanumeric labels, and most software can cope with that.

Sometimes, though, we want to use numeric identities because we would like to make the data available to third parties (other researchers, publication), and there is commercial sensitivity about them. Or just want to use a piece of software that can’t deal with character identities.

Last night put together an El quicko* function to numberify identities, which returns a list with a numeric version of the pedigree and a key to then go back to the old identities.

numberify <- function(pedigree) {
  ped_key <- with(pedigree, unique(c(as.character(mother), as.character(father), as.character(tree_id))))
  numeric_pedigree <- pedigree %>%
    mutate(tree_id = as.integer(factor(tree_id, levels = ped_key)),
           mother = as.integer(factor(mother, levels = ped_key)),
           father = as.integer(factor(father, levels = ped_key)))
  
  return(list(ped = numeric_pedigree, key = ped_key))
}

new_ped <- numberify(old_ped)

old_id <- new_ped$key[new_ped$ped$tree_id]

* It could be generalized to extract the names of the 3 fields, etc.

Categories
data curious stats teaching tutorials

Being data curious: the strange case of lamb consumption in NZ

There is a lot of talk about the skills needed for working in Statistics/Data Science, with the discussion often focusing on theoretical understanding, programming languages, exploratory data analysis, and visualization. There are many good blog posts dealing with how you get data, process it with your favorite language and then creating some good-looking plots. However, in my opinion, one important skill is curiosity; more specifically being data curious.

Often times being data curious doesn’t require statistics or coding, but just searching for and looking at graphs. A quick example comes from Mike Dickinson’s tweet: “This is extraordinary: within a decade, NZers basically stopped eating lamb. 160 years of tradition scrapped almost overnight.”

After reading the news article, many people came up with good potential explanations: Have the relative prices changed? Do we have different demographics with not so much appetite for lamb? etc.

Few comments questioned the data until Peter Ellis voiced exactly what was nagging me:

Do the two data points make sense? In this data-abundant world, it didn’t take long to find the time series from which the points came from in this handy OECD page.

Sheep meat consumption, kg/person. Data from OECD statistics.

A quick look shows that the series contains both quoted consumption figures, showing the talked-about 10-year decline. Even more surprisingly, one can see that practically most of the decline occurred from 2008 to 2009 (from 17.7 to 4.9 kg/person), which is a bizarre drop for a single year. A single person may have large differences in consumption from one year to the next; however, over a whole country those deviations tend to be averaged out. This highlights another issue with the time series: it wiggles like crazy.

When exploring data is useful to have some sort of benchmark to see if other things are also changing at the same time. I chose our neighbor Australia—with a not so different diet, similar part of the world—as my benchmark. The Australian time series doesn’t show a change like NZ. Besides using the benchmark for the same product, we can also compare what’s going on with other meats. For example, beef and veal, pork and poultry.

Pork consumption for Australia and New Zealand, kg/capita.

Poultry consumption for Australia and New Zealand, kg/capita.

All the series are smoother and show similar trends in Australia and New Zealand, which makes the lamb saga increasingly look like a mistake. We can now move from trying to explain social changes that are driving the change between two numbers, to being highly suspicious about the numbers under discussion!

Export lamb slaughter in New Zealand.

So where could be the problem coming from? Consumption per capita requires i) total domestic consumption of sheep meat and ii) population of the country. We are pretty sure we have good data for population, courtesy of Statistics New Zealand. How would one go about estimating domestic consumption of sheep meat? Probably one would:

  • Get the size of the New Zealand sheep flock. We can get sheep numbers from Statistics NZ Agricultural Production Statistics. Livestock numbers are a national indicator, which tend to have high accuracy.
  • Get an idea of the proportion of the flock that’s exported, which we know is pretty substantial. I don’t know how good these numbers are, but Beef & Lamb NZ gives us an idea of how many sheep are slaughtered for export. This number, which hovers around 20 million a year seems quite consistent. We have to remember that not the whole population is slaughtered every year, as we have to replace the flock.
  • The difference between flock size – (sheep for export + replacement sheep) should be the number of sheep for domestic consumption.
  • We need a conversion factor between number of sheep and kg of meat produced, so we can calculate meat consumption/capita.

I would assume that the sheep-meat conversion factor will show little fluctuation from year to year, so perhaps the likely culprit is the penultimate point, estimating the number of sheep for domestic consumption. One thing that grabs my attention is that while the flock is getting smaller, the number of sheep for exports stays around the same, which should mean fewer sheep available for the domestic market, giving credibility to the lower lamb consumption trend.

I don’t know if this the actual explanation for the “lamb consumption crash”. If I had more time I could chase some of the domestic consumption numbers, even call the Beef & Lamb people. But this should be enough to get you started with an example on how to question the news using real data. I’m sure you reader can come up with better ways of looking at this and other stories.

Categories
code r teaching

Reducing friction in R to avoid Excel

When you have students working in a project there is always an element of quality control. Some times the results just make sense, while others we are suspicious about something going wrong. This means going back to check the whole analysis process: can we retrace all the steps in a calculation (going back to data collection) and see if there is anything funny going on? So we sat with the student and started running code (in RStudio, of course) and I noticed something interesting: there was a lot of redundancy, pieces of code that didn’t do anything or were weirdly placed. These are typical signs of code copied from several sources, which together with the presence of setwd() showed unfamiliarity with R and RStudio (we have a mix of students with a broad range of R skills).

But the part that really caught my eye was that the script read many Near Infrared spectra files, column bound them together with the sample ID (which was 4 numbers separated by hyphens) and saved the 45 MB file to a CSV file. Then the student opened the file and split the sample ID into 4 columns, deleted the top row, saved the file and read it again into R to continue the process.

The friction point which forced the student to drop to Excel—the first of many not easily reproducible parts—was variable splitting. The loop for reading the files and some condition testing was hard to follow too. If one knows R well, any of these steps is relatively simple, but if one doesn’t know it, the copy and pasting from many different sources begins, often with inconsistent programming approaches.

Here is where I think the tidyverse brings something important to the table: consistency, more meaningful naming of functions and good documentation. For example, doing:

nir %>%
  separate(sample_id, c('block', 'tree', 'family', 'side'), sep = '-') 

is probably the easiest way of dealing with separating the contents of a single variable.

When working with several collaborators (colleagues, students, etc) the easiest way to reduce friction is to convince/drag/supplicate everyone to adopt a common language. Within the R world, the tidyverse is the closest thing we have to a lingua franca of research collaboration. ‘But isn’t R a lingua franca already?’ you may ask. The problem is that programming in base R is often too weird for normal people, and too many people just give up before feeling they can do anything useful in R (particularly if they are proficient in Excel).

Even if you are an old dog (like me) I think it pays to change to a subset of R that is more learnable. And once someone gets hooked, the transition to adding non-tidyverse functions is more bearable.

Categories
r research

Keeping track of research

If you search for data analysis workflows for research there are lots of blog posts on using R + databases + git, etc. While in some cases I may end up working with a combination like that, it’s much more likely that reality is closer to a bunch of emailed Excel or CSV files.

Some may argue that one should move the whole group of collaborators to work the right way. In practice, well, not everyone has the interest and/or the time to do so. In one of our collaborations we are dealing with a trial established in 2009 and I was tracking a field coding mistake (as in happening outdoors, doing field work, assigning codes to trees), so I had to backtrack where the errors were introduced. After checking emails from three collaborators, I think I put together the story and found the correct code values in a couple of files going back two years.

The new analysis lives in an RStudio project with the following characteristics:

  1. Folder in Dropbox, so it’s copied in several locations and it’s easy to share.
  2. Excel or CSV files with their original names (warts and all), errors, etc. Resist the temptation to rename the files to sane names, so it’s easier to track back the history of the project.
  3. R code
  4. Important part: text file (Markdown if you want) documenting the names of the data files, who & when they sent it to me.

Very low tech but, hey, it works.

Warts and all: fight your inner OCD and keep original file names.

Categories
books recycling

Influences: Cronopios and Famas

Books have accompanied me for all my life, or at least for as long as I can remember. However, my reading habits have changed many times, from reading simple books, to reading very complex books, to reading anything, to reading if I squeeze a few minutes here and there, to… you get the idea. ‘Habits’ is a funny word, an oxymoron, to refer to constant change.

Today I was thinking of influential books. No ‘good’ books or books that have received many awards or that have guided generations or catalyzed social change. I mean only books that have been important for me at a given point in time. If I had read them before or after that time they may have passed unnoticed. But I read them then, at the right time… for me.

As an adult I have moved houses several times, and every time I have lost books. There are also books that have been with me all this time. One of them is ‘Cronopios and Famas’ a collection of very short stories by Julio Cortázar§, one of the big voices of Argentinian literature. My first encounter with ‘Historias de Cronopios y Famas’–the original Spanish title–was in my maternal grandparents’ apartment. I was living with them and I was looking for something to read. Anything. I opened a drawer and found some interesting books, including Cortazar’s. It was one of the first editions, which I think belonged to one of my uncles, the one in exile.

Why was this an important book? Language, raw language. I am completely at lost when trying to explain Cortázar to someone who has not read his books. As Borges said:

No one can retell the plot of a Cortázar story; each one consists of determined words in a determined order. If we try to summarize them, we realize that something precious has been lost—Jorge Luis Borges

In ‘Progreso y retroceso’ (progress and regress) the whole story fits in only two paragraphs. The story is about a crystal that lets flies through but that does not let them come back because ‘no one knows what stuff in the flexibility of the fibers of this crystal, which was too fibrous’ or something like that:

Inventaron un cristal que dejaba pasar las moscas. La mosca venía empujaba un poco con la cabeza y, pop, ya estaba del otro lado. Alegría enormísima de la mosca.

Todo lo arruinó un sabio húngaro al descubrir que la mosca podía entrar pero no salir, o viceversa a causa de no se sabe que macana en la flexibilidad de las fibras de este cristal, que era muy fibroso. En seguida inventaron el cazamoscas con un terrón de azúcar dentro, y muchas moscas morían desesperadas. Así acabó toda posible confraternidad con estos animales dignos de mejor suerte.

The story is straightforward, with simple, almost pedestrian words. But those words have been extremely carefully selected, crafted in a particular order. I imagine Cortázar spending countless hours, agonizing on a myriad small decisions until reaching a point of perfect simplicity.

There was a clear before and after reading this book in 1981: language was not the same ever again. I learned to find the fantastic side of the quotidian. I grew to appreciate risk when building sentences, when pushing meanings and readings. My whole way to look at the world was influenced by a small book of ridiculous short stories.

P.S. I published this post in my old, extinct blog on 2009-02-02

Categories
books recycling

This time is Calvino

This happens relatively frequently: I am talking with someone else that doesn’t know me well and, at some point of the conversation I have mentioned that I am a forester. Then we move into books and I mention someone like Borges or Calvino and they look at me with this puzzled face as in ‘I didn’t know that foresters could read’. I know, it happens to other professions as well; just for the record not all of us are semi-literate apes, working with a chainsaw.

I was sorting out my bookshelves at work when I found a copy of The literature machine, a collection of essays by Italo Calvino. It had my name and signature, together with 2002, Melbourne, Australia. (Digression: besides my name and signature I always put the city where I bought a book). I had vague memories of walking around in Melbourne’s CBD and finding an underground bookshop. At the time I was not looking for anything in particular, just browsing titles.

Why did I buy the book and never read it? I do remember browsing it and getting distracted by something more urgent, albeit clearly unimportant, because I cannot remember what was it. Probably I was not ready either; it has happened to me before. From ‘Uncle Tom’s cabin’ when I was nine, to ‘The Fountainhead’ when I was a teenager, to ‘The literature machine’ seven years ago. Most likely there is an issue of maturity, of being ready to read a particular story, philosophy or approach to the world.

Many years ago I read some of Calvino’s books, like Cosmicomics (brilliantly funny) and ‘The cloven viscount’ (very enjoyable reading). But I particularly struggle with two literary forms: essays and plays. I sometimes can get into the former, but the latter has proven–until today–insurmountable.

However, today is the time for Calvino and essays. There is something deeply stimulating in these essays, together with a quaintness created by forty years gone since they were written. The feeling of freshness, possibility and hope from 1968 reads strange in 2017. At the same time, there is a bit of breaking with the system, since the implosion of the international economy. Maybe it is an excellent time to resonate with Calvino, as in the old days.

Categories
code programming r rblogs

Calculating parliament seats allocation and quotients

I was having a conversation about dropping the minimum threshold (currently 5% of the vote) for political parties to get representation in Parliament. The obvious question is how would seat allocation change, which of course involved a calculation. There is a calculator in the Electoral Commission website, but trying to understand how things work (and therefore coding) is my thing, and the Electoral Commission has a handy explanation of the Sainte-Laguë allocation formula used in New Zealand. So I had to write my own seat allocation function:

allocate_seats <- function(votes) {
  parties <- names(votes)
  denom <- seq(1, 121, 2)
  
  quotients <- vapply(denom, FUN =  function(x) votes/x, FUN.VALUE = rep(1, length(votes)))
  quotients <- t(quotients)
  colnames(quotients) <- parties
  
  priority <- rank(-quotients)
  seat_ranking <- matrix(priority, nrow = nrow(quotients), ncol = ncol(quotients))
  seat_ranking <- ifelse(seat_ranking <= 120, seat_ranking, NA)
  colnames(seat_ranking) <- parties
  
  return(list(quotients = quotients, ranking = seat_ranking))
}

Testing it with the preliminary election results (that is, no including special votes) gives:

votes2017 <- c(998813, 776556, 162988, 126995, 10959, 48018, 23456)
names(votes2017) <- c('National', 'Labour', 'NZ First', 'Green', 'ACT', 
                  'Opportunities', 'Māori')

seats2017 <- allocate_seats(votes2017)

seats2017$ranking

#      National Labour NZ First Green ACT Opportunities Māori
# [1,]        1      2        6     9  98            22    46
# [2,]        3      4       19    26  NA            67    NA
# [3,]        5      7       33    42  NA           112    NA
# [4,]        8     11       47    59  NA            NA    NA
# [5,]       10     13       60    77  NA            NA    NA
# [6,]       12     15       73    93  NA            NA    NA
# [7,]       14     17       86   110  NA            NA    NA
# [8,]       16     21      100    NA  NA            NA    NA
# [9,]       18     24      113    NA  NA            NA    NA
#[10,]       20     27       NA    NA  NA            NA    NA
# ...

In our current setup The Opportunities and Māori parties did not reach the minimum threshold (nor won an electorate as ACT violating the spirit of the system), so did not get any seats. Those 4 seats that would have gone to minor parties under no threshold ended up going to National and Labour (2 each). It sucks.

Categories
code programming r rblogs teaching

Collecting results of the New Zealand General Elections

I was reading an article about the results of our latest elections where I was having a look at the spatial pattern for votes in my city.

I was wondering how would I go over obtaining the data for something like that and went to the Electoral Commission, which has this neat page with links to CSV files with results at the voting place level. The CSV files have results for each of the candidates in the first few rows (which I didn’t care about) and at the party level later in the file.

As I could see it I needed to:

  1. Read the Electoral Commission website and extract the table that contains the links to all CSV files.
  2. Read each of the files and i- extract the electorate name, ii- skipping all the candidates votes, followed by iii-reading the party vote.
  3. Remove sub-totals and other junk from the files.
  4. Geocode the addresses
  5. Use the data for whatever else I wanted (exam question anyone?).

So I first loaded the needed packages and read the list of CSV files:

library(magrittr)
library(tidyverse)
library(rvest)
library(stringr)
library(ggmap)

# Extract list of CSV file names containing voting place data
voting_place <- 'http://www.electionresults.govt.nz/electionresults_2017_preliminary/voting-place-statistics.html'

election17 <- read_html(voting_place)
election17 %>% 
  html_nodes('table') %>% html_nodes('a') %>% 
  html_attr('href') %>% str_subset('csv') %>% 
  paste('http://www.electionresults.govt.nz/electionresults_2017_preliminary','/',., sep = '') -> voting_place_list

Then wrote a couple of functions to, first, read the whole file, get the electorate name and, second, detect where the party vote starts to keep from that line onwards. Rather than explicitly looping over the list of CSV file names, I used map_dfr from the purrr package to extract the data and join all the results by row.

get_electorate <- function(row) {
  row %>% str_split(pattern = ',') %>% 
    unlist() %>% .[1] %>% str_split(pattern = '-') %>% 
    unlist() %>% .[1] %>% str_trim() -> elect
  return(elect)
}

# Function to read only party-level votes from voting places
extract_party_vote <- function(file_name) {
  all_records <- read_lines(file_name)
  
  electorate <- get_electorate(all_records[1])
  
  start_party <- grep('Party Vote', all_records)
  party_records <- all_records[start_party:length(all_records)]
  party_records_df <- read.table(text = party_records, sep = ',', 
                                 fill = TRUE, header = TRUE, quote = '"',
                                 stringsAsFactors = FALSE)
  names(party_records_df)[1:2] <- c('neighbourhood', 'address')
  party_records_df$electorate <- electorate
  return(party_records_df)
}

# Download files and create dataframe
vote_by_place <- map_dfr(voting_place_list, extract_party_vote)

Cleaning the data and summarising by voting place (as one can vote for several electorates in a single place) is fairly straightforward. I appended the string Mobile to mobile teams that visited places like retirement homes, hospitals, prisons, etc:

# Remove TOTAL and empty records
vote_by_place %>% 
  filter(address != '') %>% 
  mutate(neighbourhood = ifelse(neighbourhood == '', paste(electorate, 'Mobile'), neighbourhood)) %>%
  group_by(neighbourhood, address) %>% 
  summarise_at(vars(ACT.New.Zealand:Informal.Party.Votes), sum, na.rm = TRUE) -> clean_vote_by_place

Geolocation is the not-working-very-well part right now. First, I had problems with Google (beyond the 1,000 places limit for the query). Then I went for using the Data Science Kit as the source but, even excluding the mobile places, it was a bit hit and miss for geolocation, particularly as the format of some address (like corner of X and Y) is not the best for a search.

In addition, either of the two sources for geolocation work really slowly and may produce a lot of output. Using sink() could be a good idea to not end up with output for roughly 3,000 queries. I did try the mutate_geocode() function, but didn't work out properly.

# Geolocate voting places    
get_geoloc <- function(record) {
  address <- paste(record$address, 'New Zealand', sep = ', ')
  where <- try(geocode(address, output = "latlona", source = "dsk"), silent = TRUE)
  where$address <- as.character(address)
  where$locality <- as.character(record$neighbourhood)
  return(where)
}

# Sending whole row to the map function
clean_vote_by_place %>% pmap_dfr(., lift_ld(get_geoloc)) -> geoloc_voting_places

David Robinson was kind enough to help me with the last line of the script, although he updated the advise to:

[tweet https://twitter.com/drob/status/913584753974136832]

Given the size of my dataset, either option took bugger all time, although I have to say that

clean_vote_by_place %>% mutate(map(transpose(.), get_geoloc)) -> etc

looks prettier.

Once the data are geolocated, creating a visualisation is not so hard. Even old dogs can find their way to do that!

Categories
r rblogs teaching

Where are New Zealand’s bellwether electorates?

I was reading a piece by Graeme Edgeler who, near the end, asked “Where are New Zealand’s bellwether electorates?”. I didn’t know where the data came from or how was the “index of disproportionality for each electorate” calculated, but I saw it mostly as an opportunity to whip up some quick code to practice the use of R and look at other packages that play well with the tidyverse.

The task can be described as: fetch Wikipedia page with results of the 2014 parliamentary election, extract the table with results by electorate, calculate some form of deviation from the national results, get the top X electorates with lowest deviation from national results.

A web search revealed that this page contains a whole bunch of results for the 2014 election and that the specific results I’m interested in are in table number 17 of the list created by html_nodes('table'). Besides the tidyverse, I needed the packages rvest for web scraping, magrittr for using %<>% (pipe and assign to original data frame) and lucid for pretty printing the final table.

library(tidyverse)
library(rvest)
library(magrittr)
library(lucid)

election14 <- read_html('https://en.wikipedia.org/wiki/New_Zealand_general_election,_2014')

election14 %>% 
  html_nodes('table') %>% .[[17]] %>% 
  html_table() %>% filter(Electorate != 'Electorate') -> electorate14

glimpse(electorate14)

Rather than reading the national results directly from Wikipedia I just typed them in code, as I already had them from some other stuff I was working on. My measure of “disproportionality for each electorate” was as sophisticated as the sum of squared deviations.

# Results for National, Labour, Green, NZ First, Conservative, Internet Mana & Māori
national_results <- c(47.04, 25.13, 10.7, 8.66, 3.99, 1.42, 1.32)

electorate14 %>% mutate(total_vote = apply(.[,2:8], 1, sum),
                        dev = apply(.[,2:8], 1, function(x) sum((x - national_results)^2))) %>%
                 arrange(dev) %>% slice(1:15) %>% lucid

# A tibble: 15 x 10
#             Electorate National Labour Green `NZ First` Conservative `Internet Mana` Māori total_vote   dev
#                                                          
# 1                Ōtaki     49.1   24.8  9.46       9.96         4.41            0.65  0.44       98.8  9.02
# 2        Hamilton West     47.7   25.7  8.21      10.8          4.67            0.72  0.56       98.4 13.2 
# 3        Hamilton East     50     23.8 11          7.14         4.81            1     0.64       98.4 14.5 
# 4    West Coast-Tasman     44.8   23.5 13          8.71         5.12            0.76  0.28       96.2 15.7 
# 5               Napier     49.4   26    8.77       7.43         6.23            0.6   0.44       98.8 17.9 
# 6           Hutt South     45.3   28   12.8        7.48         3.57            0.72  0.53       98.3 18   
# 7           East Coast     48.6   22.7  9.21      11.8          4.08            1.17  0.95       98.6 20.7 
# 8               Nelson     44.4   24.7 14.1        7.67         5.5             0.83  0.33       97.6 23.4 
# 9         Invercargill     49.5   25.1  7.57      11.2          3.68            0.62  0.32       97.9 23.7 
#10            Whanganui     47.3   25.5  7.21      12            5.02            0.73  0.58       98.3 25.4 
#11            Northcote     50.7   22.1 11.6        7.32         4.31            0.95  0.46       97.5 26.3 
#12               Wigram     42.9   28.7 12.8        8.56         3.61            0.76  0.47       97.8 35.4 
#13 Christchurch Central     44.7   26.2 15.8        7.19         3.11            1.03  0.46       98.5 37   
#14             Tukituki     52     22.8  8.57       7.6          6.56            0.68  0.52       98.8 43.3 
#15           Port Hills     47     23.9 17.1        6.62         3.11            0.75  0.4        98.8 48.7 

I’m sure there must be a ‘more idiomatic’ way of doing the squared deviation using the tidyverse. At the same time, using apply came naturally in my head when writing the code, so I opted for keeping it and not interrupting the coding flow. The results are pretty similar to the ones presented by Graeme in his piece.

I’m getting increasingly comfortable with this mestizo approach of using the tidyverse and base R for completing tasks. Whatever it takes to express what I need to achieve quickly and more or less in a readable way.

Categories
programming r rblogs stats teaching

Functions with multiple results in tidyverse

I have continued playing with the tidyverse for different parts of a couple of projects.

Often I need to apply a function by groups of observations; sometimes, that function returns more than a single number. It could be something like for each group fit a distribution and return the distribution parameters. Or, simpler for the purposes of this exploration, calculate and return a bunch of numbers.

describe_c <- function(x) {
  mn <- mean(x, na.rm = TRUE)
  dev <- sd(x, na.rm = TRUE)
  n <- sum(!is.na(x))
  cv <- dev/mn*100
  return(c(mean = mn, sdev = dev, count = n, coefvar = cv))
}

If I have a data frame called field_data, with family codes (trees with the same parents, codes have been changed to protect the innocent) and stem diameters (in mm), I could do the following in base R:

# This line produces an annoying list
summary_one <- with(field_data, tapply(stem, family, FUN = describe_v))

# This puts together a matrix by joining 
# the list results using rbind()
summary_one <- do.call(rbind, summary_one)

# To continue processing it might be better to convert
# to a data frame
summary_one <- data.frame(summary_one)

And if I need to do this for several variables, I will need to merge each of these matrices in a data frame.

Continuing with my experimentation with the tidyverse, I was wondering how to get the above going with dplyr et al. After failing a few times I asked the question in Twitter and got a number of helpful replies.

One of the keys is that dplyr can store a list result from a function. Modifying my toy function is pretty straightforward, and now looks like:

describe_list <- function(x) {
  mn <- mean(x, na.rm = TRUE)
  dev <- sd(x, na.rm = TRUE)
  n <- sum(!is.na(x))
  cv <- dev/mn*100
  return(list(c(mean = mn, sdev = dev, count = n, coefvar = cv)))
}

And we can check the contents of summary_two to see we have a list in which each element contains 4 values:

head(summary_two)
#  A tibble: 6 x 2
#   family     model
#       
# 1      A 
# 2      B 
# 3      C 
# 4      D 
# 5      E 
# 6      F 

We still need to extract the elements of each element of the list and assign them to a variable name. Using map from the purrr package is pretty straightforward in this case, and we can extract the values either using their names or their position in the element.

summary_two %>% 
  mutate(mn = map_dbl(model,'mean'),
         sd = map_dbl(model,'sdev'),
         n = map_dbl(model,'count'),
         cv = map_dbl(model,4)) %>% head

#  A tibble: 6 x 6
#   family     model       mn       sd     n       cv
#                    
# 1      A  190.8306 23.71290   425 12.42615
# 2      B  190.1111 25.46554   396 13.39508
# 3      C  188.2646 27.39215   461 14.54981
# 4      D  189.2668 25.16330   431 13.29514
# 5      E  183.5238 19.70182    21 10.73530
# 6      F  183.1250 28.82377    24 15.73994

I'm still playing with ideas to be lazier at extraction time. An almost abhorrent idea is to provide the output as character for posterior type conversion, as in:

describe_char <- function(x) {
  mn <- mean(x, na.rm = TRUE)
  dev <- sd(x, na.rm = TRUE)
  n <- sum(!is.na(x))
  cv <- dev/mn*100
  return(paste(mn, dev, n, cv, sep = ':'))
}

field_data %>% 
  group_by(family) %>%
  summarise(model = describe_char(stem)) -> summary_three

head(summary_three)

# A tibble: 6 x 2
#   family                                                  model
#                                                     
# 1      A 190.830588235294:23.7128956613006:425:12.4261502731746
# 2      B 190.111111111111:25.4655444116168:396:13.3950847284951
# 3      C  188.26464208243:27.3921487349435:461:14.5498105390125
# 4      D 189.266821345708:25.1632953227626:431:13.2951434085746
# 5      E   183.52380952381:19.7018249094317:21:10.7352963959021
# 6      F           183.125:28.8237711378767:24:15.7399432834822

summary_three %>% 
  separate(model, c('mn', 'sd', 'n', 'cv'), sep = ':') %>% head

# A tibble: 6 x 5
#   family               mn               sd     n               cv
#                                        
# 1      A 190.830588235294 23.7128956613006   425 12.4261502731746
# 2      B 190.111111111111 25.4655444116168   396 13.3950847284951
# 3      C  188.26464208243 27.3921487349435   461 14.5498105390125
# 4      D 189.266821345708 25.1632953227626   431 13.2951434085746
# 5      E  183.52380952381 19.7018249094317    21 10.7352963959021
# 6      F          183.125 28.8237711378767    24 15.7399432834822

And we can get all the way there with:

summary_three %>% 
  separate(model, c('mn', 'sd', 'n', 'cv'), sep = ':') %>% 
  mutate_at(c('mn', 'sd', 'n', 'cv'), as.numeric) %>% head

# A tibble: 6 x 5
#   family       mn       sd     n       cv
#                
# 1      A 190.8306 23.71290   425 12.42615
# 2      B 190.1111 25.46554   396 13.39508
# 3      C 188.2646 27.39215   461 14.54981
# 4      D 189.2668 25.16330   431 13.29514
# 5      E 183.5238 19.70182    21 10.73530
# 6      F 183.1250 28.82377    24 15.73994

Which I assume has all sort of potential negative side-effects, but looks really cool.

In case you want to play with the problem, here is a tiny example of field data.