Keeping track of research

If you search for data analysis workflows for research there are lots of blog posts on using R + databases + git, etc. While in some cases I may end up working with a combination like that, it’s much more likely that reality is closer to a bunch of emailed Excel or CSV files.

Some may argue that one should move the whole group of collaborators to work the right way. In practice, well, not everyone has the interest and/or the time to do so. In one of our collaborations we are dealing with a trial established in 2009 and I was tracking a field coding mistake (as in happening outdoors, doing field work, assigning codes to trees), so I had to backtrack where the errors were introduced. After checking emails from three collaborators, I think I put together the story and found the correct code values in a couple of files going back two years.

The new analysis lives in an RStudio project with the following characteristics:

  1. Folder in Dropbox, so it’s copied in several locations and it’s easy to share.
  2. Excel or CSV files with their original names (warts and all), errors, etc. Resist the temptation to rename the files to sane names, so it’s easier to track back the history of the project.
  3. R code
  4. Important part: text file (Markdown if you want) documenting the names of the data files, who & when they sent it to me.

Very low tech but, hey, it works.

Warts and all: fight your inner OCD and keep original file names.

Influences: Cronopios and Famas

Books have accompanied me for all my life, or at least for as long as I can remember. However, my reading habits have changed many times, from reading simple books, to reading very complex books, to reading anything, to reading if I squeeze a few minutes here and there, to… you get the idea. ‘Habits’ is a funny word, an oxymoron, to refer to constant change.

Today I was thinking of influential books. No ‘good’ books or books that have received many awards or that have guided generations or catalyzed social change. I mean only books that have been important for me at a given point in time. If I had read them before or after that time they may have passed unnoticed. But I read them then, at the right time… for me.

As an adult I have moved houses several times, and every time I have lost books. There are also books that have been with me all this time. One of them is ‘Cronopios and Famas’ a collection of very short stories by Julio Cortázar§, one of the big voices of Argentinian literature. My first encounter with ‘Historias de Cronopios y Famas’–the original Spanish title–was in my maternal grandparents’ apartment. I was living with them and I was looking for something to read. Anything. I opened a drawer and found some interesting books, including Cortazar’s. It was one of the first editions, which I think belonged to one of my uncles, the one in exile.

Why was this an important book? Language, raw language. I am completely at lost when trying to explain Cortázar to someone who has not read his books. As Borges said:

No one can retell the plot of a Cortázar story; each one consists of determined words in a determined order. If we try to summarize them, we realize that something precious has been lost—Jorge Luis Borges

In ‘Progreso y retroceso’ (progress and regress) the whole story fits in only two paragraphs. The story is about a crystal that lets flies through but that does not let them come back because ‘no one knows what stuff in the flexibility of the fibers of this crystal, which was too fibrous’ or something like that:

Inventaron un cristal que dejaba pasar las moscas. La mosca venía empujaba un poco con la cabeza y, pop, ya estaba del otro lado. Alegría enormísima de la mosca.

Todo lo arruinó un sabio húngaro al descubrir que la mosca podía entrar pero no salir, o viceversa a causa de no se sabe que macana en la flexibilidad de las fibras de este cristal, que era muy fibroso. En seguida inventaron el cazamoscas con un terrón de azúcar dentro, y muchas moscas morían desesperadas. Así acabó toda posible confraternidad con estos animales dignos de mejor suerte.

The story is straightforward, with simple, almost pedestrian words. But those words have been extremely carefully selected, crafted in a particular order. I imagine Cortázar spending countless hours, agonizing on a myriad small decisions until reaching a point of perfect simplicity.

There was a clear before and after reading this book in 1981: language was not the same ever again. I learned to find the fantastic side of the quotidian. I grew to appreciate risk when building sentences, when pushing meanings and readings. My whole way to look at the world was influenced by a small book of ridiculous short stories.

P.S. I published this post in my old, extinct blog on 2009-02-02

This time is Calvino

This happens relatively frequently: I am talking with someone else that doesn’t know me well and, at some point of the conversation I have mentioned that I am a forester. Then we move into books and I mention someone like Borges or Calvino and they look at me with this puzzled face as in ‘I didn’t know that foresters could read’. I know, it happens to other professions as well; just for the record not all of us are semi-literate apes, working with a chainsaw.

I was sorting out my bookshelves at work when I found a copy of The literature machine, a collection of essays by Italo Calvino. It had my name and signature, together with 2002, Melbourne, Australia. (Digression: besides my name and signature I always put the city where I bought a book). I had vague memories of walking around in Melbourne’s CBD and finding an underground bookshop. At the time I was not looking for anything in particular, just browsing titles.

Why did I buy the book and never read it? I do remember browsing it and getting distracted by something more urgent, albeit clearly unimportant, because I cannot remember what was it. Probably I was not ready either; it has happened to me before. From ‘Uncle Tom’s cabin’ when I was nine, to ‘The Fountainhead’ when I was a teenager, to ‘The literature machine’ seven years ago. Most likely there is an issue of maturity, of being ready to read a particular story, philosophy or approach to the world.

Many years ago I read some of Calvino’s books, like Cosmicomics (brilliantly funny) and ‘The cloven viscount’ (very enjoyable reading). But I particularly struggle with two literary forms: essays and plays. I sometimes can get into the former, but the latter has proven–until today–insurmountable.

However, today is the time for Calvino and essays. There is something deeply stimulating in these essays, together with a quaintness created by forty years gone since they were written. The feeling of freshness, possibility and hope from 1968 reads strange in 2017. At the same time, there is a bit of breaking with the system, since the implosion of the international economy. Maybe it is an excellent time to resonate with Calvino, as in the old days.

Calculating parliament seats allocation and quotients

I was having a conversation about dropping the minimum threshold (currently 5% of the vote) for political parties to get representation in Parliament. The obvious question is how would seat allocation change, which of course involved a calculation. There is a calculator in the Electoral Commission website, but trying to understand how things work (and therefore coding) is my thing, and the Electoral Commission has a handy explanation of the Sainte-Laguë allocation formula used in New Zealand. So I had to write my own seat allocation function:

Testing it with the preliminary election results (that is, no including special votes) gives:

In our current setup The Opportunities and Māori parties did not reach the minimum threshold (nor won an electorate as ACT violating the spirit of the system), so did not get any seats. Those 4 seats that would have gone to minor parties under no threshold ended up going to National and Labour (2 each). It sucks.

Gratuitous picture: tree on stilts (Photo: Luis, click to enlarge).

Collecting results of the New Zealand General Elections

I was reading an article about the results of our latest elections where I was having a look at the spatial pattern for votes in my city.

I was wondering how would I go over obtaining the data for something like that and went to the Electoral Commission, which has this neat page with links to CSV files with results at the voting place level. The CSV files have results for each of the candidates in the first few rows (which I didn’t care about) and at the party level later in the file.

As I could see it I needed to:

  1. Read the Electoral Commission website and extract the table that contains the links to all CSV files.
  2. Read each of the files and i- extract the electorate name, ii- skipping all the candidates votes, followed by iii-reading the party vote.
  3. Remove sub-totals and other junk from the files.
  4. Geocode the addresses
  5. Use the data for whatever else I wanted (exam question anyone?).
New Zealand Electoral Commission results website. It held really well in election night.

So I first loaded the needed packages and read the list of CSV files:

Then wrote a couple of functions to, first, read the whole file, get the electorate name and, second, detect where the party vote starts to keep from that line onwards. Rather than explicitly looping over the list of CSV file names, I used map_dfr from the purrr package to extract the data and join all the results by row.

Cleaning the data and summarising by voting place (as one can vote for several electorates in a single place) is fairly straightforward. I appended the string Mobile to mobile teams that visited places like retirement homes, hospitals, prisons, etc:

Geolocation is the not-working-very-well part right now. First, I had problems with Google (beyond the 1,000 places limit for the query). Then I went for using the Data Science Kit as the source but, even excluding the mobile places, it was a bit hit and miss for geolocation, particularly as the format of some address (like corner of X and Y) is not the best for a search.

In addition, either of the two sources for geolocation work really slowly and may produce a lot of output. Using sink() could be a good idea to not end up with output for roughly 3,000 queries. I did try the mutate_geocode() function, but didn’t work out properly.

David Robinson was kind enough to help me with the last line of the script, although he updated the advise to:

Given the size of my dataset, either option took bugger all time, although I have to say that

looks prettier.

Once the data are geolocated, creating a visualisation is not so hard. Even old dogs can find their way to do that!

Where are New Zealand’s bellwether electorates?

I was reading a piece by Graeme Edgeler who, near the end, asked “Where are New Zealand’s bellwether electorates?”. I didn’t know where the data came from or how was the “index of disproportionality for each electorate” calculated, but I saw it mostly as an opportunity to whip up some quick code to practice the use of R and look at other packages that play well with the tidyverse.

The task can be described as: fetch Wikipedia page with results of the 2014 parliamentary election, extract the table with results by electorate, calculate some form of deviation from the national results, get the top X electorates with lowest deviation from national results.

A web search revealed that this page contains a whole bunch of results for the 2014 election and that the specific results I’m interested in are in table number 17 of the list created by html_nodes('table'). Besides the tidyverse, I needed the packages rvest for web scraping, magrittr for using %<>% (pipe and assign to original data frame) and lucid for pretty printing the final table.

Rather than reading the national results directly from Wikipedia I just typed them in code, as I already had them from some other stuff I was working on. My measure of “disproportionality for each electorate” was as sophisticated as the sum of squared deviations.

I’m sure there must be a ‘more idiomatic’ way of doing the squared deviation using the tidyverse. At the same time, using apply came naturally in my head when writing the code, so I opted for keeping it and not interrupting the coding flow. The results are pretty similar to the ones presented by Graeme in his piece.

I’m getting increasingly comfortable with this mestizo approach of using the tidyverse and base R for completing tasks. Whatever it takes to express what I need to achieve quickly and more or less in a readable way.

Newton meditating on how far down the list is the Wigram electorate: not quite bellwether (Photo: Luis. Click to enlarge).

Functions with multiple results in tidyverse

I have continued playing with the tidyverse for different parts of a couple of projects.

Often I need to apply a function by groups of observations; sometimes, that function returns more than a single number. It could be something like for each group fit a distribution and return the distribution parameters. Or, simpler for the purposes of this exploration, calculate and return a bunch of numbers.

If I have a data frame called field_data, with family codes (trees with the same parents, codes have been changed to protect the innocent) and stem diameters (in mm), I could do the following in base R:

And if I need to do this for several variables, I will need to merge each of these matrices in a data frame.

Mobile phone antenna in church (Photo: Luis, click to enlarge).

Continuing with my experimentation with the tidyverse, I was wondering how to get the above going with dplyr et al. After failing a few times I asked the question in Twitter and got a number of helpful replies.

One of the keys is that dplyr can store a list result from a function. Modifying my toy function is pretty straightforward, and now looks like:

And we can check the contents of summary_two to see we have a list in which each element contains 4 values:

We still need to extract the elements of each element of the list and assign them to a variable name. Using map from the purrr package is pretty straightforward in this case, and we can extract the values either using their names or their position in the element.

I’m still playing with ideas to be lazier at extraction time. An almost abhorrent idea is to provide the output as character for posterior type conversion, as in:

And we can get all the way there with:

Which I assume has all sort of potential negative side-effects, but looks really cool.

In case you want to play with the problem, here is a tiny example of field data.

Turtles all the way down

One of the main uses for R is for exploration and learning. Let’s say that I wanted to learn simple linear regression (the bread and butter of statistics) and see how the formulas work. I could simulate a simple example and fit the regression with R:

Your typical toy problem.

The formulas for the intercept (\(b_0\)) and the slope (\(b_1\)) are pretty simple, and I have been told that there is a generic expression that instead uses matrices.

\(b_1 = \frac{\sum{x y} – n \bar{x} \bar{y}}{\sum{x x} – n \bar{x}^2}\)
\(b_0 = \bar{y} – b_1 \bar{x}\)

\( \boldsymbol{b} = \boldsymbol{X}`\boldsymbol{X}^{-1} \boldsymbol{Xy}\)

How do the contents of the matrices and the simple formulates relate to each other?

Funnily enough, looking at the matrices we can see similar sums of squares and crossproducts as in the formulas.

But I have been told that R (as most statistical software) doesn’t use the inverse of the matrix for estimating the coefficients. So how does it work?

Trees in the fog (Photo: Luis, click to enlarge).

If I type lm R will print the code of the lm() function. A quick look will reveal that there is a lot of code reading the arguments and checking that everything is OK before proceeding. However, the function then calls something else: lm.fit(). With some trepidation I type lm.fit, which again performs more checks and then calls something with a different notation:

This denotes a call to a C language function, which after some searching in Google we find in a readable form in the lm.c file. Another quick look brings more checking and a call to Fortran code:

which is a highly tuned routine for QR decomposition in a linear algebra library. By now we know that the general matrix expression produces the same as our initial formula, and that the R lm() function does not use a matrix inverse but QR decomposition to solve the system of equations.

One of the beauties of R is that brought the power of statistical computing to the masses, by not only letting you fit models but also having a peek at how things are implemented. As a user, I don’t need to know that there is a chain of function calls initiated by my bread-and-butter linear regression. But it is comforting to the nerdy me, that I can have a quick look at that.

All this for free, which sounds like a very good deal to me.

Old dog and the tidyverse

I started using R ages ago and have happily lived in mostly-base-R for data manipulation. Once in a while I move to something that makes a big difference, like ggplot2 in 2010 or Rmarkdown in 2015, but the set of packages I use for data + plotting hasn’t seen many changes. I have to confess that, meanwhile, I have tested quite a few approaches on the analytics side of things (last year was the turn of Bayesian for me).

Last week, I decided to learn more about the tidyverse, thinking of using it more with forestry postgrad students. Now, there is no lack of tutorials, reviews, documentation, etc. for the tidyverse, but most writing shows a final version of the code, without exposing the thinking and dead ends that go behind it. In this post I show how my code was changing, both after reading a few pieces of documentation and, mostly, from feedback I got from Hadley Wickham and Michael MacAskill via this Kiwi Twitter thread. This post shows minor differences in variable names from that thread, as I changed a few things while reading the files.

Trees in a carpark (Photo: Luis, click to enlarge).

The problem is as follows: I have two data frames with trial assessments. Frame one, called early, covers trees at ages 5, 7 and 8 years (although ages are in months rather than years). Frame two, called late, covers trees at age 20 years. Yes, it takes a while working with trees.

We want to keep only age 8 years (96 months) from early and want to split a code into two variables, as well as convert a variable from numeric to character. In late we want to create a tree family code, based on a set of rules to connect field codes to the pedigree of trees. Did I mention that I work breeding trees?

Finally, we want to merge all the assessments from age 8 with the assessment at age 20 for the same trees.

Rather than showing the final version of the code, it is much more interesting to show its evolution, also including how I would have done this in base R. I’m omitting the reading of the file and boring case conversion of variable names, etc.

In base R, I would probably do something like this (I’m using the stringr package just to make my life easier):

My first approach to dealing with the early frame with the tidyverse looked like:

While the second frame was processed using:

I used multiple instances of mutate because I thought it would be easier to read. The use of map instead of sapply is cool, particularly when one starts looking at more advanced features. Comments from the crowd in Twitter: mutate the whole lot in a single statement (although Hadley pointed out that there was no performance penalty by using them separately) and try using case_when to make nested ifelse easier to understand. The comments on map went in two directions: either use map_chr as a clearer and safer alternative, or just use dplyr‘s separate function. The first option for early would look like:

However, I ended up going with the final version that used separate, which is easier on the eye and faster, for a final version that looks like this:

So we filter early, separate the single set code into two variables (rep and sets) and create a couple of variables using mutate (one is a simple type conversion to character, while the other is a code starting at 1,000,000).

In the case of late, I ended up with:

And we merge the files using:

Some (many, most?) people may disagree with my use of right assign, which I love. Surely one could use either left assign or %<>% from the maggrittr package. By the way, why do I have to explicitely load magrittr (instead of relying on tidyverse) to access %<>%?

And this is how I go about learning new things: lots of false starts, often working with small examples (I used a few to check how left_join was working), lots of searching for explanations/tutorials (thanks to everyone who has written them) and asking in Twitter. If you are just starting programming, in any language, do not feel intimidated by cool looking code; most of the time it took many iterations to get it looking like that.

Why you shouldn’t entrust your pet to Glenstar Kennels

Travel is part of life and if you have pets, finding appropriate boarding for them is a must. This is my explanation for why you should not entrust your dog to Glenstar Kennels, in Canterbury, New Zealand.

At the end of 2016 I had work approval for a two-month trip overseas. Normally I would book accommodation for my dog at the SPCA boarding kennels (as when we had two-months repairs to our house following Christchurch’s earthquake). However, as this trip included Christmas/New Year, it was impossible to find a vacancy. I was happy to find a spot for my dog at Glenstar Kennels spanning the whole end of year period.

Sadly, after 2 months travelling I found a sad surprise when I went to pick up my dog. He was almost 5 kg overweight, which for a 27 kg dog is 20% of weight gain in 2 months. As an illustration, imagine if you were 75 kg and gained 15 kg in only 2 months.

I immediately wrote to the owner of Glenstar Kennels, who stated that “Whilst I do agree he has put on weight had he fed a cheap food and lost weight in the kennels I feel you would be more upset”. Well, a dog becoming overweight is not any better than a dog losing weight! Both situations lead to reduced animal lifespan. While I believe the quality of the food provided was probably appropriate, the combination of physical activity and food quantity was clearly inappropriate.

The New Zealand Animal Welfare Act 1999 and the Code of Welfare for Dogs 2010 (administered by the Ministry for Primary Industries), establish a series of minimum standards for dog care:

  1. Dogs need a balanced daily diet in quantities that meet their requirements for health and welfare and to maintain their ideal bodyweight.
  2. The amount of food offered needs to be increased if a dog is losing condition, or decreased if it is becoming overweight.
  3. The code of welfare for dogs applies in all situations, including temporary housing such as shelters, doggy daycares or day boarding facilities, and kennels

According to the schedule sent to me by Glenstar Kennels, there were 12 hours of contact a day when someone had access to my dog and 65 days to figure out that he was becoming overweight and take appropriate action. The photo below shows my dog’s increased girth as I tried to fit his harness as per the size when I drop him off at Glenstar’s facility on 9th December and his size on 12th February (after gaining 5 kg). There is a dramatic difference, well explained by the term negligence.

Poor doggio showing his change of girth after two months in Glenstar Kennels.
Poor doggio showing his change of girth after two months in Glenstar Kennels.

I am very unhappy with the level of care provided by Glenstar Kennels and the lack of a satisfactory reply to my written complaints. After 2 months away I had to take my dog to his usual veterinary to discuss his overweight, with the associated cost, so I could bring him back to good health. Dog and I have been doing our usual daily walks (as we did before the trip), and I have been very careful about his nutrition, so he can go slowly back to his optimal 27 kg.

Unfortunately, there is no compulsory regulatory body for dog boarding kennels that could enforce the Code of Welfare for Dogs. However, I feel that I have to write this review and make Glenstar Kennels negligence public, so other dog owners (and potential customers) are aware of their extremely poor service.

P.S. The owners of Glenstar Kennels also have another company: Star Pet Travel for pet relocations. They use Glenstar Kennels for their temporary accommodation. I wouldn’t use them either.