Reducing friction in R to avoid Excel

When you have students working in a project there is always an element of quality control. Some times the results just make sense, while others we are suspicious about something going wrong. This means going back to check the whole analysis process: can we retrace all the steps in a calculation (going back to data collection) and see if there is anything funny going on? So we sat with the student and started running code (in RStudio, of course) and I noticed something interesting: there was a lot of redundancy, pieces of code that didn’t do anything or were weirdly placed. These are typical signs of code copied from several sources, which together with the presence of setwd() showed unfamiliarity with R and RStudio (we have a mix of students with a broad range of R skills).

But the part that really caught my eye was that the script read many Near Infrared spectra files, column bound them together with the sample ID (which was 4 numbers separated by hyphens) and saved the 45 MB file to a CSV file. Then the student opened the file and split the sample ID into 4 columns, deleted the top row, saved the file and read it again into R to continue the process.

The friction point which forced the student to drop to Excel—the first of many not easily reproducible parts—was variable splitting. The loop for reading the files and some condition testing was hard to follow too. If one knows R well, any of these steps is relatively simple, but if one doesn’t know it, the copy and pasting from many different sources begins, often with inconsistent programming approaches.

Here is where I think the tidyverse brings something important to the table: consistency, more meaningful naming of functions and good documentation. For example, doing:

is probably the easiest way of dealing with separating the contents of a single variable.

When working with several collaborators (colleagues, students, etc) the easiest way to reduce friction is to convince/drag/supplicate everyone to adopt a common language. Within the R world, the tidyverse is the closest thing we have to a lingua franca of research collaboration. ‘But isn’t R a lingua franca already?’ you may ask. The problem is that programming in base R is often too weird for normal people, and too many people just give up before feeling they can do anything useful in R (particularly if they are proficient in Excel).

Even if you are an old dog (like me) I think it pays to change to a subset of R that is more learnable. And once someone gets hooked, the transition to adding non-tidyverse functions is more bearable.

View of my desk in research mode, including friendly robot (Photo: Luis, click to enlarge).

Calculating parliament seats allocation and quotients

I was having a conversation about dropping the minimum threshold (currently 5% of the vote) for political parties to get representation in Parliament. The obvious question is how would seat allocation change, which of course involved a calculation. There is a calculator in the Electoral Commission website, but trying to understand how things work (and therefore coding) is my thing, and the Electoral Commission has a handy explanation of the Sainte-Laguë allocation formula used in New Zealand. So I had to write my own seat allocation function:

Testing it with the preliminary election results (that is, no including special votes) gives:

In our current setup The Opportunities and Māori parties did not reach the minimum threshold (nor won an electorate as ACT violating the spirit of the system), so did not get any seats. Those 4 seats that would have gone to minor parties under no threshold ended up going to National and Labour (2 each). It sucks.

Gratuitous picture: tree on stilts (Photo: Luis, click to enlarge).

Collecting results of the New Zealand General Elections

I was reading an article about the results of our latest elections where I was having a look at the spatial pattern for votes in my city.

I was wondering how would I go over obtaining the data for something like that and went to the Electoral Commission, which has this neat page with links to CSV files with results at the voting place level. The CSV files have results for each of the candidates in the first few rows (which I didn’t care about) and at the party level later in the file.

As I could see it I needed to:

  1. Read the Electoral Commission website and extract the table that contains the links to all CSV files.
  2. Read each of the files and i- extract the electorate name, ii- skipping all the candidates votes, followed by iii-reading the party vote.
  3. Remove sub-totals and other junk from the files.
  4. Geocode the addresses
  5. Use the data for whatever else I wanted (exam question anyone?).
New Zealand Electoral Commission results website. It held really well in election night.

So I first loaded the needed packages and read the list of CSV files:

Then wrote a couple of functions to, first, read the whole file, get the electorate name and, second, detect where the party vote starts to keep from that line onwards. Rather than explicitly looping over the list of CSV file names, I used map_dfr from the purrr package to extract the data and join all the results by row.

Cleaning the data and summarising by voting place (as one can vote for several electorates in a single place) is fairly straightforward. I appended the string Mobile to mobile teams that visited places like retirement homes, hospitals, prisons, etc:

Geolocation is the not-working-very-well part right now. First, I had problems with Google (beyond the 1,000 places limit for the query). Then I went for using the Data Science Kit as the source but, even excluding the mobile places, it was a bit hit and miss for geolocation, particularly as the format of some address (like corner of X and Y) is not the best for a search.

In addition, either of the two sources for geolocation work really slowly and may produce a lot of output. Using sink() could be a good idea to not end up with output for roughly 3,000 queries. I did try the mutate_geocode() function, but didn’t work out properly.

David Robinson was kind enough to help me with the last line of the script, although he updated the advise to:

Given the size of my dataset, either option took bugger all time, although I have to say that

looks prettier.

Once the data are geolocated, creating a visualisation is not so hard. Even old dogs can find their way to do that!

Learning to code in R

It used to be that the one of the first decisions to make when learning to program was between compiled (e.g. C or FORTRAN) and interpreted (e.g. Python) languages. In my opinion these days one would have to be a masochist to learn with a compiled language: the extra compilation time and obscure errors are a killer when learning.

Today the decision would be between using a generic interpreted language (e.g. Python) and an interpreted domain specific language (DSL) like R, MATLAB, etc. While some people prefer generic languages, I’d argue that immediate feedback and easy accomplishment of useful tasks are a great thing when one is learning something for the first time.

As an example, a while ago my son asked me what I was doing in the computer and I told him that I was programming some analyses in R. I showed that the program was spitting back some numbers and plots, a fact that he found totally unremarkable and uninteresting. I searched in internet and I found Scratch, a visual programming language, that let’s the user moves blocks representing code around and build interactive games: now my son was sold. Together we are learning about loops, control structures and variables, drawing characters, etc. We are programming because the problems are i- much more interesting for him and ii- achievable in a short time frame.

An example scratch script.
An example scratch script.

Learning to program for statistics, or other scientific domains for that matter, is not that different from being a kid and learning programming. Having to do too much to get even a mildly interesting result is frustrating and discouraging; it is not that the learner is dumb, but that he has to build too many functions to get a meager reward. This is why I’d say that you should use whatever language already has a large amount of functionality (‘batteries included’ in Python parlance) for your discipline. Choose rightly and you are half-way there.

‘But’ someone will say, R is not a real language. Sorry, but it is a real language (Turing complete and the whole shebang) with oddities, as any other language, granted. As with human languages, the more you study the easier it gets to learn a new language. In fact, the syntax for many basic constructs in R is highly similar to alternatives:

By the way, I formatted the code to highlight similarities. Of course there are plenty of differences between the languages, but many of the building blocks (the core ideas if you will) are shared. You learn one and the next one gets easier.

How does one start learning to program? If I look back at 1985 (yes, last millennium) I was struggling to learn how to program in BASIC when, suddenly, I had an epiphany. The sky opened and I heard a choir singing György Ligeti’s Atmosphères (from 2001: a Space Odyssey, you know) and then I realized: we are only breaking problems in little pieces and dealing with them one at the time. I’ve been breaking problems into little pieces since then. What else did you expect? Anyhow, if you feel that you want to learn how to code in R, or whatever language is the best option for your problems, start small. Just a few lines will do. Read other people’s code but, again, only small pieces that are supposed to do something small. At this stage is easy to get discouraged because everything is new and takes a lot of time. Don’t worry: everyone has to go through this process.

Many people struggle vectorizing the code so it runs faster. Again, don’t worry at the beginning if your code is slow. Keep on writing and learning. Read Norman Noam Ross’s FasteR! HigheR! StongeR! — A Guide to Speeding Up R Code for Busy People. The guide is very neat and useful, although you don’t need super fast code, not yet at least. You need working, readable and reusable code. You may not even need code: actually, try to understand the problem, the statistical theory, before you get coding like crazy. Programming will help you understand your problem a lot better, but you need a helpful starting point.

Don’t get distracted with the politics of research and repeatability and trendy things like git (noticed that I didn’t even link to them?). You’ll learn them in time, once you got a clue about how to program.

P.S. The code used in the examples could be shortened and sped up dramatically (e.g. 1:10 or range(1, 11)) but it is not the point of this post.
P.S.2. A while ago I wrote R is a language, which could be useful connected to this post.

Doing Bayesian Data Analysis now in JAGS

Around Christmas time I presented my first impressions of Kruschke’s Doing Bayesian Data Analysis. This is a very nice book but one of its drawbacks was that part of the code used BUGS, which left mac users like me stuck.

Kruschke has now made JAGS code available so I am happy clappy and looking forward to test this New Year present. In addition, there are other updates available for the programs included in the book.

R pitfall #3: friggin’ factors

I received an email from one of my students expressing deep frustation with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to NA. Using simple letters as example names he was baffled by the result of the following code:

The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustration guaranteed!

Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns’s R Inferno PDF for details). We can deal directly with the levels using this:

We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:

I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).

Lattice when modeling, ggplot when publishing

When working in research projects I tend to fit several, sometimes quite a few, alternative models. This model fitting is informed by theoretical considerations (e.g. quantitative genetics, experimental design we used, our understanding of the process under study, etc.) but also by visual inspection of the data. Trellis graphics—where subsets of data are plotted in different ‘panels’ defined by one or more factors—are extremely useful to generate research hypotheses.

There are two packages in R that have good support for trellis graphics: lattice and ggplot2. Lattice is the oldest, while ggplot2 is probably more consistent (implementing a grammar of graphics) and popular with the cool kids and the data visualization crowd. However, lattice is also quite fast, while ggplot2 can be slow as a dog (certainly way slower than my dog).

Tree-breeding progeny trials often have between 1,000 and 12,000 individuals, and analyses commonly include several trials. Thus, it is not unusual to have tens of thousands or even hundreds of thousand of records that will be involved in the analysis. Add to this situation that I am impatient and you will understand that differences on speed can make a big difference to my mental health. But how different is the speed? We can simulate some correlated data (following the explanation in this post) and build a simple scatterplot faceted by site; let’s say 60,000 observations in 6 sites (10,000 per site).

[sourcecode lang=”r”]
library(lattice)
library(ggplot2)

# number of observations to simulate
nobs = 60000
sites = 6
 
# Using a correlation matrix (let’s assume that all variables
# have unit variance
M = matrix(c(1, 0.7,
0.7, 1), nrow=2, ncol=2)
 
# Cholesky decomposition
L = chol(M)
nvars = dim(L)[1]

# Random variables that follow an M correlation matrix
r = t(L) %*% matrix(rnorm(nvars*nobs), nrow=nvars, ncol=nobs)
r = t(r)

rdata = as.data.frame(r)
names(rdata) = c(‘x’, ‘y’)
rdata$site = factor(rep(1:sites, each = nobs/sites))

# Plotting with lattice
xyplot(y ~ x | site, data = rdata,
layout = c(3, 2), type=c(‘p’,’smooth’))

# Plotting with ggplot2
qplot(x, y, facets = ~ site,
geom=c(‘point’, ‘smooth’),
data = rdata) + facet_wrap(~site)
[/sourcecode]

The timing was done surrounding the graph calls (either xyplot() or qplot()) by system.time(print()), so the graph is sent to the screen and the operation is timed. In summary, in this simple call ggplot2 takes a bit over double the time than lattice. The more layers you add to the graph the slower it gets.

The two plots are below. We could improve both plots and make them look more similar to each other, but I want to avoid introducing more distractions in the code.

Nevertheless, I do like the flexibility of ggplot2, so I support most of my exploratory data analysis using lattice but when I have to create the final pretty plots for publications in journals I go back to ggplot2. I subscribe to Frank Harrell’s Philosophy of Biostatistics, which includes ‘use excellent graphics, liberally’. Switching between packages let me deal with both abundance of graphics and impatience.

This is R pitfall #2: plots inside a function (and system.time() is a function) have to be surrounded by print() or they won’t be sent to the screen. Pitfall #1 is here.

Setting plots side by side

This is simple example code to display side-by-side lattice plots or ggplot2 plots, using the mtcars dataset that comes with any R installation. We will display a scatterplot of miles per US gallon (mpg) on car weight (wt) next to another scatterplot of the same data, but using different colors by number of engine cylinders (cyl, treated as factor) and adding a smooth line (under the type option).

According to the documentation, position is a vector of 4 numbers, typically c(xmin, ymin, xmax, ymax) that give the lower-left and upper-right corners of a rectangle in which the Trellis plot of x is to be positioned. The coordinate system for this rectangle is [0-1] in both the x and y directions. That is, the first print() sets position to occupy the left part of the graph with full height, as well as to avoid refreshing the graph when displaying the new plot (more = TRUE). The second print() uses the right part of the graph with full height.

In the case of ggplot2, the code is not that different:

More details on ggplot’s notation can be found here.

Upgrading R (and packages)

I tend not to upgrade R very often—running from 6 months to 1 year behind in version numbers—because I had to reinstall all packages: a real pain. A quick search shows that people have managed to come up with good solutions to this problem, as presented in this stackoverflow thread. I used the code in my mac:

From all installed packages, I only had issues with 5 of them, which require installation from their respective websites: Acinonyx, INLA (and AnimalINLA) and asreml. Package graph is now available from bioconductor.org. INLA can be installed really easily from inside R (see below), while I did not bother downloading again asreml and just copied the folder from ~/Library/R/OldVersion/library/asreml to ~/Library/R/CurrentVersion/library/asreml.

Overall, it was a good upgrade experience, so thanks to the stackoverflow crowd for so many ideas on how to make R even nicer than it is.

P.S. 20100-10-14 Similar instructions, but including compiling R and installing bioconductor.

Reading HTML pages in R for text processing

We were talking with one of my colleagues about doing some text analysis—that, by the way, I have never done before—for which the first issue is to get text in R. Not any text, but files that can be accessed through internet. In summary, we need to access an HTML file, parse it so we can access specific content and then remove the HTML tags. Finally, we may want to replace some text (the end of lines,
, for example) before continue processing the files.

The package XML has the necessary functionality to deal with HTML, while the rest is done using a few standard R functions.

Incidentally, babel.html contains a translation of the short story ‘The Library of Babel’ by Jorge Luis Borges. Great story! We can repeat this process with several files and then create a corpus (and analyze it) using the tm package.