Learning to code in R

It used to be that the one of the first decisions to make when learning to program was between compiled (e.g. C or FORTRAN) and interpreted (e.g. Python) languages. In my opinion these days one would have to be a masochist to learn with a compiled language: the extra compilation time and obscure errors are a killer when learning.

Today the decision would be between using a generic interpreted language (e.g. Python) and an interpreted domain specific language (DSL) like R, MATLAB, etc. While some people prefer generic languages, I’d argue that immediate feedback and easy accomplishment of useful tasks are a great thing when one is learning something for the first time.

As an example, a while ago my son asked me what I was doing in the computer and I told him that I was programming some analyses in R. I showed that the program was spitting back some numbers and plots, a fact that he found totally unremarkable and uninteresting. I searched in internet and I found Scratch, a visual programming language, that let’s the user moves blocks representing code around and build interactive games: now my son was sold. Together we are learning about loops, control structures and variables, drawing characters, etc. We are programming because the problems are i- much more interesting for him and ii- achievable in a short time frame.

An example scratch script.

An example scratch script.

Learning to program for statistics, or other scientific domains for that matter, is not that different from being a kid and learning programming. Having to do too much to get even a mildly interesting result is frustrating and discouraging; it is not that the learner is dumb, but that he has to build too many functions to get a meager reward. This is why I’d say that you should use whatever language already has a large amount of functionality (‘batteries included’ in Python parlance) for your discipline. Choose rightly and you are half-way there.

‘But’ someone will say, R is not a real language. Sorry, but it is a real language (Turing complete and the whole shebang) with oddities, as any other language, granted. As with human languages, the more you study the easier it gets to learn a new language. In fact, the syntax for many basic constructs in R is highly similar to alternatives:

# This is R code
# Loop
for(i in 1:10){
    print(i)
}
#[1] 1
#[1] 2
#[1] 3
#[1] 4
#[1] 5
#[1] 6
#[1] 7
#[1] 8
#[1] 9
#[1] 10
 
# Control 
if(i == 10) {
    print('It is ten!')
}
 
#[1] "It is ten!"
# This is Python code
# Loop
for i in range(1,11): # Python starts indexing from zero
    print(i)
 
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
 
# Control
if i == 10:
    print("It is ten!")
 
#It is ten!

By the way, I formatted the code to highlight similarities. Of course there are plenty of differences between the languages, but many of the building blocks (the core ideas if you will) are shared. You learn one and the next one gets easier.

How does one start learning to program? If I look back at 1985 (yes, last millennium) I was struggling to learn how to program in BASIC when, suddenly, I had an epiphany. The sky opened and I heard a choir singing György Ligeti’s Atmosphères (from 2001: a Space Odyssey, you know) and then I realized: we are only breaking problems in little pieces and dealing with them one at the time. I’ve been breaking problems into little pieces since then. What else did you expect? Anyhow, if you feel that you want to learn how to code in R, or whatever language is the best option for your problems, start small. Just a few lines will do. Read other people’s code but, again, only small pieces that are supposed to do something small. At this stage is easy to get discouraged because everything is new and takes a lot of time. Don’t worry: everyone has to go through this process.

Many people struggle vectorizing the code so it runs faster. Again, don’t worry at the beginning if your code is slow. Keep on writing and learning. Read Norman Noam Ross’s FasteR! HigheR! StongeR! — A Guide to Speeding Up R Code for Busy People. The guide is very neat and useful, although you don’t need super fast code, not yet at least. You need working, readable and reusable code. You may not even need code: actually, try to understand the problem, the statistical theory, before you get coding like crazy. Programming will help you understand your problem a lot better, but you need a helpful starting point.

Don’t get distracted with the politics of research and repeatability and trendy things like git (noticed that I didn’t even link to them?). You’ll learn them in time, once you got a clue about how to program.

P.S. The code used in the examples could be shortened and sped up dramatically (e.g. 1:10 or range(1, 11)) but it is not the point of this post.
P.S.2. A while ago I wrote R is a language, which could be useful connected to this post.

Doing Bayesian Data Analysis now in JAGS

Around Christmas time I presented my first impressions of Kruschke’s Doing Bayesian Data Analysis. This is a very nice book but one of its drawbacks was that part of the code used BUGS, which left mac users like me stuck.

Kruschke has now made JAGS code available so I am happy clappy and looking forward to test this New Year present. In addition, there are other updates available for the programs included in the book.

R pitfall #3: friggin’ factors

I received an email from one of my students expressing deep frustation with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to NA. Using simple letters as example names he was baffled by the result of the following code:

[sourcecode lang=”r”]
lines = factor(LETTERS)
lines
# [1] A B C D E F G H…
# Levels: A B C D E F G H…

linesNA = ifelse(lines %in% c(‘C’, ‘G’, ‘P’), NA, lines)
linesNA
# [1] 1 2 NA 4 5 6 NA 8…
[/sourcecode]

The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustation guaranteed!

[sourcecode lang=”r”]
linesNA = factor(ifelse(lines %in% c(‘C’, ‘G’, ‘P’), NA, lines))
linesNA
# [1] 1 2 <NA> 4 5 6 <NA> 8…
# Levels: 1 2 4 5 6 8…
[/sourcecode]

Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns’s R Inferno PDF for details). We can deal directly with the levels using this:

[sourcecode lang=”r”]
linesNA = lines
levels(linesNA)[levels(linesNA) %in% c(‘C’, ‘G’, ‘P’)] = NA
linesNA
# [1] A B <NA> D E F <NA> H…
#Levels: A B D E F H…
[/sourcecode]

We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:

[sourcecode lang=”r”]
linesNA = factor(as.character(ifelse(lines %in%
c(‘C’, ‘G’, ‘P’), NA, lines)))
linesNA
# [1] A B <NA> D E F <NA> H…
#Levels: A B D E F H…
[/sourcecode]

I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).

Lattice when modeling, ggplot when publishing

When working in research projects I tend to fit several, sometimes quite a few, alternative models. This model fitting is informed by theoretical considerations (e.g. quantitative genetics, experimental design we used, our understanding of the process under study, etc.) but also by visual inspection of the data. Trellis graphics—where subsets of data are plotted in different ‘panels’ defined by one or more factors—are extremely useful to generate research hypotheses.

There are two packages in R that have good support for trellis graphics: lattice and ggplot2. Lattice is the oldest, while ggplot2 is probably more consistent (implementing a grammar of graphics) and popular with the cool kids and the data visualization crowd. However, lattice is also quite fast, while ggplot2 can be slow as a dog (certainly way slower than my dog).

Tree-breeding progeny trials often have between 1,000 and 12,000 individuals, and analyses commonly include several trials. Thus, it is not unusual to have tens of thousands or even hundreds of thousand of records that will be involved in the analysis. Add to this situation that I am impatient and you will understand that differences on speed can make a big difference to my mental health. But how different is the speed? We can simulate some correlated data (following the explanation in this post) and build a simple scatterplot faceted by site; let’s say 60,000 observations in 6 sites (10,000 per site).

[sourcecode lang=”r”]
library(lattice)
library(ggplot2)

# number of observations to simulate
nobs = 60000
sites = 6
 
# Using a correlation matrix (let’s assume that all variables
# have unit variance
M = matrix(c(1, 0.7,
0.7, 1), nrow=2, ncol=2)
 
# Cholesky decomposition
L = chol(M)
nvars = dim(L)[1]

# Random variables that follow an M correlation matrix
r = t(L) %*% matrix(rnorm(nvars*nobs), nrow=nvars, ncol=nobs)
r = t(r)

rdata = as.data.frame(r)
names(rdata) = c(‘x’, ‘y’)
rdata$site = factor(rep(1:sites, each = nobs/sites))

# Plotting with lattice
xyplot(y ~ x | site, data = rdata,
layout = c(3, 2), type=c(‘p’,’smooth’))

# Plotting with ggplot2
qplot(x, y, facets = ~ site,
geom=c(‘point’, ‘smooth’),
data = rdata) + facet_wrap(~site)
[/sourcecode]

The timing was done surrounding the graph calls (either xyplot() or qplot()) by system.time(print()), so the graph is sent to the screen and the operation is timed. In summary, in this simple call ggplot2 takes a bit over double the time than lattice. The more layers you add to the graph the slower it gets.

The two plots are below. We could improve both plots and make them look more similar to each other, but I want to avoid introducing more distractions in the code.

Nevertheless, I do like the flexibility of ggplot2, so I support most of my exploratory data analysis using lattice but when I have to create the final pretty plots for publications in journals I go back to ggplot2. I subscribe to Frank Harrell’s Philosophy of Biostatistics, which includes ‘use excellent graphics, liberally’. Switching between packages let me deal with both abundance of graphics and impatience.

This is R pitfall #2: plots inside a function (and system.time() is a function) have to be surrounded by print() or they won’t be sent to the screen. Pitfall #1 is here.

Setting plots side by side

This is simple example code to display side-by-side lattice plots or ggplot2 plots, using the mtcars dataset that comes with any R installation. We will display a scatterplot of miles per US gallon (mpg) on car weight (wt) next to another scatterplot of the same data, but using different colors by number of engine cylinders (cyl, treated as factor) and adding a smooth line (under the type option).

library(lattice)
data(mtcars)
 
# Create first plot and assign it to variable
p1 = xyplot(mpg ~ wt, data = mtcars, 
            xlab = 'Car weight', ylab = 'Mileage')
 
# Create second plot and assign it to variable
p2 = xyplot(mpg ~ wt, group= factor(cyl), data = mtcars,
            xlab = 'Car weight', ylab = 'Miles/gallon', 
            type=c('p', 'smooth'))
 
# print calls print.lattice, which takes a position
# argument.
print(p1, position = c(0, 0, 0.5, 1), more = TRUE)
print(p2, position = c(0.5, 0, 1, 1))

According to the documentation, position is a vector of 4 numbers, typically c(xmin, ymin, xmax, ymax) that give the lower-left and upper-right corners of a rectangle in which the Trellis plot of x is to be positioned. The coordinate system for this rectangle is [0-1] in both the x and y directions. That is, the first print() sets position to occupy the left part of the graph with full height, as well as to avoid refreshing the graph when displaying the new plot (more = TRUE). The second print() uses the right part of the graph with full height.

In the case of ggplot2, the code is not that different:

library(grid)
library(ggplot2)
 
# Create first plot and assign it to variable
p1 = qplot(wt, mpg, data = mtcars,
           xlab = 'Car weight', ylab = 'Mileage')
 
# Create second plot and assign it to variable
p2 = qplot(wt, mpg, color = factor(cyl), data = mtcars,
           geom = c('point', 'smooth'),
           xlab = 'Car weight', ylab = 'Mileage')
 
# Define grid layout to locate plots and print each graph
pushViewport(viewport(layout = grid.layout(1, 2))) 
print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1)) 
print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))

More details on ggplot’s notation can be found here.

Upgrading R (and packages)

I tend not to upgrade R very often—running from 6 months to 1 year behind in version numbers—because I had to reinstall all packages: a real pain. A quick search shows that people have managed to come up with good solutions to this problem, as presented in this stackoverflow thread. I used the code in my mac:

# Run in the old installation (pre-upgrade)
# saving a list of installed packages
setwd('~/Downloads')
package.list = installed.packages()[,"Package"]
save(package.list, file='package-list.Rdata')
 
# Run in new install
setwd('~/Downloads')
load('package-list.Rdata')
for (p in setdiff(package.list, installed.packages()[,"Package"])) {
    install.packages(p)
}
 
# Final output
Warning messages:
1: In getDependencies(pkgs, dependencies, available, lib) :
  package ‘Acinonyx’ is not available (for R version 2.13.2)
2: In getDependencies(pkgs, dependencies, available, lib) :
  package ‘AnimalINLA’ is not available (for R version 2.13.2)
3: In getDependencies(pkgs, dependencies, available, lib) :
  package ‘asreml’ is not available (for R version 2.13.2)
4: In getDependencies(pkgs, dependencies, available, lib) :
  package ‘INLA’ is not available (for R version 2.13.2)
5: In getDependencies(pkgs, dependencies, available, lib) :
  package ‘graph’ is not available (for R version 2.13.2)

From all installed packages, I only had issues with 5 of them, which require installation from their respective websites: Acinonyx, INLA (and AnimalINLA) and asreml. Package graph is now available from bioconductor.org. INLA can be installed really easily from inside R (see below), while I did not bother downloading again asreml and just copied the folder from ~/Library/R/OldVersion/library/asreml to ~/Library/R/CurrentVersion/library/asreml.

# Installing INLA
source("http://www.math.ntnu.no/inla/givemeINLA.R")
inla.upgrade(testing=TRUE)

Overall, it was a good upgrade experience, so thanks to the stackoverflow crowd for so many ideas on how to make R even nicer than it is.

P.S. 20100-10-14 Similar instructions, but including compiling R and installing bioconductor.

Reading HTML pages in R for text processing

We were talking with one of my colleagues about doing some text analysis—that, by the way, I have never done before—for which the first issue is to get text in R. Not any text, but files that can be accessed through internet. In summary, we need to access an HTML file, parse it so we can access specific content and then remove the HTML tags. Finally, we may want to replace some text (the end of lines, \n, for example) before continue processing the files.

The package XML has the necessary functionality to deal with HTML, while the rest is done using a few standard R functions.

[sourcecode language=”r”]
library(XML)

# Read and parse HTML file
doc.html = htmlTreeParse(‘http://apiolaza.net/babel.html’,
useInternal = TRUE)

# Extract all the paragraphs (HTML tag is p, starting at
# the root of the document). Unlist flattens the list to
# create a character vector.
doc.text = unlist(xpathApply(doc.html, ‘//p’, xmlValue))

# Replace all \n by spaces
doc.text = gsub(‘\\n’, ‘ ‘, doc.text)

# Join all the elements of the character vector into a single
# character string, separated by spaces
doc.text = paste(doc.text, collapse = ‘ ‘)
[/sourcecode]

Incidentally, babel.html contains a translation of the short story ‘The Library of Babel’ by Jorge Luis Borges. Great story! We can repeat this process with several files and then create a corpus (and analyze it) using the tm package.

Operating on datasets inside a function

There are times when we need to write a function that makes changes to a generic data frame that is passed as an argument. Let’s say, for example, that we want to write a function that converts to factor any variable with names starting with a capital letter. There are a few issues involved in this problem, including:

  • Obtaining a text version of the name of the dataset (using the substitute() function).
  • Looping over the variable names and checking if they start with a capital letter (comparing with the LETTERS vector of constants).
  • Generating the plain text version of the factor conversion, glueing the dataset and variable names (using paste()).
  • Parsing the plain text version of the code to R code (using parse()) and evaluating it (using eval()). This evaluation has to be done in the parent environment or we will lose any transformation when we leave the function, which is the reason for the envir() specification.

[sourcecode lang=”r”]
CapitalFactors = function(dataset) {
# Gets text version name of dataset
data.name = substitute(dataset)

# Loops over variable names of dataset
# and extracts the ones that start with uppercase
for(var.name in names(dataset)){
if(substr(var.name, 1, 1) %in% LETTERS) {
left = paste(data.name, ‘$’, var.name, sep = ”)
right = paste(‘factor(‘, left, ‘)’, sep = ”)
code = paste(left, ‘=’, right)
# Evaluates the parsed text, using the parent environment
# so as to actually update the original data set
eval(parse(text = code), envir = parent.frame())
}
}
}

# Create example dataset and display structure
example = data.frame(Fert = rep(1:2, each = 4),
yield = c(12, 9, 11, 13, 14, 13, 15, 14))
str(example)

‘data.frame': 8 obs. of 2 variables:
$ Fert : int 1 1 1 1 2 2 2 2
$ yield: num 12 9 11 13 14 13 15 14

# Use function on dataset and display structure
CapitalFactors(example)
str(example)

‘data.frame': 8 obs. of 2 variables:
$ Fert : Factor w/ 2 levels "1","2": 1 1 1 1 2 2 2 2
$ yield: num 12 9 11 13 14 13 15 14
[/sourcecode]

And that’s all. Now the Fert integer variable has been converted to a factor. This example function could be useful for someone out there.

A brief idea of style

Once one starts writing more R code the need for consistency increases, as it facilitates managing larger projects and their maintenance. There are several style guides or suggestions for R; for example, Andrew Gelman’s, Hadley Wickham’s, Bioconductor’s and this one. I tend to write closer to Google’s R style guide, which contains some helpful suggestions. I use something similar but:

  • I use = for assignment rather than <-, because it is visually less noisy, <- requires an extra keystroke (yes, I am that lazy) and—from a merely esthetics point of view—in many monospaced fonts the lower than and hyphen symbols do not align properly, so <- does not look like an arrow. I know that hardcore R programmers prefer the other symbol but, tough, I prefer the equal sign.
  • I indent code using four spaces, just because I am used to do so in Python. I will make an exception and go down to two spaces if there are too many nested clauses.
  • I like their identifier naming scheme, although I do not use it consistently. Mea culpa.
  • I always use single quotes for text (two fewer keystrokes per text variable).

Of course you’ll find that the examples presented in this site depart from the style guide. I didn’t say that I was consistent, did I?

All combinations for levelplot

In a previous post I explained how to create all possible combinations of the levels of two factors using expand.grid(). Another use for this function is to create a regular grid for two variables to create a levelplot or a contour plot.

For example, let’s say that we have fitted a multiple linear regression to predict wood stiffness (stiff, the response) using basic density (bd) and a measure of microfibril angle (t) as explanatory variables. The regression equation could be something like stiffness = 3.439 + 0.009 bd - 0.052 t. In our dataset bd had a range of 300 to 700 kg m-3, while t had a range from 50 to 70.

We will use the levelplot() function that is part of the lattice package of graphical functions, create a grid for both explanatory variables (every 10 for bd and every 1 for t), predict values of stiffness for all combinations of bd and t, and plot the results.

library(lattice)
 
# Create grid of data
wood = expand.grid(bd = seq(300, 700, 10), t = seq(50, 70, 1))
wood$stiffness = with(wood, 3.439 + 0.009*bd - 0.052*t)
 
levelplot(stiffness ~ bd*t, data = wood, 
          xlab='Basic density', ylab='T')

This code creates a graph like this. Simple.

Wood stiffness levelplot