Less wordy R

The Swarm Lab presents a nice comparison of R and Python code for a simple (read ‘one could do it in Excel’) problem. The example works, but I was surprised by how wordy the R code was and decided to check if one could easily produce a shorter version.

The beginning is pretty much the same, although I’ll use ggplot2 rather than lattice, because it will be a lot easier (and shorter) to get the desired appearance for the plots:

require(Quandl)
require(ggplot2)
 
# Load data from Quandl
my.data = Quandl("TPC/HIST_RECEIPT", 
                 start_date = "1945-12-31", 
                 end_date = "2013-12-31")

The whole example relies on only three variables and—as I am not great at typing—I tend to work with shorter variable names. I directly changed the names for variables 1 to 3:

# Display first lines of the data frame
# and set short names for first three columns
head(my.data)
names(my.data)[1:3] = c('year', 'indtax', 'corptax')

It is a lot easier to compare the regression lines if we change the shape of the data set from wide to long, where there is one variable for year, one for tax type, and one for the actual tax rate. It would be possible to use one of Hadley’s packages to get a simpler syntax for this, but I decided to stick to the minimum set of requirements:

# Change shape to fit both regressions simultaneously
mdlong = reshape(my.data[, 1:3], 
                 idvar = 'year', times = c('Individual', 'Corporate'), 
                 varying = list(2:3), direction = 'long')
 
mdlong$taxtype = factor(mdlong$time)

And now we are ready to produce the plots. The first one can be a rough cut to see if we get the right elements:

ggplot(mdlong, aes(x = year, y = indtax, color = taxtype)) + 
  geom_point() + geom_line() + geom_smooth(method = 'lm')
First cut of the taxes per year plot.

First cut of the taxes per year plot.

Yes, this one has the points, lines, linear regression and 95% confidence intervals for the mean predicted responses, but we still need to get rid of the grey background and get black labels (theme_bw()), set the right axis labels and ticks (scale_x... scale_y...) and set the right color palette for points and lines (scale_colour_manual) and filling the confidence intervals (scale_colour_fill) like so:

# Plotting the graph (first match color palette) and put the regression
# lines as well
serious.palette = c('#AD3333', '#00526D')
ggplot(mdlong, aes(x = year, y = indtax, color = taxtype)) + 
  geom_point() + geom_line() + geom_smooth(method = 'lm', aes(fill = taxtype)) + 
  theme_bw() + 
  scale_y_continuous('Income taxes (% of GDP)', breaks = seq(0, 12, 2), minor_breaks = NULL) + 
  scale_x_date('Fiscal year', minor_breaks = NULL) + 
  scale_colour_manual(values=serious.palette) + scale_fill_manual(values=serious.palette)
Way closer to the desired plot, still much shorter.

Way closer to the desired plot, still much shorter.

One can still change font sizes to match the original plots, reposition the legend, change the aspect ratio while saving the png graphs (all simple statements) but you get the idea. If now we move to fitting the regression lines:

# Fitting a regression with dummy variables
m1 = lm(indtax ~ year*taxtype, data = mdlong)
summary(m1)
 
# The regressions have different intercepts and slopes
# Call:
#   lm(formula = indtax ~ year * taxtype, data = mdlong)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -1.95221 -0.44303 -0.05731  0.35749  2.39415 
# 
# Coefficients:
#                            Estimate Std. Error t value Pr(>|t|)    
#   (Intercept)             3.435e+00  1.040e-01   33.01   <2e-16 ***
#   year                   -1.564e-04  1.278e-05  -12.23   <2e-16 ***
#   taxtypeIndividual       4.406e+00  1.471e-01   29.94   <2e-16 ***
#   year:taxtypeIndividual  1.822e-04  1.808e-05   10.08   <2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.7724 on 134 degrees of freedom
# Multiple R-squared:  0.9245,  Adjusted R-squared:  0.9228 
# F-statistic: 546.9 on 3 and 134 DF,  p-value: < 2.2e-16

This gives the regression coefficients for Corporate (3.45 – 1.564e-04 year) and Individual ([3.45 + 4.41] + [-1.564e-04 + 1.822e-04] year or 7.84 + 2.58e-05 year). As a bonus you get the comparison between regression lines.

In R as a second language I pointed out that ‘brevity reduces the time between thinking and implementation, so we can move on and keep on trying new ideas’. Some times it seriously does.

Split-plot 2: let’s throw in some spatial effects

Disappeared for a while collecting frequent flyer points. In the process I ‘discovered’ that I live in the middle of nowhere, as it took me 36 hours to reach my conference destination (Estoril, Portugal) through Christchurch, Sydney, Bangkok, Dubai, Madrid and Lisbon.

Where was I? Showing how split-plots look like under the bonnet (hood for you US readers). Yates presented a nice diagram of his oats data set in the paper, so we have the spatial location of each data point which permits us playing with within-trial spatial trends.

Rather than mucking around with typing coordinates we can rely on Kevin Wright’s version of the oats dataset contained in the agridat package. Kevin is a man of mistery, a James Bond of statisticians—so he keeps a low profile—with a keen interest in experimental design and analyses. This chap has put a really nice collection of data sets WITH suggested coding for the analyses, including nlme, lme4, asreml, MCMCglmm and a few other bits and pieces. Recommended!

Plants (triffids excepted) do not move, which means that environmental trends within a trial (things like fertility, water availability, temperature, etc) can affect experimental units in a way that varies with space and which induces correlation of the residuals. Kind of we could be violating the independence assumption if you haven’t got the hint yet.

Gratuitous picture: Detail of Mosteiro dos Jerónimos, Belém, Lisboa, Portugal (Photo: Luis).

There are a few ways to model environmental trends (AR processes, simple polynomials, splines, etc) that can be accounted for either through the G matrix (as random effects) or the R matrix. See previous post for explanation of the bits and pieces. We will use here a very popular approach, which is to consider two separable (so we can estimate the bloody things) autoregressive processes, one for rows and one for columns, to model spatial association. In addition, we will have a spatial residual. In summary, the residuals have moved from \( \boldsymbol{R} = \sigma^2_e \boldsymbol{I}\) to \( \boldsymbol{R} = \sigma^2_s \boldsymbol{R}_{col} \otimes \boldsymbol{R}_{row}\). I previously showed the general form of this autoregressive matrices in this post, and you can see the \( \boldsymbol{R}_{col}\) matrix below. In some cases we can also add an independent residual (the so-called nugget) to the residual matrix.

We will first fit a split-plot model considering spatial residuals using asreml because, let’s face it, there is no other package that will give you the flexibility:

require(asreml)
require(agridat)
 
# Having a look at the structure of yates.oats
# and changing things slightly so it matches 
# previous post
str(yates.oats)
 
names(yates.oats) = c('row', 'col', 'Y', 'N', 'V', 'B')
 
yates.oats$row = factor(yates.oats$row)
yates.oats$col = factor(yates.oats$col)
yates.oats$N = factor(yates.oats$N)
yates.oats$MP = yates.oats$V
 
# Base model (this was used in the previous post)
m2 = asreml(Y ~ V*N, random = ~ B/MP, data = yates.oats)
summary(m2)$varcomp
#                gamma component std.error  z.ratio constraint
# B!B.var    1.2111647  214.4771 168.83404 1.270343   Positive
# B:MP!B.var 0.5989373  106.0618  67.87553 1.562593   Positive
# R!variance 1.0000000  177.0833  37.33244 4.743416   Positive
 
# Spatial model
m3 = asreml(Y ~ V*N, random = ~ B/MP, 
            rcov = ~ ar1(col):ar1(row),
            data = yates.oats)
summary(m3)$varcomp
#                 gamma    component   std.error   z.ratio    constraint
# B!B.var    0.80338277 169.24347389 156.8662436 1.0789031      Positive
# B:MP!B.var 0.49218005 103.68440202  73.6390759 1.4080079      Positive
# R!variance 1.00000000 210.66355939  67.4051020 3.1253355      Positive
# R!col.cor  0.04484166   0.04484166   0.2006562 0.2234751 Unconstrained
# R!row.cor  0.49412567   0.49412567   0.1420397 3.4787860 Unconstrained
 
coef(m3)$fixed
coef(m3)$random
# effect
# V_GoldenRain:N_0     0.0000000
# V_GoldenRain:N_0.2   0.0000000
# V_GoldenRain:N_0.4   0.0000000
# V_GoldenRain:N_0.6   0.0000000
# V_Marvellous:N_0     0.0000000
# V_Marvellous:N_0.2  -0.8691155
# V_Marvellous:N_0.4 -12.4223873
# V_Marvellous:N_0.6  -5.5018907
# V_Victory:N_0        0.0000000
# V_Victory:N_0.2     -1.9580360
# V_Victory:N_0.4      2.1913469
# V_Victory:N_0.6      0.3728648
# N_0                  0.0000000
# N_0.2               23.3299154
# N_0.4               40.0570745
# N_0.6               47.1749577
# V_GoldenRain         0.0000000
# V_Marvellous         9.2845952
# V_Victory           -5.7259866
# (Intercept)         76.5774292
 
# effect
# B_B1               21.4952875
# B_B2                1.0944484
# B_B3               -5.4336461
# B_B4               -4.4334455
# B_B5               -6.6925874
# B_B6               -6.0300569
# B_B1:MP_GoldenRain  1.3036724
# B_B1:MP_Marvellous -0.9082462
# B_B1:MP_Victory    12.7754283
# B_B2:MP_GoldenRain  1.3286187
# B_B2:MP_Marvellous  7.4181674
# B_B2:MP_Victory    -8.0761824
# B_B3:MP_GoldenRain -6.5288220
# B_B3:MP_Marvellous 10.4361799
# B_B3:MP_Victory    -7.2367277
# B_B4:MP_GoldenRain  6.6868810
# B_B4:MP_Marvellous -9.2317585
# B_B4:MP_Victory    -0.1716372
# B_B5:MP_GoldenRain  2.4635492
# B_B5:MP_Marvellous -9.7086196
# B_B5:MP_Victory     3.1443067
# B_B6:MP_GoldenRain -5.2538993
# B_B6:MP_Marvellous  1.9942770
# B_B6:MP_Victory    -0.4351876
 
# In a larger experiment we could try fitting a nugget using units
m4 = asreml(Y ~ V*N, random = ~ B/MP + units, 
            rcov = ~ ar1(col):ar1(row),
            data = yates.oats)
summary(m4)
 
# However in this small experiments the system
# goes crazy and results meaningless
#                       gamma   component  std.error  z.ratio constraint
# B!B.var         0.006759124   223.70262   185.3816 1.206714   Positive
# B:MP!B.var      0.001339017    44.31663    29.2004 1.517672   Positive
# units!units.var 0.003150356   104.26542    34.2738 3.042132   Positive
# R!variance      1.000000000 33096.39128 19480.8333 1.698921   Positive
# R!col.cor       0.999000000     0.99900         NA       NA      Fixed
# R!row.cor       0.999000000     0.99900         NA       NA      Fixed

So we have to build an autoregressive correlation matrix for rows, one for columns and multiply the whole thing for a spatial variance. Then we can add an independent residual (the nugget, if we want—and can estimate—one). Peter Dalgaard has neat code for building the autocorrelation matrix. And going back to the code in the previous post:

ar.matrix = function(ar, dim) {
  M = diag(dim)
  M = ar^abs(row(M)-col(M))
  return(M)
}
 
# Variance components (from m3)
varB = 169.243
varMP = 103.684
varR = 210.664
arcol = 0.045
arrow = 0.494
 
# Basic vector and matrices: y, X, Z, G & R
y = matrix(yates.oats$Y, nrow = dim(yates.oats)[1], ncol = 1)
X = model.matrix(~ V*N, data = yates.oats)
Z = model.matrix(~ B/MP - 1, data = yates.oats, 
                 contrasts.arg = list(B = contrasts(yates.oats$B, contrasts = F), 
                                      MP = contrasts(yates.oats$MP, contrasts = F)))
 
G = diag(c(rep(varB, 6), rep(varMP, 18)))
 
# Only change from previous post is building the R matrix
Rcol = ar.matrix(arcol, 4)
Rrow = ar.matrix(arrow, 18)
Rcol
# Having a look at the structure
#            [,1]     [,2]     [,3]       [,4]
# [1,] 1.0000e+00 0.045000 0.002025 9.1125e-05
# [2,] 4.5000e-02 1.000000 0.045000 2.0250e-03
# [3,] 2.0250e-03 0.045000 1.000000 4.5000e-02
# [4,] 9.1125e-05 0.002025 0.045000 1.0000e+00
 
R = varR * kronecker(Rcol, Rrow)
Rinv = solve(R)
 
# Components of C
XpX = t(X) %*% Rinv %*% X
ZpZ = t(Z) %*% Rinv %*% Z
 
XpZ = t(X) %*% Rinv %*% Z
ZpX = t(Z) %*% Rinv %*% X
 
Xpy = t(X) %*% Rinv %*% y
Zpy = t(Z) %*% Rinv %*% y
 
# Building C * [b a] = RHS
C = rbind(cbind(XpX, XpZ),
          cbind(ZpX, ZpZ + solve(G)))
RHS = rbind(Xpy, Zpy)
 
blup = solve(C, RHS)
blup
#                         [,1]
# (Intercept)       76.5778238
# VMarvellous        9.2853002
# VVictory          -5.7262894
# N0.2              23.3283060
# N0.4              40.0555464
# N0.6              47.1740348
# VMarvellous:N0.2  -0.8682597
# VVictory:N0.2     -1.9568979
# VMarvellous:N0.4 -12.4200362
# VVictory:N0.4      2.1912083
# VMarvellous:N0.6  -5.5017225
# VVictory:N0.6      0.3732453
# BB1               21.4974445
# BB2                1.0949433
# BB3               -5.4344098
# BB4               -4.4333080
# BB5               -6.6948783
# BB6               -6.0297918
# BB1:MPGoldenRain   1.3047656
# BB2:MPGoldenRain   1.3294043
# BB3:MPGoldenRain  -6.5286993
# BB4:MPGoldenRain   6.6855568
# BB5:MPGoldenRain   2.4624436
# BB6:MPGoldenRain  -5.2534710
# BB1:MPMarvellous  -0.9096022
# BB2:MPMarvellous   7.4170634
# BB3:MPMarvellous  10.4349240
# BB4:MPMarvellous  -9.2293528
# BB5:MPMarvellous  -9.7080694
# BB6:MPMarvellous   1.9950370
# BB1:MPVictory     12.7749000
# BB2:MPVictory     -8.0756682
# BB3:MPVictory     -7.2355284
# BB4:MPVictory     -0.1721988
# BB5:MPVictory      3.1441163
# BB6:MPVictory     -0.4356209

Which are the same results one gets from ASReml-R. QED.

P.S. Many thanks to Kevin Wright for answering my questions about agridat.

R is a language

A commenter on this blog reminded me of one of the frustrating aspects faced by newbies, not only to R but to any other programming environment (I am thinking of typical students doing stats for the first time). The statement “R is a language” sounds perfectly harmless if you have previous exposure to programming. However, if you come from a zero-programming background the question is What do you really mean?

R—and many other statistical systems for that matter—is often approached from either one of two extremes:

  1. Explaining R as a programming language, as most of the R documentation and books (like The Art of R Programming, quite good by the way) do.
  2. The other one is from a hodgepodge of statistical analyses, introducing the language as a bonus, best represented by Crowley’s The R Book (which I find close to unreadable). In contrast, Modern Applied Statistics with S by Ripley and Venables is much better even when it doesn’t mention R in the title.

If you are new to both statistics and R I like the level of the Quick-R website as a starting point, which was expanded into a book (R in Action). It uses the second approach listed above, so if you come from a programming background the book will most likely be disappointing. Nevertheless, if you come from a newbie point of view both the website and book are great resources. In spite of this, Quick-R assumes that the reader is familiar with statistics and starts with “R is an elegant and comprehensive statistical and graphical programming language.

A simpler starting point

I would like to start from an even simpler point, ignoring for a moment programming and think about languages like English, Spanish, etc. In languages we have things (nouns) and actions (verbs). We perform actions on things: we measure a tree, draw a plot, make some assumptions, estimate coefficients, etc. In R we use functions to perform actions on objects (things in the previous explanations). Our data sets are objects that we read, write, fit a model (and create objects with results), etc. “R is a language” means that we have a grammar that is designed to deal with data from a statistical point of view.

A simple sentence “Luis writes Quantum Forest” has two objects (Luis and Quantum Forest) and one function (writes). Now lets look at some simple objects in R; for example, a number, a string of characters and a collection of numbers (the latter using the function c() to keep the numbers together):

> 23
[1] 23
 
> "Luis"
[1] "Luis"
 
> c(23, 42, pi)
[1] 23.000000 42.000000  3.141593

Up to this point we have pretty boring software, but things start becoming more interesting when we can assign objects to names, so we can keep acting on those objects (using functions). In this blog I use = instead of <- to assign things (objects) to names outside function calls. This is considered "bad form" in the R world, but to me is much more readable§. (Inside function calls the arguments should always be referred with an = sign, as we'll see in a future post). Anyway, if you are feeling in a conformist mood replace the = by <- and the code will work equally well.

> sex = 23
 
> Sex = "Luis"
 
> SEX = c(23, 42, pi)

R is case sensitive, meaning that upper- and lower-case letters are considered different. Thus, we can assign different objects to variables named sex, Sex and SEX and R will keep track of them as separate entities (A Kama Sutra of names!). Once objects are assigned to a variable R stops printing the object back to the user. However, it is possible to type the object name, press enter and get the content stored in the name. For example:

> sex
[1] 23
 
> SEX
[1] 23.000000 44.000000  3.141593

The lowly = sign is a function as well. For example, both a and b are assigned the same bunch of numbers:

> a = c(23 , 42, pi)
> a
[1] 23.000000 42.000000  3.141593
 
# Is equivalent to
> assign('b', c(23 , 42, pi))
> b
[1] 23.000000 42.000000  3.141593

Even referring to an object by its name calls a function! print(), which is why we get [1] 23.000000 44.000000 3.141593 when typing b and hitting enter in R.

Grouping objects

Robert Kabacoff has a nice invited post explaining data frames. Here I will present a very rough explanation with a toy example.

Objects can be collected in other objects and assigned a name. In data analysis we tend to collect several variables (for example tree height and stem diameter, people's age and income, etc). It is convenient to keep variables referring to the same units (trees, persons) together in a table. If you have used Excel, a rough approximation would be a spreadsheet. Our toy example could be like:

> x = c(1, 3, 4, 6, 8, 9) 
> y = c(10, 11, 15, 17, 17, 20)
 
> toy = data.frame(x, y)
> toy
  x  y
1 1 10
2 3 11
3 4 15
4 6 17
5 8 17
6 9 20

The last line combines two objects (x and y) in an R table using the function data.frame() and then it assigns the name "toy" to that table (using the function =). From now on we can refer to that data table when using other functions as in:

# Getting descriptive statistics using summary()
> summary(toy)
 
       x               y     
 Min.   :1.000   Min.   :10  
 1st Qu.:3.250   1st Qu.:12  
 Median :5.000   Median :16  
 Mean   :5.167   Mean   :15  
 3rd Qu.:7.500   3rd Qu.:17  
 Max.   :9.000   Max.   :20  
 
# Obtaining the names of the variables in
# a data frame
> names(toy)
[1] "x" "y"
 
# Or actually doing some stats. Here fitting
# a linear regression model
> fm1 = lm(y ~ x, data = toy)

Incidentally, anything following a # is a comment, which helps users document their work. Use them liberally.

Fitting a linear regression will produce lots of different outputs: estimated regression coefficients, fitted values, residuals, etc. Thus, it is very handy to assign the results of the regression to a name (in this case "fm1") for further manipulation. For example:

# Obtaining the names of objects contained in the
# fm1 object
> names(fm1)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        
 
# We can access individual objects using the notation
# objectName$components
 
# Obtaining the intercept and slope
> fm1$coefficients
(Intercept)           x 
   8.822064    1.195730
 
# Fitted (predicted) values
> fm1$fitted.values
       1        2        3        4        5        6 
10.01779 12.40925 13.60498 15.99644 18.38790 19.58363 
 
# Residuals
> fm1$residuals
          1           2           3           4           5           6 
-0.01779359 -1.40925267  1.39501779  1.00355872 -1.38790036  0.41637011 
 
# We can also use functions to extract components from an object
# as in the following graph
> plot(resid(fm1) ~ fitted(fm1))

The last line of code extracts the residuals of fm1 (using the function resid()) and the fitted values (using the function fitted()), which are then combined using the function plot().

In summary: in this introduction we relied on the R language to manipulate objects using functions. Assigning names to objects (and to the results of applying functions) we can continue processing data and improving our understanding of the problem under study.

Footnotes

R is an implementation (a dialect) of the S language. But remember, a language is a dialect with an army and a navy.

Natural languages tend to be more complex and will have pronouns, articles, adjectives, etc. Let's ignore that for the moment.

§Languages change; for example, I speak Spanish—a bad form of Latin—together with hundreds of millions of people. Who speaks Latin today?