Categories

## Setting plots side by side

This is simple example code to display side-by-side lattice plots or ggplot2 plots, using the `mtcars` dataset that comes with any R installation. We will display a scatterplot of miles per US gallon (mpg) on car weight (wt) next to another scatterplot of the same data, but using different colors by number of engine cylinders (cyl, treated as factor) and adding a smooth line (under the `type` option).

```library(lattice)
data(mtcars)

# Create first plot and assign it to variable
p1 = xyplot(mpg ~ wt, data = mtcars,
xlab = 'Car weight', ylab = 'Mileage')

# Create second plot and assign it to variable
p2 = xyplot(mpg ~ wt, group= factor(cyl), data = mtcars,
xlab = 'Car weight', ylab = 'Miles/gallon',
type=c('p', 'smooth'))

# print calls print.lattice, which takes a position
# argument.
print(p1, position = c(0, 0, 0.5, 1), more = TRUE)
print(p2, position = c(0.5, 0, 1, 1))
``` According to the documentation, position is a vector of 4 numbers, typically c(xmin, ymin, xmax, ymax) that give the lower-left and upper-right corners of a rectangle in which the Trellis plot of x is to be positioned. The coordinate system for this rectangle is [0-1] in both the x and y directions. That is, the first `print()` sets position to occupy the left part of the graph with full height, as well as to avoid refreshing the graph when displaying the new plot (`more = TRUE`). The second `print()` uses the right part of the graph with full height.

In the case of ggplot2, the code is not that different:

```library(grid)
library(ggplot2)

# Create first plot and assign it to variable
p1 = qplot(wt, mpg, data = mtcars,
xlab = 'Car weight', ylab = 'Mileage')

# Create second plot and assign it to variable
p2 = qplot(wt, mpg, color = factor(cyl), data = mtcars,
geom = c('point', 'smooth'),
xlab = 'Car weight', ylab = 'Mileage')

# Define grid layout to locate plots and print each graph
pushViewport(viewport(layout = grid.layout(1, 2)))
print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
``` More details on ggplot’s notation can be found here.

Categories

I tend not to upgrade R very often—running from 6 months to 1 year behind in version numbers—because I had to reinstall all packages: a real pain. A quick search shows that people have managed to come up with good solutions to this problem, as presented in this stackoverflow thread. I used the code in my mac:

```# Run in the old installation (pre-upgrade)
# saving a list of installed packages
package.list = installed.packages()[, "Package"]
save(package.list, file='package-list.Rdata')

# Run in new install
for (p in setdiff(package.list, installed.packages()[,"Package"])) {
install.packages(p)
}

# Final output
Warning messages:
1: In getDependencies(pkgs, dependencies, available, lib) :
package ‘Acinonyx’ is not available (for R version 2.13.2)
2: In getDependencies(pkgs, dependencies, available, lib) :
package ‘AnimalINLA’ is not available (for R version 2.13.2)
3: In getDependencies(pkgs, dependencies, available, lib) :
package ‘asreml’ is not available (for R version 2.13.2)
4: In getDependencies(pkgs, dependencies, available, lib) :
package ‘INLA’ is not available (for R version 2.13.2)
5: In getDependencies(pkgs, dependencies, available, lib) :
package ‘graph’ is not available (for R version 2.13.2)
```

From all installed packages, I only had issues with 5 of them, which require installation from their respective websites: Acinonyx, INLA (and AnimalINLA) and asreml. Package graph is now available from bioconductor.org. INLA can be installed really easily from inside R (see below), while I did not bother downloading again asreml and just copied the folder from `~/Library/R/OldVersion/library/asreml` to `~/Library/R/CurrentVersion/library/asreml`.

```# Installing INLA
source("http://www.math.ntnu.no/inla/givemeINLA.R")
```

Overall, it was a good upgrade experience, so thanks to the stackoverflow crowd for so many ideas on how to make R even nicer than it is.

P.S. 20100-10-14 Similar instructions, but including compiling R and installing bioconductor.

Categories

## Reading HTML pages in R for text processing

We were talking with one of my colleagues about doing some text analysis—that, by the way, I have never done before—for which the first issue is to get text in R. Not any text, but files that can be accessed through internet. In summary, we need to access an HTML file, parse it so we can access specific content and then remove the HTML tags. Finally, we may want to replace some text (the end of lines, ``` ```, for example) before continue processing the files.

The package `XML` has the necessary functionality to deal with HTML, while the rest is done using a few standard R functions.

```library(XML)

# Read and parse HTML file
doc.html = htmlTreeParse('http://apiolaza.net/babel.html',
useInternal = TRUE)

# Extract all the paragraphs (HTML tag is p, starting at
# the root of the document). Unlist flattens the list to
# create a character vector.
doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))

# Replace all
by spaces
doc.text = gsub('\n', ' ', doc.text)

# Join all the elements of the character vector into a single
# character string, separated by spaces
doc.text = paste(doc.text, collapse = ' ')
```

Incidentally, babel.html contains a translation of the short story ‘The Library of Babel’ by Jorge Luis Borges. Great story! We can repeat this process with several files and then create a corpus (and analyze it) using the `tm` package.

Categories

## Operating on datasets inside a function

There are times when we need to write a function that makes changes to a generic data frame that is passed as an argument. Let’s say, for example, that we want to write a function that converts to factor any variable with names starting with a capital letter. There are a few issues involved in this problem, including:

• Obtaining a text version of the name of the dataset (using the `substitute()` function).
• Looping over the variable names and checking if they start with a capital letter (comparing with the `LETTERS` vector of constants).
• Generating the plain text version of the factor conversion, glueing the dataset and variable names (using `paste()`).
• Parsing the plain text version of the code to R code (using `parse()`) and evaluating it (using `eval()`). This evaluation has to be done in the parent environment or we will lose any transformation when we leave the function, which is the reason for the `envir()` specification.
```CapitalFactors = function(dataset) {
# Gets text version name of dataset
data.name = substitute(dataset)

# Loops over variable names of dataset
for(var.name in names(dataset)){
if(substr(var.name, 1, 1) %in% LETTERS) {
left = paste(data.name, '\$', var.name, sep = '')
right = paste('factor(', left, ')', sep = '')
code = paste(left, '=', right)
# Evaluates the parsed text, using the parent environment
# so as to actually update the original data set
eval(parse(text = code), envir = parent.frame())
}
}
}

# Create example dataset and display structure
example = data.frame(Fert = rep(1:2, each = 4),
yield = c(12, 9, 11, 13, 14, 13, 15, 14))
str(example)

'data.frame':	8 obs. of  2 variables:
\$ Fert : int  1 1 1 1 2 2 2 2
\$ yield: num  12 9 11 13 14 13 15 14

# Use function on dataset and display structure
CapitalFactors(example)
str(example)

'data.frame':	8 obs. of  2 variables:
\$ Fert : Factor w/ 2 levels "1","2": 1 1 1 1 2 2 2 2
\$ yield: num  12 9 11 13 14 13 15 14
```

And that’s all. Now the Fert integer variable has been converted to a factor. This example function could be useful for someone out there.

Categories

## A brief idea of style

Once one starts writing more R code the need for consistency increases, as it facilitates managing larger projects and their maintenance. There are several style guides or suggestions for R; for example, Andrew Gelman’s, Hadley Wickham’s, Bioconductor’s and this one. I tend to write closer to Google’s R style guide, which contains some helpful suggestions. I use something similar but:

• I use `=` for assignment rather than `<-`, because it is visually less noisy, `<-` requires an extra keystroke (yes, I am that lazy) and—from a merely esthetics point of view—in many monospaced fonts the lower than and hyphen symbols do not align properly, so `<-` does not look like an arrow. I know that hardcore R programmers prefer the other symbol but, tough, I prefer the equal sign.
• I indent code using four spaces, just because I am used to do so in Python. I will make an exception and go down to two spaces if there are too many nested clauses.
• I like their identifier naming scheme, although I do not use it consistently. Mea culpa.
• I always use single quotes for text (two fewer keystrokes per text variable).

Of course you’ll find that the examples presented in this site depart from the style guide. I didn’t say that I was consistent, did I?

Categories

## All combinations for levelplot

In a previous post I explained how to create all possible combinations of the levels of two factors using `expand.grid()`. Another use for this function is to create a regular grid for two variables to create a levelplot or a contour plot.

For example, let’s say that we have fitted a multiple linear regression to predict wood stiffness (stiff, the response) using basic density (bd) and a measure of microfibril angle (t) as explanatory variables. The regression equation could be something like `stiffness = 3.439 + 0.009 bd - 0.052 t`. In our dataset bd had a range of 300 to 700 kg m-3, while t had a range from 50 to 70.

We will use the `levelplot()` function that is part of the `lattice` package of graphical functions, create a grid for both explanatory variables (every 10 for bd and every 1 for t), predict values of stiffness for all combinations of bd and t, and plot the results.

```library(lattice)

# Create grid of data
wood = expand.grid(bd = seq(300, 700, 10), t = seq(50, 70, 1))
wood\$stiffness = with(wood, 3.439 + 0.009*bd - 0.052*t)

levelplot(stiffness ~ bd*t, data = wood,
xlab='Basic density', ylab='T')
```

This code creates a graph like this. Simple. Categories

## On R versus SAS

A short while ago there was a discussion on linkedin about the use of SAS versus R for the enterprise. I have thought a bit about the issue but, as I do not use Linkedin, I did not make any comments there.

Disclaimer: I did use SAS a lot between 1992 and 1997, mostly for genetic evaluation, heavily relying on BASE, STAT, IML and GRAPH. From that point on, I was a light SAS user (mostly STAT and IML) until 2009. The main reason I left SAS was that I started using ASReml in 1997 and, around two years ago asreml-R, the R package version of ASReml. Through my job I can access any statistical software; if the university does not have a license, I can buy an academic one without any issues.

I think it is important to make a distinction between enterprise use and huge datasets. Some companies have large datasets, but there probably are many companies that need to analyze large numbers of small to medium size datasets. If we accept this premise, there is room to use a diversity of statistical packages, including both SAS and R.

Another topic that often appears in the R vs. SAS discussion is cost. SAS licenses are not cheap, but for many large companies the cost of having expensive researchers with lower productivity while they learn another “free” system can be really high. Same issue applies if there are legacy programs: converting software to a new system can be expensive and time consuming. Of course this situation is changing: new graduates are being exposed much more to R than to SAS in many departments. We now use R in many courses and students may end up working in a small company that will be happy not to spend any money to pay for a SAS license.

• There is good integration between the programming language and the statistical functions. Both SAS macros and IML are poorly integrated with the data step and procs.
• R is highly conducive to exploratory data analysis; visualization functions (either the lattice or the ggplot 2 packages) produce high quality plots that really help developing ideas to build models.
• Statistics is not defined by the software. If someone develops a new methodology or algorithm chances are that there will be an R implementation almost immediately. If I want to test a new idea I can scramble to write some code that connects packages developed by other researchers.
• It is relatively easy to integrate R with other languages, for example Python, to glue a variety of systems.
• asreml-r!
• I can exchange ideas with a huge number of people, because slowly R is becoming the de facto standard for many disciplines that make use of statistics.

Of course R has many drawbacks when compared to SAS; for example:

• The default editor in the Windows version is pathetic, while the one in OS X is pasable (code folding and proper refreshing would be great additions).
• R syntax can be horribly inconsistent across packages, making the learning process more difficult.
• There are many, too many, ways of doing the same thing, which can be confusing, particularly for newbies. For example, summarizing data by combinations of factors could be done using aggregate, summarize (from Hmisc), functions of the apply family, doBy, etc. Compare this situation to proc means.

No, I did not mention technical support (which I find a non-issue), access to large data sets (it is possible to integrate R with databases and ongoing work to process data that can’t fit in memory) or documentation. Concerning the latter, it would be helpful to have better R documentation, but SAS would also benefit from better manuals. There has been a huge number of books using R published recently and the documentation gap is closing. R would benefit of having good canonical documentation, something that all users could see first as the default documentation. The documentation included with the system is, how to call it, Spartan, and sometimes plain useless and confusing. A gigantic link to a searchable version of the R users email list from the main R project page would be great.

Categories

## Linear regression with correlated data

I started following the debate on differential minimum wage for youth (15-19 year old) and adults in New Zealand. Eric Crampton has written a nice series of blog posts, making the data from Statistics New Zealand available. I will use the nzunemployment.csv data file (with quarterly data from March 1986 to June 2011) and show an example of multiple linear regression with autocorrelated residuals in R.

A first approach could be to ignore autocorrelation and fit a linear model that attempts to predict youth unemployment with two explanatory variables: adult unemployment (continuous) and minimum wage rules (categorical: equal or different). This can be done using:

```setwd('~/Dropbox/quantumforest')

# Create factor for minimum wage, which was different for youth
# and adults before quarter 90 (June 2008)
un\$minwage = factor(ifelse(un\$q < 90, 'Different', 'Equal'))

mod1 = lm(youth ~ adult*minwage, data = un)
summary(mod1)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)         8.15314    0.43328  18.817  < 2e-16 ***
adult               1.53334    0.07506  20.428  < 2e-16 ***
minwageEqual       -5.69192    2.19356  -2.595   0.0109 *
adult:minwageEqual  2.85518    0.46197   6.180 1.47e-08 ***

Residual standard error: 1.447 on 98 degrees of freedom
Multiple R-squared: 0.8816,	Adjusted R-squared: 0.878
F-statistic: 243.3 on 3 and 98 DF,  p-value: < 2.2e-16
```

Remember that `adult*minwage` is expanded to `adult + minwage + adult:minwage`. We can make the coefficients easier to understand if we center adult unemployment on the mean of the first 80 quarters. Notice that we get the same slope, Adj-R2, etc. but now the intercept corresponds to the youth unemployment for the average adult unemployment before changing minimum wage rules. All additional analyses will use the centered version.

```un\$cadult = with(un, adult - mean(adult))
mod2 = lm(youth ~ cadult*minwage, data = un)
summary(mod2)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)         16.28209    0.15352  106.06  < 2e-16 ***
cadult               1.53334    0.07506   20.43  < 2e-16 ***
minwageEqual         9.44472    0.52629   17.95  < 2e-16 ***
cadult:minwageEqual  2.85518    0.46197    6.18 1.47e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.447 on 98 degrees of freedom
Multiple R-squared: 0.8816,	Adjusted R-squared: 0.878
F-statistic: 243.3 on 3 and 98 DF,  p-value: < 2.2e-16

plot(mod2)    # Plots residuals for the model fit
acf(mod2\$res) # Plots autocorrelation of the residuals
```

In the centered version, the intercept corresponds to youth unemployment when adult unemployment rate is 5.4 (average for the first 89 quarters). The coefficient of minwageEqual corresponds to the increase of youth unemployment (9.44%) when the law moved to have equal minimum wage for youth and adults. Notice that the slopes did not change at all.

I will use the function `gls()` from the nlme package (which comes by default with all R installations) to take into account the serial correlation. First we can fit a model equivalent to mod2, just to check that we get the same results.

```library(nlme)
mod3 = gls(youth ~ cadult*minwage, data = un)
summary(mod3)

Generalized least squares fit by REML
Model: youth ~ cadult * minwage
Data: un
AIC      BIC    logLik
375.7722 388.6971 -182.8861

Coefficients:
Value Std.Error   t-value p-value
(Intercept)         16.282089 0.1535237 106.05585       0
minwageEqual         9.444719 0.5262926  17.94576       0

Correlation:
minwageEqual        -0.292  0.014

Standardized residuals:
Min          Q1         Med          Q3         Max
-2.96256631 -0.53975848 -0.02071559  0.63499262  2.35900240

Residual standard error: 1.446696
Degrees of freedom: 102 total; 98 residual

```

Yes, they are identical. Notice that the model fitting is done using Restricted Maximum Likelihood (REML). Now we can add an autoregressive process of order 1 for the residuals and compare the two models:

```mod4 = gls(youth ~ cadult*minwage,
correlation = corAR1(form=~1), data = un)
summary(mod4)

Generalized least squares fit by REML
Model: youth ~ cadult * minwage
Data: un
AIC      BIC    logLik
353.0064 368.5162 -170.5032

Correlation Structure: AR(1)
Formula: ~1
Parameter estimate(s):
Phi
0.5012431

Coefficients:
Value Std.Error  t-value p-value
(Intercept)         16.328637 0.2733468 59.73598       0
minwageEqual         9.082626 0.8613543 10.54459       0

Correlation:
minwageEqual        -0.318  0.007

Standardized residuals:
Min          Q1         Med          Q3         Max
-2.89233359 -0.55460580 -0.02419759  0.55449166  2.29571080

Residual standard error: 1.5052
Degrees of freedom: 102 total; 98 residual

anova(mod3, mod4)
Model df      AIC      BIC    logLik   Test L.Ratio p-value
mod3     1  5 375.7722 388.6971 -182.8861
mod4     2  6 353.0064 368.5162 -170.5032 1 vs 2 24.7658  <.0001
```

There is a substantial improvement for the log likelihood (from -182 to -170). We can take into account the additional parameter (autocorrelation) that we are fitting by comparing AIC, which improved from 375.77 (-2*(-182.8861) + 2*5) to 368.52 (-2*(-170.5032) + 2*6). Remember that AIC is -2*logLikelihood + 2*number of parameters.

The file unemployment.txt contains the R code used in this post (I didn't use the .R extension as WordPress complains).

Categories

## R pitfall #1: check data structure

A common problem when running a simple (or not so simple) analysis is forgetting that the levels of a factor has been coded using integers. R doesn’t know that this variable is supposed to be a factor and when fitting, for example, something as simple as a one-way anova (using `lm()`) the variable will be used as a covariate rather than as a factor.

There is a series of steps that I follow to make sure that I am using the right variables (and types) when running a series of analyses. I always define the working directory (using `setwd()`), so I know where the files that I am reading from and writing to are.

After reading a dataset I will have a look at the first and last few observations (using `head()` and `tail()`, which by default show 6 observations). This gives you an idea of how the dataset looks like, but it doesn’t confirm the structure (for example, which variables are factors). The function `str()` provides a good overview of variable types and together with `summary()` one gets an idea of ranges, numbers of observations and missing values.

```# Define your working directory (folder). This will make
# your life easier. An example in OS X:
setwd('~/Documents/apophenia')

# and one for a Windows machine
setwd('c:/Documents/apophenia')

# Have a look at the first few and last few observations
tail(apo)

# Check the structure of the data (which variables are numeric,
# which ones are factors, etc)
str(apo)

# Obtain a summary for each of the variables in the dataset
summary(apo)
```

This code should help you avoid the ‘fitting factors as covariates’ pitfall; anyway, always check the degrees of freedom of the ANOVA table just in case.

Categories

## All combinations of levels for two factors

There are circumstances when one wants to generate all possible combinations of levels for two factors. For example, factor one with levels ‘A’, ‘B’ and ‘C’, and factor two with levels ‘D’, ‘E’, ‘F’. The function `expand.grid()` comes very handy here:

[sourcecode language=”r”]
combo = expand.grid(factor1 = LETTERS[1:3],
factor2 = LETTERS[4:6])
combo

factor1 factor2
1 A D
2 B D
3 C D
4 A E
5 B E
6 C E
7 A F
8 B F
9 C F
[/sourcecode]

Omitting the variable names (factor1 and factor 2) will automatically name the variables as Var1 and Var2. Of course we do not have to use letters for the factor levels; if you have defined a couple of factors (say Fertilizer and Irrigation) you can use `levels(Fertilizer)` and `levels(Irrigation)` instead of LETTERS…

Categories

## “Not in” in R

When processing data it is common to test if an observation belongs to a set. Let’s suppose that we want to see if the sample code belongs to a set that includes A, B, C and D. In R it is easy to write something like:

[sourcecode language=”r”]
inside.set = subset(my.data, code %in% c(‘A’, ‘B’, ‘C’, ‘D’))
[/sourcecode]

Now, what happens if what we want are the observations that are not in that set? Simple, we use the negation operator (!) as in:

[sourcecode language=”r”]
outside.set = subset(my.data, !(code %in% c(‘A’, ‘B’, ‘C’, ‘D’)))
[/sourcecode]

In summary, surround the condition by !().

Categories

## A shoebox for data analysis

Recidivism. That’s my situation concerning this posting flotsam in/on/to the ether. I’ve tried before and, often, will change priorities after a few posts; I rationalize this process thinking that I’m cured and have moved on to real life.

This time may not be different but, simultaneously, it will be more focused: just a shoebox for the little pieces of code or ideas that I use when running analyses.