Operating on datasets inside a function

There are times when we need to write a function that makes changes to a generic data frame that is passed as an argument. Let’s say, for example, that we want to write a function that converts to factor any variable with names starting with a capital letter. There are a few issues involved in this problem, including:

  • Obtaining a text version of the name of the dataset (using the substitute() function).
  • Looping over the variable names and checking if they start with a capital letter (comparing with the LETTERS vector of constants).
  • Generating the plain text version of the factor conversion, glueing the dataset and variable names (using paste()).
  • Parsing the plain text version of the code to R code (using parse()) and evaluating it (using eval()). This evaluation has to be done in the parent environment or we will lose any transformation when we leave the function, which is the reason for the envir() specification.

And that’s all. Now the Fert integer variable has been converted to a factor. This example function could be useful for someone out there.

A brief idea of style

Once one starts writing more R code the need for consistency increases, as it facilitates managing larger projects and their maintenance. There are several style guides or suggestions for R; for example, Andrew Gelman’s, Hadley Wickham’s, Bioconductor’s and this one. I tend to write closer to Google’s R style guide, which contains some helpful suggestions. I use something similar but:

  • I use = for assignment rather than <-, because it is visually less noisy, <- requires an extra keystroke (yes, I am that lazy) and—from a merely esthetics point of view—in many monospaced fonts the lower than and hyphen symbols do not align properly, so <- does not look like an arrow. I know that hardcore R programmers prefer the other symbol but, tough, I prefer the equal sign.
  • I indent code using four spaces, just because I am used to do so in Python. I will make an exception and go down to two spaces if there are too many nested clauses.
  • I like their identifier naming scheme, although I do not use it consistently. Mea culpa.
  • I always use single quotes for text (two fewer keystrokes per text variable).

Of course you’ll find that the examples presented in this site depart from the style guide. I didn’t say that I was consistent, did I?

All combinations for levelplot

In a previous post I explained how to create all possible combinations of the levels of two factors using expand.grid(). Another use for this function is to create a regular grid for two variables to create a levelplot or a contour plot.

For example, let’s say that we have fitted a multiple linear regression to predict wood stiffness (stiff, the response) using basic density (bd) and a measure of microfibril angle (t) as explanatory variables. The regression equation could be something like stiffness = 3.439 + 0.009 bd - 0.052 t. In our dataset bd had a range of 300 to 700 kg m-3, while t had a range from 50 to 70.

We will use the levelplot() function that is part of the lattice package of graphical functions, create a grid for both explanatory variables (every 10 for bd and every 1 for t), predict values of stiffness for all combinations of bd and t, and plot the results.

This code creates a graph like this. Simple.

Wood stiffness levelplot

Linear regression with correlated data

I started following the debate on differential minimum wage for youth (15-19 year old) and adults in New Zealand. Eric Crampton has written a nice series of blog posts, making the data from Statistics New Zealand available. I will use the nzunemployment.csv data file (with quarterly data from March 1986 to June 2011) and show an example of multiple linear regression with autocorrelated residuals in R.

A first approach could be to ignore autocorrelation and fit a linear model that attempts to predict youth unemployment with two explanatory variables: adult unemployment (continuous) and minimum wage rules (categorical: equal or different). This can be done using:

Remember that adult*minwage is expanded to adult + minwage + adult:minwage. We can make the coefficients easier to understand if we center adult unemployment on the mean of the first 80 quarters. Notice that we get the same slope, Adj-R2, etc. but now the intercept corresponds to the youth unemployment for the average adult unemployment before changing minimum wage rules. All additional analyses will use the centered version.

In the centered version, the intercept corresponds to youth unemployment when adult unemployment rate is 5.4 (average for the first 89 quarters). The coefficient of minwageEqual corresponds to the increase of youth unemployment (9.44%) when the law moved to have equal minimum wage for youth and adults. Notice that the slopes did not change at all.

I will use the function gls() from the nlme package (which comes by default with all R installations) to take into account the serial correlation. First we can fit a model equivalent to mod2, just to check that we get the same results.

Yes, they are identical. Notice that the model fitting is done using Restricted Maximum Likelihood (REML). Now we can add an autoregressive process of order 1 for the residuals and compare the two models:

There is a substantial improvement for the log likelihood (from -182 to -170). We can take into account the additional parameter (autocorrelation) that we are fitting by comparing AIC, which improved from 375.77 (-2*(-182.8861) + 2*5) to 368.52 (-2*(-170.5032) + 2*6). Remember that AIC is -2*logLikelihood + 2*number of parameters.

The file unemployment.txt contains the R code used in this post (I didn’t use the .R extension as WordPress complains).

R pitfall #1: check data structure

A common problem when running a simple (or not so simple) analysis is forgetting that the levels of a factor has been coded using integers. R doesn’t know that this variable is supposed to be a factor and when fitting, for example, something as simple as a one-way anova (using lm()) the variable will be used as a covariate rather than as a factor.

There is a series of steps that I follow to make sure that I am using the right variables (and types) when running a series of analyses. I always define the working directory (using setwd()), so I know where the files that I am reading from and writing to are.

After reading a dataset I will have a look at the first and last few observations (using head() and tail(), which by default show 6 observations). This gives you an idea of how the dataset looks like, but it doesn’t confirm the structure (for example, which variables are factors). The function str() provides a good overview of variable types and together with summary() one gets an idea of ranges, numbers of observations and missing values.

This code should help you avoid the ‘fitting factors as covariates’ pitfall; anyway, always check the degrees of freedom of the ANOVA table just in case.

All combinations of levels for two factors

There are circumstances when one wants to generate all possible combinations of levels for two factors. For example, factor one with levels ‘A’, ‘B’ and ‘C’, and factor two with levels ‘D’, ‘E’, ‘F’. The function expand.grid() comes very handy here:

[sourcecode language=”r”]
combo = expand.grid(factor1 = LETTERS[1:3],
factor2 = LETTERS[4:6])
combo

factor1 factor2
1 A D
2 B D
3 C D
4 A E
5 B E
6 C E
7 A F
8 B F
9 C F
[/sourcecode]

Omitting the variable names (factor1 and factor 2) will automatically name the variables as Var1 and Var2. Of course we do not have to use letters for the factor levels; if you have defined a couple of factors (say Fertilizer and Irrigation) you can use levels(Fertilizer) and levels(Irrigation) instead of LETTERS…

“Not in” in R

When processing data it is common to test if an observation belongs to a set. Let’s suppose that we want to see if the sample code belongs to a set that includes A, B, C and D. In R it is easy to write something like:

[sourcecode language=”r”]
inside.set = subset(my.data, code %in% c(‘A’, ‘B’, ‘C’, ‘D’))
[/sourcecode]

Now, what happens if what we want are the observations that are not in that set? Simple, we use the negation operator (!) as in:

[sourcecode language=”r”]
outside.set = subset(my.data, !(code %in% c(‘A’, ‘B’, ‘C’, ‘D’)))
[/sourcecode]

In summary, surround the condition by !().