Categories

## Linear regression with correlated data

I started following the debate on differential minimum wage for youth (15-19 year old) and adults in New Zealand. Eric Crampton has written a nice series of blog posts, making the data from Statistics New Zealand available. I will use the nzunemployment.csv data file (with quarterly data from March 1986 to June 2011) and show an example of multiple linear regression with autocorrelated residuals in R.

A first approach could be to ignore autocorrelation and fit a linear model that attempts to predict youth unemployment with two explanatory variables: adult unemployment (continuous) and minimum wage rules (categorical: equal or different). This can be done using:

```setwd('~/Dropbox/quantumforest')

# Create factor for minimum wage, which was different for youth
# and adults before quarter 90 (June 2008)
un\$minwage = factor(ifelse(un\$q < 90, 'Different', 'Equal'))

mod1 = lm(youth ~ adult*minwage, data = un)
summary(mod1)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)         8.15314    0.43328  18.817  < 2e-16 ***
adult               1.53334    0.07506  20.428  < 2e-16 ***
minwageEqual       -5.69192    2.19356  -2.595   0.0109 *
adult:minwageEqual  2.85518    0.46197   6.180 1.47e-08 ***

Residual standard error: 1.447 on 98 degrees of freedom
Multiple R-squared: 0.8816,	Adjusted R-squared: 0.878
F-statistic: 243.3 on 3 and 98 DF,  p-value: < 2.2e-16
```

Remember that `adult*minwage` is expanded to `adult + minwage + adult:minwage`. We can make the coefficients easier to understand if we center adult unemployment on the mean of the first 80 quarters. Notice that we get the same slope, Adj-R2, etc. but now the intercept corresponds to the youth unemployment for the average adult unemployment before changing minimum wage rules. All additional analyses will use the centered version.

```un\$cadult = with(un, adult - mean(adult))
mod2 = lm(youth ~ cadult*minwage, data = un)
summary(mod2)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)         16.28209    0.15352  106.06  < 2e-16 ***
cadult               1.53334    0.07506   20.43  < 2e-16 ***
minwageEqual         9.44472    0.52629   17.95  < 2e-16 ***
cadult:minwageEqual  2.85518    0.46197    6.18 1.47e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.447 on 98 degrees of freedom
Multiple R-squared: 0.8816,	Adjusted R-squared: 0.878
F-statistic: 243.3 on 3 and 98 DF,  p-value: < 2.2e-16

plot(mod2)    # Plots residuals for the model fit
acf(mod2\$res) # Plots autocorrelation of the residuals
```

In the centered version, the intercept corresponds to youth unemployment when adult unemployment rate is 5.4 (average for the first 89 quarters). The coefficient of minwageEqual corresponds to the increase of youth unemployment (9.44%) when the law moved to have equal minimum wage for youth and adults. Notice that the slopes did not change at all.

I will use the function `gls()` from the nlme package (which comes by default with all R installations) to take into account the serial correlation. First we can fit a model equivalent to mod2, just to check that we get the same results.

```library(nlme)
mod3 = gls(youth ~ cadult*minwage, data = un)
summary(mod3)

Generalized least squares fit by REML
Model: youth ~ cadult * minwage
Data: un
AIC      BIC    logLik
375.7722 388.6971 -182.8861

Coefficients:
Value Std.Error   t-value p-value
(Intercept)         16.282089 0.1535237 106.05585       0
minwageEqual         9.444719 0.5262926  17.94576       0

Correlation:
minwageEqual        -0.292  0.014

Standardized residuals:
Min          Q1         Med          Q3         Max
-2.96256631 -0.53975848 -0.02071559  0.63499262  2.35900240

Residual standard error: 1.446696
Degrees of freedom: 102 total; 98 residual

```

Yes, they are identical. Notice that the model fitting is done using Restricted Maximum Likelihood (REML). Now we can add an autoregressive process of order 1 for the residuals and compare the two models:

```mod4 = gls(youth ~ cadult*minwage,
correlation = corAR1(form=~1), data = un)
summary(mod4)

Generalized least squares fit by REML
Model: youth ~ cadult * minwage
Data: un
AIC      BIC    logLik
353.0064 368.5162 -170.5032

Correlation Structure: AR(1)
Formula: ~1
Parameter estimate(s):
Phi
0.5012431

Coefficients:
Value Std.Error  t-value p-value
(Intercept)         16.328637 0.2733468 59.73598       0
minwageEqual         9.082626 0.8613543 10.54459       0

Correlation:
minwageEqual        -0.318  0.007

Standardized residuals:
Min          Q1         Med          Q3         Max
-2.89233359 -0.55460580 -0.02419759  0.55449166  2.29571080

Residual standard error: 1.5052
Degrees of freedom: 102 total; 98 residual

anova(mod3, mod4)
Model df      AIC      BIC    logLik   Test L.Ratio p-value
mod3     1  5 375.7722 388.6971 -182.8861
mod4     2  6 353.0064 368.5162 -170.5032 1 vs 2 24.7658  <.0001
```

There is a substantial improvement for the log likelihood (from -182 to -170). We can take into account the additional parameter (autocorrelation) that we are fitting by comparing AIC, which improved from 375.77 (-2*(-182.8861) + 2*5) to 368.52 (-2*(-170.5032) + 2*6). Remember that AIC is -2*logLikelihood + 2*number of parameters.

The file unemployment.txt contains the R code used in this post (I didn't use the .R extension as WordPress complains).

Categories

## R pitfall #1: check data structure

A common problem when running a simple (or not so simple) analysis is forgetting that the levels of a factor has been coded using integers. R doesn’t know that this variable is supposed to be a factor and when fitting, for example, something as simple as a one-way anova (using `lm()`) the variable will be used as a covariate rather than as a factor.

There is a series of steps that I follow to make sure that I am using the right variables (and types) when running a series of analyses. I always define the working directory (using `setwd()`), so I know where the files that I am reading from and writing to are.

After reading a dataset I will have a look at the first and last few observations (using `head()` and `tail()`, which by default show 6 observations). This gives you an idea of how the dataset looks like, but it doesn’t confirm the structure (for example, which variables are factors). The function `str()` provides a good overview of variable types and together with `summary()` one gets an idea of ranges, numbers of observations and missing values.

```# Define your working directory (folder). This will make
# your life easier. An example in OS X:
setwd('~/Documents/apophenia')

# and one for a Windows machine
setwd('c:/Documents/apophenia')

# Have a look at the first few and last few observations
tail(apo)

# Check the structure of the data (which variables are numeric,
# which ones are factors, etc)
str(apo)

# Obtain a summary for each of the variables in the dataset
summary(apo)
```

This code should help you avoid the ‘fitting factors as covariates’ pitfall; anyway, always check the degrees of freedom of the ANOVA table just in case.

Categories

## All combinations of levels for two factors

There are circumstances when one wants to generate all possible combinations of levels for two factors. For example, factor one with levels ‘A’, ‘B’ and ‘C’, and factor two with levels ‘D’, ‘E’, ‘F’. The function `expand.grid()` comes very handy here:

[sourcecode language=”r”]
combo = expand.grid(factor1 = LETTERS[1:3],
factor2 = LETTERS[4:6])
combo

factor1 factor2
1 A D
2 B D
3 C D
4 A E
5 B E
6 C E
7 A F
8 B F
9 C F
[/sourcecode]

Omitting the variable names (factor1 and factor 2) will automatically name the variables as Var1 and Var2. Of course we do not have to use letters for the factor levels; if you have defined a couple of factors (say Fertilizer and Irrigation) you can use `levels(Fertilizer)` and `levels(Irrigation)` instead of LETTERS…

Categories

## “Not in” in R

When processing data it is common to test if an observation belongs to a set. Let’s suppose that we want to see if the sample code belongs to a set that includes A, B, C and D. In R it is easy to write something like:

[sourcecode language=”r”]
inside.set = subset(my.data, code %in% c(‘A’, ‘B’, ‘C’, ‘D’))
[/sourcecode]

Now, what happens if what we want are the observations that are not in that set? Simple, we use the negation operator (!) as in:

[sourcecode language=”r”]
outside.set = subset(my.data, !(code %in% c(‘A’, ‘B’, ‘C’, ‘D’)))
[/sourcecode]

In summary, surround the condition by !().