Categories

## Covariance structures

In most mixed linear model packages (e.g. asreml, ed lme4, nlme, etc) one needs to specify only the model equation (the bit that looks like y ~ factors...) when fitting simple models. We explicitly say nothing about the covariances that complete the model specification. This is because most linear mixed model packages assume that, in absence of any additional information, the covariance structure is the product of a scalar (a variance component) by a design matrix. For example, the residual covariance matrix in simple models is R = I σe2, or the additive genetic variance matrix is G = A σa2 (where A is the numerator relationship matrix), or the covariance matrix for a random effect f with incidence matrix Z is ZZ σf2.

However, there are several situations when analyses require a more complex covariance structure, usually a direct sum or a Kronecker product of two or more matrices. For example, an analysis of data from several sites might consider different error variances for each site, that is R = Σd Ri, where Σd represents a direct sum and Ri is the residual matrix for site i.

Other example of a more complex covariance structure is a multivariate analysis in a single site (so the same individual is assessed for two or more traits), where both the residual and additive genetic covariance matrices are constructed as the product of two matrices. For example, R = IR0, where I is an identity matrix of size number of observations, ⊗ is the Kronecker product (do not confuse with a plain matrix multiplication) and R0 is the error covariance matrix for the traits involved in the analysis. Similarly, G = AG0 where all the matrices are as previously defined and G0 is the additive covariance matrix for the traits.

Some structures are easier to understand (at least for me) if we express a covariance matrix (M) as the product of a correlation matrix (C) pre- and post-multiplied by a diagonal matrix (D) containing standard deviations for each of the traits (M = D C D). That is:

$$M = left [ egin{array}{cccc} v_{11}& c_{12}& c_{13}& c_{14} \ c_{21}& v_{22}& c_{23}& c_{24} \ c_{31}& c_{32}& v_{33}& c_{34} \ c_{41}& c_{42}& c_{43}& v_{44} end{array} ight ]$$ $$C = left [ egin{array}{cccc} 1& r_{12}& r_{13}& r_{14} \ r_{21}& 1& r_{23}& r_{24} \ r_{31}& r_{32}& 1& r_{34} \ r_{41}& r_{42}& r_{43}& 1 end{array} ight ]$$ $$D = left [ egin{array}{cccc} s_{11}& 0& 0& 0 \ 0& s_{22}& 0& 0 \ 0& 0& s_{33}& 0 \ 0& 0& 0& s_{44} end{array} ight ]$$

where the v are variances, the r correlations and the s standard deviations.

If we do not impose any restriction on M, apart from being positive definite (p.d.), we are talking about an unstructured matrix (us() in asreml-R parlance). Thus, M or C can take any value (as long as it is p.d.) as it is usual when analyzing multiple trait problems.

There are cases when the order of assessment or the spatial location of the experimental units create patterns of variation, which are reflected by the covariance matrix. For example, the breeding value of an individual i observed at time j (aij) is a function of genes involved in expression at time j – k (aij-k), plus the effect of genes acting in the new measurement (αj), which are considered independent of the past measurement aij = ρjk aij-k + αj, where ρjk is the additive genetic correlation between measures j and k.

Rather than using a different correlation for each pair of ages, it is possible to postulate mechanisms which model the correlations. For example, an autoregressive model (ar() in asreml-R lingo), where the correlation between measurements j and k is r|j-k|. In this model M = D CAR D, where CAR (for equally spaced assessments) is:

$$C_{AR} = left [ egin{array}{cccc} 1 & r^{|t_2-t_1|} & ldots & r^{|t_m-t_1|}\ r^{|t_2-t_1|} & 1 & ldots & r^{|t_m-t_2|}\ vdots & vdots & ddots & vdots \ r^{|t_m-t_1|} & r^{|t_m-t_2|} & ldots & 1 end{array} ight ]$$

Assuming three different autocorrelation coefficients (0.95 solid line, 0.90 dashed line and 0.85 dotted line) we can get very different patterns with a few extra units of lag, as shown in the following graph:

A model including this structure will certainly be more parsimonious (economic on terms of number of parameters) than one using an unstructured approach. Looking at the previous pattern it is a lot easier to understand why they are called ‘structures’. A similar situation is considered in spatial analysis, where the ‘independent errors’ assumption of typical analyses is relaxed. A common spatial model will consider the presence of autocorrelated residuals in both directions (rows and columns). Here the level of autocorrelation will depend on distance between trees rather than on time. We can get an idea of how separable processes look like using this code:

# Separable row col autoregressive process
car2 = function(dim, rhor, rhoc) {
M = diag(dim)
rhor^(row(M) - 1) * rhoc^(col(M) - 1)
}

library(lattice)
levelplot(car2(20, 0.95, 0.85))


This correlation matrix can then be multiplied by a spatial residual variance to obtain the covariance and we can add up a spatially independent residual variance.

Much more detail on code notation for covariance structures can be found, for example, in the ASReml-R User Guide (PDF, chapter 4), for nlme in Pinheiro and Bates’s Mixed-effects models in S and S-plus (link to Google Books, chapter 5.3) and in Bates’s draft book for lme4 in chapter 4.

Categories

## Maximum likelihood

This post is one of those ‘explain to myself how things work’ documents, which are not necessarily completely correct but are close enough to facilitate understanding.

## Background

Let’s assume that we are working with a fairly simple linear model, where we only have a response variable (say tree stem diameter in cm). If we want to ‘guess’ the diameter for a tree (yi) our best bet is the average (μ) and we will have a residual (εi). The model equation then looks like:

$$y_i = mu + varepsilon_i$$

We want to estimate both the overall mean (μ) and the error variance ($$sigma_{varepsilon}^2$$), for which we could use least squares. However, we will use an alternative method (maximum likelihood) because that is the point of this post. A likelihood function expresses the probability of obtaining the observed sample from a population given a set of model parameters. Thus, it answers the question What is the probability of observing the current dataset, when I assume a given set of parameters for my linear model? To do so, we have to assume a distribution, say normal, for which the probability density function (p.d.f) is:

$$p.d.f. (y) = frac{1} {sqrt{2 pi} sigma} e^{-1/2 frac{(y- mu)^2} {sigma^2}}$$

with n independent samples the likelihood (L) is:

$$L = prod_{i=1}^n frac{1} {sqrt{2 pi} sigma} e^{-1/2 frac{(y- mu)^2} {sigma^2}}$$

where ∏ is a multiplication operator, analogous to the summation operator ∑. Maximizing L or its natural logarithm (LogL) is equivalent, then:

$latex LogL = sum_{i=1}^n logleft ( frac{1} {sqrt{2 pi} sigma} e^{-1/2 frac{(y- mu)^2} {sigma^2}} ight )$

$latex LogL= sum_{i=1}^n left [ logleft ( frac{1} {sqrt{2 pi} sigma} ight ) + logleft ( e^{-1/2 frac{(y- mu)^2} {sigma^2}} ight ) ight ]$

$latex LogL= sum_{i=1}^n log(2 pi)^{-1/2} + sum_{i=1}^n logleft (frac{1} {sigma} ight ) + sum_{i=1}^n left ( -1/2 frac{(y- mu)^2} {sigma^2} ight )$

$$LogL= -frac{n} {2} log(2 pi) – n log(sigma) – frac{1} {2 sigma^2} sum_{i=1}^n ( y – mu)^2$$

Considering only μ, LogL is maximum when $$sum_{i=1}^n ( y – mu)^2$$ is a minimum, i.e. for:

$$mu = frac{sum_{i=1}^n y_i}{n}$$

Now considering σ:

$$frac{delta LogL}{delta sigma} = -frac{n}{sigma} – (-2) frac{1}{2} sum_{i=1}^n (y – mu)^2 sigma^{-3}$$
$$frac{delta LogL}{delta sigma} = -frac{n}{sigma} + frac{1}{sigma^3} sum_{i=1}^n (y – mu)^2$$

and setting the last expression equal to 0:

$$frac{n}{sigma} = frac{1}{sigma^3} sum_{i=1}^n (y – mu)^2$$
$$sigma^2 = frac{sum_{i=1}^n (y – mu)^2}{n}$$

We can obtain an estimate of the error variance ($$hat{sigma}$$, notice the hat) by replacing μ by our previous estimate:
$$hat{sigma}^2 = frac{sum_{i=1}^n (y – ar{y})^2}{n}$$

which is biased, because the denominator is n rather than (n – 1) (the typical denominator for sample variance). This bias arises because maximum likelihood estimates do not take into account the loss of degrees of freedom when estimating fixed effects.

## Playing in R with an example

We have data for stem diameters (in mm) for twelve 10 year-old radiata pine (Pinus radiata D. Don) trees:

diams = c(150, 124, 100, 107, 170, 144,
113, 108, 92, 129, 123, 118)

# Obtain typical mean and standard deviation
mean(diams)

[1] 123.1667

sd(diams)
[1] 22.38438

# A simple function to calculate likelihood.
# Values quickly become very small
like = function(data, mu, sigma) {
like = 1
for(obs in data){
like = like * 1/(sqrt(2*pi)*sigma) *
exp(-1/2 * (obs - mu)^2/(sigma^2))
}
return(like)
}

# For example, likelihood of data if mu=120 and sigma=20
like(diams, 120, 20)
[1] 3.476384e-24

# A simple function to calculate log-likelihood.
# Values will be easier to manage
loglike = function(data, mu, sigma) {
loglike = 0
for(obs in data){
loglike = loglike +
log(1/(sqrt(2*pi)*sigma) *
exp(-1/2 * (obs - mu)^2/(sigma^2)))
}
return(loglike)
}

# Example, log-likelihood of data if mu=120 and sigma=20
loglike(diams, 122, 20)
[1] -53.88605

# Let's try some combinations of parameters and
# plot the results
params = expand.grid(mu = seq(50, 200, 1),
sigma = seq(10, 40, 1))
params$logL = with(params, loglike(diams, mu, sigma)) summary(params) library(lattice) contourplot(logL ~ mu*sigma, data = params, cuts = 20)  It is difficult to see the maximum likelihood in this plot, so we will zoom-in by generating a smaller grid around the typical mean and standard deviation estimates. # Zooming in params = expand.grid(mu = seq(120, 126, 0.01), sigma = seq(18, 25, 0.1)) params$logL = with(params, loglike(diams, mu, sigma))
summary(params)

contourplot(logL ~ mu*sigma, data = params, cuts = 10)


We can now check what would be actual results using any of the functions to fit models using maximum likelihood, for example gls:

library(nlme)
m1 = gls(diams ~ 1, method='ML')
summary(m1)

Generalized least squares fit by maximum likelihood
Model: diams ~ 1
Data: NULL
AIC      BIC    logLik
111.6111 112.5809 -53.80556

Coefficients:
Value Std.Error  t-value p-value
(Intercept) 123.1667  6.461815 19.06069       0

Residual standard error: 21.43142
Degrees of freedom: 12 total; 11 residual


In real life, software will use an iterative process to find the combination of parameters that maximizes the log-likelihood value. If we go back to our function and use loglike(diams, 123.1667, 21.43142) we will obtain -53.80556; exactly the same value calculated by gls. In addition, we can see that the estimated standard deviation (21.43) is slightly lower than the one calculated by the function sd() (22.38), because our biased estimate divides the sum of squared deviations by n rather than by n-1.

P.S. The code for the likelihood and log-likelihood functions is far from being optimal (the loops could be vectorized). However, the loops are easier to understand for many people.

Categories

## Simulating data following a given covariance structure

Every year there is at least a couple of occasions when I have to simulate multivariate data that follow a given covariance matrix. For example, let’s say that we want to create an example of the effect of collinearity when fitting multiple linear regressions, so we want to create one variable (the response) that is correlated with a number of explanatory variables and the explanatory variables have different correlations with each other.

There is a matrix operation called Cholesky decomposition, sort of equivalent to taking a square root with scalars, that is useful to produce correlated data. If we have a covariance matrix M, the Cholesky descomposition is a lower triangular matrix L, such as that M = L L'. How does this connect to our simulated data? Let’s assume that we generate a vector z of random normally independently distributed numbers with mean zero and variance one (with length equal to the dimension of M), we can create a realization of our multivariate distribution using the product L z.

The reason why this works is that the Variance(L z) = L Variance(z) L' as L is just a constant. The variance of z is the identity matrix I; remember that the random numbers have variance one and are independently distributed. Therefore Variance(L z) = L I L' = L L = M so, in fact, we are producing random data that follow the desired covariance matrix.

As an example, let’s simulate 100 observations with 4 variables. Because we want to simulate 100 realizations, rather than a single one, it pays to generate a matrix of random numbers with as many rows as variables to simulate and as many columns as observations to simulate.

library(lattice) # for splom
library(car)     # for vif

# number of observations to simulate
nobs = 100

# Using a correlation matrix (let' assume that all variables
# have unit variance
M = matrix(c(1, 0.7, 0.7, 0.5,
0.7, 1, 0.95, 0.3,
0.7, 0.95, 1, 0.3,
0.5, 0.3, 0.3, 1), nrow=4, ncol=4)

# Cholesky decomposition
L = chol(M)
nvars = dim(L)[1]

# R chol function produces an upper triangular version of L
# so we have to transpose it.
# Just to be sure we can have a look at t(L) and the
# product of the Cholesky decomposition by itself

t(L)

[,1]       [,2]        [,3]      [,4]
[1,]  1.0  0.0000000  0.00000000 0.0000000
[2,]  0.7  0.7141428  0.00000000 0.0000000
[3,]  0.7  0.6441288  0.30837970 0.0000000
[4,]  0.5 -0.0700140 -0.01589586 0.8630442

t(L) %*% L

[,1] [,2] [,3] [,4]
[1,]  1.0 0.70 0.70  0.5
[2,]  0.7 1.00 0.95  0.3
[3,]  0.7 0.95 1.00  0.3
[4,]  0.5 0.30 0.30  1.0

# Random variables that follow an M correlation matrix
r = t(L) %*% matrix(rnorm(nvars*nobs), nrow=nvars, ncol=nobs)
r = t(r)

rdata = as.data.frame(r)
names(rdata) = c('resp', 'pred1', 'pred2', 'pred3')

# Plotting and basic stats
splom(rdata)
cor(rdata)

# We are not that far from what we want to simulate!
resp     pred1     pred2     pred3
resp  1.0000000 0.7347572 0.7516808 0.3915817
pred1 0.7347572 1.0000000 0.9587386 0.2841598
pred2 0.7516808 0.9587386 1.0000000 0.2942844
pred3 0.3915817 0.2841598 0.2942844 1.0000000


Now we can use the simulated data to learn something about the effects of collinearity when fitting multiple linear regressions. We will first fit two models using two predictors with low correlation between them, and then fit a third model with three predictors where pred1 and pred2 are highly correlated with each other.

# Model 1: predictors 1 and 3 (correlation is 0.28)
m1 = lm(resp ~ pred1 + pred3, rdata)
summary(m1)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.07536    0.06812   1.106  0.27133
pred1        0.67316    0.06842   9.838 2.99e-16 ***
pred3        0.20920    0.07253   2.884  0.00483 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6809 on 97 degrees of freedom
Multiple R-squared: 0.5762,	Adjusted R-squared: 0.5675
F-statistic: 65.95 on 2 and 97 DF,  p-value: < 2.2e-16

# Model 2: predictors 2 and 3 (correlation is 0.29)
m2 = lm(resp ~ pred2 + pred3, rdata)
summary(m2)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.06121    0.06649   0.921  0.35953
pred2        0.68513    0.06633  10.329  < 2e-16 ***
pred3        0.19624    0.07097   2.765  0.00681 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6641 on 97 degrees of freedom
Multiple R-squared: 0.5968,	Adjusted R-squared: 0.5885
F-statistic: 71.79 on 2 and 97 DF,  p-value: < 2.2e-16

# Model 3: correlation between predictors 1 and 2 is 0.96
m3 = lm(resp ~ pred1 + pred2 + pred3, rdata)
summary(m3)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.06421    0.06676   0.962  0.33856
pred1        0.16844    0.22560   0.747  0.45712
pred2        0.52525    0.22422   2.343  0.02122 *
pred3        0.19584    0.07114   2.753  0.00706 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6657 on 96 degrees of freedom
Multiple R-squared: 0.5991,	Adjusted R-squared: 0.5866
F-statistic: 47.83 on 3 and 96 DF,  p-value: < 2.2e-16

# Variance inflation
vif(m3)

pred1     pred2     pred3
12.373826 12.453165  1.094875



In my example it is possible to see the huge increase for the standard error for pred1 and pred2, when we use both highly correlated explanatory variables in model 3. In addition, model fit does not improve for model 3.

Categories

## On R versus SAS

A short while ago there was a discussion on linkedin about the use of SAS versus R for the enterprise. I have thought a bit about the issue but, as I do not use Linkedin, I did not make any comments there.

Disclaimer: I did use SAS a lot between 1992 and 1997, mostly for genetic evaluation, heavily relying on BASE, STAT, IML and GRAPH. From that point on, I was a light SAS user (mostly STAT and IML) until 2009. The main reason I left SAS was that I started using ASReml in 1997 and, around two years ago asreml-R, the R package version of ASReml. Through my job I can access any statistical software; if the university does not have a license, I can buy an academic one without any issues.

I think it is important to make a distinction between enterprise use and huge datasets. Some companies have large datasets, but there probably are many companies that need to analyze large numbers of small to medium size datasets. If we accept this premise, there is room to use a diversity of statistical packages, including both SAS and R.

Another topic that often appears in the R vs. SAS discussion is cost. SAS licenses are not cheap, but for many large companies the cost of having expensive researchers with lower productivity while they learn another “free” system can be really high. Same issue applies if there are legacy programs: converting software to a new system can be expensive and time consuming. Of course this situation is changing: new graduates are being exposed much more to R than to SAS in many departments. We now use R in many courses and students may end up working in a small company that will be happy not to spend any money to pay for a SAS license.

• There is good integration between the programming language and the statistical functions. Both SAS macros and IML are poorly integrated with the data step and procs.
• R is highly conducive to exploratory data analysis; visualization functions (either the lattice or the ggplot 2 packages) produce high quality plots that really help developing ideas to build models.
• Statistics is not defined by the software. If someone develops a new methodology or algorithm chances are that there will be an R implementation almost immediately. If I want to test a new idea I can scramble to write some code that connects packages developed by other researchers.
• It is relatively easy to integrate R with other languages, for example Python, to glue a variety of systems.
• asreml-r!
• I can exchange ideas with a huge number of people, because slowly R is becoming the de facto standard for many disciplines that make use of statistics.

Of course R has many drawbacks when compared to SAS; for example:

• The default editor in the Windows version is pathetic, while the one in OS X is pasable (code folding and proper refreshing would be great additions).
• R syntax can be horribly inconsistent across packages, making the learning process more difficult.
• There are many, too many, ways of doing the same thing, which can be confusing, particularly for newbies. For example, summarizing data by combinations of factors could be done using aggregate, summarize (from Hmisc), functions of the apply family, doBy, etc. Compare this situation to proc means.

No, I did not mention technical support (which I find a non-issue), access to large data sets (it is possible to integrate R with databases and ongoing work to process data that can’t fit in memory) or documentation. Concerning the latter, it would be helpful to have better R documentation, but SAS would also benefit from better manuals. There has been a huge number of books using R published recently and the documentation gap is closing. R would benefit of having good canonical documentation, something that all users could see first as the default documentation. The documentation included with the system is, how to call it, Spartan, and sometimes plain useless and confusing. A gigantic link to a searchable version of the R users email list from the main R project page would be great.

Categories

## Linear regression with correlated data

I started following the debate on differential minimum wage for youth (15-19 year old) and adults in New Zealand. Eric Crampton has written a nice series of blog posts, making the data from Statistics New Zealand available. I will use the nzunemployment.csv data file (with quarterly data from March 1986 to June 2011) and show an example of multiple linear regression with autocorrelated residuals in R.

A first approach could be to ignore autocorrelation and fit a linear model that attempts to predict youth unemployment with two explanatory variables: adult unemployment (continuous) and minimum wage rules (categorical: equal or different). This can be done using:

setwd('~/Dropbox/quantumforest')

# Create factor for minimum wage, which was different for youth
# and adults before quarter 90 (June 2008)
un$minwage = factor(ifelse(un$q < 90, 'Different', 'Equal'))

mod1 = lm(youth ~ adult*minwage, data = un)
summary(mod1)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)         8.15314    0.43328  18.817  < 2e-16 ***
adult               1.53334    0.07506  20.428  < 2e-16 ***
minwageEqual       -5.69192    2.19356  -2.595   0.0109 *
adult:minwageEqual  2.85518    0.46197   6.180 1.47e-08 ***

Residual standard error: 1.447 on 98 degrees of freedom
Multiple R-squared: 0.8816,	Adjusted R-squared: 0.878
F-statistic: 243.3 on 3 and 98 DF,  p-value: < 2.2e-16


Remember that adult*minwage is expanded to adult + minwage + adult:minwage. We can make the coefficients easier to understand if we center adult unemployment on the mean of the first 80 quarters. Notice that we get the same slope, Adj-R2, etc. but now the intercept corresponds to the youth unemployment for the average adult unemployment before changing minimum wage rules. All additional analyses will use the centered version.

un$cadult = with(un, adult - mean(adult)) mod2 = lm(youth ~ cadult*minwage, data = un) summary(mod2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 16.28209 0.15352 106.06 < 2e-16 *** cadult 1.53334 0.07506 20.43 < 2e-16 *** minwageEqual 9.44472 0.52629 17.95 < 2e-16 *** cadult:minwageEqual 2.85518 0.46197 6.18 1.47e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.447 on 98 degrees of freedom Multiple R-squared: 0.8816, Adjusted R-squared: 0.878 F-statistic: 243.3 on 3 and 98 DF, p-value: < 2.2e-16 plot(mod2) # Plots residuals for the model fit acf(mod2$res) # Plots autocorrelation of the residuals


In the centered version, the intercept corresponds to youth unemployment when adult unemployment rate is 5.4 (average for the first 89 quarters). The coefficient of minwageEqual corresponds to the increase of youth unemployment (9.44%) when the law moved to have equal minimum wage for youth and adults. Notice that the slopes did not change at all.

I will use the function gls() from the nlme package (which comes by default with all R installations) to take into account the serial correlation. First we can fit a model equivalent to mod2, just to check that we get the same results.

library(nlme)
mod3 = gls(youth ~ cadult*minwage, data = un)
summary(mod3)

Generalized least squares fit by REML
Model: youth ~ cadult * minwage
Data: un
AIC      BIC    logLik
375.7722 388.6971 -182.8861

Coefficients:
Value Std.Error   t-value p-value
(Intercept)         16.282089 0.1535237 106.05585       0
minwageEqual         9.444719 0.5262926  17.94576       0

Correlation:
minwageEqual        -0.292  0.014

Standardized residuals:
Min          Q1         Med          Q3         Max
-2.96256631 -0.53975848 -0.02071559  0.63499262  2.35900240

Residual standard error: 1.446696
Degrees of freedom: 102 total; 98 residual



Yes, they are identical. Notice that the model fitting is done using Restricted Maximum Likelihood (REML). Now we can add an autoregressive process of order 1 for the residuals and compare the two models:

mod4 = gls(youth ~ cadult*minwage,
correlation = corAR1(form=~1), data = un)
summary(mod4)

Generalized least squares fit by REML
Model: youth ~ cadult * minwage
Data: un
AIC      BIC    logLik
353.0064 368.5162 -170.5032

Correlation Structure: AR(1)
Formula: ~1
Parameter estimate(s):
Phi
0.5012431

Coefficients:
Value Std.Error  t-value p-value
(Intercept)         16.328637 0.2733468 59.73598       0
minwageEqual         9.082626 0.8613543 10.54459       0

Correlation:
minwageEqual        -0.318  0.007

Standardized residuals:
Min          Q1         Med          Q3         Max
-2.89233359 -0.55460580 -0.02419759  0.55449166  2.29571080

Residual standard error: 1.5052
Degrees of freedom: 102 total; 98 residual

anova(mod3, mod4)
Model df      AIC      BIC    logLik   Test L.Ratio p-value
mod3     1  5 375.7722 388.6971 -182.8861
mod4     2  6 353.0064 368.5162 -170.5032 1 vs 2 24.7658  <.0001


There is a substantial improvement for the log likelihood (from -182 to -170). We can take into account the additional parameter (autocorrelation) that we are fitting by comparing AIC, which improved from 375.77 (-2*(-182.8861) + 2*5) to 368.52 (-2*(-170.5032) + 2*6). Remember that AIC is -2*logLikelihood + 2*number of parameters.

The file unemployment.txt contains the R code used in this post (I didn't use the .R extension as WordPress complains).

Categories

## R pitfall #1: check data structure

A common problem when running a simple (or not so simple) analysis is forgetting that the levels of a factor has been coded using integers. R doesn’t know that this variable is supposed to be a factor and when fitting, for example, something as simple as a one-way anova (using lm()) the variable will be used as a covariate rather than as a factor.

There is a series of steps that I follow to make sure that I am using the right variables (and types) when running a series of analyses. I always define the working directory (using setwd()), so I know where the files that I am reading from and writing to are.

After reading a dataset I will have a look at the first and last few observations (using head() and tail(), which by default show 6 observations). This gives you an idea of how the dataset looks like, but it doesn’t confirm the structure (for example, which variables are factors). The function str() provides a good overview of variable types and together with summary() one gets an idea of ranges, numbers of observations and missing values.

# Define your working directory (folder). This will make
# your life easier. An example in OS X:
setwd('~/Documents/apophenia')

# and one for a Windows machine
setwd('c:/Documents/apophenia')

# Have a look at the first few and last few observations
tail(apo)

# Check the structure of the data (which variables are numeric,
# which ones are factors, etc)
str(apo)

# Obtain a summary for each of the variables in the dataset
summary(apo)
`

This code should help you avoid the ‘fitting factors as covariates’ pitfall; anyway, always check the degrees of freedom of the ANOVA table just in case.

Categories

## A shoebox for data analysis

Recidivism. That’s my situation concerning this posting flotsam in/on/to the ether. I’ve tried before and, often, will change priorities after a few posts; I rationalize this process thinking that I’m cured and have moved on to real life.

This time may not be different but, simultaneously, it will be more focused: just a shoebox for the little pieces of code or ideas that I use when running analyses.