R pitfall #1: check data structure

A common problem when running a simple (or not so simple) analysis is forgetting that the levels of a factor has been coded using integers. R doesn’t know that this variable is supposed to be a factor and when fitting, for example, something as simple as a one-way anova (using lm()) the variable will be used as a covariate rather than as a factor.

There is a series of steps that I follow to make sure that I am using the right variables (and types) when running a series of analyses. I always define the working directory (using setwd()), so I know where the files that I am reading from and writing to are.

After reading a dataset I will have a look at the first and last few observations (using head() and tail(), which by default show 6 observations). This gives you an idea of how the dataset looks like, but it doesn’t confirm the structure (for example, which variables are factors). The function str() provides a good overview of variable types and together with summary() one gets an idea of ranges, numbers of observations and missing values.

# Define your working directory (folder). This will make
# your life easier. An example in OS X:
setwd('~/Documents/apophenia')

# and one for a Windows machine
setwd('c:/Documents/apophenia')

# Read the data
apo <- read.csv('apophenia-example.csv', header = TRUE)

# Have a look at the first few and last few observations
head(apo)
tail(apo)

# Check the structure of the data (which variables are numeric,
# which ones are factors, etc)
str(apo)

# Obtain a summary for each of the variables in the dataset
summary(apo)

This code should help you avoid the 'fitting factors as covariates' pitfall; anyway, always check the degrees of freedom of the ANOVA table just in case.

5 Comments

Kevin Wright

2011-10-17 at 16:09

I have a similar sequence of steps, plus one more:

library(Hmsic)
describe(apo)

Kevin Wright
2011-10-18 at 08:22

Small typo. In case it is not obvious, here is the correct code:

library(Hmisc)
describe(abo)

Luis (Post author)

2011-10-17 at 16:15

Hi Kevin, Nice to hear from you and thanks for the tip.

matthew gushta

2011-10-27 at 14:50

similar to kevin, though i prefer this:
library(psych)
describe(apo, skew=F)

also, as a native windows user who copy-pastes directories, i find it easier to add a slash than reverse direction:
setwd('c:Documentsapophenia')

Luis (Post author)
2011-10-27 at 15:59

Thanks Matthew. I used to do the double backslash, but retrained muscle memory to single slash (in OS X) within a week in early 2006. Part of your code was eaten by the commenting system <pre>setwd('c:Documentsapophenia') </pre>

Kevin Wright
2011-10-17 at 16:09

I have a similar sequence of steps, plus one more:

library(Hmsic)
describe(apo)
- Kevin Wright
  2011-10-18 at 08:22
  
  Small typo. In case it is not obvious, here is the correct code:
  
  library(Hmisc)
  describe(abo)
Luis (Post author)
2011-10-17 at 16:15

Hi Kevin, Nice to hear from you and thanks for the tip.
matthew gushta
2011-10-27 at 14:50

similar to kevin, though i prefer this:
library(psych)
describe(apo, skew=F)

also, as a native windows user who copy-pastes directories, i find it easier to add a slash than reverse direction:
setwd('c:Documentsapophenia')
- Luis (Post author)
  2011-10-27 at 15:59
  
  Thanks Matthew. I used to do the double backslash, but retrained muscle memory to single slash (in OS X) within a week in early 2006. Part of your code was eaten by the commenting system <pre>setwd('c:Documentsapophenia') </pre>