R is a language

2012-01-11 / Luis

A commenter on this blog reminded me of one of the frustrating aspects faced by newbies, not only to R but to any other programming environment (I am thinking of typical students doing stats for the first time). The statement “R is a language” sounds perfectly harmless if you have previous exposure to programming. However, if you come from a zero-programming background the question is What do you really mean?

R—and many other statistical systems for that matter—is often approached from either one of two extremes:

Explaining R as a programming language, as most of the R documentation and books (like The Art of R Programming, quite good by the way) do.
The other one is from a hodgepodge of statistical analyses, introducing the language as a bonus, best represented by Crowley’s The R Book (which I find close to unreadable). In contrast, Modern Applied Statistics with S by Ripley and Venables is much better even when it doesn’t mention R in the title^†.

If you are new to both statistics and R I like the level of the Quick-R website as a starting point, which was expanded into a book (R in Action). It uses the second approach listed above, so if you come from a programming background the book will most likely be disappointing. Nevertheless, if you come from a newbie point of view both the website and book are great resources. In spite of this, Quick-R assumes that the reader is familiar with statistics and starts with “R is an elegant and comprehensive statistical and graphical programming language“.

A simpler starting point

I would like to start from an even simpler point, ignoring for a moment programming and think about languages like English, Spanish, etc. In languages we have things (nouns) and actions (verbs)^‡. We perform actions on things: we measure a tree, draw a plot, make some assumptions, estimate coefficients, etc. In R we use functions to perform actions on objects (things in the previous explanations). Our data sets are objects that we read, write, fit a model (and create objects with results), etc. “R is a language” means that we have a grammar that is designed to deal with data from a statistical point of view.

A simple sentence “Luis writes Quantum Forest” has two objects (Luis and Quantum Forest) and one function (writes). Now lets look at some simple objects in R; for example, a number, a string of characters and a collection of numbers (the latter using the function c() to keep the numbers together):

> 23
[1] 23

> "Luis"
[1] "Luis"

> c(23, 42, pi)
[1] 23.000000 42.000000  3.141593

Up to this point we have pretty boring software, but things start becoming more interesting when we can assign objects to names, so we can keep acting on those objects (using functions). In this blog I use = instead of <- to assign things (objects) to names outside function calls. This is considered "bad form" in the R world, but to me is much more readable^§. (Inside function calls the arguments should always be referred with an = sign, as we'll see in a future post). Anyway, if you are feeling in a conformist mood replace the = by <- and the code will work equally well.

> sex <- 23

> Sex <- "Luis"

> SEX <- c(23, 42, pi)

R is case sensitive, meaning that upper- and lower-case letters are considered different. Thus, we can assign different objects to variables named sex, Sex and SEX and R will keep track of them as separate entities (A Kama Sutra of names!). Once objects are assigned to a variable R stops printing the object back to the user. However, it is possible to type the object name, press enter and get the content stored in the name. For example:

> sex
[1] 23

> SEX
[1] 23.000000 44.000000  3.141593

The lowly <- sign is a function as well. For example, both a and b are assigned the same bunch of numbers:

> a <- c(23 , 42, pi)
> a
[1] 23.000000 42.000000  3.141593

# Is equivalent to
> assign('b', c(23 , 42, pi))
> b
[1] 23.000000 42.000000  3.141593

Even referring to an object by its name calls a function! print(), which is why we get [1] 23.000000 44.000000 3.141593 when typing b and hitting enter in R.

Grouping objects

Robert Kabacoff has a nice invited post explaining data frames. Here I will present a very rough explanation with a toy example.

Objects can be collected in other objects and assigned a name. In data analysis we tend to collect several variables (for example tree height and stem diameter, people's age and income, etc). It is convenient to keep variables referring to the same units (trees, persons) together in a table. If you have used Excel, a rough approximation would be a spreadsheet. Our toy example could be like:

> x <- c(1, 3, 4, 6, 8, 9)
> y <- c(10, 11, 15, 17, 17, 20)

> toy <- data.frame(x, y)
> toy
x  y
1 1 10
2 3 11
3 4 15
4 6 17
5 8 17
6 9 20

The last line combines two objects (x and y) in an R table using the function data.frame() and then it assigns the name "toy" to that table (using the function =). From now on we can refer to that data table when using other functions as in:

# Getting descriptive statistics using summary()
> summary(toy)

x               y
Min.   :1.000   Min.   :10
1st Qu.:3.250   1st Qu.:12
Median :5.000   Median :16
Mean   :5.167   Mean   :15
3rd Qu.:7.500   3rd Qu.:17
Max.   :9.000   Max.   :20

# Obtaining the names of the variables in
# a data frame
> names(toy)
[1] "x" "y"

# Or actually doing some stats. Here fitting
# a linear regression model
> fm1 <- lm(y ~ x, data = toy)

Incidentally, anything following a # is a comment, which helps users document their work. Use them liberally.

Fitting a linear regression will produce lots of different outputs: estimated regression coefficients, fitted values, residuals, etc. Thus, it is very handy to assign the results of the regression to a name (in this case "fm1") for further manipulation. For example:

# Obtaining the names of objects contained in the
# fm1 object
> names(fm1)
[1] "coefficients"  "residuals"     "effects"       "rank"
[5] "fitted.values" "assign"        "qr"            "df.residual"
[9] "xlevels"       "call"          "terms"         "model"

# We can access individual objects using the notation
# objectName$components

# Obtaining the intercept and slope
> fm1$coefficients
(Intercept)           x
8.822064    1.195730

# Fitted (predicted) values
> fm1$fitted.values
1        2        3        4        5        6
10.01779 12.40925 13.60498 15.99644 18.38790 19.58363

# Residuals
> fm1$residuals
1           2           3           4           5           6
-0.01779359 -1.40925267  1.39501779  1.00355872 -1.38790036  0.41637011

# We can also use functions to extract components from an object
# as in the following graph
> plot(resid(fm1) ~ fitted(fm1))

The last line of code extracts the residuals of fm1 (using the function resid()) and the fitted values (using the function fitted()), which are then combined using the function plot().

In summary: in this introduction we relied on the R language to manipulate objects using functions. Assigning names to objects (and to the results of applying functions) we can continue processing data and improving our understanding of the problem under study.

Footnotes

^† R is an implementation (a dialect) of the S language. But remember, a language is a dialect with an army and a navy.

^‡ Natural languages tend to be more complex and will have pronouns, articles, adjectives, etc. Let's ignore that for the moment.

^§Languages change; for example, I speak Spanish—a bad form of Latin—together with hundreds of millions of people. Who speaks Latin today?

r, rblogs, teaching, tutorials

9 Comments

Robert Young
2012-01-12 at 14:07

There are two problems with R, and texts attempting to teach it (schizophrenic, one might say):
R is both a programming language and a command language. For most statistical analysis, a command language (BMD to Stata and such) is sufficient.
R is not an object oriented or functional language, despite authors’ attempts to drape the language in such finery. R is just Fortran/C with a pretty pink dress. In no other presumed OO language I’ve used, is an object defined as merely a C struct as is the case with R. An object is not just a data lump; but that’s what R folks have chosen to redefine the term. An object is data and function in a unified whole. Smalltalk to java recognize this. R is simple function/data, the paradigm going back to Fortran. Trying to bend the language into some kind of OO paradigm just confuses, and irritates, those who have used OO languages. Stat folks who learn R as an “OO” language will find life quite difficult if they take the mindset to a real one.

Since R is also a command processor language, object management is crufty, thus setClass(). In real OO languages, objects map to reasonable file system constructs. It’s not just syntax, but semantics.

The R core folks really need to decide which to abandon: the command syntax or the programming syntax. Doing so will define the semantics (I know, this is backwards) of the result.
- Luis (Post author)
  2012-01-12 at 16:21
  
  Hi Robert,
  
  Personally I do not worry much about the specifics of the implementation of R (or any other language), as I come more from the statistics than the programming side of things. My main point is to explain to non-programmers how to deal with the language and my use of objects refers to things (rather than OO) and function to action (rather than functional programming).
- Carl
  2012-07-26 at 07:24
  
  >R is just Fortran/C with a pretty pink dress.
  
  No, R is lisp with a pretty pink dress. Fortran/C have no parallels to many of the constructs that R borrowed from lisp – e.g. lambda expressions, or dynamic typing. Even that statement is going to piss off a lisp programmer since R is most definitely not lisp, but from what I’ve heard lisp was a major influence on S, and in turn R.
Henrik
2012-01-13 at 00:21

I just want to comment on your statement regarding = and <-.
It is not only a question of style or conformity that leads to this recommendation to assign variables with <- instead of =.
One should use = when mapping values to arguments and f f(a = 2)
[1] 4
> ls()
[1] “f”
> f(a ls()
[1] “a” “f”
> a
[1] 2
> rm(a)
> ls()
[1] “f”
> f(a = a ls()
[1] “a” “f”
> f(a = a = 2)
Error: unexpected ‘=’ in “f(a = a =”

As one can see, inside a function call = and <- behave quite differently. Therefore, it is considered good style to use = when mapping arguments to values and <- when assigning variables.
- Henrik
  2012-01-13 at 00:23
  
  wordpress kind of messed up the code. The last sentence before the code should be:
  One should use = when mapping values to arguments and <- when assigning variables.
  
  The code should be (hopefully now better):
  
  > f f(a = 2) [1] 4 > ls() [1] "f" > f(a ls() [1] "a" "f" > a [1] 2 > rm(a) > ls() [1] "f" > f(a = a ls() [1] "a" "f" > f(a = a = 2) Error: unexpected '=' in "f(a = a ="
  - Henrik
    2012-01-13 at 00:27
    
    aargh, again.
    the important second call to f is (hopefully this time):
    f(a<-2)
    - Luis (Post author)
      2012-01-13 at 01:10
      
      Hi Henrik,
      
      I understand that one needs to use = to map values to function arguments. However, your code is not clear to me. Is f any generic function? How does it take the value 4 at the beginning?
      - Henrik
        2012-01-13 at 01:27
        
        ahh, I hate wordpress for always destroying R code and giving me no chance to correcting it in a preview. However, another try:
        
        First i define f (which was obviously missing).
        Then I show that there is a difference in calling f with either = or <-. Specifically, <- does lead to mapping via position and creates a variable in the gloabl environment whereas = simply maps arguments by name. In the end I show that you indeed can combine = with <- in a call but not = with =.
        # define the function f<-functioN(a) a^2 # call with = just calls the function f(a=2) # call with <- creates a corresponding object in the global environment f(a<-2) # = and <- works (maps via name and creates an object f(a=a<-2) # = and = crahses f(a=a=2)
        
        This obviously is more important or problematic if you have a function with more than one argument (e.g., f(a,b)) and you compare f(b=2, a=3) with f(b<-2,a<-3).
        
        However, the punchline remains: There are differences between = and <- inside function calls. Therefore it is always good (and makes the behavior clear in all cases) to always use <- for assignments.
Luis (Post author)
2012-01-13 at 11:32

Dear Henrik,

I have slightly rephrased my explanation in the original post, so it is clearer that I’m advocating the use of = both inside and outside function calls and that people should use only = inside a function call.

You correctly point out that the meanings of = and <- inside a function call are different. In my opinion using a global environment assignment (<-) inside a function call would be very bad programming practice, actually it would be perverse, difficult to debug code that I’d never use.

(I’m not replying directly to your latest comment because WordPress does not allow further comment nesting)