Learning to code in R

It used to be that the one of the first decisions to make when learning to program was between compiled (e.g. C or FORTRAN) and interpreted (e.g. Python) languages. In my opinion these days one would have to be a masochist to learn with a compiled language: the extra compilation time and obscure errors are a killer when learning.

Today the decision would be between using a generic interpreted language (e.g. Python) and an interpreted domain specific language (DSL) like R, MATLAB, etc. While some people prefer generic languages, I’d argue that immediate feedback and easy accomplishment of useful tasks are a great thing when one is learning something for the first time.

As an example, a while ago my son asked me what I was doing in the computer and I told him that I was programming some analyses in R. I showed that the program was spitting back some numbers and plots, a fact that he found totally unremarkable and uninteresting. I searched in internet and I found Scratch, a visual programming language, that let’s the user moves blocks representing code around and build interactive games: now my son was sold. Together we are learning about loops, control structures and variables, drawing characters, etc. We are programming because the problems are i- much more interesting for him and ii- achievable in a short time frame.

An example scratch script.

An example scratch script.

Learning to program for statistics, or other scientific domains for that matter, is not that different from being a kid and learning programming. Having to do too much to get even a mildly interesting result is frustrating and discouraging; it is not that the learner is dumb, but that he has to build too many functions to get a meager reward. This is why I’d say that you should use whatever language already has a large amount of functionality (‘batteries included’ in Python parlance) for your discipline. Choose rightly and you are half-way there.

‘But’ someone will say, R is not a real language. Sorry, but it is a real language (Turing complete and the whole shebang) with oddities, as any other language, granted. As with human languages, the more you study the easier it gets to learn a new language. In fact, the syntax for many basic constructs in R is highly similar to alternatives:

# This is R code
# Loop
for(i in 1:10){
    print(i)
}
#[1] 1
#[1] 2
#[1] 3
#[1] 4
#[1] 5
#[1] 6
#[1] 7
#[1] 8
#[1] 9
#[1] 10
 
# Control 
if(i == 10) {
    print('It is ten!')
}
 
#[1] "It is ten!"
# This is Python code
# Loop
for i in range(1,11): # Python starts indexing from zero
    print(i)
 
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
 
# Control
if i == 10:
    print("It is ten!")
 
#It is ten!

By the way, I formatted the code to highlight similarities. Of course there are plenty of differences between the languages, but many of the building blocks (the core ideas if you will) are shared. You learn one and the next one gets easier.

How does one start learning to program? If I look back at 1985 (yes, last millennium) I was struggling to learn how to program in BASIC when, suddenly, I had an epiphany. The sky opened and I heard a choir singing György Ligeti’s Atmosphères (from 2001: a Space Odyssey, you know) and then I realized: we are only breaking problems in little pieces and dealing with them one at the time. I’ve been breaking problems into little pieces since then. What else did you expect? Anyhow, if you feel that you want to learn how to code in R, or whatever language is the best option for your problems, start small. Just a few lines will do. Read other people’s code but, again, only small pieces that are supposed to do something small. At this stage is easy to get discouraged because everything is new and takes a lot of time. Don’t worry: everyone has to go through this process.

Many people struggle vectorizing the code so it runs faster. Again, don’t worry at the beginning if your code is slow. Keep on writing and learning. Read Norman Noam Ross’s FasteR! HigheR! StongeR! — A Guide to Speeding Up R Code for Busy People. The guide is very neat and useful, although you don’t need super fast code, not yet at least. You need working, readable and reusable code. You may not even need code: actually, try to understand the problem, the statistical theory, before you get coding like crazy. Programming will help you understand your problem a lot better, but you need a helpful starting point.

Don’t get distracted with the politics of research and repeatability and trendy things like git (noticed that I didn’t even link to them?). You’ll learn them in time, once you got a clue about how to program.

P.S. The code used in the examples could be shortened and sped up dramatically (e.g. 1:10 or range(1, 11)) but it is not the point of this post.
P.S.2. A while ago I wrote R is a language, which could be useful connected to this post.

Flotsam 11: mostly on books

‘No estaba muerto, andaba the parranda’ as the song says. Although rather than partying it mostly has been reading, taking pictures and trying to learn how to record sounds. Here there are some things I’ve come across lately.

I can’t remember if I’ve recommended Matloff’s The Art of R Programming before; if I haven’t, go and read the book for a good exposition of the language. Matloff also has an open book (as in free PDF, 3.5MB) entitled ‘From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science’. The download link is near the end of the page. He states that the reader ‘must know calculus, basic matrix algebra, and have some minimal skill in programming’, which incidentally is the bare minimum for someone that wants to get a good handle on stats. In my case I learned calculus partly with Piskunov’s book (I’m a sucker for Soviet books, free DjVu), matrix algebra with Searle’s book and programming with… that’s another story.

I’ve ordered a couple of books from CRC Press, which I hope to receive soon (it depends on how long it takes for the parcel to arrive to the middle of nowhere):

  • Stroup’s Generalized Linear Mixed Models: Modern Concepts, Methods and Applications, which according to the blurb comes ‘with numerous examples using SAS PROC GLIMMIX’. You could be wondering Why is he reading a book that includes SAS as a selling point? Well, SAS is a very good statistical thinking that still has a fairly broad installed based. However, the real selling point is that I’ve read some explanations on mixed models written by Stroup and he has superb understanding of the topic. I’m really looking forward to put my paws on this book.
  • Lunn et al.’s The BUGS Book: A Practical Introduction to Bayesian Analysis. I don’t use BUGS but occasionally use JAGS and one of the things that irks me of programs like BUGS, JAGS or INLA is that they follow the ‘here is a bunch of examples’ approach to documentation. This books is supposed to provide a much more detailed account of the ins and outs of fitting models and a proper manual. Or at least that’s what I’m hoping to find in it.

Finally, a link to a fairly long (and somewhat old) list of R tips and the acknowledgements of a PhD thesis that make you smile (via Arthur Charpentier).

Gratuitous picture: frozen fence (Photo: Luis).

Gratuitous picture: frozen fence (Photo: Luis).

‘He was not dead, he was out partying’.

Dealing with software impermanence

Every so often I get bored writing about statistical analyses, software and torturing data and spend time in alternative creative endeavors: taking and processing pictures, writing short stories or exploring new research topics. The former is, mostly, covered in 500px, I keep the stories private and I’m just starting to play with bioacoustics.

While I’ve been away from this blog came the Google Reader debacle; Google announced that Reader will be canned mid-year, probably because they want to move everyone towards Google+. Let’s be straightforward, it is not the end of the world but a (relatively) minor annoyance. The main consequence is that this decision led me to reevaluate my relationship with Google services and the result is that I’m replacing most services, particularly those where what I consider private information is stored.

My work email (Luis.Apiolaza@canterbury.ac.nz) stayed the same while I moved my Google calendar back to my work’s exchange server. I setup my personal email address in one of my domains, served by Zoho. There are no ads in this account. I opted for this to avoid worrying about maintaining email servers, spam filtering, etc. I’ll see how it works, but if it doesn’t, will swap it for another service: my email address will stay the same.

I exported my RSS subscriptions from Google Reader and put them in Vienna. I tried some of the online alternatives, like Feedly, but didn’t like them.

I barely use Google Docs, so it won’t be a big deal to move from them. I deleted my Google+ account, no big loss. I’m keeping my Gmail account for a little while while I transition registration to various services. Nevertheless, the most difficult services to replace are Search, Maps and Scholar, which I’m now using without being logged-in in Google. I’m testing Duck Duck Go for search (kind of OK), while I’m sticking to Maps and, particularly, to Scholar. Funnily enough I have access to Web of Science and Scopus—two well-known academic search services—though the university and I will often prefer to look in Scholar, which is easier to use, more responsive and much better coverage of the literature; particularly of conferences and reports.

Google didn’t remove a small service. It did remove my confidence on their whole ecosystem.

Finding our way in the darkness (Photo: Luis).

Finding our way in the darkness (Photo: Luis).

Remembering server installation details

I’ve been moving part of my work to university servers, where I’m just one more peasant user with little privileges. In exchange, I can access the jobs from anywhere and I can access multiple processors if needed. Given that I have a sieve-like memory, where configuration details quickly disappear through many small holes, I’m documenting the little steps needed to move my work environment there.

The server provides a default R installation but none of the additional packages I often install are available (most people accessing the servers don’t use R). I could contact the administrator to get them installed, but I’ve opted for installing them under my user space. For that I followed the instructions presented here, which in summary require adding the name of the default folder (/hpc/home/luis/rpackages) for the local library of packages to my .bashrc file:

R_LIBS="/hpc/home/luis/rpackages"
export R_LIBS

I also have a temporary folder (called rpackages)in the account where I move the source of the packages to be installed (using SFTP). Once the R session is started it is possible to check that the local folder is first in the library path, confirming that R_LIBS has been made available to R.

Then I can install the packages I moved to the server with SFTP from the temporary folder to the local library using install.packages().

.libPaths()
# [1] "/hpc/home/luis/rpackages"                                               
# [2] "/hpc/home/projects/packages/local.linux.ppc/pkg/R/2.15.1/lib64/R/library"
#
install.packages("~/temporary/plyr_1.8.tar.gz", lib="/hpc/home/luis/rpackages", repos=NULL)
 
# * installing *source* package 'plyr' ...
# ** package 'plyr' successfully unpacked and MD5 sums checked
# ** libs
# gcc -std=gnu99 -I/usr/local/pkg/R/2.15.1/lib64/R/include -DNDEBUG      -fPIC  -O2  -c loop-apply.c -o loop-apply.o
# gcc -std=gnu99 -I/usr/local/pkg/R/2.15.1/lib64/R/include -DNDEBUG      -fPIC  -O2  -c split-numeric.c -o split-numeric.o
# gcc -std=gnu99 -shared -Wl,--as-needed -o plyr.so loop-apply.o split-numeric.o
# installing to /hpc/home/luis/rpackages/plyr/libs
# ** R
# ** data
# **  moving datasets to lazyload DB
# ** inst
# ** preparing package for lazy loading
# ** help
# *** installing help indices
# ** building package indices
# ** testing if installed package can be loaded
#
# * DONE (plyr)

Now we can load the package as normal:

require(plyr)
# Loading required package: plyr

Nothing complicated or groundbreaking, just writing down the small details before I forget them.

Gratuitous picture: just different hardware (Photo: Luis).

Gratuitous picture: just different hardware (Photo: Luis).

An R wish list for 2013

First go and read An R wish list for 2012. None of the wishes came through in 2012. Fix the R website? No, it is the same this year. In fact, it is the same as in 2005. Easy to find help? Sorry, next year. Consistency and sane defaults? Coming soon to a theater near you (one day). Thus my wish list for 2012 is, very handy, still the wish list for 2013.

R as social software

The strength of R is not the software itself, but the community surrounding the software. Put another way, there are several languages that could offer the core functionality, but the whole ‘ecosystem’ that’s another thing. Softening @gappy3000′s comment: innovation is (mostly) happening outside the core.

This prompts some questions: Why isn’t ggplot2 or plyr in the default download? I don’t know if some people realize that ggplot2 is now one of the main attractions for R as data visualization language. Why isn’t Hadley’s name in this page? (Sorry I’m picking on him, first name that came to mind). How come there is not one woman in that page? I’m not saying there is an evil plan, but I’m wondering if (and how) the site and core reflect the R community and the diversity of interests (and uses). I’m also wondering what is the process to express these questions beyond a blog post. Perhaps in the developers email list?

I think that, in summary, my R wish for 2013 is that ‘The R project’—whoever that is—recognizes that the project is much more than the core download. I wish the list of contributors goes beyond the fairly small number of people with writing access to the source. I’d include those who write packages, those who explain, those who market and, yes, those who sell R. Finally, I wish all readers of Quantum Forest a great 2013.

Entry point to the R world. Same as ever.

Entry point to the R world. Same as ever.

P.S. Just in case, no, I’m not suggesting to be included in any list.

My R year

End-of-year posts are corny but, what the heck, I think I can let myself delve in to corniness once a year. The following code gives a snapshot of what and how was R for me in 2012.

outside.packages.2012 <- list(used.the.most = c('asreml', 'ggplot2'),
                              largest.use.decline = c('MASS', 'lattice'),
                              same.use = c('MCMCglmm', 'lme4'),
                              would.like.use.more = 'JAGS')
 
skill.level <- list(improved = 'fewer loops (plyr and do.call())',
                    unimproved = c('variable.naming (Still an InConsistent mess)', 
                                   'versioning (still hit and miss)'))
 
interfaces <- list(most.used = c('RStudio', 'plain vanilla R', 'text editor (Textmate and VIM)'),
                   didnt.use.at.all = 'Emacs')
 
languages <- list(for.inquisition = c('R', 'Python', 'Javascript'),
                  revisiting = 'J',
                  discarded = 'Julia (note to self: revisit in a year)')
 
(R.2012 <- list(outside.packages.2012, 
                skill.level, 
                interfaces, 
                languages))
 
# [[1]]
# [[1]]$used.the.most
# [1] "asreml"  "ggplot2"
 
# [[1]]$largest.use.decline
# [1] "MASS"    "lattice"
 
# [[1]]$same.use
# [1] "MCMCglmm" "lme4"    
 
# [[1]]$would.like.use.more
# [1] "JAGS"
 
 
# [[2]]
# [[2]]$improved
# [1] "fewer loops (plyr and do.call())"
 
# [[2]]$unimproved
# [1] "variable.naming (Still an InConsistent mess)"
# [2] "versioning (still hit and miss)"             
 
 
# [[3]]
# [[3]]$most.used
# [1] "RStudio"                        "plain vanilla R"               
# [3] "text editor (Textmate and VIM)"
 
# [[3]]$didnt.use.at.all
# [1] "Emacs"
 
 
# [[4]]
# [[4]]$for.inquisition
# [1] "R"          "Python"     "Javascript"
 
# [[4]]$revisiting
# [1] "J"
 
# [[4]]$discarded
# [1] "Julia (note to self: revisit in a year)"

So one can query this over-the-top structure with code like R.2012[[3]]$didnt.use.at.all to learn [1] "Emacs", but you already new that, didn’t you?

Despite all my complaints, monologuing about other languages and overall frustration, R has served me well. It’s just that I’d be disappointed if I were still using it a lot in ten-years time.

Gratuitous picture: building blocks for research (Photo: Luis).

Gratuitous picture: building blocks for research (Photo: Luis).

Of course there was a lot more than R and stats this year. For example, the blogs I read most often have nothing to do with either topic: Isomorphismes (can’t define it), The music of sound (sound design), Offsetting behaviour (economics/politics in NZ). In fact, I need reading about a broad range of topics to feel human.

P.S. Incidentally, my favorite R function this year was subset(); I’ve been subsetting like there is no tomorrow. By the way, you are welcome to browse around the blog and subset whatever you like.

R for inquisition

A post on high-dimensional arrays by @isomorphisms reminded me of APL and, more generally, of matrix languages, which took me back to inquisitive computing: computing not in the sense of software engineering, or databases, or formats, but of learning by poking problems through a computer.

I like languages not because I can get a job by using one, but because I can think thoughts and express ideas through them. The way we think about a problem is somehow molded by the tools we use, and if we have loops, loops we use or if we have a terse matrix notation (see my previous post on Matrix Algebra Useful for Statistics), we may use that.

I used APL fairly briefly but I was impressed by some superficial aspects (hey, that’s a weird set of characters that needs a keyboard overlay) and some deeper ones (this is an actual language, cool PDF paper). The APL revolution didn’t happen, at least not directly, but it had an influence over several other languages (including R). Somehow as a group we took a different path from ‘Expository programming’, but I think that we have to recover at least part of that ethos, programming for understanding the world.

While many times I struggle with R frustrations, it is now my primary language for inquisitive computing, although some times I dive into something else. I like Mathematica, but can access it only while plugged to the university network (license limits). Python is turning into a great scientific computing environment—although still with a feeling of sellotape holding it together, J is like APL without the Klingon keyboard.

If anything, dealing with other ways of doing things leads to a better understanding of one’s primary language. Idioms that seem natural acquire a new sense of weirdness when compared to other languages. R’s basic functionality gives an excellent starting point for inquisitive computing but don’t forget other languages that can enrich the way we look at problems.

I am curious about what are people’s favorite inquisitive languages.

Gratuitous picture: inquisition, Why bloody trees grow like this? (Photo: Luis).

Gratuitous picture: inquisition, Why bloody trees grow like this? (Photo: Luis).