# Quantum Forest

### notes in a shoebox

#### Category: meta (page 1 of 2)

I have been writing in internet on and off—perhaps mostly off—for near 20 years, including various blog stints since July 2003. This is my fifth or sixth iteration for a blog and I figured out that one element that makes it difficult to keep going in its current form is how skewed is the sampling of topics I covered. I mean all this quantitative, coding, etc. is like looking through a prism that only lets through a tiny portion of life.

Prism used to set ‘prism plots’ in forest inventory, where the distance to the tree and its size determines if it is inside the plot (Photo: Luis, click to enlarge).

I am loosening my mental definition of what should be in this site because as much as I like programming and numbers, it becomes tiring to always be switched on for those topics.  Some times this change will go unnoticed while others will represent a big departure from what is (or used to be) the core of this blog’s content.

I am hoping to try different topics (perhaps more common in a previous blog incarnation), angles and media. We will see how it works out.

This week I’ve been feeling tired of excessive fanaticism (or zealotry) of open source software (OSS) and R in general. I do use a fair amount of OSS and pushed for the adoption of R in our courses; in fact, I do think OSS is a Good ThingTM. I do not like, however, constant yabbering on why using exclusively OSS in science is a good idea and the reduction of science to repeatability and computability (both of which I covered in my previous post). I also dislike the snobbery of ‘you shall use R and not Excel at all, because the latter is evil’ (going back ages).

We often have several experiments running during the year and most of the time we do not bother setting up a data base to keep data. Doing that would essentially mean that I would have to do it, and I have a few things more important to do. Therefore, many data sets end up in… (drum roll here) Microsoft Excel.

How should a researcher setup data in Excel? Rather than reinventing the wheel, I’ll use a(n) (im)perfect diagram that I found years ago in a Genstat manual.

Suggested sane data setup in a spreadsheet.

I like it because:

• It makes clear how to setup the experimental and/or sampling structure; one can handle any design with enough columns.
• It also manages any number of traits assessed in the experimental units.
• It contains metadata in the first few rows, which can be easily skipped when reading the file. I normally convert Excel files to text and then I skip the first few lines (using skip in R or firstobs in SAS).

People doing data analysis often start convulsing at the mention of Excel; personally, I deeply dislike it for analyses but it makes data entry very easy, and even a monkey can understand how to use it (I’ve seen them typing, I swear). The secret for sane use is to use Excel only for data entry; any data manipulation (subsetting, merging, derived variables, etc.) or analysis is done in statistical software (I use either R or SAS for general statistics, ASReml for quantitative genetics).

It is far from a perfect solution but it fits in the realm of the possible and, considering all my work responsibilities, it’s a reasonable use of my time. Would it be possible that someone makes a weird change in the spreadsheet? Yes. Could you fart while moving the mouse and create a non-obvious side effect? Yes, I guess so. Will it make your life easier, and make possible to complete your research projects? Yes sir!

P.S. One could even save data using a text-based format (e.g. csv, tab-delimited) and use Excel only as a front-end for data entry. Other spreadsheets are of course equally useful.

P.S.2. Some of my data are machine-generated (e.g. by acoustic scanners and NIR spectroscopy) and get dumped by the machine in a separate—usually very wide; for example 2000 columns—text file for each sample. I never put them in Excel, but read them directly (a directory-full of them) in to R for manipulation and analysis.

As an interesting aside, the post A summary of the evidence that most published research is false provides a good summary for the need to freak out about repeatability.

Every so often I get bored writing about statistical analyses, software and torturing data and spend time in alternative creative endeavors: taking and processing pictures, writing short stories or exploring new research topics. The former is, mostly, covered in 500px, I keep the stories private and I’m just starting to play with bioacoustics.

While I’ve been away from this blog came the Google Reader debacle; Google announced that Reader will be canned mid-year, probably because they want to move everyone towards Google+. Let’s be straightforward, it is not the end of the world but a (relatively) minor annoyance. The main consequence is that this decision led me to reevaluate my relationship with Google services and the result is that I’m replacing most services, particularly those where what I consider private information is stored.

My work email (Luis.Apiolaza@canterbury.ac.nz) stayed the same while I moved my Google calendar back to my work’s exchange server. I setup my personal email address in one of my domains, served by Zoho. There are no ads in this account. I opted for this to avoid worrying about maintaining email servers, spam filtering, etc. I’ll see how it works, but if it doesn’t, will swap it for another service: my email address will stay the same.

I exported my RSS subscriptions from Google Reader and put them in Vienna. I tried some of the online alternatives, like Feedly, but didn’t like them.

I barely use Google Docs, so it won’t be a big deal to move from them. I deleted my Google+ account, no big loss. I’m keeping my Gmail account for a little while while I transition registration to various services. Nevertheless, the most difficult services to replace are Search, Maps and Scholar, which I’m now using without being logged-in in Google. I’m testing Duck Duck Go for search (kind of OK), while I’m sticking to Maps and, particularly, to Scholar. Funnily enough I have access to Web of Science and Scopus—two well-known academic search services—though the university and I will often prefer to look in Scholar, which is easier to use, more responsive and much better coverage of the literature; particularly of conferences and reports.

Google didn’t remove a small service. It did remove my confidence on their whole ecosystem.

Finding our way in the darkness (Photo: Luis, click to enlarge).

First go and read An R wish list for 2012. None of the wishes came through in 2012. Fix the R website? No, it is the same this year. In fact, it is the same as in 2005. Easy to find help? Sorry, next year. Consistency and sane defaults? Coming soon to a theater near you (one day). Thus my wish list for 2012 is, very handy, still the wish list for 2013.

## R as social software

The strength of R is not the software itself, but the community surrounding the software. Put another way, there are several languages that could offer the core functionality, but the whole ‘ecosystem’ that’s another thing. Softening @gappy3000’s comment: innovation is (mostly) happening outside the core.

This prompts some questions: Why isn’t ggplot2 or plyr in the default download? I don’t know if some people realize that ggplot2 is now one of the main attractions for R as data visualization language. Why isn’t Hadley’s name in this page? (Sorry I’m picking on him, first name that came to mind). How come there is not one woman in that page? I’m not saying there is an evil plan, but I’m wondering if (and how) the site and core reflect the R community and the diversity of interests (and uses). I’m also wondering what is the process to express these questions beyond a blog post. Perhaps in the developers email list?

I think that, in summary, my R wish for 2013 is that ‘The R project’—whoever that is—recognizes that the project is much more than the core download. I wish the list of contributors goes beyond the fairly small number of people with writing access to the source. I’d include those who write packages, those who explain, those who market and, yes, those who sell R. Finally, I wish all readers of Quantum Forest a great 2013.

Entry point to the R world. Same as ever.

P.S. Just in case, no, I’m not suggesting to be included in any list.

End-of-year posts are corny but, what the heck, I think I can let myself delve in to corniness once a year. The following code gives a snapshot of what and how was R for me in 2012.

So one can query this over-the-top structure with code like R.2012[[3]]$didnt.use.at.all to learn [1] "Emacs", but you already new that, didn’t you? Despite all my complaints, monologuing about other languages and overall frustration, R has served me well. It’s just that I’d be disappointed if I were still using it a lot in ten-years time. Gratuitous picture: building blocks for research (Photo: Luis, click to enlarge). Of course there was a lot more than R and stats this year. For example, the blogs I read most often have nothing to do with either topic: Isomorphismes (can’t define it), The music of sound (sound design), Offsetting behaviour (economics/politics in NZ). In fact, I need reading about a broad range of topics to feel human. P.S. Incidentally, my favorite R function this year was subset(); I’ve been subsetting like there is no tomorrow. By the way, you are welcome to browse around the blog and subset whatever you like. A post on high-dimensional arrays by @isomorphisms reminded me of APL and, more generally, of matrix languages, which took me back to inquisitive computing: computing not in the sense of software engineering, or databases, or formats, but of learning by poking problems through a computer. I like languages not because I can get a job by using one, but because I can think thoughts and express ideas through them. The way we think about a problem is somehow molded by the tools we use, and if we have loops, loops we use or if we have a terse matrix notation (see my previous post on Matrix Algebra Useful for Statistics), we may use that. I used APL fairly briefly but I was impressed by some superficial aspects (hey, that’s a weird set of characters that needs a keyboard overlay) and some deeper ones (this is an actual language, cool PDF paper). The APL revolution didn’t happen, at least not directly, but it had an influence over several other languages (including R). Somehow as a group we took a different path from ‘Expository programming’, but I think that we have to recover at least part of that ethos, programming for understanding the world. While many times I struggle with R frustrations, it is now my primary language for inquisitive computing, although some times I dive into something else. I like Mathematica, but can access it only while plugged to the university network (license limits). Python is turning into a great scientific computing environment—although still with a feeling of sellotape holding it together, J is like APL without the Klingon keyboard. If anything, dealing with other ways of doing things leads to a better understanding of one’s primary language. Idioms that seem natural acquire a new sense of weirdness when compared to other languages. R’s basic functionality gives an excellent starting point for inquisitive computing but don’t forget other languages that can enrich the way we look at problems. I am curious about what are people’s favorite inquisitive languages. Gratuitous picture: inquisition, Why bloody trees grow like this? (Photo: Luis, click to enlarge). This post is tangential to R, although R has a fair share of the issues I mention here, which include research reproducibility, open source, paying for software, multiple languages, salt and pepper. There is an increasing interest in the reproducibility of research. In many topics we face multiple, often conflicting claims and as researchers we value the ability to evaluate those claims, including repeating/reproducing research results. While I share the interest in reproducibility, some times I feel we are obsessing too much on only part of the research process: statistical analysis. Even here, many people focus not on the models per se, but only on the code for the analysis, which should only use tools that are free of charge. There has been enormous progress in the R world on literate programming, where the combination of RStudio + Markdown + knitr has made analyzing data and documenting the process almost enjoyable. Nevertheless, and here is the BUT coming, there is a large difference between making the code repeatable and making research reproducible. As an example, currently I am working in a project that relies on two trials, which have taken a decade to grow. We took a few hundred increment cores from a sample of trees and processed them using a densitometer, an X-Ray diffractometer and a few other lab toys. By now you get the idea, actually replicating the research may take you quite a few resources before you even start to play with free software. At that point, of course, I want to be able to get the most of my data, which means that I won’t settle for a half-assed model because the software is not able to fit it. If you think about it, spending a couple of grands in software (say ASReml and Mathematica licenses) doesn’t sound outrageous at all. Furthermore, reproducing this piece of research would require: a decade, access to genetic material and lab toys. I’ll give you the code for free, but I can’t give you ten years or$0.25 million…

In addition, the research process may require linking disparate sources of data for which other languages (e.g. Python) may be more appropriate. Some times R is the perfect tool for the job, while other times I feel like we have reached peak VBS (Visual Basic Syndrome) in R: people want to use it for everything, even when it’s a bad idea.

In summary,

• research is much more than a few lines of R (although they are very important),
• even when considering data collection and analysis it is a good idea to know more than a single language/software, because it broadens analytical options
• I prefer free (freedom+beer) software for research; however, I rely on non-free, commercial software for part of my work because it happens to be the best option for specific analyses.

Disclaimer: my primary analysis language is R and I often use lme4, MCMCglmm and INLA (all free). However, many (if not most) of my analyses that use genetic information rely on ASReml (paid, not open source). I’ve used Mathematica, Matlab, Stata and SAS for specific applications with reasonably priced academic licenses.

Gratuitous picture: 3000 trees leaning in a foggy Christchurch day (Photo: Luis, click to enlarge).

(This post continues discussing issues I described back in January in Academic publication boycott)

Some weeks ago I received a couple of emails the same day: one asking me to submit a paper to an open access journal, while the other one was inviting me to be the editor of an ‘special issue’ of my choice for another journal. I haven’t heard before about any of the two publications, which follow pretty much the same model: submit a paper for $600 and—if they like it—it will be published. However, the special issue email had this ‘buy your way in’ feeling: find ten contributors (i.e.$6,000) and you get to be an editor. Now, there is nothing wrong per-se with open access journals, some of my favorite ones (e.g. PLoS ONE) follow that model. However, I was surprised by the increasing number of new journals that look at filling the gap for ‘I need to publish soon, somewhere’. Surprised until one remembers the incentives at play in academic environments.

If I, or most academics for that matter, want to apply for academic promotion I have to show that I’m a good guy that has a ‘teaching philosophy’ and that my work is good enough to get published in journals; hopefully in lots of them. The first part is a pain, but most people can write something along the lines ‘I’m passionate about teaching and enjoy creating a challenging environment for students…’ without puking. The second part is trickier because one has to really have the papers in actual journals.

Personally, I would be happier with only having the odd ‘formal’ publication. The first time (OK, few times) I saw my name in a properly typeset paper was very exciting, but it gets old after a while. These days, however, I would prefer to just upload my work to a website, saying here you have some ideas and code, play with it. If you like it great, if not well, next time I hope it’ll be better. Nevertheless, this doesn’t count as proper publication, because it isn’t peer reviewed, independently of the number of comments the post may get. PLoS ONE counts, but it’s still a journal and I (and many other researchers) work in many things that are too small for a paper, but cool enough to share. The problem: there is little or no credit for sharing so Quantum Forest is mostly a ‘labor of love’, which counts bugger all for anything else.

These days as a researcher I often learn more from other people’s blogs and quick idea exchanges (for example through Twitter) than via formal publication. I enjoy sharing analysis, ideas and code in this blog. So what’s the point of so many papers in so many journals? I guess that many times we are just ‘ticking the box’ for promotions purposes. In addition, the idea of facing referees’ or editors’ comments like ‘it would be a good idea that you cite the following papers…’ puts me off. And what about authorship arrangements? We have moved from papers with 2-3 authors to enough authors to have a football team (with reserves and everything). Some research groups also run arrangements where ‘I scratch your back (include you as a coauthor) and you scratch mine (include me in your papers)’. We break ideas into little pieces that count for many papers, etc.

Another related issue is the cost of publication (and the barriers it imposes on readership). You see, we referee papers for journals for free (as in for zero money) and tell ourselves that we are doing a professional service to uphold the high standards of whatever research branch we belong to. Then we spend a fortune from our library budget to subscribe to the same journals for which we reviewed the papers (for free, remember?). It is not a great deal, as many reasonable people have pointed out; I added a few comments in academic publication boycott.

So, what do we need? We need promotion committees to reduce the weight on publication. We need to move away from impact factor. We can and need to communicate in other ways: scientific papers will not go away, but their importance should be reduced.

Some times the way forward is unclear. Incense doesn’t hurt (Photo: Luis).

Making an effort to prepare interesting lectures doesn’t hurt either.
These days it is fairly common editors ‘suggesting’ to include additional references in our manuscripts, which just happen to be to papers in the same journal, hoping to inflate the impact factor of the journal. Referees tend to suggest their own papers (some times useful, many times not). Lame, isn’t it?

PS. 2012-10-19 15:27 NZST. You also have to remember that not because something was published it is actually correct: outrageously funny example (via Arthur Charpentier). Yep, through Twitter.

Yesterday I accidentally started a dialogue in Twitter with the dude running The Setup. Tonight I decided to procrastinate in my hotel room (for work in Rotovegas) writing up my own Luis Uses This:

Since 2005 I’ve been using Apple computers as my main machines. They tend to be well built and keep on running without rebooting for a while and I ssh to a unix box from them when I need extra oomph. At the moment I have a 2009 15″ macbook pro and a 2010 27″ iMac; both computers are pretty much the default, except for extra RAM and they are still running Snow Leopard. I have never liked Apple mice, so I bought a Logitech mouse for the iMac. I use a generic Android phone, because I’m too cheap to spend money on an iPhone. I don’t have an iPad either, because I don’t have a proper use for it and I dislike lugging around gear for the sake of it.

I’m not ‘addicted’ to any software. However, I do use some programs frequently: R for data analysis/scripting (often with RStudio as a frontend), asreml for quantitative genetics, Python for scripting/scraping, XeLaTeX for writing lecture notes and writing solo papers (because my second surname is Zúñiga), MS Word for writing anything that requires collaborating with other people, Keynote for presentations (but sometimes have to use PowerPoint). I check my university email with Thunderbird or Entourage, my browsing is mostly done using Chrome, but when paranoid I use Tor + Firefox + Vidalia. I use a bunch of other programs but not often enough to deserve a mention. If you think about it, Keynote is the only format that defeats my platform agnosticism (I could still write Word documents using OpenOffice or similar). I almost forgot! I do rely on Dropbox to keep computers synced.

I keep on changing text editors: I don’t understand how people can cope with emacs and am uncomfortably writing this post using vim (which is awful as well), I own a copy of Textmate but I feel annoyed by the author’s abandonment, so I’m at a point where I tend to use software-specific editors: R – RStudio, XeLaTeX – TeXShop, etc.

If I weren’t allowed to use a mac at work I’d probably move to Linux; the major hassle would be converting Keynote presentations to something else. I could live with Windows, but I would start with a totally clean install, because I find the pre-installed software very unhelpful. These days I think that I’ve been unconsciously preparing myself for the impermanence of software, so if I need to learn a new stats package or new editor that is ‘just fine': software agnostic Buddhism.

Non-computer-wise I’m permanently dissatisfied with my bag/backpack: I haven’t found a nice overnight trip bag that it’s designed for walking around carrying a laptop. (Did I mention that I like to walk?) Most of them are dorky or plain useless and my current theory is that the solution goes for getting a smaller (say 11-13″) laptop. Because the university depreciates laptops over 4 years I still have to wait a year to test the theory.

I tend to doodle when thinking or preparing a talk. I prefer to write in unlined photocopy paper with a pen or pencil. A fountain pen is nice, but a \$0.20 pen will do too. It has the advantage of being i-) cheap and ii-) available everywhere.

I like to take pictures and muck around with shutter speeds and apertures, which doesn’t mean that I’m any good at it. I use a Nikon Coolpix P7100 camera, but I’m sure that a Canon G12 would do the job as well. It is the smallest camera that gives me the degree of control I like. I process the pictures in Lightroom, which is just OK, but, again, it sort of fits my platform agnosticism.

I’m slowly moving to ebooks, for which I use a Sony Reader (which I got for free) that I manage using Calibre. I keep wireless disabled and non-configured: it is only for reading books and I often use the dictionary feature while reading (I’m always surprised by the large number of English words).

## What would be your dream setup?

This would be a ‘sabbatical package’ where I would spend 6 months living in another (earthquake-proof) country, near the ocean, with my family, good food, a light notebook with a week’s worth of battery life, decent internet connection and the chance to catch up with my research subject.

P.S. 2012-04-19. I came across this post in 37 signals discussing a switch from OS X to Ubuntu. I think that there is a class of user cases (e.g. web developers, scientific programming) where moving from one to the other should be relatively painless.

This week was the first anniversary of the February 22nd earthquake in Christchurch. Between that and the first week of lectures it has been hard to find time to write much about data analysis. Thus, if I owe you some analyses, some code or some text be patient, please. I’ll be back soon(ish).

I have never been much of a downtown person, so it is no surprise that I haven’t been in Hagley Park for a while. Today (Sunday in my part of the planet) was a a nice treat to spend some time there and see the interaction between city recovery and the Botanic Gardens, particularly in a Eucalyptus delegatensis surrounded by messages.

Earthquake messages around Eucalyptus delegatensis, Christchurch Botanic Gardens (Photo: Luis).

Detail of messages (Photo: Luis).

Trees can carry a lot of sadness and hope (Photo: Luis).

Up the eucalypt tree (Photo: Luis).

Time for a pause, time to look forward.

P.S. I wanted to attend the Christchurch PechaKucha tonight, but I ran out of time at the end. Next time.