Excel, fanaticism and R

This week I’ve been feeling tired of excessive fanaticism (or zealotry) of open source software (OSS) and R in general. I do use a fair amount of OSS and pushed for the adoption of R in our courses; in fact, I do think OSS is a Good ThingTM. I do not like, however, constant yabbering on why using exclusively OSS in science is a good idea and the reduction of science to repeatability and computability (both of which I covered in my previous post). I also dislike the snobbery of ‘you shall use R and not Excel at all, because the latter is evil’ (going back ages).

We often have several experiments running during the year and most of the time we do not bother setting up a data base to keep data. Doing that would essentially mean that I would have to do it, and I have a few things more important to do. Therefore, many data sets end up in… (drum roll here) Microsoft Excel.

How should a researcher setup data in Excel? Rather than reinventing the wheel, I’ll use a(n) (im)perfect diagram that I found years ago in a Genstat manual.

Suggested sane data setup in a spreadsheet.

Suggested sane data setup in a spreadsheet.

I like it because:

  • It makes clear how to setup the experimental and/or sampling structure; one can handle any design with enough columns.
  • It also manages any number of traits assessed in the experimental units.
  • It contains metadata in the first few rows, which can be easily skipped when reading the file. I normally convert Excel files to text and then I skip the first few lines (using skip in R or firstobs in SAS).

People doing data analysis often start convulsing at the mention of Excel; personally, I deeply dislike it for analyses but it makes data entry very easy, and even a monkey can understand how to use it (I’ve seen them typing, I swear). The secret for sane use is to use Excel only for data entry; any data manipulation (subsetting, merging, derived variables, etc.) or analysis is done in statistical software (I use either R or SAS for general statistics, ASReml for quantitative genetics).

It is far from a perfect solution but it fits in the realm of the possible and, considering all my work responsibilities, it’s a reasonable use of my time. Would it be possible that someone makes a weird change in the spreadsheet? Yes. Could you fart while moving the mouse and create a non-obvious side effect? Yes, I guess so. Will it make your life easier, and make possible to complete your research projects? Yes sir!

P.S. One could even save data using a text-based format (e.g. csv, tab-delimited) and use Excel only as a front-end for data entry. Other spreadsheets are of course equally useful.

P.S.2. Some of my data are machine-generated (e.g. by acoustic scanners and NIR spectroscopy) and get dumped by the machine in a separate—usually very wide; for example 2000 columns—text file for each sample. I never put them in Excel, but read them directly (a directory-full of them) in to R for manipulation and analysis.

As an interesting aside, the post A summary of the evidence that most published research is false provides a good summary for the need to freak out about repeatability.

Dealing with software impermanence

Every so often I get bored writing about statistical analyses, software and torturing data and spend time in alternative creative endeavors: taking and processing pictures, writing short stories or exploring new research topics. The former is, mostly, covered in 500px, I keep the stories private and I’m just starting to play with bioacoustics.

While I’ve been away from this blog came the Google Reader debacle; Google announced that Reader will be canned mid-year, probably because they want to move everyone towards Google+. Let’s be straightforward, it is not the end of the world but a (relatively) minor annoyance. The main consequence is that this decision led me to reevaluate my relationship with Google services and the result is that I’m replacing most services, particularly those where what I consider private information is stored.

My work email (Luis.Apiolaza@canterbury.ac.nz) stayed the same while I moved my Google calendar back to my work’s exchange server. I setup my personal email address in one of my domains, served by Zoho. There are no ads in this account. I opted for this to avoid worrying about maintaining email servers, spam filtering, etc. I’ll see how it works, but if it doesn’t, will swap it for another service: my email address will stay the same.

I exported my RSS subscriptions from Google Reader and put them in Vienna. I tried some of the online alternatives, like Feedly, but didn’t like them.

I barely use Google Docs, so it won’t be a big deal to move from them. I deleted my Google+ account, no big loss. I’m keeping my Gmail account for a little while while I transition registration to various services. Nevertheless, the most difficult services to replace are Search, Maps and Scholar, which I’m now using without being logged-in in Google. I’m testing Duck Duck Go for search (kind of OK), while I’m sticking to Maps and, particularly, to Scholar. Funnily enough I have access to Web of Science and Scopus—two well-known academic search services—though the university and I will often prefer to look in Scholar, which is easier to use, more responsive and much better coverage of the literature; particularly of conferences and reports.

Google didn’t remove a small service. It did remove my confidence on their whole ecosystem.

Finding our way in the darkness (Photo: Luis, click to enlarge).

Finding our way in the darkness (Photo: Luis, click to enlarge).

An R wish list for 2013

First go and read An R wish list for 2012. None of the wishes came through in 2012. Fix the R website? No, it is the same this year. In fact, it is the same as in 2005. Easy to find help? Sorry, next year. Consistency and sane defaults? Coming soon to a theater near you (one day). Thus my wish list for 2012 is, very handy, still the wish list for 2013.

R as social software

The strength of R is not the software itself, but the community surrounding the software. Put another way, there are several languages that could offer the core functionality, but the whole ‘ecosystem’ that’s another thing. Softening @gappy3000’s comment: innovation is (mostly) happening outside the core.

This prompts some questions: Why isn’t ggplot2 or plyr in the default download? I don’t know if some people realize that ggplot2 is now one of the main attractions for R as data visualization language. Why isn’t Hadley’s name in this page? (Sorry I’m picking on him, first name that came to mind). How come there is not one woman in that page? I’m not saying there is an evil plan, but I’m wondering if (and how) the site and core reflect the R community and the diversity of interests (and uses). I’m also wondering what is the process to express these questions beyond a blog post. Perhaps in the developers email list?

I think that, in summary, my R wish for 2013 is that ‘The R project’—whoever that is—recognizes that the project is much more than the core download. I wish the list of contributors goes beyond the fairly small number of people with writing access to the source. I’d include those who write packages, those who explain, those who market and, yes, those who sell R. Finally, I wish all readers of Quantum Forest a great 2013.

Entry point to the R world. Same as ever.

Entry point to the R world. Same as ever.

P.S. Just in case, no, I’m not suggesting to be included in any list.

My R year

End-of-year posts are corny but, what the heck, I think I can let myself delve in to corniness once a year. The following code gives a snapshot of what and how was R for me in 2012.

outside.packages.2012 <- list(used.the.most = c('asreml', 'ggplot2'),
                              largest.use.decline = c('MASS', 'lattice'),
                              same.use = c('MCMCglmm', 'lme4'),
                              would.like.use.more = 'JAGS')
 
skill.level <- list(improved = 'fewer loops (plyr and do.call())',
                    unimproved = c('variable.naming (Still an InConsistent mess)', 
                                   'versioning (still hit and miss)'))
 
interfaces <- list(most.used = c('RStudio', 'plain vanilla R', 'text editor (Textmate and VIM)'),
                   didnt.use.at.all = 'Emacs')
 
languages <- list(for.inquisition = c('R', 'Python', 'Javascript'),
                  revisiting = 'J',
                  discarded = 'Julia (note to self: revisit in a year)')
 
(R.2012 <- list(outside.packages.2012, 
                skill.level, 
                interfaces, 
                languages))
 
# [[1]]
# [[1]]$used.the.most
# [1] "asreml"  "ggplot2"
 
# [[1]]$largest.use.decline
# [1] "MASS"    "lattice"
 
# [[1]]$same.use
# [1] "MCMCglmm" "lme4"    
 
# [[1]]$would.like.use.more
# [1] "JAGS"
 
 
# [[2]]
# [[2]]$improved
# [1] "fewer loops (plyr and do.call())"
 
# [[2]]$unimproved
# [1] "variable.naming (Still an InConsistent mess)"
# [2] "versioning (still hit and miss)"             
 
 
# [[3]]
# [[3]]$most.used
# [1] "RStudio"                        "plain vanilla R"               
# [3] "text editor (Textmate and VIM)"
 
# [[3]]$didnt.use.at.all
# [1] "Emacs"
 
 
# [[4]]
# [[4]]$for.inquisition
# [1] "R"          "Python"     "Javascript"
 
# [[4]]$revisiting
# [1] "J"
 
# [[4]]$discarded
# [1] "Julia (note to self: revisit in a year)"

So one can query this over-the-top structure with code like R.2012[[3]]$didnt.use.at.all to learn [1] "Emacs", but you already new that, didn’t you?

Despite all my complaints, monologuing about other languages and overall frustration, R has served me well. It’s just that I’d be disappointed if I were still using it a lot in ten-years time.

Gratuitous picture: building blocks for research (Photo: Luis, click to enlarge).

Gratuitous picture: building blocks for research (Photo: Luis, click to enlarge).

Of course there was a lot more than R and stats this year. For example, the blogs I read most often have nothing to do with either topic: Isomorphismes (can’t define it), The music of sound (sound design), Offsetting behaviour (economics/politics in NZ). In fact, I need reading about a broad range of topics to feel human.

P.S. Incidentally, my favorite R function this year was subset(); I’ve been subsetting like there is no tomorrow. By the way, you are welcome to browse around the blog and subset whatever you like.

R for inquisition

A post on high-dimensional arrays by @isomorphisms reminded me of APL and, more generally, of matrix languages, which took me back to inquisitive computing: computing not in the sense of software engineering, or databases, or formats, but of learning by poking problems through a computer.

I like languages not because I can get a job by using one, but because I can think thoughts and express ideas through them. The way we think about a problem is somehow molded by the tools we use, and if we have loops, loops we use or if we have a terse matrix notation (see my previous post on Matrix Algebra Useful for Statistics), we may use that.

I used APL fairly briefly but I was impressed by some superficial aspects (hey, that’s a weird set of characters that needs a keyboard overlay) and some deeper ones (this is an actual language, cool PDF paper). The APL revolution didn’t happen, at least not directly, but it had an influence over several other languages (including R). Somehow as a group we took a different path from ‘Expository programming’, but I think that we have to recover at least part of that ethos, programming for understanding the world.

While many times I struggle with R frustrations, it is now my primary language for inquisitive computing, although some times I dive into something else. I like Mathematica, but can access it only while plugged to the university network (license limits). Python is turning into a great scientific computing environment—although still with a feeling of sellotape holding it together, J is like APL without the Klingon keyboard.

If anything, dealing with other ways of doing things leads to a better understanding of one’s primary language. Idioms that seem natural acquire a new sense of weirdness when compared to other languages. R’s basic functionality gives an excellent starting point for inquisitive computing but don’t forget other languages that can enrich the way we look at problems.

I am curious about what are people’s favorite inquisitive languages.

Gratuitous picture: inquisition, Why bloody trees grow like this? (Photo: Luis, click to enlarge).

Gratuitous picture: inquisition, Why bloody trees grow like this? (Photo: Luis, click to enlarge).

When R, or any other language, is not enough

This post is tangential to R, although R has a fair share of the issues I mention here, which include research reproducibility, open source, paying for software, multiple languages, salt and pepper.

There is an increasing interest in the reproducibility of research. In many topics we face multiple, often conflicting claims and as researchers we value the ability to evaluate those claims, including repeating/reproducing research results. While I share the interest in reproducibility, some times I feel we are obsessing too much on only part of the research process: statistical analysis. Even here, many people focus not on the models per se, but only on the code for the analysis, which should only use tools that are free of charge.

There has been enormous progress in the R world on literate programming, where the combination of RStudio + Markdown + knitr has made analyzing data and documenting the process almost enjoyable. Nevertheless, and here is the BUT coming, there is a large difference between making the code repeatable and making research reproducible.

As an example, currently I am working in a project that relies on two trials, which have taken a decade to grow. We took a few hundred increment cores from a sample of trees and processed them using a densitometer, an X-Ray diffractometer and a few other lab toys. By now you get the idea, actually replicating the research may take you quite a few resources before you even start to play with free software. At that point, of course, I want to be able to get the most of my data, which means that I won’t settle for a half-assed model because the software is not able to fit it. If you think about it, spending a couple of grands in software (say ASReml and Mathematica licenses) doesn’t sound outrageous at all. Furthermore, reproducing this piece of research would require: a decade, access to genetic material and lab toys. I’ll give you the code for free, but I can’t give you ten years or $0.25 million…

In addition, the research process may require linking disparate sources of data for which other languages (e.g. Python) may be more appropriate. Some times R is the perfect tool for the job, while other times I feel like we have reached peak VBS (Visual Basic Syndrome) in R: people want to use it for everything, even when it’s a bad idea.

In summary,

  • research is much more than a few lines of R (although they are very important),
  • even when considering data collection and analysis it is a good idea to know more than a single language/software, because it broadens analytical options
  • I prefer free (freedom+beer) software for research; however, I rely on non-free, commercial software for part of my work because it happens to be the best option for specific analyses.

Disclaimer: my primary analysis language is R and I often use lme4, MCMCglmm and INLA (all free). However, many (if not most) of my analyses that use genetic information rely on ASReml (paid, not open source). I’ve used Mathematica, Matlab, Stata and SAS for specific applications with reasonably priced academic licenses.

Gratuitous picture: 3000 trees leaning in a foggy Christchurch day (Photo: Luis).

Gratuitous picture: 3000 trees leaning in a foggy Christchurch day (Photo: Luis, click to enlarge).

Publication incentives

(This post continues discussing issues I described back in January in Academic publication boycott)

Some weeks ago I received a couple of emails the same day: one asking me to submit a paper to an open access journal, while the other one was inviting me to be the editor of an ‘special issue’ of my choice for another journal. I haven’t heard before about any of the two publications, which follow pretty much the same model: submit a paper for $600 and—if they like it—it will be published. However, the special issue email had this ‘buy your way in’ feeling: find ten contributors (i.e. $6,000) and you get to be an editor. Now, there is nothing wrong per-se with open access journals, some of my favorite ones (e.g. PLoS ONE) follow that model. However, I was surprised by the increasing number of new journals that look at filling the gap for ‘I need to publish soon, somewhere’. Surprised until one remembers the incentives at play in academic environments.

If I, or most academics for that matter, want to apply for academic promotion I have to show that I’m a good guy that has a ‘teaching philosophy’ and that my work is good enough to get published in journals; hopefully in lots of them. The first part is a pain, but most people can write something along the lines ‘I’m passionate about teaching and enjoy creating a challenging environment for students…’ without puking. The second part is trickier because one has to really have the papers in actual journals.

Personally, I would be happier with only having the odd ‘formal’ publication. The first time (OK, few times) I saw my name in a properly typeset paper was very exciting, but it gets old after a while. These days, however, I would prefer to just upload my work to a website, saying here you have some ideas and code, play with it. If you like it great, if not well, next time I hope it’ll be better. Nevertheless, this doesn’t count as proper publication, because it isn’t peer reviewed, independently of the number of comments the post may get. PLoS ONE counts, but it’s still a journal and I (and many other researchers) work in many things that are too small for a paper, but cool enough to share. The problem: there is little or no credit for sharing so Quantum Forest is mostly a ‘labor of love’, which counts bugger all for anything else.

These days as a researcher I often learn more from other people’s blogs and quick idea exchanges (for example through Twitter) than via formal publication. I enjoy sharing analysis, ideas and code in this blog. So what’s the point of so many papers in so many journals? I guess that many times we are just ‘ticking the box’ for promotions purposes. In addition, the idea of facing referees’ or editors’ comments like ‘it would be a good idea that you cite the following papers…’ puts me off. And what about authorship arrangements? We have moved from papers with 2-3 authors to enough authors to have a football team (with reserves and everything). Some research groups also run arrangements where ‘I scratch your back (include you as a coauthor) and you scratch mine (include me in your papers)’. We break ideas into little pieces that count for many papers, etc.

Another related issue is the cost of publication (and the barriers it imposes on readership). You see, we referee papers for journals for free (as in for zero money) and tell ourselves that we are doing a professional service to uphold the high standards of whatever research branch we belong to. Then we spend a fortune from our library budget to subscribe to the same journals for which we reviewed the papers (for free, remember?). It is not a great deal, as many reasonable people have pointed out; I added a few comments in academic publication boycott.

So, what do we need? We need promotion committees to reduce the weight on publication. We need to move away from impact factor. We can and need to communicate in other ways: scientific papers will not go away, but their importance should be reduced.

Some times the way forward is unclear. Incense doesn’t hurt (Photo: Luis).

Making an effort to prepare interesting lectures doesn’t hurt either.
These days it is fairly common editors ‘suggesting’ to include additional references in our manuscripts, which just happen to be to papers in the same journal, hoping to inflate the impact factor of the journal. Referees tend to suggest their own papers (some times useful, many times not). Lame, isn’t it?

PS. 2012-10-19 15:27 NZST. You also have to remember that not because something was published it is actually correct: outrageously funny example (via Arthur Charpentier). Yep, through Twitter.

My setup

Yesterday I accidentally started a dialogue in Twitter with the dude running The Setup. Tonight I decided to procrastinate in my hotel room (for work in Rotovegas) writing up my own Luis Uses This:

Since 2005 I’ve been using Apple computers as my main machines. They tend to be well built and keep on running without rebooting for a while and I ssh to a unix box from them when I need extra oomph. At the moment I have a 2009 15″ macbook pro and a 2010 27″ iMac; both computers are pretty much the default, except for extra RAM and they are still running Snow Leopard. I have never liked Apple mice, so I bought a Logitech mouse for the iMac. I use a generic Android phone, because I’m too cheap to spend money on an iPhone. I don’t have an iPad either, because I don’t have a proper use for it and I dislike lugging around gear for the sake of it.

I’m not ‘addicted’ to any software. However, I do use some programs frequently: R for data analysis/scripting (often with RStudio as a frontend), asreml for quantitative genetics, Python for scripting/scraping, XeLaTeX for writing lecture notes and writing solo papers (because my second surname is Zúñiga), MS Word for writing anything that requires collaborating with other people, Keynote for presentations (but sometimes have to use PowerPoint). I check my university email with Thunderbird or Entourage, my browsing is mostly done using Chrome, but when paranoid I use Tor + Firefox + Vidalia. I use a bunch of other programs but not often enough to deserve a mention. If you think about it, Keynote is the only format that defeats my platform agnosticism (I could still write Word documents using OpenOffice or similar). I almost forgot! I do rely on Dropbox to keep computers synced.

I keep on changing text editors: I don’t understand how people can cope with emacs and am uncomfortably writing this post using vim (which is awful as well), I own a copy of Textmate but I feel annoyed by the author’s abandonment, so I’m at a point where I tend to use software-specific editors: R – RStudio, XeLaTeX – TeXShop, etc.

If I weren’t allowed to use a mac at work I’d probably move to Linux; the major hassle would be converting Keynote presentations to something else. I could live with Windows, but I would start with a totally clean install, because I find the pre-installed software very unhelpful. These days I think that I’ve been unconsciously preparing myself for the impermanence of software, so if I need to learn a new stats package or new editor that is ‘just fine': software agnostic Buddhism.

Non-computer-wise I’m permanently dissatisfied with my bag/backpack: I haven’t found a nice overnight trip bag that it’s designed for walking around carrying a laptop. (Did I mention that I like to walk?) Most of them are dorky or plain useless and my current theory is that the solution goes for getting a smaller (say 11-13″) laptop. Because the university depreciates laptops over 4 years I still have to wait a year to test the theory.

I tend to doodle when thinking or preparing a talk. I prefer to write in unlined photocopy paper with a pen or pencil. A fountain pen is nice, but a $0.20 pen will do too. It has the advantage of being i-) cheap and ii-) available everywhere.

I like to take pictures and muck around with shutter speeds and apertures, which doesn’t mean that I’m any good at it. I use a Nikon Coolpix P7100 camera, but I’m sure that a Canon G12 would do the job as well. It is the smallest camera that gives me the degree of control I like. I process the pictures in Lightroom, which is just OK, but, again, it sort of fits my platform agnosticism.

I’m slowly moving to ebooks, for which I use a Sony Reader (which I got for free) that I manage using Calibre. I keep wireless disabled and non-configured: it is only for reading books and I often use the dictionary feature while reading (I’m always surprised by the large number of English words).

What would be your dream setup?

This would be a ‘sabbatical package’ where I would spend 6 months living in another (earthquake-proof) country, near the ocean, with my family, good food, a light notebook with a week’s worth of battery life, decent internet connection and the chance to catch up with my research subject.

P.S. 2012-04-19. I came across this post in 37 signals discussing a switch from OS X to Ubuntu. I think that there is a class of user cases (e.g. web developers, scientific programming) where moving from one to the other should be relatively painless.

Eucalypt earthquake memories

This week was the first anniversary of the February 22nd earthquake in Christchurch. Between that and the first week of lectures it has been hard to find time to write much about data analysis. Thus, if I owe you some analyses, some code or some text be patient, please. I’ll be back soon(ish).

I have never been much of a downtown person, so it is no surprise that I haven’t been in Hagley Park for a while. Today (Sunday in my part of the planet) was a a nice treat to spend some time there and see the interaction between city recovery and the Botanic Gardens, particularly in a Eucalyptus delegatensis surrounded by messages.

Earthquake messages around Eucalyptus delegatensis, Christchurch Botanic Gardens (Photo: Luis).

Detail of messages (Photo: Luis).

Trees can carry a lot of sadness and hope (Photo: Luis).

Up the eucalypt tree (Photo: Luis).

Time for a pause, time to look forward.

P.S. I wanted to attend the Christchurch PechaKucha tonight, but I ran out of time at the end. Next time.

Academic publication boycott

The last few weeks there has been a number of researchers calling for, or supporting, a boycott against Elsevier; for example, Scientific Community to Elsevier: Drop Dead, Elsevier—my part in its downfall or, more general, Should you boycott academic publishers?

What metrics are used to compare Elsevier to other publishers? It is common to refer to cost-per-article; for example, in my area Forest Ecology and Management (one of the most popular general Forestry Journals) charges USD 31.50 per article but Tree Genetics and Genomes (published by Springer Verlag) costs EUR 34.95 (roughly USD 46). Nevertheless, researchers affiliated to universities or research institutes rarely pay per article; instead, our libraries have institution-wide subscriptions. Before the great consolidation drive we would have access to individual journal subscription prices (sometimes reaching thousands of dollars per year, each of them). Now libraries buy bundles from a given publisher (e.g. Elsevier, Springer, Blackwell, Wiley, etc) so it is very hard to get a feeling of the actual cost of a single journal. With this consideration, I am not sure if Elsevier ‘deserves’ being singled out in this mess; at least not any more than Springer or Blackwell, or… a number of other publishers.

Gaahl Gorgoroth

Elsevier? No, just Gaahl Gorgoroth

What we do know is that most of the work is done and paid for by scientists (and society in general) rather than journals. Researchers do research and our salaries and research expenses are many times paid for (at least partially if not completely) by public funding. We also act as referees for publications and a subset of us are part of editorial boards of journals. We do use some journal facilities; for example, an electronic submission system (for which there are free alternatives) and someone will ‘produce’ the papers in electronic format, which would be a small(ish) problem if everyone used LaTeX.

If we go back some years ago, many scientific societies used to run their own journals (many times scrapping by or directly running them at a loss). Then big publishers came ‘to the rescue’ offering economies of scale and an opportunity to make a buck. There is nothing wrong with the existence of publishers facilitating the publication process; but when combined with the distortions in the publication process (see below) publishers have achieved a tremendous power. At the same time, publishers have hiked prices and moved a large part of their operations to cheaper countries (e.g. India, Indonesia, etc) leaving us researchers struggling to pay for the subscriptions to read our own work. Not only that, but copyright restrictions in many journals do not allow us to make our work available to the people who paid for the research: you, the tax payer.

Today scientific societies could run their own journals and completely drop the printed version, so we could have cheaper journals while societies wouldn’t go belly up moving paper across continents. Some questions, Would scientific societies be willing to change? If that’s the case, Could they change their contractual arrangements with publishers?

Why do we play the game?

The most important part of the problem is that we (the researchers) are willing to participate in the publication process with the current set of rules. Why do we do it? At the end of the day, many of us play the journal publication game because it has been subverted from dissemination of important research results to signaling researcher value. University and research institute managers need to have a way to evaluate their researchers, managing tenures, promotions, etc. Rather than going for actually doing a proper evaluation (difficult, expensive and subjective), they go for an easy one (subjective as well): number of publications in ‘good’ journals. If I want to get promoted or taken seriously in funding applications I have to publish in journals.

I think it is easy to see that I enjoy openly communicating what I have learned (for example this blog and in my main site). I would rather spend more time doing this than writing ‘proper’ papers, but of course this is rarely considered important in my evaluations.

If you already are a top-of-the-scale, tenured professor it is very easy to say ‘I don’t want to play the game anymore’. If you are a newcomer to the game, trying to establish yourself in these times of PhD gluts and very few available research positions, all incentives line up to play the game.

This is only part of the problem

The questioning does not stop at the publication process. Instead, the peer value of review process is also under scrutiny. Then we enter into open science: beyond having access to publications, How much can we trust the results? We have discussions on open access data even when it is in closed journals. And on, and on.

We have moved from a situation of scarcity, where publishing was expensive, the tools to analyze our data were expensive and making data available was painfully difficult to a time when all that is trivially easy. I can collect some data, upload it to my site, rely on the democratization of statistics, write it up and create a PDF or HTML version by pressing a button. We would like to have feedback: relatively easy if the publication is interesting. We want an idea of reliability or trust: we could have, for example, some within-organization peer reviewing. Remember though that peer reviewing is not a panacea. We want to have an idea of community standing, which would be the number of people referring to that document (paper, blog post, wiki, whatever).

Maybe the most important thing is that we are trying to carry on with ‘traditional’ practices that do not extend beyond, say, 100 years. We do not need to do so if we are open to a more fluid environment on both publication, analytics and data sharing. Better, we wouldn’t need to continue if we stopped putting so much weight on traditional publication avenues when evaluating researchers.

Is Elsevier evil? I don’t think so; or, at least, it doesn’t seem to be significantly worse than other publishers. Have we vested too much power on Elsevier and other publishers? You bet! At the very least we should get back to saner copyright practices, where the authors retain copyright and provide a non-exclusive license to the publishers. Publishers will still make money but everyone will be able to freely access our research results because, you know, they already pay for the research.

Disclaimer: I have published in journals managed by Elsevier and Springer. I currently have articles under review for both publishers.

P.S. Gaahl Gorgoroth image from Wikipedia.

P.S.2 The cost of knowledge is keeping track of academics taking a stand against Elsevier; 1503 of them at 12:32 NZST 2012-01-30. HT: Arthur Charpentier.

P.S.3 2012-01-31 NZST I would love to know what other big publishers are thinking.

P.S.4 2012-02-01 NZST Research Works Act: are you kidding me?

The Research Works Act (RWA) bill (H.R.3699) introduced to the US Congress on 16 December 2011 proposes that:

No Federal agency may adopt, implement, maintain, continue, or otherwise engage in any policy, program, or other activity that–

(1) causes, permits, or authorizes network dissemination of any private-sector research work without the prior consent of the publisher of such work; or

(2) requires that any actual or prospective author, or the employer of such an actual or prospective author, assent to network dissemination of a private-sector research work.

The idea of calling researcher’s work funded by government, edited by their peers (probably at least partially funded by government funds) private-sector research work because a publishing company applied whatever document template they use on top of the original manuscript is obscene. By the way, Richard Poynder has a post that lists a number of publishers that have publicly disavowed the RWA.

P.S.5 2012-02-02 16:38 NZST Doron Zeilberger points to the obvious corollary: we don’t need journals for research dissemination anymore (although still we do for signaling). Therefore if one is keen on boycotts it should affect all publishers. Academics are stuck with last century’s publication model.

P.S.6 2012-10-19 15:18 NZST I have some comments on publication incentives.