In our research group we often have people creating statistical models that end up in publications but, most of the time, the practical implementation of those models is lacking. I mean, we have a bunch of barely functioning code that is very difficult to use in a reliable way in operations of the breeding programs. I was very keen on continue using one of the models in our research, enough to rewrite and document the model fitting, and then create another package for using the.model in operations.
Unfortunately, neither the data nor the model are mine to give away, so I can’t share them (yet). But I hope these notes will help you in you are in the same boat and need to use your models (or ‘you’ are in fact future me, who tend to forget how or why I wrote code in a specific way).
A basic motivational example
Let’s start with a simple example: linear regression. We want to predict a response using a predictor variable, and then we can predict the response for new values of the predictor contained in new_data with:
my_model <- lm(response ~ predictor, data = my_data)
predictions <- predict(my_model, newdat = new_data)
# Saving the model object
The model coefficients needed to predict new values are stored in the my_model object. If we want to use the model elsewhere, we can save the object as an .Rda file, in this case model_file.Rda.
We can later read the model file in, say, a different project and get new predictions using:
Near-infrared spectroscopy is the stuff of CSI and other crime shows. We measure the reflection at different wavelengths and run a regression analysis using what we want to predict as Y in the model. The number of predictors (wavelengths) is much larger than in the previous example—1,296 for the NIR machine we are using—so it is not unusual to have more predictors than observations. NIR spectra are often trained using pls() (from the pls package) with help from functions from the prospectr package.
I could still use the save/load approach from the motivational example to store and reuse the model object created with pls but, instead, I wanted to implement the model, plus some auxiliary functions, in a package to make the functions easier to use in our lab.
I had two issues/struggles/learning opportunities that I needed to sort out to get this package working:
1. How to automatically load the model object when attaching the package?
Normally, datasets and other objects go in the data folder, where they are made available to the user. Instead, I wanted to make the object internally available. The solution turned out to be quite straightforward: save the model object to a file called sysdata.rda (either in the R or data folders of the package). This file is automatically loaded when we run library(package_name). We just need to create that file with something like:
2. How to make predict.pls work in the package?
I was struggling to use the predict function, as in my head it was being provided by the pls package. However, pls is only extending the predict function, which comes with the default R installation but is part of the stats package. At the end, sorted it out with the following Imports, Depends and LazyData in the DESCRIPTION file:
Nothing groundbreaking, I know, but spent a bit of time sorting out that couple of annoyances before everything fell into place. Right now we are using the models in a much easier and reproducible way.
If you search for data analysis workflows for research there are lots of blog posts on using R + databases + git, etc. While in some cases I may end up working with a combination like that, it’s much more likely that reality is closer to a bunch of emailed Excel or CSV files.
Some may argue that one should move the whole group of collaborators to work the right way. In practice, well, not everyone has the interest and/or the time to do so. In one of our collaborations we are dealing with a trial established in 2009 and I was tracking a field coding mistake (as in happening outdoors, doing field work, assigning codes to trees), so I had to backtrack where the errors were introduced. After checking emails from three collaborators, I think I put together the story and found the correct code values in a couple of files going back two years.
The new analysis lives in an RStudio project with the following characteristics:
Folder in Dropbox, so it’s copied in several locations and it’s easy to share.
Excel or CSV files with their original names (warts and all), errors, etc. Resist the temptation to rename the files to sane names, so it’s easier to track back the history of the project.
Important part: text file (Markdown if you want) documenting the names of the data files, who & when they sent it to me.
He pospuesto muchas veces este post, así es que va en su estado actual, incompleto, parcialmente digerido, como para empezar una conversación.
Como investigador me beneficia decir que todos los países deberían investir (¿o será gastar?) más recursos en investigación. Mientras más grande sea el presupuesto, más probable es que me va a tocar una tajada. Sin embargo, cuando miro a Chile desde la distancia, hay algo que me incomoda en la lógica de la mayor inversión en investigación.
La historia es más o menos así:
Chile dedica una proporción muy pequeña de su presupuesto a la investigación científica.
Los países desarrollados invierten mucho más en investigación.
Por lo tanto, si Chile quiere ser desarrollado hay que invertir más en investigación.
¿Cómo va a poder participar Chile en la economía del conocimiento con tan poca investigación?
A un nivel superficial, la historia parece tener sentido. Pero cuando me detengo a pensar, “la historia no me convence, sólo me atraganta” (como diría Fulano).
Los primeros dos puntos son completamente ciertos: Chile dedica una proporción pequeña (algunos dirían minúscula) del presupuesto nacional a la investigación científica. Los países desarrollados invierten muchas veces más en proporción (y órdenes de magnitud más en manera absoluta) de su producto interno bruto en materias de investigación (datos aquí). Sin embargo, y este sin embargo es muy relevante, hay un supuesto de causalidad en el corazón de mi “pero”. El supuesto es que los países son más desarrollados porque invirtieron más en investigación.
Ahí es dónde empiezo a dudar. Los países desarrollados invierten mucho más en arte, pero ¿son más desarrollados porque invierten más dinero en ballet (o teatro, o películas, o estatuas)? Quizás los países más ricos se pueden dar e lujo de invertir más en cultura—y la ciencia es una expresión cultural—y dicha inversión puede que tenga efectos positivos en la economía (o puede que no). Por ejemplo, Japón invierte un mayor porcentaje que Alemania en investigación, y la economía alemana ha crecido mucho más que la japonesa. Por supuesto, podrías comentar, el contexto económico y cultural es diferente. ¿Por qué esperarías una relación directa? Bueno, ese es mi punto.
Tal vez una pregunta mejor es ¿por qué la investigación científica debería ser evaluada como un camino al desarrollo? Los investigadores tenemos una culpa parcial en el tema. En algún momento hubo que justificar el presupuesto y alguien dijo “pero es una inversión”, y de ahí continuamos repitiéndola.
Los investigadores representamos un grupo privilegiado: hemos tenido la mejor educación disponible; somos la cúspide del sistema educacional. En cierta medida, nuestras demandas por mayor financiamiento representan la extensión de ese privilegio, mientras la mayoría de la población recibe una educación que condena a trabajos con el ingreso mínimo.
Si uno estuviera a cargo del desarrollo de políticas públicas, ¿cuáles serían la inversiones que maximizan el beneficio para la sociedad? Quizás invertir en educación de buena calidad para una mayoría de la población, mejorar salud y nutrición para los sectores menos favorecidos tengan una mayor rentabilidad social. Algo así como los esfuerzos para reducir la mortalidad infantil (datos aquí. Por si acaso, los países con más alta mortalidad en el gráfico son México y Turquía).
¿Y la economía del conocimiento?
Si uno piensa en Google, Facebook, Uber, Airbnb, …, Apple. ¿Cuánta ciencia hay en estas compañías? ¿Cuánta tecnología e ingeniería? Una simple apuesta: hay mucha más tecnología, ingeniería y emprendimiento que ciencia.
Monsanto, 23andMe, Syngenta. Una historia similar.
No creo que nos falte ciencia pero hay una carencia de emprendimiento técnico/científico. Sobran doctorados mientras faltan masters que integren entendimiento científico con comercialización. Esta situación no es exclusiva de Chile: sobran doctorados en buena parte del planeta. En muchas áreas existe un esquema pirámide de la enseñanza: hay muchos más estudiantes graduándose con postgrados que posiciones disponibles en universidades e institutos de investigación.
Hay una publicación del World Economic Forum que trata de medir cuáles son los países más creativos. Para ese fin usa un índice con tres factores: tecnología (inversión en investigación y desarrollo, patentes per capita), talento (porcentaje de adultos con educación terciaria y trabajadores en actividades creativas) y tolerancia (tratamiento de inmigrantes, minorías étnicas y alternativas sexuales). Es posible tener un alto índice de creatividad con valores no particularmente alto para algún factor, y la creatividad tiene una correlación positiva con el desempeño económico. Chile ciertamente tiene que trabajar en talento y tolerancia, que involucran mayores sectores de la población; es decir, son esencialmente indicadores más democráticos.
¿Quieres decir que no hay que financiar la investigación?
Buscando crear instancias de financiamiento es fácil llegar a vender la idea de una conexión estrecha entre ciencia y desarrollo, pero ¿cuántos investigadores trabajan pensando en eso? Si uno es honesto, prácticamente nadie estudia “la quinta pata del gato” en un tema científico para desarrollar el país. Uno lo hace porque le interesa el desafío, por querer entender y explicar.
Para dejarlo bien claro: no estoy diciendo paren de financiar la investigación científica. Lo que sí estoy diciendo es que los motivos comúnmente presentados por personas lobbying al gobierno tienen una relación causal tenue con el desarrollo económico del país. La investigación merece ser financiada como una representación de la cultura del país, así como lo son el teatro, la música, etc.
P.S. La mayor parte de mi investigación se conecta con aplicaciones de mejoramiento genético, estadística y ciencias de la madera en la industria forestal (aunque en ocasiones he trabajado en genética de la personalidad de calamares, desempeño reproductivo en moscas y algunas otras rarezas).
This week I’ve been feeling tired of excessive fanaticism (or zealotry) of open source software (OSS) and R in general. I do use a fair amount of OSS and pushed for the adoption of R in our courses; in fact, I do think OSS is a Good ThingTM. I do not like, however, constant yabbering on why using exclusively OSS in science is a good idea and the reduction of science to repeatability‡ and computability (both of which I covered in my previous post). I also dislike the snobbery of ‘you shall use R and not Excel at all, because the latter is evil’ (going back ages).
We often have several experiments running during the year and most of the time we do not bother setting up a data base to keep data. Doing that would essentially mean that I would have to do it, and I have a few things more important to do. Therefore, many data sets end up in… (drum roll here) Microsoft Excel.
How should a researcher setup data in Excel? Rather than reinventing the wheel, I’ll use a(n) (im)perfect diagram that I found years ago in a Genstat manual.
I like it because:
It makes clear how to setup the experimental and/or sampling structure; one can handle any design with enough columns.
It also manages any number of traits assessed in the experimental units.
It contains metadata in the first few rows, which can be easily skipped when reading the file. I normally convert Excel files to text and then I skip the first few lines (using skip in R or firstobs in SAS).
People doing data analysis often start convulsing at the mention of Excel; personally, I deeply dislike it for analyses but it makes data entry very easy, and even a monkey can understand how to use it (I’ve seen them typing, I swear). The secret for sane use is to use Excel only for data entry; any data manipulation (subsetting, merging, derived variables, etc.) or analysis is done in statistical software (I use either R or SAS for general statistics, ASReml for quantitative genetics).
It is far from a perfect solution but it fits in the realm of the possible and, considering all my work responsibilities, it’s a reasonable use of my time. Would it be possible that someone makes a weird change in the spreadsheet? Yes. Could you fart while moving the mouse and create a non-obvious side effect? Yes, I guess so. Will it make your life easier, and make possible to complete your research projects? Yes sir!
P.S. One could even save data using a text-based format (e.g. csv, tab-delimited) and use Excel only as a front-end for data entry. Other spreadsheets are of course equally useful.
P.S.2. Some of my data are machine-generated (e.g. by acoustic scanners and NIR spectroscopy) and get dumped by the machine in a separate—usually very wide; for example 2000 columns—text file for each sample. I never put them in Excel, but read them directly (a directory-full of them) in to R for manipulation and analysis.
“Should I reject a manuscript because the analyses weren’t done using open software?” I overheard a couple of young researchers discussing. Initially I thought it was a joke but, to my surprise, it was not funny at all.
There is an unsettling, underlying idea in that question: the value of a scientific work can be reduced to its computability. If I, the reader, cannot replicate the computation the work is of little, if any, value. Even further, my verification has to have no software cost involved, because if that is not the case we are limiting the possibility of computation to only those who can afford it. Therefore, the almost unavoidable conclusion is that we should force the use of open software in science.
What happens if the analyses were run using a point-and-click interface? For example SPSS, JMP, Genstat, Statistica, and a few other programs allow access to fairly complex analytical algorithms via a system of menus and icons. Most of them are not open source nor generate code for the analyses. Should we ban their use in science? One could argue that if users only spend the time and learn a programming language (e.g. R or Python) they will be free of the limitations of point-and-click. Nevertheless, we would be shifting accessibility from people that can pay for an academic license for a software to people that can learn and moderately enjoy programming. Are we better off as research community by that shift?
There is another assumption: open software will always provide good (or even appropriate) analytical tools for any problem. I assume that in many cases OSS is good enough and that there is a subset of problems where it is the best option. However, there is another subset where it is suboptimal. For example, I deal a lot with linear mixed models used in quantitative genetics, an area where R is seriously deficient. In fact, I should have to ignore the last 15 years of statistical development to run large problems. Given that some of the data sets are worth millions of dollars and decades of work, Should I sacrifice the use of best models so a hypothetical someone, somewhere can actually run my code without paying for an academic software license? This was a rhetorical question, by the way, as I would not do it.
There are trade-offs and unintended consequences in all research policies. This is one case where I think the negative effects would outweigh the benefits.
P.S. 2013-12-20 16:13 NZST Timothée Poisot provides some counterarguments for a subset of articles: papers about software.
While taking a shower I was daydreaming about what would happen if one were to invent journals today, with a very low cost of publication and no physical limits to the size of a publication. My shower answer was that there would be little chance for a model like traditional printed journals.
One could create a central repository (a bit like the arXiv) taking submissions of text format of the article + figures, which are automatically translated to a decent-looking web format and a printable version. This would be the canonical version of the article and would get assigned a unique identifier. The submitters would get to update their article any number of times, creating versions (pretty much like software). This way they could fix any issues without breaking references from other articles.
There would be a payment for submitting articles to the repository (say $100 for the sake of argument), covering the costs of hosting and infrastructure, serving at the same time as a deterrent for spam.
Journals in their current form would tend to disappear, but there would be topical aggregators (or feeds). Thus, the ‘Journal of whatever’ would now be a curator of content from the ‘big bucket’ central repository, pulling aside articles worthy (in their opinion) of more scrutiny, or commentary, etc. This could be either a commercial venture or amateur labor of love, done by people very interested in a given topic and could even apply a different format to the canonical article, always pointing back to the unique identifier in the central repository.
Some aggregators could be highly picky and recognized by readers, becoming the new Nature or Science. Authors could still ‘submit’ or recommend their papers to these aggregators. However, papers could also be in multiple feeds and copyright would probably stay with the authors for a limited amount of time. The most important currency for academics is recognition, and this system would provide it, as well as the potential for broad exposure and no cost for readers or libraries.
There would be no pre-publication peer review because, let’s face it, currently it’s more of a lottery than anything else. Post-publication peer review, broad by the research community would be new standard.
Any big drawbacks for my shower daydream?
P.S.1 2013-12-15 13:40 NZST Thomas Lumley pointed me to a couple of papers on the ‘Selected Papers Network’, which would be one way of dealing with prestige/quality/recognition signals needed by academics.
P.S.2 2013-12-15 14:43 NZST This ‘journals are feeds’ approach fits well with how I read papers: I do not read journals, but odd papers that I found either via web searches or recommended by other researchers. There are, however, researchers that aim to read whole issues, although I can;t make sense of it.
This post is tangential to R, although R has a fair share of the issues I mention here, which include research reproducibility, open source, paying for software, multiple languages, salt and pepper.
There is an increasing interest in the reproducibility of research. In many topics we face multiple, often conflicting claims and as researchers we value the ability to evaluate those claims, including repeating/reproducing research results. While I share the interest in reproducibility, some times I feel we are obsessing too much on only part of the research process: statistical analysis. Even here, many people focus not on the models per se, but only on the code for the analysis, which should only use tools that are free of charge.
There has been enormous progress in the R world on literate programming, where the combination of RStudio + Markdown + knitr has made analyzing data and documenting the process almost enjoyable. Nevertheless, and here is the BUT coming, there is a large difference between making the code repeatable and making research reproducible.
As an example, currently I am working in a project that relies on two trials, which have taken a decade to grow. We took a few hundred increment cores from a sample of trees and processed them using a densitometer, an X-Ray diffractometer and a few other lab toys. By now you get the idea, actually replicating the research may take you quite a few resources before you even start to play with free software. At that point, of course, I want to be able to get the most of my data, which means that I won’t settle for a half-assed model because the software is not able to fit it. If you think about it, spending a couple of grands in software (say ASReml and Mathematica licenses) doesn’t sound outrageous at all. Furthermore, reproducing this piece of research would require: a decade, access to genetic material and lab toys. I’ll give you the code for free, but I can’t give you ten years or $0.25 million…
In addition, the research process may require linking disparate sources of data for which other languages (e.g. Python) may be more appropriate. Some times R is the perfect tool for the job, while other times I feel like we have reached peak VBS (Visual Basic Syndrome) in R: people want to use it for everything, even when it’s a bad idea.
research is much more than a few lines of R (although they are very important),
even when considering data collection and analysis it is a good idea to know more than a single language/software, because it broadens analytical options
I prefer free (freedom+beer) software for research; however, I rely on non-free, commercial software for part of my work because it happens to be the best option for specific analyses.
Disclaimer: my primary analysis language is R and I often use lme4, MCMCglmm and INLA (all free). However, many (if not most) of my analyses that use genetic information rely on ASReml (paid, not open source). I’ve used Mathematica, Matlab, Stata and SAS for specific applications with reasonably priced academic licenses.
Some weeks ago I received a couple of emails the same day: one asking me to submit a paper to an open access journal, while the other one was inviting me to be the editor of an ‘special issue’ of my choice for another journal. I haven’t heard before about any of the two publications, which follow pretty much the same model: submit a paper for $600 and—if they like it—it will be published. However, the special issue email had this ‘buy your way in’ feeling: find ten contributors (i.e. $6,000) and you get to be an editor. Now, there is nothing wrong per-se with open access journals, some of my favorite ones (e.g. PLoS ONE) follow that model. However, I was surprised by the increasing number of new journals that look at filling the gap for ‘I need to publish soon, somewhere’. Surprised until one remembers the incentives at play in academic environments.
If I, or most academics for that matter, want to apply for academic promotion I have to show that I’m a good guy that has a ‘teaching philosophy’ and that my work is good enough to get published in journals; hopefully in lots of them. The first part is a pain, but most people can write something along the lines ‘I’m passionate about teaching and enjoy creating a challenging environment for students…’ without puking†. The second part is trickier because one has to really have the papers in actual journals.
Personally, I would be happier with only having the odd ‘formal’ publication. The first time (OK, few times) I saw my name in a properly typeset paper was very exciting, but it gets old after a while. These days, however, I would prefer to just upload my work to a website, saying here you have some ideas and code, play with it. If you like it great, if not well, next time I hope it’ll be better. Nevertheless, this doesn’t count as proper publication, because it isn’t peer reviewed, independently of the number of comments the post may get. PLoS ONE counts, but it’s still a journal and I (and many other researchers) work in many things that are too small for a paper, but cool enough to share. The problem: there is little or no credit for sharing so Quantum Forest is mostly a ‘labor of love’, which counts bugger all for anything else.
These days as a researcher I often learn more from other people’s blogs and quick idea exchanges (for example through Twitter) than via formal publication. I enjoy sharing analysis, ideas and code in this blog. So what’s the point of so many papers in so many journals? I guess that many times we are just ‘ticking the box’ for promotions purposes. In addition, the idea of facing referees’ or editors’ comments like ‘it would be a good idea that you cite the following papers…’ puts me off‡. And what about authorship arrangements? We have moved from papers with 2-3 authors to enough authors to have a football team (with reserves and everything). Some research groups also run arrangements where ‘I scratch your back (include you as a coauthor) and you scratch mine (include me in your papers)’. We break ideas into little pieces that count for many papers, etc.
Another related issue is the cost of publication (and the barriers it imposes on readership). You see, we referee papers for journals for free (as in for zero money) and tell ourselves that we are doing a professional service to uphold the high standards of whatever research branch we belong to. Then we spend a fortune from our library budget to subscribe to the same journals for which we reviewed the papers (for free, remember?). It is not a great deal, as many reasonable people have pointed out; I added a few comments in academic publication boycott.
So, what do we need? We need promotion committees to reduce the weight on publication. We need to move away from impact factor. We can and need to communicate in other ways: scientific papers will not go away, but their importance should be reduced.
† Making an effort to prepare interesting lectures doesn’t hurt either. ‡ These days it is fairly common editors ‘suggesting’ to include additional references in our manuscripts, which just happen to be to papers in the same journal, hoping to inflate the impact factor of the journal. Referees tend to suggest their own papers (some times useful, many times not). Lame, isn’t it?
PS. 2012-10-19 15:27 NZST. You also have to remember that not because something was published it is actually correct: outrageously funny example (via Arthur Charpentier). Yep, through Twitter.
The media in New Zealand brieflycovered the destruction of a trial with genetically modified pines (Pinus radiata D. Don, vulgar name Radiata pine, Monterey pine) near Rotorua. This is not the first time that Luddites destroy a trial, ignoring that they have been established following regulations from the Environmental Protection Agency. Most people have discussed this pseudo-religious vandalism either from the wasting resources (money, more importantly time, delays on publication for scientists, etc) or from the criminal activity points of view.
I will discuss something slightly different, when would we plant genetically modified trees?
Some background first
In New Zealand, plantations of forests trees are established by the private sector (mostly forest companies and small growers–usually farmers). Most of the stock planted in the country has some degree of (traditional) breeding, and it ranges from seed mixes with a large numbers of parents to the deployment of genetically identical clones. The higher the degree of improvement the most likely is that tree deployment involves a small number of highly selected genotypes. Overall, most tree plantations are based on open-pollinated seed with a modest degree of genetic improvement, which is much more genetically diverse than most agricultural crops. In contrast, agricultural crops tend to deploy named clonal varieties which is what we buy in supermarkets: Gold kiwifruit, Gala apples, Nadine potatoes, etc.
Stating the obvious, tree and agricultural growers will pay more for genetic material if they have the expectation that the seeds, cuttings, tubers, etc are going to provide higher quantity and/or quality of products which will pay for the extra expense. Here we can see a big difference between people growing trees and annual/short rotation crops: there is a large lag between tree establishment and income coming from the trees, which means that when one runs a discounted cash flow analysis to estimate profitability:
Income is in the distant future (say 25-30 years) and are heavily discounted.
Establishment costs, which include buying the genetic material, are not discounted because they happen right now.
Unsurprisingly, growers want to reduce establishment costs as much as they can and remember that the cost of trees is an important component. This means that most people planting trees will go for cheaper, low level of genetic improvement trees (often seedlings), unless they are convinced that they can recover the extra expense with more improved trees (usually clones, which cost at least double than seedlings).
What’s the relationship with genetic modification?
Modification of any organism is an expensive process, which means that:
One would only modify individuals with an outstanding genetic background; i.e. start with a good genotype to end up with a great one.
Successful modifications will be clonally propagated to scale up the modification, driving down unit cost.
Thus, we have a combination of very good genotypes plus clonal propagation plus no discounting, which would make establishment costs very high (although no impossible). There is a second element that, at least for now, would delay adoption. Most large forest growers will have some type of product certification, which establishes that the grower is using good forestry, environmental and social practices. Think of it as a sticker that says the producer of this piece of wood is a good guy, so please feel confident about buying this product; that is, this sticker is part of a marketing strategy. Currently some forest certification organizations do not accept the use of genetically modified organisms (e.g. Forest Certification Council, PDF of GMO policy).
This does not mean that it is not financially possible to plant genetically modified trees. For once, modification costs would reduce with economies of scale (as for most biotechnologies), and one of the reasons we don’t have these economies is the political pressure by almost-religious zealots against GMO, which make people scared about being first to plant GM trees/plants. Another option is to change the GMO policy for some certification agencies or, relying on other certification organizations that do accept GMOs. Each individual forest company would have to evaluate the trade-offs of the certification decision, as they do not work as a block.
A simple scenario
Roughly 80% percent of the forest plantations in New Zealand correspond to radiata pine. Now imagine that we face a very destructive pest or disease that has the potential to severely damage the survival/growth of the trees. I know that it would take us a long time (decades?) to breed trees resistant to this problem. I also know that the GM crowd could insert several disease resistance genes and silence flowering, so we don’t have reproduction of modified trees. Would you support the use of genetic modification to save one of the largest industries of the country? I would.
However, before using the technology I would like to have access to data from trials growing in New Zealand conditions. The destruction of trials makes extremely difficult to make informed decisions and this is the worst crime. This people are not just destroying trees but damaging our ability to properly make decisions as a society, evaluating the pros and cons of our activities.
P.S. These are just my personal musings about the subject and do not represent the views of the forest companies, the university or anyone else. I do not work on genetic modification, but I am a quantitative geneticist & tree breeder.
P.S.2. While I do not work on genetic modification—so I’d struggle to call that crowd ‘colleagues’—I support researchers on that topic in their effort to properly evaluate the performance of genetically modified trees.
What metrics are used to compare Elsevier to other publishers? It is common to refer to cost-per-article; for example, in my area Forest Ecology and Management (one of the most popular general Forestry Journals) charges USD 31.50 per article but Tree Genetics and Genomes (published by Springer Verlag) costs EUR 34.95 (roughly USD 46). Nevertheless, researchers affiliated to universities or research institutes rarely pay per article; instead, our libraries have institution-wide subscriptions. Before the great consolidation drive we would have access to individual journal subscription prices (sometimes reaching thousands of dollars per year, each of them). Now libraries buy bundles from a given publisher (e.g. Elsevier, Springer, Blackwell, Wiley, etc) so it is very hard to get a feeling of the actual cost of a single journal. With this consideration, I am not sure if Elsevier ‘deserves’ being singled out in this mess; at least not any more than Springer or Blackwell, or… a number of other publishers.
What we do know is that most of the work is done and paid for by scientists (and society in general) rather than journals. Researchers do research and our salaries and research expenses are many times paid for (at least partially if not completely) by public funding. We also act as referees for publications and a subset of us are part of editorial boards of journals. We do use some journal facilities; for example, an electronic submission system (for which there are free alternatives) and someone will ‘produce’ the papers in electronic format, which would be a small(ish) problem if everyone used LaTeX.
If we go back some years ago, many scientific societies used to run their own journals (many times scrapping by or directly running them at a loss). Then big publishers came ‘to the rescue’ offering economies of scale and an opportunity to make a buck. There is nothing wrong with the existence of publishers facilitating the publication process; but when combined with the distortions in the publication process (see below) publishers have achieved a tremendous power. At the same time, publishers have hiked prices and moved a large part of their operations to cheaper countries (e.g. India, Indonesia, etc) leaving us researchers struggling to pay for the subscriptions to read our own work. Not only that, but copyright restrictions in many journals do not allow us to make our work available to the people who paid for the research: you, the tax payer.
Today scientific societies could run their own journals and completely drop the printed version, so we could have cheaper journals while societies wouldn’t go belly up moving paper across continents. Some questions, Would scientific societies be willing to change? If that’s the case, Could they change their contractual arrangements with publishers?
Why do we play the game?
The most important part of the problem is that we (the researchers) are willing to participate in the publication process with the current set of rules. Why do we do it? At the end of the day, many of us play the journal publication game because it has been subverted from dissemination of important research results to signaling researcher value. University and research institute managers need to have a way to evaluate their researchers, managing tenures, promotions, etc. Rather than going for actually doing a proper evaluation (difficult, expensive and subjective), they go for an easy one (subjective as well): number of publications in ‘good’ journals. If I want to get promoted or taken seriously in funding applications I have to publish in journals.
I think it is easy to see that I enjoy openly communicating what I have learned (for example this blog and in my main site). I would rather spend more time doing this than writing ‘proper’ papers, but of course this is rarely considered important in my evaluations.
If you already are a top-of-the-scale, tenured professor it is very easy to say ‘I don’t want to play the game anymore’. If you are a newcomer to the game, trying to establish yourself in these times of PhD gluts and very few available research positions, all incentives line up to play the game.
This is only part of the problem
The questioning does not stop at the publication process. Instead, the peer value of review process is also under scrutiny. Then we enter into open science: beyond having access to publications, How much can we trust the results? We have discussions on open access data even when it is in closed journals. And on, and on.
We have moved from a situation of scarcity, where publishing was expensive, the tools to analyze our data were expensive and making data available was painfully difficult to a time when all that is trivially easy. I can collect some data, upload it to my site, rely on the democratization of statistics, write it up and create a PDF or HTML version by pressing a button. We would like to have feedback: relatively easy if the publication is interesting. We want an idea of reliability or trust: we could have, for example, some within-organization peer reviewing. Remember though that peer reviewing is not a panacea. We want to have an idea of community standing, which would be the number of people referring to that document (paper, blog post, wiki, whatever).
Maybe the most important thing is that we are trying to carry on with ‘traditional’ practices that do not extend beyond, say, 100 years. We do not need to do so if we are open to a more fluid environment on both publication, analytics and data sharing. Better, we wouldn’t need to continue if we stopped putting so much weight on traditional publication avenues when evaluating researchers.
Is Elsevier evil? I don’t think so; or, at least, it doesn’t seem to be significantly worse than other publishers. Have we vested too much power on Elsevier and other publishers? You bet! At the very least we should get back to saner copyright practices, where the authors retain copyright and provide a non-exclusive license to the publishers. Publishers will still make money but everyone will be able to freely access our research results because, you know, they already pay for the research.
Disclaimer: I have published in journals managed by Elsevier and Springer. I currently have articles under review for both publishers.
P.S.3 2012-01-31 NZST I would love to know what other big publishers are thinking.
P.S.4 2012-02-01 NZST Research Works Act: are you kidding me?
The Research Works Act (RWA) bill (H.R.3699) introduced to the US Congress on 16 December 2011 proposes that:
No Federal agency may adopt, implement, maintain, continue, or otherwise engage in any policy, program, or other activity that–
(1) causes, permits, or authorizes network dissemination of any private-sector research work without the prior consent of the publisher of such work; or
(2) requires that any actual or prospective author, or the employer of such an actual or prospective author, assent to network dissemination of a private-sector research work.
The idea of calling researcher’s work funded by government, edited by their peers (probably at least partially funded by government funds) private-sector research work because a publishing company applied whatever document template they use on top of the original manuscript is obscene. By the way, Richard Poynder has a post that lists a number of publishers that have publicly disavowed the RWA.
P.S.5 2012-02-02 16:38 NZST Doron Zeilberger points to the obvious corollary: we don’t need journals for research dissemination anymore (although still we do for signaling). Therefore if one is keen on boycotts it should affect all publishers. Academics are stuck with last century’s publication model.