The last few weeks there has been a number of researchers calling for, or supporting, a boycott against Elsevier; for example, Scientific Community to Elsevier: Drop Dead, Elsevier—my part in its downfall or, more general, Should you boycott academic publishers?

What metrics are used to compare Elsevier to other publishers? It is common to refer to cost-per-article; for example, in my area Forest Ecology and Management (one of the most popular general Forestry Journals) charges USD 31.50 per article but Tree Genetics and Genomes (published by Springer Verlag) costs EUR 34.95 (roughly USD 46). Nevertheless, researchers affiliated to universities or research institutes rarely pay per article; instead, our libraries have institution-wide subscriptions. Before the great consolidation drive we would have access to individual journal subscription prices (sometimes reaching thousands of dollars per year, each of them). Now libraries buy bundles from a given publisher (e.g. Elsevier, Springer, Blackwell, Wiley, etc) so it is very hard to get a feeling of the actual cost of a single journal. With this consideration, I am not sure if Elsevier ‘deserves’ being singled out in this mess; at least not any more than Springer or Blackwell, or… a number of other publishers.

Elsevier? No, just Gaahl Gorgoroth

What we do know is that most of the work is done and paid for by scientists (and society in general) rather than journals. Researchers do research and our salaries and research expenses are many times paid for (at least partially if not completely) by public funding. We also act as referees for publications and a subset of us are part of editorial boards of journals. We do use some journal facilities; for example, an electronic submission system (for which there are free alternatives) and someone will ‘produce’ the papers in electronic format, which would be a small(ish) problem if everyone used LaTeX.

If we go back some years ago, many scientific societies used to run their own journals (many times scrapping by or directly running them at a loss). Then big publishers came ‘to the rescue’ offering economies of scale and an opportunity to make a buck. There is nothing wrong with the existence of publishers facilitating the publication process; but when combined with the distortions in the publication process (see below) publishers have achieved a tremendous power. At the same time, publishers have hiked prices and moved a large part of their operations to cheaper countries (e.g. India, Indonesia, etc) leaving us researchers struggling to pay for the subscriptions to read our own work. Not only that, but copyright restrictions in many journals do not allow us to make our work available to the people who paid for the research: you, the tax payer.

Today scientific societies could run their own journals and completely drop the printed version, so we could have cheaper journals while societies wouldn’t go belly up moving paper across continents. Some questions, Would scientific societies be willing to change? If that’s the case, Could they change their contractual arrangements with publishers?

## Why do we play the game?

The most important part of the problem is that we (the researchers) are willing to participate in the publication process with the current set of rules. Why do we do it? At the end of the day, many of us play the journal publication game because it has been subverted from dissemination of important research results to signaling researcher value. University and research institute managers need to have a way to evaluate their researchers, managing tenures, promotions, etc. Rather than going for actually doing a proper evaluation (difficult, expensive and subjective), they go for an easy one (subjective as well): number of publications in ‘good’ journals. If I want to get promoted or taken seriously in funding applications I have to publish in journals.

I think it is easy to see that I enjoy openly communicating what I have learned (for example this blog and in my main site). I would rather spend more time doing this than writing ‘proper’ papers, but of course this is rarely considered important in my evaluations.

If you already are a top-of-the-scale, tenured professor it is very easy to say ‘I don’t want to play the game anymore’. If you are a newcomer to the game, trying to establish yourself in these times of PhD gluts and very few available research positions, all incentives line up to play the game.

## This is only part of the problem

The questioning does not stop at the publication process. Instead, the peer value of review process is also under scrutiny. Then we enter into open science: beyond having access to publications, How much can we trust the results? We have discussions on open access data even when it is in closed journals. And on, and on.

We have moved from a situation of scarcity, where publishing was expensive, the tools to analyze our data were expensive and making data available was painfully difficult to a time when all that is trivially easy. I can collect some data, upload it to my site, rely on the democratization of statistics, write it up and create a PDF or HTML version by pressing a button. We would like to have feedback: relatively easy if the publication is interesting. We want an idea of reliability or trust: we could have, for example, some within-organization peer reviewing. Remember though that peer reviewing is not a panacea. We want to have an idea of community standing, which would be the number of people referring to that document (paper, blog post, wiki, whatever).

Maybe the most important thing is that we are trying to carry on with ‘traditional’ practices that do not extend beyond, say, 100 years. We do not need to do so if we are open to a more fluid environment on both publication, analytics and data sharing. Better, we wouldn’t need to continue if we stopped putting so much weight on traditional publication avenues when evaluating researchers.

Is Elsevier evil? I don’t think so; or, at least, it doesn’t seem to be significantly worse than other publishers. Have we vested too much power on Elsevier and other publishers? You bet! At the very least we should get back to saner copyright practices, where the authors retain copyright and provide a non-exclusive license to the publishers. Publishers will still make money but everyone will be able to freely access our research results because, you know, they already pay for the research.

Disclaimer: I have published in journals managed by Elsevier and Springer. I currently have articles under review for both publishers.

P.S. Gaahl Gorgoroth image from Wikipedia.

P.S.2 The cost of knowledge is keeping track of academics taking a stand against Elsevier; 1503 of them at 12:32 NZST 2012-01-30. HT: Arthur Charpentier.

P.S.3 2012-01-31 NZST I would love to know what other big publishers are thinking.

## P.S.4 2012-02-01 NZST Research Works Act: are you kidding me?

The Research Works Act (RWA) bill (H.R.3699) introduced to the US Congress on 16 December 2011 proposes that:

No Federal agency may adopt, implement, maintain, continue, or otherwise engage in any policy, program, or other activity that–

(1) causes, permits, or authorizes network dissemination of any private-sector research work without the prior consent of the publisher of such work; or

(2) requires that any actual or prospective author, or the employer of such an actual or prospective author, assent to network dissemination of a private-sector research work.

The idea of calling researcher’s work funded by government, edited by their peers (probably at least partially funded by government funds) private-sector research work because a publishing company applied whatever document template they use on top of the original manuscript is obscene. By the way, Richard Poynder has a post that lists a number of publishers that have publicly disavowed the RWA.

P.S.5 2012-02-02 16:38 NZST Doron Zeilberger points to the obvious corollary: we don’t need journals for research dissemination anymore (although still we do for signaling). Therefore if one is keen on boycotts it should affect all publishers. Academics are stuck with last century’s publication model.

P.S.6 2012-10-19 15:18 NZST I have some comments on publication incentives.

# Mid-January flotsam: teaching edition

I was thinking about new material that I will use for teaching this coming semester (starting the third week of February) and suddenly compiled the following list of links:

Enough procrastination. Let’s keep on filling out PBRF forms; it is the right time for that hexennial activity.

Gratuitous picture: Avonhead Cemetery, a place to reflect on teaching and the meaning of life (Photo by Luis).

# IPython

I installed the latest version (0.12) of IPython from source in my mac, but forgot that I had a previous version (0.10) installed using easy_install. When trying to run ipython I kept getting the error: No module named terminal.ipapp.

Fixing it. I ran easy_install -m ipython in Terminal, so Python doesn’t continue looking for the old package, as explained here. Then I was able to navigate to /Library/Python/2.6/site-packages and then use rm -r ipython-0.10.1-py2.6.egg. Now everything works fine.

# R is a language

A commenter on this blog reminded me of one of the frustrating aspects faced by newbies, not only to R but to any other programming environment (I am thinking of typical students doing stats for the first time). The statement “R is a language” sounds perfectly harmless if you have previous exposure to programming. However, if you come from a zero-programming background the question is What do you really mean?

R—and many other statistical systems for that matter—is often approached from either one of two extremes:

1. Explaining R as a programming language, as most of the R documentation and books (like The Art of R Programming, quite good by the way) do.
2. The other one is from a hodgepodge of statistical analyses, introducing the language as a bonus, best represented by Crowley’s The R Book (which I find close to unreadable). In contrast, Modern Applied Statistics with S by Ripley and Venables is much better even when it doesn’t mention R in the title.

If you are new to both statistics and R I like the level of the Quick-R website as a starting point, which was expanded into a book (R in Action). It uses the second approach listed above, so if you come from a programming background the book will most likely be disappointing. Nevertheless, if you come from a newbie point of view both the website and book are great resources. In spite of this, Quick-R assumes that the reader is familiar with statistics and starts with “R is an elegant and comprehensive statistical and graphical programming language.

## A simpler starting point

I would like to start from an even simpler point, ignoring for a moment programming and think about languages like English, Spanish, etc. In languages we have things (nouns) and actions (verbs). We perform actions on things: we measure a tree, draw a plot, make some assumptions, estimate coefficients, etc. In R we use functions to perform actions on objects (things in the previous explanations). Our data sets are objects that we read, write, fit a model (and create objects with results), etc. “R is a language” means that we have a grammar that is designed to deal with data from a statistical point of view.

A simple sentence “Luis writes Quantum Forest” has two objects (Luis and Quantum Forest) and one function (writes). Now lets look at some simple objects in R; for example, a number, a string of characters and a collection of numbers (the latter using the function c() to keep the numbers together):

> 23 [1] 23   > "Luis" [1] "Luis"   > c(23, 42, pi) [1] 23.000000 42.000000 3.141593

Up to this point we have pretty boring software, but things start becoming more interesting when we can assign objects to names, so we can keep acting on those objects (using functions). In this blog I use = instead of <- to assign things (objects) to names outside function calls. This is considered "bad form" in the R world, but to me is much more readable§. (Inside function calls the arguments should always be referred with an = sign, as we'll see in a future post). Anyway, if you are feeling in a conformist mood replace the = by <- and the code will work equally well.

> sex = 23   > Sex = "Luis"   > SEX = c(23, 42, pi)

R is case sensitive, meaning that upper- and lower-case letters are considered different. Thus, we can assign different objects to variables named sex, Sex and SEX and R will keep track of them as separate entities (A Kama Sutra of names!). Once objects are assigned to a variable R stops printing the object back to the user. However, it is possible to type the object name, press enter and get the content stored in the name. For example:

> sex [1] 23   > SEX [1] 23.000000 44.000000 3.141593

The lowly = sign is a function as well. For example, both a and b are assigned the same bunch of numbers:

> a = c(23 , 42, pi) > a [1] 23.000000 42.000000 3.141593   # Is equivalent to > assign('b', c(23 , 42, pi)) > b [1] 23.000000 42.000000 3.141593

Even referring to an object by its name calls a function! print(), which is why we get [1] 23.000000 44.000000 3.141593 when typing b and hitting enter in R.

## Grouping objects

Robert Kabacoff has a nice invited post explaining data frames. Here I will present a very rough explanation with a toy example.

Objects can be collected in other objects and assigned a name. In data analysis we tend to collect several variables (for example tree height and stem diameter, people's age and income, etc). It is convenient to keep variables referring to the same units (trees, persons) together in a table. If you have used Excel, a rough approximation would be a spreadsheet. Our toy example could be like:

> x = c(1, 3, 4, 6, 8, 9) > y = c(10, 11, 15, 17, 17, 20)   > toy = data.frame(x, y) > toy x y 1 1 10 2 3 11 3 4 15 4 6 17 5 8 17 6 9 20

The last line combines two objects (x and y) in an R table using the function data.frame() and then it assigns the name "toy" to that table (using the function =). From now on we can refer to that data table when using other functions as in:

# Getting descriptive statistics using summary() > summary(toy)   x y Min. :1.000 Min. :10 1st Qu.:3.250 1st Qu.:12 Median :5.000 Median :16 Mean :5.167 Mean :15 3rd Qu.:7.500 3rd Qu.:17 Max. :9.000 Max. :20   # Obtaining the names of the variables in # a data frame > names(toy) [1] "x" "y"   # Or actually doing some stats. Here fitting # a linear regression model > fm1 = lm(y ~ x, data = toy)

Incidentally, anything following a # is a comment, which helps users document their work. Use them liberally.

Fitting a linear regression will produce lots of different outputs: estimated regression coefficients, fitted values, residuals, etc. Thus, it is very handy to assign the results of the regression to a name (in this case "fm1") for further manipulation. For example:

# Obtaining the names of objects contained in the # fm1 object > names(fm1) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model"   # We can access individual objects using the notation # objectName$components # Obtaining the intercept and slope > fm1$coefficients (Intercept) x 8.822064 1.195730   # Fitted (predicted) values > fm1$fitted.values 1 2 3 4 5 6 10.01779 12.40925 13.60498 15.99644 18.38790 19.58363 # Residuals > fm1$residuals 1 2 3 4 5 6 -0.01779359 -1.40925267 1.39501779 1.00355872 -1.38790036 0.41637011   # We can also use functions to extract components from an object # as in the following graph > plot(resid(fm1) ~ fitted(fm1))

The last line of code extracts the residuals of fm1 (using the function resid()) and the fitted values (using the function fitted()), which are then combined using the function plot().

In summary: in this introduction we relied on the R language to manipulate objects using functions. Assigning names to objects (and to the results of applying functions) we can continue processing data and improving our understanding of the problem under study.

## Footnotes

R is an implementation (a dialect) of the S language. But remember, a language is a dialect with an army and a navy.

Natural languages tend to be more complex and will have pronouns, articles, adjectives, etc. Let's ignore that for the moment.

§Languages change; for example, I speak Spanish—a bad form of Latin—together with hundreds of millions of people. Who speaks Latin today?

# The world owes you nothing

Don’t go around saying the world owes you a living. The world owes you nothing. It was here first.

―Mark Twain.

# Doing Bayesian Data Analysis now in JAGS

Around Christmas time I presented my first impressions of Kruschke’s Doing Bayesian Data Analysis. This is a very nice book but one of its drawbacks was that part of the code used BUGS, which left mac users like me stuck.

Kruschke has now made JAGS code available so I am happy clappy and looking forward to test this New Year present. In addition, there are other updates available for the programs included in the book.

# Deactivated Intense Debate, testing default comments

Intense Debate did not play well with Linux users, so I’m going back to WordPress’s default commenting system. Please email me at Luis.Apiolaza@canterbury.ac.nz if you have any problems using the site.

Incidentally, Happy 2012!

# Plotting earthquake data

Since 4th September 2010 we have had over 2,800 quakes (considering only magnitude 3+) in Christchurch. Quakes come in swarms, with one or few strong shocks, followed by numerous smaller ones and then the ocasional shock, creating an interesting data visualization problem. In our case, we have had swarms in September 2010, December 2010, February 2011, June 2011 and December 2011.

Geonet provides the basic information and there have been several attempts at displaying the full set of shocks. For example, Christchurch Quake Map uses animation, while Canterbury Quake Live uses four panels showing quakes for last 24 hours, last week, last month and since September 2010. While both alternatives are informative, it is hard to see long-term trends due to overplotting, particularly when we move beyond one week during a swarm.

Geonet allows data extraction through forms and queries. Rough limits for the Christchurch earthquakes are: Southern Latitude (-43.90), Northern Latitude (-43.15), Western Longitude (171.75) and Eastern Longitude (173.35). We can limit the lower magnitude to 3, as shocks are hard to feel below that value.

Graph presented by Canterbury Quake Live, notice how difficult is to read the bottom graph.

The file earthquakes.csv contains 2,802 shocks, starting on 2010-09-04. The file can be read using the following code:

[sourcecode lang="r"]
setwd(‘~/Dropbox/quantumforest’)
library(ggplot2)

# Reading file and manipulating dates

# Joins parts of date, converts it to UTC and then
# expresses it in NZST
qk$DATEtxt = with(qk, paste(ORI_YEAR,’-’,ORI_MONTH,’-’,ORI_DAY,’ ‘, ORI_HOUR,’:’,ORI_MINUTE,’:’,ORI_SECOND, sep=”)) qk$DATEutc = as.POSIXct(qk$DATEtxt, tz=’UTC’) qk$DATEnz = as.POSIXct(format(qk\$DATEutc, tz=’Pacific/Auckland’))
[/sourcecode]

The following code produces a plot that, in my opinion, presents a clearer idea of the swarms but that I still feel does not make justice to the problem.

[sourcecode lang="r"]
png(‘earthquakesELE.png’,height=600, width=1200)
ele = ggplot(qk, aes(DATEnz, MAG, ymin=3, ymax=MAG))
ele + geom_linerange(color=’grey’) + geom_point(color=’red’, size=1) +
scale_y_continuous(name=’Magnitude’, limits=c(2.5,7.2)) +
scale_x_datetime(name=’Date’,major=’month’) + theme_bw() +
opts(panel.grid.major = theme_line(size=0))
dev.off()
[/sourcecode]

Graph displaying quake swarms, but still far from perfect. Looking forward to your ideas.

Please let me know if you have a better idea for the plot.

P.S.1 If you want to download data from Geonet there will be problems reading in R the resulting earthquakes.csv file, because the file is badly formed. All lines end with a comma except for the first one, tripping R into believing that the first column contains row names. The easiest way to fix the file is to add a comma at the end of the first line, which will create an extra empty variable called X that is not used in the plots.

P.S.2 We had some discussions by email with Michael (who commented below), where he suggested dropping the grey lines and using alpha less than one to reduce plotting density. I opted for also using an alpha scale, so stronger quakes are darker to mimic the psychology of ‘quake suffering’: both frequent smaller quakes and the odd stronger quakes can be distressing to people. In addition, now the plot uses a 1:6 ratio.

[sourcecode lang="r"]
png(‘earthquakesALPHA.png’,height=300, width=1800)
qplot(DATEnz, MAG, data = qk, alpha = MAG) +
geom_point(color = ‘red’, size=1.5) +
scale_x_datetime(‘Date’, major=’month’) +
scale_y_continuous(‘Magnitude’) +
opts(legend.position = ‘none’,
axis.text.x = theme_text(colour = ‘black’),
axis.text.y = theme_text(colour = ‘black’))
dev.off()
[/sourcecode]

New version, click to expand.

I ran out of time, but the background needs more work, as well as finding the right level of alpha to best tell the story.