Mid-February flotsam

This coming Monday we start the first semester in Canterbury (and in New Zealand for that matter). We are all looking forward to an earthquake-free year; more realistically, I’d be happy with low magnitude aftershocks.

  • The Wall Street Journal reports that more pediatricians are ‘firing’ patients that refuse to use vaccines. I’m wondering about practices that will cluster with ‘vaccine refusers’.
  • I am collaborating with a researcher in Electrical Engineering where he and his students develop very cool tools for us (see example in this previous post on dealing with autocorrelation in mixed models). They use Python to control the tools, data extraction and do some basic processing (isn’t that cool?). Python + Scipy have moved a lot towards creating a nice environment for scientific computing; however, in my opinion setting an R environment is way easier than dealing with all the versions for python, wxpython, etc. At the end I went for the 32 bit version in OS X because 64 bit is not supported (wxpython targets carbon, meh).
  • Why not publish your data too? HT: @chlalanne. I’ll be uploading datasets that I’ve used in my papers (at least the ones that do not contain industry/confidential data).
  • Timandra Harkness asks Have you been seduced by statistics? in Significance Journal (free access). In the same issue Matt Briggs asks another important question Why do statisticians answer silly questions that no one ever asks?
  • How companies learn your secrets: the creepy side of statistics/machine learning.
  • Feeling all smug in a discussion in Engineering: people struggling with software licensing (Office, Matlab, etc). I pointed out that we didn’t have any licensing issues with R to be at the bleeding edge when teaching. Take that!
  • In the nothing-to-do with-stats category, Retronaut displays a beautiful collection of Soviet space propaganda posters. HT: can’t remember.

Supermarkets in New Zealand, Are they creepy too? Checkouts at Pak N Save, Christchurch (Photo: Luis).

I’ve been running analyses for two to three papers, mostly using R + asreml-R + ggplot2 + plyr. I hope to write one of them using XeLaTeX (I’ll be the sole author), while in the other(s) I am condemned to MS Word (!). Thinking of journals to send the papers to, with the most liberal copyright as possible (hard in forestry).

P.S. Flotsam posts compile small bits of information, twitter favorites, shared links, etc.

If you have to use circles…

Stats Chat is an interesting kiwi site—managed by the Department of Statistics of the University of Auckland—that centers around the use and presentation of statistics in the media. This week there was an interesting discussion on one of those infographics that make you cringe:

I understand the newspaper’s need to grab our attention, as well as the designer’s esthetical considerations, but they have to follow avoiding misleading the reader and providing at least a ball-park idea of the importance of whatever issue is under discussion. Clearly, as pointed out in the discussion, a line chart would convey the message with a minimum of ink; however, the designer may still want to use circles, and here we could go back to a faceted version of the always maligned pie chart.

Faceted pie charts using ggplot2

The code reads the data, reshapes it and plots it using pretty much the explanation for pie charts in the ggplot2 documentation.

# Location and libraries
require(reshape)
require(ggplot2)
setwd('~/Dropbox/quantumforest')


# Reading data and melting into long shape
bub = read.csv('bubble-data.csv', header = TRUE)
bub2 = melt(bub, id = 'year')
names(bub2)[2] = 'worry'
levels(bub2$worry)[3] = 'Something else'

# Pie charts; no, really, pie charts!
crazyPalette = scale_fill_manual(values = c('#7DADDD', '#EF1C22', '#BEC7D6'))
crazy = ggplot(bub2, aes(x = factor(1), y = value, fill = worry))
crazy = crazy + geom_bar(width = 1) + coord_polar(theta = 'y') +
        facet_wrap(~ year) + crazyPalette +
        scale_x_discrete('Worry per year (%)') + scale_y_continuous('')
         
crazy
view raw pie-charts.R This Gist brought to you by GitHub.

Again, please remember my typical disclaimer about lack of design and color flair. Colors and scales need work, but I think it is an improvement over the original.

Revisiting homicide rates

A pint of R plotted an interesting dataset: intentional homicides in South America. I thought the graphs were pretty but I was unhappy about the way information was conveyed in the plots; relative risk should be very important but number of homicides is very misleading as it also relates to country population (this problem often comes up in our discussions in Stats Chat).

Instead of just complaining I decided to try a few alternatives (disclaimer: I do not have a good eye for colors or design but I am only looking at ways that could better show relative risk). I therefore downloaded the MS Excel file, which contained a lot of information from other countries and extracted only the information relevant to these plots, which you can obtain here: homicides.csv (4 KB). Some quick code could display the following graph:

require(ggplot2)

setwd('~/Dropbox/quantumforest')
kill = read.csv('homicides.csv', header = TRUE)

kp = ggplot(kill, aes(x = year, y = country, fill = rate))

# Colors coming from
# http://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/
png('homicides-tile.png', width = 500, height = 500)
kp = kp + geom_tile() + scale_x_continuous(name = 'Year', expand = c(0, 0)) +
     scale_y_discrete(name = 'Country', expand = c(0, 0)) +
     scale_fill_gradient(low = 'white', high = 'steelblue', name = 'Homicide rate') +
     theme_bw() +
     opts(panel.grid.major = theme_line(colour = NA),
          panel.grid.minor = theme_line(colour = NA))
dev.off()

Tile graph for homicides.

It is also possible to use a line graph, but it quickly gets very messy, so I created totally arbitrary violence categories:

# Totally arbitrary classification
kill$type = ifelse(kill$country %in% c('Brazil', 'Colombia', 'Venezuela'),
                   'Freaking violent',
                   ifelse(kill$country %in% c('Ecuador', 'Surinam', 'Guyana'),
                          'Plain violent',
                          'Sort of quiet'))

kp2 = ggplot(kill, aes(x = year, y = rate, colour = country))

png('homicides-lines.png', width=1000, height = 300)
kp2 + geom_line() + facet_grid(. ~ type) +
      scale_y_continuous('Homicides/100,000 people') +
      scale_x_continuous('Year') + theme_bw() +
      opts(axis.text.x = theme_text(size = 10),
           axis.text.y = theme_text(size = 10),
           legend.position = 'none')
dev.off()

Another view, which still requires labeling countries. Venezuela, what did happen to you? You look like the Wild West...

I hope others will download the data and provide much better alternatives to display violence. If you do, please add a link in the comments.

Oracle’s strange understanding of R users

After reading David Smith’s tweet on the price of Oracle R Enterprise (actually free, but it requires Oracle Data Mining at $23K/core as pointed out by Joshua Ulrich.) I went to Oracle’s site to see what was all about. Oracle has a very interesting concept of why we use R:

Statisticians and data analysts like R because they typically don’t know SQL and are not familiar with database tasks. R allows them to remain highly productive.

Pardon? It sounds like if we only knew SQL and database tasks we would not need statistical software. File for future reference.

Early-February flotsam

Mike Croucher at Walking Randomly points out an interesting difference in operator precedence for several mathematical packages to evaluate a simple operation 2^3^4. It is pretty much a divide between Matlab and Excel (does the later qualify as mathematical software?) on one side with result 4096 (or (2^3)^4) and Mathematica, R and Python on the other, resulting on 2417851639229258349412352 (or 2^(3^4)). Remember your parentheses…

Corey Chivers, aka Bayesian Biologist, uses R to help students understand the Monty Hall problem. I think a large part of the confusion to grok it stems from a convenient distraction: opening doors. The problem could be reframed as: i- you pick a door (so probability of winning the prize is 1/3) and Monty gets the other two doors (probability of winning is 2/3), ii- Monty is offering to switch all his doors for yours, so switching increases the probability of winning, iii- Monty will never open a winning door to entice the switch, so we should forget about them.

To make the point clearer, let’s imagine now that instead of 3 doors the game has 10 doors. You pick one (probability of winning 1/10) and Monty keeps 9 (probability of winning 9/10). Would you switch one door for nine? Of course! The fact that Monty will open 8 non-winning doors rather than all of his doors does not make a difference in the deal.

# Number of games and doors
n.games = 10000
n.doors = 10

# Assign prize to door for each game. Remember:
# Monty keeps all doors not chosen by player
prize.door = floor(runif(n.games, 1, n.doors + 1))
player.door = floor(runif(n.games, 1, n.doors + 1))

# If prize.door and player.door are the same
# and player does not switch
are.same = prize.door == player.door
cat('Probability of winning by not switching', sum(are.same)/n.games, '\n')
cat('Probability of winning by switching', (n.games - sum(are.same))/n.games, '\n')

Gratuitous picture: fish in New Brighton pier (Photo: Luis).

Pierre Lemieux reminds us that “a dishonest statistician is an outliar”.

If you want to make dulce de leche using condensed milk—but lack a pressure cooker—use an autoclave for 50 to 60 minutes. HT: Heidi Smith. Geeky and one needs an autoclave worth thousands of dollars, but that’s what universities are for.

Lesser and Pearl inform us that there are at least 20 modalities for making statistics fun in “Functional Fun in Statistics Teaching: Resources, Research and Recommendations”. HT: Chelsea Heaven. I’ve used music, videos, cartoons, jokes, striking examples using body parts, quotations, food, juggling, etc.

An old review of Buddhism without Beliefs: A Contemporary Guide to Awakening by Stephen Batchelor. I can’t see any statistical angle, but I liked that book.

P.S. Awesome video by OK Go HT: Eric Crampton.

Back to quantitative genetics!

Rstudio and asreml working together in a mac

December and January were crazy months, with a lot of travel and suddenly I found myself in February working in four parallel projects involving quantitative genetics data analyses. (I’ll write about some of them very soon)

Anyhow, as I have pointed out in repeated occasions, I prefer asreml-R for mixed model analyses because I run out of functionality with nlme and lme4 very quickly. Ten-trait multivariate mixed model with a pedigree, anyone? I thought so. Well, there are asreml-R versions for Windows, Linux and OS X; unsurprisingly, I use the latter. Installation in OS X is not particularly complicated (just follow the instructions in this PDF file) and remember to add and export the following environment variables in your .bash_profile:

# Location of license file, usually installed
# under Applications, together with the console
# version of asreml
ASREML_LICENSE_FILE=/Applications/asreml3/bin/asreml.lic
export ASREML_LICENSE_FILE

# Location for dynamic library loading. The number will
# depend on R version (here is 2.13)
DYLD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Users/lap44/Library/R/2.13/library/asreml/libs
export DYLD_LIBRARY_PATH

These instructions work fine if one uses asreml-R in a Terminal window or if one uses a text editor (emacs + ESS, VIM + VIM-R, Textmate + R + R console bundles, etc). However, I couldn’t get the default R GUI in OS X (R.app) or Rstudio (my favorite way to work with R these days) working.

I would use library(asreml) or require(asreml), which would create no problems, but as soon as I used the function asreml(...) I would get an error message stating that I did not have a valid license. Frustrating to say the list (I even considered using emacs for the rest of my life, can you believe it?), because it meant that Rstudio was not being able to ‘see’ the environment variables (as posted in this question). The discussion on that question points to a very useful part of the R help system, which explains the startup process for R, suggesting a simple solution: create an .Renviron file in your home directory.

Thus, I simply copied the previously highlighted code in a text file called .Renviron, saved it in my home directory and I can now call asreml-R from Rstudio without any problems. This solution should also work for any other time when one would like to access environment variables from Rstudio for OS X. Incidentally, Rstudio is becoming really useful, adding ‘projects’ and integrating version control, which means that now I have a(n almost) self-contained working environment.

P.S. Note to self: While I was testing environment variables in my .bash_profile I wanted to refresh the variables without rebooting the computer. The easiest way to do so is typing . .bash_profile (yes, that starts with dot space). To check the exported value of a specific variable one can use the echo command as so echo $ASREML_LICENSE_FILE, which should return the assigned value (/Applications/asreml3/bin/asreml.lic in my case).

Academic publication boycott

The last few weeks there has been a number of researchers calling for, or supporting, a boycott against Elsevier; for example, Scientific Community to Elsevier: Drop Dead, Elsevier—my part in its downfall or, more general, Should you boycott academic publishers?

What metrics are used to compare Elsevier to other publishers? It is common to refer to cost-per-article; for example, in my area Forest Ecology and Management (one of the most popular general Forestry Journals) charges USD 31.50 per article but Tree Genetics and Genomes (published by Springer Verlag) costs EUR 34.95 (roughly USD 46). Nevertheless, researchers affiliated to universities or research institutes rarely pay per article; instead, our libraries have institution-wide subscriptions. Before the great consolidation drive we would have access to individual journal subscription prices (sometimes reaching thousands of dollars per year, each of them). Now libraries buy bundles from a given publisher (e.g. Elsevier, Springer, Blackwell, Wiley, etc) so it is very hard to get a feeling of the actual cost of a single journal. With this consideration, I am not sure if Elsevier ‘deserves’ being singled out in this mess; at least not any more than Springer or Blackwell, or… a number of other publishers.

Gaahl Gorgoroth

Elsevier? No, just Gaahl Gorgoroth

What we do know is that most of the work is done and paid for by scientists (and society in general) rather than journals. Researchers do research and our salaries and research expenses are many times paid for (at least partially if not completely) by public funding. We also act as referees for publications and a subset of us are part of editorial boards of journals. We do use some journal facilities; for example, an electronic submission system (for which there are free alternatives) and someone will ‘produce’ the papers in electronic format, which would be a small(ish) problem if everyone used LaTeX.

If we go back some years ago, many scientific societies used to run their own journals (many times scrapping by or directly running them at a loss). Then big publishers came ‘to the rescue’ offering economies of scale and an opportunity to make a buck. There is nothing wrong with the existence of publishers facilitating the publication process; but when combined with the distortions in the publication process (see below) publishers have achieved a tremendous power. At the same time, publishers have hiked prices and moved a large part of their operations to cheaper countries (e.g. India, Indonesia, etc) leaving us researchers struggling to pay for the subscriptions to read our own work. Not only that, but copyright restrictions in many journals do not allow us to make our work available to the people who paid for the research: you, the tax payer.

Today scientific societies could run their own journals and completely drop the printed version, so we could have cheaper journals while societies wouldn’t go belly up moving paper across continents. Some questions, Would scientific societies be willing to change? If that’s the case, Could they change their contractual arrangements with publishers?

Why do we play the game?

The most important part of the problem is that we (the researchers) are willing to participate in the publication process with the current set of rules. Why do we do it? At the end of the day, many of us play the journal publication game because it has been subverted from dissemination of important research results to signaling researcher value. University and research institute managers need to have a way to evaluate their researchers, managing tenures, promotions, etc. Rather than going for actually doing a proper evaluation (difficult, expensive and subjective), they go for an easy one (subjective as well): number of publications in ‘good’ journals. If I want to get promoted or taken seriously in funding applications I have to publish in journals.

I think it is easy to see that I enjoy openly communicating what I have learned (for example this blog and in my main site). I would rather spend more time doing this than writing ‘proper’ papers, but of course this is rarely considered important in my evaluations.

If you already are a top-of-the-scale, tenured professor it is very easy to say ‘I don’t want to play the game anymore’. If you are a newcomer to the game, trying to establish yourself in these times of PhD gluts and very few available research positions, all incentives line up to play the game.

This is only part of the problem

The questioning does not stop at the publication process. Instead, the peer value of review process is also under scrutiny. Then we enter into open science: beyond having access to publications, How much can we trust the results? We have discussions on open access data even when it is in closed journals. And on, and on.

We have moved from a situation of scarcity, where publishing was expensive, the tools to analyze our data were expensive and making data available was painfully difficult to a time when all that is trivially easy. I can collect some data, upload it to my site, rely on the democratization of statistics, write it up and create a PDF or HTML version by pressing a button. We would like to have feedback: relatively easy if the publication is interesting. We want an idea of reliability or trust: we could have, for example, some within-organization peer reviewing. Remember though that peer reviewing is not a panacea. We want to have an idea of community standing, which would be the number of people referring to that document (paper, blog post, wiki, whatever).

Maybe the most important thing is that we are trying to carry on with ‘traditional’ practices that do not extend beyond, say, 100 years. We do not need to do so if we are open to a more fluid environment on both publication, analytics and data sharing. Better, we wouldn’t need to continue if we stopped putting so much weight on traditional publication avenues when evaluating researchers.

Is Elsevier evil? I don’t think so; or, at least, it doesn’t seem to be significantly worse than other publishers. Have we vested too much power on Elsevier and other publishers? You bet! At the very least we should get back to saner copyright practices, where the authors retain copyright and provide a non-exclusive license to the publishers. Publishers will still make money but everyone will be able to freely access our research results because, you know, they already pay for the research.

Disclaimer: I have published in journals managed by Elsevier and Springer. I currently have articles under review for both publishers.

P.S. Gaahl Gorgoroth image from Wikipedia.

P.S.2 The cost of knowledge is keeping track of academics taking a stand against Elsevier; 1503 of them at 12:32 NZST 2012-01-30. HT: Arthur Charpentier.

P.S.3 2012-01-31 NZST I would love to know what other big publishers are thinking.

P.S.4 2012-02-01 NZST Research Works Act: are you kidding me?

The Research Works Act (RWA) bill (H.R.3699) introduced to the US Congress on 16 December 2011 proposes that:

No Federal agency may adopt, implement, maintain, continue, or otherwise engage in any policy, program, or other activity that–

(1) causes, permits, or authorizes network dissemination of any private-sector research work without the prior consent of the publisher of such work; or

(2) requires that any actual or prospective author, or the employer of such an actual or prospective author, assent to network dissemination of a private-sector research work.

The idea of calling researcher’s work funded by government, edited by their peers (probably at least partially funded by government funds) private-sector research work because a publishing company applied whatever document template they use on top of the original manuscript is obscene. By the way, Richard Poynder has a post that lists a number of publishers that have publicly disavowed the RWA.

P.S.5 2012-02-02 16:38 NZST Doron Zeilberger points to the obvious corollary: we don’t need journals for research dissemination anymore (although still we do for signaling). Therefore if one is keen on boycotts it should affect all publishers. Academics are stuck with last century’s publication model.

Mid-January flotsam: teaching edition

I was thinking about new material that I will use for teaching this coming semester (starting the third week of February) and suddenly compiled the following list of links:

Enough procrastination. Let’s keep on filling out PBRF forms; it is the right time for that hexennial activity.

Gratuitous picture: Avonhead Cemetery, a place to reflect on teaching and the meaning of life (Photo by Luis).

IPython

I installed the latest version (0.12) of IPython from source in my mac, but forgot that I had a previous version (0.10) installed using easy_install. When trying to run ipython I kept getting the error: No module named terminal.ipapp.

Fixing it. I ran easy_install -m ipython in Terminal, so Python doesn’t continue looking for the old package, as explained here. Then I was able to navigate to /Library/Python/2.6/site-packages and then use rm -r ipython-0.10.1-py2.6.egg. Now everything works fine.