R, academia and the democratization of statistics

I am not a statistician but I use statistics, teach some statistics and write about applications of statistics in biological problems.

Last week I was in this biostatistics conference, talking with a Ph.D. student who was surprised about this situation because I didn’t have any statistical training. I corrected “any formal training”. On the first day one of the invited speakers was musing about the growing number of “amateurs” using statistics—many times wrongly—and about what biostatisticians could offer as professional value-adding. Yes, he was talking about people like me spoiling the party.

Twenty years ago it was more difficult to say “cool, I have some data and I will run some statistical analyses” because (and you can easily see where I am going here) access to statistical software was difficult. You were among the lucky ones if you were based at a university or a large company, because you had access to SAS, SPSS, MINITAB, etc. However, you were out of luck outside of these environments, because there was no way to easily afford a personal licence, not for a hobby at least. This greatly limited the pool of people that could afford to muck around with stats.

Gratuitous picture: spiral in Sydney

Enter R and other free (sensu gratis) software that allowed us to skip the specialist, skip the university or the large organization. Do you need a formal degree in statistics to start running analyses? Do you even need to go through a university (for any degree, it doesn’t really matter) to do so? There are plenty of resources to start, download R or a matrix language, get online tutorials and books, read, read, read and ask questions in email lists or fora when you get stuck. If you are a clever cookie—and some of you clearly are one—you could easily cover as much ground as someone going through a university degree. It is probably still not enough to settle down, but it is a great start and a great improvement over the situation twenty years ago.

This description leaves three groups in trouble, trying to sort out their “value-adding” ability: academia, (bio)statisticians and software makers. What are universities offering, that’s unique enough, to justify the time and money invested by individuals and governments? If you make software, What makes it special? For how long can you rely on tradition and inertia so people don’t switch to something else? What’s so special about your (bio) statistical training to justify having one “you” in the organization? Too many questions, so better I go to sleep.

P.S. Did I write “value-adding” ability? I must have been talking to the suits for too long… Next post I may end up writing “value-adding proposition”!

19 thoughts on “R, academia and the democratization of statistics

  • 2011/12/13 at 3:50 am
    Permalink

    Great post. I think that universities provide condensed, "high speed" education and allow you to be in touch with people in the same field of expertise. That makes learning a more efficient and more social experience than being self educated.

    Like you, I'm mostly self educated in statistics. I feel that I would have learned statistics a lot faster if I'd had a formal training in statistics.

    Looking at is another way, I know quite a few people who are self educated in the field that I studied in university (musicology). Some of those people have more knowledge of musicology than I do, but it took them many more years to reach that level.

    Reply
    • 2011/12/14 at 7:08 am
      Permalink

      I agree with you. The "high speed" education of universities is the main advantage over the self-teaching way. I had tripped a lot trying new statistical techniques for my data.
      Another aspect no one had mentioned: the price of a specialty in statistics (MSc, PhD, etc) is worth enough the financial risk involved in pay for that training? I had pondered quite a lot this, and I'm inclined against the training. The financial cost is harder to justify if you can buy a lot of good books and try to do it in R.

      Reply
      • 2012/06/23 at 6:42 am
        Permalink

        I’m actually in the same position as you. I have an option to invest £10k in a MSc in stats, but I’m disinclined too. I’m already very good at using R, and I intend to push my knowledge up further linking it into the internet and automating analyses. Passing the Royal Stat Exams is a far cheaper way to achieve ‘paper’ qualifications, and the level on the exams is actually very high (think old school degrees from the 60s when things used to be difficult). So I intend to be fully self taught and just skip the grad degree – too expensive, and too inflexible.

        Reply
  • 2011/12/13 at 4:39 am
    Permalink

    Great post indeed. I am also trying to educate myself in satistics and I think I made a lot of progress. But as much as I "understand" what I´m learning, I´m aware that I lack the "mastery" that someone with formal training has. I think one of the advantages of Universities is that they offer rigorous training.

    On the other had, an important advantage of self learning is motivation. Learning something because you like it rather than because you have to makes the process much more enjoyable.

    Saludos.

    Reply
    • 2011/12/13 at 9:40 am
      Permalink

      Hola. I think that going through informal training we tend to get a strange density, with a very deep understanding of odd issues and some gaping holes in our general understanding. We are 'heterogeneous' but I'm not sure if that is a bad thing.

      Reply
  • 2011/12/13 at 7:00 am
    Permalink

    What you didn't mention, and it's all over the dreaded Financial Services part of stats (one need only count the blogs posting here to see the assault), is that Excel is the primary (oft times, sole) stat software being used. To make decisions which affect many people.

    Trained as an econometrician, but until recently mostly doing database development, I worked with Ph.D. math stats early on out of university. The good ones admitted they didn't know everything. In the final analysis, 99.44% of the time, you just need to add up the squared differences! :)

    Reply
    • 2011/12/13 at 8:14 am
      Permalink

      You are right in that Excel is the most often used tool to produce statistical analyses (particularly in business), considering mostly descriptive statistics, simple linear regression and one-way ANOVA. However, I work in a field (forestry) with a strong culture of setting-up experiments that need better and more complex statistical models.

      I don't mind people using Excel for simple things; but I do have problems when they let Excel define their analyses. For example, many students will try analyzing one factor at the time when we have factorial/nested experiments.

      Reply
  • 2011/12/13 at 7:31 am
    Permalink

    I'll join others who wrote that this is a good post – thanks for writing it.

    From my understanding, the role of a University (at least in Israel), is not to educate (that is the role of collages) but to train researchers and give them some level supporting environment (both social, and financial). Having this said – I still think there is much room for improvement and "soul searching" on what should a University be these days. I think the people who deal with the open research movement are also doing a lot of good thinking regarding these questions.

    Reply
    • 2011/12/13 at 9:47 am
      Permalink

      Before going to uni I was looking for an education, not necessarily training, I thought of the latter as a byproduct of showing up in lectures. Unfortunately, the demands were heavily biased towards achieving a piece of paper that signalled that we were prepared as a [fill title here]. The changing environment is squeezing universities (and I'm part of one) and some of us are asking ourselves what is our future role in both education and generation of knowledge.

      One part of using R for analyses is that we can now (potentially) share data, code and interpretation opening the door to a much more interesting conversation with other professional researchers and "amateurs". Many researchers, myself included, are struggling to find the best way to make that conversation happen.

      Reply
  • 2011/12/13 at 7:55 am
    Permalink

    hmm …
    statisticians "musing about the growing number of “amateurs” using statistics". statistic is a tool to make inference about the real world. statisticians do not have “any formal training” in that world.

    Reply
  • 2011/12/13 at 9:00 am
    Permalink

    Nice post. I can answer the question "What’s so special about your (bio) statistical training to justify having one “you” in the organization?" from personal experience. I have a masters degree in biostatistics and have been employed at a research hospital for many years. Almost all of the work I do and the tools I use were not taught to me in graduate or undergraduate school. I think, like you, most of what I use now was self-taught or learned on the job. And I believe this can be (maybe even should be) generalized across fields.

    I briefly taught at the high school level and students would often ask me "why are you teaching me this? When am I ever going to use this?" Wrong question because the answer is always going to be "never". If formally trained (bio)statisticians graduate from their programs with similar mindsets as the high school student (i.e., school is supposed to train me to do a job), then they will be hobbled. If they leave with the understanding that there's some much more to learn, then they will do fine.

    Reply
    • 2011/12/13 at 9:34 am
      Permalink

      I share your mindset. When teaching I try to provide plenty of examples for different areas (e.g. agriculture, sociology, economics, etc) but I insist on i- the importance of understanding the general concepts, ii- their broad applicability under situations that we can't foresee and iii- that students will need to learn a lot after leaving university if they want to be valuable.

      Reply
  • 2011/12/13 at 12:52 pm
    Permalink

    Hi. Two cents from my point of view: 1) some statistical training may help you be aware of information in the data that you may not notice unless you were not shown to recognize it. This is true of any field, right? You may kick the dirt and see something that looks like a strange pebble and just move on, whereas the archaeologist may see a rare hominid molar. To what degree is training in statistics helping you "see" more than the untrained eye? Hard to say. 2) A degree in statistics may subjectively matter. I think there is a degree of confidence – call it subjective confidence – in analyses done by a trained statistician as compared to the self-taught. Put it differently: say you are hiring for a data analysis position and you have two candidates with similar experience, but one of which has a degree in statistics. Does that degree matter? 3) The coolest thing about what you call the "democratization" of statistics is (hopefully) the awakening of our school systems to teaching probabilities and uncertainty early on. The more of us use statistics, the more we will seek the use and understanding of statistics. In that sense, R and other free stats software may be fueling a paradigm shift in education that I much welcome.

    Reply
    • 2011/12/13 at 2:54 pm
      Permalink

      I'm not advocating to ignore university training, but pointing out that today one can go a long distance without formal training. We can access the same tools that are used by professional statisticians (e.g. R) as well as a huge amount of literature and course material. At the margins the distinction between pros and amateurs is blurring, so the province of the pros is becoming ever more specialized.

      Degrees do matter (still) but, as a counterexample, would you hire a new graduate fresh from university or an amateur that created several R packages? The choice there is harder.

      Reply
  • 2011/12/16 at 12:42 am
    Permalink

    There are basic principles of statistical inference that cannot be obtained from packages or mucking around with data.

    Reply
    • 2012/05/18 at 9:16 am
      Permalink

      …but I think thats where the reading comes in!

      Reply
  • 2011/12/18 at 6:28 pm
    Permalink

    Having done my graduate work in economics after an undergraduate degree in the humanities, I was most impressed with the bizarre exclusiveness and snobbery amongst the numerical set. I had thought that said snobbery was a humanities department specialty.
    No one really likes democracy; as no one really likes market competition.

    Reply
  • 2013/04/10 at 8:47 am
    Permalink

    I’m late to this discussion but a very interesting post. I work as a the sole stats analyst and decision modeller for my employer but don’t have a degree in anything (just an under-grad certificate in decision analysis and that was achieved via distance learning). All my stats training is self taught and I’ve taught myself R.

    One advantage of being self taught is that you have no artificial time limits imposed on your study. That means if you are passionate about stats, or any subject for that matter, you can cover a 3 year degree level curriculum in half the time because you’ll be thinking of nothing else 24/7. A friend with two MSc’s in applied maths and statistics has said my work is of publishable quality and now at a post masters/doctoral level. I’ve been studying for about 3 years and of course I’ve been applying what I’ve learnt 8-10 hours a day which helps.

    For those interested in self learning statistics I recommend not only the PDFs on the R website but also, if not more so, the text books written for undergrads from the 1960s or 1970s, ie. before the advent of the home computer and when even undergrads would only be allowed an hours session on the campus mainframe per week.

    In those days you really did need to know the math and underlying proofs because 90% of the time that’s all you had to work with and these old out-of-print textbooks reflect that.

    Reply
  • 2013/12/07 at 11:58 pm
    Permalink

    Nice post! I feel that there is a similarity between the development you describe in statistics and the development in computer science. If you wanted to solve a problem that required programming in 80s you would have to have formal training / lots of experience as the tools were not super friendly (c, c++, etc.). Now we have tools like Python and Ruby which makes programming accessible to a much wider audience…

    Reply

Leave a Reply