Quantum Forest

a shoebox for data analysis

Statistics unplugged

How much does statistical software help and how much it interferes when teaching statistical concepts? Software used in the practice of statistics (say R, SAS, Stata, etc) brings to the party a mental model that it’s often alien to students, while being highly optimized for practitioners. It is possible to introduce a minimum of distraction while focusing on teaching concepts, although it requires careful choice of a subset of functionality. Almost invariably some students get stuck with the software and everything goes downhill from there; the student moved from struggling with a concept to struggling with syntax (Do I use a parenthesis here?).

I am a big fan of Tim Bell’s Computer Science Unplugged, a program for teaching Computer Science’s ideas at primary and secondary school without using computers (see example videos).

Here is an example video for public key encryption:

This type of instruction makes me question both how we teach statistics and at what level we can start teaching statistics. The good news is that the New Zealand school curriculum includes statistics in secondary school, for which there is increasing number of resources. However, I think we could be targeting students even earlier.

This year my wife was helping primary school students participating in a science fair and I ended up volunteering to introduce them to some basic concepts so they could design their own experiments. Students got the idea of the need for replication, randomization, etc based on a simple question: Did one of them have special powers to guess the result of flipping a coin? (Of course this is Fisher’s tea-drinking-lady-experiment, but no 10 year old cares about tea, while at least some of them care about super powers). After the discussion one of them ran a very cool experiment on the effect of liquefaction on the growth of native grasses (very pertinent in post-earthquake Christchurch), with 20 replicates (pots) for each treatment. He got the concepts behind the experiment; software just entered the scene when we needed to confirm our understanding of the results in a visual way:

Seven-week growth of native grasses with three proportions of liquefied soil.

Seven-week growth of native grasses with three proportions of liquefied soil. T1: pure liquefaction, T2: 50% liquefaction, 50% normal soil, T3: pure normal soil.

People tell me that teaching stats without a computer is like teaching chemistry without a lab or doing astronomy without a telescope, or… you get the idea. At the same time, there are some books that describe some class activities that do not need a computer; e.g. Gelman’s Teaching Statistics: A Bag of Tricks. (Incidentally, why is that book so friggin’ expensive?)

Back to uni

Back from primary school kiddies to a regression course at university. Let’s say that we have two variables, x & y, and that we want to regress y (response) on x (predictor) and get diagnostic plots. In R we could simulate some data and plot the relationship using something like this:

Typical simple linear regression scatterplot.

Typical simple linear regression scatterplot.

We can the fit the linear regression and get some diagnostic plots using:

Typical diagnostic plot for simple linear regression model. What's the meaning of the fourth plot (lower right)?

Typical diagnostic plot for simple linear regression model. What’s the meaning of the fourth plot (lower right)?

If we ask students to explain the 4th plot—which displays discrepancy (how far a point is from the general trend) on leverage (how far is a point from the center of mass, pivot of the regression)—many of them will struggle to say what is going on in that plot. At that moment one could go and calculate the Hat matrix of the regression (\(X (X’X)^{-1} X’\)) and get leverage from the diagonal, etc and students will get a foggy idea. Another, probably better, option is to present the issue as a physical system on which students already have experience. A good candidate for physical system is using a seesaw, because many (perhaps most) students experienced playing in one as children.

Take your students to a playground (luckily there is one next to uni), get them playing with a seesaw. The influence of a point is related to the product of leverage (how far from the pivot we are applying force) and discrepancy (how big is the force applied). The influence of a point on the estimated regression coefficients will be very large when we apply a strong force far from the pivot (as in our point y[100]), just as it happens in a seesaw. We can apply lots of force (discrepancy) near the pivot (as in our point [y[50]) and little will happen. Students like mucking around with the seesaw and, more importantly, they remember.

Compulsory seesaw picture  (source Wikipedia).

Compulsory seesaw picture (source Wikipedia).

Analogy can go only so far. Some times a physical analogy like a quincunx (to demonstrate the central limit theorem) ends up being more confusing than using an example with variables that are more meaningful for students.

I don’t know what is the maximum proportion of course content that could be replaced by using props, experiments, animations, software specifically designed to make a point (rather than to run analysis), etc. I do know that we still need to introduce ‘proper’ statistical software—at some point students have to face praxis. Nevertheless, developing an intuitive understanding is vital to move from performing monkeys; that is, people clicking on menus or going over the motion of copy/pasting code without understanding what’s going on in the analyses.

I’d like to hear if you have any favorite demos/props/etc when explaining statistical concepts.

P.S. In this post I don’t care if you love stats software, but I specifically care about helping learners who struggle understanding concepts.


  1. I do love a good analogy. As a student so far I’ve found manual calculation and very small datasets to be useful – big enough to give meaningful results, but small enough that you can ‘see’ the numbers for yourself; it lets you see what is going on when you manipulate them. Some texts try to avoid formulae, which I think is a huge mistake, because the formula is critical to understanding what you are doing. And I suspect that if you can’t comprehend a fairly simple formula, you’re going to have problems.

    I think it’s worthwhile to find some novelty. By third year stats, after intro stats in first-year psych, testing and assesment, and second year stats, the old stock-standard examples were getting worn really thin.

  2. oh one other thing – when you *do* introduce software, why not just go for R? You type in the formula and actually have hands-on with your calculations. Learning SPSS takes time – it’s not entirely intuitive – and at times it’s like learning to press vaguely labeled buttons on a big black box. You can use it without having a clue about what is going on inside.

    • Luis

      2013/12/28 at 10:06 am

      We have used R for the last 4 or so years in the regression modeling course; before that we used SAS since immemorial times. My concern is that some students use incantation code instead of incantation buttons: they still have little clue of what the code is doing. Parts of the code are even contradictory with each other. In addition, I don’t think there is a difference on intrinsic understanding of what’s going on between filling a menu with y in ‘dependent var’ and x in ‘independent var’ and typing lm(y ~ x).

      My main issue is how do we teach concepts in such a way that we reach the largest number of students. The actual tool for analysis is somewhat secondary for someone who does understand the concepts; they’ll tend to do the right analyses.

      • that’s an interesting observation about ‘incantation code’. Good point.

        In my course, Psychology, I feel more tight focus within the statistics subject would be helpful. They are always muddled up with ethics, testing and other issues which really clutter your perspective. Assessment tasks often involve faking a psychological report from bogus data, so a lot of effort is expended on material that is tangential to the actual statistical work. I think it would be better taught like maths, broken into discrete chunks with regular, narrowly focused assessments. As it stands, it’s possible to pass with quite large gaps in knowledge.

        I’d love to give our class a stats pop quiz in a year’s time and see how much material we’ve actually retained.

  3. I have some experience with college-age and older students at an American community college (equivalent to the first two years of university). In many cases they have TI-83s etc. which they can summon to solve nearly any straightforward combinatorial or statistical problem. Not to any instructor’s surprise, they often have no idea of whether their solutions make any sense. Whether it is a calculator or a programming language, the concepts come first. These assistive tools take the unwary (or uneducated) down irrelevant paths that seem to lead to solutions but lead only to confusion. These tools can indeed be helpful but only to supplement, not replace, a sound grounding. If we want to produce clueless graduates who think they know what they are doing, then we should give them all calculators or computers and skip instruction. That, however, is hardly doing a service to the students or to society.

  4. Good for you for your efforts to make stats palatable to the new. I would say try and find a few angles to get at each concept. For some students the see-saw will click, for others, something else makes sense.

    I also think the relevance of the question makes it all the easier. I dream of teaching a stats class like I took in high school where we actually find out something interesting about the entire student body. Like does car ownership affect your grades? Or do blond students actually have more romantic relationships?

Leave a Reply

© 2015 Quantum Forest

Theme by Anders NorenUp ↑