# Jetsam 17: Scaffolding

# Less wordy R

The Swarm Lab presents a nice comparison of R and Python code for a simple (read ‘one could do it in Excel’) problem. The example works, but I was surprised by how wordy the R code was and decided to check if one could easily produce a shorter version.

The beginning is pretty much the same, although I’ll use ggplot2 rather than lattice, because it will be a lot easier (and shorter) to get the desired appearance for the plots:

require(Quandl) require(ggplot2) # Load data from Quandl my.data = Quandl("TPC/HIST_RECEIPT", start_date = "1945-12-31", end_date = "2013-12-31") |

The whole example relies on only three variables and—as I am not great at typing—I tend to work with shorter variable names. I directly changed the names for variables 1 to 3:

# Display first lines of the data frame # and set short names for first three columns head(my.data) names(my.data)[1:3] = c('year', 'indtax', 'corptax') |

It is a lot easier to compare the regression lines if we change the shape of the data set from wide to long, where there is one variable for year, one for tax type, and one for the actual tax rate. It would be possible to use one of Hadley’s packages to get a simpler syntax for this, but I decided to stick to the minimum set of requirements:

# Change shape to fit both regressions simultaneously mdlong = reshape(my.data[, 1:3], idvar = 'year', times = c('Individual', 'Corporate'), varying = list(2:3), direction = 'long') mdlong$taxtype = factor(mdlong$time) |

And now we are ready to produce the plots. The first one can be a rough cut to see if we get the right elements:

ggplot(mdlong, aes(x = year, y = indtax, color = taxtype)) + geom_point() + geom_line() + geom_smooth(method = 'lm') |

Yes, this one has the points, lines, linear regression and 95% confidence intervals for the mean predicted responses, but we still need to get rid of the grey background and get black labels (`theme_bw()`

), set the right axis labels and ticks (`scale_x... scale_y...`

) and set the right color palette for points and lines (`scale_colour_manual`

) and filling the confidence intervals (`scale_colour_fill`

) like so:

# Plotting the graph (first match color palette) and put the regression # lines as well serious.palette = c('#AD3333', '#00526D') ggplot(mdlong, aes(x = year, y = indtax, color = taxtype)) + geom_point() + geom_line() + geom_smooth(method = 'lm', aes(fill = taxtype)) + theme_bw() + scale_y_continuous('Income taxes (% of GDP)', breaks = seq(0, 12, 2), minor_breaks = NULL) + scale_x_date('Fiscal year', minor_breaks = NULL) + scale_colour_manual(values=serious.palette) + scale_fill_manual(values=serious.palette) |

One can still change font sizes to match the original plots, reposition the legend, change the aspect ratio while saving the png graphs (all simple statements) but you get the idea. If now we move to fitting the regression lines:

# Fitting a regression with dummy variables m1 = lm(indtax ~ year*taxtype, data = mdlong) summary(m1) # The regressions have different intercepts and slopes # Call: # lm(formula = indtax ~ year * taxtype, data = mdlong) # # Residuals: # Min 1Q Median 3Q Max # -1.95221 -0.44303 -0.05731 0.35749 2.39415 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 3.435e+00 1.040e-01 33.01 <2e-16 *** # year -1.564e-04 1.278e-05 -12.23 <2e-16 *** # taxtypeIndividual 4.406e+00 1.471e-01 29.94 <2e-16 *** # year:taxtypeIndividual 1.822e-04 1.808e-05 10.08 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 0.7724 on 134 degrees of freedom # Multiple R-squared: 0.9245, Adjusted R-squared: 0.9228 # F-statistic: 546.9 on 3 and 134 DF, p-value: < 2.2e-16 |

This gives the regression coefficients for Corporate (3.45 – 1.564e-04 year) and Individual ([3.45 + 4.41] + [-1.564e-04 + 1.822e-04] year or 7.84 + 2.58e-05 year). As a bonus you get the comparison between regression lines.

In R as a second language I pointed out that ‘brevity reduces the time between thinking and implementation, so we can move on and keep on trying new ideas’. Some times it seriously does.

# A couple of thoughts on biotech and food security

“What has {insert biotech here} done for food security?” This question starts at the wrong end of the problem, because food security is much larger than any biotechnology. I would suggest that governance, property rights and education are the fundamental issues for food security, followed by biotechnological options. For example, the best biotechnology is useless if one is trying to do agriculture in a war-ravaged country.

Once we have a relatively stable government and educated people can rely on property rights, the effects of different biotechnologies will be magnified and it will be possible to better assess them. I would say that matching the most appropriate technologies to the local environmental, economic and cultural conditions is a good sign of sustainable agriculture. I would also say that the broader the portfolio of biotechnology and agronomic practices the more likely a good match will be. That is, I would not *a priori* exclude any biotechnology from the table based on *generic* considerations.

Should the success of a biotechnology for food security be measured as yield? It could be *one* of the desired effects but it is not necessarily the most important one. For example, having less fluctuating production (that is reducing the variance rather than increasing the mean) could be more relevant. Or we could be interested in creating combinations of traits that are difficult to achieve by traditional breeding (e.g. biofortification), where yield is still the same but nutritional content differs. Or we would like to have a reduction of inputs (agrochemicals, for example) while maintaining yield. There are many potential answers and—coming back to matching practices to local requirements—using a simple average of all crops in a country (or a continent) is definitely the wrong scale of assessment. We do not want to work with an average farmer or an average consumer but to target specific needs with the best available practices. Some times this will include {insert biotech, agronomical practices here}, other times this will include {insert another biotech and set of agronomical practices here}.

And that is the way I think of improving food security.

# Jetsam 16: Backstroke

# Jetsam 15: My clown bear

# Jetsam 14: Dashboard

# R as a second language

Imagine that you are studying English as a second language; you learn the basic rules, some vocabulary and start writing sentences. After a little while, it is very likely that you’ll write grammatically correct sentences that no native speaker would use. You’d be following the formalisms but ignoring culture, idioms, slang and patterns of effective use.

R is a language and any newcomers, particularly if they already know another programming language, will struggle at the beginning to get what is beyond the formal grammar and vocabulary. I use R for inquisition: testing ideas, data exploration, visualization; under this setting, the easiest is to perform a task the more likely is one going to do it. It is possible to use several other languages for this but—and I think this is an important but—R’s brevity reduces the time between thinking and implementation, so we can move on and keep on trying new ideas^{†}.

A typical example is when we want to repeat something or iterate over a collection of elements. In most languages if one wants to do something many times the obvious way is using a loop (coded like, `for()`

or `while()`

). It is *possible* to use a `for()`

loop in R but many times is the wrong tool for the job, as it increases the lag between thought and code, moving us away from ‘the flow’.

# Generate some random data with 10 rows and 5 columns M = matrix(round(runif(50, 1, 5), 0), nrow = 10, ncol = 5) M # [,1] [,2] [,3] [,4] [,5] # [1,] 2 3 4 2 1 # [2,] 3 1 3 3 4 # [3,] 4 2 5 1 3 # [4,] 2 4 4 5 3 # [5,] 2 3 1 4 4 # [6,] 3 2 2 5 1 # [7,] 1 3 5 5 2 # [8,] 5 4 2 5 4 # [9,] 3 2 3 4 3 #[10,] 4 4 1 2 3 # Create dumb function that returns mean and median # for data sillyFunction = function(aRow) { c(mean(aRow), median(aRow)) } # On-liner to apply our function to each row apply(M, 1, sillyFunction) # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] #[1,] 2.4 2.8 3 3.6 2.8 2.6 3.2 4 3 2.8 #[2,] 2.0 3.0 3 4.0 3.0 2.0 3.0 4 3 3.0 # or one could do it for each column apply(M, 2, sillyFunction) # Of course one could use a loop. Pre-allocating # the result matrix would have a loop with little # time penalty (versus growing the matrix) nCases = dim(M)[1] resMatrix = matrix(0, nrow = nCases, ncol = 2) # and here is the loop for(i in 1:nCases){ resMatrix[i, 1:2] = sillyFunction(M[i,]) } resMatrix # Same results as before # [,1] [,2] # [1,] 2.4 2 # [2,] 2.8 3 # [3,] 3.0 3 # [4,] 3.6 4 # [5,] 2.8 3 # [6,] 2.6 2 # [7,] 3.2 3 # [8,] 4.0 4 # [9,] 3.0 3 #[10,] 2.8 3 |

One of the distinctive features of R is that there is already a lot of functionality available for jobs that occur frequently in data analysis. The easiest is to perform a task the more likely is one going to do it, which is perfect if one is exploring/thinking about data.

Thomas Lumley reminded me of the ACM citation for John Chambers—father of S of which R is an implementation—which stated that Chambers’s work:

…will forever alter the way people analyze, visualize, and manipulate data . . . S is an elegant, widely accepted, and enduring software system, with conceptual integrity, thanks to the insight, taste, and effort of John Chambers.

If I could summarize the relevance of R in a Tweetable phrase (with hash tags and everything) it would be:

Most data analysis languages underestimate the importance of interactivity/low barrier to exploration. That’s where #Rstats shines.

One could run statistical analyses with many languages (including generic ones), but to provide the right level of interactivity for analysis, visualization and data manipulation one ends up creating functions that, almost invariably, look a bit like R; pandas in Python, for example.

There are some complications with some of the design decisions in R, especially when we get down to consistency which begets memorability. A glaring example is the `apply`

family of functions and here is where master opportunist (in the positive sense of expert at finding good opportunities) Hadley Wickham^{‡} made sense out of confusion in his package plyr.

There is also a tension in languages under considerable use because speakers/writers/analysts/coders start adapting them to new situations, adding words and turns of phrase. Look at English for an example! This is also happening to R and some people wish the language looked different in some non-trivial ways. A couple of examples: Coffeescript for R and Rasmus Bååth’s suggestions. Not all of them can be implemented, but suggestions like this speak of the success of R.

If you are struggling to start working with R, as with other languages, first let go. The key to learning and working with a new language is immersing yourself in it; even better if you do it with people who already speak it.

^{†} Just to be clear, there are several good statistical languages. However, none is as supportive of rapid inquisition as R (IMO). It is not unusual to develop models in one language (e.g. R) and implement it in another for operational purposes (e.g. SAS, Python, whatever).

^{‡} The first thing I admire about Hadley is his ‘good eye’ for finding points of friction. The second one is doing something about the frictions, often with very good taste.

P.S. It should come clear from this post that English is indeed my second language.

# Teaching linear models

I teach several courses every year and the most difficult to pull off is FORE224/STAT202: regression modeling.

The academic promotion application form in my university includes a section on one’s ‘teaching philosophy’. I struggle with that part because I suspect I lack anything as grandiose as a philosophy when teaching: as most university lecturers I never studied teaching, although I try to do my best. If anything, I can say that I enjoy teaching and helping students to ‘get it’ and that I want to instill a sense of ‘statistics is fun’ in them. I spend quite a bit of time looking for memorable examples, linking to stats in the news (statschat and listening the news while walking my dog are very helpful here) and collecting data. But a philosophy? Don’t think so.

One of the hardest parts of the course is the diversity of student backgrounds. Hitting the right level, the right tone is very hard. Make it too easy and the 1/5 to 1/4 of students with a good mathematical background will hate it; they may even decide to abandon any intention of continuing doing stats if ‘that’s all there is about the topic’. Make it too complicated and half the class will fail and/or hate the content.

Part of the problem is based around what we mean by teaching ‘statistics’. In some cases it seems limited to what specific software does; for example, teaching with Excel means restriction to whatever models are covered in Excel’s Data Analysis Toolpak (DAT). The next choice when teaching is using menu-driven software (e.g. SPSS), which provides much more statistical functionality than Excel + DAT, at the expense of being further removed from common usability conventions. At the other extreme of simplicity is software that requires coding to control the analyses (e.g. R or SAS). In general, the more control we want, the more we have to learn to achieve it^{†}.

A while ago I made a distinction between the different levels of learning (user cases) when teaching statistics. In summary, we had i- very few students getting in to statistics and heavy duty coding, ii- a slightly larger group that will use stats while in a while and iii- the majority that will mostly consume statistics. I feel a duty towards the three groups, while admitting that I have predilection for the first one. Nevertheless, the third group provides most of the challenges and need for thinking about how to teach the subject.

When teaching linear models (general form \(y = X \beta + \epsilon\)) we tend to compartmentalize content: we have an ANOVA course if the design matrix \(X\) represents categorical predictors (contains only 1s and 0s), a regression course if \(X\) is full of continuous predictors and we talk about ANCOVA or regression on dummy variables if \(X\) is a combination of both. The use of different functions for different contents of \(X\) (for example `aov()`

versus `lm()`

in R or `proc reg`

versus `proc glm`

in SAS) further consolidates the distinction. Even when using menus, software tends to guide students through different submenus depending on the type of \(X\).

At the beginning of the course we restrict ourselves to \(X\) full of continuous predictors, but we introduce the notion of matrices with small examples. This permits showing the connection between all the linear model courses (because a rose by any other name…) and it also allows deriving a general expression of the formulas for the regression coefficients (essential for the most advanced students). Slower students may struggle with some of this material; however, working with small examples they can replicate the results from R (or Excel or SAS or whatever one uses to teach). Some times they even think it is cool.

Here is where the `model.matrix()`

R function becomes handy; rather than building incidence matrices by hand—which is easy for tiny examples—we can get the matrices used by the `lm()`

function to then calculate regression parameters (and any other output) for more complex models.

Once students get the idea that on matrix terms our teaching compartments are pretty much the same, we can reinforce the idea by using a single function (or proc) to show that we can obtain all the bits and pieces that make up what we call ‘fitting the model’. This highlights the idea that ANOVA, ANCOVA & regression are subsets of linear models, which are subsets of linear mixed models, which are subsets of generalized linear mixed models. A statistical Russian doll.

We want students to understand, some times so badly that we lower the bar to a point where there is no much to understand. Here is the tricky part, finding the right level of detail so all types of students learn to enjoy the topic, although at different levels of understanding.

^{†}There is software that generates code from menus too, like Stata or Genstat.

P.S. This is part of my thinking aloud with hesitation about teaching, as in Statistics unplugged, Excel, fanaticism and R, Split-plot 1: How does a linear mixed model look like?, R, academia and the democratization of statistics, Mid-January flotsam: teaching edition & Teaching with R: the switch. I am always looking for better ways of transferring knowledge.

# Statistics unplugged

How much does *statistical* software help and how much it interferes when teaching statistical *concepts*? Software used in the *practice* of statistics (say R, SAS, Stata, etc) brings to the party a mental model that it’s often alien to students, while being highly optimized for practitioners. It is possible to introduce a minimum of distraction while focusing on teaching concepts, although it requires careful choice of a subset of functionality. Almost invariably some students get stuck with the software and everything goes downhill from there; the student moved from struggling with a concept to struggling with syntax (Do I use a parenthesis here?).

I am a big fan of Tim Bell’s Computer Science Unplugged, a program for teaching Computer Science’s ideas at primary and secondary school *without* using computers (see example videos).

Here is an example video for public key encryption:

This type of instruction makes me question both *how* we teach statistics and *at what level* we can start teaching statistics. The good news is that the New Zealand school curriculum includes statistics in secondary school, for which there is increasing number of resources. However, I think we could be targeting students even earlier.

This year my wife was helping primary school students participating in a science fair and I ended up volunteering to introduce them to some basic concepts so they could design their own experiments. Students got the idea of the need for replication, randomization, etc based on a simple question: Did one of them have special powers to guess the result of flipping a coin? (Of course this is Fisher’s tea-drinking-lady-experiment, but no 10 year old cares about tea, while at least some of them care about super powers). After the discussion one of them ran a very cool experiment on the effect of liquefaction on the growth of native grasses (very pertinent in post-earthquake Christchurch), with 20 replicates (pots) for each treatment. He got the concepts behind the experiment; software just entered the scene when we needed to confirm our understanding of the results in a visual way:

People tell me that teaching stats without a computer is like teaching chemistry without a lab or doing astronomy without a telescope, or… you get the idea. At the same time, there are some books that describe *some* class activities that do not need a computer; e.g. Gelman’s Teaching Statistics: A Bag of Tricks. (Incidentally, why is that book so friggin’ expensive?)

## Back to uni

Back from primary school kiddies to a regression course at university. Let’s say that we have two variables, x & y, and that we want to regress y (response) on x (predictor) and get diagnostic plots. In R we could simulate some data and plot the relationship using something like this:

# Basic regression data n = 100 x = 1:n y = 70 + x*5 + rnorm(n, 0, 40) # Changing couple of points and plotting y[50] = 550 y[100] = 350 plot(y ~ x) |

We can the fit the linear regression and get some diagnostic plots using:

# Regression and diagnostics m1 = lm(y ~ x) par(mfrow = c(2,2)) plot(m1) par(mfrow = c(1,1)) |

If we ask students to explain the 4th plot—which displays discrepancy (how far a point is from the general trend) on leverage (how far is a point from the center of mass, pivot of the regression)—many of them will struggle to say what is going on in that plot. At that moment one could go and calculate the Hat matrix of the regression (\(X (X’X)^{-1} X’\)) and get leverage from the diagonal, etc and students will get a foggy idea. Another, probably better, option is to present the issue as a physical system on which students already have experience. A good candidate for physical system is using a seesaw, because many (perhaps most) students experienced playing in one as children.

Take your students to a playground (luckily there is one next to uni), get them playing with a seesaw. The influence of a point is related to the product of *leverage* (how far from the pivot we are applying force) and *discrepancy* (how big is the force applied). The influence of a point on the estimated regression coefficients will be very large when we apply a strong force far from the pivot (as in our point y[100]), just as it happens in a seesaw. We can apply lots of force (discrepancy) near the pivot (as in our point [y[50]) and little will happen. Students like mucking around with the seesaw and, more importantly, they remember.

Analogy can go only so far. Some times a physical analogy like a quincunx (to demonstrate the central limit theorem) ends up being more confusing than using an example with variables that are more meaningful for students.

I don’t know what is the maximum proportion of course content that could be replaced by using props, experiments, animations, software specifically designed to make a point (rather than to run analysis), etc. I do know that we still need to introduce ‘proper’ statistical software—at some point students have to face praxis. Nevertheless, developing an intuitive understanding is vital to move from performing monkeys; that is, people clicking on menus or going over the motion of copy/pasting code without understanding what’s going on in the analyses.

I’d like to hear if you have any favorite demos/props/etc when explaining statistical concepts.

P.S. In this post I don’t care if you love stats software, but I specifically care about helping learners who struggle understanding concepts.