Categories

## Turtles all the way down

One of the main uses for R is for exploration and learning. Let’s say that I wanted to learn simple linear regression (the bread and butter of statistics) and see how the formulas work. I could simulate a simple example and fit the regression with R:

library(arm)  # For display()

# Simulate 5 observations
set.seed(50)
x <- 1:5
y <- 2 + 3*x + rnorm(5, mean = 0, sd = 3)

# Fit regression
reg <- lm(y ~ x, dat)
display(reg)

# lm(formula = y ~ x, data = dat)
#             coef.est coef.se
# (Intercept) 3.99     3.05
# x           2.04     0.92
# ---
# n = 5, k = 2
# residual sd = 2.91, R-Squared = 0.62

# Plot it
plot(y ~ x)
abline(coef(reg))


The formulas for the intercept ($$b_0$$) and the slope ($$b_1$$) are pretty simple, and I have been told that there is a generic expression that instead uses matrices.

$$b_1 = \frac{\sum{x y} - n \bar{x} \bar{y}}{\sum{x x} - n \bar{x}^2}$$
$$b_0 = \bar{y} - b_1 \bar{x}$$

$$\boldsymbol{b} = \boldsymbol{X}\boldsymbol{X}^{-1} \boldsymbol{Xy}$$

How do the contents of the matrices and the simple formulates relate to each other?

# Formulas for slope and intercept
b1 <- (sum(x*y) - length(x)*mean(x)*mean(y))/(sum(x*x) - length(x)*mean(x)^2)
b0 <- mean(y) - b1*mean(x)


Funnily enough, looking at the matrices we can see similar sums of squares and crossproducts as in the formulas.

X <- model.matrix(reg)

#   (Intercept) x
# 1           1 1
# 2           1 2
# 3           1 3
# 4           1 4
# 5           1 5
# attr(,"assign")
# [1] 0 1

t(X) %*% X
#             (Intercept)  x
# (Intercept)           5 15
# x                    15 55

# So XX contains bits and pieces of the previous formulas
length(x)
# [1] 5

sum(x)
# [1] 15

sum(x*x)
# [1] 55

# And so does Xy
t(X) %*% y
#                  [,1]
# (Intercept)  50.61283
# x           172.27210

sum(y)
# [1] 50.61283

sum(x*y)
# [1] 172.2721

# So if we combine the whole lot and remember that
# solves calculates the inverse
solve(t(X) %*% X) %*% t(X) %*% y
#                 [,1]
# (Intercept) 3.992481
# x           2.043362


But I have been told that R (as most statistical software) doesn't use the inverse of the matrix for estimating the coefficients. So how does it work?

If I type lm R will print the code of the lm() function. A quick look will reveal that there is a lot of code reading the arguments and checking that everything is OK before proceeding. However, the function then calls something else: lm.fit(). With some trepidation I type lm.fit, which again performs more checks and then calls something with a different notation:

z <- .Call(C_Cdqrls, x, y, tol, FALSE)


This denotes a call to a C language function, which after some searching in Google we find in a readable form in the lm.c file. Another quick look brings more checking and a call to Fortran code:

F77_CALL(dqrls)(REAL(qr), &n, &p, REAL(y), &ny, &rtol,
REAL(coefficients), REAL(residuals), REAL(effects),
&rank, INTEGER(pivot), REAL(qraux), work);


which is a highly tuned routine for QR decomposition in a linear algebra library. By now we know that the general matrix expression produces the same as our initial formula, and that the R lm() function does not use a matrix inverse but QR decomposition to solve the system of equations.

One of the beauties of R is that brought the power of statistical computing to the masses, by not only letting you fit models but also having a peek at how things are implemented. As a user, I don't need to know that there is a chain of function calls initiated by my bread-and-butter linear regression. But it is comforting to the nerdy me, that I can have a quick look at that.

All this for free, which sounds like a very good deal to me.

Categories

## Old dog and the tidyverse

I started using R ages ago and have happily lived in mostly-base-R for data manipulation. Once in a while I move to something that makes a big difference, like ggplot2 in 2010 or Rmarkdown in 2015, but the set of packages I use for data + plotting hasn’t seen many changes. I have to confess that, meanwhile, I have tested quite a few approaches on the analytics side of things (last year was the turn of Bayesian for me).

Last week, I decided to learn more about the tidyverse, thinking of using it more with forestry postgrad students. Now, there is no lack of tutorials, reviews, documentation, etc. for the tidyverse, but most writing shows a final version of the code, without exposing the thinking and dead ends that go behind it. In this post I show how my code was changing, both after reading a few pieces of documentation and, mostly, from feedback I got from Hadley Wickham and Michael MacAskill via this Kiwi Twitter thread. This post shows minor differences in variable names from that thread, as I changed a few things while reading the files.

The problem is as follows: I have two data frames with trial assessments. Frame one, called early, covers trees at ages 5, 7 and 8 years (although ages are in months rather than years). Frame two, called late, covers trees at age 20 years. Yes, it takes a while working with trees.

We want to keep only age 8 years (96 months) from early and want to split a code into two variables, as well as convert a variable from numeric to character. In late we want to create a tree family code, based on a set of rules to connect field codes to the pedigree of trees. Did I mention that I work breeding trees?

Finally, we want to merge all the assessments from age 8 with the assessment at age 20 for the same trees.

Rather than showing the final version of the code, it is much more interesting to show its evolution, also including how I would have done this in base R. I’m omitting the reading of the file and boring case conversion of variable names, etc.

In base R, I would probably do something like this (I’m using the stringr package just to make my life easier):

library(stringr)

early_8 <- subset(early, age == 96)
early_8 <- within(early_8, {
rep <- sapply(section, function(x) unlist(str_split(x, '_'))[1])
sets <- sapply(section, function(x) unlist(str_split(x, '_'))[2])
tree <- as.character(tree position)
})

late <- within(late, {
family <- ifelse(field_code < 500, paste('885', str_pad(field_code, 3, pad = '0'), sep = ''),
ifelse(field_code >= 500 & field_code < 600, paste('883', str_pad(field_code - 500, 3, pad = '0'), sep = ''),
as.character(field_code)))
rep <- as.character(rep)
tree <- as.character(tree)
})

both <- merge(early_8, late, by.x = 'genotype', by.y = 'family')


My first approach to dealing with the early frame with the tidyverse looked like:

library(tidyverse)
library(stringr)

early %>%
filter(age == 96) %>%
mutate(rep = flatten_chr(map(section, function(x) unlist(str_split(x, '_'))[1]))) %>%
mutate(sets = flatten_chr(map(section, function(x) unlist(str_split(x, '_'))[2]))) %>%
mutate(tree = as.character(tree position)) -> early_8


While the second frame was processed using:

late %>%
mutate(family = ifelse(field_code < 500, paste('885', str_pad(field_code, 3, pad = '0'), sep = ''),
ifelse(field_code >= 500 & field_code < 600, paste('883', str_pad(field_code - 500, 3, pad = '0'), sep = ''),
as.character(field_code)))) %>%
mutate(rep = as.character(rep))
mutate(tree = as.character(tree)) -> late


I used multiple instances of mutate because I thought it would be easier to read. The use of map instead of sapply is cool, particularly when one starts looking at more advanced features. Comments from the crowd in Twitter: mutate the whole lot in a single statement (although Hadley pointed out that there was no performance penalty by using them separately) and try using case_when to make nested ifelse easier to understand. The comments on map went in two directions: either use map_chr as a clearer and safer alternative, or just use dplyr's separate function. The first option for early would look like:

early %>%
filter(age == 96) %>%
mutate(rep = map_chr(section, function(x) unlist(str_split(x, '_'))[1])) %>%
mutate(sets = map_chr(section, function(x) unlist(str_split(x, '_'))[2])) %>%
mutate(tree = as.character(tree position)) -> early_8


However, I ended up going with the final version that used separate, which is easier on the eye and faster, for a final version that looks like this:

early %>%
filter(age == 96) %>%
separate(section, c('rep', 'sets'), sep = '_') %>%
mutate(tree = as.character(tree position),
genotype = as.character(1:nrow(.) + 10^6)) -> early_8


So we filter early, separate the single set code into two variables (rep and sets) and create a couple of variables using mutate (one is a simple type conversion to character, while the other is a code starting at 1,000,000).

In the case of late, I ended up with:

late %>%
mutate(family = case_when(
field_code < 500  ~  paste('885', str_pad(field_code, 3, pad = '0'), sep = ''),
field_code >= 500 & field_code < 600  ~  paste('883', str_pad(field_code - 500, 3, pad = '0'), sep = ''),
TRUE ~ as.character(field_code)),
rep = as.character(rep),
tree = as.character(tree)) -> late


And we merge the files using:

early_8 %>%
left_join(late, c('rep', 'sets', 'tree')) %>%
left_join(base_pedigree, by = c('family' = 'genotype'))  -> both


Some (many, most?) people may disagree with my use of right assign, which I love. Surely one could use either left assign or %<>% from the maggrittr package. By the way, why do I have to explicitely load magrittr (instead of relying on tidyverse) to access %<>%?

And this is how I go about learning new things: lots of false starts, often working with small examples (I used a few to check how left_join was working), lots of searching for explanations/tutorials (thanks to everyone who has written them) and asking in Twitter. If you are just starting programming, in any language, do not feel intimidated by cool looking code; most of the time it took many iterations to get it looking like that.

Categories

## Why you shouldn’t entrust your pet to Glenstar Kennels

Travel is part of life and if you have pets, finding appropriate boarding for them is a must. This is my explanation for why you should not entrust your dog to Glenstar Kennels, in Canterbury, New Zealand.

At the end of 2016 I had work approval for a two-month trip overseas. Normally I would book accommodation for my dog at the SPCA boarding kennels (as when we had two-months repairs to our house following Christchurch’s earthquake). However, as this trip included Christmas/New Year, it was impossible to find a vacancy. I was happy to find a spot for my dog at Glenstar Kennels spanning the whole end of year period.

Sadly, after 2 months travelling I found a sad surprise when I went to pick up my dog. He was almost 5 kg overweight, which for a 27 kg dog is 20% of weight gain in 2 months. As an illustration, imagine if you were 75 kg and gained 15 kg in only 2 months.

I immediately wrote to the owner of Glenstar Kennels, who stated that “Whilst I do agree he has put on weight had he fed a cheap food and lost weight in the kennels I feel you would be more upset”. Well, a dog becoming overweight is not any better than a dog losing weight! Both situations lead to reduced animal lifespan. While I believe the quality of the food provided was probably appropriate, the combination of physical activity and food quantity was clearly inappropriate.

The New Zealand Animal Welfare Act 1999 and the Code of Welfare for Dogs 2010 (administered by the Ministry for Primary Industries), establish a series of minimum standards for dog care:

1. Dogs need a balanced daily diet in quantities that meet their requirements for health and welfare and to maintain their ideal bodyweight.
2. The amount of food offered needs to be increased if a dog is losing condition, or decreased if it is becoming overweight.
3. The code of welfare for dogs applies in all situations, including temporary housing such as shelters, doggy daycares or day boarding facilities, and kennels

According to the schedule sent to me by Glenstar Kennels, there were 12 hours of contact a day when someone had access to my dog and 65 days to figure out that he was becoming overweight and take appropriate action. The photo below shows my dog’s increased girth as I tried to fit his harness as per the size when I drop him off at Glenstar’s facility on 9th December and his size on 12th February (after gaining 5 kg). There is a dramatic difference, well explained by the term negligence.

I am very unhappy with the level of care provided by Glenstar Kennels and the lack of a satisfactory reply to my written complaints. After 2 months away I had to take my dog to his usual veterinary to discuss his overweight, with the associated cost, so I could bring him back to good health. Dog and I have been doing our usual daily walks (as we did before the trip), and I have been very careful about his nutrition, so he can go slowly back to his optimal 27 kg.

Unfortunately, there is no compulsory regulatory body for dog boarding kennels that could enforce the Code of Welfare for Dogs. However, I feel that I have to write this review and make Glenstar Kennels negligence public, so other dog owners (and potential customers) are aware of their extremely poor service.

P.S. The owners of Glenstar Kennels also have another company: Star Pet Travel for pet relocations. They use Glenstar Kennels for their temporary accommodation. I wouldn’t use them either.

Categories

## Jetsam 33: 1:51″ in the backyard

I was testing the Zoom H4n with a couple of external omni microphones while walking in the backyard. It was quite windy, picking up around 30″ and then the dog coming and sniffing the mics at around 45″.

Categories

Categories

Categories

Categories

## Cute Gibbs sampling for rounded observations

I was attending a course of Bayesian Statistics where this problem showed up:

There is a number of individuals, say 12, who take a pass/fail test 15 times. For each individual we have recorded the number of passes, which can go from 0 to 15. Because of confidentiality issues, we are presented with rounded-to-the-closest-multiple-of-3 data ($$\mathbf{R}$$). We are interested on estimating $$\theta$$ of the Binomial distribution behind the data.

Rounding is probabilistic, with probability 2/3 if you are one count away from a multiple of 3 and probability 1/3 if the count is you are two counts away. Multiples of 3 are not rounded.

We can use Gibbs sampling to alternate between sampling the posterior for the unrounded $$\mathbf{Y}$$ and $$\theta$$. In the case of $$\mathbf{Y}$$ I used:

# Possible values that were rounded to R
possible <- function(rounded) {
if(rounded == 0) {
options <- c(0, 1, 2)
} else {
options <- c(rounded - 2, rounded - 1, rounded,
rounded + 1, rounded + 2)
}
return(options)
}

# Probability mass function of numbers rounding to R
# given theta
prior_y <- function(options, theta) {
p <- dbinom(options, 15, prob = theta)
return(p)
}

# Likelihood of rounding
like_round3 <- function(options) {
if(length(options) == 3) {
like <- c(1, 2/3, 1/3) }
else {
like <- c(1/3, 2/3, 1, 2/3, 1/3)
}
return(like)
}

# Estimating posterior mass function and drawing a
# random value of it
posterior_sample_y <- function(R, theta) {
po <- possible(R)
pr <- prior_y(po, theta)
li <- like_round3(po)
post <- li*pr/sum(li*pr)
samp <- sample(po, 1, prob = post)
return(samp)
}


While for $$theta$$ we are assuming a vague $$mbox{Beta}(alpha, eta)$$, with $$alpha$$ and $$eta$$ equal to 1, as prior density function for $$theta$$, so the posterior density is a $$mbox{Beta}(alpha + sum Y_i, eta + 12*15 - sum Y_i)$$.

## Function to sample from the posterior Pr(theta | Y, R)
posterior_sample_theta <- function(alpha, beta, Y) {
theta <- rbeta(1, alpha + sum(Y), beta + 12*15 - sum(Y))
return(theta)
}


I then implemented the sampler as:

## Data
R <- c(0, 0, 3, 9, 3, 0, 6, 3, 0, 6, 0, 3)
nsim <- 10000
burnin <- 1000
alpha <- 1
beta <- 1
store <- matrix(0, nrow = nsim, ncol = length(R) + 1)

starting.values <- c(R, 0.1)

## Sampling
store[1,] <- starting.values
for(draw in 2:nsim){
current <- store[draw - 1,]
for(obs in 1:length(R)) {
y <- posterior_sample_y(R[obs], current[length(R) + 1])
# Jump or not still missing
current[obs] <- y
}
theta <- posterior_sample_theta(alpha, beta, current[1:length(R)])
# Jump or not still missing
current[length(R) + 1] <- theta

store[draw,] <- current
}


And plotted the results as:

plot((burnin+1):nsim, store[(burnin+1):nsim,13], type = 'l')

library(ggplot2)

ggplot(data.frame(theta = store[(burnin+1):nsim,13]), aes(x = theta)) +
geom_density(fill = 'blue', alpha = 0.5)


multiple_plot <- data.frame(Y = matrix(store[(burnin+1):nsim, 1:12],
nrow = (nsim - burnin)*12,
ncol = 1))
multiple_plot\$obs <- factor(rep(1:12, each = (nsim - burnin)))

ggplot(multiple_plot, aes(x = Y)) + geom_histogram() + facet_grid(~obs)
`

I thought it was a nice, cute example of simultaneously estimating a latent variable and, based on that, estimating the parameter behind it.

Categories

## Entitlement

You are not entitled to your opinion. You are entitled to your informed opinion. No one is entitled to be ignorant—Harlan Ellison

Categories

Categories

## The camera is an instrument

The camera is an instrument that teaches people how to see without a camera—Dorothea Lange

Categories