R, Julia and genome wide selection

— “You are a pussy” emailed my friend.
— “Sensu cat?” I replied.
— “No. Sensu chicken” blurbed my now ex-friend.

What was this about? He read my post on R, Julia and the shiny new thing, which prompted him to assume that I was the proverbial old dog unwilling (or was it unable?) to learn new tricks. (Incidentally, with friends like this who needs enemies? Hi, Gus.)

Having a look at different—statistical—horses (Photo: Luis).

I decided to tackle a small—but hopefully useful—piece of code: fitting/training a Genome Wide Selection model, using the Bayes A approach put forward by Meuwissen, Hayes and Goddard in 2001. In that approach the breeding values of the individuals (response) are expressed as a function of a very large number of random predictors (2000, our molecular markers). The dataset (csv file) is a simulation of 2000 bi-allelic markers (aa = 0, Aa = 1, AA = 2) for 250 individuals, followed by the phenotypes (column 2001) and breeding values (column 2002). These models are frequently adjusted using MCMC.

In 2010 I attended this course in Ames, Iowa where Rohan Fernando passed us the following R code (pretty much a transliteration from C code; notice the trailing semicolons, for example). P.D. 2012-04-26 Please note that this is teaching code not production code:

nmarkers = 2000;    # number of markers
startMarker = 1981; # set to 1 to use all
numiter  = 2000;    # number of iterations
vara     = 1.0/20.0; 

# input data
data     = matrix(scan("trainData.out0"),ncol=nmarkers+2,byrow=TRUE);
nrecords = dim(data)[1];

beg = Sys.time()

# x has the mean followed by the markers
x = cbind(1,data[,startMarker:nmarkers]);
y = data[,nmarkers+1];
a =  data[,nmarkers+2];
# inital values

nmarkers = nmarkers - startMarker + 1;
mean2pq = 0.5;                          # just an approximation
scalea  = 0.5*vara/(nmarkers*mean2pq);  # 0.5 = (v-2)/v for v=4

size = dim(x)[2];
b = array(0.0,size);
meanb = b;
b[1] = mean(y);
var  = array(0.0,size);

# adjust y
 ycorr = y - x%*%b;                  

# MCMC sampling
for (iter in 1:numiter){
  # sample vare
  vare = ( t(ycorr)%*%ycorr )/rchisq(1,nrecords + 3);

  # sample intercept
  ycorr = ycorr + x[,1]*b[1];
  rhs = sum(ycorr)/vare;
  invLhs = 1.0/(nrecords/vare);
  mean = rhs*invLhs;
  b[1] = rnorm(1,mean,sqrt(invLhs));
  ycorr = ycorr - x[,1]*b[1];
  meanb[1] = meanb[1] + b[1];

  # sample variance for each locus
  for (locus in 2:size){
    var[locus] = (scalea*4+b[locus]*b[locus])/rchisq(1,4.0+1)
  }

# sample effect for each locus
  for (locus in 2:size){
    # unadjust y for this locus
    ycorr = ycorr + x[,locus]*b[locus];
    rhs = t(x[,locus])%*%ycorr/vare;
    lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus];
    invLhs = 1.0/lhs;
    mean = invLhs*rhs;
    b[locus]= rnorm(1,mean,sqrt(invLhs));
    #adjust y for the new value of this locus
    ycorr = ycorr - x[,locus]*b[locus];
    meanb[locus] = meanb[locus] + b[locus];
  }
}

Sys.time() - beg

meanb = meanb/numiter;
aHat  = x %*% meanb;

Thus, we just need defining a few variables, reading the data (marker genotypes, breeding values and phenotypic data) into a matrix, creating loops, matrix and vector multiplication and generating random numbers (using a Gaussian and Chi squared distributions). Not much if you think about it, but I didn’t have much time to explore Julia’s features as to go for something more complex.

nmarkers = 2000    # Number of markers
startmarker = 1981 # Set to 1 to use all
numiter = 2000     # Number of iterations

data = dlmread("markers.csv", ',')
(nrecords, ncols) = size(data)

tic()

#this is the mean and markers matrix
X = hcat(ones(Float64, nrecords), data[:, startmarker:nmarkers])
y = data[:, nmarkers + 1]
a = data[:, nmarkers + 2]

nmarkers = nmarkers - startmarker + 1
vara = 1.0/nmarkers
mean2pq = 0.5

scalea  = 0.5*vara/(nmarkers*mean2pq) # 0.5 = (v-2)/v for v=4

ndesign = size(X, 2)
b = zeros(Float64, ndesign)
meanb = zeros(Float64, ndesign)
b[1] = mean(y)
varian  = zeros(Float64, ndesign)

# adjust y
ycorr = y - X * b                  

# MCMC sampling
for i = 1:numiter
  # sample vare
  vare = dot(ycorr, ycorr )/randchi2(nrecords + 3)

  # sample intercept
  ycorr = ycorr + X[:, 1] * b[1];
  rhs = sum(ycorr)/vare;
  invlhs = 1.0/(nrecords/vare);
  mn = rhs*invlhs;
  b[1] = randn() * sqrt(invlhs) + mn;
  ycorr = ycorr - X[:, 1] * b[1];
  meanb[1] = meanb[1] + b[1];

  # sample variance for each locus
  for locus = 2:ndesign
      varian[locus] = (scalea*4 + b[locus]*b[locus])/randchi2(4.0 + 1);
  end

  # sample effect for each locus
  for locus = 2:ndesign
      # unadjust y for this locus
      ycorr = ycorr + X[:, locus] * b[locus];
      rhs = dot(X[:, locus], ycorr)/vare;
      lhs = dot(X[:, locus], X[:, locus])/vare + 1.0/varian[locus];
      invlhs = 1.0/lhs;
      mn = invlhs * rhs;
      b[locus] = randn() * sqrt(invlhs) + mn;
      #adjust y for the new value of this locus
      ycorr = ycorr - X[:, locus] * b[locus];
      meanb[locus] = meanb[locus] + b[locus];
  end
end

toc()

meanb = meanb/numiter;
aHat  = X * meanb;

The code looks remarkably similar and there are four main sources of differences:

  1. The first trivial one is that the original code read a binary dataset and I didn’t know how to do it in Julia, so I’ve read a csv file instead (this is why I start timing after reading the file too).
  2. The second trivial one is to avoid name conflicts between variables and functions; for example, in R the user is allowed to have a variable called var that will not interfere with the variance function. Julia is picky about that, so I needed renaming some variables.
  3. Julia pases variables by reference, while R does so by value when assigning matrices, which tripped me because in the original R code there was something like: b = array(0.0,size); meanb = b;. This works fine in R, but in Julia changes to the b vector also changed meanb.
  4. The definition of scalar vs array created some problems in Julia. For example y' * y (t(y) %*% y in R) is numerically equivalent to dot(y, y). However, the first version returns an array, while the second one a scalar. I got an error message when trying to store the ‘scalar like an array’ in to an array. I find that confusing.

One interesting point in this comparison is using rough code, not really optimized for speed; in fact, the only thing that I can say of the Julia code is that ‘it runs’ and it probably is not very idiomatic. Testing runs with different numbers of markers we get that R needs roughly 2.8x the time used by Julia. The Julia website claims better results in benchmarks, but in real life we work with, well, real problems.

In 1996-7 I switched from SAS to ASReml for genetic analyses because it was 1-2 orders of magnitude faster and opened a world of new models. Today a change from R to Julia would deliver (in this particular case) a much more modest speed up (~3x), which is OK but not worth changing languages (yet). Together with the embryonic graphical capabilities and the still-to-develop ecosystem of packages, means that I’m still using R. Nevertheless, the Julia team has achieved very impressive performance in very little time, so it is worth to keep an eye on their progress.

P.S.1 Readers are welcome to suggesting ways of improving the code.
P.S.2 WordPress does not let me upload the binary version of the simulated data.
P.S.3 Hey WordPress guys; it would be handy if the sourcecode plugin supported Julia!
P.S.4 2012-04-26 Following AL’s recommendation in the comments, one can replace in R:

rhs = t(x[,locus])%*%ycorr/vare;
lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus]

by

rhs = crossprod(x[,locus],ycorr)/vare
lhs = crossprod(x[,locus],x[,locus])/vare + 1.0/var[locus]

reducing execution time by roughly 20%, making the difference between Julia and R even smaller.

Tall big data, wide big data

After attending two one-day workshops last week I spent most days paying attention to (well, at least listening to) presentations in this biostatistics conference. Most presenters were R users—although Genstat, Matlab and SAS fans were also present and not once I heard “I can’t deal with the current size of my data sets”. However, there were some complaints about the speed of R, particularly when dealing with simulations or some genomic analyses.

Some people worried about the size of coming datasets; nevertheless that worry was across statistical packages or, more precisely, it went beyond statistical software. How will we able to even store the data from something like the Square Kilometer Array, let alone analyze it?

LaurelAndHardy.jpg

In a previous post I was asking if we needed to actually deal with ‘big data’ in R, and my answer was probably not or, better, at least not directly. I still think that it is a valid, although incomplete opinion. In many statistical analyses we can think of n (the number of observations) and p (the number of variables per observation). In most cases, particularly when people refer to big data, n >> p. Thus, we may have 100 million people but we have only 10 potential predictors: tall data. In contrast, we may have only 1,000 individuals but with 50,000 points each coming from a near infrared spectrometry or information from 250,000 SNPs (a type of molecular marker): wide data. Both types of data will keep on growing but are challenging in a different way.

In a totally generalizing, unfair and simplistic way I will state that dealing with wide is more difficult (and potentially interesting) than dealing with tall. This from a modeling perspective. As the t-shirt says: sampling is not a crime, and it should work quite well with simpler models and large datasets. In contrast, sampling to fitting wide data may not work at all.

Algorithms. Clever algorithms is what we need in a first stage. For example, we can fit linear mixed models to a tall dataset with ten millions records or a multivariate mixed model with 60 responses using ASReml-R. Wide datasets are often approached using Bayesian inference, but MCMC gets slooow when dealing with thousands of predictors, we need other fast approximations to the posterior distributions.

This post may not be totally coherent, but it keeps the conversation going. My excuse? I was watching Be kind rewind while writing it.

R, academia and the democratization of statistics

I am not a statistician but I use statistics, teach some statistics and write about applications of statistics in biological problems.

Last week I was in this biostatistics conference, talking with a Ph.D. student who was surprised about this situation because I didn’t have any statistical training. I corrected “any formal training”. On the first day one of the invited speakers was musing about the growing number of “amateurs” using statistics—many times wrongly—and about what biostatisticians could offer as professional value-adding. Yes, he was talking about people like me spoiling the party.

Twenty years ago it was more difficult to say “cool, I have some data and I will run some statistical analyses” because (and you can easily see where I am going here) access to statistical software was difficult. You were among the lucky ones if you were based at a university or a large company, because you had access to SAS, SPSS, MINITAB, etc. However, you were out of luck outside of these environments, because there was no way to easily afford a personal licence, not for a hobby at least. This greatly limited the pool of people that could afford to muck around with stats.

Gratuitous picture: spiral in Sydney

Enter R and other free (sensu gratis) software that allowed us to skip the specialist, skip the university or the large organization. Do you need a formal degree in statistics to start running analyses? Do you even need to go through a university (for any degree, it doesn’t really matter) to do so? There are plenty of resources to start, download R or a matrix language, get online tutorials and books, read, read, read and ask questions in email lists or fora when you get stuck. If you are a clever cookie—and some of you clearly are one—you could easily cover as much ground as someone going through a university degree. It is probably still not enough to settle down, but it is a great start and a great improvement over the situation twenty years ago.

This description leaves three groups in trouble, trying to sort out their “value-adding” ability: academia, (bio)statisticians and software makers. What are universities offering, that’s unique enough, to justify the time and money invested by individuals and governments? If you make software, What makes it special? For how long can you rely on tradition and inertia so people don’t switch to something else? What’s so special about your (bio) statistical training to justify having one “you” in the organization? Too many questions, so better I go to sleep.

P.S. Did I write “value-adding” ability? I must have been talking to the suits for too long… Next post I may end up writing “value-adding proposition”!

On the (statistical) road, workshops and R

Things have been a bit quiet at Quantum Forest during the last ten days. Last Monday (Sunday for most readers) I flew to Australia to attend a couple of one-day workshops; one on spatial analysis (in Sydney) and another one on modern applications of linear mixed models (in Wollongong). This will be followed by attending The International Biometric Society Australasian Region Conference in Kiama.

I would like to comment on the workshops to look for commonalities and differences. First, both workshops heavily relied on R, supporting the idea that if you want to reach a lot of people and get them using your ideas, R is pretty much the vehicle to do so. It is almost trivial to get people to install R and RStudio before the workshop so they are ready to go. “Almost” because you have to count on someone having a bizarre software configuration or draconian security policies for their computer.

The workshop on Spatial Analysis also used WinBUGS, which with all respect to the developers, is a clunky piece of software. Calling it from R or using JAGS from R seems to me a much more sensible way of using a Bayesian approach while maintaining access to the full power of R. The workshop on linear mixed models relied on asreml-R; if you haven’t tried it, please give it a go (it is a free license for academic/research use). There were applications on multi-stage experiments, composite samples and high dimensional data (molecular information). In addition, there was an initial session on optimal design of experiments.

In my opinion, the second workshop (modern applications…) was much more successful than the first one (spatial analysis…) for a few reasons:

  • One has to limit the material to cover in a one-day workshop; if you want to cover a lot consider three days so people can digest all the material.
  • One has to avoid the “split-personality” approach to presentations; having very-basic and super-hard but nothing in the middle is not a great idea (IMHO). Pick a starting point and gradually move people from there.
  • Limit the introduction of new software. One software per day seems to be a good rule of thumb.

Something bizarre (for me, at least) was the difference between the audiences. In Sydney the crowd was a lot younger, with many trainees in biostatistics coming mostly from health research. They had little exposure to R and seemed to come from a mostly SAS shop. The crowd in Wollongong had people with a lot of experience (OK, oldish) both in statistics and R. I was expecting young people to be more conversant in R.

Tomorrow we will drive down to Kiama, sort out registration and then go to the welcome BBQ. Funny thing is that this is my first statistics conference; as I mentioned in the About page of this blog, I am a simple forester. :-)

No one would ever conceive

I believe that no one who is familiar, either with mathematical advances in other fields, or with the range of special biological conditions to be considered, would ever conceive that everything could be summed up in a single mathematical formula, however complex.

— R.A. Fisher (1932) quoted in the preface to Foundations of Mathematical Genetics by A.W.F. Edwards (1976).

Do we need to deal with ‘big data’ in R?

David Smith at the Revolutions blog posted a nice presentation on “big data” (oh, how I dislike that term). It is a nice piece of work and the Revolution guys managed to process a large amount of records, starting with a download of 70GB and ending up with a series of linear regressions.

I’ve spent the last two weeks traveling (including a visit to the trial below) and finishing marking for the semester, which has somewhat affected my perception on dealing with large amounts of data. The thing is that dealing with hotel internet caps (100MB) or even with my lowly home connection monthly cap (5GB) does get one thinking… Would I spend several months of internet connection just downloading data so I could graph and plot some regression lines for 110 data points? Or does it make sense to run a linear regression with two predictors using 100 million records?

My basic question is why would I want to deal with all those 100 million records directly in R? Wouldn’t it make much more sense to reduce the data to a meaningful size using the original database, up there in the cloud, and download the reduced version to continue an in-depth analysis? There are packages to query external databases (ROracle, RMySQL, RODBC, …, pick your poison), we can sample to explore the dataset, etc.

We can deal with a rather large dataset in our laptop but is it the best that we can do to deal with the underlying modeling problem? Just wondering.

Gratuitous picture of generic Pinus radiata breeding trial in Northern Tasmania.

On “true” models

Before starting the description of the probability distributions, we want to impose on the reader the essential feature that a model is an interpretation of a real phenomenon that fits its characteristics to some degree of approximation rather than an explanation that would require the model to be “true”. In short, there is no such thing as a “true model”, even though some models are more appropriate than others!

— Jean-Michel Marin and Christian P. Robert in Bayesian Core: a practical approach to computational Bayesian statistics.

Teaching with R: the tools

I bought an Android phone, nothing fancy just my first foray in the smartphone world, which is a big change coming from the dumb phone world(*). Everything is different and I am back at being a newbie; this is what many students feel the same time they are exposed to R. However, and before getting into software, I find it useful to think of teaching from several points of view, considering that there are several user cases:

  1. A few of the students will get into statistics and heavy duty coding, they will be highly motivated and take several stats courses. It is in their best interest to learn R (and other software very well).
  2. Some, but not many, of the students will use statistics once in while. A point and click GUI may help to build remember commands.
  3. Most students will have to consume statistics, read reports, request information from colleagues and act as responsible, statistically literate citizens. I think that concepts (rather than software) are far the most relevant element to this group.

The first group requires access to all the power in R and, very likely, has at least a passing idea of coding in other languages. The second and third groups are occasional users, which will tend to forget the language and notation and most likely will need a system with menus.

At this point, some of the readers may be tempted to say that everyone (that is groups 1—3) should learn to write R code and they just need a good text editor (say Emacs). This may come as a surprise but normal people not only do not use text editors, they even don’t know what they are. You could be tempted to say that having a good editor would also let them write in LaTeX (or XeLaTeX), which is an excellent way to ‘future proof’ your documents. Please let me repeat this: normal people do not write in LaTeX, they use Word or something equivalent.

But, but. Yes, I know, we are not normal people.

What are the problems?

When working in my Ph.D. I had the ‘brilliant’ (a.k.a. masochistic) idea of using different languages for each part of my project: Fortran 90 (Lahey), ASReml, Python (ActiveState), Matlab and Mathematica. One thing that I experienced, was that working with a scripting language integrated with a good IDE (e.g. ActiveState Python or Matlab) was much more conducive to learning than a separate console and text editor. I still have fond memories of learning and using Python. This meandering description brings me back to what we should use for teaching.

Let’s be honest, the stock R IDE that one gets with the initial download is spartan if you are in OS X and plain sucky if you are in Windows. Given that working with the console plus a text editor (Emacs, Vim, Textmate, etc) is an uncomfortable learning experience (at least in my opinion) there is a nice niche for IDEs like RStudio, which integrate editor, data manager, graphs, etc.; particularly if they are cross-platform. Why is that RStudio is not included as the default R IDE? (Incidentally, I have never used Revolution R Productivity Environment—Windows only—that looks quite cool).

Today I am tempted to recommend moving the whole course to RStudio, which means installing it in an awful lot of computers at the university. One of the issues that stops me is that is introducing another layer of abstraction to R. We have the plain-vanilla console, then the normal installation and, on top, RStudio. On the other hand, we are already introducing an extra level with R commander.

At this point we reach the point-and-click GUI. The last two years we have used R Commander, which has helped, but I have never felt entirely comfortable with it. This year I had a chat with some students that used SPSS before and, after the initial shock, they seemed to cope with R Commander. In a previous post someone suggested Deducer, which I hope to check before the end of this year. I am always on the look out for a good and easy interface for students that fall in the second and third cases (see above). It would be nice to have a series of interfaces that look like Jeroen Ooms’s prototypes. Please let me know if you have any suggestions.

(*)This is not strictly true, as I had a Nokia E72, which was a dumb phone with a lot of buttons pretending to be a smartphone.

(**)This post should be read as me thinking aloud and reflecting today’s impressions, which are continually evolving. I am not yet truly comfortable with any GUI in the R world, and still feel that SPSS, Splus or Genstat (for example) provide a nicer flow on the GUI front.

Covariance structures

In most mixed linear model packages (e.g. asreml, lme4, nlme, etc) one needs to specify only the model equation (the bit that looks like y ~ factors...) when fitting simple models. We explicitly say nothing about the covariances that complete the model specification. This is because most linear mixed model packages assume that, in absence of any additional information, the covariance structure is the product of a scalar (a variance component) by a design matrix. For example, the residual covariance matrix in simple models is R = I σe2, or the additive genetic variance matrix is G = A σa2 (where A is the numerator relationship matrix), or the covariance matrix for a random effect f with incidence matrix Z is ZZ σf2.

However, there are several situations when analyses require a more complex covariance structure, usually a direct sum or a Kronecker product of two or more matrices. For example, an analysis of data from several sites might consider different error variances for each site, that is R = Σd Ri, where Σd represents a direct sum and Ri is the residual matrix for site i.

Other example of a more complex covariance structure is a multivariate analysis in a single site (so the same individual is assessed for two or more traits), where both the residual and additive genetic covariance matrices are constructed as the product of two matrices. For example, R = IR0, where I is an identity matrix of size number of observations, ⊗ is the Kronecker product (do not confuse with a plain matrix multiplication) and R0 is the error covariance matrix for the traits involved in the analysis. Similarly, G = AG0 where all the matrices are as previously defined and G0 is the additive covariance matrix for the traits.

Some structures are easier to understand (at least for me) if we express a covariance matrix (M) as the product of a correlation matrix (C) pre- and post-multiplied by a diagonal matrix (D) containing standard deviations for each of the traits (M = D C D). That is:

M = \left [  \begin{array}{cccc}  v_{11}& c_{12}& c_{13}& c_{14} \\  c_{21}& v_{22}& c_{23}& c_{24} \\  c_{31}& c_{32}& v_{33}& c_{34} \\  c_{41}& c_{42}& c_{43}& v_{44}  \end{array}   \right ]

C = \left [  \begin{array}{cccc}  1& r_{12}& r_{13}& r_{14} \\  r_{21}& 1& r_{23}& r_{24} \\  r_{31}& r_{32}& 1& r_{34} \\  r_{41}& r_{42}& r_{43}& 1  \end{array}   \right ]

D = \left [  \begin{array}{cccc}  s_{11}& 0& 0& 0 \\  0& s_{22}& 0& 0 \\  0& 0& s_{33}& 0 \\  0& 0& 0& s_{44}  \end{array}   \right ]

where the v are variances, the r correlations and the s standard deviations.

If we do not impose any restriction on M, apart from being positive definite (p.d.), we are talking about an unstructured matrix (us() in asreml-R parlance). Thus, M or C can take any value (as long as it is p.d.) as it is usual when analyzing multiple trait problems.

There are cases when the order of assessment or the spatial location of the experimental units create patterns of variation, which are reflected by the covariance matrix. For example, the breeding value of an individual i observed at time j (aij) is a function of genes involved in expression at time j – k (aij-k), plus the effect of genes acting in the new measurement (αj), which are considered independent of the past measurement aij = ρjk aij-k + αj, where ρjk is the additive genetic correlation between measures j and k.

Rather than using a different correlation for each pair of ages, it is possible to postulate mechanisms which model the correlations. For example, an autoregressive model (ar() in asreml-R lingo), where the correlation between measurements j and k is r|j-k|. In this model M = D CAR D, where CAR (for equally spaced assessments) is:

C_{AR} = \left [   \begin{array}{cccc}  1 & r^{|t_2-t_1|} & \ldots & r^{|t_m-t_1|}\\  r^{|t_2-t_1|} & 1 & \ldots & r^{|t_m-t_2|}\\  \vdots & \vdots & \ddots & \vdots \\  r^{|t_m-t_1|} & r^{|t_m-t_2|} & \ldots & 1  \end{array} \right ]

Assuming three different autocorrelation coefficients (0.95 solid line, 0.90 dashed line and 0.85 dotted line) we can get very different patterns with a few extra units of lag, as shown in the following graph:

A model including this structure will certainly be more parsimonious (economic on terms of number of parameters) than one using an unstructured approach. Looking at the previous pattern it is a lot easier to understand why they are called ‘structures’. A similar situation is considered in spatial analysis, where the ‘independent errors’ assumption of typical analyses is relaxed. A common spatial model will consider the presence of autocorrelated residuals in both directions (rows and columns). Here the level of autocorrelation will depend on distance between trees rather than on time. We can get an idea of how separable processes look like using this code:

# Separable row col autoregressive process
car2 = function(dim, rhor, rhoc) {
    M = diag(dim)
    rhor^(row(M) - 1) * rhoc^(col(M) - 1)
}

library(lattice)
levelplot(car2(20, 0.95, 0.85))

This correlation matrix can then be multiplied by a spatial residual variance to obtain the covariance and we can add up a spatially independent residual variance.

Much more detail on code notation for covariance structures can be found, for example, in the ASReml-R User Guide (PDF, chapter 4), for nlme in Pinheiro and Bates’s Mixed-effects models in S and S-plus (link to Google Books, chapter 5.3) and in Bates’s draft book for lme4 in chapter 4.

Maximum likelihood

This post is one of those ‘explain to myself how things work’ documents, which are not necessarily completely correct but are close enough to facilitate understanding.

Background

Let’s assume that we are working with a fairly simple linear model, where we only have a response variable (say tree stem diameter in cm). If we want to ‘guess’ the diameter for a tree (yi) our best bet is the average (μ) and we will have a residual (εi). The model equation then looks like:

y_i = \mu + \varepsilon_i

We want to estimate both the overall mean (μ) and the error variance (\sigma_{\varepsilon}^2), for which we could use least squares. However, we will use an alternative method (maximum likelihood) because that is the point of this post. A likelihood function expresses the probability of obtaining the observed sample from a population given a set of model parameters. Thus, it answers the question What is the probability of observing the current dataset, when I assume a given set of parameters for my linear model? To do so, we have to assume a distribution, say normal, for which the probability density function (p.d.f) is:

p.d.f. (y) = \frac{1} {\sqrt{2 \pi} \sigma} e^{-1/2 \frac{(y- \mu)^2} {\sigma^2}}

with n independent samples the likelihood (L) is:

L = \prod_{i=1}^n \frac{1} {\sqrt{2 \pi} \sigma} e^{-1/2 \frac{(y- \mu)^2} {\sigma^2}}

where ∏ is a multiplication operator, analogous to the summation operator ∑. Maximizing L or its natural logarithm (LogL) is equivalent, then:

LogL = \sum_{i=1}^n \log\left ( \frac{1} {\sqrt{2 \pi} \sigma} e^{-1/2 \frac{(y- \mu)^2} {\sigma^2}} \right )

LogL= \sum_{i=1}^n \left [ \log\left ( \frac{1} {\sqrt{2 \pi} \sigma} \right ) + \log\left ( e^{-1/2 \frac{(y- \mu)^2} {\sigma^2}}  \right ) \right ]

LogL= \sum_{i=1}^n \log(2 \pi)^{-1/2} + \sum_{i=1}^n \log\left (\frac{1} {\sigma}\right ) + \sum_{i=1}^n \left ( -1/2 \frac{(y- \mu)^2} {\sigma^2} \right )

LogL= -\frac{n} {2} \log(2 \pi) - n \log(\sigma) - \frac{1} {2 \sigma^2} \sum_{i=1}^n ( y - \mu)^2

Considering only μ, LogL is maximum when \sum_{i=1}^n ( y - \mu)^2 is a minimum, i.e. for:

\mu = \frac{\sum_{i=1}^n y_i}{n}

Now considering σ:

\frac{\delta LogL}{\delta \sigma} = -\frac{n}{\sigma} - (-2) \frac{1}{2} \sum_{i=1}^n (y - \mu)^2 \sigma^{-3}
\frac{\delta LogL}{\delta \sigma} = -\frac{n}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^n (y - \mu)^2

and setting the last expression equal to 0:

\frac{n}{\sigma} = \frac{1}{\sigma^3} \sum_{i=1}^n (y - \mu)^2
\sigma^2 = \frac{\sum_{i=1}^n (y - \mu)^2}{n}

We can obtain an estimate of the error variance (\hat{\sigma}, notice the hat) by replacing μ by our previous estimate:
\hat{\sigma}^2 = \frac{\sum_{i=1}^n (y - \bar{y})^2}{n}

which is biased, because the denominator is n rather than (n – 1) (the typical denominator for sample variance). This bias arises because maximum likelihood estimates do not take into account the loss of degrees of freedom when estimating fixed effects.

Playing in R with an example

We have data for stem diameters (in mm) for twelve 10 year-old radiata pine (Pinus radiata D. Don) trees:

diams = c(150, 124, 100, 107, 170, 144,
          113, 108, 92, 129, 123, 118)

# Obtain typical mean and standard deviation
mean(diams)

[1] 123.1667

sd(diams)
[1] 22.38438

# A simple function to calculate likelihood.
# Values quickly become very small
like = function(data, mu, sigma) {
  like = 1
  for(obs in data){
    like = like * 1/(sqrt(2*pi)*sigma) *
                     exp(-1/2 * (obs - mu)^2/(sigma^2))
  }
  return(like)
}

# For example, likelihood of data if mu=120 and sigma=20
like(diams, 120, 20)
[1] 3.476384e-24

# A simple function to calculate log-likelihood.
# Values will be easier to manage
loglike = function(data, mu, sigma) {
  loglike = 0
    for(obs in data){
      loglike = loglike +
                log(1/(sqrt(2*pi)*sigma) *
                       exp(-1/2 * (obs - mu)^2/(sigma^2)))
    }
    return(loglike)
}

# Example, log-likelihood of data if mu=120 and sigma=20
loglike(diams, 122, 20)
[1] -53.88605

# Let's try some combinations of parameters and
# plot the results
params = expand.grid(mu = seq(50, 200, 1),
                     sigma = seq(10, 40, 1))
params$logL = with(params, loglike(diams, mu, sigma))
summary(params)

library(lattice)
contourplot(logL ~ mu*sigma, data = params, cuts = 20)

It is difficult to see the maximum likelihood in this plot, so we will zoom-in by generating a smaller grid around the typical mean and standard deviation estimates.

# Zooming in
params = expand.grid(mu = seq(120, 126, 0.01),
                     sigma = seq(18, 25, 0.1))
params$logL = with(params, loglike(diams, mu, sigma))
summary(params)

contourplot(logL ~ mu*sigma, data = params, cuts = 10)

We can now check what would be actual results using any of the functions to fit models using maximum likelihood, for example gls:

library(nlme)
m1 = gls(diams ~ 1, method='ML')
summary(m1)

Generalized least squares fit by maximum likelihood
  Model: diams ~ 1
  Data: NULL
       AIC      BIC    logLik
  111.6111 112.5809 -53.80556

Coefficients:
               Value Std.Error  t-value p-value
(Intercept) 123.1667  6.461815 19.06069       0

Residual standard error: 21.43142
Degrees of freedom: 12 total; 11 residual

In real life, software will use an iterative process to find the combination of parameters that maximizes the log-likelihood value. If we go back to our function and use loglike(diams, 123.1667, 21.43142) we will obtain -53.80556; exactly the same value calculated by gls. In addition, we can see that the estimated standard deviation (21.43) is slightly lower than the one calculated by the function sd() (22.38), because our biased estimate divides the sum of squared deviations by n rather than by n-1.

P.S. The code for the likelihood and log-likelihood functions is far from being optimal (the loops could be vectorized). However, the loops are easier to understand for many people.