This post is one of those ‘explain to myself how things work’ documents, medical which are not necessarily completely correct but are close enough to facilitate understanding.


Let’s assume that we are working with a fairly simple linear model, where we only have a response variable (say tree stem diameter in cm). If we want to ‘guess’ the diameter for a tree (yi) our best bet is the average (μ) and we will have a residual (εi). The model equation then looks like:

\(y_i = mu + varepsilon_i \)

We want to estimate both the overall mean (μ) and the error variance (\(sigma_{varepsilon}^2\)), for which we could use least squares. However, we will use an alternative method (maximum likelihood) because that is the point of this post. A likelihood function expresses the probability of obtaining the observed sample from a population given a set of model parameters. Thus, it answers the question What is the probability of observing the current dataset, when I assume a given set of parameters for my linear model? To do so, we have to assume a distribution, say normal, for which the probability density function (p.d.f) is:

\(p.d.f. (y) = frac{1} {sqrt{2 pi} sigma} e^{-1/2 frac{(y- mu)^2} {sigma^2}} \)

with n independent samples the likelihood (L) is:

\(L = prod_{i=1}^n frac{1} {sqrt{2 pi} sigma} e^{-1/2 frac{(y- mu)^2} {sigma^2}}\)

where ∏ is a multiplication operator, analogous to the summation operator ∑. Maximizing L or its natural logarithm (LogL) is equivalent, then:

$latex LogL = sum_{i=1}^n logleft ( frac{1} {sqrt{2 pi} sigma} e^{-1/2 frac{(y- mu)^2} {sigma^2}}
ight ) $

$latex LogL= sum_{i=1}^n left [ logleft ( frac{1} {sqrt{2 pi} sigma}
ight ) + logleft ( e^{-1/2 frac{(y- mu)^2} {sigma^2}}
ight )
ight ]$

$latex LogL= sum_{i=1}^n log(2 pi)^{-1/2} + sum_{i=1}^n logleft (frac{1} {sigma}
ight ) + sum_{i=1}^n left ( -1/2 frac{(y- mu)^2} {sigma^2}
ight )$

\(LogL= -frac{n} {2} log(2 pi) – n log(sigma) – frac{1} {2 sigma^2} sum_{i=1}^n ( y – mu)^2\)

Considering only μ, LogL is maximum when \(sum_{i=1}^n ( y – mu)^2\) is a minimum, i.e. for:

\(mu = frac{sum_{i=1}^n y_i}{n}\)

Now considering σ:

\(frac{delta LogL}{delta sigma} = -frac{n}{sigma} – (-2) frac{1}{2} sum_{i=1}^n (y – mu)^2 sigma^{-3}\)
\(frac{delta LogL}{delta sigma} = -frac{n}{sigma} + frac{1}{sigma^3} sum_{i=1}^n (y – mu)^2\)

and setting the last expression equal to 0:

\(frac{n}{sigma} = frac{1}{sigma^3} sum_{i=1}^n (y – mu)^2\)
\(sigma^2 = frac{sum_{i=1}^n (y – mu)^2}{n}\)

We can obtain an estimate of the error variance (\(hat{sigma}\), notice the hat) by replacing μ by our previous estimate:
\(hat{sigma}^2 = frac{sum_{i=1}^n (y – ar{y})^2}{n}\)

which is biased, because the denominator is n rather than (n – 1) (the typical denominator for sample variance). This bias arises because maximum likelihood estimates do not take into account the loss of degrees of freedom when estimating fixed effects.

Playing in R with an example

We have data for stem diameters (in mm) for twelve 10 year-old radiata pine (Pinus radiata D. Don) trees:

It is difficult to see the maximum likelihood in this plot, so we will zoom-in by generating a smaller grid around the typical mean and standard deviation estimates.

We can now check what would be actual results using any of the functions to fit models using maximum likelihood, for example gls:

In real life, software will use an iterative process to find the combination of parameters that maximizes the log-likelihood value. If we go back to our function and use loglike(diams, 123.1667, 21.43142) we will obtain -53.80556; exactly the same value calculated by gls. In addition, we can see that the estimated standard deviation (21.43) is slightly lower than the one calculated by the function sd() (22.38), because our biased estimate divides the sum of squared deviations by n rather than by n-1.

P.S. The code for the likelihood and log-likelihood functions is far from being optimal (the loops could be vectorized). However, the loops are easier to understand for many people.