This post is one of those ‘explain to myself how things work’ documents, which are not necessarily completely correct but are close enough to facilitate understanding.

## Background

Let’s assume that we are working with a fairly simple linear model, where we only have a response variable (say tree stem diameter in cm). If we want to ‘guess’ the diameter for a tree (y_{i}) our best bet is the average (μ) and we will have a residual (ε_{i}). The model equation then looks like:

\(y_i = mu + varepsilon_i \)

We want to estimate both the overall mean (μ) and the error variance (\(sigma_{varepsilon}^2\)), for which we could use least squares. However, we will use an alternative method (maximum likelihood) because that is the point of this post. A likelihood function expresses the probability of obtaining the observed sample from a population given a set of model parameters. Thus, it answers the question *What is the probability of observing the current dataset, when I assume a given set of parameters for my linear model?* To do so, we have to assume a distribution, say normal, for which the probability density function (p.d.f) is:

\(p.d.f. (y) = frac{1} {sqrt{2 pi} sigma} e^{-1/2 frac{(y- mu)^2} {sigma^2}} \)

with n independent samples the likelihood (*L*) is:

\(L = prod_{i=1}^n frac{1} {sqrt{2 pi} sigma} e^{-1/2 frac{(y- mu)^2} {sigma^2}}\)

where ∏ is a multiplication operator, analogous to the summation operator ∑. Maximizing *L* or its natural logarithm (*LogL*) is equivalent, then:

$latex LogL = sum_{i=1}^n logleft ( frac{1} {sqrt{2 pi} sigma} e^{-1/2 frac{(y- mu)^2} {sigma^2}}

ight ) $

$latex LogL= sum_{i=1}^n left [ logleft ( frac{1} {sqrt{2 pi} sigma}

ight ) + logleft ( e^{-1/2 frac{(y- mu)^2} {sigma^2}}

ight )

ight ]$

$latex LogL= sum_{i=1}^n log(2 pi)^{-1/2} + sum_{i=1}^n logleft (frac{1} {sigma}

ight ) + sum_{i=1}^n left ( -1/2 frac{(y- mu)^2} {sigma^2}

ight )$

\(LogL= -frac{n} {2} log(2 pi) – n log(sigma) – frac{1} {2 sigma^2} sum_{i=1}^n ( y – mu)^2\)

Considering only μ, *LogL* is maximum when \(sum_{i=1}^n ( y – mu)^2\) is a minimum, i.e. for:

\(mu = frac{sum_{i=1}^n y_i}{n}\)

Now considering σ:

\(frac{delta LogL}{delta sigma} = -frac{n}{sigma} – (-2) frac{1}{2} sum_{i=1}^n (y – mu)^2 sigma^{-3}\)

\(frac{delta LogL}{delta sigma} = -frac{n}{sigma} + frac{1}{sigma^3} sum_{i=1}^n (y – mu)^2\)

and setting the last expression equal to 0:

\(frac{n}{sigma} = frac{1}{sigma^3} sum_{i=1}^n (y – mu)^2\)

\(sigma^2 = frac{sum_{i=1}^n (y – mu)^2}{n}\)

We can obtain an estimate of the error variance (\(hat{sigma}\), notice the hat) by replacing μ by our previous estimate:

\(hat{sigma}^2 = frac{sum_{i=1}^n (y – ar{y})^2}{n}\)

which is biased, because the denominator is n rather than (n – 1) (the typical denominator for sample variance). This bias arises because maximum likelihood estimates do not take into account the loss of degrees of freedom when estimating fixed effects.

## Playing in R with an example

We have data for stem diameters (in mm) for twelve 10 year-old radiata pine (*Pinus radiata* D. Don) trees:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
diams = c(150, 124, 100, 107, 170, 144, 113, 108, 92, 129, 123, 118) # Obtain typical mean and standard deviation mean(diams) [1] 123.1667 sd(diams) [1] 22.38438 # A simple function to calculate likelihood. # Values quickly become very small like = function(data, mu, sigma) { like = 1 for(obs in data){ like = like * 1/(sqrt(2*pi)*sigma) * exp(-1/2 * (obs - mu)^2/(sigma^2)) } return(like) } # For example, likelihood of data if mu=120 and sigma=20 like(diams, 120, 20) [1] 3.476384e-24 # A simple function to calculate log-likelihood. # Values will be easier to manage loglike = function(data, mu, sigma) { loglike = 0 for(obs in data){ loglike = loglike + log(1/(sqrt(2*pi)*sigma) * exp(-1/2 * (obs - mu)^2/(sigma^2))) } return(loglike) } # Example, log-likelihood of data if mu=120 and sigma=20 loglike(diams, 122, 20) [1] -53.88605 # Let's try some combinations of parameters and # plot the results params = expand.grid(mu = seq(50, 200, 1), sigma = seq(10, 40, 1)) params$logL = with(params, loglike(diams, mu, sigma)) summary(params) library(lattice) contourplot(logL ~ mu*sigma, data = params, cuts = 20) |

It is difficult to see the maximum likelihood in this plot, so we will zoom-in by generating a smaller grid around the typical mean and standard deviation estimates.

1 2 3 4 5 6 7 |
# Zooming in params = expand.grid(mu = seq(120, 126, 0.01), sigma = seq(18, 25, 0.1)) params$logL = with(params, loglike(diams, mu, sigma)) summary(params) contourplot(logL ~ mu*sigma, data = params, cuts = 10) |

We can now check what would be actual results using any of the functions to fit models using maximum likelihood, for example `gls`

:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
library(nlme) m1 = gls(diams ~ 1, method='ML') summary(m1) Generalized least squares fit by maximum likelihood Model: diams ~ 1 Data: NULL AIC BIC logLik 111.6111 112.5809 -53.80556 Coefficients: Value Std.Error t-value p-value (Intercept) 123.1667 6.461815 19.06069 0 Residual standard error: 21.43142 Degrees of freedom: 12 total; 11 residual |

In real life, software will use an iterative process to find the combination of parameters that maximizes the log-likelihood value. If we go back to our function and use `loglike(diams, 123.1667, 21.43142)`

we will obtain -53.80556; exactly the same value calculated by `gls`

. In addition, we can see that the estimated standard deviation (21.43) is slightly lower than the one calculated by the function `sd()`

(22.38), because our biased estimate divides the sum of squared deviations by n rather than by n-1.

P.S. The code for the likelihood and log-likelihood functions is far from being optimal (the loops could be vectorized). However, the loops are easier to understand for many people.