This post is one of those ‘explain to myself how things work’ documents, which are not necessarily completely correct but are close enough to facilitate understanding.

## Background

Let’s assume that we are working with a fairly simple linear model, where we only have a response variable (say tree stem diameter in cm). If we want to ‘guess’ the diameter for a tree (y_{i}) our best bet is the average (μ) and we will have a residual (ε_{i}). The model equation then looks like:

\(y_i = \mu + \varepsilon_i \)

We want to estimate both the overall mean (μ) and the error variance (\(\sigma_{\varepsilon}^2\)), for which we could use least squares. However, we will use an alternative method (maximum likelihood) because that is the point of this post. A likelihood function expresses the probability of obtaining the observed sample from a population given a set of model parameters. Thus, it answers the question *What is the probability of observing the current dataset, when I assume a given set of parameters for my linear model?* To do so, we have to assume a distribution, say normal, for which the probability density function (p.d.f) is:

\(p.d.f. (y) = \frac{1} {\sqrt{2 \pi} \sigma} e^{-1/2 \frac{(y- \mu)^2} {\sigma^2}} \)

with n independent samples the likelihood (*L*) is:

\(L = \prod_{i=1}^n \frac{1} {\sqrt{2 \pi} \sigma} e^{-1/2 \frac{(y- \mu)^2} {\sigma^2}}\)

where ∏ is a multiplication operator, analogous to the summation operator ∑. Maximizing *L* or its natural logarithm (*LogL*) is equivalent, then:

\(LogL = \sum_{i=1}^n \log\left ( \frac{1} {\sqrt{2 \pi} \sigma} e^{-1/2 \frac{(y- \mu)^2} {\sigma^2}} \right ) \)

\(LogL= \sum_{i=1}^n \left [ \log\left ( \frac{1} {\sqrt{2 \pi} \sigma} \right ) + \log\left ( e^{-1/2 \frac{(y- \mu)^2} {\sigma^2}} \right ) \right ]\)

\(LogL= \sum_{i=1}^n \log(2 \pi)^{-1/2} + \sum_{i=1}^n \log\left (\frac{1} {\sigma}\right ) + \sum_{i=1}^n \left ( -1/2 \frac{(y- \mu)^2} {\sigma^2} \right )\)

\(LogL= -\frac{n} {2} \log(2 \pi) – n \log(\sigma) – \frac{1} {2 \sigma^2} \sum_{i=1}^n ( y – \mu)^2\)

Considering only μ, *LogL* is maximum when \(\sum_{i=1}^n ( y – \mu)^2\) is a minimum, i.e. for:

\(\mu = \frac{\sum_{i=1}^n y_i}{n}\)

Now considering σ:

\(\frac{\delta LogL}{\delta \sigma} = -\frac{n}{\sigma} – (-2) \frac{1}{2} \sum_{i=1}^n (y – \mu)^2 \sigma^{-3}\)

\(\frac{\delta LogL}{\delta \sigma} = -\frac{n}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^n (y – \mu)^2\)

and setting the last expression equal to 0:

\(\frac{n}{\sigma} = \frac{1}{\sigma^3} \sum_{i=1}^n (y – \mu)^2\)

\(\sigma^2 = \frac{\sum_{i=1}^n (y – \mu)^2}{n}\)

We can obtain an estimate of the error variance (\(\hat{\sigma}\), notice the hat) by replacing μ by our previous estimate:

\(\hat{\sigma}^2 = \frac{\sum_{i=1}^n (y – \bar{y})^2}{n}\)

which is biased, because the denominator is n rather than (n – 1) (the typical denominator for sample variance). This bias arises because maximum likelihood estimates do not take into account the loss of degrees of freedom when estimating fixed effects.

## Playing in R with an example

We have data for stem diameters (in mm) for twelve 10 year-old radiata pine (*Pinus radiata* D. Don) trees:

[sourcecode lang=”r”]

diams = c(150, 124, 100, 107, 170, 144,

113, 108, 92, 129, 123, 118)

# Obtain typical mean and standard deviation

mean(diams)

[1] 123.1667

sd(diams)

[1] 22.38438

# A simple function to calculate likelihood.

# Values quickly become very small

like = function(data, mu, sigma) {

like = 1

for(obs in data){

like = like * 1/(sqrt(2*pi)*sigma) *

exp(-1/2 * (obs – mu)^2/(sigma^2))

}

return(like)

}

# For example, likelihood of data if mu=120 and sigma=20

like(diams, 120, 20)

[1] 3.476384e-24

# A simple function to calculate log-likelihood.

# Values will be easier to manage

loglike = function(data, mu, sigma) {

loglike = 0

for(obs in data){

loglike = loglike +

log(1/(sqrt(2*pi)*sigma) *

exp(-1/2 * (obs – mu)^2/(sigma^2)))

}

return(loglike)

}

# Example, log-likelihood of data if mu=120 and sigma=20

loglike(diams, 122, 20)

[1] -53.88605

# Let’s try some combinations of parameters and

# plot the results

params = expand.grid(mu = seq(50, 200, 1),

sigma = seq(10, 40, 1))

params$logL = with(params, loglike(diams, mu, sigma))

summary(params)

library(lattice)

contourplot(logL ~ mu*sigma, data = params, cuts = 20)

[/sourcecode]

It is difficult to see the maximum likelihood in this plot, so we will zoom-in by generating a smaller grid around the typical mean and standard deviation estimates.

[sourcecode lang=”r”]

# Zooming in

params = expand.grid(mu = seq(120, 126, 0.01),

sigma = seq(18, 25, 0.1))

params$logL = with(params, loglike(diams, mu, sigma))

summary(params)

contourplot(logL ~ mu*sigma, data = params, cuts = 10)

[/sourcecode]

We can now check what would be actual results using any of the functions to fit models using maximum likelihood, for example `gls`

:

[sourcecode lang=”r”]

library(nlme)

m1 = gls(diams ~ 1, method=’ML’)

summary(m1)

Generalized least squares fit by maximum likelihood

Model: diams ~ 1

Data: NULL

AIC BIC logLik

111.6111 112.5809 -53.80556

Coefficients:

Value Std.Error t-value p-value

(Intercept) 123.1667 6.461815 19.06069 0

Residual standard error: 21.43142

Degrees of freedom: 12 total; 11 residual

[/sourcecode]

In real life, software will use an iterative process to find the combination of parameters that maximizes the log-likelihood value. If we go back to our function and use `loglike(diams, 123.1667, 21.43142)`

we will obtain -53.80556; exactly the same value calculated by `gls`

. In addition, we can see that the estimated standard deviation (21.43) is slightly lower than the one calculated by the function `sd()`

(22.38), because our biased estimate divides the sum of squared deviations by n rather than by n-1.

P.S. The code for the likelihood and log-likelihood functions is far from being optimal (the loops could be vectorized). However, the loops are easier to understand for many people.