Quantum Forest

notes in a shoebox

Simulating data following a given covariance structure

Every year there is at least a couple of occasions when I have to simulate multivariate data that follow a given covariance matrix. For example, let’s say that we want to create an example of the effect of collinearity when fitting multiple linear regressions, so we want to create one variable (the response) that is correlated with a number of explanatory variables and the explanatory variables have different correlations with each other.

There is a matrix operation called Cholesky decomposition, sort of equivalent to taking a square root with scalars, that is useful to produce correlated data. If we have a covariance matrix M, the Cholesky descomposition is a lower triangular matrix L, such as that M = L L'. How does this connect to our simulated data? Let’s assume that we generate a vector z of random normally independently distributed numbers with mean zero and variance one (with length equal to the dimension of M), we can create a realization of our multivariate distribution using the product L z.

The reason why this works is that the Variance(L z) = L Variance(z) L' as L is just a constant. The variance of z is the identity matrix I; remember that the random numbers have variance one and are independently distributed. Therefore Variance(L z) = L I L' = L L` = M so, in fact, we are producing random data that follow the desired covariance matrix.

As an example, let’s simulate 100 observations with 4 variables. Because we want to simulate 100 realizations, rather than a single one, it pays to generate a matrix of random numbers with as many rows as variables to simulate and as many columns as observations to simulate.

Now we can use the simulated data to learn something about the effects of collinearity when fitting multiple linear regressions. We will first fit two models using two predictors with low correlation between them, and then fit a third model with three predictors where pred1 and pred2 are highly correlated with each other.

In my example it is possible to see the huge increase for the standard error for pred1 and pred2, when we use both highly correlated explanatory variables in model 3. In addition, model fit does not improve for model 3.


  1. Unfortunately there are numerous problems when you use Cholesky decomposition for large matrices. You have a nice example of 4×4 matrix. When you try to do the same, for example, for images 1000×1000, very often it takes an hour and the matrix is not positive definite because of the rounding. There are some methods to escape computational difficulties, but they often do not work well. As usual, nice theory needs some changes to apply to real data :-)

    • Luis

      2011/10/17 at 1:52 pm

      Hi Andiy, At least when working with the analysis of experiments we tend not to use such high dimensional matrices, particularly because we want to analyze the problem using multivariate mixed models. At the analysis level, anything beyond, say, five traits will make life miserable to achieve convergence when estimating a positive definite matrix. From an analysis point of view we would probably move to a factor analytic decomposition.

  2. Hi Luis,

    I am looking for this kind of R code. I want to generate data following a given covariance structure. I wonder whether I can use this R code for my research. If so, what is the correct citation for the code? Thanks a million!!


    • Luis

      2012/10/14 at 7:40 am

      Sure you can use it; that’s why I put it here! Correct citation depends on the standard you are using; see examples here. If you want to reference a book search in google books for cholesky and simulation.

Leave a Reply

© 2015 Quantum Forest

Theme by Anders NorenUp ↑