## How to simulate multilevel/longitudinal data

I was recently talking with my friend Rebecca about simulating multilevel data, and she asked me if I would show her some examples. It occurred to me that many of you might also like to see some examples, so I decided to post them to the Stata Blog.

### Introduction

We simulate data all the time at StataCorp and for a variety of reasons.

One reason is that real datasets that include the features we would like are often difficult to find. We prefer to use real datasets in the manual examples, but sometimes that isn’t feasible and so we create simulated datasets.

We also simulate data to check the coverage probabilities of new estimators in Stata. Sometimes the formulae published in books and papers contain typographical errors. Sometimes the asymptotic properties of estimators don’t hold under certain conditions. And every once in a while, we make coding mistakes. We run simulations during development to verify that a 95% confidence interval really is a 95% confidence interval.

Simulated data can also come in handy for presentations, teaching purposes, and calculating statistical power using simulations for complex study designs.

And, simulating data is just plain fun once you get the hang of it.

Some of you will recall Vince Wiggins’s blog entry from 2011 entitled “Multilevel random effects in xtmixed and sem — the long and wide of it” in which he simulated a three-level dataset. I’m going to elaborate on how Vince simulated multilevel data, and then I’ll show you some useful variations. Specifically, I’m going to talk about:

- How to simulate single-level data
- How to simulate two- and three-level data
- How to simulate three-level data with covariates
- How to simulate longitudinal data with random slopes
- How to simulate longitudinal data with structured errors

### How to simulate single-level data

Let’s begin by simulating a trivially simple, single-level dataset that has the form

\[y_i = 70 + e_i\]

We will assume that e is normally distributed with mean zero and variance \(\sigma^2\).

We’d want to simulate 500 observations, so let’s begin by clearing Stata’s memory and setting the number of observations to 500.

. clear . set obs 500

Next, let’s create a variable named **e** that contains pseudorandom normally distributed data with mean zero and standard deviation 5:

. generate e = rnormal(0,5)

The variable **e** is our error term, so we can create an outcome variable **y** by typing

. generate y = 70 + e . list y e in 1/5 +----------------------+ | y e | |----------------------| 1. | 78.83927 8.83927 | 2. | 69.97774 -.0222647 | 3. | 69.80065 -.1993514 | 4. | 68.11398 -1.88602 | 5. | 63.08952 -6.910483 | +----------------------+

We can fit a linear regression for the variable **y** to determine whether our parameter estimates are reasonably close to the parameters we specified when we simulated our dataset:

. regress y Source | SS df MS Number of obs = 500 -------------+------------------------------ F( 0, 499) = 0.00 Model | 0 0 . Prob > F = . Residual | 12188.8118 499 24.4264766 R-squared = 0.0000 -------------+------------------------------ Adj R-squared = 0.0000 Total | 12188.8118 499 24.4264766 Root MSE = 4.9423 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons | 69.89768 .221027 316.24 0.000 69.46342 70.33194 ------------------------------------------------------------------------------

The estimate of **_cons** is 69.9, which is very close to 70, and the Root MSE of 4.9 is equally close to the error’s standard deviation of 5. The parameter estimates will not be exactly equal to the underlying parameters we specified when we created the data because we introduced randomness with the **rnormal()** function.

This simple example is just to get us started before we work with multilevel data. For familiarity, let’s fit the same model with the **mixed** command that we will be using later:

. mixed y, stddev Mixed-effects ML regression Number of obs = 500 Wald chi2(0) = . Log likelihood = -1507.8857 Prob > chi2 = . ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons | 69.89768 .2208059 316.56 0.000 69.46491 70.33045 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ sd(Residual) | 4.93737 .1561334 4.640645 5.253068 ------------------------------------------------------------------------------

The output is organized with the parameter estimates for the fixed part in the top table and the estimated standard deviations for the random effects in the bottom table. Just as previously, the estimate of **_cons** is 69.9, and the estimate of the standard deviation of the residuals is 4.9.

Okay. That really was trivial, wasn’t it? Simulating two- and three-level data is almost as easy.

### How to simulate two- and three-level data

I posted a blog entry last year titled “Multilevel linear models in Stata, part 1: Components of variance“. In that posting, I showed a diagram for a residual of a three-level model.

The equation for the variance-components model I fit had the form

\[y_{ijk} = mu + u_i.. + u_{ij.} + e_{ijk}\]

This model had three residuals, whereas the one-level model we just fit above had only one.

This time, let’s start with a two-level model. Let’s simulate Read more…