Home > Statistics > How to simulate multilevel/longitudinal data

How to simulate multilevel/longitudinal data

I was recently talking with my friend Rebecca about simulating multilevel data, and she asked me if I would show her some examples. It occurred to me that many of you might also like to see some examples, so I decided to post them to the Stata Blog.

Introduction

We simulate data all the time at StataCorp and for a variety of reasons.

One reason is that real datasets that include the features we would like are often difficult to find. We prefer to use real datasets in the manual examples, but sometimes that isn’t feasible and so we create simulated datasets.

We also simulate data to check the coverage probabilities of new estimators in Stata. Sometimes the formulae published in books and papers contain typographical errors. Sometimes the asymptotic properties of estimators don’t hold under certain conditions. And every once in a while, we make coding mistakes. We run simulations during development to verify that a 95% confidence interval really is a 95% confidence interval.

Simulated data can also come in handy for presentations, teaching purposes, and calculating statistical power using simulations for complex study designs.

And, simulating data is just plain fun once you get the hang of it.

Some of you will recall Vince Wiggins’s blog entry from 2011 entitled “Multilevel random effects in xtmixed and sem — the long and wide of it” in which he simulated a three-level dataset. I’m going to elaborate on how Vince simulated multilevel data, and then I’ll show you some useful variations. Specifically, I’m going to talk about:

1. How to simulate single-level data
2. How to simulate two- and three-level data
3. How to simulate three-level data with covariates
4. How to simulate longitudinal data with random slopes
5. How to simulate longitudinal data with structured errors

How to simulate single-level data

Let’s begin by simulating a trivially simple, single-level dataset that has the form

$y_i = 70 + e_i$

We will assume that e is normally distributed with mean zero and variance $$\sigma^2$$.

We’d want to simulate 500 observations, so let’s begin by clearing Stata’s memory and setting the number of observations to 500.

. clear
. set obs 500

Next, let’s create a variable named e that contains pseudorandom normally distributed data with mean zero and standard deviation 5:

. generate e = rnormal(0,5)

The variable e is our error term, so we can create an outcome variable y by typing

. generate y = 70 + e

. list y e in 1/5

+----------------------+
|        y           e |
|----------------------|
1. | 78.83927     8.83927 |
2. | 69.97774   -.0222647 |
3. | 69.80065   -.1993514 |
4. | 68.11398    -1.88602 |
5. | 63.08952   -6.910483 |
+----------------------+

We can fit a linear regression for the variable y to determine whether our parameter estimates are reasonably close to the parameters we specified when we simulated our dataset:

. regress y

Source |       SS       df       MS              Number of obs =     500
-------------+------------------------------           F(  0,   499) =    0.00
Model |           0     0           .           Prob > F      =       .
Residual |  12188.8118   499  24.4264766           R-squared     =  0.0000
Total |  12188.8118   499  24.4264766           Root MSE      =  4.9423

------------------------------------------------------------------------------
y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons |   69.89768    .221027   316.24   0.000     69.46342    70.33194
------------------------------------------------------------------------------

The estimate of _cons is 69.9, which is very close to 70, and the Root MSE of 4.9 is equally close to the error’s standard deviation of 5. The parameter estimates will not be exactly equal to the underlying parameters we specified when we created the data because we introduced randomness with the rnormal() function.

This simple example is just to get us started before we work with multilevel data. For familiarity, let’s fit the same model with the mixed command that we will be using later:

. mixed y, stddev

Mixed-effects ML regression                     Number of obs      =       500

Wald chi2(0)       =         .
Log likelihood = -1507.8857                     Prob > chi2        =         .

------------------------------------------------------------------------------
y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons |   69.89768   .2208059   316.56   0.000     69.46491    70.33045
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
sd(Residual) |    4.93737   .1561334      4.640645    5.253068
------------------------------------------------------------------------------

The output is organized with the parameter estimates for the fixed part in the top table and the estimated standard deviations for the random effects in the bottom table. Just as previously, the estimate of _cons is 69.9, and the estimate of the standard deviation of the residuals is 4.9.

Okay. That really was trivial, wasn’t it? Simulating two- and three-level data is almost as easy.

How to simulate two- and three-level data

I posted a blog entry last year titled “Multilevel linear models in Stata, part 1: Components of variance“. In that posting, I showed a diagram for a residual of a three-level model.

The equation for the variance-components model I fit had the form

$y_{ijk} = mu + u_i.. + u_{ij.} + e_{ijk}$

This model had three residuals, whereas the one-level model we just fit above had only one.

This time, let’s start with a two-level model. Let’s simulate a two-level dataset, a model for children nested within classrooms. We’ll index classrooms by i and children by j. The model is
$y_{ij} = mu + u_{i.} + e_{ij}$

For this toy model, let’s assume two classrooms with two students per classroom, meaning that we want to create a four-observation dataset, where the observations are students.

To create this four-observation dataset, we start by creating a two-observation dataset, where the observations are classrooms. Because there are two classrooms, we type

. clear
. set obs 2
. generate classroom = _n

From now on, we’ll refer to classroom as i. It’s easier to remember what variables mean if they have meaningful names.

Next, we’ll create a variable that contains each classroom’s random effect $$u_i$$, which we’ll assume follows an N(0,3) distribution.

. generate u_i = rnormal(0,3)

. list

+----------------------+
| classr~m         u_i |
|----------------------|
1. |        1    .7491351 |
2. |        2   -.0031386 |
+----------------------+

We can now expand our data to include two children per classroom by typing

. expand 2

. list

+----------------------+
| classr~m         u_i |
|----------------------|
1. |        1    .7491351 |
2. |        2   -.0031386 |
3. |        1    .7491351 |
4. |        2   -.0031386 |
+----------------------+

Now, we can think of our observations as being students. We can create a child ID (we’ll call it child rather than j), and we can create each child’s residual $$e_{ij}$$, which we will assume has an N(0,5) distribution:

. bysort classroom: generate child = _n

. generate e_ij = rnormal(0,5)

. list

+------------------------------------------+
| classr~m         u_i   child        e_ij |
|------------------------------------------|
1. |        1    .7491351       1    2.832674 |
2. |        1    .7491351       2    1.487452 |
3. |        2   -.0031386       1    6.598946 |
4. |        2   -.0031386       2   -.3605778 |
+------------------------------------------+

We now have nearly all the ingredients to calculate $$y_{ij}$$:

$$y_{ij} = mu + u_{i.} + e_{ij}$$

We’ll assume mu is 70. We type

. generate y = 70 + u_i + e_ij

. list y classroom u_i child e_ij, sepby(classroom)

+-----------------------------------------------------+
|        y   classr~m         u_i   child        e_ij |
|-----------------------------------------------------|
1. | 73.58181          1    .7491351       1    2.832674 |
2. | 72.23659          1    .7491351       2    1.487452 |
|-----------------------------------------------------|
3. | 76.59581          2   -.0031386       1    6.598946 |
4. | 69.63628          2   -.0031386       2   -.3605778 |
+-----------------------------------------------------+

Note that the random effect u_i is the same within each school, and each child has a different value for e_ij.

Our strategy was simple:

2. Create variables for the level ID and its random effect.
3. Expand the data by the number of observations within that level.
4. Repeat steps 2 and 3 until the bottom level is reached.

Let’s try this recipe for three-level data where children are nested within classrooms which are nested within schools. This time, I will index schools with i, classrooms with j, and children with k so that my model is

$y_{ijk} = mu + u_{i..} + u_{ij.} + e_{ijk}$

where

$$u_{i..}$$ ~ N(0,2)
$$u_{ij.}$$ ~ N(0,3)
$$u_{ijk}$$ ~ N(0,5)

Let’s create data for

(level 3, i)   2 schools

(level 2, j)   2 classrooms in each school

(level 1, k)  2 students in most classrooms; 3 students in i==2 & j==2

Begin by creating the level-three data for the two schools:

. clear
. set obs 2
. generate school = _n
. generate u_i = rnormal(0,2)
. list school u_i

+--------------------+
| school         u_i |
|--------------------|
1. |      1    3.677312 |
2. |      2   -3.193004 |
+--------------------+

Next, we expand the data so that we have the two classrooms nested within each of the schools, and we create its random effect:

. expand 2
. bysort school: generate classroom = _n
. generate u_ij = rnormal(0,3)
. list school u_i classroom u_ij, sepby(school)

+-------------------------------------------+
| school         u_i   classr~m        u_ij |
|-------------------------------------------|
1. |      1    3.677312          1    .9811059 |
2. |      1    3.677312          2   -3.482453 |
|-------------------------------------------|
3. |      2   -3.193004          1   -4.107915 |
4. |      2   -3.193004          2   -2.450383 |
+-------------------------------------------+

Finally, we expand the data so that we have three students in school 2’s classroom 2, and two students in all the other classrooms. Sorry for that complication, but I wanted to show you how to create unbalanced data.

In the previous examples, we’ve been typing things like expand 2, meaning double the observations. In this case, we need to do something different for school 2, classroom 2, namely,

. expand 3 if school==2 & classroom==2

and then we can just expand the rest:

. expand 2 if !(school==2 & clasroom==2)

Obviously, in a real simulation, you would probably want 16 to 25 students in each classroom. You could do something like that by typing

. expand 16+int((25-16+1)*runiform())

In any case, we will type

. expand 3 if school==2 & classroom==2

. expand 2 if !(school==2 & classroom==2)

. bysort school classroom: generate child = _n

. generate e_ijk = rnormal(0,5)

. generate y = 70 + u_i + u_ij + e_ijk

. list y school u_i classroom u_ij child e_ijk, sepby(classroom)

+------------------------------------------------------------------------+
|        y   school       u_i   classr~m        u_ij   child       e_ijk |
|------------------------------------------------------------------------|
1. | 76.72794        1  3.677312          1    .9811059       1    2.069526 |
2. | 69.81315        1  3.677312          1    .9811059       2   -4.845268 |
|------------------------------------------------------------------------|
3. | 74.09565        1  3.677312          2   -3.482453       1    3.900788 |
4. | 71.50263        1  3.677312          2   -3.482453       2    1.307775 |
|------------------------------------------------------------------------|
5. | 64.86206        2 -3.193004          1   -4.107915       1    2.162977 |
6. | 61.80236        2 -3.193004          1   -4.107915       2   -.8967164 |
|------------------------------------------------------------------------|
7. | 66.65285        2 -3.193004          2   -2.450383       1    2.296242 |
8. | 49.96139        2 -3.193004          2   -2.450383       2   -14.39522 |
9. | 64.41605        2 -3.193004          2   -2.450383       3    .0594433 |
+------------------------------------------------------------------------+

Regardless of how we generate the data, we must ensure that the school-level random effects u_i are the same within school and the classroom-level random effects u_ij are the same within classroom.

Concerning data construction, the example above we concocted to produce a dataset that would be easy to list. Let’s now create a dataset that is more reasonable:

$y_{ijk} = mu + u_{i..} + u_{ij.} + e_{ijk}$

where

$$u_{i..}$$ ~ N(0,2)
$$u_{ij.}$$ ~ N(0,3)
$$u_{ijk}$$ ~ N(0,5)

Let’s create data for

(level 3, i)   6 schools

(level 2, j)   10 classrooms in each school

(level 1, k)   16-25 students

. clear
. set obs 6
. generate school = _n
. generate u_i = rnormal(0,2)
. expand 10
. bysort school: generate classroom = _n
. generate u_ij = rnormal(0,3)
. expand 16+int((25-16+1)*runiform())
. bysort school classroom: generate child = _n
. generate e_ijk = rnormal(0,5)
. generate y = 70 + u_i + u_ij + e_ijk

We can use the mixed command to fit the model with our simulated data.

. mixed y || school: || classroom: , stddev

Mixed-effects ML regression                     Number of obs      =      1217

-----------------------------------------------------------
|   No. of       Observations per Group
Group Variable |   Groups    Minimum    Average    Maximum
----------------+------------------------------------------
school |        6        197      202.8        213
classroom |       60         16       20.3         25
-----------------------------------------------------------

Wald chi2(0)       =         .
Log likelihood = -3710.0673                     Prob > chi2        =         .

------------------------------------------------------------------------------
y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons |   70.25941   .9144719    76.83   0.000     68.46707    72.05174
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
school: Identity             |
sd(_cons) |   2.027064   .7159027      1.014487    4.050309
-----------------------------+------------------------------------------------
classroom: Identity          |
sd(_cons) |   2.814152   .3107647       2.26647    3.494178
-----------------------------+------------------------------------------------
sd(Residual) |   4.828923   .1003814      4.636133     5.02973
------------------------------------------------------------------------------
LR test vs. linear regression:       chi2(2) =   379.37   Prob > chi2 = 0.0000

The parameter estimates from our simulated data match the parameters used to create the data pretty well: the estimate for _cons is 70.3, which is near 70; the estimated standard deviation for the school-level random effects is 2.02, which is near 2; the estimated standard deviation for the classroom-level random effects is 2.8, which is near 3; and the estimated standard deviation for the individual-level residuals is 4.8, which is near 5.

We’ve just done one reasonable simulation.

If we wanted to do a full simulation, we would need to do the above 100, 1,000, 10,000, or more times. We would put our code in a loop. And in that loop, we would keep track of whatever parameter interested us.

How to simulate three-level data with covariates

Usually, we’re more interested in estimating the effects of the covariates than in estimating the variance of the random effects. Covariates are typically binary (such as male/female), categorical (such as race), ordinal (such as education level), or continuous (such as age).

Let’s add some covariates to our simulated data. Our model is

$y_{ijk} = mu + u_{i..} + u_{ij.} + e_{ijk}$

where

$$u_{i..}$$ ~ N(0,2)
$$u_{ij.}$$ ~ N(0,3)
$$u_{ijk}$$ ~ N(0,5)

We create data for

(level 3, i)   6 schools

(level 2, j)   10 classrooms in each school

(level 1, k)   16-25 students

(level 3, school i)       whether the school is in an urban environment

(level 2, classroom j)  teacher’s experience (years)

(level 1, student k)    student’s mother’s education level

We can create a binary covariate called urban at the school level that equals 1 if the school is located in an urban area and equals 0 otherwise.

. clear
. set obs 6
. generate school = _n
. generate u_i = rnormal(0,2)
. generate urban = runiform()<0.50

Here we assigned schools to one of the two groups with equal probability (runiform()<0.50), but we could have assigned 70% of the schools to be urban by typing

. generate urban = runiform()<0.70

At the classroom level, we could add a continuous covariate for the teacher's years of experience. We could generate this variable by using any of Stata's random-number functions (see help random_number_functions. In the example below, I've generated teacher's years of experience with a uniform distribution ranging from 5-20 years.

. expand 10
. bysort school: generate classroom = _n
. generate u_ij = rnormal(0,3)
. bysort school: generate teach_exp = 5+int((20-5+1)*runiform())

When we summarize our data, we see that teaching experience ranges from 6-20 years with an average of 13 years.

. summarize teach_exp

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
teach_exp |        60    13.21667    4.075939          6         20

At the child level, we could add a categorical/ordinal covariate for mother's highest level of education completed. After we expand the data and create the child ID and error variables, we can generate a uniformly distributed random variable, temprand, on the interval [0,1].

. expand 16+int((25-16+1)*runiform())
. bysort school classroom: generate child = _n
. generate e_ijk = rnormal(0,5)
. generate temprand = runiform()

We can assign children to different groups by using the egen command with cutpoints. In the example below, children whose value of temprand is in the interval [0,0.5) will be assigned to mother_educ==0, children whose value of temprand is in the interval [0.5,0.9) will be assigned to mother_educ==1, and children whose value of temprand is in the interval [0.9,1) will be assigned to mother_educ==2.

. egen mother_educ = cut(temprand), at(0,0.5, 0.9, 1) icodes
. label define mother_educ 0 "HighSchool" 1 "College" 2 ">College"
. label values mother_educ mother_educ

The resulting frequencies of each category are very close to the frequencies we specified in our egen command.

. tabulate mother_educ, generate(meduc)

mother_educ |      Freq.     Percent        Cum.
------------+-----------------------------------
HighSchool |        602       50.17       50.17
College |        476       39.67       89.83
>College |        122       10.17      100.00
------------+-----------------------------------
Total |      1,200      100.00

We used the option generate(meduc) in the tabulate command above to create indicator variables for each category of mother_educ. This will allow us to specify an effect size for each category when we create our outcome variable.

. summarize meduc*

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
meduc1 |      1200    .5016667    .5002057          0          1
meduc2 |      1200    .3966667    .4894097          0          1
meduc3 |      1200    .1016667    .3023355          0          1

Now, we can create an outcome variable called score by adding all our fixed and random effects together. We can specify an effect size (regression coefficient) for each fixed effect in our model.

. generate score = 70
+ (-2)*urban
+ 1.5*teach_exp
+ 0*meduc1
+ 2*meduc2
+ 5*meduc3
+ u_i + u_ij + e_ijk

I have specified that the grand mean is 70, urban schools will have scores 2 points lower than nonurban schools, and each year of teacher's experience will add 1.5 points to the students score.

Mothers whose highest level of education was high school (meduc1==1) will serve as the referent category for mother_educ(mother_educ==0). The scores of children whose mother completed college (meduc2==1 and mother_educ==1) will be 2 points higher than the children in the referent group. And the scores of children whose mother completed more than college (meduc3==1 and mother_educ==2) will be 5 points higher than the children in the referent group. Now, we can use the mixed command to fit a model to our simulated data. We used the indicator variables meduc1-meduc3 to create the data, but we will use the factor variable i.mother_educ to fit the model.

. mixed score urban teach_exp i.mother_educ  || school: ||
classroom: , stddev baselevel

Mixed-effects ML regression                     Number of obs      =      1259

-----------------------------------------------------------
|   No. of       Observations per Group
Group Variable |   Groups    Minimum    Average    Maximum
----------------+------------------------------------------
school |        6        200      209.8        217
classroom |       60         16       21.0         25
-----------------------------------------------------------

Wald chi2(4)       =    387.64
Log likelihood = -3870.5395                     Prob > chi2        =    0.0000

------------------------------------------------------------------------------
score |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
urban |  -2.606451    2.07896    -1.25   0.210    -6.681138    1.468237
teach_exp |   1.584759    .096492    16.42   0.000     1.395638     1.77388
|
mother_educ |
HighSchool  |          0  (base)
College  |   2.215281   .3007208     7.37   0.000     1.625879    2.804683
>College  |   5.065907   .5237817     9.67   0.000     4.039314      6.0925
|
_cons |   68.95018   2.060273    33.47   0.000     64.91212    72.98824
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
school: Identity             |
sd(_cons) |   2.168154   .7713944      1.079559    4.354457
-----------------------------+------------------------------------------------
classroom: Identity          |
sd(_cons) |    3.06871   .3320171      2.482336    3.793596
-----------------------------+------------------------------------------------
sd(Residual) |   4.947779   .1010263      4.753681    5.149802
------------------------------------------------------------------------------
LR test vs. linear regression:       chi2(2) =   441.25   Prob > chi2 = 0.0000

"Close" is in the eye of the beholder, but to my eyes, the parameter estimates look remarkably close to the parameters that were used to simulate the data. The parameter estimates for the fixed part of the model are -2.6 for urban (parameter = -2), 1.6 for teach_exp (parameter = 1.5), 2.2 for the College category of mother_educ (parameter = 2), 5.1 for the >College category of mother_educ (parameter = 5), and 69.0 for the intercept (parameter = 70). The estimated standard deviations for the random effects are also very close to the simulation parameters. The estimated standard deviation is 2.2 (parameter = 2) at the school level, 3.1 (parameter = 3) at the classroom level, and 4.9 (parameter = 5) at the child level.

Some of you may disagree that the parameter estimates are close. My reply is that it doesn't matter unless you're simulating a single dataset for demonstration purposes. If you are, simply simulate more datasets until you get one that looks close enough for you. If you are simulating data to check coverage probabilities or to estimate statistical power, you will be averaging over thousands of simulated datasets and the results of any one of those datasets won't matter.

How to simulate longitudinal data with random slopes

Longitudinal data are often conceptualized as multilevel data where the repeated observations are nested within individuals. The main difference between ordinary multilevel models and multilevel models for longitudinal data is the inclusion of a random slope. If you are not familiar with random slopes, you can learn more about them in a blog entry I wrote last year (Multilevel linear models in Stata, part 2: Longitudinal data).

Simulating longitudinal data with a random slope is much like simulating two-level data, with a couple of modifications. First, the bottom level will be observations within person. Second, there will be an interaction between time (age) and a person-level random effect. So we will generate data for the following model:

$weight_{ij} = mu + age_{ij} + u_{0i.} + age*u_{1i.} + e_{ij}$

where

$$u_{0i.}$$ ~ N(0,3)   $$u_{1i.}$$ ~ N(0,1)   $$e_{ij}$$ ~ N(0,2)

Let's begin by simulating longitudinal data for 300 people.

. clear
. set obs 300
. gen person = _n

For longitudinal data, we must create two person-level random effects: the variable u_0i is analogous to the random effect we created earlier, and the variable u_1i is the random effect for the slope over time.

. generate u_0i = rnormal(0,3)
. generate u_1i = rnormal(0,1)

Let's expand the data so that there are five observations nested within each person. Rather than create an observation-level identification number, let's create a variable for age that ranges from 12 to 16 years,

. expand 5
. bysort person: generate age = _n + 11

and create an observation-level error term from an N(0,2) distribution:

. generate e_ij = rnormal(0,2)

. list person u_0i u_1i age e_ij if person==1

+-------------------------------------------------+
| person       u_0i        u_1i   age        e_ij |
|-------------------------------------------------|
1. |      1   .9338312   -.3097848    12    1.172153 |
2. |      1   .9338312   -.3097848    13    2.935366 |
3. |      1   .9338312   -.3097848    14   -2.306981 |
4. |      1   .9338312   -.3097848    15   -2.148335 |
5. |      1   .9338312   -.3097848    16   -.4276625 |
+-------------------------------------------------+

The person-level random effects u_0i and u_1i are the same at all ages, and the observation-level random effects e_ij are different at each age. Now we're ready to generate an outcome variable called weight, measured in kilograms, based on the following model:

$weight_{ij} = 3 + 3.6*age_{ij} + u_{0i} + age*u_{1i} + e_{ij}$

. generate weight = 3 + 3.6*age + u_0i + age*u_1i + e_ij

The random effect u_1i is multiplied by age, which is why it is called a random slope. We could rewrite the model as

$weight_{ij} = 3 + age_{ij}*(3.6 + u_{1i}) + u_{01} + e_{ij}$

Note that for each year of age, a person's weight will increase by 3.6 kilograms plus some random amount specified by u_1j. In other words,the slope for age will be slightly different for each person.

We can use the mixed command to fit a model to our data:

. mixed weight age || person: age , stddev

Mixed-effects ML regression                     Number of obs      =      1500
Group variable: person                          Number of groups   =       300

Obs per group: min =         5
avg =       5.0
max =         5

Wald chi2(1)       =   3035.03
Log likelihood = -3966.3842                     Prob > chi2        =    0.0000

------------------------------------------------------------------------------
weight |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
age |   3.708161   .0673096    55.09   0.000     3.576237    3.840085
_cons |   2.147311   .5272368     4.07   0.000     1.113946    3.180676
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
person: Independent          |
sd(age) |   .9979648   .0444139      .9146037    1.088924
sd(_cons) |    3.38705   .8425298      2.080103    5.515161
-----------------------------+------------------------------------------------
sd(Residual) |   1.905885   .0422249      1.824897    1.990468
------------------------------------------------------------------------------
LR test vs. linear regression:       chi2(2) =  4366.32   Prob > chi2 = 0.0000

The estimate for the intercept _cons = 2.1 is not very close to the original parameter value of 3, but the estimate of 3.7 for age is very close (parameter = 3.6). The standard deviations of the random effects are also very close to the parameters used to simulate the data. The estimate for the person level _cons is 2.1 (parameter = 2), the person-level slope is 0.997 (parameter = 1), and the observation-level residual is 1.9 (parameter = 2).

How to simulate longitudinal data with structured errors

Longitudinal data often have an autoregressive pattern to their errors because of the sequential collection of the observations. Measurements taken closer together in time will be more similar than measurements taken further apart in time. There are many patterns that can be used to descibe the correlation among the errors, including autoregressive, moving average, banded, exponential, Toeplitz, and others (see help mixed##rspec).

Let's simulate a dataset where the errors have a Toeplitz structure, which I will define below.

We begin by creating a sample with 500 people with a person-level random effect having an N(0,2) distribution.

. clear
. set obs 500
. gen person = _n
. generate u_i = rnormal(0,2)

Next, we can use the drawnorm command to create error variables with a Toeplitz pattern.

A Toeplitz 1 correlation matrix has the following structure:

. matrix V = ( 1.0, 0.5, 0.0, 0.0, 0.0  \     ///
0.5, 1.0, 0.5, 0.0, 0.0  \     ///
0.0, 0.5, 1.0, 0.5, 0.0  \     ///
0.0, 0.0, 0.5, 1.0, 0.5  \     ///
0.0, 0.0, 0.0, 0.5, 1.0 )

. matrix list V

symmetric V[5,5]
c1  c2  c3  c4  c5
r1   1
r2  .5   1
r3   0  .5   1
r4   0   0  .5   1
r5   0   0   0  .5   1

The correlation matrix has 1s on the main diagonal, and each pair of contiguous observations will have a correlation of 0.5. Observations more than 1 unit of time away from each other are assumed to be uncorrelated.

We must also define a matrix of means to use the drawnorm command.

. matrix M = (0 \ 0 \ 0 \ 0 \ 0)

. matrix list M

M[5,1]
c1
r1   0
r2   0
r3   0
r4   0
r5   0

Now, we're ready to use the drawnorm command to create five error variables that have a Toeplitz 1 structure.

. drawnorm e1 e2 e3 e4 e5, means(M) cov(V)

. list in 1/2

+---------------------------------------------------------------------------+
| person        u_i         e1         e2        e3          e4          e5 |
|---------------------------------------------------------------------------|
1. |      1   5.303562  -1.288265  -1.201399   .353249    .0495944   -1.472762 |
2. |      2  -.0133588   .6949759    2.82179  .7195075   -1.032395    .1995016 |
+---------------------------------------------------------------------------+

Let's estimate the correlation matrix for our simulated data to verify that our simulation worked as we expected.

. correlate e1-e5
(obs=300)

|       e1       e2       e3       e4       e5
-------------+---------------------------------------------
e1 |   1.0000
e2 |   0.5542   1.0000
e3 |  -0.0149   0.4791   1.0000
e4 |  -0.0508  -0.0364   0.5107   1.0000
e5 |   0.0022  -0.0615   0.0248   0.4857   1.0000

The correlations are 1 along the main diagonal, near 0.5 for the contiguous observations, and near 0 otherwise.

Our data are currently in wide format, and we need them in long format to use the mixed command. We can use the reshape command to convert our data from wide to long format. If you are not familiar with the reshape command, you can learn more about it by typing help reshape.

. reshape long e, i(person) j(time)
(note: j = 1 2 3 4 5)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                      300   ->    1500
Number of variables                   7   ->       4
j variable (5 values)                     ->   time
xij variables:
e1 e2 ... e5   ->   e
-----------------------------------------------------------------------------

Now, we are ready to create our age variable and the outcome variable weight.

. bysort person: generate age = _n + 11
. generate weight = 3 + 3.6*age + u_i + e

. list weight person u_i time age e if person==1

+-------------------------------------------------------+
|   weight   person        u_i   time   age           e |
|-------------------------------------------------------|
1. |  50.2153        1   5.303562      1    12   -1.288265 |
2. | 53.90216        1   5.303562      2    13   -1.201399 |
3. | 59.05681        1   5.303562      3    14     .353249 |
4. | 62.35316        1   5.303562      4    15    .0495944 |
5. |  64.4308        1   5.303562      5    16   -1.472762 |
+-------------------------------------------------------+

We can use the mixed command to fit a model to our simulated data.

. mixed weight age || person:, residual(toeplitz 1, t(time)) , stddev

Mixed-effects ML regression                     Number of obs      =      1500
Group variable: person                          Number of groups   =       300

Obs per group: min =         5
avg =       5.0
max =         5

Wald chi2(1)       =  33797.58
Log likelihood = -2323.9389                     Prob > chi2        =    0.0000

------------------------------------------------------------------------------
weight |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
age |   3.576738   .0194556   183.84   0.000     3.538606     3.61487
_cons |   3.119974   .3244898     9.62   0.000     2.483985    3.755962
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
person: Identity             |
sd(_cons) |   3.004718   .1268162      2.766166    3.263843
-----------------------------+------------------------------------------------
Residual: Toeplitz(1)        |
rho1 |   .4977523   .0078807      .4821492    .5130398
sd(e) |   .9531284   .0230028      .9090933    .9992964
------------------------------------------------------------------------------
LR test vs. linear regression:       chi2(2) =  3063.87   Prob > chi2 = 0.0000

Again, our parameter estimates match the parameters that were used to simulate the data very closely.

The parameter estimate is 3.6 for age (parameter = 3.6) and 3.1 for _cons (parameter = 3). The estimated standard deviations of the person-level random effect is 3.0 (parameter = 3). The estimated standard deviation for the errors is 0.95 (parameter = 1), and the estimated correlation for the Toeplitz structure is 0.5 (parameter = 0.5).

Conclusion

I hope I've convinced you that simulating multilevel/longitudinal data is easy and useful. The next time you find yourself teaching a class or giving a talk that requires multilevel examples, try simulating the data. And if you need to calculate statistical power for a multilevel or longitudinal model, consider simulations.

Categories: Statistics Tags:
• Dick Campbell

This is very informative. I would note, however, that the three level with covariates example assumes that the three covariates are constructed in such a way that they are mutually uncorrelated which would not be true in actual data. I’m wondering if there is some elegant way to construct these variables with some desired degree of correlation.

• Klaus Steitzel

I have been trying to create the three-level data without covariates. When I do

clear

program checkit, rclass
version 13
drop _all
set obs 6
generate school = _n
generate u_i = rnormal(0,2)
expand 10
bysort school: generate classroom = _n
generate u_ij = rnormal(0,3)
expand 16+int((25-16+1)*runiform())
bysort school classroom: generate child = _n
generate e_ijk = rnormal(0,5)
generate y = 70 + u_i + u_ij + e_ijk
mixed y || school: || classroom: , stddev iterate(300)
return scalar intercept = _b[_cons] // Obtain intercept
matrix b = e(b) // Get matrix for variance components
return scalar sd_e_ijk = exp(b[1,4]) // Get SD of residual variance
return scalar sd_u_ij = exp(b[1,3]) // Get SD of classroom variance
return scalar sd_u_i = exp(b[1,2]) // Get SD of school variance
return scalar n = e(N) // Get sample size
end

simulate constant = r(intercept) ///
res_var = r(sd_e_ijk) ///
clas_var = r(sd_u_ij) ///
scho_var = r(sd_u_i) ///
n = r(n) ///
, seed(3171) reps(10000): checkit

-sum- gives me a school-level variance component of 1.6, not 2.0:

Variable | Obs Mean Std. Dev. Min Max
————-+——————————————————–
constant | 10000 70.00696 .9110622 66.54556 73.34476
res_var | 10000 4.997051 .1042258 4.638548 5.409982
clas_var | 10000 2.986707 .3304136 1.791261 4.274279
scho_var | 10000 1.5931 .7894726 4.43e-11 5.00181
n | 10000 1229.93 22.152 1144 1311

Am I doing something wrong?

• Simon Turner

I feel like it may be due to the very small number of schools. If you up the number of schools to something like, 100, you should see more consistent results.

My results with 200 simulations of 100 schools…

Variable | Obs Mean Std. Dev. Min Max
————-+———————————————————
constant | 200 69.98887 .2223072 69.24297 70.6267
res_var | 200 4.998262 .0236969 4.938057 5.064035
clas_var | 200 2.984202 .0763061 2.742453 3.167063
scho_var | 200 1.979497 .1706455 1.490328 2.403322
n | 200 20494.91 86.57499 20264 20815