Archive

Author Archive

How to simulate multilevel/longitudinal data


I was recently talking with my friend Rebecca about simulating multilevel data, and she asked me if I would show her some examples. It occurred to me that many of you might also like to see some examples, so I decided to post them to the Stata Blog.

Introduction

We simulate data all the time at StataCorp and for a variety of reasons.

One reason is that real datasets that include the features we would like are often difficult to find. We prefer to use real datasets in the manual examples, but sometimes that isn’t feasible and so we create simulated datasets.

We also simulate data to check the coverage probabilities of new estimators in Stata. Sometimes the formulae published in books and papers contain typographical errors. Sometimes the asymptotic properties of estimators don’t hold under certain conditions. And every once in a while, we make coding mistakes. We run simulations during development to verify that a 95% confidence interval really is a 95% confidence interval.

Simulated data can also come in handy for presentations, teaching purposes, and calculating statistical power using simulations for complex study designs.

And, simulating data is just plain fun once you get the hang of it.

Some of you will recall Vince Wiggins’s blog entry from 2011 entitled “Multilevel random effects in xtmixed and sem — the long and wide of it” in which he simulated a three-level dataset. I’m going to elaborate on how Vince simulated multilevel data, and then I’ll show you some useful variations. Specifically, I’m going to talk about:

  1. How to simulate single-level data
  2. How to simulate two- and three-level data
  3. How to simulate three-level data with covariates
  4. How to simulate longitudinal data with random slopes
  5. How to simulate longitudinal data with structured errors

How to simulate single-level data

Let’s begin by simulating a trivially simple, single-level dataset that has the form

\[y_i = 70 + e_i\]

We will assume that e is normally distributed with mean zero and variance \(\sigma^2\).

We’d want to simulate 500 observations, so let’s begin by clearing Stata’s memory and setting the number of observations to 500.

. clear 
. set obs 500

Next, let’s create a variable named e that contains pseudorandom normally distributed data with mean zero and standard deviation 5:

. generate e = rnormal(0,5)

The variable e is our error term, so we can create an outcome variable y by typing

. generate y = 70 + e

. list y e in 1/5

     +----------------------+
     |        y           e |
     |----------------------|
  1. | 78.83927     8.83927 |
  2. | 69.97774   -.0222647 |
  3. | 69.80065   -.1993514 |
  4. | 68.11398    -1.88602 |
  5. | 63.08952   -6.910483 |
     +----------------------+

We can fit a linear regression for the variable y to determine whether our parameter estimates are reasonably close to the parameters we specified when we simulated our dataset:

. regress y

      Source |       SS       df       MS              Number of obs =     500
-------------+------------------------------           F(  0,   499) =    0.00
       Model |           0     0           .           Prob > F      =       .
    Residual |  12188.8118   499  24.4264766           R-squared     =  0.0000
-------------+------------------------------           Adj R-squared =  0.0000
       Total |  12188.8118   499  24.4264766           Root MSE      =  4.9423

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   69.89768    .221027   316.24   0.000     69.46342    70.33194
------------------------------------------------------------------------------

The estimate of _cons is 69.9, which is very close to 70, and the Root MSE of 4.9 is equally close to the error’s standard deviation of 5. The parameter estimates will not be exactly equal to the underlying parameters we specified when we created the data because we introduced randomness with the rnormal() function.

This simple example is just to get us started before we work with multilevel data. For familiarity, let’s fit the same model with the mixed command that we will be using later:

. mixed y, stddev

Mixed-effects ML regression                     Number of obs      =       500

                                                Wald chi2(0)       =         .
Log likelihood = -1507.8857                     Prob > chi2        =         .

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   69.89768   .2208059   316.56   0.000     69.46491    70.33045
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
                sd(Residual) |    4.93737   .1561334      4.640645    5.253068
------------------------------------------------------------------------------

The output is organized with the parameter estimates for the fixed part in the top table and the estimated standard deviations for the random effects in the bottom table. Just as previously, the estimate of _cons is 69.9, and the estimate of the standard deviation of the residuals is 4.9.

Okay. That really was trivial, wasn’t it? Simulating two- and three-level data is almost as easy.

How to simulate two- and three-level data

I posted a blog entry last year titled “Multilevel linear models in Stata, part 1: Components of variance“. In that posting, I showed a diagram for a residual of a three-level model.

The equation for the variance-components model I fit had the form

\[y_{ijk} = mu + u_i.. + u_{ij.} + e_{ijk}\]

This model had three residuals, whereas the one-level model we just fit above had only one.

This time, let’s start with a two-level model. Let’s simulate a two-level dataset, a model for children nested within classrooms. We’ll index classrooms by i and children by j. The model is
\[y_{ij} = mu + u_{i.} + e_{ij}\]

For this toy model, let’s assume two classrooms with two students per classroom, meaning that we want to create a four-observation dataset, where the observations are students.

To create this four-observation dataset, we start by creating a two-observation dataset, where the observations are classrooms. Because there are two classrooms, we type

. clear 
. set obs 2
. generate classroom = _n

From now on, we’ll refer to classroom as i. It’s easier to remember what variables mean if they have meaningful names.

Next, we’ll create a variable that contains each classroom’s random effect \(u_i\), which we’ll assume follows an N(0,3) distribution.

. generate u_i = rnormal(0,3)

. list

     +----------------------+
     | classr~m         u_i |
     |----------------------|
  1. |        1    .7491351 |
  2. |        2   -.0031386 |
     +----------------------+

We can now expand our data to include two children per classroom by typing

. expand 2

. list

     +----------------------+
     | classr~m         u_i |
     |----------------------|
  1. |        1    .7491351 |
  2. |        2   -.0031386 |
  3. |        1    .7491351 |
  4. |        2   -.0031386 |
     +----------------------+

Now, we can think of our observations as being students. We can create a child ID (we’ll call it child rather than j), and we can create each child’s residual \(e_{ij}\), which we will assume has an N(0,5) distribution:

. bysort classroom: generate child = _n

. generate e_ij = rnormal(0,5)

. list

     +------------------------------------------+
     | classr~m         u_i   child        e_ij |
     |------------------------------------------|
  1. |        1    .7491351       1    2.832674 |
  2. |        1    .7491351       2    1.487452 |
  3. |        2   -.0031386       1    6.598946 |
  4. |        2   -.0031386       2   -.3605778 |
     +------------------------------------------+

We now have nearly all the ingredients to calculate \(y_{ij}\):

\(y_{ij} = mu + u_{i.} + e_{ij}\)

We’ll assume mu is 70. We type

. generate y = 70 + u_i + e_ij

. list y classroom u_i child e_ij, sepby(classroom)

     +-----------------------------------------------------+
     |        y   classr~m         u_i   child        e_ij |
     |-----------------------------------------------------|
  1. | 73.58181          1    .7491351       1    2.832674 |
  2. | 72.23659          1    .7491351       2    1.487452 |
     |-----------------------------------------------------|
  3. | 76.59581          2   -.0031386       1    6.598946 |
  4. | 69.63628          2   -.0031386       2   -.3605778 |
     +-----------------------------------------------------+

Note that the random effect u_i is the same within each school, and each child has a different value for e_ij.

Our strategy was simple:

  1. Start with the top level of the data hierarchy.
  2. Create variables for the level ID and its random effect.
  3. Expand the data by the number of observations within that level.
  4. Repeat steps 2 and 3 until the bottom level is reached.

Let’s try this recipe for three-level data where children are nested within classrooms which are nested within schools. This time, I will index schools with i, classrooms with j, and children with k so that my model is

\[y_{ijk} = mu + u_{i..} + u_{ij.} + e_{ijk}\]

where

\(u_{i..}\) ~ N(0,2)
\(u_{ij.}\) ~ N(0,3)
\(u_{ijk}\) ~ N(0,5)

Let’s create data for

(level 3, i)   2 schools

(level 2, j)   2 classrooms in each school

(level 1, k)  2 students in most classrooms; 3 students in i==2 & j==2

Begin by creating the level-three data for the two schools:

. clear
. set obs 2
. generate school = _n
. generate u_i = rnormal(0,2)
. list school u_i

     +--------------------+
     | school         u_i |
     |--------------------|
  1. |      1    3.677312 |
  2. |      2   -3.193004 |
     +--------------------+

Next, we expand the data so that we have the three classrooms nested within each of the schools, and we create its random effect:

. expand 2
. bysort school: generate classroom = _n
. generate u_ij = rnormal(0,3)
. list school u_i classroom u_ij, sepby(school)

     +-------------------------------------------+
     | school         u_i   classr~m        u_ij |
     |-------------------------------------------|
  1. |      1    3.677312          1    .9811059 |
  2. |      1    3.677312          2   -3.482453 |
     |-------------------------------------------|
  3. |      2   -3.193004          1   -4.107915 |
  4. |      2   -3.193004          2   -2.450383 |
     +-------------------------------------------+

Finally, we expand the data so that we have three students in school 2′s classroom 2, and two students in all the other classrooms. Sorry for that complication, but I wanted to show you how to create unbalanced data.

In the previous examples, we’ve been typing things like expand 2, meaning double the observations. In this case, we need to do something different for school 2, classroom 2, namely,

. expand 3 if school==2 & classroom==2

and then we can just expand the rest:

. expand 2 if !(school==2 & clasroom==2)

Obviously, in a real simulation, you would probably want 16 to 25 students in each classroom. You could do something like that by typing

. expand 16+int((25-16+1)*runiform())

In any case, we will type

. expand 3 if school==2 & classroom==2

. expand 2 if !(school==2 & classroom==2)

. bysort school classroom: generate child = _n

. generate e_ijk = rnormal(0,5)

. generate y = 70 + u_i + u_ij + e_ijk

. list y school u_i classroom u_ij child e_ijk, sepby(classroom)

     +------------------------------------------------------------------------+
     |        y   school       u_i   classr~m        u_ij   child       e_ijk |
     |------------------------------------------------------------------------|
  1. | 76.72794        1  3.677312          1    .9811059       1    2.069526 |
  2. | 69.81315        1  3.677312          1    .9811059       2   -4.845268 |
     |------------------------------------------------------------------------|
  3. | 74.09565        1  3.677312          2   -3.482453       1    3.900788 |
  4. | 71.50263        1  3.677312          2   -3.482453       2    1.307775 |
     |------------------------------------------------------------------------|
  5. | 64.86206        2 -3.193004          1   -4.107915       1    2.162977 |
  6. | 61.80236        2 -3.193004          1   -4.107915       2   -.8967164 |
     |------------------------------------------------------------------------|
  7. | 66.65285        2 -3.193004          2   -2.450383       1    2.296242 |
  8. | 49.96139        2 -3.193004          2   -2.450383       2   -14.39522 |
  9. | 64.41605        2 -3.193004          2   -2.450383       3    .0594433 |
     +------------------------------------------------------------------------+

Regardless of how we generate the data, we must ensure that the school-level random effects u_i are the same within school and the classroom-level random effects u_ij are the same within classroom.

Concerning data construction, the example above we concocted to produce a dataset that would be easy to list. Let’s now create a dataset that is more reasonable:

\[y_{ijk} = mu + u_{i..} + u_{ij.} + e_{ijk}\]

where

\(u_{i..}\) ~ N(0,2)
\(u_{ij.}\) ~ N(0,3)
\(u_{ijk}\) ~ N(0,5)

Let’s create data for

(level 3, i)   6 schools

(level 2, j)   10 classrooms in each school

(level 1, k)   16-25 students

. clear
. set obs 6
. generate school = _n
. generate u_i = rnormal(0,2)
. expand 10
. bysort school: generate classroom = _n
. generate u_ij = rnormal(0,3)
. expand 16+int((25-16+1)*runiform())
. bysort school classroom: generate child = _n
. generate e_ijk = rnormal(0,5)
. generate y = 70 + u_i + u_ij + e_ijk

We can use the mixed command to fit the model with our simulated data.

. mixed y || school: || classroom: , stddev

Mixed-effects ML regression                     Number of obs      =      1217

-----------------------------------------------------------
                |   No. of       Observations per Group
 Group Variable |   Groups    Minimum    Average    Maximum
----------------+------------------------------------------
         school |        6        197      202.8        213
      classroom |       60         16       20.3         25
-----------------------------------------------------------

                                                Wald chi2(0)       =         .
Log likelihood = -3710.0673                     Prob > chi2        =         .

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   70.25941   .9144719    76.83   0.000     68.46707    72.05174
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
school: Identity             |
                   sd(_cons) |   2.027064   .7159027      1.014487    4.050309
-----------------------------+------------------------------------------------
classroom: Identity          |
                   sd(_cons) |   2.814152   .3107647       2.26647    3.494178
-----------------------------+------------------------------------------------
                sd(Residual) |   4.828923   .1003814      4.636133     5.02973
------------------------------------------------------------------------------
LR test vs. linear regression:       chi2(2) =   379.37   Prob > chi2 = 0.0000

The parameter estimates from our simulated data match the parameters used to create the data pretty well: the estimate for _cons is 70.3, which is near 70; the estimated standard deviation for the school-level random effects is 2.02, which is near 2; the estimated standard deviation for the classroom-level random effects is 2.8, which is near 3; and the estimated standard deviation for the individual-level residuals is 4.8, which is near 5.

We’ve just done one reasonable simulation.

If we wanted to do a full simulation, we would need to do the above 100, 1,000, 10,000, or more times. We would put our code in a loop. And in that loop, we would keep track of whatever parameter interested us.

How to simulate three-level data with covariates

Usually, we’re more interested in estimating the effects of the covariates than in estimating the variance of the random effects. Covariates are typically binary (such as male/female), categorical (such as race), ordinal (such as education level), or continuous (such as age).

Let’s add some covariates to our simulated data. Our model is

\[y_{ijk} = mu + u_{i..} + u_{ij.} + e_{ijk}\]

where

\(u_{i..}\) ~ N(0,2)
\(u_{ij.}\) ~ N(0,3)
\(u_{ijk}\) ~ N(0,5)

We create data for

(level 3, i)   6 schools

(level 2, j)   10 classrooms in each school

(level 1, k)   16-25 students

Let’s add to this model

(level 3, school i)       whether the school is in an urban environment

(level 2, classroom j)  teacher’s experience (years)

(level 1, student k)    student’s mother’s education level

We can create a binary covariate called urban at the school level that equals 1 if the school is located in an urban area and equals 0 otherwise.

. clear
. set obs 6
. generate school = _n
. generate u_i = rnormal(0,2)
. generate urban = runiform()<0.50

Here we assigned schools to one of the two groups with equal probability (runiform()<0.50), but we could have assigned 70% of the schools to be urban by typing

. generate urban = runiform()<0.70

At the classroom level, we could add a continuous covariate for the teacher's years of experience. We could generate this variable by using any of Stata's random-number functions (see help random_number_functions. In the example below, I've generated teacher's years of experience with a uniform distribution ranging from 5-20 years.

. expand 10
. bysort school: generate classroom = _n
. generate u_ij = rnormal(0,3)
. bysort school: generate teach_exp = 5+int((20-5+1)*runiform())

When we summarize our data, we see that teaching experience ranges from 6-20 years with an average of 13 years.

. summarize teach_exp

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
   teach_exp |        60    13.21667    4.075939          6         20

At the child level, we could add a categorical/ordinal covariate for mother's highest level of education completed. After we expand the data and create the child ID and error variables, we can generate a uniformly distributed random variable, temprand, on the interval [0,1].

. expand 16+int((25-16+1)*runiform())
. bysort school classroom: generate child = _n
. generate e_ijk = rnormal(0,5)
. generate temprand = runiform()

We can assign children to different groups by using the egen command with cutpoints. In the example below, children whose value of temprand is in the interval [0,0.5) will be assigned to mother_educ==0, children whose value of temprand is in the interval [0.5,0.9) will be assigned to mother_educ==1, and children whose value of temprand is in the interval [0.9,1) will be assigned to mother_educ==2.

. egen mother_educ = cut(temprand), at(0,0.5, 0.9, 1) icodes
. label define mother_educ 0 "HighSchool" 1 "College" 2 ">College"
. label values mother_educ mother_educ

The resulting frequencies of each category are very close to the frequencies we specified in our egen command.

. tabulate mother_educ, generate(meduc)

mother_educ |      Freq.     Percent        Cum.
------------+-----------------------------------
 HighSchool |        602       50.17       50.17
    College |        476       39.67       89.83
   >College |        122       10.17      100.00
------------+-----------------------------------
      Total |      1,200      100.00

We used the option generate(meduc) in the tabulate command above to create indicator variables for each category of mother_educ. This will allow us to specify an effect size for each category when we create our outcome variable.

. summarize meduc*

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
      meduc1 |      1200    .5016667    .5002057          0          1
      meduc2 |      1200    .3966667    .4894097          0          1
      meduc3 |      1200    .1016667    .3023355          0          1

Now, we can create an outcome variable called score by adding all our fixed and random effects together. We can specify an effect size (regression coefficient) for each fixed effect in our model.

. generate score = 70
        + (-2)*urban
        + 1.5*teach_exp
        + 0*meduc1
        + 2*meduc2
        + 5*meduc3
        + u_i + u_ij + e_ijk

I have specified that the grand mean is 70, urban schools will have scores 2 points lower than nonurban schools, and each year of teacher's experience will add 1.5 points to the students score.

Mothers whose highest level of education was high school (meduc1==1) will serve as the referent category for mother_educ(mother_educ==0). The scores of children whose mother completed college (meduc2==1 and mother_educ==1) will be 2 points higher than the children in the referent group. And the scores of children whose mother completed more than college (meduc3==1 and mother_educ==2) will be 5 points higher than the children in the referent group. Now, we can use the mixed command to fit a model to our simulated data. We used the indicator variables meduc1-meduc3 to create the data, but we will use the factor variable i.mother_educ to fit the model.

. mixed score urban teach_exp i.mother_educ  || school: ||
                                             classroom: , stddev baselevel

Mixed-effects ML regression                     Number of obs      =      1259

-----------------------------------------------------------
                |   No. of       Observations per Group
 Group Variable |   Groups    Minimum    Average    Maximum
----------------+------------------------------------------
         school |        6        200      209.8        217
      classroom |       60         16       21.0         25
-----------------------------------------------------------

                                                Wald chi2(4)       =    387.64
Log likelihood = -3870.5395                     Prob > chi2        =    0.0000

------------------------------------------------------------------------------
       score |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       urban |  -2.606451    2.07896    -1.25   0.210    -6.681138    1.468237
   teach_exp |   1.584759    .096492    16.42   0.000     1.395638     1.77388
             |
 mother_educ |
 HighSchool  |          0  (base)
    College  |   2.215281   .3007208     7.37   0.000     1.625879    2.804683
   >College  |   5.065907   .5237817     9.67   0.000     4.039314      6.0925
             |
       _cons |   68.95018   2.060273    33.47   0.000     64.91212    72.98824
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
school: Identity             |
                   sd(_cons) |   2.168154   .7713944      1.079559    4.354457
-----------------------------+------------------------------------------------
classroom: Identity          |
                   sd(_cons) |    3.06871   .3320171      2.482336    3.793596
-----------------------------+------------------------------------------------
                sd(Residual) |   4.947779   .1010263      4.753681    5.149802
------------------------------------------------------------------------------
LR test vs. linear regression:       chi2(2) =   441.25   Prob > chi2 = 0.0000

“Close” is in the eye of the beholder, but to my eyes, the parameter estimates look remarkably close to the parameters that were used to simulate the data. The parameter estimates for the fixed part of the model are -2.6 for urban (parameter = -2), 1.6 for teach_exp (parameter = 1.5), 2.2 for the College category of mother_educ (parameter = 2), 5.1 for the >College category of mother_educ (parameter = 5), and 69.0 for the intercept (parameter = 70). The estimated standard deviations for the random effects are also very close to the simulation parameters. The estimated standard deviation is 2.2 (parameter = 2) at the school level, 3.1 (parameter = 3) at the classroom level, and 4.9 (parameter = 5) at the child level.

Some of you may disagree that the parameter estimates are close. My reply is that it doesn’t matter unless you’re simulating a single dataset for demonstration purposes. If you are, simply simulate more datasets until you get one that looks close enough for you. If you are simulating data to check coverage probabilities or to estimate statistical power, you will be averaging over thousands of simulated datasets and the results of any one of those datasets won’t matter.

How to simulate longitudinal data with random slopes

Longitudinal data are often conceptualized as multilevel data where the repeated observations are nested within individuals. The main difference between ordinary multilevel models and multilevel models for longitudinal data is the inclusion of a random slope. If you are not familiar with random slopes, you can learn more about them in a blog entry I wrote last year (Multilevel linear models in Stata, part 2: Longitudinal data).

Simulating longitudinal data with a random slope is much like simulating two-level data, with a couple of modifications. First, the bottom level will be observations within person. Second, there will be an interaction between time (age) and a person-level random effect. So we will generate data for the following model:

\[weight_{ij} = mu + age_{ij} + u_{0i.} + age*u_{1i.} + e_{ij}\]

where

\(u_{0i.}\) ~ N(0,3)   \(u_{1i.}\) ~ N(0,1)   \(e_{ij}\) ~ N(0,2)

Let’s begin by simulating longitudinal data for 300 people.

. clear
. set obs 300
. gen person = _n

For longitudinal data, we must create two person-level random effects: the variable u_0i is analogous to the random effect we created earlier, and the variable u_1i is the random effect for the slope over time.

. generate u_0i = rnormal(0,3)
. generate u_1i = rnormal(0,1)

Let’s expand the data so that there are five observations nested within each person. Rather than create an observation-level identification number, let’s create a variable for age that ranges from 12 to 16 years,

. expand 5
. bysort person: generate age = _n + 11

and create an observation-level error term from an N(0,2) distribution:

. generate e_ij = rnormal(0,2)

. list person u_0i u_1i age e_ij if person==1

      +-------------------------------------------------+
      | person       u_0i        u_1i   age        e_ij |
      |-------------------------------------------------|
   1. |      1   .9338312   -.3097848    12    1.172153 |
   2. |      1   .9338312   -.3097848    13    2.935366 |
   3. |      1   .9338312   -.3097848    14   -2.306981 |
   4. |      1   .9338312   -.3097848    15   -2.148335 |
   5. |      1   .9338312   -.3097848    16   -.4276625 |
      +-------------------------------------------------+

The person-level random effects u_0i and u_1i are the same at all ages, and the observation-level random effects e_ij are different at each age. Now we’re ready to generate an outcome variable called weight, measured in kilograms, based on the following model:

\[weight_{ij} = 3 + 3.6*age_{ij} + u_{0i} + age*u_{1i} + e_{ij}\]

. generate weight = 3 + 3.6*age + u_0i + age*u_1i + e_ij

The random effect u_1i is multiplied by age, which is why it is called a random slope. We could rewrite the model as

\[weight_{ij} = 3 + age_{ij}*(3.6 + u_{1i}) + u_{01} + e_{ij}\]

Note that for each year of age, a person’s weight will increase by 3.6 kilograms plus some random amount specified by u_1j. In other words,the slope for age will be slightly different for each person.

We can use the mixed command to fit a model to our data:

. mixed weight age || person: age , stddev

Mixed-effects ML regression                     Number of obs      =      1500
Group variable: person                          Number of groups   =       300

                                                Obs per group: min =         5
                                                               avg =       5.0
                                                               max =         5


                                                Wald chi2(1)       =   3035.03
Log likelihood = -3966.3842                     Prob > chi2        =    0.0000

------------------------------------------------------------------------------
      weight |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   3.708161   .0673096    55.09   0.000     3.576237    3.840085
       _cons |   2.147311   .5272368     4.07   0.000     1.113946    3.180676
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
person: Independent          |
                     sd(age) |   .9979648   .0444139      .9146037    1.088924
                   sd(_cons) |    3.38705   .8425298      2.080103    5.515161
-----------------------------+------------------------------------------------
                sd(Residual) |   1.905885   .0422249      1.824897    1.990468
------------------------------------------------------------------------------
LR test vs. linear regression:       chi2(2) =  4366.32   Prob > chi2 = 0.0000

The estimate for the intercept _cons = 2.1 is not very close to the original parameter value of 3, but the estimate of 3.7 for age is very close (parameter = 3.6). The standard deviations of the random effects are also very close to the parameters used to simulate the data. The estimate for the person level _cons is 2.1 (parameter = 2), the person-level slope is 0.997 (parameter = 1), and the observation-level residual is 1.9 (parameter = 2).

How to simulate longitudinal data with structured errors

Longitudinal data often have an autoregressive pattern to their errors because of the sequential collection of the observations. Measurements taken closer together in time will be more similar than measurements taken further apart in time. There are many patterns that can be used to descibe the correlation among the errors, including autoregressive, moving average, banded, exponential, Toeplitz, and others (see help mixed##rspec).

Let’s simulate a dataset where the errors have a Toeplitz structure, which I will define below.

We begin by creating a sample with 500 people with a person-level random effect having an N(0,2) distribution.

. clear
. set obs 500
. gen person = _n
. generate u_i = rnormal(0,2)

Next, we can use the drawnorm command to create error variables with a Toeplitz pattern.

A Toeplitz 1 correlation matrix has the following structure:

. matrix V = ( 1.0, 0.5, 0.0, 0.0, 0.0  \     ///
               0.5, 1.0, 0.5, 0.0, 0.0  \     ///
               0.0, 0.5, 1.0, 0.5, 0.0  \     ///
               0.0, 0.0, 0.5, 1.0, 0.5  \     ///
               0.0, 0.0, 0.0, 0.5, 1.0 )

. matrix list V

symmetric V[5,5]
    c1  c2  c3  c4  c5
r1   1
r2  .5   1
r3   0  .5   1
r4   0   0  .5   1
r5   0   0   0  .5   1

The correlation matrix has 1s on the main diagonal, and each pair of contiguous observations will have a correlation of 0.5. Observations more than 1 unit of time away from each other are assumed to be uncorrelated.

We must also define a matrix of means to use the drawnorm command.

. matrix M = (0 \ 0 \ 0 \ 0 \ 0)

. matrix list M

M[5,1]
    c1
r1   0
r2   0
r3   0
r4   0
r5   0

Now, we’re ready to use the drawnorm command to create five error variables that have a Toeplitz 1 structure.

. drawnorm e1 e2 e3 e4 e5, means(M) cov(V)

. list in 1/2

     +---------------------------------------------------------------------------+
     | person        u_i         e1         e2        e3          e4          e5 |
     |---------------------------------------------------------------------------|
  1. |      1   5.303562  -1.288265  -1.201399   .353249    .0495944   -1.472762 |
  2. |      2  -.0133588   .6949759    2.82179  .7195075   -1.032395    .1995016 |
     +---------------------------------------------------------------------------+

Let’s estimate the correlation matrix for our simulated data to verify that our simulation worked as we expected.

. correlate e1-e5
(obs=300)

             |       e1       e2       e3       e4       e5
-------------+---------------------------------------------
          e1 |   1.0000
          e2 |   0.5542   1.0000
          e3 |  -0.0149   0.4791   1.0000
          e4 |  -0.0508  -0.0364   0.5107   1.0000
          e5 |   0.0022  -0.0615   0.0248   0.4857   1.0000

The correlations are 1 along the main diagonal, near 0.5 for the contiguous observations, and near 0 otherwise.

Our data are currently in wide format, and we need them in long format to use the mixed command. We can use the reshape command to convert our data from wide to long format. If you are not familiar with the reshape command, you can learn more about it by typing help reshape.

. reshape long e, i(person) j(time)
(note: j = 1 2 3 4 5)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                      300   ->    1500
Number of variables                   7   ->       4
j variable (5 values)                     ->   time
xij variables:
                           e1 e2 ... e5   ->   e
-----------------------------------------------------------------------------

Now, we are ready to create our age variable and the outcome variable weight.

. bysort person: generate age = _n + 11
. generate weight = 3 + 3.6*age + u_i + e

. list weight person u_i time age e if person==1

      +-------------------------------------------------------+
      |   weight   person        u_i   time   age           e |
      |-------------------------------------------------------|
   1. |  50.2153        1   5.303562      1    12   -1.288265 |
   2. | 53.90216        1   5.303562      2    13   -1.201399 |
   3. | 59.05681        1   5.303562      3    14     .353249 |
   4. | 62.35316        1   5.303562      4    15    .0495944 |
   5. |  64.4308        1   5.303562      5    16   -1.472762 |
      +-------------------------------------------------------+

We can use the mixed command to fit a model to our simulated data.

. mixed weight age || person:, residual(toeplitz 1, t(time)) , stddev

Mixed-effects ML regression                     Number of obs      =      1500
Group variable: person                          Number of groups   =       300

                                                Obs per group: min =         5
                                                               avg =       5.0
                                                               max =         5

                                                Wald chi2(1)       =  33797.58
Log likelihood = -2323.9389                     Prob > chi2        =    0.0000

------------------------------------------------------------------------------
      weight |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   3.576738   .0194556   183.84   0.000     3.538606     3.61487
       _cons |   3.119974   .3244898     9.62   0.000     2.483985    3.755962
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
person: Identity             |
                   sd(_cons) |   3.004718   .1268162      2.766166    3.263843
-----------------------------+------------------------------------------------
Residual: Toeplitz(1)        |
                        rho1 |   .4977523   .0078807      .4821492    .5130398
                       sd(e) |   .9531284   .0230028      .9090933    .9992964
------------------------------------------------------------------------------
LR test vs. linear regression:       chi2(2) =  3063.87   Prob > chi2 = 0.0000

Again, our parameter estimates match the parameters that were used to simulate the data very closely.

The parameter estimate is 3.6 for age (parameter = 3.6) and 3.1 for _cons (parameter = 3). The estimated standard deviations of the person-level random effect is 3.0 (parameter = 3). The estimated standard deviation for the errors is 0.95 (parameter = 1), and the estimated correlation for the Toeplitz structure is 0.5 (parameter = 0.5).

Conclusion

I hope I’ve convinced you that simulating multilevel/longitudinal data is easy and useful. The next time you find yourself teaching a class or giving a talk that requires multilevel examples, try simulating the data. And if you need to calculate statistical power for a multilevel or longitudinal model, consider simulations.

How to create animated graphics using Stata

Introduction

Today I want to show you how to create animated graphics using Stata. It’s easier than you might expect and you can use animated graphics to illustrate concepts that would be challenging to illustrate with static graphs. In addition to Stata, you will need a video editing program but don’t be concerned if you don’t have one. At the 2012 UK Stata User Group Meeting Robert Grant demonstrated how to create animated graphics from within Stata using a free software program called FFmpeg. I will show you how I create my animated graphs using Camtasia and how Robert creates his using FFmpeg.

I recently recorded a video for the Stata Youtube channel called “Power and sample size calculations in Stata: A conceptual introduction“. I wanted to illustrate two concepts: (1) that statistcal power increases as sample size increases, and (2) as effect size increases. Both of these concepts can be illustrated with a static graph along with the explanation “imagine that …”. Creating animated graphs allowed me to skip the explanation and just show what I meant.

ChangeMeans

ChangeSampleSize

Creating the graphs

Videos are illusions. All videos — from Charles-Émile Reynaud’s 1877 praxinoscope to modern blu-ray movies — are created by displaying a series of ordered still images for a fraction of a second each. Our brains perceive this series of still images as motion.

To create the illusion of motion with graphs, we make an ordered series of slightly differing graphs. We can use loops to do this. If you are not familiar with loops in Stata, here’s one to count to five:

forvalues i = 1(1)5 {
    disp "i = `i'"
}

i = 1
i = 2
i = 3
i = 4
i = 5

We could place a graph command inside the loop. If, for each interation, the graph command created a slightly different graph, we would be on our way to creating our first video. The loop below creates a series of graphs of normal densities with means 0 through 1 in increments of 0.1.

forvalues mu = 0(0.1)1 {
    twoway  function y=normalden(x,`mu',1), range(-3 6) title("N(`mu',1)")
}

You may have noticed the illusion of motion as Stata created each graph; the normal densities appeared to be moving to the right as each new graph appeared on the screen.

You may have also noticed that some of the values of the mean did not look as you would have wanted. For example, 1.0 was displayed as 0.999999999. That’s not a mistake, it’s because Stata stores numbers and performs calculations in base two and displays them in base ten; for a detailed explanation, see Precision (yet again), Part I.

We can fix that by reformating the means using the string() function.

forvalues mu = 0(0.1)1 {
    local mu = string(`mu', "%3.1f")
    twoway  function y=normalden(x,`mu',1), range(-3 6) title("N(`mu',1)")
}

Next, we need to save our graphs. We can do this by adding graph export inside the loop.

forvalues mu = 0(0.1)1 {
    local mu = string(`mu', "%3.1f")
    twoway  function y=normalden(x,`mu',1), range(-3 6) title("N(`mu',1)")
    graph export graph_`mu'.png, as(png) width(1280) height(720) replace
}

Note that the name of each graph file includes the value of mu so that we know the order of our files. We can view the contents of the directory to verify that Stata has created a file for each of our graphs.

. ls
  <dir>   2/11/14 12:12  .
  <dir>   2/11/14 12:12  ..
  35.6k   2/11/14 12:11  graph_0.0.png
  35.6k   2/11/14 12:11  graph_0.1.png
  35.7k   2/11/14 12:11  graph_0.2.png
  35.7k   2/11/14 12:11  graph_0.3.png
  35.7k   2/11/14 12:11  graph_0.4.png
  35.8k   2/11/14 12:11  graph_0.5.png
  35.9k   2/11/14 12:12  graph_0.6.png
  35.7k   2/11/14 12:12  graph_0.7.png
  35.8k   2/11/14 12:12  graph_0.8.png
  35.9k   2/11/14 12:12  graph_0.9.png
  35.6k   2/11/14 12:12  graph_1.0.png

Now that we have created our graphs, we need to combine them into a video.

There are many commercial, freeware, and free software programs available that we could use. I will outline the basic steps using two of them, one a commerical GUI based product (not free) called Camtasia, and the other a free command-based program called FFmpeg.

Creating videos with Camtasia

Most commercial video editing programs have similar interfaces. The user imports image, sound and video files, organizes them in tracks on a timeline and then previews the resulting video. Camtasia is a commercial video program that I use to record videos for the Stata Youtube channel and its interface looks like this.

Camtasia

We begin by importing the graph files into Camtasia:

CamtasiaImport

Next we drag the images onto the timeline:

CamtasiaTimeline

And then we make the display time for each image very short…in this case 0.1 seconds or 10 frames per second.

CamtasiaTimelineShortened

After previewing the video, we can export it to any of Camtasia’s supported formats. I’ve exported to a “.gif” file because it is easy to view in a web browser.

graph_camtasia

We just created our first animated graph! All we have to do to make it look as professional as the power-and-sample size examples I showed you earlier is go back into our Stata program and modify the graph command to add the additional elements we want to display!

Creating videos with FFmpeg

Stata user and medical statistician Robert Grant gave a presentation at the 2012 UK Stata User Group Meeting in London entitled “Producing animated graphs from Stata without having to learn any specialized software“. You can read more about Robert by visiting his blog and clicking on About.

In his presentation, Robert demonstrated how to combine graph images into a video using a free software program called FFmpeg. Robert followed the same basic strategy I demonstrated above, but Robert’s choice of software has two appealing features. First, the software is readily available and free. Second, FFmpeg can be called from within the Stata environment using the winexec command. This means that we can create our graphs and combine them into a video using Stata do files. Combining dozens or hundreds of graphs into a single video with a program is faster and easier than using a drag-and-drop interface.

Let’s return to our previous example and combine the files using FFmpeg. Recall that we inserted the mean into the name of each file (e.g. “graph_0.4.png”) so that we could keep track of the order of the files. In my experience, it can be difficult to combine files with decimals in their names using FFmpeg. To avoid the problem, I have added a line of code between the twoway command and the graph export command that names the files with sequential integers which are padded with zeros.

forvalues mu = 0(0.1)1 {
    local mu = string(`mu', "%3.1f")
    twoway function y=normalden(x,`mu',1), range(-3 6) title("N(`mu',1)")
    local mu = string(`mu'*10+1, "%03.0f")
    graph export graph_`mu'.png, as(png) width(1280) height(720) replace
}

. ls
  <dir>   2/12/14 12:21  .
  <dir>   2/12/14 12:21  ..
  35.6k   2/12/14 12:21  graph_001.png
  35.6k   2/12/14 12:21  graph_002.png
  35.7k   2/12/14 12:21  graph_003.png
  35.7k   2/12/14 12:21  graph_004.png
  35.7k   2/12/14 12:21  graph_005.png
  35.8k   2/12/14 12:21  graph_006.png
  35.9k   2/12/14 12:21  graph_007.png
  35.7k   2/12/14 12:21  graph_008.png
  35.8k   2/12/14 12:21  graph_009.png
  35.9k   2/12/14 12:21  graph_010.png
  35.6k   2/12/14 12:21  graph_011.png

We can then combine these files into a video with FFmpeg using the following commands

local GraphPath "C:\Users\jch\AnimatedGraphics\example\"
winexec "C:\Program Files\FFmpeg\bin\ffmpeg.exe" -i `GraphPath'graph_%03d.png
    -b:v 512k `GraphPath'graph.mpg

The local macro GraphPath contains the path for the directory where my graphics files are stored.

The Stata command winexec whatever executes whatever. In our case, whatever is ffmpeg.exe, preceeded by ffmpeg.exe‘s path, and followed by the arguments FFmpeg needs. We specify two options, -i and -b.

The -i option is followed by a path and filename template. In our case, the path is obtained from the Stata local macro GraphPath and the filename template is “graph_%03d.png”. This template tells FFmpeg to look for a three digit sequence of numbers between “graph_” and “.png” in the filenames. The zero that precedes the three in the template tells FFmpeg that the three digit sequence of numbers is padded with zeros.

The -b option specifies the path and filename of the video to be created along with some attributes of the video.

Once we have created our video, we can use FFmpeg to convert our video to other video formats. For example, we could convert “graph.mpg” to “graph.gif” using the following command:

winexec "C:\Program Files\FFmpeg\bin\ffmpeg.exe" -r 10 -i `GraphPath'graph.mpg
    -t 10 -r 10 `GraphPath'graph.gif

which creates this graph:

graph_ffmpeg

FFmpeg is a very flexible program and there are far too many options to discuss in this blog entry. If you would like to learn more about FFmpeg you can visit their website at www.ffmpeg.org.

More Examples

I made the preceding examples as simple as possible so that we could focus on the mechanics of creating videos. We now know that, if we want to make professional looking videos, all the complication comes on the Stata side. We leave our loop alone but change the graph command inside it to be more complicated.

So here’s how I created the two animated-graphics videos that I used to create the overall video “Power and sample size calculations in Stata: A conceptual introduction” on our YouTube channel.

The first demonstrated that increasing the effect size (the difference between the means) results in increased statistical power.

local GraphCounter = 100
local mu_null = 0
local sd = 1
local z_crit = round(-1*invnormal(0.05), 0.01)
local z_crit_label = `z_crit' + 0.75

forvalues mu_alt = 1(0.01)3 {
  twoway  ///
    function y=normalden(x,`mu_null',`sd'),                    ///
             range(-3 `z_crit') color(red) dropline(0)   ||    ///
    function y=normalden(x,`mu_alt',`sd'),                     ///
             range(-3 5) color(green) dropline(`mu_alt') ||    ///
    function y=normalden(x,`mu_alt',`sd'),                     ///
             range(`z_crit' 6) recast(area) color(green) ||    ///
    function y=normalden(x,`mu_null',`sd'),                    ///
             range(`z_crit' 6) recast(area) color(red)         ///
    title("Power for {&mu}={&mu}{subscript:0} versus {&mu}={&mu}{subscript:A}") ///
    xtitle("{it: z}") xlabel(-3 -2 -1 0 1 2 3 4 5 6)           ///
    legend(off)                                                ///
    ytitle("Density") yscale(range(0 0.6))                     ///
    ylabel(0(0.1)0.6, angle(horizontal) nogrid)                ///
    text(0.45 0 "{&mu}{subscript:0}", color(red))              ///
    text(0.45 `mu_alt' "{&mu}{subscript:A}", color(green))

  graph export mu_alt_`GraphCounter'.png, as(png) width(1280) height(720) replace

  local ++GraphCounter
}

The above Stata code created the *.png files that I then combined using Camtasia to produce this gif:

ChangeMeans

The second video demonstrated that power increases as the sample size increases.

local GraphCounter = 301
local mu_label = 0.45
local power_label = 2.10
local mu_null = 0
local mu_alt = 2

forvalues sd = 1(-0.01)0.5 {
  local z_crit = round(-1*invnormal(0.05)*`sd', 0.01)
  local z_crit_label = `z_crit' + 0.75

  twoway                                                                        ///
    function y=normalden(x,`mu_null',`sd'),                                     ///
             range(-3 `z_crit') color(red) dropline(0)  ||                      ///
    function y=normalden(x,`mu_alt',`sd'),                                      ///
             range(-3 5) color(green)  dropline(`mu_alt')      ||               ///
    function y=normalden(x,`mu_alt',`sd'),                                      ///
             range(`z_crit' 6) recast(area) color(green)       ||               ///
    function y=normalden(x,`mu_null',`sd'),                                     ///
             range(`z_crit' 6) recast(area) color(red)                          ///
    title("Power for {&mu}={&mu}{subscript:0} versus {&mu}={&mu}{subscript:A}") ///
    xtitle("{it: z}") xlabel(-3 -2 -1 0 1 2 3 4 5 6)                            ///
    legend(off)                                                                 ///
    ytitle("Density") yscale(range(0 0.6))                                      ///
    ylabel(0(0.1)0.6, angle(horizontal) nogrid)                                 ///
    text(`mu_label' 0 "{&mu}{subscript:0}", color(red))                         ///
    text(`mu_label' `mu_alt' "{&mu}{subscript:A}", color(green))
  graph export mu_alt_`GraphCounter'.png, as(png) width(1280) height(720) replace

  local ++GraphCounter
  local mu_label = `mu_label' + 0.005
  local power_label = `power_label' + 0.03
}

Just as previously, the above Stata code creates the *.png files that I then combine using Camtasia to produce a gif:

ChangeSampleSize

Let me show you some more examples.

The next example demonstrates the basic idea of lowess smoothing.

sysuse auto
local WindowWidth = 500
forvalues WindowUpper = 2200(25)5000 {
  local WindowLower = `WindowUpper' - `WindowWidth'
  twoway (scatter mpg weight)                                             ///
    (lowess mpg weight if weight < (`WindowUpper'-250), lcolor(green))    ///
    (lfit mpg weight if weight>`WindowLower' & weight<`WindowUpper',      ///
         lwidth(medium) lcolor(red))                                      ///
    , xline(`WindowLower' `WindowUpper', lwidth(medium) lcolor(black))    ///
    legend(on order(1 2 3) cols(3))
  graph export lowess_`WindowUpper'.png, as(png) width(1280) height(720) replace
}

The result is,

lowess

The animated graph I created is not yet a perfect analogy to what lowess actually does, but it comes close. It has two problems. The lowess curve changes outside of the sliding window, which it should not and the animation does not illustrate the weighting of the points within the window, say by using differently sized markers for the points in the sliding window. Even so, the graph does a far better job than the usual explanaton that one should imagine sliding a window across the scatterplot.

As yet another example, we can use animated graphs to demonstrate the concept of convergence. There is a FAQ on the Stata website written by Bill Gould that explains the relationship between the chi-squared and F distributions. The animated graph below shows that F(d1, d2) converges to d1*χ^2 as d2 goes to infinity:

forvalues df = 1(1)100 {
  twoway function  y=chi2(2,2*x), range(0 6) color(red) ||                                       ///
    function y=F(2,`df',x), range(0 6) color(green)                                              ///
    title("Cumulative distributions for {&chi}{sup:2}{sub:df} and {it:F}{subscript:df,df2}")     ///
    xtitle("{it: denominator df}") xlabel(0 1 2 3 4 5 6) legend(off)                             ///
    text(0.45 4 "df2 = `df'",  size(huge) color(black))                                          ///
    legend(on order(1 "{&chi}{sup:2}{sub:df}" 2 "{it:F}{subscript:df,df2}") cols(2) position(5) ring(0))

  local df = string(`df', "%03.0f")
  graph export converge2_`df'.png, as(png) width(1280) height(720) replace
}

converge2

The t distribution has a similar relationship with the normal distribution.

forvalues df = 1(1)100 {
  twoway  function y=normal(x), range(-3 3) color(red)   ||                     ///
    function y=t(`df',x), range(-3 3) color(green)                              ///
    title("Cumulative distributions for Normal(0,1) and {it:t}{subscript:df}")  ///
    xtitle("{it: t/z}") xlabel(-3 -2 -1 0 1 2 3) legend(off)                    ///
    text(0.45 -2 "df = `df'",  size(huge) color(black))                         ///
    legend(on order(1 "N(0,1)" 2 "{it:t}{subscript:df}") cols(2) position(5) ring(0))

  local df = string(`df', "%03.0f")
  graph export converge_`df'.png, as(png) width(1280) height(720) replace
}

The result is

converge

Final thoughts

I have learned through trial and error two things that improve the quality of my animated graphs. First, note that the axes of the graphs in most of the examples above are explicitly defined in the graph commands. This is often necessary to keep the axes stable from graph to graph. Second, videos have a smoother, higher quality appearance when there are many graphs with very small changes from graph to graph.

I hope I have convinced you that creating animated graphics with Stata is easier than you imagined. If the old saying that “a picture is worth a thousand words” is true, imagine how many words you can save using animated graphs.

Other resources

FFmpeg

Camtasia

Relationship between chi-squared and F distributions

Robert Grant’s blog and examples

Hans Rosling’s 200 Countries recreated using only Stata

Categories: Graphics Tags: , ,

Measures of effect size in Stata 13

Today I want to talk about effect sizes such as Cohen’s d, Hedges’s g, Glass’s Δ, η2, and ω2. Effects sizes concern rescaling parameter estimates to make them easier to interpret, especially in terms of practical significance.

Many researchers in psychology and education advocate reporting of effect sizes, professional organizations such as the American Psychological Association (APA) and the American Educational Research Association (AERA) strongly recommend their reporting, and professional journals such as the Journal of Experimental Psychology: Applied and Educational and Psychological Measurement require that they be reported.

Anyway, today I want to show you

  1. What effect sizes are.
  2. How to calculate effect sizes and their confidence intervals in Stata.
  3. How to calculate bootstrap confidence intervals for those effect sizes.
  4. How to use Stata’s effect-size calculator.

1. What are effect sizes?

The importance of research results is often assessed by statistical significance, usually that the p-value is less than 0.05. P-values and statistical significance, however, don’t tell us anything about practical significance.

What if I told you that I had developed a new weight-loss pill and that the difference between the average weight loss for people who took the pill and the those who took a placebo was statistically significant? Would you buy my new pill? If you were overweight, you might reply, “Of course! I’ll take two bottles and a large order of french fries to go!”. Now let me add that the average difference in weight loss was only one pound over the year. Still interested? My results may be statistically significant but they are not practically significant.

Or what if I told you that the difference in weight loss was not statistically significant — the p-value was “only” 0.06 — but the average difference over the year was 20 pounds? You might very well be interested in that pill.

The size of the effect tells us about the practical significance. P-values do not assess practical significance.

All of which is to say, one should report parameter estimates along with statistical significance.

In my examples above, you knew that 1 pound over the year is small and 20 pounds is large because you are familiar with human weights.

In another context, 1 pound might be large, and in yet another, 20 pounds small.

Formal measures of effects sizes are thus usually presented in unit-free but easy-to-interpret form, such as standardized differences and proportions of variability explained.

The “d” family

Effect sizes that measure the scaled difference between means belong to the “d” family. The generic formula is

{delta} = {{mu}_1 - {mu}_2} / {sigma}

The estimators differ in terms of how sigma is calculated.

Cohen’s d, for instance, uses the pooled sample standard deviation.

Hedges’s g incorporates an adjustment which removes the bias of Cohen’s d.

Glass’s Δ was originally developed in the context of experiments and uses the “control group” standard deviation in the denominator. It has subsequently been generalized to nonexperimental studies. Because there is no control group in observational studies, Kline (2013) recommends reporting Glass’s Δ using the standard deviation for each group. Glass’s Delta_1 uses one group’s standard deviation and Delta_2 uses the other group’s.

Although I have given definitions to Cohen’s d, Hedges’s g, and Glass’s Δ, different authors swap the definitions around! As a result, many authors refer to all of the above as just Delta.

Be careful when using software to know which Delta you are getting. I have used Stata terminology, of course.

Anyway, the use of a standardized scale allows us to assess of practical significance. Delta = 1.5 indicates that the mean of one group is 1.5 standard deviations higher than that of the other. A difference of 1.5 standard deviations is obviously large, and a difference of 0.1 standard deviations is obviously small.

The “r” family

The r family quantifies the ratio of the variance attributable to an effect to the total variance and is often interpreted as the “proportion of variance explained”. The generic estimator is known as eta-squared,

{eta}^2 = {{sigma}^2_effect} / {{sigma}^2_total}

η2 is equivalent to the R-squared statistic from linear regression.

ω2 is a less biased variation of η2 that is equivalent to the adjusted R-squared.

Both of these measures concern the entire model.

Partial η2 and partial ω2 are like partial R-squareds and concern individual terms in the model. A term might be a variable or a variable and its interaction with another variable.

Both the d and r families allow us to make an apples-to-apples comparison of variables measured on different scales. For example, an intervention could affect both systolic blood pressure and total cholesterol. Comparing the relative effect of the intervention on the two outcomes would be difficult on their original scales.

How does one compare mm/Hg and mg/dL? It is straightforward in terms of Cohen’s d or ω2 because then we are comparing standard deviation changes or proportion of variance explained.

2. How to calculate effect sizes and their confidence intervals in Stata

Consider a study where 30 school children are randomly assigned to classrooms that incorporated web-based instruction (treatment) or standard classroom environments (control). At the end of the school year, the children were given tests to measure reading and mathematics skills. The reading test is scored on a 0-15 point scale and, the mathematics test, on a 0-100 point scale.

Let’s download a dataset for our fictitious example from the Stata website by typing:

. use http://www.stata.com/videos13/data/webclass.dta

Contains data from http://www.stata.com/videos13/data/webclass.dta
  obs:            30                          Fictitious web-based learning 
                                                experiment data
 vars:             5                          5 Sep 2013 11:28
 size:           330                          (_dta has notes)
-------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
id              byte    %9.0g                 ID Number
treated         byte    %9.0g      treated    Treatment Group
agegroup        byte    %9.0g      agegroup   Age Group
reading         float   %9.0g                 Reading Score
math            float   %9.0g                 Math Score
-------------------------------------------------------------------------------

. notes

_dta:
  1.  Variable treated records 0=control, 1=treated.
  2.  Variable agegroup records 1=7 years old, 2=8 years old, 3=9 years old.

We can compute a t-statistic to test the null hypothesis that the average math scores are the same in the treatment and control groups.

. ttest math, by(treated)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
 Control |      15    69.98866    3.232864    12.52083    63.05485    76.92246
 Treated |      15    79.54943    1.812756    7.020772    75.66146     83.4374
---------+--------------------------------------------------------------------
combined |      30    74.76904    2.025821    11.09588    70.62577    78.91231
---------+--------------------------------------------------------------------
    diff |           -9.560774    3.706412               -17.15301   -1.968533
------------------------------------------------------------------------------
    diff = mean(Control) - mean(Treated)                          t =  -2.5795
Ho: diff = 0                                     degrees of freedom =       28

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0077         Pr(|T| > |t|) = 0.0154          Pr(T > t) = 0.9923

The treated students have a larger mean, yet the difference of -9.56 is reported as negative because -ttest- calculated Control minus Treated. So just remember, negative differences mean Treated > Control in this case.

The t-statistic equals -2.58 and its two-sided p-value of 0.0154 indicates that the difference between the math scores in the two groups is statistically significant.

Next, let’s calculate effect sizes from the d family:

. esize twosample math, by(treated) cohensd hedgesg glassdelta

Effect size based on mean comparison

                                   Obs per group:
                                         Control =     15
                                         Treated =     15
---------------------------------------------------------
        Effect Size |   Estimate     [95% Conf. Interval]
--------------------+------------------------------------
          Cohen's d |  -.9419085    -1.691029   -.1777553
         Hedges's g |   -.916413    -1.645256   -.1729438
    Glass's Delta 1 |  -.7635896     -1.52044    .0167094
    Glass's Delta 2 |  -1.361784    -2.218342   -.4727376
---------------------------------------------------------

Cohen’s d and Hedges’s g both indicate that the average reading scores differ by approximately -0.93 standard deviations with 95% confidence intervals of (-1.69, -0.18) and (-1.65, -0.17) respectively.

Since this is an experiment, we are interested in Glass’s Delta 1 because it is calculated using the control group standard deviation. Average reading scores differ by -0.76 and the confidence interval is (-1.52, 0.02).

The confidence intervals for Cohen’s d and Hedges’s g do not include the null value of zero but the confidence interval for Glass’s Delta 1 does. Thus we cannot completely rule out the possibility that the treatment had no effect on math scores.

Next we could incorporate the age group of the children into our analysis by using a two-way ANOVA to test the null hypothesis that the mean math scores are equal for all groups.

. anova math treated##agegroup

                           Number of obs =      30     R-squared     =  0.2671
                           Root MSE      = 10.4418     Adj R-squared =  0.1144

                  Source |  Partial SS    df       MS           F     Prob > F
        -----------------+----------------------------------------------------
                   Model |  953.697551     5   190.73951       1.75     0.1617
                         |
                 treated |  685.562956     1  685.562956       6.29     0.0193
                agegroup |  47.7059268     2  23.8529634       0.22     0.8051
        treated#agegroup |  220.428668     2  110.214334       1.01     0.3789
                         |
                Residual |  2616.73825    24  109.030761
        -----------------+----------------------------------------------------
                   Total |   3570.4358    29  123.118476

The F-statistic for the entire model is not statistically significant (F=1.75, ndf=5, ddf=24, p=0.1617) but the F-statistic for the main effect of treatment is statistically significant (F=6.29, ndf=1, ddf=24, p=0.0193).

We can compute the η2 and partial η2 estimates for this model using the estat esize command immediately after our anova command (note that estat esize works after the regress command too).

. estat esize

Effect sizes for linear models

---------------------------------------------------------------------
               Source |   Eta-Squared     df     [95% Conf. Interval]
----------------------+----------------------------------------------
                Model |   .2671096         5            0    .4067062
                      |
              treated |   .2076016         1     .0039512    .4451877
             agegroup |   .0179046         2            0    .1458161
     treated#agegroup |   .0776932         2            0     .271507
---------------------------------------------------------------------

The overall η2 indicates that our model accounts for approximately 26.7% of the variablity in math scores though the 95% confidence interval includes the null value of zero (0.00%, 40.7%). The partial η2 for treatment is 0.21 (21% of the variability explained) and its 95% confidence interval excludes zero (0.3%, 20%).

We could calculate the alternative r-family member ω2 rather than η2 by typing

. estat esize, omega

Effect sizes for linear models

---------------------------------------------------------------------
               Source | Omega-Squared     df     [95% Conf. Interval]
----------------------+----------------------------------------------
                Model |   .1144241         5            0    .2831033
                      |
              treated |    .174585         1            0    .4220705
             agegroup |          0         2            0    .0746342
     treated#agegroup |   .0008343         2            0    .2107992
---------------------------------------------------------------------

The overall ω2 indicates that our model accounts for approximately 11.4% of the variability in math scores and treatment accounts for 17.5%. This perplexing result stems from the way that ω2 and partial ω2 are calculated. See Pierce, Block, & Aguinis (2004) for a thorough explanation.

Except for the η2 for treatment, the confidence intervals include 0 so we cannot rule out the possibility that there is no effect. Whether results are practically significant is generically a matter context and opinion. In some situations, accounting for 5% of the variability in an outcome could be very important and in other situations accounting for 30% may not be.

We could repeat the same analyses for the reading scores using the following commands:

. ttest reading, by(treated)
. esize twosample reading, by(treated) cohensd hedgesg glassdelta
. anova reading treated##agegroup
. estat esize
. estat esize, omega

None of the t- or F-statistics for reading scores were statistically significant at the 0.05 level.

Even though the reading and math scores were measured on two different scales, we can directly compare the relative effect of the treatment using effect sizes:

        Effect Size   |     Reading Score          Math Score
        ------------------------------------------------------------
        Cohen's d     |   -0.23 (-0.95 - 0.49)  -0.94 (-1.69 - -0.18)
        Hedges's g    |   -0.22 (-0.92 - 0.48)  -0.92 (-1.65 - -0.17)
        Glass's Delta |   -0.21 (-0.93 - 0.51)  -0.76 (-1.52 -  0.02)
        Eta-squared   |    0.02 ( 0.00 - 0.20)   0.21 ( 0.00 -  0.44)
        Omega-squared |    0.00 ( 0.00 - 0.17)   0.17 ( 0.00 -  0.42)

The results show that the average reading scores in the treated and control groups differ by approximately 0.22 standard deviations while the average math scores differ by approximately 0.92 standard deviations. Similarly, treatment status accounted for almost none of the variability in reading scores while it accounted for roughly 17% of the variability in math scores. The intervention clearly had a larger effect on math scores than reading scores. We also know that we cannot completely rule out an effect size of zero (no effect) for both reading and math scores because several confidence intervals included zero. Whether or not the effects are practically significant is a matter of interpretation but the effect sizes provide a standardized metric for evaluation.

3. How to calculate bootstrap confidence intervals

Simulation studies have shown that bootstrap confidence intervals for the d family may be preferable to confidence intervals based on the noncentral t distribution when the variable of interest does not have a normal distribution (Kelley 2005; Algina, Keselman, and Penfield 2006). We can calculate bootstrap confidence intervals for Cohen’s d and Hedges’s g using Stata’s bootstrap prefix:

. bootstrap r(d) r(g), reps(500) nowarn:  esize twosample reading, by(treated)
(running esize on estimation sample)

Bootstrap replications (500)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100
..................................................   150
..................................................   200
..................................................   250
..................................................   300
..................................................   350
..................................................   400
..................................................   450
..................................................   500

Bootstrap results                               Number of obs      =        30
                                                Replications       =       500

      command:  esize twosample reading, by(treated)
        _bs_1:  r(d)
        _bs_2:  r(g)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _bs_1 |   -.228966   .3905644    -0.59   0.558    -.9944582    .5365262
       _bs_2 |  -.2227684   .3799927    -0.59   0.558    -.9675403    .5220036
------------------------------------------------------------------------------

The bootstrap estimate of the 95% confidence interval for Cohen’s d is -0.99 to 0.54 which is slightly wider than the earlier estimate based on the non-central t distribution (see [R] esize for details). The bootstrap estimate is slightly wider for Hedges’s g as well.

4. How to use Stata’s effect-size calculator

You can use Stata’s effect size calculators to estimate them using summary statistics. If we know that the mean, standard deviation and sample size for one group is 70, 12.5 and 15 respectively and 80, 7 and 15 for another group, we can use esizei to estimate effect sizes from the d family:

. esizei 15 70 12.5 15 80 7, cohensd hedgesg glassdelta

Effect size based on mean comparison

                                   Obs per group:
                                         Group 1 =     15
                                         Group 2 =     15
---------------------------------------------------------
        Effect Size |   Estimate     [95% Conf. Interval]
--------------------+------------------------------------
          Cohen's d |  -.9871279    -1.739873   -.2187839
         Hedges's g |  -.9604084    -1.692779   -.2128619
    Glass's Delta 1 |        -.8    -1.561417   -.0143276
    Glass's Delta 2 |  -1.428571    -2.299112   -.5250285
---------------------------------------------------------

We can estimate effect sizes from the r family using esizei with slightly different syntax. For example, if we know the numerator and denominator degrees of freedom along with the F statistic, we can calculate η2 and ω2 using the following command:

. esizei 1 28 6.65

Effect sizes for linear models

---------------------------------------------------------
        Effect Size |   Estimate     [95% Conf. Interval]
--------------------+------------------------------------
        Eta-Squared |   .1919192     .0065357    .4167874
      Omega-Squared |   .1630592            0    .3959584
---------------------------------------------------------

Video demonstration

Stata has dialog boxes that can assist you in calculating effect sizes. If you would like a brief introduction using the GUI, you can watch a demonstration on Stata’s YouTube Channel:

Tour of effect sizes in Stata

Final thoughts and further reading

Most older papers and many current papers do not report effect sizes. Nowadays, the general consensus among behavioral scientists, their professional organizations, and their journals is that effect sizes should always be reported in addition to tests of statistical significance. Stata 13 now makes it easy to compute most popular effects sizes.

Some methodologists believe that effect sizes with confidence intervals should always be reported and that statistical hypothesis tests should be abandoned altogether; see Cumming (2012) and Kline (2013). While this may sound like a radical notion, other fields such as epidemiology have been moving in this direction since the 1990s. Cumming and Kline offer compelling arguments for this paradigm shift as well as excellent introductions to effect sizes.

American Psychological Association (2009). Publication Manual of the American Psychological Association, 6th Ed. Washington, DC: American Psychological Association.

Algina, J., H. J. Keselman, and R. D. Penfield. (2006). Confidence interval coverage for Cohen’s effect size statistic. Educational and Psychological Measurement, 66(6): 945–960.

Cumming, G. (2012). Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Taylor & Francis.

Kelley, K. (2005). The effects of nonnormal distributions on confidence intervals around the standardized mean difference: Bootstrap and parametric confidence intervals. Educational and Psychological Measurement 65: 51–69.

Kirk, R. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, 746-759.

Kline, R. B. (2013). Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. 2nd ed. Washington, DC: American Psychological Association.

Pierce, C.A., Block, R. A., and Aguinis, H. (2004). Cautionary note on reporting eta-squared values from multifactor ANOVA designs. Educational and Psychological Measurement, 64(6) 916-924

Thompson, B. (1996) AERA Editorial Policies regarding Statistical Significance Testing: Three Suggested Reforms. Educational Researcher, 25(2) 26-30

Wilkinson, L., & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604

Update on the Stata YouTube Channel

What is it about round numbers that compels us to pause and reflect? We celebrate 20-year school reunions, 25-year wedding anniversaries, 50th birthdays and other similar milestones. I don’t know the answer but the Stata YouTube Channel recently passed several milestones – more than 1500 subscribers, over 50,000 video views and it was launched six months ago. We felt the need for a small celebration to mark the occasion, and I thought that I would give you a brief update.

I could tell you about re-recording the original 24 videos with a larger font to make them easier to read. I could tell you about the hardware and software that we use to record them including our experiments with various condenser and dynamic microphones. I could share quotes from some of the nice messages we’ve received. But I think it would be more fun to talk about….you!

YouTube collects data about the number of views each video receives as well as summary data about who, what, when, where, and how you are watching them. There is no need to be concerned about your privacy; there are no personal identifiers of any kind associated with these data. But the summary data are interesting, and I thought it might be fun to share some of the data with you.

Who’s watching?

Figure1

Figure 1 shows the age distribution of Stata YouTube Channel viewers. If you have ever attended a Stata Conference, you will not be surprised by this graph…until you notice the age group at the bottom. I would not have guessed that 13-17 year olds are watching our videos. Perhaps they saw Stata in the movie “Moneyball” with Brad Pitt and wanted to learn more. Or maybe they were influenced by the latest fashion craze sweeping the youth of the world.

What are you watching?

Figure2

We have posted more than 50 videos over a wide range of topics. Figure 2 shows the total number of views for the ten most popular videos. The more popular of the ten are about broad topics. These broader videos are mostly older and have thus had time to accumulate more views.

Even so, these videos receive more views per day currently than do the special topic videos that have been posted more recently. This supports my belief that Stata YouTube Channel viewers tend to be relatively new Stata users who want to learn about general topics, and that means more generic videos in the future. So you and your two post-docs will just have to read the manual if you want to learn how to fit asymmetric power ARCH models with outer-product gradient standard errors.

When are you watching?

Figure3

We usually post new videos on Tuesday mornings which might lead you to believe that the peak viewing day would also be Tuesday. Figure 3, however, shows us that the average number of views per day (vpd) is higher on Wednesdays at 420 vpd and in fact peaks on Thursdays at 430 vpd before declining Friday through Sunday.

Figure4

Figure 4 also shows us that late September may have been not the best time to launch the Stata YouTube Channel. Our early momentum in September and October slowed during the November and December holiday seasons. We were, however, pleased to see that 49 of you spent New Years Eve watching our videos. Perhaps next year we’ll prepare something more festive just for you!

Where are you watching?

What do the Czech Republic, Pakistan, Uganda, Madagascar, the United Kingdom, the Bahamas, the United States, Montenegro, and Italy have in common? Correct! They are all countries in which you are watching our videos. They are also locations depicted in one of my favorite action films but I’ll leave that to the trivia buffs. I think the most exciting information that we found in our data is that the Stata YouTube Channel is being viewed in 164 countries!

Figure5

You might not be surprised to learn that roughly half of the people watching the videos live in the United States, the United Kingdom, or Canada. The results may be unexpected when we consider the “view rate” defined as the number of views per 100,000 residents. Figure 5 shows the top 20 countries ranked by view rate for countries with at least four million residents. Denmark had the highest view rate which was nearly twice the rate of Norway which had the second highest view rate. The view rate in Denmark was more than three times the rate in the US and the UK.

How are you watching?

You might think that I would have anything to report about “how” you are watching the videos, but it turns out that 5.2% of you are watching on mobile devices. Perhaps this explains the 13-17 year old demographic or the 49 people watching on New Year’s Eve. Or maybe we are helping you pass the time in the dentist office waiting room.

Final thoughts

Six months isn’t much of a milestone. We Stata folk will use any excuse to break out the cake and ice cream. Even so, the Stata YouTube Channel began as an experiment and often experiments do not work out as we would like. This experiment has exceeded our expectations and, as a result, we have started taking requests for videos on our Facebook page and we’ll be adding more videos every week. So thanks for watching and stay tuned!

Now if you will excuse me, I’m going to get some cake and ice cream.

Categories: Resources Tags: ,

Multilevel linear models in Stata, part 2: Longitudinal data

In my last posting, I introduced you to the concepts of hierarchical or “multilevel” data. In today’s post, I’d like to show you how to use multilevel modeling techniques to analyse longitudinal data with Stata’s xtmixed command.

Last time, we noticed that our data had two features. First, we noticed that the means within each level of the hierarchy were different from each other and we incorporated that into our data analysis by fitting a “variance component” model using Stata’s xtmixed command.

The second feature that we noticed is that repeated measurement of GSP showed an upward trend. We’ll pick up where we left off last time and stick to the concepts again and you can refer to the references at the end to learn more about the details.

The videos

Stata has a very friendly dialog box that can assist you in building multilevel models. If you would like a brief introduction using the GUI, you can watch a demonstration on Stata’s YouTube Channel:

Introduction to multilevel linear models in Stata, part 2: Longitudinal data

Longitudinal data

I’m often asked by beginning data analysts – “What’s the difference between longitudinal data and time-series data? Aren’t they the same thing?”.

The confusion is understandable — both types of data involve some measurement of time. But the answer is no, they are not the same thing.

Univariate time series data typically arise from the collection of many data points over time from a single source, such as from a person, country, financial instrument, etc.

Longitudinal data typically arise from collecting a few observations over time from many sources, such as a few blood pressure measurements from many people.

There are some multivariate time series that blur this distinction but a rule of thumb for distinguishing between the two is that time series have more repeated observations than subjects while longitudinal data have more subjects than repeated observations.

Because our GSP data from last time involve 17 measurements from 48 states (more sources than measurements), we will treat them as longitudinal data.

GSP Data: http://www.stata-press.com/data/r12/productivity.dta

Random intercept models

As I mentioned last time, repeated observations on a group of individuals can be conceptualized as multilevel data and modeled just as any other multilevel data. We left off last time with a variance component model for GSP (Gross State Product, logged) and noted that our model assumed a constant GSP over time while the data showed a clear upward trend.

Graph3

If we consider a single observation and think about our model, nothing in the fixed or random part of the models is a function of time.

Slide15

Let’s begin by adding the variable year to the fixed part of our model.

Slide16

As we expected, our grand mean has become a linear regression which more accurately reflects the change over time in GSP. What might be unexpected is that each state’s and region’s mean has changed as well and now has the same slope as the regression line. This is because none of the random components of our model are a function of time. Let’s fit this model with the xtmixed command:

. xtmixed gsp year, || region: || state:

------------------------------------------------------------------------------
         gsp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        year |   .0274903   .0005247    52.39   0.000     .0264618    .0285188
       _cons |  -43.71617   1.067718   -40.94   0.000    -45.80886   -41.62348
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
region: Identity             |
                   sd(_cons) |   .6615238   .2038949      .3615664    1.210327
-----------------------------+------------------------------------------------
state: Identity              |
                   sd(_cons) |   .7805107   .0885788      .6248525    .9749452
-----------------------------+------------------------------------------------
                sd(Residual) |   .0734343   .0018737      .0698522    .0772001
------------------------------------------------------------------------------

The fixed part of our model now displays an estimate of the intercept (_cons = -43.7) and the slope (year = 0.027). Let’s graph the model for Region 7 and see if it fits the data better than the variance component model.

predict GrandMean, xb
label var GrandMean "GrandMean"
predict RegionEffect, reffects level(region)
predict StateEffect, reffects level(state)
gen RegionMean = GrandMean + RegionEffect
gen StateMean = GrandMean + RegionEffect + StateEffect

twoway  (line GrandMean year, lcolor(black) lwidth(thick))      ///
        (line RegionMean year, lcolor(blue) lwidth(medthick))   ///
        (line StateMean year, lcolor(green) connect(ascending)) ///
        (scatter gsp year, mcolor(red) msize(medsmall))         ///
        if region ==7,                                          ///
        ytitle(log(Gross State Product), margin(medsmall))      ///
        legend(cols(4) size(small))                             ///
        title("Multilevel Model of GSP for Region 7", size(medsmall))

Graph4

That looks like a much better fit than our variance-components model from last time. Perhaps I should leave well enough alone, but I can’t help noticing that the slopes of the green lines for each state don’t fit as well as they could. The top green line fits nicely but the second from the top looks like it slopes upward more than is necessary. That’s the best fit we can achieve if the regression lines are forced to be parallel to each other. But what if the lines were not forced to be parallel? What if we could fit a “mini-regression model” for each state within the context of my overall multilevel model. Well, good news — we can!

Random slope models

By introducing the variable year to the fixed part of the model, we turned our grand mean into a regression line. Next I’d like to incorporate the variable year into the random part of the model. By introducing a fourth random component that is a function of time, I am effectively estimating a separate regression line within each state.

Slide19

Notice that the size of the new, brown deviation u1ij. is a function of time. If the observation were one year to the left, u1ij. would be smaller and if the observation were one year to the right, u1ij.would be larger.

It is common to “center” the time variable before fitting these kinds of models. Explaining why is for another day. The quick answer is that, at some point during the fitting of the model, Stata will have to compute the equivalent of the inverse of the square of year. For the year 1986 this turns out to be 2.535e-07. That’s a fairly small number and if we multiply it by another small number…well, you get the idea. By centering age (e.g. cyear = year – 1978), we get a more reasonable number for 1986 (0.01). (Hint: If you have problems with your model converging and you have large values for time, try centering them. It won’t always help, but it might).

So let’s center our year variable by subtracting 1978 and fit a model that includes a random slope.

gen cyear = year - 1978
xtmixed gsp cyear, || region: || state: cyear, cov(indep)

Slide21

I’ve color-coded the output so that we can match each part of the output back to the model and the graph. The fixed part of the model appears in the top table and it looks like any other simple linear regression model. The random part of the model is definitely more complicated. If you get lost, look back at the graphic of the deviations and remind yourself that we have simply partitioned the deviation of each observation into four components. If we did this for every observation, the standard deviations in our output are simply the average of those deviations.

Let’s look at a graph of our new “random slope” model for Region 7 and see how well it fits our data.

predict GrandMean, xb
label var GrandMean "GrandMean"
predict RegionEffect, reffects level(region)
predict StateEffect_year StateEffect_cons, reffects level(state)

gen RegionMean = GrandMean + RegionEffect
gen StateMean_cons = GrandMean + RegionEffect + StateEffect_cons
gen StateMean_year = GrandMean + RegionEffect + StateEffect_cons + ///
                     (cyear*StateEffect_year)

twoway  (line GrandMean cyear, lcolor(black) lwidth(thick))             ///
        (line RegionMean cyear, lcolor(blue) lwidth(medthick))          ///
        (line StateMean_cons cyear, lcolor(green) connect(ascending))   ///
        (line StateMean_year cyear, lcolor(brown) connect(ascending))   ///
        (scatter gsp cyear, mcolor(red) msize(medsmall))                ///
        if region ==7,                                                  ///
        ytitle(log(Gross State Product), margin(medsmall))              ///
        legend(cols(3) size(small))                                     ///
        title("Multilevel Model of GSP for Region 7", size(medsmall))

Graph6

The top brown line fits the data slightly better, but the brown line below it (second from the top) is a much better fit. Mission accomplished!

Where do we go from here?

I hope I have been able to convince you that multilevel modeling is easy using Stata’s xtmixed command and that this is a tool that you will want to add to your kit. I would love to say something like “And that’s all there is to it. Go forth and build models!”, but I would be remiss if I didn’t point out that I have glossed over many critical topics.

In our GSP example, we would still like to consider the impact of other independent variables. I haven’t mentioned choice of estimation methods (ML or REML in the case of xtmixed). I’ve assessed the fit of our models by looking at graphs, an approach important but incomplete. We haven’t thought about hypothesis testing. Oh — and, all the usual residual diagnostics for linear regression such as checking for outliers, influential observations, heteroskedasticity and normality still apply….times four! But now that you understand the concepts and some of the mechanics, it shouldn’t be difficult to fill in the details. If you’d like to learn more, check out the links below.

I hope this was helpful…thanks for stopping by.

For more information

If you’d like to learn more about modeling multilevel and longitudinal data, check out

Multilevel and Longitudinal Modeling Using Stata, Third Edition
Volume I: Continuous Responses
Volume II: Categorical Responses, Counts, and Survival
by Sophia Rabe-Hesketh and Anders Skrondal

or sign up for our popular public training course Multilevel/Mixed Models Using Stata.

Multilevel linear models in Stata, part 1: Components of variance

In the last 15-20 years multilevel modeling has evolved from a specialty area of statistical research into a standard analytical tool used by many applied researchers.

Stata has a lot of multilevel modeling capababilities.

I want to show you how easy it is to fit multilevel models in Stata. Along the way, we’ll unavoidably introduce some of the jargon of multilevel modeling.

I’m going to focus on concepts and ignore many of the details that would be part of a formal data analysis. I’ll give you some suggestions for learning more at the end of the post.

    The videos

Stata has a friendly dialog box that can assist you in building multilevel models. If you would like a brief introduction using the GUI, you can watch a demonstration on Stata’s YouTube Channel:

Introduction to multilevel linear models in Stata, part 1: The xtmixed command

    Multilevel data

Multilevel data are characterized by a hierarchical structure. A classic example is children nested within classrooms and classrooms nested within schools. The test scores of students within the same classroom may be correlated due to exposure to the same teacher or textbook. Likewise, the average test scores of classes might be correlated within a school due to the similar socioeconomic level of the students.

You may have run across datasets with these kinds of structures in your own work. For our example, I would like to use a dataset that has both longitudinal and classical hierarchical features. You can access this dataset from within Stata by typing the following command:

use http://www.stata-press.com/data/r12/productivity.dta

We are going to build a model of gross state product for 48 states in the USA measured annually from 1970 to 1986. The states have been grouped into nine regions based on their economic similarity. For distributional reasons, we will be modeling the logarithm of annual Gross State Product (GSP) but in the interest of readability, I will simply refer to the dependent variable as GSP.

. describe gsp year state region

              storage  display     value
variable name   type   format      label      variable label
-----------------------------------------------------------------------------
gsp             float  %9.0g                  log(gross state product)
year            int    %9.0g                  years 1970-1986
state           byte   %9.0g                  states 1-48
region          byte   %9.0g                  regions 1-9

Let’s look at a graph of these data to see what we’re working with.

twoway (line gsp year, connect(ascending)), ///
        by(region, title("log(Gross State Product) by Region", size(medsmall)))

graph1

Each line represents the trajectory of a state’s (log) GSP over the years 1970 to 1986. The first thing I notice is that the groups of lines are different in each of the nine regions. Some groups of lines seem higher and some groups seem lower. The second thing that I notice is that the slopes of the lines are not the same. I’d like to incorporate those attributes of the data into my model.

    Components of variance

Let’s tackle the vertical differences in the groups of lines first. If we think about the hierarchical structure of these data, I have repeated observations nested within states which are in turn nested within regions. I used color to keep track of the data hierarchy.

slide2

We could compute the mean GSP within each state and note that the observations within in each state vary about their state mean.

slide3

Likewise, we could compute the mean GSP within each region and note that the state means vary about their regional mean.

slide4

We could also compute a grand mean and note that the regional means vary about the grand mean.

slide5

Next, let’s introduce some notation to help us keep track of our mutlilevel structure. In the jargon of multilevel modelling, the repeated measurements of GSP are described as “level 1″, the states are referred to as “level 2″ and the regions are “level 3″. I can add a three-part subscript to each observation to keep track of its place in the hierarchy.

slide7

Now let’s think about our model. The simplest regression model is the intercept-only model which is equivalent to the sample mean. The sample mean is the “fixed” part of the model and the difference between the observation and the mean is the residual or “random” part of the model. Econometricians often prefer the term “disturbance”. I’m going to use the symbol μ to denote the fixed part of the model. μ could represent something as simple as the sample mean or it could represent a collection of independent variables and their parameters.

slide8

Each observation can then be described in terms of its deviation from the fixed part of the model.

slide9

If we computed this deviation of each observation, we could estimate the variability of those deviations. Let’s try that for our data using Stata’s xtmixed command to fit the model:

. xtmixed gsp

Mixed-effects ML regression                     Number of obs      =       816

                                                Wald chi2(0)       =         .
Log likelihood = -1174.4175                     Prob > chi2        =         .

------------------------------------------------------------------------------
         gsp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   10.50885   .0357249   294.16   0.000     10.43883    10.57887
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
                sd(Residual) |   1.020506   .0252613      .9721766    1.071238
------------------------------------------------------------------------------

The top table in the output shows the fixed part of the model which looks like any other regression output from Stata, and the bottom table displays the random part of the model. Let’s look at a graph of our model along with the raw data and interpret our results.

predict GrandMean, xb
label var GrandMean "GrandMean"
twoway  (line GrandMean year, lcolor(black) lwidth(thick))              ///
        (scatter gsp year, mcolor(red) msize(tiny)),                    ///
        ytitle(log(Gross State Product), margin(medsmall))              ///
        legend(cols(4) size(small))                                     ///
        title("GSP for 1970-1986 by Region", size(medsmall))

graph1b

The thick black line in the center of the graph is the estimate of _cons, which is an estimate of the fixed part of model for GSP. In this simple model, _cons is the sample mean which is equal to 10.51. In “Random-effects Parameters” section of the output, sd(Residual) is the average vertical distance between each observation (the red dots) and fixed part of the model (the black line). In this model, sd(Residual) is the estimate of the sample standard deviation which equals 1.02.

At this point you may be thinking to yourself – “That’s not very interesting – I could have done that with Stata’s summarize command”. And you would be correct.

. summ gsp

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         gsp |       816    10.50885    1.021132    8.37885   13.04882

But here’s where it does become interesting. Let’s make the random part of the model more complex to account for the hierarchical structure of the data. Consider a single observation, yijk and take another look at its residual.

slide11

The observation deviates from its state mean by an amount that we will denote eijk. The observation’s state mean deviates from the the regionals mean uij. and the observation’s regional mean deviates from the fixed part of the model, μ, by an amount that we will denote ui... We have partitioned the observation’s residual into three parts, aka “components”, that describe its magnitude relative to the state, region and grand means. If we calculated this set of residuals for each observation, wecould estimate the variability of those residuals and make distributional assumptions about them.

slide12

These kinds of models are often called “variance component” models because they estimate the variability accounted for by each level of the hierarchy. We can estimate a variance component model for GSP using Stata’s xtmixed command:

xtmixed gsp, || region: || state:

------------------------------------------------------------------------------
         gsp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   10.65961   .2503806    42.57   0.000     10.16887    11.15035
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
region: Identity             |   
                   sd(_cons) |   .6615227   .2038944       .361566    1.210325
-----------------------------+------------------------------------------------
state: Identity              |   
                   sd(_cons) |   .7797837   .0886614      .6240114    .9744415
-----------------------------+------------------------------------------------
                sd(Residual) |   .1570457   .0040071       .149385    .1650992
------------------------------------------------------------------------------

The fixed part of the model, _cons, is still the sample mean. But now there are three parameters estimates in the bottom table labeled “Random-effects Parameters”. Each quantifies the average deviation at each level of the hierarchy.

Let’s graph the predictions from our model and see how well they fit the data.

predict GrandMean, xb
label var GrandMean "GrandMean"
predict RegionEffect, reffects level(region)
predict StateEffect, reffects level(state)
gen RegionMean = GrandMean + RegionEffect
gen StateMean = GrandMean + RegionEffect + StateEffect

twoway  (line GrandMean year, lcolor(black) lwidth(thick))      ///
        (line RegionMean year, lcolor(blue) lwidth(medthick))   ///
        (line StateMean year, lcolor(green) connect(ascending)) ///
        (scatter gsp year, mcolor(red) msize(tiny)),            ///
        ytitle(log(Gross State Product), margin(medsmall))      ///
        legend(cols(4) size(small))                             ///
        by(region, title("Multilevel Model of GSP by Region", size(medsmall)))

graph2

Wow – that’s a nice graph if I do say so myself. It would be impressive for a report or publication, but it’s a little tough to read with all nine regions displayed at once. Let’s take a closer look at Region 7 instead.

twoway  (line GrandMean year, lcolor(black) lwidth(thick))      ///
        (line RegionMean year, lcolor(blue) lwidth(medthick))   ///
        (line StateMean year, lcolor(green) connect(ascending)) ///
        (scatter gsp year, mcolor(red) msize(medsmall))         ///
        if region ==7,                                          ///
        ytitle(log(Gross State Product), margin(medsmall))      ///
        legend(cols(4) size(small))                             ///
        title("Multilevel Model of GSP for Region 7", size(medsmall))

graph3

The red dots are the observations of GSP for each state within Region 7. The green lines are the estimated mean GSP within each State and the blue line is the estimated mean GSP within Region 7. The thick black line in the center is the overall grand mean for all nine regions. The model appears to fit the data fairly well but I can’t help noticing that the red dots seem to have an upward slant to them. Our model predicts that GSP is constant within each state and region from 1970 to 1986 when clearly the data show an upward trend.

So we’ve tackled the first feature of our data. We’ve succesfully incorporated the basic hierarchical structure into our model by fitting a variance componentis using Stata’s xtmixed command. But our graph tells us that we aren’t finished yet.

Next time we’ll tackle the second feature of our data — the longitudinal nature of the observations.

    For more information

If you’d like to learn more about modelling multilevel and longitudinal data, check out

Multilevel and Longitudinal Modeling Using Stata, Third Edition
Volume I: Continuous Responses
Volume II: Categorical Responses, Counts, and Survival
by Sophia Rabe-Hesketh and Anders Skrondal

or sign up for our popular public training course “Multilevel/Mixed Models Using Stata“.

There’s a course coming up in Washington, DC on February 7-8, 2013.

Using Stata’s SEM features to model the Beck Depression Inventory

I just got back from the 2012 Stata Conference in San Diego where I gave a talk on Psychometric Analysis Using Stata and from the 2012 American Psychological Association Meeting in Orlando. Stata’s structural equation modeling (SEM) builder was popular at both meetings and I wanted to show you how easy it is to use. If you are not familiar with the basics of SEM, please refer to the references at the end of the post. My goal is simply to show you how to use the SEM builder assuming that you already know something about SEM. If you would like to view a video demonstration of the SEM builder, please click the play button below:

The data used here and for the silly examples in my talk were simulated to resemble one of the most commonly used measures of depression: the Beck Depression Inventory (BDI). If you find these data too silly or not relevant to your own research, you could instead imagine it being a set of questions to measure mathematical ability, the ability to use a statistical package, or whatever you wanted.

The Beck Depression Inventory

Originally published by Aaron Beck and colleagues in 1961, the BDI marked an important change in the conceptualization of depression from a psychoanalytic perspective to a cognitive/behavioral perspective. It was also a landmark in the measurement of depression shifting from lengthy, expensive interviews with a psychiatrist to a brief, inexpensive questionnaire that could be scored and quantified. The original inventory consisted of 21 questions each allowing ordinal responses of increasing symptom severity from 0-3. The sum of the responses could then be used to classify a respondent’s depressive symptoms as none, mild, moderate or severe. Many studies have demonstrated that the BDI has good psychometric properties such as high test-retest reliability and the scores correlate well with the assessments of psychiatrists and psychologists. The 21 questions can also be grouped into two subscales. The affective scale includes questions like “I feel sad” and “I feel like a failure” that quantify emotional symptoms of depression. The somatic or physical scale includes questions like “I have lost my appetite” and “I have trouble sleeping” that quantify physical symptoms of depression. Since its original publication, the BDI has undergone two revisions in response to the American Psychiatric Association’s (APA) Diagnostic and Statistical Manuals (DSM) and the BDI-II remains very popular.

The Stata Depression Inventory

Since the BDI is a copyrighted psychometric instrument, I created a fictitious instrument called the “Stata Depression Inventory”. It consists of 20 questions each beginning with the phrase “My statistical software makes me…”. The individual questions are listed in the variable labels below.

. describe qu1-qu20

variable  storage  display    value
 name       type   format     label      variable label
------------------------------------------------------------------------------
qu1         byte   %16.0g     response   ...feel sad
qu2         byte   %16.0g     response   ...feel pessimistic about the future
qu3         byte   %16.0g     response   ...feel like a failure
qu4         byte   %16.0g     response   ...feel dissatisfied
qu5         byte   %16.0g     response   ...feel guilty or unworthy
qu6         byte   %16.0g     response   ...feel that I am being punished
qu7         byte   %16.0g     response   ...feel disappointed in myself
qu8         byte   %16.0g     response   ...feel am very critical of myself
qu9         byte   %16.0g     response   ...feel like harming myself
qu10        byte   %16.0g     response   ...feel like crying more than usual
qu11        byte   %16.0g     response   ...become annoyed or irritated easily
qu12        byte   %16.0g     response   ...have lost interest in other people
qu13        byte   %16.0g     qu13_t1    ...have trouble making decisions
qu14        byte   %16.0g     qu14_t1    ...feel unattractive
qu15        byte   %16.0g     qu15_t1    ...feel like not working
qu16        byte   %16.0g     qu16_t1    ...have trouble sleeping
qu17        byte   %16.0g     qu17_t1    ...feel tired or fatigued
qu18        byte   %16.0g     qu18_t1    ...makes my appetite lower than usual
qu19        byte   %16.0g     qu19_t1    ...concerned about my health
qu20        byte   %16.0g     qu20_t1    ...experience decreased libido

The responses consist of a 5-point Likert scale ranging from 1 (Strongly Disagree) to 5 (Strongly Agree). Questions 1-10 form the affective scale of the inventory and questions 11-20 form the physical scale. Data were simulated for 1000 imaginary people and included demographic variables such as age, sex and race. The responses can be summarized succinctly in a matrix of bar graphs:

Classical statistical analysis

The beginning of a classical statistical analysis of these data might consist of summing the responses for questions 1-10 and referring to them as the “Affective Depression Score” and summing questions 11-20 and referring to them as the “Physical Depression Score”.

egen Affective = rowtotal(qu1-qu10)
label var Affective "Affective Depression Score"
egen physical = rowtotal(qu11-qu20)
label var physical "Physical Depression Score"

We could be more sophisticated and use principal components to create the affective and physical depression score:

pca qu1-qu20, components(2)
predict Affective Physical
label var Affective "Affective Depression Score"
label var Physical "Physical Depression Score"

We could then ask questions such as “Are there differences in affective and physical depression scores by sex?” and test these hypotheses using multivariate statistics such as Hotelling’s T-squared statistic. The problem with this analysis strategy is that it treats the depression scores as though they were measured without error and can lead to inaccurate p-values for our test statistics.

Structural equation modeling

Structural equation modeling (SEM) is an ideal way to analyze data where the outcome of interest is a scale or scales derived from a set of measured variables. The affective and physical scores are treated as latent variables in the model resulting in accurate p-values and, best of all….these models are very easy to fit using Stata! We begin by selecting the SEM builder from the Statistics menu:

In the SEM builder, we can select the “Add Measurement Component” icon:

which will open the following dialog box:

In the box labeled “Latent Variable Name” we can type “Affective” (red arrow below) and we can select the variables qu1-qu10 in the “Measured variables” box (blue arrow below).

When we click “OK”, the affective measurement component appears in the builder:

We can repeat this process to create a measurement component for our physical depression scale (images not shown). We can also allow for covariance/correlation between our affective and physical depression scales using the “Add Covariance” icon on the toolbar (red arrow below).

I’ll omit the intermediate steps to build the full model shown below but it’s easy to use the “Add Observed Variable” and “Add Path” icons to create the full model:

Now we’re ready to estimate the parameters for our model. To do this, we click the “Estimate” icon on the toolbar (duh!):

And the flowing dialog box appears:

Let’s ignore the estimation options for now and use the default settings. Click “OK” and the parameter estimates will appear in the diagram:

Some of the parameter estimates are difficult to read in this form but it is easy to rearrange the placement and formatting of the estimates to make them easier to read.

If we look at Stata’s output window and scroll up, you’ll notice that the SEM Builder automatically generated the command for our model:

sem (Affective -> qu1) (Affective -> qu2) (Affective -> qu3)
    (Affective -> qu4) (Affective -> qu5) (Affective -> qu6)
    (Affective -> qu7) (Affective -> qu8) (Affective -> qu9)
    (Affective -> qu10) (Physical -> qu11) (Physical -> qu12)
    (Physical -> qu13) (Physical -> qu14) (Physical -> qu15)
    (Physical -> qu16) (Physical -> qu17) (Physical -> qu18)
    (Physical -> qu19) (Physical -> qu20) (sex -> Affective)
    (sex -> Physical), latent(Affective Physical) cov(e.Physical*e.Affective)

We can gather terms and abbreviate some things to make the command much easier to read:

sem (Affective -> qu1-qu10) ///
    (Physical -> qu11-qu20) /// 
    (sex -> Affective Physical) ///
    , latent(Affective Physical ) ///
    cov( e.Physical*e.Affective)

We could then calculate a Wald statistic to test the null hypothesis that there is no association between sex and our affective and physical depression scales.

test sex

 ( 1)  [Affective]sex = 0
 ( 2)  [Physical]sex = 0

           chi2(  2) =    2.51
         Prob > chi2 =    0.2854

Final thoughts
This is an admittedly oversimplified example – we haven’t considered the fit of the model or considered any alternative models. We have only included one dichotomous independent variable. We might prefer to use a likelihood ratio test or a score test. Those are all very important issues and should not be ignored in a proper data analysis. But my goal was to demonstrate how easy it is to use Stata’s SEM builder to model data such as those arising from the Beck Depression Inventory. Incidentally, if these data were collected using a complex survey design, it would not be difficult to incorporate the sampling structure and sample weights into the analysis. Missing data can be handled easily as well using Full Information Maximum Likelihood (FIML) but those are topics for another day.

If you would like view the slides from my talk, download the data used in this example or view a video demonstration of Stata’s SEM builder using these data, please use the links below. For the dataset, you can also type use followed by the URL for the data to load it directly into Stata.

Slides:
http://stata.com/meeting/sandiego12/materials/sd12_huber.pdf

Data:
http://stata.com/meeting/sandiego12/materials/Huber_2012SanDiego.dta

YouTube video demonstration:
http://www.youtube.com/watch?v=Xj0gBlqwYHI

References

Beck AT, Ward CH, Mendelson M, Mock J, Erbaugh J (June 1961). An inventory for measuring depression. Arch. Gen. Psychiatry 4 (6): 561–71.

Beck AT, Ward C, Mendelson M (1961). Beck Depression Inventory (BDI). Arch Gen Psychiatry 4 (6): 561–571

Beck AT, Steer RA, Ball R, Ranieri W (December 1996). Comparison of Beck Depression Inventories -IA and -II in psychiatric outpatients. Journal of Personality Assessment 67 (3): 588–97
Bollen, KA. (1989). Structural Equations With Latent Variables. New York, NY: John Wiley and Sons

Kline, RB (2011). Principles and Practice of Structural Equation Modeling. New York, NY: Guilford Press

Raykov, T & Marcoulides, GA (2006). A First Course in Structural Equation Modeling. Mahwah, NJ: Lawrence Erlbaum

Schumacker, RE & Lomax, RG (2012) A Beginner’s Guide to Structural Equation Modeling, 3rd Ed. New York, NY: Routledge

Categories: Statistics Tags: , ,

Stata YouTube channel announced!

StataCorp now provides free tutorial videos on StataCorp’s YouTube channel,

http://www.youtube.com/user/statacorp

There are 24 videos providing 1 hour 51 minutes of instructional entertainment:

Stata Quick Tour (5:47)
Stata Quick Help (2:47)
Stata PDF Documentation (6:37)

Stata One-sample t-test (3:43)
Stata t-test for Two Independent Samples (5:09)
Stata t-test for Two Paired Samples (4:42)

Stata Simple Linear Regression (5:33)

Stata SEM Builder (8:09)
Stata One-way ANOVA (5:15)
Stata Two-way ANOVA (5:57)

Stata Pearson’s Correlation Coefficient (3:29)
Stata Pearson’s Chi2 and Fisher’s Exact Test (3:16)

Stata Box Plots (4:04)
Stata Basic Scatterplots (5:19)
Stata Bar Graphs (4:15)
Stata Histograms (4:50)
Stata Pie Charts (5:32)

Stata Descriptive Statistics (5:49)

Stata Tables and Crosstabulations (7:20)
Stata Combining Crosstabs and Descriptives (5:58)

Stata Converting Data to Stata with Stat/Transfer (2:47)
Stata Import Excel Data (1:33)
Stata Excel Copy/Paste (1:16)
Stata Example Data Included with Stata (2:14)

And more are forthcoming.

 

The inside story

Alright, that’s the official announcement.

Last Friday, 21 September 2012, was an exciting day here at StataCorp. After a couple of years of “wouldn’t it be cool if”, and a couple of months of “we’re almost there”, Stata’s YouTube channel was finally ready for prime time.

Stata’s YouTube Channel was the brainchild of Karen Strope, StataCorp’s Director of Marketing, but I had something to do with it, too. Well, maybe more than something, but I’m a modest guy. Anyway, I thought it sounded like fun and recorded a few prototype videos. Annette Fett, StataCorp’s Graphic Designer, added the cool splash-screen and after a few experiments, we soon had 24 Blu-ray resolution videos. We’ve kicked off with videos covering topics such as a tour of Stata’s interface, how to create basic graphs, how to conduct many common statistical analyses, and more.

My personal favorite is the video entitled Combining Crosstabs and Descriptives because it’s relevant to nearly all Stata users and works well as a video demonstration.

Videos about Stata – isn’t that like dancing about architecture?

Stata has over 9,000 pages of documentation included in PDF format, a built-in Help system, and a collection of books on general and special topics published by Stata Press, and an extensive collection of dialog boxes that make even the most complex graphs and analyses easy to perform.

So aren’t the videos, ahh, unnecessary?

The problem is, it’s cumbersome to describe how to use all of Stata’s features, especially dialog boxes, in a manual, even when you have 9,000 pages, and 9,000 pages tries even the most dedicated user’s patience.

In a 3-7 minutes video, we can show you how to create complicated graphs or a sophisticated structural equation model.

We have three audiences in mind.

  1. Videos for non-Stata users, whom we call future Stata users; videos intended to provide a loosely guided tour of Stata’s features.
  2. Videos for new Stata users, such as the person who might simply want to know “How do I calculate a twoway ANOVA in Stata?” or “How do I create a Pie Chart?”. These videos will get them up and running quickly and painlessly.
  3. Videos for experienced Stata users who want to learn new tips and tricks.

There’s actually a fourth group that’s of interest, too; experienced Stata users teaching statistics or data analysis classes, who don’t want to spend valuable class time showing their students how to use Stata. They can refer their students to the relevant videos as homework and thus free class time for the teaching of statistics.

Request for comments

One of the fun things about working at StataCorp is that management doesn’t much use the word “no”. New ideas are more often met with the phrase, “well, let’s try it and see what happens”. So I’m trying this. My plan is to add a couple of videos to the channel every week or two as time permits. I have a list of topics I’d like to cover including things like multiple imputation, survey analysis, mixed models, Stata’s “immediate” commands (tabi, ttesti, csi, cci, etc…), and more examples using the SEM Builder.

However, I will take requests. If you have a suggested topic or a future video, leave a comment.

I’d like to keep the videos brief, between 3-7 minutes, so please don’t request feature-length films like “How to do survival analysis in Stata”. Similarly, topics that are only interesting to you and your two post-docs such as “Please describe the difference between the Laplacian Approximation and Adaptive Gauss-Hermite Quadrature in the xtmepoisson command” are not likely to see the light of day. But I am very interested in your ideas for small, bite-sized topics that will work in a video format.

Categories: Company Tags: , ,