When a two-step estimator produces consistent point estimates but inconsistent standard errors, it is known as the two-step-estimation problem. For instance, inverse-probability weighted (IPW) estimators are a weighted average in which the weights are estimated in the first step. Two-step estimators use first-step estimates to estimate the parameters of interest in a second step. The two-step-estimation problem arises because the second step ignores the estimation error in the first step.
One solution is to convert the two-step estimator into a one-step estimator. My favorite way to do this conversion is to stack the equations solved by each of the two estimators and solve them jointly. This one-step approach produces consistent point estimates and consistent standard errors. There is no two-step problem because all the computations are performed jointly. Newey (1984) derives and justifies this approach.
I’m going to illustrate this approach with the IPW example, but it can be used with any two-step problem as long as each step is continuous.
IPW estimators are frequently used to estimate the mean that would be observed if everyone in a population received a specified treatment, a quantity known as a potential-outcome mean (POM). A difference of POMs is called the average treatment effect (ATE). Aside from all that, it is the mechanics of the two-step IPW estimator that interest me here. IPW estimators are weighted averages of the outcome, and the weights are estimated in a first step. The weights used in the second step are the inverse of the estimated probability of treatment.
Let’s imagine we are analyzing an extract of the birthweight data used by Cattaneo (2010). In this dataset, bweight is the baby’s weight at birth, mbsmoke is 1 if the mother smoked while pregnant (and 0 otherwise), mmarried is 1 if the mother is married, and prenatal1 is 1 if the mother had a prenatal visit in the first trimester.
Let’s imagine we want to estimate the mean when all pregnant women smoked, which is to say, the POM for smoking. If we were doing substantive research, we would also estimate the POM when no pregnant women smoked. The difference between these estimated POMs would then estimate the ATE of smoking.
In the IPW estimator, we begin by estimating the probability weights for smoking. We fit a probit model of mbsmoke as a function of mmarried and prenatal1.
. use cattaneo2 (Excerpt from Cattaneo (2010) Journal of Econometrics 155: 138-154) . probit mbsmoke mmarried prenatal1, vce(robust) Iteration 0: log pseudolikelihood = -2230.7484 Iteration 1: log pseudolikelihood = -2102.6994 Iteration 2: log pseudolikelihood = -2102.1437 Iteration 3: log pseudolikelihood = -2102.1436 Probit regression Number of obs = 4642 Wald chi2(2) = 259.42 Prob > chi2 = 0.0000 Log pseudolikelihood = -2102.1436 Pseudo R2 = 0.0577 ------------------------------------------------------------------------------ | Robust mbsmoke | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mmarried | -.6365472 .0478037 -13.32 0.000 -.7302407 -.5428537 prenatal1 | -.2144569 .0547583 -3.92 0.000 -.3217811 -.1071327 _cons | -.3226297 .0471906 -6.84 0.000 -.4151215 -.2301379 ------------------------------------------------------------------------------
The results indicate that both mmarried and prenatal1 significantly predict whether the mother smoked while pregnant.
We want to calculate the inverse probabilities. We begin by getting the probabilities:
. predict double pr, pr
Now, we can obtain the inverse probabilities by typing
. generate double ipw = (mbsmoke==1)/pr
We can now perform the second step: calculate the mean for smokers by using the IPWs.
. mean bweight [pw=ipw] Mean estimation Number of obs = 864 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ bweight | 3162.868 21.71397 3120.249 3205.486 -------------------------------------------------------------- . mean bweight [pw=ipw] if mbsmoke
The point estimate reported by mean is consistent; the reported standard error is not. It is not because mean takes the weights as fixed when they were in fact estimated.
The stacked two-step—using gmm to solve the two-step-estimation problem—instead creates a one-step estimator that solves both steps simultaneously.
To do that, we have to find and then code the moment conditions.
So what are the moment conditions for the first-step maximum-likelihood probit? Maximum likelihood (ML) estimators obtain their parameter estimates by finding the parameters that set the means of the first derivatives with respect to each parameter to 0. The means of the first derivatives are the moments.
The moment conditions are that the means of the first derivatives equal 0. We can obtain those first derivatives for ourselves, or we can copy them from the Methods and formulas section of [R] probit:
\[
1/N\sum_{i=1}^N\frac{ \phi({\bf x}_i\boldsymbol{\beta}’)
\left\{d_i-\Theta\left({\bf
x}_i\boldsymbol{\beta}’\right)\right\}}{\Theta\left({\bf
x}_i\boldsymbol{\beta}’\right)
\left\{1-\Theta\left({\bf x}_i\boldsymbol{\beta}’\right)\right\}}{\bf x}_i’ = {\bf 0}
\]
where \(\phi()\) is the density function of the standard normal distribution, \(d_i\) is the binary variable that is 1 for treated individuals (and 0 otherwise), and \(\Theta()\) is the cumulative probability function of the standard normal.
What’s the point of these moment conditions? We are going to use the generalized method of moments (GMM) to solve for the ML probit estimates. GMM is an estimation framework that defines estimators that solve moment conditions. The GMM estimator that sets the mean of the first derivatives of the ML probit to 0 produces the same point estimates as the ML probit estimator.
Stata’s GMM estimator is the gmm command; see [R] gmm for an introduction.
The structure of these moment conditions greatly simplifies the problem. For each observation, the left-hand side is the product of a scalar subexpression, namely,
\[
\frac{\phi({\bf x}_i\boldsymbol{\beta}’)\{d_i-\Theta({\bf
x}_i\boldsymbol{\beta}’)\}}
{\Theta({\bf x}_i\boldsymbol{\beta}’)\{1-\Theta({\bf
x}_i\boldsymbol{\beta}’)\}}
\]
and the covariates \({\bf x}_i\). In GMM parlance, the variables that multiply the scalar expression are called instruments.
The gmm command that will solve these moment conditions is
. generate double cons = 1 . gmm (normalden({xb:mmarried prenatal1 cons})*(mbsmoke - normal({xb:}))/ /// > (normal({xb:})*(1-normal({xb:})) )), /// > instruments(mmarried prenatal1 ) winitial(identity) onestep Step 1 Iteration 0: GMM criterion Q(b) = .61413428 Iteration 1: GMM criterion Q(b) = .00153235 Iteration 2: GMM criterion Q(b) = 1.652e-06 Iteration 3: GMM criterion Q(b) = 1.217e-12 Iteration 4: GMM criterion Q(b) = 7.162e-25 GMM estimation Number of parameters = 3 Number of moments = 3 Initial weight matrix: Identity Number of obs = 4642 ------------------------------------------------------------------------------ | Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- /xb_mmarried | -.6365472 .0477985 -13.32 0.000 -.7302306 -.5428638 /xb_prenat~1 | -.2144569 .0547524 -3.92 0.000 -.3217696 -.1071442 /xb_cons | -.3226297 .0471855 -6.84 0.000 -.4151115 -.2301479 ------------------------------------------------------------------------------ Instruments for equation 1: mmarried prenatal1 _cons
With gmm, we specify in parentheses the scalar expression, and we specify the covariates in the instruments() option. The unknown parameters are the implied coefficients on the variables specified in {xb:mmarried prenatal1 cons}. Note that we subsequently refer to the linear combination as {xb:}.
The winitial(identity) and onestep options help the solution-finding technique.
The point estimates and the standard errors produced by the gmm command match those reported by probit, ignoring numerical issues.
Now that we can use gmm to obtain our first-step estimates, we need to add the moment condition that defines the weighted average of the POM for smokers. The equation for the POM for smokers is
\[
{\rm POM} = 1/N\sum_{i=1}^{N}{{\bf mbsmoke}_i\over{\Phi({\bf x}_i\boldsymbol{\beta})}}
\]
Recall that the inverse weights are \(1/\Phi({\bf x}_i\boldsymbol{\beta})\) for smokers. When we solved this problem using a two-step estimator, we performed the second step only for smokers. We typed mean bweight [pw=ipw] if mbsmoke==1. We cannot use if mbsmoke==1 in the gmm command because the first step has to be performed over all the data. Instead, we set the weights to 0 in the second step for the nonsmokers. Multiplying \(1/\Phi({\bf x}_i\boldsymbol{\beta})\) by \({\bf mbsmoke}_i\) does that.
Anyway, the equation for the POM for smokers is
\[
{\rm POM} = 1/N\sum_{i=1}^{N}{{\bf mbsmoke}_i\over{\Phi({\bf x}_i\boldsymbol{\beta})}}\]
and the moment condition is therefore
\[
1/N\sum_{i=1}^{N}{{\bf mbsmoke}_i\over{\Phi({\bf x}_i\boldsymbol{\beta})}} – {\rm
POM} = 0
\]
In the gmm command below, I call the scalar expression for the probit moment conditions eq1, and I call the scalar expression for the POM weighted-average equation eq2. Both moment conditions have the scalar-expression-times-instrument structure, but the weighted-average moment expression is multiplied by a constant that is included as an instrument by default. In the weighted-average moment condition, parameter pom is the POM we wish to estimate.
. gmm (eq1: normalden({xb:mmarried prenatal1 cons})* /// > (mbsmoke - normal({xb:}))/(normal({xb:})*(1-normal({xb:})) )) /// > (eq2: (mbsmoke/normal({xb:}))*(bweight - {pom})), /// > instruments(eq1:mmarried prenatal1 ) /// > instruments(eq2: ) /// > winitial(identity) onestep Step 1 Iteration 0: GMM criterion Q(b) = 1364234.7 Iteration 1: GMM criterion Q(b) = 141803.69 Iteration 2: GMM criterion Q(b) = 84836.523 Iteration 3: GMM criterion Q(b) = 1073.6829 Iteration 4: GMM criterion Q(b) = .01215102 Iteration 5: GMM criterion Q(b) = 1.196e-13 Iteration 6: GMM criterion Q(b) = 2.815e-27 GMM estimation Number of parameters = 4 Number of moments = 4 Initial weight matrix: Identity Number of obs = 4642 ------------------------------------------------------------------------------ | Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- /xb_mmarried | -.6365472 .0477985 -13.32 0.000 -.7302306 -.5428638 /xb_prenat~1 | -.2144569 .0547524 -3.92 0.000 -.3217696 -.1071442 /xb_cons | -.3226297 .0471855 -6.84 0.000 -.4151115 -.2301479 /pom | 3162.868 21.65827 146.04 0.000 3120.418 3205.317 ------------------------------------------------------------------------------ Instruments for equation 1: mmarried prenatal1 _cons Instruments for equation 2: _cons
In this output, both the point estimates and the standard errors are consistent!
They are consistent because we converted our two-step estimator into a one-step estimator.
What we have just done is reimplement Stata’s teffects command in a particular case. Results are identical:
. teffects ipw (bweight) (mbsmoke mmarried prenatal1, probit) , pom Iteration 0: EE criterion = 5.387e-22 Iteration 1: EE criterion = 3.332e-27 Treatment-effects estimation Number of obs = 4642 Estimator : inverse-probability weights Outcome model : weighted mean Treatment model: probit ------------------------------------------------------------------------------ | Robust bweight | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- POmeans | mbsmoke | nonsmoker | 3401.441 9.528643 356.97 0.000 3382.765 3420.117 smoker | 3162.868 21.65827 146.04 0.000 3120.418 3205.317 ------------------------------------------------------------------------------
To which problems can you apply this stacked two-step approach?
This approach of stacking the moment conditions is designed for two-step problems in which the number of parameters equals the number of sample moment conditions in each step. Such estimators are called exactly identified because the number of parameters is the same as the number of equations that they solve.
For exactly identified estimators, the point estimates produced by the stacked GMM are identical to the point estimates produced by the two-step estimator. The stacked GMM, however, produces consistent standard errors.
For estimators with more conditions than parameters, the stacked GMM also corrects the standard errors, but there are caveats that I’m not going to discuss here.
The stacked GMM requires that the moment conditions be continuously differentiable and satisfy standard regularity conditions. Smooth, regular ML estimators and least-squares estimators meet these requirements; see Newey (1984) for details.
The main practical hurdle is getting the moment conditions for the estimators in the different steps. If the steps involve ML, those first-derivative conditions can be directly translated to moment conditions. The calculus part is worked out in many textbooks, and sometimes even in the Stata manuals.
See [R] gmm for more information on how to use the gmm command.
Cattaneo, M. D. 2010. Efficient semiparametric estimation of multi-valued treatment effects under ignorability. Journal of Econometrics 155: 138–154.
Newey, W. K. 1984. A method of moments interpretation of sequential estimators. Economics Letters 14: 201–206.
]]>Once installed, launch GoodReader, press the bookmark icon at the bottom of the screen, and GoodReader shows you the list of the manuals.
Well, that’s only a partial list. We’d have to scroll to see them all.
If you tap on a manual, it opens,
You can swipe to go forward,
All the links are live. If you tap on graph intro, the reader jumps to the manual entry,
Here are some formulas:
To illustrate formulas, I jumped to mi estimate in the [MI] manual. I can jump anywhere because I have all 21 manuals—all 11,000-plus pages—installed on my iPad.
You can have them installed on your iPad, too.
Here’s how.
You must purchase GoodReader 4 from the App Store. No other PDF reader will do. What makes GoodReader a good reader for the Stata manuals is that it can handle links across manuals. As of this date, only GoodReader will do this.
We are going to copy the manuals from your computer to your iPad. You need a computer containing Stata. This does not have to be the same computer to which you sync your iPad.
Before we do that, however, let’s verify your Stata is up to date. We want to copy the the latest version of the manuals, and StataCorp sometimes updates them. Launch Stata and type update query. Follow update‘s instructions if there’s an update. Updates are free.
We are ready to copy. There are two ways you can copy the manuals. You can physically copy them using iTunes by plugging your iPad into your computer, or you can use GoodReader to wirelessly copy them. We recommend using iTunes because the other method requires file sharing be enabled on the computer and setting that up can be difficult.
Copying the manuals with iTunes is simple and is the method we recommend.
Once the docs folder appears in GoodReader Documents list, you’re done! Unplug your iPad from the computer. If it makes you feel better, you can eject your iPad first.
Skip to Viewing the manuals below.
GoodReader can copy the manuals wirelessly from your computer and, even better, keep the copied manuals in sync with the manuals on the computer. However, file sharing must be enabled on your computer and setting that up can be difficult. So we’ll just assume that file sharing is working and, if you have trouble using file sharing, copy the manuals using iTunes as covered above.
Once the download has completed, the progress window will close and you’re done!
Launch GoodReader.
This first time, it will show you a list of folders. Select docs to open the folder and view a list of Stata manuals.
Select one. GoodReader will show you the first page of the manual, surrounded by GoodReader icons. Tap once in the middle of the screen to hide the icons. Tap later in the middle of the screen to bring them back.
The most useful icon is bookmarks located at the center at the bottom of the screen,
Tap the icon and the list of manuals reappears in the Outlines tab so that you can choose another manual.
StataCorp sometimes updates the manuals. Refreshing the manuals on your iPad is easy enough.
If you copied the manuals using iTunes — or even if you didn’t — you can repeat the steps to copy the manuals using iTunes.
If you copied the manuals wirelessly, just tap the the Sync button from the main screen of GoodReader while your iPad is on the same network as the computer.
When manuals were real, I used to write in them. I’d highlight something I thought important, or put an arrow here or there. With virtual manuals, that’s called annotating.
There’s an issue with annotating, however. If you update your manuals, you lose your annotations. So either don’t annotate or don’t update.
]]>For example, when we want to compare parameters among two or more models, we usually use suest, which combines the estimation results under one parameter vector and creates a simultaneous covariance matrix of the robust type. This covariance estimate is described in the Methods and formulas of [R] suest as the robust variance from a “stacked model”. Actually, gsem can estimate these kinds of “stacked models”, even if the estimation samples are not the same and eventually overlap. By using the option vce(robust), we can replicate the results from suest if the models are available for gsem. In addition, gsem allows us to combine results from some estimation commands that are not supported by suest, like models including random effects.
Let’s consider the childweight dataset, described in [ME] mixed. Consider the following models, where weights of boys and girls are modeled using the age and the age-squared:
. webuse childweight, clear (Weight data on Asian children) . regress weight age c.age#c.age if girl == 0, noheader ------------------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 7.985022 .6343855 12.59 0.000 6.725942 9.244101 | c.age#c.age | -1.74346 .2374504 -7.34 0.000 -2.214733 -1.272187 | _cons | 3.684363 .3217223 11.45 0.000 3.045833 4.322893 ------------------------------------------------------------------------------ . regress weight age c.age#c.age if girl == 1, noheader ------------------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 7.008066 .5164687 13.57 0.000 5.982746 8.033386 | c.age#c.age | -1.450582 .1930318 -7.51 0.000 -1.833798 -1.067365 | _cons | 3.480933 .2616616 13.30 0.000 2.961469 4.000397 ------------------------------------------------------------------------------
To test whether birthweights are the same for the two groups, we need to test whether the intercepts in the two regressions are the same. Using suest, we would proceed as follows:
. quietly regress weight age c.age#c.age if girl == 0, noheader . estimates store boys . quietly regress weight age c.age#c.age if girl == 1, noheader . estimates store girls . suest boys girls Simultaneous results for boys, girls Number of obs = 198 ------------------------------------------------------------------------------ | Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- boys_mean | age | 7.985022 .4678417 17.07 0.000 7.068069 8.901975 | c.age#c.age | -1.74346 .2034352 -8.57 0.000 -2.142186 -1.344734 | _cons | 3.684363 .1719028 21.43 0.000 3.34744 4.021286 -------------+---------------------------------------------------------------- boys_lnvar | _cons | .4770289 .1870822 2.55 0.011 .1103546 .8437032 -------------+---------------------------------------------------------------- girls_mean | age | 7.008066 .4166916 16.82 0.000 6.191365 7.824766 | c.age#c.age | -1.450582 .1695722 -8.55 0.000 -1.782937 -1.118226 | _cons | 3.480933 .1556014 22.37 0.000 3.17596 3.785906 -------------+---------------------------------------------------------------- girls_lnvar | _cons | .0097127 .1351769 0.07 0.943 -.2552292 .2746545 ------------------------------------------------------------------------------
Invoking an estimation command with the option coeflegend will give us a legend we can use to refer to the parameters when we use postestimation commands like test.
. suest, coeflegend Simultaneous results for boys, girls Number of obs = 198 ------------------------------------------------------------------------------ | Coef. Legend -------------+---------------------------------------------------------------- boys_mean | age | 7.985022 _b[boys_mean:age] | c.age#c.age | -1.74346 _b[boys_mean:c.age#c.age] | _cons | 3.684363 _b[boys_mean:_cons] -------------+---------------------------------------------------------------- boys_lnvar | _cons | .4770289 _b[boys_lnvar:_cons] -------------+---------------------------------------------------------------- girls_mean | age | 7.008066 _b[girls_mean:age] | c.age#c.age | -1.450582 _b[girls_mean:c.age#c.age] | _cons | 3.480933 _b[girls_mean:_cons] -------------+---------------------------------------------------------------- girls_lnvar | _cons | .0097127 _b[girls_lnvar:_cons] ------------------------------------------------------------------------------ . test _b[boys_mean:_cons] = _b[girls_mean:_cons] ( 1) [boys_mean]_cons - [girls_mean]_cons = 0 chi2( 1) = 0.77 Prob > chi2 = 0.3803
We find no evidence that the intercepts are different.
Now, let’s replicate those results by using the gsem command. We generate the variable weightboy, a copy of weight for boys and missing otherwise, and the variable weightgirl, a copy of weight for girls and missing otherwise.
. quietly generate weightboy = weight if girl == 0 . quietly generate weightgirl = weight if girl == 1 . gsem (weightboy <- age c.age#c.age) (weightgirl <- age c.age#c.age), /// > nolog vce(robust) Generalized structural equation model Number of obs = 198 Log pseudolikelihood = -302.2308 ------------------------------------------------------------------------------- | Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] -----------------+------------------------------------------------------------- weightboy <- | age | 7.985022 .4678417 17.07 0.000 7.068069 8.901975 | c.age#c.age | -1.74346 .2034352 -8.57 0.000 -2.142186 -1.344734 | _cons | 3.684363 .1719028 21.43 0.000 3.34744 4.021286 -----------------+------------------------------------------------------------- weightgirl <- | age | 7.008066 .4166916 16.82 0.000 6.191365 7.824766 | c.age#c.age | -1.450582 .1695722 -8.55 0.000 -1.782937 -1.118226 | _cons | 3.480933 .1556014 22.37 0.000 3.17596 3.785906 -----------------+------------------------------------------------------------- var(e.weightboy)| 1.562942 .3014028 1.071012 2.280821 var(e.weightgirl)| .978849 .1364603 .7448187 1.286414 ------------------------------------------------------------------------------- . gsem, coeflegend Generalized structural equation model Number of obs = 198 Log pseudolikelihood = -302.2308 ------------------------------------------------------------------------------- | Coef. Legend -----------------+------------------------------------------------------------- weightboy <- | age | 7.985022 _b[weightboy:age] | c.age#c.age | -1.74346 _b[weightboy:c.age#c.age] | _cons | 3.684363 _b[weightboy:_cons] -----------------+------------------------------------------------------------- weightgirl <- | age | 7.008066 _b[weightgirl:age] | c.age#c.age | -1.450582 _b[weightgirl:c.age#c.age] | _cons | 3.480933 _b[weightgirl:_cons] -----------------+------------------------------------------------------------- var(e.weightboy)| 1.562942 _b[var(e.weightboy):_cons] var(e.weightgirl)| .978849 _b[var(e.weightgirl):_cons] ------------------------------------------------------------------------------- . test _b[weightgirl:_cons]= _b[weightboy:_cons] ( 1) - [weightboy]_cons + [weightgirl]_cons = 0 chi2( 1) = 0.77 Prob > chi2 = 0.3803
gsem allowed us to fit models on different subsets simultaneously. By default, the model is assumed to be a linear regression, but several links and families are available; for example, you can combine two Poisson models or a multinomial logistic model with a regular logistic model. See [SEM] sem and gsem for details.
Here, I use the vce(robust) option to replicate the results for suest. However, when estimation samples don’t overlap, results from both estimations are assumed to be independent, and thus the option vce(robust) is not needed. When performing the estimation without the vce(robust) option, the joint covariance matrix will contain two blocks with the covariances from the original models and 0s outside those blocks.
The childweight dataset contains repeated measures, and it is, in the documentation, analyzed used the mixed command, which allows us to account for the intra-individual correlation via random effects.
Now, let’s use the techniques described above to combine results from two random-effects models. Here are the two separate models:
. mixed weight age c.age#c.age if girl == 0 || id:, nolog Mixed-effects ML regression Number of obs = 100 Group variable: id Number of groups = 34 Obs per group: min = 1 avg = 2.9 max = 5 Wald chi2(2) = 1070.28 Log likelihood = -149.05479 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ weight | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 8.328882 .4601093 18.10 0.000 7.427084 9.230679 | c.age#c.age | -1.859798 .1722784 -10.80 0.000 -2.197458 -1.522139 | _cons | 3.525929 .2723617 12.95 0.000 2.99211 4.059749 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ id: Identity | var(_cons) | .7607779 .2439115 .4058409 1.426133 -----------------------------+------------------------------------------------ var(Residual) | .7225673 .1236759 .5166365 1.010582 ------------------------------------------------------------------------------ LR test vs. linear regression: chibar2(01) = 30.34 Prob >= chibar2 = 0.0000 . mixed weight age c.age#c.age if girl == 1 || id:, nolog Mixed-effects ML regression Number of obs = 98 Group variable: id Number of groups = 34 Obs per group: min = 1 avg = 2.9 max = 5 Wald chi2(2) = 2141.72 Log likelihood = -114.3008 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ weight | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 7.273082 .3167266 22.96 0.000 6.652309 7.893854 | c.age#c.age | -1.538309 .118958 -12.93 0.000 -1.771462 -1.305156 | _cons | 3.354834 .2111793 15.89 0.000 2.94093 3.768738 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ id: Identity | var(_cons) | .6925554 .1967582 .396848 1.208606 -----------------------------+------------------------------------------------ var(Residual) | .3034231 .0535359 .2147152 .4287799 ------------------------------------------------------------------------------ LR test vs. linear regression: chibar2(01) = 47.42 Prob >= chibar2 = 0.0000
Random effects can be included in a gsem model by incorporating latent variables at the group level; these are the latent variables M1[id] and M2[id] below. By default, gsem will try to estimate a covariance when it sees two latent variables at the same level. This can be easily solved by restricting this covariance term to 0. Option vce(robust) should be used whenever we want to produce the mechanism used by suest.
. gsem (weightboy <- age c.age#c.age M1[id]) /// > (weightgirl <- age c.age#c.age M2[id]), /// > cov(M1[id]*M2[id]@0) vce(robust) nolog Generalized structural equation model Number of obs = 198 Log pseudolikelihood = -263.35559 ( 1) [weightboy]M1[id] = 1 ( 2) [weightgirl]M2[id] = 1 (Std. Err. adjusted for clustering on id) ------------------------------------------------------------------------------- | Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] -----------------+------------------------------------------------------------- weightboy <- | age | 8.328882 .4211157 19.78 0.000 7.50351 9.154253 | c.age#c.age | -1.859798 .1591742 -11.68 0.000 -2.171774 -1.547823 | M1[id] | 1 (constrained) | _cons | 3.525929 .1526964 23.09 0.000 3.22665 3.825209 -----------------+------------------------------------------------------------- weightgirl <- | age | 7.273082 .3067378 23.71 0.000 6.671887 7.874277 | c.age#c.age | -1.538309 .120155 -12.80 0.000 -1.773808 -1.30281 | M2[id] | 1 (constrained) | _cons | 3.354834 .1482248 22.63 0.000 3.064319 3.64535 -----------------+------------------------------------------------------------- var(M1[id])| .7607774 .2255575 .4254915 1.360268 var(M2[id])| .6925553 .1850283 .4102429 1.169144 -----------------+------------------------------------------------------------- var(e.weightboy)| .7225674 .1645983 .4623572 1.129221 var(e.weightgirl)| .3034231 .0667975 .1970877 .4671298 -------------------------------------------------------------------------------
Above, we have the joint output from the two models, which would allow us to perform tests among parameters in both models. Notice that option vce(robust) implies that standard errors will be clustered on the groups determined by id.
gsem, when called with the vce(robust) option, will complain if there are inconsistencies among the groups in the models (for example, if the random effects in both models were crossed).
In the previous model, gsem‘s default covariance structure included a term that wasn’t in the original two models, so we needed to include an additional restriction. This can be easy to spot in a simple model, but if you don’t want to rely just on a visual inspection, you can write a small loop to make sure that all the estimates in the joint model are actually also in the original models.
Let’s see an example with random effects, this time with overlapping data.
. *fit first model and save the estimates . gsem (weightboy <- age c.age#c.age M1[id]), nolog Generalized structural equation model Number of obs = 100 Log likelihood = -149.05479 ( 1) [weightboy]M1[id] = 1 ------------------------------------------------------------------------------- | Coef. Std. Err. z P>|z| [95% Conf. Interval] ----------------+-------------------------------------------------------------- weightboy <- | age | 8.328882 .4609841 18.07 0.000 7.425369 9.232394 | c.age#c.age | -1.859798 .1725233 -10.78 0.000 -2.197938 -1.521659 | M1[id] | 1 (constrained) | _cons | 3.525929 .2726322 12.93 0.000 2.99158 4.060279 ----------------+-------------------------------------------------------------- var(M1[id])| .7607774 .2439114 .4058407 1.426132 ----------------+-------------------------------------------------------------- var(e.weightboy)| .7225674 .1236759 .5166366 1.010582 ------------------------------------------------------------------------------- . mat b1 = e(b) . *fit second model and save the estimates . gsem (weight <- age M2[id]), nolog Generalized structural equation model Number of obs = 198 Log likelihood = -348.32402 ( 1) [weight]M2[id] = 1 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- weight <- | age | 3.389281 .1152211 29.42 0.000 3.163452 3.615111 | M2[id] | 1 (constrained) | _cons | 5.156913 .1803059 28.60 0.000 4.80352 5.510306 -------------+---------------------------------------------------------------- var(M2[id])| .6076662 .2040674 .3146395 1.173591 -------------+---------------------------------------------------------------- var(e.weight)| 1.524052 .1866496 1.198819 1.937518 ------------------------------------------------------------------------------ . mat b2 = e(b) . *stack estimates from first and second models . mat stacked = b1, b2 . *estimate joint model and save results . gsem (weightboy <- age c.age#c.age M1[id]) /// > (weight <- age M2[id]), cov(M1[id]*M2[id]@0) vce(robust) nolog Generalized structural equation model Number of obs = 198 Log pseudolikelihood = -497.37881 ( 1) [weightboy]M1[id] = 1 ( 2) [weight]M2[id] = 1 (Std. Err. adjusted for clustering on id) ------------------------------------------------------------------------------- | Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] ----------------+-------------------------------------------------------------- weightboy <- | age | 8.328882 .4211157 19.78 0.000 7.50351 9.154253 | c.age#c.age | -1.859798 .1591742 -11.68 0.000 -2.171774 -1.547823 | M1[id] | 1 (constrained) | _cons | 3.525929 .1526964 23.09 0.000 3.22665 3.825209 ----------------+-------------------------------------------------------------- weight <- | age | 3.389281 .1157835 29.27 0.000 3.16235 3.616213 | M2[id] | 1 (constrained) | _cons | 5.156913 .1345701 38.32 0.000 4.89316 5.420665 ----------------+-------------------------------------------------------------- var(M1[id])| .7607774 .2255575 .4254915 1.360268 var(M2[id])| .6076662 .1974 .3214791 1.148623 ----------------+-------------------------------------------------------------- var(e.weightboy)| .7225674 .1645983 .4623572 1.129221 var(e.weight)| 1.524052 .1705637 1.223877 1.897849 ------------------------------------------------------------------------------- . mat b = e(b) . *verify that estimates from the joint model are the same as . *from models 1 and 2 . local stripes : colfullnames(b) . foreach l of local stripes{ 2. matrix r1 = b[1,"`l'"] 3. matrix r2 = stacked[1,"`l'"] 4. assert reldif(el(r1,1,1), el(r2,1,1))<1e-5 5. }
The loop above verifies that all the labels in the second model correspond to estimates in the first and that the estimates are actually the same. If you omit the restriction for the variance in the joint model, the assert command will produce an error.
As documented in [U] 20.21.2 Correlated errors: Cluster-robust standard errors, the formula for the robust estimator of the variance is
\[
V_{robust} = \hat V(\sum_{j=1}^N u’_ju_j) \hat V
\]
where \(N\) is the number of observations, \(\hat V\) is the conventional estimator of the variance, and for each observation \(j\), \(u_j\) is a row vector (with as many columns as parameters), which represents the contribution of this observation to the gradient. (If we stack the rows \(u_j\), the columns of this matrix are the scores.)
When we apply suest, the matrix \(\hat V\) is constructed as the stacked block-diagonal conventional variance estimates from the original submodels; this is the variance you will see if you apply gsem to the joint model without the vce(robust) option. The \(u_j\) values used by suest are now the values from both estimations, so we have as many \(u_j\) values as the sum of observations in the two original models and each row contains as many columns as the total number of parameters in both models. This is the exact operation that gsem, vce(robust) does.
When random effects are present, standard errors will be clustered on groups. Instead of observation-level contributions to the gradient, we would use cluster-level contributions. This means that observations in the two models would need to be clustered in a consistent manner; observations that are common to the two estimations would need to be in the same cluster in the two estimations.
]]>
We simulate data all the time at StataCorp and for a variety of reasons.
One reason is that real datasets that include the features we would like are often difficult to find. We prefer to use real datasets in the manual examples, but sometimes that isn’t feasible and so we create simulated datasets.
We also simulate data to check the coverage probabilities of new estimators in Stata. Sometimes the formulae published in books and papers contain typographical errors. Sometimes the asymptotic properties of estimators don’t hold under certain conditions. And every once in a while, we make coding mistakes. We run simulations during development to verify that a 95% confidence interval really is a 95% confidence interval.
Simulated data can also come in handy for presentations, teaching purposes, and calculating statistical power using simulations for complex study designs.
And, simulating data is just plain fun once you get the hang of it.
Some of you will recall Vince Wiggins’s blog entry from 2011 entitled “Multilevel random effects in xtmixed and sem — the long and wide of it” in which he simulated a three-level dataset. I’m going to elaborate on how Vince simulated multilevel data, and then I’ll show you some useful variations. Specifically, I’m going to talk about:
Let’s begin by simulating a trivially simple, single-level dataset that has the form
\[y_i = 70 + e_i\]
We will assume that e is normally distributed with mean zero and variance \(\sigma^2\).
We’d want to simulate 500 observations, so let’s begin by clearing Stata’s memory and setting the number of observations to 500.
. clear . set obs 500
Next, let’s create a variable named e that contains pseudorandom normally distributed data with mean zero and standard deviation 5:
. generate e = rnormal(0,5)
The variable e is our error term, so we can create an outcome variable y by typing
. generate y = 70 + e . list y e in 1/5 +----------------------+ | y e | |----------------------| 1. | 78.83927 8.83927 | 2. | 69.97774 -.0222647 | 3. | 69.80065 -.1993514 | 4. | 68.11398 -1.88602 | 5. | 63.08952 -6.910483 | +----------------------+
We can fit a linear regression for the variable y to determine whether our parameter estimates are reasonably close to the parameters we specified when we simulated our dataset:
. regress y Source | SS df MS Number of obs = 500 -------------+------------------------------ F( 0, 499) = 0.00 Model | 0 0 . Prob > F = . Residual | 12188.8118 499 24.4264766 R-squared = 0.0000 -------------+------------------------------ Adj R-squared = 0.0000 Total | 12188.8118 499 24.4264766 Root MSE = 4.9423 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons | 69.89768 .221027 316.24 0.000 69.46342 70.33194 ------------------------------------------------------------------------------
The estimate of _cons is 69.9, which is very close to 70, and the Root MSE of 4.9 is equally close to the error’s standard deviation of 5. The parameter estimates will not be exactly equal to the underlying parameters we specified when we created the data because we introduced randomness with the rnormal() function.
This simple example is just to get us started before we work with multilevel data. For familiarity, let’s fit the same model with the mixed command that we will be using later:
. mixed y, stddev Mixed-effects ML regression Number of obs = 500 Wald chi2(0) = . Log likelihood = -1507.8857 Prob > chi2 = . ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons | 69.89768 .2208059 316.56 0.000 69.46491 70.33045 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ sd(Residual) | 4.93737 .1561334 4.640645 5.253068 ------------------------------------------------------------------------------
The output is organized with the parameter estimates for the fixed part in the top table and the estimated standard deviations for the random effects in the bottom table. Just as previously, the estimate of _cons is 69.9, and the estimate of the standard deviation of the residuals is 4.9.
Okay. That really was trivial, wasn’t it? Simulating two- and three-level data is almost as easy.
I posted a blog entry last year titled “Multilevel linear models in Stata, part 1: Components of variance“. In that posting, I showed a diagram for a residual of a three-level model.
The equation for the variance-components model I fit had the form
\[y_{ijk} = mu + u_i.. + u_{ij.} + e_{ijk}\]
This model had three residuals, whereas the one-level model we just fit above had only one.
This time, let’s start with a two-level model. Let’s simulate a two-level dataset, a model for children nested within classrooms. We’ll index classrooms by i and children by j. The model is
\[y_{ij} = mu + u_{i.} + e_{ij}\]
For this toy model, let’s assume two classrooms with two students per classroom, meaning that we want to create a four-observation dataset, where the observations are students.
To create this four-observation dataset, we start by creating a two-observation dataset, where the observations are classrooms. Because there are two classrooms, we type
. clear . set obs 2 . generate classroom = _n
From now on, we’ll refer to classroom as i. It’s easier to remember what variables mean if they have meaningful names.
Next, we’ll create a variable that contains each classroom’s random effect \(u_i\), which we’ll assume follows an N(0,3) distribution.
. generate u_i = rnormal(0,3) . list +----------------------+ | classr~m u_i | |----------------------| 1. | 1 .7491351 | 2. | 2 -.0031386 | +----------------------+
We can now expand our data to include two children per classroom by typing
. expand 2 . list +----------------------+ | classr~m u_i | |----------------------| 1. | 1 .7491351 | 2. | 2 -.0031386 | 3. | 1 .7491351 | 4. | 2 -.0031386 | +----------------------+
Now, we can think of our observations as being students. We can create a child ID (we’ll call it child rather than j), and we can create each child’s residual \(e_{ij}\), which we will assume has an N(0,5) distribution:
. bysort classroom: generate child = _n . generate e_ij = rnormal(0,5) . list +------------------------------------------+ | classr~m u_i child e_ij | |------------------------------------------| 1. | 1 .7491351 1 2.832674 | 2. | 1 .7491351 2 1.487452 | 3. | 2 -.0031386 1 6.598946 | 4. | 2 -.0031386 2 -.3605778 | +------------------------------------------+
We now have nearly all the ingredients to calculate \(y_{ij}\):
\(y_{ij} = mu + u_{i.} + e_{ij}\)
We’ll assume mu is 70. We type
. generate y = 70 + u_i + e_ij . list y classroom u_i child e_ij, sepby(classroom) +-----------------------------------------------------+ | y classr~m u_i child e_ij | |-----------------------------------------------------| 1. | 73.58181 1 .7491351 1 2.832674 | 2. | 72.23659 1 .7491351 2 1.487452 | |-----------------------------------------------------| 3. | 76.59581 2 -.0031386 1 6.598946 | 4. | 69.63628 2 -.0031386 2 -.3605778 | +-----------------------------------------------------+
Note that the random effect u_i is the same within each school, and each child has a different value for e_ij.
Our strategy was simple:
Let’s try this recipe for three-level data where children are nested within classrooms which are nested within schools. This time, I will index schools with i, classrooms with j, and children with k so that my model is
\[y_{ijk} = mu + u_{i..} + u_{ij.} + e_{ijk}\]
where
\(u_{i..}\) ~ N(0,2)
\(u_{ij.}\) ~ N(0,3)
\(u_{ijk}\) ~ N(0,5)
Let’s create data for
(level 3, i) 2 schools
(level 2, j) 2 classrooms in each school
(level 1, k) 2 students in most classrooms; 3 students in i==2 & j==2
Begin by creating the level-three data for the two schools:
. clear . set obs 2 . generate school = _n . generate u_i = rnormal(0,2) . list school u_i +--------------------+ | school u_i | |--------------------| 1. | 1 3.677312 | 2. | 2 -3.193004 | +--------------------+
Next, we expand the data so that we have the three classrooms nested within each of the schools, and we create its random effect:
. expand 2 . bysort school: generate classroom = _n . generate u_ij = rnormal(0,3) . list school u_i classroom u_ij, sepby(school) +-------------------------------------------+ | school u_i classr~m u_ij | |-------------------------------------------| 1. | 1 3.677312 1 .9811059 | 2. | 1 3.677312 2 -3.482453 | |-------------------------------------------| 3. | 2 -3.193004 1 -4.107915 | 4. | 2 -3.193004 2 -2.450383 | +-------------------------------------------+
Finally, we expand the data so that we have three students in school 2’s classroom 2, and two students in all the other classrooms. Sorry for that complication, but I wanted to show you how to create unbalanced data.
In the previous examples, we’ve been typing things like expand 2, meaning double the observations. In this case, we need to do something different for school 2, classroom 2, namely,
. expand 3 if school==2 & classroom==2
and then we can just expand the rest:
. expand 2 if !(school==2 & clasroom==2)
Obviously, in a real simulation, you would probably want 16 to 25 students in each classroom. You could do something like that by typing
. expand 16+int((25-16+1)*runiform())
In any case, we will type
. expand 3 if school==2 & classroom==2 . expand 2 if !(school==2 & classroom==2) . bysort school classroom: generate child = _n . generate e_ijk = rnormal(0,5) . generate y = 70 + u_i + u_ij + e_ijk . list y school u_i classroom u_ij child e_ijk, sepby(classroom) +------------------------------------------------------------------------+ | y school u_i classr~m u_ij child e_ijk | |------------------------------------------------------------------------| 1. | 76.72794 1 3.677312 1 .9811059 1 2.069526 | 2. | 69.81315 1 3.677312 1 .9811059 2 -4.845268 | |------------------------------------------------------------------------| 3. | 74.09565 1 3.677312 2 -3.482453 1 3.900788 | 4. | 71.50263 1 3.677312 2 -3.482453 2 1.307775 | |------------------------------------------------------------------------| 5. | 64.86206 2 -3.193004 1 -4.107915 1 2.162977 | 6. | 61.80236 2 -3.193004 1 -4.107915 2 -.8967164 | |------------------------------------------------------------------------| 7. | 66.65285 2 -3.193004 2 -2.450383 1 2.296242 | 8. | 49.96139 2 -3.193004 2 -2.450383 2 -14.39522 | 9. | 64.41605 2 -3.193004 2 -2.450383 3 .0594433 | +------------------------------------------------------------------------+
Regardless of how we generate the data, we must ensure that the school-level random effects u_i are the same within school and the classroom-level random effects u_ij are the same within classroom.
Concerning data construction, the example above we concocted to produce a dataset that would be easy to list. Let’s now create a dataset that is more reasonable:
\[y_{ijk} = mu + u_{i..} + u_{ij.} + e_{ijk}\]
where
\(u_{i..}\) ~ N(0,2)
\(u_{ij.}\) ~ N(0,3)
\(u_{ijk}\) ~ N(0,5)
Let’s create data for
(level 3, i) 6 schools
(level 2, j) 10 classrooms in each school
(level 1, k) 16-25 students
. clear . set obs 6 . generate school = _n . generate u_i = rnormal(0,2) . expand 10 . bysort school: generate classroom = _n . generate u_ij = rnormal(0,3) . expand 16+int((25-16+1)*runiform()) . bysort school classroom: generate child = _n . generate e_ijk = rnormal(0,5) . generate y = 70 + u_i + u_ij + e_ijk
We can use the mixed command to fit the model with our simulated data.
. mixed y || school: || classroom: , stddev Mixed-effects ML regression Number of obs = 1217 ----------------------------------------------------------- | No. of Observations per Group Group Variable | Groups Minimum Average Maximum ----------------+------------------------------------------ school | 6 197 202.8 213 classroom | 60 16 20.3 25 ----------------------------------------------------------- Wald chi2(0) = . Log likelihood = -3710.0673 Prob > chi2 = . ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons | 70.25941 .9144719 76.83 0.000 68.46707 72.05174 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ school: Identity | sd(_cons) | 2.027064 .7159027 1.014487 4.050309 -----------------------------+------------------------------------------------ classroom: Identity | sd(_cons) | 2.814152 .3107647 2.26647 3.494178 -----------------------------+------------------------------------------------ sd(Residual) | 4.828923 .1003814 4.636133 5.02973 ------------------------------------------------------------------------------ LR test vs. linear regression: chi2(2) = 379.37 Prob > chi2 = 0.0000
The parameter estimates from our simulated data match the parameters used to create the data pretty well: the estimate for _cons is 70.3, which is near 70; the estimated standard deviation for the school-level random effects is 2.02, which is near 2; the estimated standard deviation for the classroom-level random effects is 2.8, which is near 3; and the estimated standard deviation for the individual-level residuals is 4.8, which is near 5.
We’ve just done one reasonable simulation.
If we wanted to do a full simulation, we would need to do the above 100, 1,000, 10,000, or more times. We would put our code in a loop. And in that loop, we would keep track of whatever parameter interested us.
Usually, we’re more interested in estimating the effects of the covariates than in estimating the variance of the random effects. Covariates are typically binary (such as male/female), categorical (such as race), ordinal (such as education level), or continuous (such as age).
Let’s add some covariates to our simulated data. Our model is
\[y_{ijk} = mu + u_{i..} + u_{ij.} + e_{ijk}\]
where
\(u_{i..}\) ~ N(0,2)
\(u_{ij.}\) ~ N(0,3)
\(u_{ijk}\) ~ N(0,5)
We create data for
(level 3, i) 6 schools
(level 2, j) 10 classrooms in each school
(level 1, k) 16-25 students
Let’s add to this model
(level 3, school i) whether the school is in an urban environment
(level 2, classroom j) teacher’s experience (years)
(level 1, student k) student’s mother’s education level
We can create a binary covariate called urban at the school level that equals 1 if the school is located in an urban area and equals 0 otherwise.
. clear . set obs 6 . generate school = _n . generate u_i = rnormal(0,2) . generate urban = runiform()<0.50
Here we assigned schools to one of the two groups with equal probability (runiform()<0.50), but we could have assigned 70% of the schools to be urban by typing
. generate urban = runiform()<0.70
At the classroom level, we could add a continuous covariate for the teacher's years of experience. We could generate this variable by using any of Stata's random-number functions (see help random_number_functions. In the example below, I've generated teacher's years of experience with a uniform distribution ranging from 5-20 years.
. expand 10 . bysort school: generate classroom = _n . generate u_ij = rnormal(0,3) . bysort school: generate teach_exp = 5+int((20-5+1)*runiform())
When we summarize our data, we see that teaching experience ranges from 6-20 years with an average of 13 years.
. summarize teach_exp Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- teach_exp | 60 13.21667 4.075939 6 20
At the child level, we could add a categorical/ordinal covariate for mother's highest level of education completed. After we expand the data and create the child ID and error variables, we can generate a uniformly distributed random variable, temprand, on the interval [0,1].
. expand 16+int((25-16+1)*runiform()) . bysort school classroom: generate child = _n . generate e_ijk = rnormal(0,5) . generate temprand = runiform()
We can assign children to different groups by using the egen command with cutpoints. In the example below, children whose value of temprand is in the interval [0,0.5) will be assigned to mother_educ==0, children whose value of temprand is in the interval [0.5,0.9) will be assigned to mother_educ==1, and children whose value of temprand is in the interval [0.9,1) will be assigned to mother_educ==2.
. egen mother_educ = cut(temprand), at(0,0.5, 0.9, 1) icodes . label define mother_educ 0 "HighSchool" 1 "College" 2 ">College" . label values mother_educ mother_educ
The resulting frequencies of each category are very close to the frequencies we specified in our egen command.
. tabulate mother_educ, generate(meduc) mother_educ | Freq. Percent Cum. ------------+----------------------------------- HighSchool | 602 50.17 50.17 College | 476 39.67 89.83 >College | 122 10.17 100.00 ------------+----------------------------------- Total | 1,200 100.00
We used the option generate(meduc) in the tabulate command above to create indicator variables for each category of mother_educ. This will allow us to specify an effect size for each category when we create our outcome variable.
. summarize meduc* Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- meduc1 | 1200 .5016667 .5002057 0 1 meduc2 | 1200 .3966667 .4894097 0 1 meduc3 | 1200 .1016667 .3023355 0 1
Now, we can create an outcome variable called score by adding all our fixed and random effects together. We can specify an effect size (regression coefficient) for each fixed effect in our model.
. generate score = 70 + (-2)*urban + 1.5*teach_exp + 0*meduc1 + 2*meduc2 + 5*meduc3 + u_i + u_ij + e_ijk
I have specified that the grand mean is 70, urban schools will have scores 2 points lower than nonurban schools, and each year of teacher's experience will add 1.5 points to the students score.
Mothers whose highest level of education was high school (meduc1==1) will serve as the referent category for mother_educ(mother_educ==0). The scores of children whose mother completed college (meduc2==1 and mother_educ==1) will be 2 points higher than the children in the referent group. And the scores of children whose mother completed more than college (meduc3==1 and mother_educ==2) will be 5 points higher than the children in the referent group. Now, we can use the mixed command to fit a model to our simulated data. We used the indicator variables meduc1-meduc3 to create the data, but we will use the factor variable i.mother_educ to fit the model.
. mixed score urban teach_exp i.mother_educ || school: || classroom: , stddev baselevel Mixed-effects ML regression Number of obs = 1259 ----------------------------------------------------------- | No. of Observations per Group Group Variable | Groups Minimum Average Maximum ----------------+------------------------------------------ school | 6 200 209.8 217 classroom | 60 16 21.0 25 ----------------------------------------------------------- Wald chi2(4) = 387.64 Log likelihood = -3870.5395 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ score | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- urban | -2.606451 2.07896 -1.25 0.210 -6.681138 1.468237 teach_exp | 1.584759 .096492 16.42 0.000 1.395638 1.77388 | mother_educ | HighSchool | 0 (base) College | 2.215281 .3007208 7.37 0.000 1.625879 2.804683 >College | 5.065907 .5237817 9.67 0.000 4.039314 6.0925 | _cons | 68.95018 2.060273 33.47 0.000 64.91212 72.98824 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ school: Identity | sd(_cons) | 2.168154 .7713944 1.079559 4.354457 -----------------------------+------------------------------------------------ classroom: Identity | sd(_cons) | 3.06871 .3320171 2.482336 3.793596 -----------------------------+------------------------------------------------ sd(Residual) | 4.947779 .1010263 4.753681 5.149802 ------------------------------------------------------------------------------ LR test vs. linear regression: chi2(2) = 441.25 Prob > chi2 = 0.0000
"Close" is in the eye of the beholder, but to my eyes, the parameter estimates look remarkably close to the parameters that were used to simulate the data. The parameter estimates for the fixed part of the model are -2.6 for urban (parameter = -2), 1.6 for teach_exp (parameter = 1.5), 2.2 for the College category of mother_educ (parameter = 2), 5.1 for the >College category of mother_educ (parameter = 5), and 69.0 for the intercept (parameter = 70). The estimated standard deviations for the random effects are also very close to the simulation parameters. The estimated standard deviation is 2.2 (parameter = 2) at the school level, 3.1 (parameter = 3) at the classroom level, and 4.9 (parameter = 5) at the child level.
Some of you may disagree that the parameter estimates are close. My reply is that it doesn't matter unless you're simulating a single dataset for demonstration purposes. If you are, simply simulate more datasets until you get one that looks close enough for you. If you are simulating data to check coverage probabilities or to estimate statistical power, you will be averaging over thousands of simulated datasets and the results of any one of those datasets won't matter.
Longitudinal data are often conceptualized as multilevel data where the repeated observations are nested within individuals. The main difference between ordinary multilevel models and multilevel models for longitudinal data is the inclusion of a random slope. If you are not familiar with random slopes, you can learn more about them in a blog entry I wrote last year (Multilevel linear models in Stata, part 2: Longitudinal data).
Simulating longitudinal data with a random slope is much like simulating two-level data, with a couple of modifications. First, the bottom level will be observations within person. Second, there will be an interaction between time (age) and a person-level random effect. So we will generate data for the following model:
\[weight_{ij} = mu + age_{ij} + u_{0i.} + age*u_{1i.} + e_{ij}\]
where
\(u_{0i.}\) ~ N(0,3) \(u_{1i.}\) ~ N(0,1) \(e_{ij}\) ~ N(0,2)
Let's begin by simulating longitudinal data for 300 people.
. clear . set obs 300 . gen person = _n
For longitudinal data, we must create two person-level random effects: the variable u_0i is analogous to the random effect we created earlier, and the variable u_1i is the random effect for the slope over time.
. generate u_0i = rnormal(0,3) . generate u_1i = rnormal(0,1)
Let's expand the data so that there are five observations nested within each person. Rather than create an observation-level identification number, let's create a variable for age that ranges from 12 to 16 years,
. expand 5 . bysort person: generate age = _n + 11
and create an observation-level error term from an N(0,2) distribution:
. generate e_ij = rnormal(0,2) . list person u_0i u_1i age e_ij if person==1 +-------------------------------------------------+ | person u_0i u_1i age e_ij | |-------------------------------------------------| 1. | 1 .9338312 -.3097848 12 1.172153 | 2. | 1 .9338312 -.3097848 13 2.935366 | 3. | 1 .9338312 -.3097848 14 -2.306981 | 4. | 1 .9338312 -.3097848 15 -2.148335 | 5. | 1 .9338312 -.3097848 16 -.4276625 | +-------------------------------------------------+
The person-level random effects u_0i and u_1i are the same at all ages, and the observation-level random effects e_ij are different at each age. Now we're ready to generate an outcome variable called weight, measured in kilograms, based on the following model:
\[weight_{ij} = 3 + 3.6*age_{ij} + u_{0i} + age*u_{1i} + e_{ij}\]
. generate weight = 3 + 3.6*age + u_0i + age*u_1i + e_ij
The random effect u_1i is multiplied by age, which is why it is called a random slope. We could rewrite the model as
\[weight_{ij} = 3 + age_{ij}*(3.6 + u_{1i}) + u_{01} + e_{ij}\]
Note that for each year of age, a person's weight will increase by 3.6 kilograms plus some random amount specified by u_1j. In other words,the slope for age will be slightly different for each person.
We can use the mixed command to fit a model to our data:
. mixed weight age || person: age , stddev Mixed-effects ML regression Number of obs = 1500 Group variable: person Number of groups = 300 Obs per group: min = 5 avg = 5.0 max = 5 Wald chi2(1) = 3035.03 Log likelihood = -3966.3842 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ weight | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 3.708161 .0673096 55.09 0.000 3.576237 3.840085 _cons | 2.147311 .5272368 4.07 0.000 1.113946 3.180676 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ person: Independent | sd(age) | .9979648 .0444139 .9146037 1.088924 sd(_cons) | 3.38705 .8425298 2.080103 5.515161 -----------------------------+------------------------------------------------ sd(Residual) | 1.905885 .0422249 1.824897 1.990468 ------------------------------------------------------------------------------ LR test vs. linear regression: chi2(2) = 4366.32 Prob > chi2 = 0.0000
The estimate for the intercept _cons = 2.1 is not very close to the original parameter value of 3, but the estimate of 3.7 for age is very close (parameter = 3.6). The standard deviations of the random effects are also very close to the parameters used to simulate the data. The estimate for the person level _cons is 2.1 (parameter = 2), the person-level slope is 0.997 (parameter = 1), and the observation-level residual is 1.9 (parameter = 2).
Longitudinal data often have an autoregressive pattern to their errors because of the sequential collection of the observations. Measurements taken closer together in time will be more similar than measurements taken further apart in time. There are many patterns that can be used to descibe the correlation among the errors, including autoregressive, moving average, banded, exponential, Toeplitz, and others (see help mixed##rspec).
Let's simulate a dataset where the errors have a Toeplitz structure, which I will define below.
We begin by creating a sample with 500 people with a person-level random effect having an N(0,2) distribution.
. clear . set obs 500 . gen person = _n . generate u_i = rnormal(0,2)
Next, we can use the drawnorm command to create error variables with a Toeplitz pattern.
A Toeplitz 1 correlation matrix has the following structure:
. matrix V = ( 1.0, 0.5, 0.0, 0.0, 0.0 \ /// 0.5, 1.0, 0.5, 0.0, 0.0 \ /// 0.0, 0.5, 1.0, 0.5, 0.0 \ /// 0.0, 0.0, 0.5, 1.0, 0.5 \ /// 0.0, 0.0, 0.0, 0.5, 1.0 ) . matrix list V symmetric V[5,5] c1 c2 c3 c4 c5 r1 1 r2 .5 1 r3 0 .5 1 r4 0 0 .5 1 r5 0 0 0 .5 1
The correlation matrix has 1s on the main diagonal, and each pair of contiguous observations will have a correlation of 0.5. Observations more than 1 unit of time away from each other are assumed to be uncorrelated.
We must also define a matrix of means to use the drawnorm command.
. matrix M = (0 \ 0 \ 0 \ 0 \ 0) . matrix list M M[5,1] c1 r1 0 r2 0 r3 0 r4 0 r5 0
Now, we're ready to use the drawnorm command to create five error variables that have a Toeplitz 1 structure.
. drawnorm e1 e2 e3 e4 e5, means(M) cov(V) . list in 1/2 +---------------------------------------------------------------------------+ | person u_i e1 e2 e3 e4 e5 | |---------------------------------------------------------------------------| 1. | 1 5.303562 -1.288265 -1.201399 .353249 .0495944 -1.472762 | 2. | 2 -.0133588 .6949759 2.82179 .7195075 -1.032395 .1995016 | +---------------------------------------------------------------------------+
Let's estimate the correlation matrix for our simulated data to verify that our simulation worked as we expected.
. correlate e1-e5 (obs=300) | e1 e2 e3 e4 e5 -------------+--------------------------------------------- e1 | 1.0000 e2 | 0.5542 1.0000 e3 | -0.0149 0.4791 1.0000 e4 | -0.0508 -0.0364 0.5107 1.0000 e5 | 0.0022 -0.0615 0.0248 0.4857 1.0000
The correlations are 1 along the main diagonal, near 0.5 for the contiguous observations, and near 0 otherwise.
Our data are currently in wide format, and we need them in long format to use the mixed command. We can use the reshape command to convert our data from wide to long format. If you are not familiar with the reshape command, you can learn more about it by typing help reshape.
. reshape long e, i(person) j(time) (note: j = 1 2 3 4 5) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 300 -> 1500 Number of variables 7 -> 4 j variable (5 values) -> time xij variables: e1 e2 ... e5 -> e -----------------------------------------------------------------------------
Now, we are ready to create our age variable and the outcome variable weight.
. bysort person: generate age = _n + 11 . generate weight = 3 + 3.6*age + u_i + e . list weight person u_i time age e if person==1 +-------------------------------------------------------+ | weight person u_i time age e | |-------------------------------------------------------| 1. | 50.2153 1 5.303562 1 12 -1.288265 | 2. | 53.90216 1 5.303562 2 13 -1.201399 | 3. | 59.05681 1 5.303562 3 14 .353249 | 4. | 62.35316 1 5.303562 4 15 .0495944 | 5. | 64.4308 1 5.303562 5 16 -1.472762 | +-------------------------------------------------------+
We can use the mixed command to fit a model to our simulated data.
. mixed weight age || person:, residual(toeplitz 1, t(time)) , stddev Mixed-effects ML regression Number of obs = 1500 Group variable: person Number of groups = 300 Obs per group: min = 5 avg = 5.0 max = 5 Wald chi2(1) = 33797.58 Log likelihood = -2323.9389 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ weight | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 3.576738 .0194556 183.84 0.000 3.538606 3.61487 _cons | 3.119974 .3244898 9.62 0.000 2.483985 3.755962 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ person: Identity | sd(_cons) | 3.004718 .1268162 2.766166 3.263843 -----------------------------+------------------------------------------------ Residual: Toeplitz(1) | rho1 | .4977523 .0078807 .4821492 .5130398 sd(e) | .9531284 .0230028 .9090933 .9992964 ------------------------------------------------------------------------------ LR test vs. linear regression: chi2(2) = 3063.87 Prob > chi2 = 0.0000
Again, our parameter estimates match the parameters that were used to simulate the data very closely.
The parameter estimate is 3.6 for age (parameter = 3.6) and 3.1 for _cons (parameter = 3). The estimated standard deviations of the person-level random effect is 3.0 (parameter = 3). The estimated standard deviation for the errors is 0.95 (parameter = 1), and the estimated correlation for the Toeplitz structure is 0.5 (parameter = 0.5).
I hope I've convinced you that simulating multilevel/longitudinal data is easy and useful. The next time you find yourself teaching a class or giving a talk that requires multilevel examples, try simulating the data. And if you need to calculate statistical power for a multilevel or longitudinal model, consider simulations.
]]>Some commands, like logit or stcox, come with their own set of prediction tools to detect influential points. However, these kinds of predictions can be computed for virtually any regression command. In particular, we will see that the dfbeta statistics can be easily computed for any command that accepts the jackknife prefix. dfbeta statistics allow us to visualize how influential some observations are compared with the rest, concerning a specific parameter.
We will also compute Cook’s likelihood displacement, which is an overall measure of influence, and it can also be compared with a specific threshold.
The main task of jackknife is to fit the model while suppressing one observation at a time, which allows us to see how much results change when each observation is suppressed; in other words, it allows us to see how much each observation influences the results. A very intuitive measure of influence is dfbeta, which is the amount that a particular parameter changes when an observation is suppressed. There will be one dfbeta variable for each parameter. If \(\hat\beta\) is the estimate for parameter \(\beta\) obtained from the full data and \( \hat\beta_{(i)} \) is the corresponding estimate obtained when the \(i\)th observation is suppressed, then the \(i\)th element of variable dfbeta is obtained as
\[dfbeta = \hat\beta – \hat\beta_{(i)}\]
Parameters \(\hat\beta\) are saved by the estimation commands in matrix e(b) and also can be obtained using the _b notation, as we will show below. The leave-one-out values \(\hat\beta_{(i)}\) can be saved in a new file by using the option saving() with jackknife. With these two elements, we can compute the dfbeta values for each variable.
Let’s see an example with the probit command.
. sysuse auto, clear (1978 Automobile Data) . *preserve original dataset . preserve . *generate a variable with the original observation number . gen obs =_n . probit foreign mpg weight Iteration 0: log likelihood = -45.03321 Iteration 1: log likelihood = -27.914626 Iteration 2: log likelihood = -26.858074 Iteration 3: log likelihood = -26.844197 Iteration 4: log likelihood = -26.844189 Iteration 5: log likelihood = -26.844189 Probit regression Number of obs = 74 LR chi2(2) = 36.38 Prob > chi2 = 0.0000 Log likelihood = -26.844189 Pseudo R2 = 0.4039 ------------------------------------------------------------------------------ foreign | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -.1039503 .0515689 -2.02 0.044 -.2050235 -.0028772 weight | -.0023355 .0005661 -4.13 0.000 -.003445 -.0012261 _cons | 8.275464 2.554142 3.24 0.001 3.269437 13.28149 ------------------------------------------------------------------------------ . *keep the estimation sample so each observation will be matched . *with the corresponding replication . keep if e(sample) (0 observations deleted) . *use jackknife to generate the replications, and save the values in . *file b_replic . jackknife, saving(b_replic, replace): probit foreign mpg weight (running probit on estimation sample) Jackknife replications (74) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 ........................ Probit regression Number of obs = 74 Replications = 74 F( 2, 73) = 10.36 Prob > F = 0.0001 Log likelihood = -26.844189 Pseudo R2 = 0.4039 ------------------------------------------------------------------------------ | Jackknife foreign | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -.1039503 .0831194 -1.25 0.215 -.269607 .0617063 weight | -.0023355 .0006619 -3.53 0.001 -.0036547 -.0010164 _cons | 8.275464 3.506085 2.36 0.021 1.287847 15.26308 ------------------------------------------------------------------------------ . *verify that all the replications were successful . assert e(N_misreps) ==0 . merge 1:1 _n using b_replic Result # of obs. ----------------------------------------- not matched 0 matched 74 (_merge==3) ----------------------------------------- . *see how values from replications are stored . describe, fullnames Contains data from .../auto.dta obs: 74 1978 Automobile Data vars: 17 13 Apr 2013 17:45 size: 4,440 (_dta has notes) -------------------------------------------------------------------------------- storage display value variable name type format label variable label -------------------------------------------------------------------------------- make str18 %-18s Make and Model price int %8.0gc Price mpg int %8.0g Mileage (mpg) rep78 int %8.0g Repair Record 1978 headroom float %6.1f Headroom (in.) trunk int %8.0g Trunk space (cu. ft.) weight int %8.0gc Weight (lbs.) length int %8.0g Length (in.) turn int %8.0g Turn Circle (ft.) displacement int %8.0g Displacement (cu. in.) gear_ratio float %6.2f Gear Ratio foreign byte %8.0g origin Car type obs float %9.0g foreign_b_mpg float %9.0g [foreign]_b[mpg] foreign_b_weight float %9.0g [foreign]_b[weight] foreign_b_cons float %9.0g [foreign]_b[_cons] _merge byte %23.0g _merge -------------------------------------------------------------------------------- Sorted by: Note: dataset has changed since last saved . *compute the dfbeta for each covariate . foreach var in mpg weight { 2. gen dfbeta_`var' = (_b[`var'] -foreign_b_`var') 3. } . gen dfbeta_cons = (_b[_cons] - foreign_b_cons) . label var obs "observation number" . label var dfbeta_mpg "dfbeta for mpg" . label var dfbeta_weight "dfbeta for weight" . label var dfbeta_cons "dfbeta for the constant" . *plot dfbeta values for variable mpg . scatter dfbeta_mpg obs, mlabel(obs) title("dfbeta values for variable mpg") . *restore original dataset . restore
Based on the impact on the coefficient for variable mpg, observation 71 seems to be the most influential. We could create a similar plot for each parameter.
jackknife prints a dot for each successful replication and an ‘x’ for each replication that ends with an error. By looking at the output immediately following the jackknife command, we can see that all the replications were successful. However, we added an assert line in the code to avoid relying on visual inspection. If some replications failed, we would need to explore the reasons.
The command jackknife allows us to save the leave-one-out values in a different file. To use these, we would need to do some data management and merge the two files. On the other hand, the same command called with the option keep saves pseudovalues, which are defined as follows:
\[\hat{\beta}_i^* = N\hat\beta – (N-1)\hat\beta_{(i)} \]
where \(N\) is the number of observations involved in the computation, returned as e(N). Therefore, using the pseudovalues, \(\beta_{(i)}\) values can be computed as \[\hat\beta_{(i)} = \frac{ N \hat\beta – \hat\beta^*_i}{N-1} \]
Also, dfbeta values can be computed directly from the pseudovalues as \[ \hat\beta – \hat\beta_{(i)} = \frac{\hat\beta_{i}^* -\hat\beta} {N-1} \]
Using the pseudovalues instead of the leave-one-out values simplifies our program because we don’t have to worry about matching each pseudovalue to the correct observation.
Let’s reproduce the previous example.
. sysuse auto, clear (1978 Automobile Data) . jackknife, keep: probit foreign mpg weight (running probit on estimation sample) Jackknife replications (74) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 ........................ Probit regression Number of obs = 74 Replications = 74 F( 2, 73) = 10.36 Prob > F = 0.0001 Log likelihood = -26.844189 Pseudo R2 = 0.4039 ------------------------------------------------------------------------------ | Jackknife foreign | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -.1039503 .0831194 -1.25 0.215 -.269607 .0617063 weight | -.0023355 .0006619 -3.53 0.001 -.0036547 -.0010164 _cons | 8.275464 3.506085 2.36 0.021 1.287847 15.26308 ------------------------------------------------------------------------------ . *see how pseudovalues are stored . describe, fullnames Contains data from /Users/isabelcanette/Desktop/stata_mar18/309/ado/base/a/auto. > dta obs: 74 1978 Automobile Data vars: 15 13 Apr 2013 17:45 size: 4,070 (_dta has notes) -------------------------------------------------------------------------------- storage display value variable name type format label variable label -------------------------------------------------------------------------------- make str18 %-18s Make and Model price int %8.0gc Price mpg int %8.0g Mileage (mpg) rep78 int %8.0g Repair Record 1978 headroom float %6.1f Headroom (in.) trunk int %8.0g Trunk space (cu. ft.) weight int %8.0gc Weight (lbs.) length int %8.0g Length (in.) turn int %8.0g Turn Circle (ft.) displacement int %8.0g Displacement (cu. in.) gear_ratio float %6.2f Gear Ratio foreign byte %8.0g origin Car type foreign_b_mpg float %9.0g pseudovalues: [foreign]_b[mpg] foreign_b_weight float %9.0g pseudovalues: [foreign]_b[weight] foreign_b_cons float %9.0g pseudovalues: [foreign]_b[_cons] -------------------------------------------------------------------------------- Sorted by: foreign Note: dataset has changed since last saved . *verify that all the replications were successful . assert e(N_misreps)==0 . *compute the dfbeta for each covariate . local N = e(N) . foreach var in mpg weight { 2. gen dfbeta_`var' = (foreign_b_`var' - _b[`var'])/(`N'-1) 3. } . gen dfbeta_`cons' = (foreign_b_cons - _b[_cons])/(`N'-1) . *plot deff values for variable weight . gen obs = _n . label var obs "observation number" . label var dfbeta_mpg "dfbeta for mpg" . scatter dfbeta_mpg obs, mlabel(obs) title("dfbeta values for variable mpg")
If you have panel data or a situation where each individual is represented by a group of observations (for example, conditional logit or survival models), you might be interested in influential groups. In this case, you would look at the changes on the parameters when each group is suppressed. Let’s see an example with xtlogit.
. webuse towerlondon, clear . xtset family . jackknife, cluster(family) idcluster(newclus) keep: xtlogit dtlm difficulty . assert e(N_misreps)==0
The group-level pseudovalues will be saved on the first observations corresponding to each group, and there will be missing values on the rest. To compute the dfbeta value for the coefficient for difficulty, we type
. local N = e(N_clust) . gen dfbeta_difficulty = (dtlm_b_difficulty - _b[difficulty])/(`N'-1)
We can then plot those values:
. scatter dfbeta_difficulty newclus, mlabel(family) /// title("dfbeta values for variable difficulty") xtitle("family")
Option idcluster() for jackknife generates a new variable that assigns consecutive integers to the clusters; using this variable produces a plot where families are equally spaced on the horizontal axis.
As before, we can see that some groups are more influential than others. It would require some research to find out whether this is a problem.
If we want a global measure of influence (that is, not tied to a particular parameter), we can compute the likelihood displacement values. We consider the likelihood displacement value as defined by Cook (1986):
\[LD_i = 2[L(\hat\theta) – L(\hat\theta_{(i)})] \]
where \(L\) is the log-likelihood function (evaluated on the full dataset), \(\hat\theta\) is the set of parameter estimates obtained from the full dataset, and \(\hat\theta_{(i)}\) is the set of the parameter estimates obtained when leaving out the \(i\)th observation. Notice that what changes is the parameter vector. The log-likelihood function is always evaluated on the whole sample; provided that \(\hat\theta\) is the set of parameters that maximizes the log likelihood, the log-likelihood displacement is always positive. Cook suggested, as a confidence region for this value, the interval \([0, \chi^2_p(\alpha))\), where \(\chi^2_p(\alpha)\) is the (\(1-\alpha\)) quantile from a chi-squared distribution with \(p\) degrees of freedom, and \(p\) is the number of parameters in \(\theta\).
To perform our assessment based on the likelihood displacement, we will need to do the following:
Let’s do it with our probit model.
We first create the macro cmdline containing the command line for the model we want to use. We fit the model and save the original log likelihood in macro ll0.
With a loop, the leave-one-out parameters are saved in consecutive rows of matrix B. It is useful to have those values in a matrix, because we will then extract each row to evaluate the log likelihood at those values.
**********Step 1 sysuse auto, clear set more off local cmdline probit foreign weight mpg `cmdline' keep if e(sample) local ll0 = e(ll) mat b0 = e(b) mat b = b0 local N = _N forvalues i = 1(1)`N'{ `cmdline' if _n !=`i' mat b1 = e(b) mat b = b \ b1 } mat B = b[2...,1...] mat list B
In each iteration of a loop, a row from B is stored as matrix b. To evaluate the log likelihood at these values, the trick is to use them as initial values and invoke the command with 0 iterations. This can be done for any command that is based on ml.
**********Step 2 gen L1 = . forvalues i = 1(1)`N'{ mat b = B[`i',1...] `cmdline', from(b) iter(0) local ll = e(ll) replace L1 = `ll' in `i' }
Using variable L1 and the macro with the original log likelihood, we compute Cook’s likehood displacement.
**********Step 3 gen LD = 2*(`ll0' - L1)
Create the plot, using as a reference the 90% quantile for the \(\chi^2\) distribution. \(p\) is the number of columns in matrix b0 (or equivalently, the number of columns in matrix B).
**********Step 4 local k = colsof(b0) gen upper_bound = invchi2tail(`k', .1) gen n = _n twoway scatter LD n, mlabel(n) || line upper_bound n, /// title("Likelihood displacement")
We can see that observation 71 is the most influential, and its likelihood displacement value is within the range we would normally expect.
Cook, D. 1986. Assessment of local influence. Journal of the Royal Statistical Society, Series B 48: 133–169.
]]>Today I want to show you how to create animated graphics using Stata. It’s easier than you might expect and you can use animated graphics to illustrate concepts that would be challenging to illustrate with static graphs. In addition to Stata, you will need a video editing program but don’t be concerned if you don’t have one. At the 2012 UK Stata User Group Meeting Robert Grant demonstrated how to create animated graphics from within Stata using a free software program called FFmpeg. I will show you how I create my animated graphs using Camtasia and how Robert creates his using FFmpeg.
I recently recorded a video for the Stata Youtube channel called “Power and sample size calculations in Stata: A conceptual introduction“. I wanted to illustrate two concepts: (1) that statistcal power increases as sample size increases, and (2) as effect size increases. Both of these concepts can be illustrated with a static graph along with the explanation “imagine that …”. Creating animated graphs allowed me to skip the explanation and just show what I meant.
Videos are illusions. All videos — from Charles-Émile Reynaud’s 1877 praxinoscope to modern blu-ray movies — are created by displaying a series of ordered still images for a fraction of a second each. Our brains perceive this series of still images as motion.
To create the illusion of motion with graphs, we make an ordered series of slightly differing graphs. We can use loops to do this. If you are not familiar with loops in Stata, here’s one to count to five:
forvalues i = 1(1)5 { disp "i = `i'" } i = 1 i = 2 i = 3 i = 4 i = 5
We could place a graph command inside the loop. If, for each interation, the graph command created a slightly different graph, we would be on our way to creating our first video. The loop below creates a series of graphs of normal densities with means 0 through 1 in increments of 0.1.
forvalues mu = 0(0.1)1 { twoway function y=normalden(x,`mu',1), range(-3 6) title("N(`mu',1)") }
You may have noticed the illusion of motion as Stata created each graph; the normal densities appeared to be moving to the right as each new graph appeared on the screen.
You may have also noticed that some of the values of the mean did not look as you would have wanted. For example, 1.0 was displayed as 0.999999999. That’s not a mistake, it’s because Stata stores numbers and performs calculations in base two and displays them in base ten; for a detailed explanation, see Precision (yet again), Part I.
We can fix that by reformating the means using the string() function.
forvalues mu = 0(0.1)1 { local mu = string(`mu', "%3.1f") twoway function y=normalden(x,`mu',1), range(-3 6) title("N(`mu',1)") }
Next, we need to save our graphs. We can do this by adding graph export inside the loop.
forvalues mu = 0(0.1)1 { local mu = string(`mu', "%3.1f") twoway function y=normalden(x,`mu',1), range(-3 6) title("N(`mu',1)") graph export graph_`mu'.png, as(png) width(1280) height(720) replace }
Note that the name of each graph file includes the value of mu so that we know the order of our files. We can view the contents of the directory to verify that Stata has created a file for each of our graphs.
. ls <dir> 2/11/14 12:12 . <dir> 2/11/14 12:12 .. 35.6k 2/11/14 12:11 graph_0.0.png 35.6k 2/11/14 12:11 graph_0.1.png 35.7k 2/11/14 12:11 graph_0.2.png 35.7k 2/11/14 12:11 graph_0.3.png 35.7k 2/11/14 12:11 graph_0.4.png 35.8k 2/11/14 12:11 graph_0.5.png 35.9k 2/11/14 12:12 graph_0.6.png 35.7k 2/11/14 12:12 graph_0.7.png 35.8k 2/11/14 12:12 graph_0.8.png 35.9k 2/11/14 12:12 graph_0.9.png 35.6k 2/11/14 12:12 graph_1.0.png
Now that we have created our graphs, we need to combine them into a video.
There are many commercial, freeware, and free software programs available that we could use. I will outline the basic steps using two of them, one a commerical GUI based product (not free) called Camtasia, and the other a free command-based program called FFmpeg.
Most commercial video editing programs have similar interfaces. The user imports image, sound and video files, organizes them in tracks on a timeline and then previews the resulting video. Camtasia is a commercial video program that I use to record videos for the Stata Youtube channel and its interface looks like this.
We begin by importing the graph files into Camtasia:
Next we drag the images onto the timeline:
And then we make the display time for each image very short…in this case 0.1 seconds or 10 frames per second.
After previewing the video, we can export it to any of Camtasia’s supported formats. I’ve exported to a “.gif” file because it is easy to view in a web browser.
We just created our first animated graph! All we have to do to make it look as professional as the power-and-sample size examples I showed you earlier is go back into our Stata program and modify the graph command to add the additional elements we want to display!
Stata user and medical statistician Robert Grant gave a presentation at the 2012 UK Stata User Group Meeting in London entitled “Producing animated graphs from Stata without having to learn any specialized software“. You can read more about Robert by visiting his blog and clicking on About.
In his presentation, Robert demonstrated how to combine graph images into a video using a free software program called FFmpeg. Robert followed the same basic strategy I demonstrated above, but Robert’s choice of software has two appealing features. First, the software is readily available and free. Second, FFmpeg can be called from within the Stata environment using the winexec command. This means that we can create our graphs and combine them into a video using Stata do files. Combining dozens or hundreds of graphs into a single video with a program is faster and easier than using a drag-and-drop interface.
Let’s return to our previous example and combine the files using FFmpeg. Recall that we inserted the mean into the name of each file (e.g. “graph_0.4.png”) so that we could keep track of the order of the files. In my experience, it can be difficult to combine files with decimals in their names using FFmpeg. To avoid the problem, I have added a line of code between the twoway command and the graph export command that names the files with sequential integers which are padded with zeros.
forvalues mu = 0(0.1)1 { local mu = string(`mu', "%3.1f") twoway function y=normalden(x,`mu',1), range(-3 6) title("N(`mu',1)") local mu = string(`mu'*10+1, "%03.0f") graph export graph_`mu'.png, as(png) width(1280) height(720) replace } . ls <dir> 2/12/14 12:21 . <dir> 2/12/14 12:21 .. 35.6k 2/12/14 12:21 graph_001.png 35.6k 2/12/14 12:21 graph_002.png 35.7k 2/12/14 12:21 graph_003.png 35.7k 2/12/14 12:21 graph_004.png 35.7k 2/12/14 12:21 graph_005.png 35.8k 2/12/14 12:21 graph_006.png 35.9k 2/12/14 12:21 graph_007.png 35.7k 2/12/14 12:21 graph_008.png 35.8k 2/12/14 12:21 graph_009.png 35.9k 2/12/14 12:21 graph_010.png 35.6k 2/12/14 12:21 graph_011.png
We can then combine these files into a video with FFmpeg using the following commands
local GraphPath "C:\Users\jch\AnimatedGraphics\example\" winexec "C:\Program Files\FFmpeg\bin\ffmpeg.exe" -i `GraphPath'graph_%03d.png -b:v 512k `GraphPath'graph.mpg
The local macro GraphPath contains the path for the directory where my graphics files are stored.
The Stata command winexec “whatever“ executes whatever. In our case, whatever is ffmpeg.exe, preceeded by ffmpeg.exe‘s path, and followed by the arguments FFmpeg needs. We specify two options, -i and -b.
The -i option is followed by a path and filename template. In our case, the path is obtained from the Stata local macro GraphPath and the filename template is “graph_%03d.png”. This template tells FFmpeg to look for a three digit sequence of numbers between “graph_” and “.png” in the filenames. The zero that precedes the three in the template tells FFmpeg that the three digit sequence of numbers is padded with zeros.
The -b option specifies the path and filename of the video to be created along with some attributes of the video.
Once we have created our video, we can use FFmpeg to convert our video to other video formats. For example, we could convert “graph.mpg” to “graph.gif” using the following command:
winexec "C:\Program Files\FFmpeg\bin\ffmpeg.exe" -r 10 -i `GraphPath'graph.mpg -t 10 -r 10 `GraphPath'graph.gif
which creates this graph:
FFmpeg is a very flexible program and there are far too many options to discuss in this blog entry. If you would like to learn more about FFmpeg you can visit their website at www.ffmpeg.org.
I made the preceding examples as simple as possible so that we could focus on the mechanics of creating videos. We now know that, if we want to make professional looking videos, all the complication comes on the Stata side. We leave our loop alone but change the graph command inside it to be more complicated.
So here’s how I created the two animated-graphics videos that I used to create the overall video “Power and sample size calculations in Stata: A conceptual introduction” on our YouTube channel.
The first demonstrated that increasing the effect size (the difference between the means) results in increased statistical power.
local GraphCounter = 100 local mu_null = 0 local sd = 1 local z_crit = round(-1*invnormal(0.05), 0.01) local z_crit_label = `z_crit' + 0.75 forvalues mu_alt = 1(0.01)3 { twoway /// function y=normalden(x,`mu_null',`sd'), /// range(-3 `z_crit') color(red) dropline(0) || /// function y=normalden(x,`mu_alt',`sd'), /// range(-3 5) color(green) dropline(`mu_alt') || /// function y=normalden(x,`mu_alt',`sd'), /// range(`z_crit' 6) recast(area) color(green) || /// function y=normalden(x,`mu_null',`sd'), /// range(`z_crit' 6) recast(area) color(red) /// title("Power for {&mu}={&mu}{subscript:0} versus {&mu}={&mu}{subscript:A}") /// xtitle("{it: z}") xlabel(-3 -2 -1 0 1 2 3 4 5 6) /// legend(off) /// ytitle("Density") yscale(range(0 0.6)) /// ylabel(0(0.1)0.6, angle(horizontal) nogrid) /// text(0.45 0 "{&mu}{subscript:0}", color(red)) /// text(0.45 `mu_alt' "{&mu}{subscript:A}", color(green)) graph export mu_alt_`GraphCounter'.png, as(png) width(1280) height(720) replace local ++GraphCounter }
The above Stata code created the *.png files that I then combined using Camtasia to produce this gif:
The second video demonstrated that power increases as the sample size increases.
local GraphCounter = 301 local mu_label = 0.45 local power_label = 2.10 local mu_null = 0 local mu_alt = 2 forvalues sd = 1(-0.01)0.5 { local z_crit = round(-1*invnormal(0.05)*`sd', 0.01) local z_crit_label = `z_crit' + 0.75 twoway /// function y=normalden(x,`mu_null',`sd'), /// range(-3 `z_crit') color(red) dropline(0) || /// function y=normalden(x,`mu_alt',`sd'), /// range(-3 5) color(green) dropline(`mu_alt') || /// function y=normalden(x,`mu_alt',`sd'), /// range(`z_crit' 6) recast(area) color(green) || /// function y=normalden(x,`mu_null',`sd'), /// range(`z_crit' 6) recast(area) color(red) /// title("Power for {&mu}={&mu}{subscript:0} versus {&mu}={&mu}{subscript:A}") /// xtitle("{it: z}") xlabel(-3 -2 -1 0 1 2 3 4 5 6) /// legend(off) /// ytitle("Density") yscale(range(0 0.6)) /// ylabel(0(0.1)0.6, angle(horizontal) nogrid) /// text(`mu_label' 0 "{&mu}{subscript:0}", color(red)) /// text(`mu_label' `mu_alt' "{&mu}{subscript:A}", color(green)) graph export mu_alt_`GraphCounter'.png, as(png) width(1280) height(720) replace local ++GraphCounter local mu_label = `mu_label' + 0.005 local power_label = `power_label' + 0.03 }
Just as previously, the above Stata code creates the *.png files that I then combine using Camtasia to produce a gif:
Let me show you some more examples.
The next example demonstrates the basic idea of lowess smoothing.
sysuse auto local WindowWidth = 500 forvalues WindowUpper = 2200(25)5000 { local WindowLower = `WindowUpper' - `WindowWidth' twoway (scatter mpg weight) /// (lowess mpg weight if weight < (`WindowUpper'-250), lcolor(green)) /// (lfit mpg weight if weight>`WindowLower' & weight<`WindowUpper', /// lwidth(medium) lcolor(red)) /// , xline(`WindowLower' `WindowUpper', lwidth(medium) lcolor(black)) /// legend(on order(1 2 3) cols(3)) graph export lowess_`WindowUpper'.png, as(png) width(1280) height(720) replace }
The result is,
The animated graph I created is not yet a perfect analogy to what lowess actually does, but it comes close. It has two problems. The lowess curve changes outside of the sliding window, which it should not and the animation does not illustrate the weighting of the points within the window, say by using differently sized markers for the points in the sliding window. Even so, the graph does a far better job than the usual explanaton that one should imagine sliding a window across the scatterplot.
As yet another example, we can use animated graphs to demonstrate the concept of convergence. There is a FAQ on the Stata website written by Bill Gould that explains the relationship between the chi-squared and F distributions. The animated graph below shows that F(d1, d2) converges to d1*χ^2 as d2 goes to infinity:
forvalues df = 1(1)100 { twoway function y=chi2(2,2*x), range(0 6) color(red) || /// function y=F(2,`df',x), range(0 6) color(green) /// title("Cumulative distributions for {&chi}{sup:2}{sub:df} and {it:F}{subscript:df,df2}") /// xtitle("{it: denominator df}") xlabel(0 1 2 3 4 5 6) legend(off) /// text(0.45 4 "df2 = `df'", size(huge) color(black)) /// legend(on order(1 "{&chi}{sup:2}{sub:df}" 2 "{it:F}{subscript:df,df2}") cols(2) position(5) ring(0)) local df = string(`df', "%03.0f") graph export converge2_`df'.png, as(png) width(1280) height(720) replace }
The t distribution has a similar relationship with the normal distribution.
forvalues df = 1(1)100 { twoway function y=normal(x), range(-3 3) color(red) || /// function y=t(`df',x), range(-3 3) color(green) /// title("Cumulative distributions for Normal(0,1) and {it:t}{subscript:df}") /// xtitle("{it: t/z}") xlabel(-3 -2 -1 0 1 2 3) legend(off) /// text(0.45 -2 "df = `df'", size(huge) color(black)) /// legend(on order(1 "N(0,1)" 2 "{it:t}{subscript:df}") cols(2) position(5) ring(0)) local df = string(`df', "%03.0f") graph export converge_`df'.png, as(png) width(1280) height(720) replace }
The result is
I have learned through trial and error two things that improve the quality of my animated graphs. First, note that the axes of the graphs in most of the examples above are explicitly defined in the graph commands. This is often necessary to keep the axes stable from graph to graph. Second, videos have a smoother, higher quality appearance when there are many graphs with very small changes from graph to graph.
I hope I have convinced you that creating animated graphics with Stata is easier than you imagined. If the old saying that “a picture is worth a thousand words” is true, imagine how many words you can save using animated graphs.
Relationship between chi-squared and F distributions
]]>After the entry was posted, a few users pointed out two features they wanted added to putexcel:
In Stata 13.1, we added the new option keepcellformat to putexcel. This option retains a cell’s format after writing numeric data to it. keepcellformat is useful for people who want to automate the updating of a report or paper.
To review, the basic syntax of putexcel is as follows:
putexcel excel_cell=(expression) … using filename[, options]
If you are working with matrices, the syntax is
putexcel excel_cell=matrix(expression) … using filename[, options]
In the previous blog post, we exported a simple table created by the correlate command by using the commands below.
. sysuse auto (1978 Automobile Data) . correlate foreign mpg (obs=74) | foreign mpg -------------+------------------ foreign | 1.0000 mpg | 0.3934 1.0000 . putexcel A1=matrix(r(C), names) using corr
These commands created the file corr.xlsx, which contained the table below in the first worksheet.
As you can see, this table is not formatted. So, I formatted the table by hand in Excel so that the correlations were rounded to two digits and the column and row headers were bold with a blue background.
putexcel‘s default behavior is to remove the formatting of cells. Thus, if we want to change the correlated variables in our command from foreign and mpg to foreign and weight using the below commands, the new correlations shown in Excel will revert to the default format:
. sysuse auto, clear (1978 Automobile Data) . correlate foreign weight (obs=74) | foreign weight -------------+------------------ foreign | 1.0000 weight | -0.5928 1.0000 . putexcel A1=matrix(r(C), names) using corr, modify
As of Stata 13.1, you can now use the keepcellformat option to preserve a numeric cell’s format when writing to it. For example, the command
. putexcel A1=matrix(r(C), names) using corr, modify keepcellformat
will produce
Let’s look at a real-world problem and really see how the keepcellformat option can help us. Suppose we need to export the following tabulate table to a report we wrote in Word.
. webuse auto2, clear (1978 Automobile Data) . label variable rep78 "Repair Record" . tabulate rep78 Repair | Record | Freq. Percent Cum. ------------+----------------------------------- Poor | 2 2.90 2.90 Fair | 8 11.59 14.49 Average | 30 43.48 57.97 Good | 18 26.09 84.06 Excellent | 11 15.94 100.00 ------------+----------------------------------- Total | 69 100.00
In the previous putexcel blog post, I mentioned my user-written command tab2xl, which exports a one-way tabulation to an Excel file. I have since updated the command so that it uses the new keepcellformat option to preserve cell formatting. You can download the updated tab2xl command by typing the following:
. net install http://www.stata.com/users/kcrow/tab2xl, replace
Using this command, I can now export my tabulate table to Excel by typing
. tab2xl rep78 using tables, row(1) col(1)
Once the table is in Excel, I format it by hand so that it looks like this:
I then link this Excel table to a Word document. When you link an Excel table to a Word document, it
It is fairly easy to link an Excel table to a Word document or PowerPoint presentation. In Excel/Word 2010, you would do as follows:
My report now looks like this:
With the Excel table linked into Word, any time we update our Excel table using putexcel, we also update our table in Word.
Suppose that after a few weeks, we get more repair record data. We now need to update our report, and our new tabulate table looks like this:
. tabulate rep78 Repair | Record | Freq. Percent Cum. ------------+----------------------------------- Poor | 4 2.90 2.90 Fair | 8 5.80 8.70 Average | 60 43.48 52.17 Good | 44 31.88 84.06 Excellent | 22 15.94 100.00 ------------+----------------------------------- Total | 138 100.00
To update the report, we simply need to reissue the putexcel command after tabulate.
. tabulate rep78 . tab2xl rep78 using tables, row(1) col(1)
The linked Word report will automatically reflect the changes:
]]>
The ordinal probit model is used to model ordinal dependent variables. In the usual parameterization, we assume that there is an underlying linear regression, which relates an unobserved continuous variable \(y^*\) to the covariates \(x\).
\[y^*_{i} = x_{i}\gamma + u_i\]
The observed dependent variable \(y\) relates to \(y^*\) through a series of cut-points \(-\infty =\kappa_0<\kappa_1<\dots< \kappa_m=+\infty\) , as follows:
\[y_{i} = j {\mbox{ if }} \kappa_{j-1} < y^*_{i} \leq \kappa_j\]
Provided that the variance of \(u_i\) can’t be identified from the observed data, it is assumed to be equal to one. However, we can consider a re-scaled parameterization for the same model; a straightforward way of seeing this, is by noting that, for any positive number \(M\):
\[\kappa_{j-1} < y^*_{i} \leq \kappa_j \iff
M\kappa_{j-1} < M y^*_{i} \leq M\kappa_j
\]
that is,
\[\kappa_{j-1} < x_i\gamma + u_i \leq \kappa_j \iff
M\kappa_{j-1}< x_i(M\gamma) + Mu_i \leq M\kappa_j
\]
In other words, if the model is identified, it can be represented by multiplying the unobserved variable \(y\) by a positive number, and this will mean that the standard error of the residual component, the coefficients, and the cut-points will be multiplied by this number.
Let me show you an example; I will first fit a standard ordinal probit model, both with oprobit and with gsem. Then, I will use gsem to fit an ordinal probit model where the residual term for the underlying linear regression has a standard deviation equal to 2. I will do this by introducing a latent variable \(L\), with variance 1, and coefficient \(\sqrt 3\). This will be added to the underlying latent residual, with variance 1; then, the ‘new’ residual term will have variance equal to \(1+((\sqrt 3)^2\times Var(L))= 4\), so the standard deviation will be 2. We will see that as a result, the coefficients, as well as the cut-points, will be multiplied by 2.
. sysuse auto, clear (1978 Automobile Data) . oprobit rep mpg disp , nolog Ordered probit regression Number of obs = 69 LR chi2(2) = 14.68 Prob > chi2 = 0.0006 Log likelihood = -86.352646 Pseudo R2 = 0.0783 ------------------------------------------------------------------------------ rep78 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | .0497185 .0355452 1.40 0.162 -.0199487 .1193858 displacement | -.0029884 .0021498 -1.39 0.165 -.007202 .0012252 -------------+---------------------------------------------------------------- /cut1 | -1.570496 1.146391 -3.81738 .6763888 /cut2 | -.7295982 1.122361 -2.929386 1.47019 /cut3 | .6580529 1.107838 -1.513269 2.829375 /cut4 | 1.60884 1.117905 -.5822132 3.799892 ------------------------------------------------------------------------------ . gsem (rep <- mpg disp, oprobit), nolog Generalized structural equation model Number of obs = 69 Log likelihood = -86.352646 -------------------------------------------------------------------------------- | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- rep78 <- | mpg | .0497185 .0355452 1.40 0.162 -.0199487 .1193858 displacement | -.0029884 .0021498 -1.39 0.165 -.007202 .0012252 ---------------+---------------------------------------------------------------- rep78 | /cut1 | -1.570496 1.146391 -1.37 0.171 -3.81738 .6763888 /cut2 | -.7295982 1.122361 -0.65 0.516 -2.929386 1.47019 /cut3 | .6580529 1.107838 0.59 0.553 -1.513269 2.829375 /cut4 | 1.60884 1.117905 1.44 0.150 -.5822132 3.799892 -------------------------------------------------------------------------------- . local a = sqrt(3) . gsem (rep <- mpg disp L@`a'), oprobit var(L@1) nolog Generalized structural equation model Number of obs = 69 Log likelihood = -86.353008 ( 1) [rep78]L = 1.732051 ( 2) [var(L)]_cons = 1 -------------------------------------------------------------------------------- | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- rep78 <- | mpg | .099532 .07113 1.40 0.162 -.0398802 .2389442 displacement | -.0059739 .0043002 -1.39 0.165 -.0144022 .0024544 L | 1.732051 (constrained) ---------------+---------------------------------------------------------------- rep78 | /cut1 | -3.138491 2.293613 -1.37 0.171 -7.63389 1.356907 /cut2 | -1.456712 2.245565 -0.65 0.517 -5.857938 2.944513 /cut3 | 1.318568 2.21653 0.59 0.552 -3.02575 5.662887 /cut4 | 3.220004 2.236599 1.44 0.150 -1.16365 7.603657 ---------------+---------------------------------------------------------------- var(L)| 1 (constrained) --------------------------------------------------------------------------------
This model is defined analogously to the model fitted by -ivprobit- for probit models with endogenous covariates; we assume an underlying model with two equations,
\[
\begin{eqnarray}
y^*_{1i} =& y_{2i} \beta + x_{1i} \gamma + u_i & \\
y_{2i} =& x_{1i} \pi_1 + x_{2i} \pi_2 + v_i & \,\,\,\,\,\, (1)
\end{eqnarray}
\]
where \(u_i \sim N(0, 1) \), \(v_i\sim N(0,s^2) \), and \(corr(u_i, v_i) = \rho\).
We don’t observe \(y^*_{1i}\); instead, we observe a discrete variable \(y_{1i}\), such as, for a set of cut-points (to be estimated) \(\kappa_0 = -\infty < \kappa_1 < \kappa_2 \dots < \kappa_m = +\infty \),
\[y_{1i} = j {\mbox{ if }} \kappa_{j-1} < y^*_{1i} \leq \kappa_j \]
I will re-scale the first equation, preserving the correlation. That is, I will consider the following system:
\[
\begin{eqnarray}
z^*_{1i} =&
y_{2i}b +x_{1i}c + t_i + \alpha L_i &\\
y_{2i} = &x_{1i}\pi_1 + x_{2i}\pi_2 + w_i + \alpha L_i & \,\,\,\,\,\, (2)
\end{eqnarray}
\]
where \(t_i, w_i, L_i\) are independent, \(t_i \sim N(0, 1)\) , \(w_i \sim N(0,\sigma^2)\), \(L_i \sim N(0, 1)\)
\[y_{1i} = j {\mbox{ if }} \lambda_{j-1} < z^*_{1i} \leq \lambda_j \]
By introducing a latent variable in both equations, I am modeling a correlation between the error terms. The fist equation is a re-scaled version of the original equation, that is, \(z^*_1 = My^*_1\),
\[ y_{2i}b +x_{1i}c + t_i + \alpha_i L_i
= M(y_{2i}\beta) +M x_{1i}\gamma + M u_i \]
This implies that
\[M u_i = t_i + \alpha_i L_i, \]
where \(Var(u_i) = 1\) and \(Var(t_i + \alpha L_i) = 1 + \alpha^2\), so the scale is \(M = \sqrt{1+\alpha^2} \).
The second equation remains the same, we just express \(v_i\) as \(w_i + \alpha L_i\). Now, after estimating the system (2), we can recover the parameters in (1) as follows:
\[\beta = \frac{1}{\sqrt{1+ \alpha^2}} b\]
\[\gamma = \frac{1}{\sqrt{1+ \alpha^2}} c\]
\[\kappa_j = \frac{1}{\sqrt{1+ \alpha^2}} \lambda_j \]
\[V(v_i) = V(w_i + \alpha L_i) =V(w_i) + \alpha^2\].
\[\rho = Cov(t_i + \alpha L_i, w_i + \alpha L_i) =
\frac{\alpha^2}{(\sqrt{1+\alpha^2}\sqrt{V(w_i)+\alpha^2)}}\]
Note: This parameterization assumes that the correlation is positive; for negative values of the correlation, \(L\) should be included in the second equation with a negative sign (that is, L@(-a) instead of L@a). When trying to perform the estimation with the wrong sign, the model most likely won’t achieve convergence. Otherwise, you will see a coefficient for L that is virtually zero. In Stata 13.1 we have included features that allow you to fit the model without this restriction. However, this time we will use the older parameterization, which will allow you to visualize the different components more easily.
clear set seed 1357 set obs 10000 forvalues i = 1(1)5 { gen x`i' =2* rnormal() + _n/1000 } mat C = [1,.5 \ .5, 1] drawnorm z1 z2, cov(C) gen y2 = 0 forvalues i = 1(1)5 { replace y2 = y2 + x`i' } replace y2 = y2 + z2 gen y1star = y2 + x1 + x2 + z1 gen xb1 = y2 + x1 + x2 gen y1 = 4 replace y1 = 3 if xb1 + z1 <=.8 replace y1 = 2 if xb1 + z1 <=.3 replace y1 = 1 if xb1 + z1 <=-.3 replace y1 = 0 if xb1 + z1 <=-.8 gsem (y1 <- y2 x1 x2 L@a, oprobit) (y2 <- x1 x2 x3 x4 x5 L@a), var(L@1) local y1 y1 local y2 y2 local xaux x1 x2 x3 x4 x5 local xmain y2 x1 x2 local s2 sqrt(1+_b[`y1':L]^2) foreach v in `xmain'{ local trans `trans' (`y1'_`v': _b[`y1':`v']/`s2') } foreach v in `xaux' _cons { local trans `trans' (`y2'_`v': _b[`y2':`v']) } qui tab `y1' if e(sample) local ncuts = r(r)-1 forvalues i = 1(1) `ncuts'{ local trans `trans' (cut_`i': _b[`y1'_cut`i':_cons]/`s2') } local s1 sqrt( _b[var(e.`y2'):_cons] +_b[`y1':L]^2) local trans `trans' (sig_2: `s1') local trans `trans' (rho_12: _b[`y1':L]^2/(`s1'*`s2')) nlcom `trans'
This is the output from gsem:
Generalized structural equation model Number of obs = 10000 Log likelihood = -14451.117 ( 1) [y1]L - [y2]L = 0 ( 2) [var(L)]_cons = 1 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- y1 <- | y2 | 1.379511 .0775028 17.80 0.000 1.227608 1.531414 x1 | 1.355687 .0851558 15.92 0.000 1.188785 1.522589 x2 | 1.346323 .0833242 16.16 0.000 1.18301 1.509635 L | .7786594 .0479403 16.24 0.000 .6846982 .8726206 -------------+---------------------------------------------------------------- y2 <- | x1 | .9901353 .0044941 220.32 0.000 .981327 .9989435 x2 | 1.006836 .0044795 224.76 0.000 .998056 1.015615 x3 | 1.004249 .0044657 224.88 0.000 .9954963 1.013002 x4 | .9976541 .0044783 222.77 0.000 .9888767 1.006431 x5 | .9987587 .0044736 223.26 0.000 .9899907 1.007527 L | .7786594 .0479403 16.24 0.000 .6846982 .8726206 _cons | .0002758 .0192417 0.01 0.989 -.0374372 .0379887 -------------+---------------------------------------------------------------- y1 | /cut1 | -1.131155 .1157771 -9.77 0.000 -1.358074 -.9042358 /cut2 | -.5330973 .1079414 -4.94 0.000 -.7446585 -.321536 /cut3 | .2722794 .1061315 2.57 0.010 .0642654 .4802933 /cut4 | .89394 .1123013 7.96 0.000 .6738334 1.114047 -------------+---------------------------------------------------------------- var(L)| 1 (constrained) -------------+---------------------------------------------------------------- var(e.y2)| .3823751 .074215 .2613848 .5593696 ------------------------------------------------------------------------------
These are the results we obtain when we transform the values reported by gsem to the original parameterization:
------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- y1_y2 | 1.088455 .0608501 17.89 0.000 .9691909 1.207719 y1_x1 | 1.069657 .0642069 16.66 0.000 .943814 1.195501 y1_x2 | 1.062269 .0619939 17.14 0.000 .940763 1.183774 y2_x1 | .9901353 .0044941 220.32 0.000 .981327 .9989435 y2_x2 | 1.006836 .0044795 224.76 0.000 .998056 1.015615 y2_x3 | 1.004249 .0044657 224.88 0.000 .9954963 1.013002 y2_x4 | .9976541 .0044783 222.77 0.000 .9888767 1.006431 y2_x5 | .9987587 .0044736 223.26 0.000 .9899907 1.007527 y2__cons | .0002758 .0192417 0.01 0.989 -.0374372 .0379887 cut_1 | -.892498 .0895971 -9.96 0.000 -1.068105 -.7168909 cut_2 | -.4206217 .0841852 -5.00 0.000 -.5856218 -.2556217 cut_3 | .2148325 .0843737 2.55 0.011 .0494632 .3802018 cut_4 | .705332 .0905974 7.79 0.000 .5277644 .8828997 sig_2 | .9943267 .007031 141.42 0.000 .9805462 1.008107 rho_12 | .4811176 .0477552 10.07 0.000 .3875191 .574716 ------------------------------------------------------------------------------
The estimates are quite close to the values used for the simulation. If you try to perform the estimation with the wrong sign for the coefficient for L, you will get a number that is virtually zero (if you get convergence at all). In this case, the evaluator is telling us that the best value it can find, provided the restrictions we have imposed, is zero. If you see such results, you may want to try the opposite sign. If both give a zero coefficient, it means that this is the solution, and there is not endogeneity at all. If one of them is not zero, it means that the non-zero value is the solution. As stated before, in Stata 13.1, the model can be fitted without this restriction.
]]>A stored result is simply a scalar, macro, or matrix stored in memory after you run a Stata command. The two main types of stored results are e-class (for estimation commands) and r-class (for general commands). You can list a command’s stored results after it has been run by typing ereturn list (for estimation commands) and return list (for general commands). Let’s try a simple example by loading the auto dataset and running correlate on the variables foreign and mpg
. sysuse auto (1978 Automobile Data) . correlate foreign mpg (obs=74) | foreign mpg -------------+------------------ foreign | 1.0000 mpg | 0.3934 1.0000
Because correlate is not an estimation command, use the return list command to see its stored results.
. return list scalars: r(N) = 74 r(rho) = .3933974152205484 matrices: r(C) : 2 x 2
Now we can use putexcel to export these results to Excel. The basic syntax of putexcel is
putexcel excel_cell=(expression) … using filename [, options]
If you are working with matrices, the syntax is
putexcel excel_cell=matrix(expression) … using filename [, options]
It is easy to build the above syntax in the putexcel dialog. There is a helpful video on Youtube about the dialog here. Let’s list the matrix r(C) to see what it contains.
. matrix list r(C) symmetric r(C)[2,2] foreign mpg foreign 1 mpg .39339742 1
To re-create the table in Excel, we need to export the matrix r(C) with the matrix row and column names. The command to type in your Stata Command window is
putexcel A1=matrix(r(C), names) using corr
Note that to export the matrix row and column names, we used the names option after we specifed the matrix r(C). When I open the file corr.xlsx in Excel, the table below is displayed.
Next let’s try a more involved example. Load the auto dataset, and run a tabulation on the variable foreign. Because tabulate is not an estimation command, use the return list command to see its stored results.
. sysuse auto (1978 Automobile Data) . tabulate foreign Car type | Freq. Percent Cum. ------------+----------------------------------- Domestic | 52 70.27 70.27 Foreign | 22 29.73 100.00 ------------+----------------------------------- Total | 74 100.00 . return list scalars: r(N) = 74 r(r) = 2
tabulate is different from most commands in Stata in that it does not automatically save all the results we need into the stored results (we will use scalar r(N)). We need to use the matcell() and matrow() options of tabulate to save the results produced by the command into two Stata matrices.
. tabulate foreign, matcell(freq) matrow(names) Car type | Freq. Percent Cum. ------------+----------------------------------- Domestic | 52 70.27 70.27 Foreign | 22 29.73 100.00 ------------+----------------------------------- Total | 74 100.00 . matrix list freq freq[2,1] c1 r1 52 r2 22 . matrix list names names[2,1] c1 r1 0 r2 1
The putexcel commands used to create a basic tabulation table in Excel column 1 row 1 are
putexcel A1=("Car type") B1=("Freq.") C1=("Percent") using results, replace putexcel A2=matrix(names) B2=matrix(freq) C2=matrix(freq/r(N)) using results, modify
Below is the table produced in Excel by these commands.
Again this is a basic tabulation table. You probably noticed that we did not have the Cum. column or the Total row in the export table. Also our Car type column contains the numeric values (0,1), not the value lables (Domestic, Foreign) of the variable foreign, and our Percent column is not formatted correctly. To get the exact table displayed in the Results window into an Excel file takes a little programming. With a few functions and a forvalues loop, we can easily export any table produced by running the tabulate command on a numeric variable.
There are two extended macro functions, label and display, that can help us. The label function can extract the value labels for each variable, and the display function can correctly format numbers for our numeric columns. Last, we use forvalues to loop over the rows of the returned matrices to produce our final tables. Our do-file to produce the tabulate table in Excel looks like
sysuse auto tabulate foreign, matcell(freq) matrow(names) putexcel A1=("Car type") B1=("Freq.") C1=("Percent") D1=("Cum.") using results, replace local rows = rowsof(names) local row = 2 local cum_percent = 0 forvalues i = 1/`rows' { local val = names[`i',1] local val_lab : label (foreign) `val' local freq_val = freq[`i',1] local percent_val = `freq_val'/`r(N)'*100 local percent_val : display %9.2f `percent_val' local cum_percent : display %9.2f (`cum_percent' + `percent_val') putexcel A`row'=("`val_lab'") B`row'=(`freq_val') C`row'=(`percent_val') /// D`row'=(`cum_percent') using results, modify local row = `row' + 1 } putexcel A`row'=("Total") B`row'=(r(N)) C`row'=(100.00) using results, modify
The above commands produce this table in Excel:
The solution above works well for this one table, but what if we need to export the tabulation table for 100 variables to the same Excel spreadsheet? It would be very tedious to run the same do-file 100 times, each time changing the cell and row numbers. Now we could easily change our do-file into the Stata command (ado-file) called tab2xl. The syntax for our new command could be
tab2xl varname using filename, row(rownumber) col(colnumber) [replace sheet(name)]
The pseudocode of our program (file tab2xl.ado) looks like
program tab2xl /* parse command syntax */ /* tabulate varname */ /* get column letters based on starting column number passed in */ /* write header row to filename in starting row number passed in */ /* loop over rows of returned matrix and calculate/write values to filename */ /* write total row to filename */ end
If you would like to download a working version of our tab2xl command, type
net install http://www.stata.com/users/kcrow/tab2xl
in Stata.
]]>Many researchers in psychology and education advocate reporting of effect sizes, professional organizations such as the American Psychological Association (APA) and the American Educational Research Association (AERA) strongly recommend their reporting, and professional journals such as the Journal of Experimental Psychology: Applied and Educational and Psychological Measurement require that they be reported.
Anyway, today I want to show you
The importance of research results is often assessed by statistical significance, usually that the p-value is less than 0.05. P-values and statistical significance, however, don’t tell us anything about practical significance.
What if I told you that I had developed a new weight-loss pill and that the difference between the average weight loss for people who took the pill and the those who took a placebo was statistically significant? Would you buy my new pill? If you were overweight, you might reply, “Of course! I’ll take two bottles and a large order of french fries to go!”. Now let me add that the average difference in weight loss was only one pound over the year. Still interested? My results may be statistically significant but they are not practically significant.
Or what if I told you that the difference in weight loss was not statistically significant — the p-value was “only” 0.06 — but the average difference over the year was 20 pounds? You might very well be interested in that pill.
The size of the effect tells us about the practical significance. P-values do not assess practical significance.
All of which is to say, one should report parameter estimates along with statistical significance.
In my examples above, you knew that 1 pound over the year is small and 20 pounds is large because you are familiar with human weights.
In another context, 1 pound might be large, and in yet another, 20 pounds small.
Formal measures of effects sizes are thus usually presented in unit-free but easy-to-interpret form, such as standardized differences and proportions of variability explained.
Effect sizes that measure the scaled difference between means belong to the “d” family. The generic formula is
The estimators differ in terms of how sigma is calculated.
Cohen’s d, for instance, uses the pooled sample standard deviation.
Hedges’s g incorporates an adjustment which removes the bias of Cohen’s d.
Glass’s Δ was originally developed in the context of experiments and uses the “control group” standard deviation in the denominator. It has subsequently been generalized to nonexperimental studies. Because there is no control group in observational studies, Kline (2013) recommends reporting Glass’s Δ using the standard deviation for each group. Glass’s Delta_1 uses one group’s standard deviation and Delta_2 uses the other group’s.
Although I have given definitions to Cohen’s d, Hedges’s g, and Glass’s Δ, different authors swap the definitions around! As a result, many authors refer to all of the above as just Delta.
Be careful when using software to know which Delta you are getting. I have used Stata terminology, of course.
Anyway, the use of a standardized scale allows us to assess of practical significance. Delta = 1.5 indicates that the mean of one group is 1.5 standard deviations higher than that of the other. A difference of 1.5 standard deviations is obviously large, and a difference of 0.1 standard deviations is obviously small.
The r family quantifies the ratio of the variance attributable to an effect to the total variance and is often interpreted as the “proportion of variance explained”. The generic estimator is known as eta-squared,
η^{2} is equivalent to the R-squared statistic from linear regression.
ω^{2} is a less biased variation of η^{2} that is equivalent to the adjusted R-squared.
Both of these measures concern the entire model.
Partial η^{2} and partial ω^{2} are like partial R-squareds and concern individual terms in the model. A term might be a variable or a variable and its interaction with another variable.
Both the d and r families allow us to make an apples-to-apples comparison of variables measured on different scales. For example, an intervention could affect both systolic blood pressure and total cholesterol. Comparing the relative effect of the intervention on the two outcomes would be difficult on their original scales.
How does one compare mm/Hg and mg/dL? It is straightforward in terms of Cohen’s d or ω^{2} because then we are comparing standard deviation changes or proportion of variance explained.
Consider a study where 30 school children are randomly assigned to classrooms that incorporated web-based instruction (treatment) or standard classroom environments (control). At the end of the school year, the children were given tests to measure reading and mathematics skills. The reading test is scored on a 0-15 point scale and, the mathematics test, on a 0-100 point scale.
Let’s download a dataset for our fictitious example from the Stata website by typing:
. use http://www.stata.com/videos13/data/webclass.dta Contains data from http://www.stata.com/videos13/data/webclass.dta obs: 30 Fictitious web-based learning experiment data vars: 5 5 Sep 2013 11:28 size: 330 (_dta has notes) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- id byte %9.0g ID Number treated byte %9.0g treated Treatment Group agegroup byte %9.0g agegroup Age Group reading float %9.0g Reading Score math float %9.0g Math Score ------------------------------------------------------------------------------- . notes _dta: 1. Variable treated records 0=control, 1=treated. 2. Variable agegroup records 1=7 years old, 2=8 years old, 3=9 years old.
We can compute a t-statistic to test the null hypothesis that the average math scores are the same in the treatment and control groups.
. ttest math, by(treated) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- Control | 15 69.98866 3.232864 12.52083 63.05485 76.92246 Treated | 15 79.54943 1.812756 7.020772 75.66146 83.4374 ---------+-------------------------------------------------------------------- combined | 30 74.76904 2.025821 11.09588 70.62577 78.91231 ---------+-------------------------------------------------------------------- diff | -9.560774 3.706412 -17.15301 -1.968533 ------------------------------------------------------------------------------ diff = mean(Control) - mean(Treated) t = -2.5795 Ho: diff = 0 degrees of freedom = 28 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0077 Pr(|T| > |t|) = 0.0154 Pr(T > t) = 0.9923
The treated students have a larger mean, yet the difference of -9.56 is reported as negative because -ttest- calculated Control minus Treated. So just remember, negative differences mean Treated > Control in this case.
The t-statistic equals -2.58 and its two-sided p-value of 0.0154 indicates that the difference between the math scores in the two groups is statistically significant.
Next, let’s calculate effect sizes from the d family:
. esize twosample math, by(treated) cohensd hedgesg glassdelta Effect size based on mean comparison Obs per group: Control = 15 Treated = 15 --------------------------------------------------------- Effect Size | Estimate [95% Conf. Interval] --------------------+------------------------------------ Cohen's d | -.9419085 -1.691029 -.1777553 Hedges's g | -.916413 -1.645256 -.1729438 Glass's Delta 1 | -.7635896 -1.52044 .0167094 Glass's Delta 2 | -1.361784 -2.218342 -.4727376 ---------------------------------------------------------
Cohen’s d and Hedges’s g both indicate that the average reading scores differ by approximately -0.93 standard deviations with 95% confidence intervals of (-1.69, -0.18) and (-1.65, -0.17) respectively.
Since this is an experiment, we are interested in Glass’s Delta 1 because it is calculated using the control group standard deviation. Average reading scores differ by -0.76 and the confidence interval is (-1.52, 0.02).
The confidence intervals for Cohen’s d and Hedges’s g do not include the null value of zero but the confidence interval for Glass’s Delta 1 does. Thus we cannot completely rule out the possibility that the treatment had no effect on math scores.
Next we could incorporate the age group of the children into our analysis by using a two-way ANOVA to test the null hypothesis that the mean math scores are equal for all groups.
. anova math treated##agegroup Number of obs = 30 R-squared = 0.2671 Root MSE = 10.4418 Adj R-squared = 0.1144 Source | Partial SS df MS F Prob > F -----------------+---------------------------------------------------- Model | 953.697551 5 190.73951 1.75 0.1617 | treated | 685.562956 1 685.562956 6.29 0.0193 agegroup | 47.7059268 2 23.8529634 0.22 0.8051 treated#agegroup | 220.428668 2 110.214334 1.01 0.3789 | Residual | 2616.73825 24 109.030761 -----------------+---------------------------------------------------- Total | 3570.4358 29 123.118476
The F-statistic for the entire model is not statistically significant (F=1.75, ndf=5, ddf=24, p=0.1617) but the F-statistic for the main effect of treatment is statistically significant (F=6.29, ndf=1, ddf=24, p=0.0193).
We can compute the η^{2} and partial η^{2} estimates for this model using the estat esize command immediately after our anova command (note that estat esize works after the regress command too).
. estat esize Effect sizes for linear models --------------------------------------------------------------------- Source | Eta-Squared df [95% Conf. Interval] ----------------------+---------------------------------------------- Model | .2671096 5 0 .4067062 | treated | .2076016 1 .0039512 .4451877 agegroup | .0179046 2 0 .1458161 treated#agegroup | .0776932 2 0 .271507 ---------------------------------------------------------------------
The overall η^{2} indicates that our model accounts for approximately 26.7% of the variablity in math scores though the 95% confidence interval includes the null value of zero (0.00%, 40.7%). The partial η^{2} for treatment is 0.21 (21% of the variability explained) and its 95% confidence interval excludes zero (0.3%, 20%).
We could calculate the alternative r-family member ω^{2} rather than η^{2} by typing
. estat esize, omega Effect sizes for linear models --------------------------------------------------------------------- Source | Omega-Squared df [95% Conf. Interval] ----------------------+---------------------------------------------- Model | .1144241 5 0 .2831033 | treated | .174585 1 0 .4220705 agegroup | 0 2 0 .0746342 treated#agegroup | .0008343 2 0 .2107992 ---------------------------------------------------------------------
The overall ω^{2} indicates that our model accounts for approximately 11.4% of the variability in math scores and treatment accounts for 17.5%. This perplexing result stems from the way that ω^{2} and partial ω^{2} are calculated. See Pierce, Block, & Aguinis (2004) for a thorough explanation.
Except for the η^{2} for treatment, the confidence intervals include 0 so we cannot rule out the possibility that there is no effect. Whether results are practically significant is generically a matter context and opinion. In some situations, accounting for 5% of the variability in an outcome could be very important and in other situations accounting for 30% may not be.
We could repeat the same analyses for the reading scores using the following commands:
. ttest reading, by(treated) . esize twosample reading, by(treated) cohensd hedgesg glassdelta . anova reading treated##agegroup . estat esize . estat esize, omega
None of the t- or F-statistics for reading scores were statistically significant at the 0.05 level.
Even though the reading and math scores were measured on two different scales, we can directly compare the relative effect of the treatment using effect sizes:
Effect Size | Reading Score Math Score ------------------------------------------------------------ Cohen's d | -0.23 (-0.95 - 0.49) -0.94 (-1.69 - -0.18) Hedges's g | -0.22 (-0.92 - 0.48) -0.92 (-1.65 - -0.17) Glass's Delta | -0.21 (-0.93 - 0.51) -0.76 (-1.52 - 0.02) Eta-squared | 0.02 ( 0.00 - 0.20) 0.21 ( 0.00 - 0.44) Omega-squared | 0.00 ( 0.00 - 0.17) 0.17 ( 0.00 - 0.42)
The results show that the average reading scores in the treated and control groups differ by approximately 0.22 standard deviations while the average math scores differ by approximately 0.92 standard deviations. Similarly, treatment status accounted for almost none of the variability in reading scores while it accounted for roughly 17% of the variability in math scores. The intervention clearly had a larger effect on math scores than reading scores. We also know that we cannot completely rule out an effect size of zero (no effect) for both reading and math scores because several confidence intervals included zero. Whether or not the effects are practically significant is a matter of interpretation but the effect sizes provide a standardized metric for evaluation.
Simulation studies have shown that bootstrap confidence intervals for the d family may be preferable to confidence intervals based on the noncentral t distribution when the variable of interest does not have a normal distribution (Kelley 2005; Algina, Keselman, and Penfield 2006). We can calculate bootstrap confidence intervals for Cohen’s d and Hedges’s g using Stata’s bootstrap prefix:
. bootstrap r(d) r(g), reps(500) nowarn: esize twosample reading, by(treated) (running esize on estimation sample) Bootstrap replications (500) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .................................................. 100 .................................................. 150 .................................................. 200 .................................................. 250 .................................................. 300 .................................................. 350 .................................................. 400 .................................................. 450 .................................................. 500 Bootstrap results Number of obs = 30 Replications = 500 command: esize twosample reading, by(treated) _bs_1: r(d) _bs_2: r(g) ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _bs_1 | -.228966 .3905644 -0.59 0.558 -.9944582 .5365262 _bs_2 | -.2227684 .3799927 -0.59 0.558 -.9675403 .5220036 ------------------------------------------------------------------------------
The bootstrap estimate of the 95% confidence interval for Cohen’s d is -0.99 to 0.54 which is slightly wider than the earlier estimate based on the non-central t distribution (see [R] esize for details). The bootstrap estimate is slightly wider for Hedges’s g as well.
You can use Stata’s effect size calculators to estimate them using summary statistics. If we know that the mean, standard deviation and sample size for one group is 70, 12.5 and 15 respectively and 80, 7 and 15 for another group, we can use esizei to estimate effect sizes from the d family:
. esizei 15 70 12.5 15 80 7, cohensd hedgesg glassdelta Effect size based on mean comparison Obs per group: Group 1 = 15 Group 2 = 15 --------------------------------------------------------- Effect Size | Estimate [95% Conf. Interval] --------------------+------------------------------------ Cohen's d | -.9871279 -1.739873 -.2187839 Hedges's g | -.9604084 -1.692779 -.2128619 Glass's Delta 1 | -.8 -1.561417 -.0143276 Glass's Delta 2 | -1.428571 -2.299112 -.5250285 ---------------------------------------------------------
We can estimate effect sizes from the r family using esizei with slightly different syntax. For example, if we know the numerator and denominator degrees of freedom along with the F statistic, we can calculate η^{2} and ω^{2} using the following command:
. esizei 1 28 6.65 Effect sizes for linear models --------------------------------------------------------- Effect Size | Estimate [95% Conf. Interval] --------------------+------------------------------------ Eta-Squared | .1919192 .0065357 .4167874 Omega-Squared | .1630592 0 .3959584 ---------------------------------------------------------
Stata has dialog boxes that can assist you in calculating effect sizes. If you would like a brief introduction using the GUI, you can watch a demonstration on Stata’s YouTube Channel:
Most older papers and many current papers do not report effect sizes. Nowadays, the general consensus among behavioral scientists, their professional organizations, and their journals is that effect sizes should always be reported in addition to tests of statistical significance. Stata 13 now makes it easy to compute most popular effects sizes.
Some methodologists believe that effect sizes with confidence intervals should always be reported and that statistical hypothesis tests should be abandoned altogether; see Cumming (2012) and Kline (2013). While this may sound like a radical notion, other fields such as epidemiology have been moving in this direction since the 1990s. Cumming and Kline offer compelling arguments for this paradigm shift as well as excellent introductions to effect sizes.
American Psychological Association (2009). Publication Manual of the American Psychological Association, 6th Ed. Washington, DC: American Psychological Association.
Algina, J., H. J. Keselman, and R. D. Penfield. (2006). Confidence interval coverage for Cohen’s effect size statistic. Educational and Psychological Measurement, 66(6): 945–960.
Cumming, G. (2012). Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Taylor & Francis.
Kelley, K. (2005). The effects of nonnormal distributions on confidence intervals around the standardized mean difference: Bootstrap and parametric confidence intervals. Educational and Psychological Measurement 65: 51–69.
Kirk, R. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, 746-759.
Kline, R. B. (2013). Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. 2nd ed. Washington, DC: American Psychological Association.
Pierce, C.A., Block, R. A., and Aguinis, H. (2004). Cautionary note on reporting eta-squared values from multifactor ANOVA designs. Educational and Psychological Measurement, 64(6) 916-924
Thompson, B. (1996) AERA Editorial Policies regarding Statistical Significance Testing: Three Suggested Reforms. Educational Researcher, 25(2) 26-30
Wilkinson, L., & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604
]]>