Today I want to show you how to create animated graphics using Stata. It’s easier than you might expect and you can use animated graphics to illustrate concepts that would be challenging to illustrate with static graphs. In addition to Stata, you will need a video editing program but don’t be concerned if you don’t have one. At the 2012 UK Stata User Group Meeting Robert Grant demonstrated how to create animated graphics from within Stata using a free software program called FFmpeg. I will show you how I create my animated graphs using Camtasia and how Robert creates his using FFmpeg.

I recently recorded a video for the Stata Youtube channel called “Power and sample size calculations in Stata: A conceptual introduction“. I wanted to illustrate two concepts: (1) that statistcal power increases as sample size increases, and (2) as effect size increases. Both of these concepts can be illustrated with a static graph along with the explanation “imagine that …”. Creating animated graphs allowed me to skip the explanation and just show what I meant.

Videos are illusions. All videos — from Charles-Émile Reynaud’s 1877 praxinoscope to modern blu-ray movies — are created by displaying a series of ordered still images for a fraction of a second each. Our brains perceive this series of still images as motion.

To create the illusion of motion with graphs, we make an ordered series of slightly differing graphs. We can use loops to do this. If you are not familiar with loops in Stata, here’s one to count to five:

forvalues i = 1(1)5 { disp "i = `i'" } i = 1 i = 2 i = 3 i = 4 i = 5

We could place a graph command inside the loop. If, for each interation, the **graph** command created a slightly different graph, we would be on our way to creating our first video. The loop below creates a series of graphs of normal densities with means 0 through 1 in increments of 0.1.

forvalues mu = 0(0.1)1 { twoway function y=normalden(x,`mu',1), range(-3 6) title("N(`mu',1)") }

You may have noticed the illusion of motion as Stata created each graph; the normal densities appeared to be moving to the right as each new graph appeared on the screen.

You may have also noticed that some of the values of the mean did not look as you would have wanted. For example, 1.0 was displayed as 0.999999999. That’s not a mistake, it’s because Stata stores numbers and performs calculations in base two and displays them in base ten; for a detailed explanation, see Precision (yet again), Part I.

We can fix that by reformating the means using the **string()** function.

forvalues mu = 0(0.1)1 { local mu = string(`mu', "%3.1f") twoway function y=normalden(x,`mu',1), range(-3 6) title("N(`mu',1)") }

Next, we need to save our graphs. We can do this by adding **graph export** inside the loop.

forvalues mu = 0(0.1)1 { local mu = string(`mu', "%3.1f") twoway function y=normalden(x,`mu',1), range(-3 6) title("N(`mu',1)") graph export graph_`mu'.png, as(png) width(1280) height(720) replace }

Note that the name of each graph file includes the value of mu so that we know the order of our files. We can view the contents of the directory to verify that Stata has created a file for each of our graphs.

. ls <dir> 2/11/14 12:12 . <dir> 2/11/14 12:12 .. 35.6k 2/11/14 12:11 graph_0.0.png 35.6k 2/11/14 12:11 graph_0.1.png 35.7k 2/11/14 12:11 graph_0.2.png 35.7k 2/11/14 12:11 graph_0.3.png 35.7k 2/11/14 12:11 graph_0.4.png 35.8k 2/11/14 12:11 graph_0.5.png 35.9k 2/11/14 12:12 graph_0.6.png 35.7k 2/11/14 12:12 graph_0.7.png 35.8k 2/11/14 12:12 graph_0.8.png 35.9k 2/11/14 12:12 graph_0.9.png 35.6k 2/11/14 12:12 graph_1.0.png

Now that we have created our graphs, we need to combine them into a video.

There are many commercial, freeware, and free software programs available that we could use. I will outline the basic steps using two of them, one a commerical GUI based product (not free) called Camtasia, and the other a free command-based program called FFmpeg.

Most commercial video editing programs have similar interfaces. The user imports image, sound and video files, organizes them in tracks on a timeline and then previews the resulting video. Camtasia is a commercial video program that I use to record videos for the Stata Youtube channel and its interface looks like this.

We begin by importing the graph files into Camtasia:

Next we drag the images onto the timeline:

And then we make the display time for each image very short…in this case 0.1 seconds or 10 frames per second.

After previewing the video, we can export it to any of Camtasia’s supported formats. I’ve exported to a “.gif” file because it is easy to view in a web browser.

We just created our first animated graph! All we have to do to make it look as professional as the power-and-sample size examples I showed you earlier is go back into our Stata program and modify the **graph** command to add the additional elements we want to display!

Stata user and medical statistician Robert Grant gave a presentation at the 2012 UK Stata User Group Meeting in London entitled “Producing animated graphs from Stata without having to learn any specialized software“. You can read more about Robert by visiting his blog and clicking on About.

In his presentation, Robert demonstrated how to combine graph images into a video using a free software program called FFmpeg. Robert followed the same basic strategy I demonstrated above, but Robert’s choice of software has two appealing features. First, the software is readily available and free. Second, FFmpeg can be called from within the Stata environment using the **winexec** command. This means that we can create our graphs and combine them into a video using Stata do files. Combining dozens or hundreds of graphs into a single video with a program is faster and easier than using a drag-and-drop interface.

Let’s return to our previous example and combine the files using FFmpeg. Recall that we inserted the mean into the name of each file (e.g. “graph_0.4.png”) so that we could keep track of the order of the files. In my experience, it can be difficult to combine files with decimals in their names using FFmpeg. To avoid the problem, I have added a line of code between the **twoway** command and the **graph export** command that names the files with sequential integers which are padded with zeros.

forvalues mu = 0(0.1)1 { local mu = string(`mu', "%3.1f") twoway function y=normalden(x,`mu',1), range(-3 6) title("N(`mu',1)") local mu = string(`mu'*10+1, "%03.0f") graph export graph_`mu'.png, as(png) width(1280) height(720) replace } . ls <dir> 2/12/14 12:21 . <dir> 2/12/14 12:21 .. 35.6k 2/12/14 12:21 graph_001.png 35.6k 2/12/14 12:21 graph_002.png 35.7k 2/12/14 12:21 graph_003.png 35.7k 2/12/14 12:21 graph_004.png 35.7k 2/12/14 12:21 graph_005.png 35.8k 2/12/14 12:21 graph_006.png 35.9k 2/12/14 12:21 graph_007.png 35.7k 2/12/14 12:21 graph_008.png 35.8k 2/12/14 12:21 graph_009.png 35.9k 2/12/14 12:21 graph_010.png 35.6k 2/12/14 12:21 graph_011.png

We can then combine these files into a video with FFmpeg using the following commands

local GraphPath "C:\Users\jch\AnimatedGraphics\example\" winexec "C:\Program Files\FFmpeg\bin\ffmpeg.exe" -i `GraphPath'graph_%03d.png -b:v 512k `GraphPath'graph.mpg

The local macro **GraphPath** contains the path for the directory where my graphics files are stored.

The Stata command **winexec** **“***whatever***“** executes *whatever*. In our case, *whatever* is **ffmpeg.exe**, preceeded by **ffmpeg.exe**‘s path, and followed by the arguments FFmpeg needs. We specify two options, **-i** and **-b**.

The **-i** option is followed by a path and filename template. In our case, the path is obtained from the Stata local macro GraphPath and the filename template is “graph_%03d.png”. This template tells FFmpeg to look for a three digit sequence of numbers between “graph_” and “.png” in the filenames. The zero that precedes the three in the template tells FFmpeg that the three digit sequence of numbers is padded with zeros.

The **-b** option specifies the path and filename of the video to be created along with some attributes of the video.

Once we have created our video, we can use FFmpeg to convert our video to other video formats. For example, we could convert “graph.mpg” to “graph.gif” using the following command:

winexec "C:\Program Files\FFmpeg\bin\ffmpeg.exe" -r 10 -i `GraphPath'graph.mpg -t 10 -r 10 `GraphPath'graph.gif

which creates this graph:

FFmpeg is a very flexible program and there are far too many options to discuss in this blog entry. If you would like to learn more about FFmpeg you can visit their website at www.ffmpeg.org.

I made the preceding examples as simple as possible so that we could focus on the mechanics of creating videos. We now know that, if we want to make professional looking videos, all the complication comes on the Stata side. We leave our loop alone but change the **graph** command inside it to be more complicated.

So here’s how I created the two animated-graphics videos that I used to create the overall video “Power and sample size calculations in Stata: A conceptual introduction” on our YouTube channel.

The first demonstrated that increasing the effect size (the difference between the means) results in increased statistical power.

local GraphCounter = 100 local mu_null = 0 local sd = 1 local z_crit = round(-1*invnormal(0.05), 0.01) local z_crit_label = `z_crit' + 0.75 forvalues mu_alt = 1(0.01)3 { twoway /// function y=normalden(x,`mu_null',`sd'), /// range(-3 `z_crit') color(red) dropline(0) || /// function y=normalden(x,`mu_alt',`sd'), /// range(-3 5) color(green) dropline(`mu_alt') || /// function y=normalden(x,`mu_alt',`sd'), /// range(`z_crit' 6) recast(area) color(green) || /// function y=normalden(x,`mu_null',`sd'), /// range(`z_crit' 6) recast(area) color(red) /// title("Power for {&mu}={&mu}{subscript:0} versus {&mu}={&mu}{subscript:A}") /// xtitle("{it: z}") xlabel(-3 -2 -1 0 1 2 3 4 5 6) /// legend(off) /// ytitle("Density") yscale(range(0 0.6)) /// ylabel(0(0.1)0.6, angle(horizontal) nogrid) /// text(0.45 0 "{&mu}{subscript:0}", color(red)) /// text(0.45 `mu_alt' "{&mu}{subscript:A}", color(green)) graph export mu_alt_`GraphCounter'.png, as(png) width(1280) height(720) replace local ++GraphCounter }

The above Stata code created the *.png files that I then combined using Camtasia to produce this gif:

The second video demonstrated that power increases as the sample size increases.

local GraphCounter = 301 local mu_label = 0.45 local power_label = 2.10 local mu_null = 0 local mu_alt = 2 forvalues sd = 1(-0.01)0.5 { local z_crit = round(-1*invnormal(0.05)*`sd', 0.01) local z_crit_label = `z_crit' + 0.75 twoway /// function y=normalden(x,`mu_null',`sd'), /// range(-3 `z_crit') color(red) dropline(0) || /// function y=normalden(x,`mu_alt',`sd'), /// range(-3 5) color(green) dropline(`mu_alt') || /// function y=normalden(x,`mu_alt',`sd'), /// range(`z_crit' 6) recast(area) color(green) || /// function y=normalden(x,`mu_null',`sd'), /// range(`z_crit' 6) recast(area) color(red) /// title("Power for {&mu}={&mu}{subscript:0} versus {&mu}={&mu}{subscript:A}") /// xtitle("{it: z}") xlabel(-3 -2 -1 0 1 2 3 4 5 6) /// legend(off) /// ytitle("Density") yscale(range(0 0.6)) /// ylabel(0(0.1)0.6, angle(horizontal) nogrid) /// text(`mu_label' 0 "{&mu}{subscript:0}", color(red)) /// text(`mu_label' `mu_alt' "{&mu}{subscript:A}", color(green)) graph export mu_alt_`GraphCounter'.png, as(png) width(1280) height(720) replace local ++GraphCounter local mu_label = `mu_label' + 0.005 local power_label = `power_label' + 0.03 }

Just as previously, the above Stata code creates the *.png files that I then combine using Camtasia to produce a gif:

Let me show you some more examples.

The next example demonstrates the basic idea of lowess smoothing.

sysuse auto local WindowWidth = 500 forvalues WindowUpper = 2200(25)5000 { local WindowLower = `WindowUpper' - `WindowWidth' twoway (scatter mpg weight) /// (lowess mpg weight if weight < (`WindowUpper'-250), lcolor(green)) /// (lfit mpg weight if weight>`WindowLower' & weight<`WindowUpper', /// lwidth(medium) lcolor(red)) /// , xline(`WindowLower' `WindowUpper', lwidth(medium) lcolor(black)) /// legend(on order(1 2 3) cols(3)) graph export lowess_`WindowUpper'.png, as(png) width(1280) height(720) replace }

The result is,

The animated graph I created is not yet a perfect analogy to what lowess actually does, but it comes close. It has two problems. The lowess curve changes outside of the sliding window, which it should not and the animation does not illustrate the weighting of the points within the window, say by using differently sized markers for the points in the sliding window. Even so, the graph does a far better job than the usual explanaton that one should imagine sliding a window across the scatterplot.

As yet another example, we can use animated graphs to demonstrate the concept of convergence. There is a FAQ on the Stata website written by Bill Gould that explains the relationship between the chi-squared and F distributions. The animated graph below shows that F(d1, d2) converges to d1*χ^2 as d2 goes to infinity:

forvalues df = 1(1)100 { twoway function y=chi2(2,2*x), range(0 6) color(red) || /// function y=F(2,`df',x), range(0 6) color(green) /// title("Cumulative distributions for {&chi}{sup:2}{sub:df} and {it:F}{subscript:df,df2}") /// xtitle("{it: denominator df}") xlabel(0 1 2 3 4 5 6) legend(off) /// text(0.45 4 "df2 = `df'", size(huge) color(black)) /// legend(on order(1 "{&chi}{sup:2}{sub:df}" 2 "{it:F}{subscript:df,df2}") cols(2) position(5) ring(0)) local df = string(`df', "%03.0f") graph export converge2_`df'.png, as(png) width(1280) height(720) replace }

The t distribution has a similar relationship with the normal distribution.

forvalues df = 1(1)100 { twoway function y=normal(x), range(-3 3) color(red) || /// function y=t(`df',x), range(-3 3) color(green) /// title("Cumulative distributions for Normal(0,1) and {it:t}{subscript:df}") /// xtitle("{it: t/z}") xlabel(-3 -2 -1 0 1 2 3) legend(off) /// text(0.45 -2 "df = `df'", size(huge) color(black)) /// legend(on order(1 "N(0,1)" 2 "{it:t}{subscript:df}") cols(2) position(5) ring(0)) local df = string(`df', "%03.0f") graph export converge_`df'.png, as(png) width(1280) height(720) replace }

The result is

I have learned through trial and error two things that improve the quality of my animated graphs. First, note that the axes of the graphs in most of the examples above are explicitly defined in the graph commands. This is often necessary to keep the axes stable from graph to graph. Second, videos have a smoother, higher quality appearance when there are many graphs with very small changes from graph to graph.

I hope I have convinced you that creating animated graphics with Stata is easier than you imagined. If the old saying that “a picture is worth a thousand words” is true, imagine how many words you can save using animated graphs.

Relationship between chi-squared and F distributions

]]>After the entry was posted, a few users pointed out two features they wanted added to **putexcel**:

- Retain a cell’s format after writing numeric data to it.
- Allow
**putexcel**to format a cell.

In Stata 13.1, we added the new option **keepcellformat** to **putexcel**. This option retains a cell’s format after writing numeric data to it. **keepcellformat** is useful for people who want to automate the updating of a report or paper.

To review, the basic syntax of **putexcel** is as follows:

putexcelexcel_cell=(expression) … usingfilename[,options]

If you are working with matrices, the syntax is

putexcelexcel_cell=matrix(expression) … usingfilename[,options]

In the previous blog post, we exported a simple table created by the **correlate** command by using the commands below.

. sysuse auto (1978 Automobile Data) . correlate foreign mpg (obs=74) | foreign mpg -------------+------------------ foreign | 1.0000 mpg | 0.3934 1.0000 . putexcel A1=matrix(r(C), names) using corr

These commands created the file **corr.xlsx**, which contained the table below in the first worksheet.

As you can see, this table is not formatted. So, I formatted the table by hand in Excel so that the correlations were rounded to two digits and the column and row headers were bold with a blue background.

**putexcel**‘s default behavior is to remove the formatting of cells. Thus, if we want to change the correlated variables in our command from **foreign** and **mpg** to **foreign** and **weight** using the below commands, the new correlations shown in Excel will revert to the default format:

. sysuse auto, clear (1978 Automobile Data) . correlate foreign weight (obs=74) | foreign weight -------------+------------------ foreign | 1.0000 weight | -0.5928 1.0000 . putexcel A1=matrix(r(C), names) using corr, modify

As of Stata 13.1, you can now use the **keepcellformat** option to preserve a numeric cell’s format when writing to it. For example, the command

. putexcel A1=matrix(r(C), names) using corr, modify keepcellformat

will produce

Let’s look at a real-world problem and really see how the **keepcellformat** option can help us. Suppose we need to export the following **tabulate** table to a report we wrote in Word.

. webuse auto2, clear (1978 Automobile Data) . label variable rep78 "Repair Record" . tabulate rep78 Repair | Record | Freq. Percent Cum. ------------+----------------------------------- Poor | 2 2.90 2.90 Fair | 8 11.59 14.49 Average | 30 43.48 57.97 Good | 18 26.09 84.06 Excellent | 11 15.94 100.00 ------------+----------------------------------- Total | 69 100.00

In the previous **putexcel** blog post, I mentioned my user-written command **tab2xl**, which exports a one-way tabulation to an Excel file. I have since updated the command so that it uses the new **keepcellformat** option to preserve cell formatting. You can download the updated **tab2xl** command by typing the following:

. net install http://www.stata.com/users/kcrow/tab2xl, replace

Using this command, I can now export my **tabulate** table to Excel by typing

. tab2xl rep78 using tables, row(1) col(1)

Once the table is in Excel, I format it by hand so that it looks like this:

I then link this Excel table to a Word document. When you link an Excel table to a Word document, it

- preserves the formatting of the table and
- automatically updates the Word document when you update the Excel table.

It is fairly easy to link an Excel table to a Word document or PowerPoint presentation. In Excel/Word 2010, you would do as follows:

- Highlight the table/data in Excel.
- On the Home tab, click on the Copy button.
- Open the Word document and scroll to where you want the table pasted.
- On the Home tab of Word, click on the Paste button.
- Select
**Link & Keep Source Formatting**, , from the Paste icon menu.

My report now looks like this:

With the Excel table linked into Word, any time we update our Excel table using **putexcel**, we also update our table in Word.

Suppose that after a few weeks, we get more repair record data. We now need to update our report, and our new **tabulate** table looks like this:

. tabulate rep78 Repair | Record | Freq. Percent Cum. ------------+----------------------------------- Poor | 4 2.90 2.90 Fair | 8 5.80 8.70 Average | 60 43.48 52.17 Good | 44 31.88 84.06 Excellent | 22 15.94 100.00 ------------+----------------------------------- Total | 138 100.00

To update the report, we simply need to reissue the **putexcel** command after **tabulate**.

. tabulate rep78 . tab2xl rep78 using tables, row(1) col(1)

The linked Word report will automatically reflect the changes:

]]>The new command

The ordinal probit model is used to model ordinal dependent variables. In the usual parameterization, we assume that there is an underlying linear regression, which relates an unobserved continuous variable \(y^*\) to the covariates \(x\).

\[y^*_{i} = x_{i}\gamma + u_i\]

The observed dependent variable \(y\) relates to \(y^*\) through a series of cut-points \(-\infty =\kappa_0<\kappa_1<\dots< \kappa_m=+\infty\) , as follows:

\[y_{i} = j {\mbox{ if }} \kappa_{j-1} < y^*_{i} \leq \kappa_j\]

Provided that the variance of \(u_i\) can’t be identified from the observed data, it is assumed to be equal to one. However, we can consider a re-scaled parameterization for the same model; a straightforward way of seeing this, is by noting that, for any positive number \(M\):

\[\kappa_{j-1} < y^*_{i} \leq \kappa_j \iff

M\kappa_{j-1} < M y^*_{i} \leq M\kappa_j

\]

that is,

\[\kappa_{j-1} < x_i\gamma + u_i \leq \kappa_j \iff

M\kappa_{j-1}< x_i(M\gamma) + Mu_i \leq M\kappa_j

\]

In other words, if the model is identified, it can be represented by multiplying the unobserved variable \(y\) by a positive number, and this will mean that the standard error of the residual component, the coefficients, and the cut-points will be multiplied by this number.

Let me show you an example; I will first fit a standard ordinal probit model, both with **oprobit** and with **gsem**. Then, I will use **gsem** to fit an ordinal probit model where the residual term for the underlying linear regression has a standard deviation equal to 2. I will do this by introducing a latent variable \(L\), with variance 1, and coefficient \(\sqrt 3\). This will be added to the underlying latent residual, with variance 1; then, the ‘new’ residual term will have variance equal to \(1+((\sqrt 3)^2\times Var(L))= 4\), so the standard deviation will be 2. We will see that as a result, the coefficients, as well as the cut-points, will be multiplied by 2.

. sysuse auto, clear (1978 Automobile Data) . oprobit rep mpg disp , nolog Ordered probit regression Number of obs = 69 LR chi2(2) = 14.68 Prob > chi2 = 0.0006 Log likelihood = -86.352646 Pseudo R2 = 0.0783 ------------------------------------------------------------------------------ rep78 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | .0497185 .0355452 1.40 0.162 -.0199487 .1193858 displacement | -.0029884 .0021498 -1.39 0.165 -.007202 .0012252 -------------+---------------------------------------------------------------- /cut1 | -1.570496 1.146391 -3.81738 .6763888 /cut2 | -.7295982 1.122361 -2.929386 1.47019 /cut3 | .6580529 1.107838 -1.513269 2.829375 /cut4 | 1.60884 1.117905 -.5822132 3.799892 ------------------------------------------------------------------------------ . gsem (rep <- mpg disp, oprobit), nolog Generalized structural equation model Number of obs = 69 Log likelihood = -86.352646 -------------------------------------------------------------------------------- | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- rep78 <- | mpg | .0497185 .0355452 1.40 0.162 -.0199487 .1193858 displacement | -.0029884 .0021498 -1.39 0.165 -.007202 .0012252 ---------------+---------------------------------------------------------------- rep78 | /cut1 | -1.570496 1.146391 -1.37 0.171 -3.81738 .6763888 /cut2 | -.7295982 1.122361 -0.65 0.516 -2.929386 1.47019 /cut3 | .6580529 1.107838 0.59 0.553 -1.513269 2.829375 /cut4 | 1.60884 1.117905 1.44 0.150 -.5822132 3.799892 -------------------------------------------------------------------------------- . local a = sqrt(3) . gsem (rep <- mpg disp L@`a'), oprobit var(L@1) nolog Generalized structural equation model Number of obs = 69 Log likelihood = -86.353008 ( 1) [rep78]L = 1.732051 ( 2) [var(L)]_cons = 1 -------------------------------------------------------------------------------- | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- rep78 <- | mpg | .099532 .07113 1.40 0.162 -.0398802 .2389442 displacement | -.0059739 .0043002 -1.39 0.165 -.0144022 .0024544 L | 1.732051 (constrained) ---------------+---------------------------------------------------------------- rep78 | /cut1 | -3.138491 2.293613 -1.37 0.171 -7.63389 1.356907 /cut2 | -1.456712 2.245565 -0.65 0.517 -5.857938 2.944513 /cut3 | 1.318568 2.21653 0.59 0.552 -3.02575 5.662887 /cut4 | 3.220004 2.236599 1.44 0.150 -1.16365 7.603657 ---------------+---------------------------------------------------------------- var(L)| 1 (constrained) --------------------------------------------------------------------------------

This model is defined analogously to the model fitted by -ivprobit- for probit models with endogenous covariates; we assume an underlying model with two equations,

\[

\begin{eqnarray}

y^*_{1i} =& y_{2i} \beta + x_{1i} \gamma + u_i & \\

y_{2i} =& x_{1i} \pi_1 + x_{2i} \pi_2 + v_i & \,\,\,\,\,\, (1)

\end{eqnarray}

\]

where \(u_i \sim N(0, 1) \), \(v_i\sim N(0,s^2) \), and \(corr(u_i, v_i) = \rho\).

We don’t observe \(y^*_{1i}\); instead, we observe a discrete variable \(y_{1i}\), such as, for a set of cut-points (to be estimated) \(\kappa_0 = -\infty < \kappa_1 < \kappa_2 \dots < \kappa_m = +\infty \),

\[y_{1i} = j {\mbox{ if }} \kappa_{j-1} < y^*_{1i} \leq \kappa_j \]

I will re-scale the first equation, preserving the correlation. That is, I will consider the following system:

\[

\begin{eqnarray}

z^*_{1i} =&

y_{2i}b +x_{1i}c + t_i + \alpha L_i &\\

y_{2i} = &x_{1i}\pi_1 + x_{2i}\pi_2 + w_i + \alpha L_i & \,\,\,\,\,\, (2)

\end{eqnarray}

\]

where \(t_i, w_i, L_i\) are independent, \(t_i \sim N(0, 1)\) , \(w_i \sim N(0,\sigma^2)\), \(L_i \sim N(0, 1)\)

\[y_{1i} = j {\mbox{ if }} \lambda_{j-1} < z^*_{1i} \leq \lambda_j \]

By introducing a latent variable in both equations, I am modeling a correlation between the error terms. The fist equation is a re-scaled version of the original equation, that is, \(z^*_1 = My^*_1\),

\[ y_{2i}b +x_{1i}c + t_i + \alpha_i L_i

= M(y_{2i}\beta) +M x_{1i}\gamma + M u_i \]

This implies that

\[M u_i = t_i + \alpha_i L_i, \]

where \(Var(u_i) = 1\) and \(Var(t_i + \alpha L_i) = 1 + \alpha^2\), so the scale is \(M = \sqrt{1+\alpha^2} \).

The second equation remains the same, we just express \(v_i\) as \(w_i + \alpha L_i\). Now, after estimating the system (2), we can recover the parameters in (1) as follows:

\[\beta = \frac{1}{\sqrt{1+ \alpha^2}} b\]

\[\gamma = \frac{1}{\sqrt{1+ \alpha^2}} c\]

\[\kappa_j = \frac{1}{\sqrt{1+ \alpha^2}} \lambda_j \]

\[V(v_i) = V(w_i + \alpha L_i) =V(w_i) + \alpha^2\].

\[\rho = Cov(t_i + \alpha L_i, w_i + \alpha L_i) =

\frac{\alpha^2}{(\sqrt{1+\alpha^2}\sqrt{V(w_i)+\alpha^2)}}\]

Note: This parameterization assumes that the correlation is positive; for negative values of the correlation, \(L\) should be included in the second equation with a negative sign (that is, L@(-a) instead of L@a). When trying to perform the estimation with the wrong sign, the model most likely won’t achieve convergence. Otherwise, you will see a coefficient for L that is virtually zero. In Stata 13.1 we have included features that allow you to fit the model without this restriction. However, this time we will use the older parameterization, which will allow you to visualize the different components more easily.

clear set seed 1357 set obs 10000 forvalues i = 1(1)5 { gen x`i' =2* rnormal() + _n/1000 } mat C = [1,.5 \ .5, 1] drawnorm z1 z2, cov(C) gen y2 = 0 forvalues i = 1(1)5 { replace y2 = y2 + x`i' } replace y2 = y2 + z2 gen y1star = y2 + x1 + x2 + z1 gen xb1 = y2 + x1 + x2 gen y1 = 4 replace y1 = 3 if xb1 + z1 <=.8 replace y1 = 2 if xb1 + z1 <=.3 replace y1 = 1 if xb1 + z1 <=-.3 replace y1 = 0 if xb1 + z1 <=-.8 gsem (y1 <- y2 x1 x2 L@a, oprobit) (y2 <- x1 x2 x3 x4 x5 L@a), var(L@1) local y1 y1 local y2 y2 local xaux x1 x2 x3 x4 x5 local xmain y2 x1 x2 local s2 sqrt(1+_b[`y1':L]^2) foreach v in `xmain'{ local trans `trans' (`y1'_`v': _b[`y1':`v']/`s2') } foreach v in `xaux' _cons { local trans `trans' (`y2'_`v': _b[`y2':`v']) } qui tab `y1' if e(sample) local ncuts = r(r)-1 forvalues i = 1(1) `ncuts'{ local trans `trans' (cut_`i': _b[`y1'_cut`i':_cons]/`s2') } local s1 sqrt( _b[var(e.`y2'):_cons] +_b[`y1':L]^2) local trans `trans' (sig_2: `s1') local trans `trans' (rho_12: _b[`y1':L]^2/(`s1'*`s2')) nlcom `trans'

This is the output from **gsem**:

Generalized structural equation model Number of obs = 10000 Log likelihood = -14451.117 ( 1) [y1]L - [y2]L = 0 ( 2) [var(L)]_cons = 1 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- y1 <- | y2 | 1.379511 .0775028 17.80 0.000 1.227608 1.531414 x1 | 1.355687 .0851558 15.92 0.000 1.188785 1.522589 x2 | 1.346323 .0833242 16.16 0.000 1.18301 1.509635 L | .7786594 .0479403 16.24 0.000 .6846982 .8726206 -------------+---------------------------------------------------------------- y2 <- | x1 | .9901353 .0044941 220.32 0.000 .981327 .9989435 x2 | 1.006836 .0044795 224.76 0.000 .998056 1.015615 x3 | 1.004249 .0044657 224.88 0.000 .9954963 1.013002 x4 | .9976541 .0044783 222.77 0.000 .9888767 1.006431 x5 | .9987587 .0044736 223.26 0.000 .9899907 1.007527 L | .7786594 .0479403 16.24 0.000 .6846982 .8726206 _cons | .0002758 .0192417 0.01 0.989 -.0374372 .0379887 -------------+---------------------------------------------------------------- y1 | /cut1 | -1.131155 .1157771 -9.77 0.000 -1.358074 -.9042358 /cut2 | -.5330973 .1079414 -4.94 0.000 -.7446585 -.321536 /cut3 | .2722794 .1061315 2.57 0.010 .0642654 .4802933 /cut4 | .89394 .1123013 7.96 0.000 .6738334 1.114047 -------------+---------------------------------------------------------------- var(L)| 1 (constrained) -------------+---------------------------------------------------------------- var(e.y2)| .3823751 .074215 .2613848 .5593696 ------------------------------------------------------------------------------

These are the results we obtain when we transform the values reported by **gsem** to the original parameterization:

------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- y1_y2 | 1.088455 .0608501 17.89 0.000 .9691909 1.207719 y1_x1 | 1.069657 .0642069 16.66 0.000 .943814 1.195501 y1_x2 | 1.062269 .0619939 17.14 0.000 .940763 1.183774 y2_x1 | .9901353 .0044941 220.32 0.000 .981327 .9989435 y2_x2 | 1.006836 .0044795 224.76 0.000 .998056 1.015615 y2_x3 | 1.004249 .0044657 224.88 0.000 .9954963 1.013002 y2_x4 | .9976541 .0044783 222.77 0.000 .9888767 1.006431 y2_x5 | .9987587 .0044736 223.26 0.000 .9899907 1.007527 y2__cons | .0002758 .0192417 0.01 0.989 -.0374372 .0379887 cut_1 | -.892498 .0895971 -9.96 0.000 -1.068105 -.7168909 cut_2 | -.4206217 .0841852 -5.00 0.000 -.5856218 -.2556217 cut_3 | .2148325 .0843737 2.55 0.011 .0494632 .3802018 cut_4 | .705332 .0905974 7.79 0.000 .5277644 .8828997 sig_2 | .9943267 .007031 141.42 0.000 .9805462 1.008107 rho_12 | .4811176 .0477552 10.07 0.000 .3875191 .574716 ------------------------------------------------------------------------------

The estimates are quite close to the values used for the simulation. If you try to perform the estimation with the wrong sign for the coefficient for L, you will get a number that is virtually zero (if you get convergence at all). In this case, the evaluator is telling us that the best value it can find, provided the restrictions we have imposed, is zero. If you see such results, you may want to try the opposite sign. If both give a zero coefficient, it means that this is the solution, and there is not endogeneity at all. If one of them is not zero, it means that the non-zero value is the solution. As stated before, in Stata 13.1, the model can be fitted without this restriction.

]]>A **stored result** is simply a scalar, macro, or matrix stored in memory after you run a Stata command. The two main types of stored results are **e-class** (for estimation commands) and **r-class** (for general commands). You can list a command’s stored results after it has been run by typing **ereturn list** (for estimation commands) and **return list** (for general commands). Let’s try a simple example by loading the auto dataset and running **correlate** on the variables **foreign** and **mpg**

. sysuse auto (1978 Automobile Data) . correlate foreign mpg (obs=74) | foreign mpg -------------+------------------ foreign | 1.0000 mpg | 0.3934 1.0000

Because **correlate** is not an estimation command, use the **return list** command to see its **stored results**.

. return list scalars: r(N) = 74 r(rho) = .3933974152205484 matrices: r(C) : 2 x 2

Now we can use **putexcel** to export these results to Excel. The basic syntax of **putexcel** is

putexcelexcel_cell=(expression) … usingfilename[,options]

If you are working with matrices, the syntax is

putexcelexcel_cell=matrix(expression) … usingfilename[,options]

It is easy to build the above syntax in the **putexcel** dialog. There is a helpful video on Youtube about the dialog here. Let’s list the matrix **r(C)** to see what it contains.

. matrix list r(C) symmetric r(C)[2,2] foreign mpg foreign 1 mpg .39339742 1

To re-create the table in Excel, we need to export the matrix **r(C)** with the matrix row and column names. The command to type in your Stata Command window is

putexcel A1=matrix(r(C), names) using corr

Note that to export the matrix row and column names, we used the **names** option after we specifed the matrix **r(C)**. When I open the file corr.xlsx in Excel, the table below is displayed.

Next let’s try a more involved example. Load the auto dataset, and run a tabulation on the variable **foreign**. Because **tabulate** is not an estimation command, use the **return list** command to see its **stored results**.

. sysuse auto (1978 Automobile Data) . tabulate foreign Car type | Freq. Percent Cum. ------------+----------------------------------- Domestic | 52 70.27 70.27 Foreign | 22 29.73 100.00 ------------+----------------------------------- Total | 74 100.00 . return list scalars: r(N) = 74 r(r) = 2

**tabulate** is different from most commands in Stata in that it does not automatically save all the results we need into the **stored results** (we will use scalar **r(N)**). We need to use the **matcell()** and **matrow()** options of **tabulate** to save the results produced by the command into two Stata matrices.

. tabulate foreign, matcell(freq) matrow(names) Car type | Freq. Percent Cum. ------------+----------------------------------- Domestic | 52 70.27 70.27 Foreign | 22 29.73 100.00 ------------+----------------------------------- Total | 74 100.00 . matrix list freq freq[2,1] c1 r1 52 r2 22 . matrix list names names[2,1] c1 r1 0 r2 1

The **putexcel** commands used to create a basic tabulation table in Excel column 1 row 1 are

putexcel A1=("Car type") B1=("Freq.") C1=("Percent") using results, replace putexcel A2=matrix(names) B2=matrix(freq) C2=matrix(freq/r(N)) using results, modify

Below is the table produced in Excel by these commands.

Again this is a basic tabulation table. You probably noticed that we did not have the **Cum.** column or the **Total** row in the export table. Also our **Car type** column contains the numeric values (0,1), not the value lables (Domestic, Foreign) of the variable **foreign**, and our **Percent** column is not formatted correctly. To get the exact table displayed in the Results window into an Excel file takes a little programming. With a few functions and a **forvalues** loop, we can easily export any table produced by running the **tabulate** command on a numeric variable.

There are two extended macro functions, **label** and **display**, that can help us. The **label** function can extract the value labels for each variable, and the **display** function can correctly format numbers for our numeric columns. Last, we use **forvalues** to loop over the rows of the returned matrices to produce our final tables. Our do-file to produce the **tabulate** table in Excel looks like

sysuse auto tabulate foreign, matcell(freq) matrow(names) putexcel A1=("Car type") B1=("Freq.") C1=("Percent") D1=("Cum.") using results, replace local rows = rowsof(names) local row = 2 local cum_percent = 0 forvalues i = 1/`rows' { local val = names[`i',1] local val_lab : label (foreign) `val' local freq_val = freq[`i',1] local percent_val = `freq_val'/`r(N)'*100 local percent_val : display %9.2f `percent_val' local cum_percent : display %9.2f (`cum_percent' + `percent_val') putexcel A`row'=("`val_lab'") B`row'=(`freq_val') C`row'=(`percent_val') /// D`row'=(`cum_percent') using results, modify local row = `row' + 1 } putexcel A`row'=("Total") B`row'=(r(N)) C`row'=(100.00) using results, modify

The above commands produce this table in Excel:

The solution above works well for this one table, but what if we need to export the tabulation table for 100 variables to the same Excel spreadsheet? It would be very tedious to run the same do-file 100 times, each time changing the cell and row numbers. Now we could easily change our do-file into the Stata command (ado-file) called **tab2xl**. The syntax for our new command could be

tab2xlvarnameusingfilename, row(rownumber) col(colnumber) [replace sheet(name)]

The pseudocode of our program (file tab2xl.ado) looks like

program tab2xl /* parse command syntax */ /* tabulate varname */ /* get column letters based on starting column number passed in */ /* write header row to filename in starting row number passed in */ /* loop over rows of returned matrix and calculate/write values to filename */ /* write total row to filename */ end

If you would like to download a working version of our **tab2xl** command, type

net install http://www.stata.com/users/kcrow/tab2xl

in Stata.

]]>Many researchers in psychology and education advocate reporting of effect sizes, professional organizations such as the American Psychological Association (APA) and the American Educational Research Association (AERA) strongly recommend their reporting, and professional journals such as the *Journal of Experimental Psychology: Applied* and *Educational and Psychological Measurement* require that they be reported.

Anyway, today I want to show you

- What effect sizes are.
- How to calculate effect sizes and their confidence intervals in Stata.
- How to calculate bootstrap confidence intervals for those effect sizes.
- How to use Stata’s effect-size calculator.

The importance of research results is often assessed by statistical significance, usually that the p-value is less than 0.05. P-values and statistical significance, however, don’t tell us anything about practical significance.

What if I told you that I had developed a new weight-loss pill and that the difference between the average weight loss for people who took the pill and the those who took a placebo was statistically significant? Would you buy my new pill? If you were overweight, you might reply, “Of course! I’ll take two bottles and a large order of french fries to go!”. Now let me add that the average difference in weight loss was only one pound over the year. Still interested? My results may be statistically significant but they are not practically significant.

Or what if I told you that the difference in weight loss was not statistically significant — the p-value was “only” 0.06 — but the average difference over the year was 20 pounds? You might very well be interested in that pill.

The size of the effect tells us about the practical significance. P-values do not assess practical significance.

All of which is to say, one should report parameter estimates along with statistical significance.

In my examples above, you knew that 1 pound over the year is small and 20 pounds is large because you are familiar with human weights.

In another context, 1 pound might be large, and in yet another, 20 pounds small.

Formal measures of effects sizes are thus usually presented in unit-free but easy-to-interpret form, such as standardized differences and proportions of variability explained.

Effect sizes that measure the scaled difference between means belong to the “d” family. The generic formula is

The estimators differ in terms of how sigma is calculated.

Cohen’s d, for instance, uses the pooled sample standard deviation.

Hedges’s g incorporates an adjustment which removes the bias of Cohen’s d.

Glass’s Δ was originally developed in the context of experiments and uses the “control group” standard deviation in the denominator. It has subsequently been generalized to nonexperimental studies. Because there is no control group in observational studies, Kline (2013) recommends reporting Glass’s Δ using the standard deviation for each group. Glass’s Delta_1 uses one group’s standard deviation and Delta_2 uses the other group’s.

Although I have given definitions to Cohen’s d, Hedges’s g, and Glass’s Δ, different authors swap the definitions around! As a result, many authors refer to all of the above as just Delta.

Be careful when using software to know which Delta you are getting. I have used Stata terminology, of course.

Anyway, the use of a standardized scale allows us to assess of practical significance. Delta = 1.5 indicates that the mean of one group is 1.5 standard deviations higher than that of the other. A difference of 1.5 standard deviations is obviously large, and a difference of 0.1 standard deviations is obviously small.

The r family quantifies the ratio of the variance attributable to an effect to the total variance and is often interpreted as the “proportion of variance explained”. The generic estimator is known as eta-squared,

η^{2} is equivalent to the R-squared statistic from linear regression.

ω^{2} is a less biased variation of η^{2} that is equivalent to the adjusted R-squared.

Both of these measures concern the entire model.

Partial η^{2} and partial ω^{2} are like partial R-squareds and concern individual terms in the model. A term might be a variable or a variable and its interaction with another variable.

Both the d and r families allow us to make an apples-to-apples comparison of variables measured on different scales. For example, an intervention could affect both systolic blood pressure and total cholesterol. Comparing the relative effect of the intervention on the two outcomes would be difficult on their original scales.

How does one compare mm/Hg and mg/dL? It is straightforward in terms of Cohen’s d or ω^{2} because then we are comparing standard deviation changes or proportion of variance explained.

Consider a study where 30 school children are randomly assigned to classrooms that incorporated web-based instruction (treatment) or standard classroom environments (control). At the end of the school year, the children were given tests to measure reading and mathematics skills. The reading test is scored on a 0-15 point scale and, the mathematics test, on a 0-100 point scale.

Let’s download a dataset for our fictitious example from the Stata website by typing:

. use http://www.stata.com/videos13/data/webclass.dta Contains data from http://www.stata.com/videos13/data/webclass.dta obs: 30 Fictitious web-based learning experiment data vars: 5 5 Sep 2013 11:28 size: 330 (_dta has notes) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- id byte %9.0g ID Number treated byte %9.0g treated Treatment Group agegroup byte %9.0g agegroup Age Group reading float %9.0g Reading Score math float %9.0g Math Score ------------------------------------------------------------------------------- . notes _dta: 1. Variable treated records 0=control, 1=treated. 2. Variable agegroup records 1=7 years old, 2=8 years old, 3=9 years old.

We can compute a t-statistic to test the null hypothesis that the average math scores are the same in the treatment and control groups.

. ttest math, by(treated) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- Control | 15 69.98866 3.232864 12.52083 63.05485 76.92246 Treated | 15 79.54943 1.812756 7.020772 75.66146 83.4374 ---------+-------------------------------------------------------------------- combined | 30 74.76904 2.025821 11.09588 70.62577 78.91231 ---------+-------------------------------------------------------------------- diff | -9.560774 3.706412 -17.15301 -1.968533 ------------------------------------------------------------------------------ diff = mean(Control) - mean(Treated) t = -2.5795 Ho: diff = 0 degrees of freedom = 28 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0077 Pr(|T| > |t|) = 0.0154 Pr(T > t) = 0.9923

The treated students have a larger mean, yet the difference of -9.56 is reported as negative because -ttest- calculated Control minus Treated. So just remember, negative differences mean Treated > Control in this case.

The t-statistic equals -2.58 and its two-sided p-value of 0.0154 indicates that the difference between the math scores in the two groups is statistically significant.

Next, let’s calculate effect sizes from the d family:

. esize twosample math, by(treated) cohensd hedgesg glassdelta Effect size based on mean comparison Obs per group: Control = 15 Treated = 15 --------------------------------------------------------- Effect Size | Estimate [95% Conf. Interval] --------------------+------------------------------------ Cohen's d | -.9419085 -1.691029 -.1777553 Hedges's g | -.916413 -1.645256 -.1729438 Glass's Delta 1 | -.7635896 -1.52044 .0167094 Glass's Delta 2 | -1.361784 -2.218342 -.4727376 ---------------------------------------------------------

Cohen’s d and Hedges’s g both indicate that the average reading scores differ by approximately -0.93 standard deviations with 95% confidence intervals of (-1.69, -0.18) and (-1.65, -0.17) respectively.

Since this is an experiment, we are interested in Glass’s Delta 1 because it is calculated using the control group standard deviation. Average reading scores differ by -0.76 and the confidence interval is (-1.52, 0.02).

The confidence intervals for Cohen’s d and Hedges’s g do not include the null value of zero but the confidence interval for Glass’s Delta 1 does. Thus we cannot completely rule out the possibility that the treatment had no effect on math scores.

Next we could incorporate the age group of the children into our analysis by using a two-way ANOVA to test the null hypothesis that the mean math scores are equal for all groups.

. anova math treated##agegroup Number of obs = 30 R-squared = 0.2671 Root MSE = 10.4418 Adj R-squared = 0.1144 Source | Partial SS df MS F Prob > F -----------------+---------------------------------------------------- Model | 953.697551 5 190.73951 1.75 0.1617 | treated | 685.562956 1 685.562956 6.29 0.0193 agegroup | 47.7059268 2 23.8529634 0.22 0.8051 treated#agegroup | 220.428668 2 110.214334 1.01 0.3789 | Residual | 2616.73825 24 109.030761 -----------------+---------------------------------------------------- Total | 3570.4358 29 123.118476

The F-statistic for the entire model is not statistically significant (F=1.75, ndf=5, ddf=24, p=0.1617) but the F-statistic for the main effect of treatment is statistically significant (F=6.29, ndf=1, ddf=24, p=0.0193).

We can compute the η^{2} and partial η^{2} estimates for this model using the **estat esize** command immediately after our **anova** command (note that **estat esize** works after the **regress** command too).

. estat esize Effect sizes for linear models --------------------------------------------------------------------- Source | Eta-Squared df [95% Conf. Interval] ----------------------+---------------------------------------------- Model | .2671096 5 0 .4067062 | treated | .2076016 1 .0039512 .4451877 agegroup | .0179046 2 0 .1458161 treated#agegroup | .0776932 2 0 .271507 ---------------------------------------------------------------------

The overall η^{2} indicates that our model accounts for approximately 26.7% of the variablity in math scores though the 95% confidence interval includes the null value of zero (0.00%, 40.7%). The partial η^{2} for treatment is 0.21 (21% of the variability explained) and its 95% confidence interval excludes zero (0.3%, 20%).

We could calculate the alternative r-family member ω^{2} rather than η^{2} by typing

. estat esize, omega Effect sizes for linear models --------------------------------------------------------------------- Source | Omega-Squared df [95% Conf. Interval] ----------------------+---------------------------------------------- Model | .1144241 5 0 .2831033 | treated | .174585 1 0 .4220705 agegroup | 0 2 0 .0746342 treated#agegroup | .0008343 2 0 .2107992 ---------------------------------------------------------------------

The overall ω^{2} indicates that our model accounts for approximately 11.4% of the variability in math scores and treatment accounts for 17.5%. This perplexing result stems from the way that ω^{2} and partial ω^{2} are calculated. See Pierce, Block, & Aguinis (2004) for a thorough explanation.

Except for the η^{2} for treatment, the confidence intervals include 0 so we cannot rule out the possibility that there is no effect. Whether results are practically significant is generically a matter context and opinion. In some situations, accounting for 5% of the variability in an outcome could be very important and in other situations accounting for 30% may not be.

We could repeat the same analyses for the reading scores using the following commands:

. ttest reading, by(treated) . esize twosample reading, by(treated) cohensd hedgesg glassdelta . anova reading treated##agegroup . estat esize . estat esize, omega

None of the t- or F-statistics for reading scores were statistically significant at the 0.05 level.

Even though the reading and math scores were measured on two different scales, we can directly compare the relative effect of the treatment using effect sizes:

Effect Size | Reading Score Math Score ------------------------------------------------------------ Cohen's d | -0.23 (-0.95 - 0.49) -0.94 (-1.69 - -0.18) Hedges's g | -0.22 (-0.92 - 0.48) -0.92 (-1.65 - -0.17) Glass's Delta | -0.21 (-0.93 - 0.51) -0.76 (-1.52 - 0.02) Eta-squared | 0.02 ( 0.00 - 0.20) 0.21 ( 0.00 - 0.44) Omega-squared | 0.00 ( 0.00 - 0.17) 0.17 ( 0.00 - 0.42)

The results show that the average reading scores in the treated and control groups differ by approximately 0.22 standard deviations while the average math scores differ by approximately 0.92 standard deviations. Similarly, treatment status accounted for almost none of the variability in reading scores while it accounted for roughly 17% of the variability in math scores. The intervention clearly had a larger effect on math scores than reading scores. We also know that we cannot completely rule out an effect size of zero (no effect) for both reading and math scores because several confidence intervals included zero. Whether or not the effects are practically significant is a matter of interpretation but the effect sizes provide a standardized metric for evaluation.

Simulation studies have shown that bootstrap confidence intervals for the d family may be preferable to confidence intervals based on the noncentral t distribution when the variable of interest does not have a normal distribution (Kelley 2005; Algina, Keselman, and Penfield 2006). We can calculate bootstrap confidence intervals for Cohen’s d and Hedges’s g using Stata’s **bootstrap** prefix:

. bootstrap r(d) r(g), reps(500) nowarn: esize twosample reading, by(treated) (running esize on estimation sample) Bootstrap replications (500) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .................................................. 100 .................................................. 150 .................................................. 200 .................................................. 250 .................................................. 300 .................................................. 350 .................................................. 400 .................................................. 450 .................................................. 500 Bootstrap results Number of obs = 30 Replications = 500 command: esize twosample reading, by(treated) _bs_1: r(d) _bs_2: r(g) ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _bs_1 | -.228966 .3905644 -0.59 0.558 -.9944582 .5365262 _bs_2 | -.2227684 .3799927 -0.59 0.558 -.9675403 .5220036 ------------------------------------------------------------------------------

The bootstrap estimate of the 95% confidence interval for Cohen’s d is -0.99 to 0.54 which is slightly wider than the earlier estimate based on the non-central t distribution (see [R] esize for details). The bootstrap estimate is slightly wider for Hedges’s g as well.

You can use Stata’s effect size calculators to estimate them using summary statistics. If we know that the mean, standard deviation and sample size for one group is 70, 12.5 and 15 respectively and 80, 7 and 15 for another group, we can use **esizei** to estimate effect sizes from the d family:

. esizei 15 70 12.5 15 80 7, cohensd hedgesg glassdelta Effect size based on mean comparison Obs per group: Group 1 = 15 Group 2 = 15 --------------------------------------------------------- Effect Size | Estimate [95% Conf. Interval] --------------------+------------------------------------ Cohen's d | -.9871279 -1.739873 -.2187839 Hedges's g | -.9604084 -1.692779 -.2128619 Glass's Delta 1 | -.8 -1.561417 -.0143276 Glass's Delta 2 | -1.428571 -2.299112 -.5250285 ---------------------------------------------------------

We can estimate effect sizes from the r family using **esizei** with slightly different syntax. For example, if we know the numerator and denominator degrees of freedom along with the F statistic, we can calculate η^{2} and ω^{2} using the following command:

. esizei 1 28 6.65 Effect sizes for linear models --------------------------------------------------------- Effect Size | Estimate [95% Conf. Interval] --------------------+------------------------------------ Eta-Squared | .1919192 .0065357 .4167874 Omega-Squared | .1630592 0 .3959584 ---------------------------------------------------------

Stata has dialog boxes that can assist you in calculating effect sizes. If you would like a brief introduction using the GUI, you can watch a demonstration on Stata’s YouTube Channel:

Most older papers and many current papers do not report effect sizes. Nowadays, the general consensus among behavioral scientists, their professional organizations, and their journals is that effect sizes should always be reported in addition to tests of statistical significance. Stata 13 now makes it easy to compute most popular effects sizes.

Some methodologists believe that effect sizes with confidence intervals should always be reported and that statistical hypothesis tests should be abandoned altogether; see Cumming (2012) and Kline (2013). While this may sound like a radical notion, other fields such as epidemiology have been moving in this direction since the 1990s. Cumming and Kline offer compelling arguments for this paradigm shift as well as excellent introductions to effect sizes.

American Psychological Association (2009). Publication Manual of the American Psychological Association, 6th Ed. Washington, DC: American Psychological Association.

Algina, J., H. J. Keselman, and R. D. Penfield. (2006). Confidence interval coverage for Cohen’s effect size statistic. Educational and Psychological Measurement, 66(6): 945–960.

Cumming, G. (2012). Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Taylor & Francis.

Kelley, K. (2005). The effects of nonnormal distributions on confidence intervals around the standardized mean difference: Bootstrap and parametric confidence intervals. Educational and Psychological Measurement 65: 51–69.

Kirk, R. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, 746-759.

Kline, R. B. (2013). Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. 2nd ed. Washington, DC: American Psychological Association.

Pierce, C.A., Block, R. A., and Aguinis, H. (2004). Cautionary note on reporting eta-squared values from multifactor ANOVA designs. Educational and Psychological Measurement, 64(6) 916-924

Thompson, B. (1996) AERA Editorial Policies regarding Statistical Significance Testing: Three Suggested Reforms. Educational Researcher, 25(2) 26-30

Wilkinson, L., & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604

]]>Well, we sure haven’t made that sound exciting when, in fact, Stata 13 is a big — we mean really BIG — release, and we really do want to tell you about it.

Rather than summarizing, however, we’ll send you to the website, which in addition to the standard marketing materials, has technical sheets, demonstrations, and even videos of the new features.

And all 11,000 pages of the manuals are now online.

]]>I could tell you about re-recording the original 24 videos with a larger font to make them easier to read. I could tell you about the hardware and software that we use to record them including our experiments with various condenser and dynamic microphones. I could share quotes from some of the nice messages we’ve received. But I think it would be more fun to talk about….you!

YouTube collects data about the number of views each video receives as well as summary data about who, what, when, where, and how you are watching them. There is no need to be concerned about your privacy; there are no personal identifiers of any kind associated with these data. But the summary data are interesting, and I thought it might be fun to share some of the data with you.

Figure 1 shows the age distribution of Stata YouTube Channel viewers. If you have ever attended a Stata Conference, you will not be surprised by this graph…until you notice the age group at the bottom. I would not have guessed that 13-17 year olds are watching our videos. Perhaps they saw Stata in the movie “Moneyball” with Brad Pitt and wanted to learn more. Or maybe they were influenced by the latest fashion craze sweeping the youth of the world.

We have posted more than 50 videos over a wide range of topics. Figure 2 shows the total number of views for the ten most popular videos. The more popular of the ten are about broad topics. These broader videos are mostly older and have thus had time to accumulate more views.

Even so, these videos receive more views per day currently than do the special topic videos that have been posted more recently. This supports my belief that Stata YouTube Channel viewers tend to be relatively new Stata users who want to learn about general topics, and that means more generic videos in the future. So you and your two post-docs will just have to read the manual if you want to learn how to fit asymmetric power ARCH models with outer-product gradient standard errors.

We usually post new videos on Tuesday mornings which might lead you to believe that the peak viewing day would also be Tuesday. Figure 3, however, shows us that the average number of views per day (vpd) is higher on Wednesdays at 420 vpd and in fact peaks on Thursdays at 430 vpd before declining Friday through Sunday.

Figure 4 also shows us that late September may have been not the best time to launch the Stata YouTube Channel. Our early momentum in September and October slowed during the November and December holiday seasons. We were, however, pleased to see that 49 of you spent New Years Eve watching our videos. Perhaps next year we’ll prepare something more festive just for you!

What do the Czech Republic, Pakistan, Uganda, Madagascar, the United Kingdom, the Bahamas, the United States, Montenegro, and Italy have in common? Correct! They are all countries in which you are watching our videos. They are also locations depicted in one of my favorite action films but I’ll leave that to the trivia buffs. I think the most exciting information that we found in our data is that the Stata YouTube Channel is being viewed in 164 countries!

You might not be surprised to learn that roughly half of the people watching the videos live in the United States, the United Kingdom, or Canada. The results may be unexpected when we consider the “view rate” defined as the number of views per 100,000 residents. Figure 5 shows the top 20 countries ranked by view rate for countries with at least four million residents. Denmark had the highest view rate which was nearly twice the rate of Norway which had the second highest view rate. The view rate in Denmark was more than three times the rate in the US and the UK.

You might think that I would have anything to report about “how” you are watching the videos, but it turns out that 5.2% of you are watching on mobile devices. Perhaps this explains the 13-17 year old demographic or the 49 people watching on New Year’s Eve. Or maybe we are helping you pass the time in the dentist office waiting room.

Six months isn’t much of a milestone. We Stata folk will use any excuse to break out the cake and ice cream. Even so, the Stata YouTube Channel began as an experiment and often experiments do not work out as we would like. This experiment has exceeded our expectations and, as a result, we have started taking requests for videos on our Facebook page and we’ll be adding more videos every week. So thanks for watching and stay tuned!

Now if you will excuse me, I’m going to get some cake and ice cream.

]]>Last time, we noticed that our data had two features. First, we noticed that the means within each level of the hierarchy were different from each other and we incorporated that into our data analysis by fitting a “variance component” model using Stata’s **xtmixed** command.

The second feature that we noticed is that repeated measurement of GSP showed an upward trend. We’ll pick up where we left off last time and stick to the concepts again and you can refer to the references at the end to learn more about the details.

Stata has a very friendly dialog box that can assist you in building multilevel models. If you would like a brief introduction using the GUI, you can watch a demonstration on Stata’s YouTube Channel:

Introduction to multilevel linear models in Stata, part 2: Longitudinal data

I’m often asked by beginning data analysts – “What’s the difference between longitudinal data and time-series data? Aren’t they the same thing?”.

The confusion is understandable — both types of data involve some measurement of time. But the answer is no, they are not the same thing.

Univariate time series data typically arise from the collection of many data points over time from a single source, such as from a person, country, financial instrument, etc.

Longitudinal data typically arise from collecting a few observations over time from many sources, such as a few blood pressure measurements from many people.

There are some multivariate time series that blur this distinction but a rule of thumb for distinguishing between the two is that time series have more repeated observations than subjects while longitudinal data have more subjects than repeated observations.

Because our GSP data from last time involve 17 measurements from 48 states (more sources than measurements), we will treat them as longitudinal data.

GSP Data: http://www.stata-press.com/data/r12/productivity.dta

As I mentioned last time, repeated observations on a group of individuals can be conceptualized as multilevel data and modeled just as any other multilevel data. We left off last time with a variance component model for GSP (Gross State Product, logged) and noted that our model assumed a constant GSP over time while the data showed a clear upward trend.

If we consider a single observation and think about our model, nothing in the fixed or random part of the models is a function of time.

Let’s begin by adding the variable year to the fixed part of our model.

As we expected, our grand mean has become a linear regression which more accurately reflects the change over time in GSP. What might be unexpected is that each state’s and region’s mean has changed as well and now has the same slope as the regression line. This is because none of the random components of our model are a function of time. Let’s fit this model with the **xtmixed** command:

. xtmixed gsp year, || region: || state: ------------------------------------------------------------------------------ gsp | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- year | .0274903 .0005247 52.39 0.000 .0264618 .0285188 _cons | -43.71617 1.067718 -40.94 0.000 -45.80886 -41.62348 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ region: Identity | sd(_cons) | .6615238 .2038949 .3615664 1.210327 -----------------------------+------------------------------------------------ state: Identity | sd(_cons) | .7805107 .0885788 .6248525 .9749452 -----------------------------+------------------------------------------------ sd(Residual) | .0734343 .0018737 .0698522 .0772001 ------------------------------------------------------------------------------

The fixed part of our model now displays an estimate of the intercept (_cons = -43.7) and the slope (year = 0.027). Let’s graph the model for Region 7 and see if it fits the data better than the variance component model.

predict GrandMean, xb label var GrandMean "GrandMean" predict RegionEffect, reffects level(region) predict StateEffect, reffects level(state) gen RegionMean = GrandMean + RegionEffect gen StateMean = GrandMean + RegionEffect + StateEffect twoway (line GrandMean year, lcolor(black) lwidth(thick)) /// (line RegionMean year, lcolor(blue) lwidth(medthick)) /// (line StateMean year, lcolor(green) connect(ascending)) /// (scatter gsp year, mcolor(red) msize(medsmall)) /// if region ==7, /// ytitle(log(Gross State Product), margin(medsmall)) /// legend(cols(4) size(small)) /// title("Multilevel Model of GSP for Region 7", size(medsmall))

That looks like a much better fit than our variance-components model from last time. Perhaps I should leave well enough alone, but I can’t help noticing that the slopes of the green lines for each state don’t fit as well as they could. The top green line fits nicely but the second from the top looks like it slopes upward more than is necessary. That’s the best fit we can achieve if the regression lines are forced to be parallel to each other. But what if the lines were not forced to be parallel? What if we could fit a “mini-regression model” for each state within the context of my overall multilevel model. Well, good news — we can!

By introducing the variable year to the fixed part of the model, we turned our grand mean into a regression line. Next I’d like to incorporate the variable year into the random part of the model. By introducing a fourth random component that is a function of time, I am effectively estimating a separate regression line within each state.

Notice that the size of the new, brown deviation u_{1ij.} is a function of time. If the observation were one year to the left, u_{1ij.} would be smaller and if the observation were one year to the right, u_{1ij.}would be larger.

It is common to “center” the time variable before fitting these kinds of models. Explaining why is for another day. The quick answer is that, at some point during the fitting of the model, Stata will have to compute the equivalent of the inverse of the square of year. For the year 1986 this turns out to be 2.535e-07. That’s a fairly small number and if we multiply it by another small number…well, you get the idea. By centering age (e.g. cyear = year – 1978), we get a more reasonable number for 1986 (0.01). (Hint: If you have problems with your model converging and you have large values for time, try centering them. It won’t always help, but it might).

So let’s center our year variable by subtracting 1978 and fit a model that includes a random slope.

gen cyear = year - 1978 xtmixed gsp cyear, || region: || state: cyear, cov(indep)

I’ve color-coded the output so that we can match each part of the output back to the model and the graph. The fixed part of the model appears in the top table and it looks like any other simple linear regression model. The random part of the model is definitely more complicated. If you get lost, look back at the graphic of the deviations and remind yourself that we have simply partitioned the deviation of each observation into four components. If we did this for every observation, the standard deviations in our output are simply the average of those deviations.

Let’s look at a graph of our new “random slope” model for Region 7 and see how well it fits our data.

predict GrandMean, xb label var GrandMean "GrandMean" predict RegionEffect, reffects level(region) predict StateEffect_year StateEffect_cons, reffects level(state) gen RegionMean = GrandMean + RegionEffect gen StateMean_cons = GrandMean + RegionEffect + StateEffect_cons gen StateMean_year = GrandMean + RegionEffect + StateEffect_cons + /// (cyear*StateEffect_year) twoway (line GrandMean cyear, lcolor(black) lwidth(thick)) /// (line RegionMean cyear, lcolor(blue) lwidth(medthick)) /// (line StateMean_cons cyear, lcolor(green) connect(ascending)) /// (line StateMean_year cyear, lcolor(brown) connect(ascending)) /// (scatter gsp cyear, mcolor(red) msize(medsmall)) /// if region ==7, /// ytitle(log(Gross State Product), margin(medsmall)) /// legend(cols(3) size(small)) /// title("Multilevel Model of GSP for Region 7", size(medsmall))

The top brown line fits the data slightly better, but the brown line below it (second from the top) is a much better fit. Mission accomplished!

I hope I have been able to convince you that multilevel modeling is easy using Stata’s **xtmixed** command and that this is a tool that you will want to add to your kit. I would love to say something like “And that’s all there is to it. Go forth and build models!”, but I would be remiss if I didn’t point out that I have glossed over many critical topics.

In our GSP example, we would still like to consider the impact of other independent variables. I haven’t mentioned choice of estimation methods (ML or REML in the case of **xtmixed**). I’ve assessed the fit of our models by looking at graphs, an approach important but incomplete. We haven’t thought about hypothesis testing. Oh — and, all the usual residual diagnostics for linear regression such as checking for outliers, influential observations, heteroskedasticity and normality still apply….times four! But now that you understand the concepts and some of the mechanics, it shouldn’t be difficult to fill in the details. If you’d like to learn more, check out the links below.

I hope this was helpful…thanks for stopping by.

If you’d like to learn more about modeling multilevel and longitudinal data, check out

Multilevel and Longitudinal Modeling Using Stata, Third Edition

Volume I: Continuous Responses

Volume II: Categorical Responses, Counts, and Survival

by Sophia Rabe-Hesketh and Anders Skrondal

or sign up for our popular public training course Multilevel/Mixed Models Using Stata.

]]>Stata has a lot of multilevel modeling capababilities.

I want to show you how easy it is to fit multilevel models in Stata. Along the way, we’ll unavoidably introduce some of the jargon of multilevel modeling.

I’m going to focus on concepts and ignore many of the details that would be part of a formal data analysis. I’ll give you some suggestions for learning more at the end of the post.

- The videos

Stata has a friendly dialog box that can assist you in building multilevel models. If you would like a brief introduction using the GUI, you can watch a demonstration on Stata’s YouTube Channel:

Introduction to multilevel linear models in Stata, part 1: The **xtmixed** command

- Multilevel data

Multilevel data are characterized by a hierarchical structure. A classic example is children nested within classrooms and classrooms nested within schools. The test scores of students within the same classroom may be correlated due to exposure to the same teacher or textbook. Likewise, the average test scores of classes might be correlated within a school due to the similar socioeconomic level of the students.

You may have run across datasets with these kinds of structures in your own work. For our example, I would like to use a dataset that has both longitudinal and classical hierarchical features. You can access this dataset from within Stata by typing the following command:

**use http://www.stata-press.com/data/r12/productivity.dta**

We are going to build a model of gross state product for 48 states in the USA measured annually from 1970 to 1986. The states have been grouped into nine regions based on their economic similarity. For distributional reasons, we will be modeling the logarithm of annual Gross State Product (GSP) but in the interest of readability, I will simply refer to the dependent variable as GSP.

. describe gsp year state region storage display value variable name type format label variable label ----------------------------------------------------------------------------- gsp float %9.0g log(gross state product) year int %9.0g years 1970-1986 state byte %9.0g states 1-48 region byte %9.0g regions 1-9

Let’s look at a graph of these data to see what we’re working with.

twoway (line gsp year, connect(ascending)), /// by(region, title("log(Gross State Product) by Region", size(medsmall)))

Each line represents the trajectory of a state’s (log) GSP over the years 1970 to 1986. The first thing I notice is that the groups of lines are different in each of the nine regions. Some groups of lines seem higher and some groups seem lower. The second thing that I notice is that the slopes of the lines are not the same. I’d like to incorporate those attributes of the data into my model.

- Components of variance

Let’s tackle the vertical differences in the groups of lines first. If we think about the hierarchical structure of these data, I have repeated observations nested within states which are in turn nested within regions. I used color to keep track of the data hierarchy.

We could compute the mean GSP within each state and note that the observations within in each state vary about their state mean.

Likewise, we could compute the mean GSP within each region and note that the state means vary about their regional mean.

We could also compute a grand mean and note that the regional means vary about the grand mean.

Next, let’s introduce some notation to help us keep track of our mutlilevel structure. In the jargon of multilevel modelling, the repeated measurements of GSP are described as “level 1″, the states are referred to as “level 2″ and the regions are “level 3″. I can add a three-part subscript to each observation to keep track of its place in the hierarchy.

Now let’s think about our model. The simplest regression model is the intercept-only model which is equivalent to the sample mean. The sample mean is the “fixed” part of the model and the difference between the observation and the mean is the residual or “random” part of the model. Econometricians often prefer the term “disturbance”. I’m going to use the symbol μ to denote the fixed part of the model. μ could represent something as simple as the sample mean or it could represent a collection of independent variables and their parameters.

Each observation can then be described in terms of its deviation from the fixed part of the model.

If we computed this deviation of each observation, we could estimate the variability of those deviations. Let’s try that for our data using Stata’s **xtmixed** command to fit the model:

. xtmixed gsp Mixed-effects ML regression Number of obs = 816 Wald chi2(0) = . Log likelihood = -1174.4175 Prob > chi2 = . ------------------------------------------------------------------------------ gsp | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons | 10.50885 .0357249 294.16 0.000 10.43883 10.57887 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ sd(Residual) | 1.020506 .0252613 .9721766 1.071238 ------------------------------------------------------------------------------

The top table in the output shows the fixed part of the model which looks like any other regression output from Stata, and the bottom table displays the random part of the model. Let’s look at a graph of our model along with the raw data and interpret our results.

predict GrandMean, xb label var GrandMean "GrandMean" twoway (line GrandMean year, lcolor(black) lwidth(thick)) /// (scatter gsp year, mcolor(red) msize(tiny)), /// ytitle(log(Gross State Product), margin(medsmall)) /// legend(cols(4) size(small)) /// title("GSP for 1970-1986 by Region", size(medsmall))

The thick black line in the center of the graph is the estimate of _cons, which is an estimate of the fixed part of model for GSP. In this simple model, _cons is the sample mean which is equal to 10.51. In “Random-effects Parameters” section of the output, sd(Residual) is the average vertical distance between each observation (the red dots) and fixed part of the model (the black line). In this model, sd(Residual) is the estimate of the sample standard deviation which equals 1.02.

At this point you may be thinking to yourself – “That’s not very interesting – I could have done that with Stata’s **summarize** command”. And you would be correct.

. summ gsp Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- gsp | 816 10.50885 1.021132 8.37885 13.04882

But here’s where it does become interesting. Let’s make the random part of the model more complex to account for the hierarchical structure of the data. Consider a single observation, y_{ijk} and take another look at its residual.

The observation deviates from its state mean by an amount that we will denote e_{ijk}. The observation’s state mean deviates from the the regionals mean u_{ij.} and the observation’s regional mean deviates from the fixed part of the model, μ, by an amount that we will denote u_{i..}. We have partitioned the observation’s residual into three parts, aka “components”, that describe its magnitude relative to the state, region and grand means. If we calculated this set of residuals for each observation, wecould estimate the variability of those residuals and make distributional assumptions about them.

These kinds of models are often called “variance component” models because they estimate the variability accounted for by each level of the hierarchy. We can estimate a variance component model for GSP using Stata’s **xtmixed** command:

xtmixed gsp, || region: || state: ------------------------------------------------------------------------------ gsp | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons | 10.65961 .2503806 42.57 0.000 10.16887 11.15035 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ region: Identity | sd(_cons) | .6615227 .2038944 .361566 1.210325 -----------------------------+------------------------------------------------ state: Identity | sd(_cons) | .7797837 .0886614 .6240114 .9744415 -----------------------------+------------------------------------------------ sd(Residual) | .1570457 .0040071 .149385 .1650992 ------------------------------------------------------------------------------

The fixed part of the model, _cons, is still the sample mean. But now there are three parameters estimates in the bottom table labeled “Random-effects Parameters”. Each quantifies the average deviation at each level of the hierarchy.

Let’s graph the predictions from our model and see how well they fit the data.

predict GrandMean, xb label var GrandMean "GrandMean" predict RegionEffect, reffects level(region) predict StateEffect, reffects level(state) gen RegionMean = GrandMean + RegionEffect gen StateMean = GrandMean + RegionEffect + StateEffect twoway (line GrandMean year, lcolor(black) lwidth(thick)) /// (line RegionMean year, lcolor(blue) lwidth(medthick)) /// (line StateMean year, lcolor(green) connect(ascending)) /// (scatter gsp year, mcolor(red) msize(tiny)), /// ytitle(log(Gross State Product), margin(medsmall)) /// legend(cols(4) size(small)) /// by(region, title("Multilevel Model of GSP by Region", size(medsmall)))

Wow – that’s a nice graph if I do say so myself. It would be impressive for a report or publication, but it’s a little tough to read with all nine regions displayed at once. Let’s take a closer look at Region 7 instead.

twoway (line GrandMean year, lcolor(black) lwidth(thick)) /// (line RegionMean year, lcolor(blue) lwidth(medthick)) /// (line StateMean year, lcolor(green) connect(ascending)) /// (scatter gsp year, mcolor(red) msize(medsmall)) /// if region ==7, /// ytitle(log(Gross State Product), margin(medsmall)) /// legend(cols(4) size(small)) /// title("Multilevel Model of GSP for Region 7", size(medsmall))

The red dots are the observations of GSP for each state within Region 7. The green lines are the estimated mean GSP within each State and the blue line is the estimated mean GSP within Region 7. The thick black line in the center is the overall grand mean for all nine regions. The model appears to fit the data fairly well but I can’t help noticing that the red dots seem to have an upward slant to them. Our model predicts that GSP is constant within each state and region from 1970 to 1986 when clearly the data show an upward trend.

So we’ve tackled the first feature of our data. We’ve succesfully incorporated the basic hierarchical structure into our model by fitting a variance componentis using Stata’s **xtmixed** command. But our graph tells us that we aren’t finished yet.

Next time we’ll tackle the second feature of our data — the longitudinal nature of the observations.

- For more information

If you’d like to learn more about modelling multilevel and longitudinal data, check out

Multilevel and Longitudinal Modeling Using Stata, Third Edition

Volume I: Continuous Responses

Volume II: Categorical Responses, Counts, and Survival

by Sophia Rabe-Hesketh and Anders Skrondal

or sign up for our popular public training course “Multilevel/Mixed Models Using Stata“.

There’s a course coming up in Washington, DC on February 7-8, 2013.

]]>**http://www.stata.com/statalist/archive/2012-10/msg01129.html**

To remind you, I’ve been writing about how to *use* random-number generators in parts **1**, **2**, and **3**, and I still have one more posting I want to write on the subject. What I just wrote on Statalist, however, is about how random-number generators work, and I think you will find it interesting.

To find out more about Statalist, see

]]>