Programming estimators in Stata: Why you should
Distributing a Stata command that implements a statistical method will get that method used by lots of people. They will thank you. And, they will cite you!
This post is the first in the series #StataProgramming about programing an estimation command in Stata that uses Mata to do the numerical work. In the process of showing you how to program an estimation command in Stata, I will discuss do-file programming, ado-file programming, and Mata programming. When the series ends, you will be able to write Stata commands.
Stata users like its predictable syntax and its estimation-postestimation structure that facilitates hypothesis testing, specification tests, and parameter interpretation. To help you write Stata commands that people want to use, I illustrate how Stata syntax is predictable and give an overview of the estimation-postestimation structure that you will want to emulate in your programs.
Stata structure by example
I use and describe some simulated data about the number of traffic accidents observed on 948 people.
Example 1: Accident data
. use http://www.stata.com/data/accident2.dta . describe Contains data from http://www.stata.com/data/accident2.dta obs: 948 vars: 6 23 Sep 2015 13:04 size: 22,752 -------------------------------------------------------------------------------- storage display value variable name type format label variable label -------------------------------------------------------------------------------- kids float %9.0g number of children cvalue float %9.0g car value index tickets float %9.0g number of tickets in last 2 years traffic float %9.0g local traffic index, larger=>worse male float %9.0g 1=>man, 0=>woman accidents float %9.0g number of traffic in last 5 years -------------------------------------------------------------------------------- Sorted by:
Stata’s predictable syntax
I estimate the parameters of a Poisson regression model for accidents as a function of traffic conditions (traffic), an indicator for being a male driver (male), and the number of tickets received in the last two years (tickets).
Example 2: A Poisson model for accidents
. poisson accidents traffic male tickets , vce(robust) Iteration 0: log pseudolikelihood = -377.98594 Iteration 1: log pseudolikelihood = -370.68001 Iteration 2: log pseudolikelihood = -370.66527 Iteration 3: log pseudolikelihood = -370.66527 Poisson regression Number of obs = 948 Wald chi2(3) = 1798.65 Prob > chi2 = 0.0000 Log pseudolikelihood = -370.66527 Pseudo R2 = 0.8191 ------------------------------------------------------------------------------ | Robust accidents | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- traffic | .0764399 .0165119 4.63 0.000 .0440772 .1088027 male | 3.228004 .1232081 26.20 0.000 2.986521 3.469488 tickets | 1.366614 .0328218 41.64 0.000 1.302284 1.430943 _cons | -7.434478 .2413188 -30.81 0.000 -7.907454 -6.961502 ------------------------------------------------------------------------------
I want to focus on the structure in this example so that you can use it to make your commands easier to use. In particular, I want to discuss the structure of the command syntax and to point out that the output is easy to read and interpret because it is a standard Stata output table. For estimators that table almost always reports estimates (often coefficients), standard errors, tests against zero and their $p$-values, and confidence intervals.
Stata syntax is predictable, which makes it easy to use. Stata users “speak Stata” and do not even notice the details. I highlight some of these details so that we can make the syntax of the commands we write predictable. Here are some of the standard syntax elements illustrated in example 2.
- The command has four syntactical elements;
- command name (poisson),
- list of variable names (accidents traffic male tickets),
- a comma,
- an option (vce(robust)).
- In the list of variable names, the name of the dependent variable is first and it is followed by the names of the independent variables.
- The job of the comma is to separate the command name and variable list from the option or options.
The output is also structured; it is composed of an iteration log, a header, and a standard output table.
Estimation-postestimation framework
As a Stata user, I could now use the estimation-postestimation framework. For example, I could perform a Wald test of the hypothesis that the coefficient on male is 3.
Example 3: A Wald test of a linear restriction
. test male = 3 ( 1) [accidents]male = 3 chi2( 1) = 3.42 Prob > chi2 = 0.0642
or I could perform a Wald test of the nonlinear hypothesis that the ratio of the coefficient on male to the ratio of the coefficient on tickets is 2.
Example 4: A Wald test of a nonlinear restriction
. testnl _b[male]/_b[tickets] = 2 (1) _b[male]/_b[tickets] = 2 chi2(1) = 19.65 Prob > chi2 = 0.0000
I could also predict the mean of accidents for each observation and summarize the results.
Example 5: Summarizing the predicted conditional means
. predict nhat (option n assumed; predicted number of events) . summarize nhat Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- nhat | 948 .8512658 2.971087 .0006086 29.0763
Finally, I could use margins to estimate conditional or population-averaged parameters that are functions of the parameters in the original model. I use margins to estimate the average number of accidents that would be observed if each individual received 0 tickets, or 1 ticket, or 2 tickets, …, or 7 tickets. See [R] margins, Long and Freese (2006, sec. 4.4.2-4.4.3), and Cameron and Trivedi (2010, 10.5.6{10.6.9) for introductions to estimating functions of the the model parameters by margins.
Example 6: Estimating functions of model parameters
. margins, at(tickets=(0 1 2 3 4 5 6 7)) Predictive margins Number of obs = 948 Model VCE : Robust Expression : Predicted number of events, predict() 1._at : tickets = 0 2._at : tickets = 1 3._at : tickets = 2 4._at : tickets = 3 5._at : tickets = 4 6._at : tickets = 5 7._at : tickets = 6 8._at : tickets = 7 ------------------------------------------------------------------------------ | Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _at | 1 | .0097252 .0015387 6.32 0.000 .0067094 .012741 2 | .0381426 .0048762 7.82 0.000 .0285854 .0476998 3 | .1495971 .0148157 10.10 0.000 .120559 .1786353 4 | .5867272 .0432256 13.57 0.000 .5020066 .6714478 5 | 2.301172 .1302033 17.67 0.000 2.045978 2.556366 6 | 9.025308 .5049176 17.87 0.000 8.035688 10.01493 7 | 35.39769 2.555679 13.85 0.000 30.38865 40.40673 8 | 138.8315 13.49606 10.29 0.000 112.3797 165.2832 ------------------------------------------------------------------------------
The glue
The estimation results stored in e() are the glue that holds together the estimation-postestimation framework. The poisson command stores lots of stuff in e(). I could use ereturn list to list all this stuff, but there are many stored objects that do not interest you yet.
Most of the estimation-postestimation features that I discussed were implemented using e(b), e(V), and e(predict), which are the vector of point estimates, the estimated VCE, and the name of the command that implements predict after poisson.
I will show how to store what you need in e() in the #StataProgramming series.
Structure of Stata commands
Here is an outline of the tasks performed by a Stata estimation command.
- Parse the input to the command.
- Compute results.
- Store results in e()
- Display output.
You need to write a predict command to complete the estimation-postestimation framework. After you have stored the estimation results and written the predict command, margins works.
I will explain each of these steps in the #StataProgramming series of posts.
Use this structure to your advantage. To make your command easy to use, design it to have the predictable syntax implemented in other commands and make it work in the estimation-postestimation framework. This task is far easier than it sounds. In fact, it is just plain easy. The Stata language steers you in this direction.
Done and undone
I will teach you how to program an estimation command in Stata in the #StataProgramming series. I will also show you how do the numerical work in Mata. I discussed the following points, in this first post.
- The predictable structure of Stata syntax makes Stata easy to use. You should emulate this structure, so that your commands are easy to use.
- The estimation-postestimation framework makes inference and advanced estimation simple. It is easy for you to make your command work with this framework.
- The estimation results stored in e(), and the predict command, are the glue that holds the estimation-postestimation framework together.
In the next post, I discuss do-file programming tools that I will subsequently use to parse the input to the command.
References
Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Revised ed. College Station, Texas: Stata Press.
Long, J. S., and J. Freese. 2014. Regression models for categorical dependent variables using Stata. 3rd ed. College Station, Texas: Stata Press.