Home > Programming > Programming estimators in Stata: Why you should

Programming estimators in Stata: Why you should

Distributing a Stata command that implements a statistical method will get that method used by lots of people. They will thank you. And, they will cite you!

This post is the first in the series #StataProgramming about programing an estimation command in Stata that uses Mata to do the numerical work. In the process of showing you how to program an estimation command in Stata, I will discuss do-file programming, ado-file programming, and Mata programming. When the series ends, you will be able to write Stata commands.

Stata users like its predictable syntax and its estimation-postestimation structure that facilitates hypothesis testing, specification tests, and parameter interpretation. To help you write Stata commands that people want to use, I illustrate how Stata syntax is predictable and give an overview of the estimation-postestimation structure that you will want to emulate in your programs.

Stata structure by example

I use and describe some simulated data about the number of traffic accidents observed on 948 people.

Example 1: Accident data

. use http://www.stata.com/data/accident2.dta

. describe

Contains data from http://www.stata.com/data/accident2.dta
  obs:           948                          
 vars:             6                          23 Sep 2015 13:04
 size:        22,752                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
kids            float   %9.0g                 number of children
cvalue          float   %9.0g                 car value index
tickets         float   %9.0g                 number of tickets in last 2 years
traffic         float   %9.0g                 local traffic index, larger=>worse
male            float   %9.0g                 1=>man, 0=>woman
accidents       float   %9.0g                 number of traffic in last 5 years
--------------------------------------------------------------------------------
Sorted by: 

Stata’s predictable syntax

I estimate the parameters of a Poisson regression model for accidents as a function of traffic conditions (traffic), an indicator for being a male driver (male), and the number of tickets received in the last two years (tickets).

Example 2: A Poisson model for accidents

. poisson accidents traffic male tickets , vce(robust)

Iteration 0:   log pseudolikelihood = -377.98594  
Iteration 1:   log pseudolikelihood = -370.68001  
Iteration 2:   log pseudolikelihood = -370.66527  
Iteration 3:   log pseudolikelihood = -370.66527  

Poisson regression                              Number of obs     =        948
                                                Wald chi2(3)      =    1798.65
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -370.66527               Pseudo R2         =     0.8191

------------------------------------------------------------------------------
             |               Robust
   accidents |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     traffic |   .0764399   .0165119     4.63   0.000     .0440772    .1088027
        male |   3.228004   .1232081    26.20   0.000     2.986521    3.469488
     tickets |   1.366614   .0328218    41.64   0.000     1.302284    1.430943
       _cons |  -7.434478   .2413188   -30.81   0.000    -7.907454   -6.961502
------------------------------------------------------------------------------

I want to focus on the structure in this example so that you can use it to make your commands easier to use. In particular, I want to discuss the structure of the command syntax and to point out that the output is easy to read and interpret because it is a standard Stata output table. For estimators that table almost always reports estimates (often coefficients), standard errors, tests against zero and their $p$-values, and confidence intervals.

Stata syntax is predictable, which makes it easy to use. Stata users “speak Stata” and do not even notice the details. I highlight some of these details so that we can make the syntax of the commands we write predictable. Here are some of the standard syntax elements illustrated in example 2.

  1. The command has four syntactical elements;
    1. command name (poisson),
    2. list of variable names (accidents traffic male tickets),
    3. a comma,
    4. an option (vce(robust)).
  2. In the list of variable names, the name of the dependent variable is first and it is followed by the names of the independent variables.
  3. The job of the comma is to separate the command name and variable list from the option or options.

The output is also structured; it is composed of an iteration log, a header, and a standard output table.

Estimation-postestimation framework

As a Stata user, I could now use the estimation-postestimation framework. For example, I could perform a Wald test of the hypothesis that the coefficient on male is 3.

Example 3: A Wald test of a linear restriction

. test male = 3

 ( 1)  [accidents]male = 3

           chi2(  1) =    3.42
         Prob > chi2 =    0.0642

or I could perform a Wald test of the nonlinear hypothesis that the ratio of the coefficient on male to the ratio of the coefficient on tickets is 2.

Example 4: A Wald test of a nonlinear restriction

. testnl _b[male]/_b[tickets] = 2

  (1)  _b[male]/_b[tickets] = 2

               chi2(1) =       19.65
           Prob > chi2 =        0.0000

I could also predict the mean of accidents for each observation and summarize the results.

Example 5: Summarizing the predicted conditional means

. predict nhat
(option n assumed; predicted number of events)

. summarize nhat

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        nhat |        948    .8512658    2.971087   .0006086    29.0763

Finally, I could use margins to estimate conditional or population-averaged parameters that are functions of the parameters in the original model. I use margins to estimate the average number of accidents that would be observed if each individual received 0 tickets, or 1 ticket, or 2 tickets, …, or 7 tickets. See [R] margins, Long and Freese (2006, sec. 4.4.2-4.4.3), and Cameron and Trivedi (2010, 10.5.6{10.6.9) for introductions to estimating functions of the the model parameters by margins.

Example 6: Estimating functions of model parameters

. margins, at(tickets=(0 1 2 3 4 5 6 7))

Predictive margins                              Number of obs     =        948
Model VCE    : Robust

Expression   : Predicted number of events, predict()

1._at        : tickets         =           0

2._at        : tickets         =           1

3._at        : tickets         =           2

4._at        : tickets         =           3

5._at        : tickets         =           4

6._at        : tickets         =           5

7._at        : tickets         =           6

8._at        : tickets         =           7

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _at |
          1  |   .0097252   .0015387     6.32   0.000     .0067094     .012741
          2  |   .0381426   .0048762     7.82   0.000     .0285854    .0476998
          3  |   .1495971   .0148157    10.10   0.000      .120559    .1786353
          4  |   .5867272   .0432256    13.57   0.000     .5020066    .6714478
          5  |   2.301172   .1302033    17.67   0.000     2.045978    2.556366
          6  |   9.025308   .5049176    17.87   0.000     8.035688    10.01493
          7  |   35.39769   2.555679    13.85   0.000     30.38865    40.40673
          8  |   138.8315   13.49606    10.29   0.000     112.3797    165.2832
------------------------------------------------------------------------------

The glue

The estimation results stored in e() are the glue that holds together the estimation-postestimation framework. The poisson command stores lots of stuff in e(). I could use ereturn list to list all this stuff, but there are many stored objects that do not interest you yet.

Most of the estimation-postestimation features that I discussed were implemented using e(b), e(V), and e(predict), which are the vector of point estimates, the estimated VCE, and the name of the command that implements predict after poisson.

I will show how to store what you need in e() in the #StataProgramming series.

Structure of Stata commands

Here is an outline of the tasks performed by a Stata estimation command.

  1. Parse the input to the command.
  2. Compute results.
  3. Store results in e()
  4. Display output.

You need to write a predict command to complete the estimation-postestimation framework. After you have stored the estimation results and written the predict command, margins works.

I will explain each of these steps in the #StataProgramming series of posts.

Use this structure to your advantage. To make your command easy to use, design it to have the predictable syntax implemented in other commands and make it work in the estimation-postestimation framework. This task is far easier than it sounds. In fact, it is just plain easy. The Stata language steers you in this direction.

Done and undone

I will teach you how to program an estimation command in Stata in the #StataProgramming series. I will also show you how do the numerical work in Mata. I discussed the following points, in this first post.

  1. The predictable structure of Stata syntax makes Stata easy to use. You should emulate this structure, so that your commands are easy to use.
  2. The estimation-postestimation framework makes inference and advanced estimation simple. It is easy for you to make your command work with this framework.
  3. The estimation results stored in e(), and the predict command, are the glue that holds the estimation-postestimation framework together.

In the next post, I discuss do-file programming tools that I will subsequently use to parse the input to the command.

References

Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Revised ed. College Station, Texas: Stata Press.

Long, J. S., and J. Freese. 2014. Regression models for categorical dependent variables using Stata. 3rd ed. College Station, Texas: Stata Press.