**Stata syntax and the syntax command**

The **myregress2** command described in Programming an estimation command in Stata: A first command for OLS has the syntax

**myregress2** *depvar* [*indepvars*]

This syntax requires that the dependent variable be specified because *depvar* is not enclosed in square brackets. The independent variables are optional because *indepvars* is enclosed in square brackets. Type

for an introduction to reading Stata syntax diagrams.

This syntax is implemented by the **syntax** command in line 5 of **myregress2.ado**, which I discussed at length in Programming an estimation command in Stata: A first command for OLS. The user must specify a list of variable names because **varlist** is not enclosed in square brackets. The syntax of the **syntax** command follows the rules of a syntax diagram.

*! version 2.0.0 26Oct2015 program define myregress2, eclass version 14 syntax varlist gettoken depvar : varlist tempname zpz xpx xpy xpxi b V tempvar xbhat res res2 quietly matrix accum `zpz' = `varlist' local p : word count `varlist' local p = `p' + 1 matrix `xpx' = `zpz'[2..`p', 2..`p'] matrix `xpy' = `zpz'[2..`p', 1] matrix `xpxi' = syminv(`xpx') matrix `b' = (`xpxi'*`xpy')' quietly matrix score double `xbhat' = `b' quietly generate double `res' = (`depvar' - `xbhat') quietly generate double `res2' = (`res')^2 quietly summarize `res2' local N = r(N) local sum = r(sum) local s2 = `sum'/(`N'-(`p'-1)) matrix `V' = `s2'*`xpxi' ereturn post `b' `V' ereturn local cmd "myregress2" ereturn display end

Example 1 illustrates that **myregress2** runs the requested regression when I specify a varlist.

**Example 1: myregress2 with specified variables**

. sysuse auto (1978 Automobile Data) . myregress2 price mpg trunk ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -220.1649 65.59262 -3.36 0.001 -348.7241 -91.6057 trunk | 43.55851 88.71884 0.49 0.623 -130.3272 217.4442 _cons | 10254.95 2349.084 4.37 0.000 5650.83 14859.07 ------------------------------------------------------------------------------

Example 2 illustrates that the **syntax** command displays an error message and stops execution when I do not specify a varlist. I use **set trace on** to see each line of code and the output it produces.

**Example 2: myregress2 without a varlist**

. set trace on . myregress2 --------------------------------------------------------- begin myregress2 -- - version 14 - syntax varlist varlist required ----------------------------------------------------------- end myregress2 -- r(100);

Example 3 illustrates that the **syntax** command is checking that the specified variables are in the current dataset. **syntax** throws an error because **DoesNotExist** is not a variable in the current dataset.

**Example 3: myregress2 with a variable not in this dataset**

. set trace on . myregress2 price mpg trunk DoesNotExist --------------------------------------------------------- begin myregress2 -- - version 14 - syntax varlist variable DoesNotExist not found ----------------------------------------------------------- end myregress2 -- r(111); end of do-file r(111);

Because the **syntax** command on line 5 is not restricting the specified variables to be numeric, I get the **no observations** error in example 4 instead of an error indicating the actual problem, which is the string variable **make**.

**Example 4: myregress2 with a string variable**

. describe make storage display value variable name type format label variable label ------------------------------------------------------------------------------- make str18 %-18s Make and Model . myregress2 price mpg trunk make no observations r(2000); end of do-file r(2000);

On line 5 of **myregress3**, I modify **varlist** to only accept numeric variables This change produces a more informative error message when I try to include a string variable in the regression.

*! version 3.0.0 30Oct2015 program define myregress3, eclass version 14 syntax varlist(numeric) gettoken depvar : varlist tempname zpz xpx xpy xpxi b V tempvar xbhat res res2 quietly matrix accum `zpz' = `varlist' local p : word count `varlist' local p = `p' + 1 matrix `xpx' = `zpz'[2..`p', 2..`p'] matrix `xpy' = `zpz'[2..`p', 1] matrix `xpxi' = syminv(`xpx') matrix `b' = (`xpxi'*`xpy')' quietly matrix score double `xbhat' = `b' quietly generate double `res' = (`depvar' - `xbhat') quietly generate double `res2' = (`res')^2 quietly summarize `res2' local N = r(N) local sum = r(sum) local s2 = `sum'/(`N'-(`p'-1)) matrix `V' = `s2'*`xpxi' ereturn post `b' `V' ereturn local cmd "myregress3" ereturn display end

**Example 5: myregress3 with a string variable**

. set trace on . myregress3 price mpg trunk make --------------------------------------------------------- begin myregress3 -- - version 14 - syntax varlist(numeric) string variables not allowed in varlist; make is a string variable ----------------------------------------------------------- end myregress3 -- r(109); end of do-file r(109);

On line 5 of **myregress4**, I modify the **varlist** to accept time-series (**ts**) variables. The **syntax** command puts time-series variables in a canonical form that is stored in the local macro **varlist**, as illustrated in the display on line 6, whose output appears in example 6.

*! version 4.0.0 31Oct2015 program define myregress4, eclass version 14 syntax varlist(numeric ts) display "varlist is `varlist'" gettoken depvar : varlist tempname zpz xpx xpy xpxi b V tempvar xbhat res res2 quietly matrix accum `zpz' = `varlist' local p : word count `varlist' local p = `p' + 1 matrix `xpx' = `zpz'[2..`p', 2..`p'] matrix `xpy' = `zpz'[2..`p', 1] matrix `xpxi' = syminv(`xpx') matrix `b' = (`xpxi'*`xpy')' quietly matrix score double `xbhat' = `b' quietly generate double `res' = (`depvar' - `xbhat') quietly generate double `res2' = (`res')^2 quietly summarize `res2' local N = r(N) local sum = r(sum) local s2 = `sum'/(`N'-(`p'-1)) matrix `V' = `s2'*`xpxi' ereturn post `b' `V' ereturn local cmd "myregress4" ereturn display end

**Example 6: myregress4 with time-series variables**

. sysuse gnp96 . myregress4 L(0/3).gnp varlist is gnp96 L.gnp96 L2.gnp96 L3.gnp96 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- gnp96 | L1. | 1.277086 .0860652 14.84 0.000 1.108402 1.445771 L2. | -.135549 .1407719 -0.96 0.336 -.4114568 .1403588 L3. | -.1368326 .0871645 -1.57 0.116 -.3076719 .0340067 | _cons | -2.94825 14.36785 -0.21 0.837 -31.10871 25.21221 ------------------------------------------------------------------------------

**Done and undone**

I used the **syntax** command to improve how **myregress2** handles the variables specified by the user. I showed how to require that all variables be numeric variables and how to make the command accept time-series operated variables. In the next post, I show how to make the command allow for sample restrictions, how to handle missing values, how to allow for factor-operated variables, and how to deal with perfectly collinear variables.

\newcommand{\betab}{\boldsymbol{\beta}}

\newcommand{\xb}{{\bf x}}

\newcommand{\yb}{{\bf y}}

\newcommand{\Xb}{{\bf X}}

\)I show how to write a Stata estimation command that implements the ordinary least-squares (

Programming an estimation command in Stata: A first ado command, in which I introduced some ado-programming concepts. Although I introduce some local macro tricks that I use all the time, I also build on Programing an estimation command in Stata: Where to store your stuff.

**Local macro tricks**

I use lots of local macro tricks in my ado-files. In this section, I illustrate ones that I use in the commands that I develop in this post. In every ado-file that I write, I ask questions about lists of variable names stored in local macros. I frequently use the extended macro functions and the **gettoken** command to ask these questions and store the results in a local macro.

The syntax for storing the result of an extended macro function in a local macro is

**local** *localname* : *extended_fcn*

Below, I use the extended macro function **word count** to count the number of elements in the list and store the result in the local macro **count**.

**Example 1: Storing and extracting the result of an extended macro function**

. local count : word count a b c . display "count contains `count'" count contains 3

There are many extended macro functions, but I illustrate just the one I use in this post; type **help extended fcn** for a complete list.

A token is an element in a list. I frequently use the **gettoken** command to split lists apart. The **gettoken** command has the syntax

**gettoken** *localname1* **[***localname2***]** : *localname3*

**gettoken** stores the first token in the list stored in the local macro *localname3* into the local macro *localname1*. If the optional *localname2* is specified, the remaining tokens are stored in the local macro *localname2*.

I use **gettoken** to store the first token stored in **mylist** into the local macro **first**, whose contents I subsequently extract and display.

**Example 2: Using gettoken to store first token only**

. local mylist y x1 x2 . display "mylist contains `mylist'" mylist contains y x1 x2 . gettoken first : mylist . display "first contains `first'" first contains y

Now, I use **gettoken** to store the first token stored in **mylist** into the local macro **first** and the remaining tokens into the local macro **left**. I subsequently extract and display the contents of **first** and **left**.

**Example 3: Using gettoken to store first and remaining tokens**

. gettoken first left: mylist . display "first contains `first'" first contains y . display "left contains `left'" left contains x1 x2

I frequently want to increase the value of a local macro by some fixed amount, say, \(3\). I now illustrate a solution that I use.

**Example 4: Local macro update**

. local p = 1 . local p = `p' + 3 . display "p is now `p'" p is now 4

When the update value, also known as the increment value, is \(1\), we can use the increment operator, as below:

**Example 5: Local macro update**

. local p = 1 . local ++p . display "p is now `p'" p is now 2

**A first version of myregress**

The code in **myregress1** implements a version of the **OLS** formulas. I use **myregress1** in example 6. Below example 6, I discuss the code and the output.

*! version 1.0.0 23Oct2015 program define myregress1, eclass version 14 syntax varlist display "The syntax command puts the variables specified by the " display " user into the local macro varlist" display " varlist contains `varlist'" gettoken depvar : varlist display "The dependent variable is `depvar'" matrix accum zpz = `varlist' display "matrix accum forms Z'Z" matrix list zpz local p : word count `varlist' local p = `p' + 1 matrix xpx = zpz[2..`p', 2..`p'] matrix xpy = zpz[2..`p', 1] matrix xpxi = syminv(xpx) matrix b = (xpxi*xpy)' matrix score double xbhat = b generate double res = (`depvar' - xbhat) generate double res2 = res^2 summarize res2 local N = r(N) local sum = r(sum) local s2 = `sum'/(`N'-(`p'-1)) matrix V = `s2'*xpxi ereturn post b V ereturn local cmd "myregress1" ereturn display end

**Example 6: myregress1 output**

. sysuse auto (1978 Automobile Data) . myregress1 price mpg trunk The syntax command puts the variables specified by the user into the local macro varlist varlist contains price mpg trunk The dependent variable is price (obs=74) matrix accum forms Z'Z symmetric zpz[4,4] price mpg trunk _cons price 3.448e+09 mpg 9132716 36008 trunk 6565725 20630 15340 _cons 456229 1576 1018 74 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- res2 | 74 6674851 1.30e+07 11.24372 9.43e+07 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -220.1649 65.59262 -3.36 0.001 -348.7241 -91.6057 trunk | 43.55851 88.71884 0.49 0.623 -130.3272 217.4442 _cons | 10254.95 2349.084 4.37 0.000 5650.83 14859.07 ------------------------------------------------------------------------------

Here are my comments on the code and the output in example 6.

- Line 2 specifies that
**myregress1**is an e-class command that stores its results in**e()**. - Lines 5–8 illustrate that the
**syntax**command stores the names of the variables specified by the user in the local macro**varlist**. This behavior is also illustrated in example 6. - Line 10 uses the
**gettoken**command to store the first variable name stored in the local macro**varlist**in the local macro**depvar**. Line 11 displays this name and the usage is illustrated in example 6. - Line 13 uses
**matrix accum**to put \((\Xb’\Xb)\) and \((\Xb’\yb)\) into a Stata matrix named**zpz**, as discussed in Programming an estimation command in Stata: Using Stata matrix commands and functions to compute OLS objects and further illustrated in lines 14–15 and example 6. - Line 17 stores the number of variables in the local macro
**varlist**into the local macro**p**. - Line 18 increments the local macro
**p**by \(1\) to account for the constant term included by**matrix accum**by default. - Lines 20–23 extract \((\Xb’\Xb)\) and \((\Xb’\yb)\) from
**zpz**and put the vector of point estimates \(\widehat{\betab}\) into the Stata row vector**b**. - Line 25 puts \(\Xb\widehat{\betab}\) into the variable
**xbhat**. - Lines 26 and 27 calculate the residuals and the squared residuals, respectively.
- Lines 28–32 calculate the estimated variance-covariance matrix of the estimator (
**VCE**) from the sum of squared residuals. - Line 33 stores
**b**and**V**into**e(b)**and**e(V)**, respectively. - Line 34 stores the name of the estimation command (
**myregress1**) in**e(cmd)**. - Line 35 produces a standard Stata output table from the results in
**e(b)**and**e(V)**.

**myregress1** contains code to help illustrate how it works, and it uses hard-coded names for global objects like Stata variables and Stata matrices. Users do not want to see the output from the illustration lines, so they must be removed. Users do not want their global Stata matrices overwritten by a command they use, which is what **myregress1** would do to a matrix named **zpz**, **xpx**, **xpxi**, **b**, or **V**.

The code in **myregress2** fixes these problems.

*! version 2.0.0 26Oct2015 program define myregress2, eclass version 14 syntax varlist gettoken depvar : varlist tempname zpz xpx xpy xpxi b V tempvar xbhat res res2 quietly matrix accum `zpz' = `varlist' local p : word count `varlist' local p = `p' + 1 matrix `xpx' = `zpz'[2..`p', 2..`p'] matrix `xpy' = `zpz'[2..`p', 1] matrix `xpxi' = syminv(`xpx') matrix `b' = (`xpxi'*`xpy')' quietly matrix score double `xbhat' = `b' generate double `res' = (`depvar' - `xbhat') generate double `res2' = (`res')^2 quietly summarize `res2' local N = r(N) local sum = r(sum) local s2 = `sum'/(`N'-(`p'-1)) matrix `V' = `s2'*`xpxi' ereturn post `b' `V' ereturn local cmd "myregress2" ereturn display end

- Line 8 uses
**tempname**to put safe names into the local macros**zpz**,**xpx**,**xpy**,**xpxi**,**b**, and**V**. - Line 9 uses
**tempvar**to put safe names into the local macros**xbhat**,**res**,**res2**. - Lines 11, 14–18, and 25–26 use the safe names in the local macros created by
**tempname**instead of the hard-coded names for the matrices. - Lines 18–20 use the safe names in the local macros created by
**tempvar**instead of the hard-coded names for the variables it creates.

The output below shows the output produced by **myregress2**.

**Example 7: myregress2 output**

. myregress2 price mpg trunk ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -220.1649 65.59262 -3.36 0.001 -348.7241 -91.6057 trunk | 43.55851 88.71884 0.49 0.623 -130.3272 217.4442 _cons | 10254.95 2349.084 4.37 0.000 5650.83 14859.07 ------------------------------------------------------------------------------

**Done and undone**

After reviewing some tricks with local macros that I use in most of the ado-files that I write, I discussed two versions of an ado-command that implements the (**OLS**) estimator. In the next post, I extend this command so that the user may request a robust **VCE**, or that the constant term be suppressed, or both.

\newcommand{\ebi}{\boldsymbol{\epsilon}_i}

\newcommand{\Sigmab}{\boldsymbol{\Sigma}}

\newcommand{\Omegab}{\boldsymbol{\Omega}}

\newcommand{\Lambdab}{\boldsymbol{\Lambda}}

\newcommand{\betab}{\boldsymbol{\beta}}

\newcommand{\gammab}{\boldsymbol{\gamma}}

\newcommand{\Gammab}{\boldsymbol{\Gamma}}

\newcommand{\deltab}{\boldsymbol{\delta}}

\newcommand{\xib}{\boldsymbol{\xi}}

\newcommand{\iotab}{\boldsymbol{\iota}}

\newcommand{\eb}{{\bf e}}

\newcommand{\xb}{{\bf x}}

\newcommand{\xbit}{{\bf x}_{it}}

\newcommand{\xbi}{{\bf x}_{i}}

\newcommand{\zb}{{\bf z}}

\newcommand{\zbi}{{\bf z}_i}

\newcommand{\wb}{{\bf w}}

\newcommand{\yb}{{\bf y}}

\newcommand{\ub}{{\bf u}}

\newcommand{\Gb}{{\bf G}}

\newcommand{\Hb}{{\bf H}}

\newcommand{\thetab}{\boldsymbol{\theta}}

\newcommand{\XBI}{{\bf x}_{i1},\ldots,{\bf x}_{iT}}

\newcommand{\Sb}{{\bf S}}

\newcommand{\Xb}{{\bf X}}

\newcommand{\Mb}{{\bf M}}

\newcommand{\Xtb}{\tilde{\bf X}}

\newcommand{\Wb}{{\bf W}}

\newcommand{\Ab}{{\bf A}}

\newcommand{\Bb}{{\bf B}}

\newcommand{\Zb}{{\bf Z}}

\newcommand{\Eb}{{\bf E}}

\newcommand{\Vb}{{\bf V}}\)I present the formulas for computing the ordinary least-squares (

**OLS formulas**

Recall that the OLS point estimates are given by

\[

\widehat{\betab} =

\left( \sum_{i=1}^N \xb_i’\xb_i \right)^{-1}

\left(

\sum_{i=1}^N \xb_i’y_i

\right)

\]

where \({\bf x}_i\) is the \(1\times k\) vector of independent variables, \(y_i\) is the dependent variable for each of the \(N\) sample observations, and the model for \(y_i\) is

\[

y_i = \xb_i\betab’ + \epsilon_i

\]

If the \(\epsilon_i\) are independently and identically distributed, we estimate the variance-covariance matrix of the estimator (**VCE**) by

\[

\widehat{\Vb} = \widehat{s}

\left( \sum_{i=1}^N \xb_i’\xb_i \right)^{-1}

\]

where \(\widehat{s} = 1/(N-k)\sum_{i=1}^N e_i^2\) and \(e_i=y_i-{\bf x}_i\widehat{{\boldsymbol \beta}}\).

See Cameron and Trivedi (2005), Stock and Watson (2010), or Wooldridge (2015) for introductions to **OLS**.

**Stata matrix implementation**

I use the **matrix accum** command to compute the sum of the products over the observations. Typing

. matrix accum zpz = z1 z2 z3

puts \(\left( \sum_{i=1}^N {\bf z}_i'{\bf z}_i \right)\) into the Stata matrix **zpz**, where \({\bf z}_i=( {\tt z1}_i, {\tt z2}_i, {\tt z3}_i, 1)\). The \(1\) appears because **matrix accum** has included the constant term by default, like almost all estimation commands.

Below, I use **matrix accum** to compute \(\left( \sum_{i=1}^N {\bf z}_i'{\bf z}_i \right)\), which contains \(\left( \sum_{i=1}^N {\bf x}_i'{\bf x}_i \right)\) and \(\left( \sum_{i=1}^N {\bf x}_i’y_i \right)\).

**Example 1: Using matrix accum**

. sysuse auto (1978 Automobile Data) . matrix accum zpz = price mpg trunk (obs=74) . matrix list zpz symmetric zpz[4,4] price mpg trunk _cons price 3.448e+09 mpg 9132716 36008 trunk 6565725 20630 15340 _cons 456229 1576 1018 74

Now, I extract \(\left( \sum_{i=1}^N {\bf x}_i'{\bf x}_i \right)\) from rows 2–4 and columns 2–4 of **zpz** and \(\left( \sum_{i=1}^N {\bf x}_i’y_i \right)\) from rows 2–4 and column 1 of **zpz**.

**Example 2: Extracting submatrices**

. matrix xpx = zpz[2..4, 2..4] . matrix xpy = zpz[2..4, 1] . matrix list xpx symmetric xpx[3,3] mpg trunk _cons mpg 36008 trunk 20630 15340 _cons 1576 1018 74 . matrix list xpy xpy[3,1] price mpg 9132716 trunk 6565725 _cons 456229

I now compute \(\widehat{{\boldsymbol \beta}}\) from the matrices formed in example 2.

**Example 3: Computing \(\widehat{\betab}\)**

. matrix xpxi = syminv(xpx) . matrix b = xpxi*xpy . matrix list b b[3,1] price mpg -220.16488 trunk 43.55851 _cons 10254.95 . matrix b = b' . matrix list b b[1,3] mpg trunk _cons price -220.16488 43.55851 10254.95

I transposed **b** to make it a row vector because point estimates in Stata are stored as row vectors.

Example 3 illustrates that the Stata matrix **b** contains the estimated coefficients and the names of the variables on which these values are estimated coefficients. To clarify, our model is

\[

\Eb[{\tt price}|{\tt mpg}, {\tt trunk} ] = {\tt mpg}*\beta_{\tt mpg}

+ {\tt trunk}*\beta_{\tt trunk} + {\tt \_cons}

\]

and **b** contains the information that \(-220.16\) is the estimated coefficient on **mpg**, that \(43.56\) is the estimated coefficient on **trunk**, and that \(10254.95\) is the estimated constant. We can compute the linear combination \(\xb_i\widehat{\betab}\) over the observations using the information in **b**, because **b** contains both the value and the name for each coefficient.

I use **matrix score** to compute this linear combination for each observation, and I use **generate** to reiterate what this linear combination is.

**Example 4: Using matrix score to compute \(\xb_i\widehat{\betab}’\)**

. matrix score double xbhat1 = b . generate double xbhat2 = mpg*(-220.16488) + trunk*(43.55851) + 10254.95 . list xbhat1 xbhat2 in 1/4 +-----------------------+ | xbhat1 xbhat2 | |-----------------------| 1. | 5890.4661 5890.4663 | 2. | 6991.2905 6991.2907 | 3. | 5934.0246 5934.0248 | 4. | 6548.5884 6548.5886 | +-----------------------+

I use the predictions for \(\Eb[{\tt price}|{\tt mpg}, {\tt trunk} ]\) in **xbhat1** to compute the residuals and the estimated **VCE**.

**Example 5: Computing the estimated VCE**

. generate double res = (price - xbhat1) . generate double res2 = res^2 . summarize res2 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- res2 | 74 6674851 1.30e+07 11.24372 9.43e+07 . return list scalars: r(N) = 74 r(sum_w) = 74 r(mean) = 6674850.504745401 r(Var) = 168983977867533.1 r(sd) = 12999383.74952956 r(min) = 11.24371634723049 r(max) = 94250157.2111593 r(sum) = 493938937.3511598 . local N = r(N) . local sum = r(sum) . local s2 = `sum'/(`N'-3) . matrix V = (`s2')*xpxi

(See Programing an estimation command in Stata: Where to store your stuff for discussions of using results from r-class commands and using local macros.)

I verify that my computations for \(\widehat{\betab}\) and the **VCE** match those of **regress**.

**Example 6: Comparing against regress**

. regress price mpg trunk Source | SS df MS Number of obs = 74 -------------+---------------------------------- F(2, 71) = 10.14 Model | 141126459 2 70563229.4 Prob > F = 0.0001 Residual | 493938937 71 6956886.44 R-squared = 0.2222 -------------+---------------------------------- Adj R-squared = 0.2003 Total | 635065396 73 8699525.97 Root MSE = 2637.6 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -220.1649 65.59262 -3.36 0.001 -350.9529 -89.3769 trunk | 43.55851 88.71884 0.49 0.625 -133.3418 220.4589 _cons | 10254.95 2349.084 4.37 0.000 5571.01 14938.89 ------------------------------------------------------------------------------ . matrix list e(b) e(b)[1,3] mpg trunk _cons y1 -220.16488 43.55851 10254.95 . matrix list b b[1,3] mpg trunk _cons price -220.16488 43.55851 10254.95 . matrix list e(V) symmetric e(V)[3,3] mpg trunk _cons mpg 4302.3924 trunk 3384.4186 7871.0326 _cons -138187.95 -180358.85 5518194.7 . matrix list V symmetric V[3,3] mpg trunk _cons mpg 4302.3924 trunk 3384.4186 7871.0326 _cons -138187.95 -180358.85 5518194.7

**Robust standard errors**

The frequently used robust estimator of the **VCE** is given by

\[

\widehat{V}_{robust}=\frac{N}{N-k}

\left( \sum_{i=1}^N \xb_i’\xb_i \right)^{-1}

\Mb

\left( \sum_{i=1}^N \xb_i’\xb_i \right)^{-1}

\]

where

\[\Mb=\sum_{i=1}^N \widehat{e}_i^2\xb_i’\xb_i\]

See Cameron and Trivedi (2005), Stock and Watson (2010), or Wooldridge (2015) for derivations and discussions.

**matrix accum** with weights \(\widehat{e}_i^2\) computes the formula for \(\Mb\). Below, I use **matrix accum** to compute \(\Mb\) and \(\widehat{V}_{robust}\)

**Example 7: A robust VCE**

. matrix accum M = mpg trunk [iweight=res2] (obs=493938937.4) . matrix V2 = (`N'/(`N'-3))*xpxi*M*xpxi

I now verify that my computations match those reported by **regress**.

**Example 8: Comparing computations of robust VCE**

. regress price mpg trunk, robust Linear regression Number of obs = 74 F(2, 71) = 11.59 Prob > F = 0.0000 R-squared = 0.2222 Root MSE = 2637.6 ------------------------------------------------------------------------------ | Robust price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -220.1649 72.45388 -3.04 0.003 -364.6338 -75.69595 trunk | 43.55851 71.4537 0.61 0.544 -98.91613 186.0331 _cons | 10254.95 2430.641 4.22 0.000 5408.39 15101.51 ------------------------------------------------------------------------------ . matrix list e(V) symmetric e(V)[3,3] mpg trunk _cons mpg 5249.5646 trunk 3569.5316 5105.6316 _cons -169049.76 -147284.49 5908013.8 . matrix list V2 symmetric V2[3,3] mpg trunk _cons mpg 5249.5646 trunk 3569.5316 5105.6316 _cons -169049.76 -147284.49 5908013.8

**Cluster-robust standard errors**

The cluster-robust estimator of the **VCE** is frequently used when the data have a panel structure, also known as a longitudinal structure. This **VCE** accounts for the within-group correlation of the errors, and it is given by

\[

\widehat{V}_{cluster}=\frac{N}{N-k}\frac{g}{g-1}

\left( \sum_{i=1}^N \xb_i’\xb_i \right)^{-1}

\Mb_c

\left( \sum_{i=1}^N \xb_i’\xb_i \right)^{-1}

\]

where

\[\Mb_c=\sum_{j=1}^g

\Xb_j’

(\widehat{\eb}_j \widehat{\eb}_j’)

\Xb_j \]

\(\Xb_j\) is the \(n_j\times k\) matrix of observations on \(\xb_i\) in group \(j\), \(\widehat{\eb}_j\) is the \(n_j\times 1\) vector of residuals in group \(j\), and \(g\) is the number of groups. See Cameron and Trivedi (2005), Wooldridge (2010), and **[R] regress** for derivations and discussions.

**matrix opaccum** computes the formula for \(\Mb_c\). Below, I create the group variable **cvar** from **rep78** and use **matrix opaccum** to compute \(\Mb_c\) and \(\widehat{V}_{cluster}\)

**Example 9: A cluster-robust VCE**

. generate cvar = cond( missing(rep78), 6, rep78) . tab cvar cvar | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 2.70 2.70 2 | 8 10.81 13.51 3 | 30 40.54 54.05 4 | 18 24.32 78.38 5 | 11 14.86 93.24 6 | 5 6.76 100.00 ------------+----------------------------------- Total | 74 100.00 . local Nc = r(r) . sort cvar . matrix opaccum M2 = mpg trunk , group(cvar) opvar(res) . matrix V2 = ((`N'-1)/(`N'-3))*(`Nc'/(`Nc'-1))*xpxi*M2*xpxi

I now verify that my computations match those reported by **regress**.

**Example 10: Comparing computations of cluster-robust VCE**

. regress price mpg trunk, vce(cluster cvar) Linear regression Number of obs = 74 F(2, 5) = 9.54 Prob > F = 0.0196 R-squared = 0.2222 Root MSE = 2637.6 (Std. Err. adjusted for 6 clusters in cvar) ------------------------------------------------------------------------------ | Robust price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -220.1649 93.28127 -2.36 0.065 -459.952 19.62226 trunk | 43.55851 58.89644 0.74 0.493 -107.8396 194.9566 _cons | 10254.95 2448.547 4.19 0.009 3960.758 16549.14 ------------------------------------------------------------------------------ . matrix list e(V) symmetric e(V)[3,3] mpg trunk _cons mpg 8701.3957 trunk 4053.5381 3468.7911 _cons -223021 -124190.97 5995384.3 . matrix list V2 symmetric V2[3,3] mpg trunk _cons mpg 8701.3957 trunk 4053.5381 3468.7911 _cons -223021 -124190.97 5995384.3

**Done and undone**

I reviewed the formulas that underlie the **OLS** estimator and showed how to compute them using Stata matrix commands and functions. In the next two posts, I write an ado-command that implements these formulas.

**References**

Cameron, A. C., and P. K. Trivedi. 2005. *Microeconometrics: Methods and applications**. Cambridge: Cambridge University Press.*

Stock, J. H., and M. W. Watson. 2010. *Introduction to Econometrics**. 3rd ed. Boston, MA: Addison Wesley New York.*

Wooldridge, J. M. 2010. *Econometric Analysis of Cross Section and Panel Data**. 2nd ed. Cambridge, Massachusetts: MIT Press.*

Wooldridge, J. M. 2015. *Introductory Econometrics: A Modern Approach**. 6th ed. Cincinnati, Ohio: South-Western.*

Today I will provide information that will help you interpret the estimation and postestimation results from Stata’s Arellano–Bond estimator **xtabond**, the most common linear dynamic panel-data estimator.

**The instruments and the regressors**

We have fictional data for 1,000 people from 1991 to 2000. The outcome of interest is income (**income**), and the explanatory variables are years of schooling (**educ**) and an indicator for marital status (**married**). Below, we fit an Arellano–Bond model using **xtabond**.

. xtabond income married educ, vce(robust) Arellano-Bond dynamic panel-data estimation Number of obs = 8,000 Group variable: id Number of groups = 1,000 Time variable: year Obs per group: min = 8 avg = 8 max = 8 Number of instruments = 39 Wald chi2(3) = 3113.63 Prob > chi2 = 0.0000 One-step results (Std. Err. adjusted for clustering on id) ------------------------------------------------------------------------------ | Robust income | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- income | L1. | .2008311 .0036375 55.21 0.000 .1937018 .2079604 | married | 1.057667 .1006091 10.51 0.000 .8604764 1.254857 educ | .057551 .0045863 12.55 0.000 .0485619 .06654 _cons | .2645702 .0805474 3.28 0.001 .1067002 .4224403 ------------------------------------------------------------------------------ Instruments for differenced equation GMM-type: L(2/.).income Standard: D.married D.educ Instruments for level equation Standard: _cons

A couple of elements in the output table are different from what one would expect. The output includes a coefficient for the lagged value of the dependent variable that we did not specify in the command. Why?

In the Arellano–Bond framework, the value of the dependent variable in the previous period is a predictor for the current value of the dependent variable. Stata includes the value of the dependent variable in the previous period for us. Another noteworthy aspect that appears in the table is the mention of 39 instruments in the header. This is followed by a footnote that refers to GMM and standard-type instruments. Here a bit of math will help us understand what is going on.

The relationship of interest is given by

\[\begin{equation*}

y_{it} = x_{it}’\beta_1 + y_{i(t-1)}\beta_2 + \alpha_i + \varepsilon_{it}

\end{equation*}\]

In the equation above, \(y_{it}\) is the outcome of interest for individual \(i\) at time \(t\), \(x_{it}\) are a set of regressors that may include past values, \(y_{i(t-1)}\) is the value of the outcome in the previous period, \(\alpha_i\) is a time-invariant unobservable, and \(\varepsilon_{it}\) is a time-varying unobservable.

As in the fixed-effects framework, we assume the time-invariant unobserved component is related to the regressors. When unobservables and observables are correlated, we have an endogeneity problem that yields inconsistent parameter estimates if we use a conventional linear panel-data estimator. One solution is taking first-differences of the relationship of interest. However, the strategy of taking first-differences does not work. Why?

\[\begin{eqnarray*}

\Delta y_{it} &=& \Delta x_{it}’\beta_1 + \Delta y_{i(t-1)} + \Delta \varepsilon_{it} \\

E\left( \Delta y_{i(t-1)} \Delta \varepsilon_{it} \right) &\neq & 0

\end{eqnarray*}\]

In the first equation above, we got rid of \(\alpha_i\), which is correlated with our regressors, but we generated a new endogeneity problem. The second equation above illustrates one of our regressors is related to our unobservables. The solution is instrumental variables. Which instrumental variables? Arellano–Bond suggest the second lags of the dependent variable and all the feasible lags thereafter. This generates the set of moment conditions defined by

\[\begin{eqnarray*}

E\left( \Delta y_{i(t-2)} \Delta \varepsilon_{it} \right) &=& 0 \\

E\left( \Delta y_{i(t-3)} \Delta \varepsilon_{it} \right) &=& 0 \\

\ldots & & \\

E\left( \Delta y_{i(t-j)} \Delta \varepsilon_{it} \right) &=& 0

\end{eqnarray*}\]

In our example, we have 10 time periods, which yield the following set of instruments:

\[\begin{eqnarray*}

t&=10& \quad y_{t-8}, y_{t-7}, y_{t-6}, y_{t-5}, y_{t-4}, y_{t-3}, y_{t-2}, y_{t-1} \\

t&=9& \quad y_{t-7}, y_{t-6}, y_{t-5}, y_{t-4}, y_{t-3}, y_{t-2}, y_{t-1} \\

t&=8& \quad y_{t-6}, y_{t-5}, y_{t-4}, y_{t-3}, y_{t-2}, y_{t-1} \\

t& = 7& \quad y_{t-5}, y_{t-4}, y_{t-3}, y_{t-2}, y_{t-1} \\

t&= 6& \quad y_{t-4}, y_{t-3}, y_{t-2}, y_{t-1} \\

t&= 5& \quad y_{t-3}, y_{t-2}, y_{t-1} \\

t&= 4& \quad y_{t-2}, y_{t-1} \\

t&=3& \quad y_{t-1}

\end{eqnarray*}\]

This gives us 36 instruments which are what the table calls GMM-type instruments. GMM has been explored in the blog post Estimating parameters by maximum likelihood and method of moments using mlexp and gmm and we will talk about it in a later post. The other three instruments are given by the first difference of the regressors **educ** and **married** and the constant. This is no different from two-stage least squares, where we include the exogenous variables as part of our instrument list.

**Testing for serial correlation**

The key for the instrument set in Arellano–Bond to work is that

\[\begin{equation}

E\left( \Delta y_{i(t-j)} \Delta \varepsilon_{it} \right) = 0 \quad j \geq 2

\end{equation}\]

We can test these conditions in Stata using **estat abond**. In essence, the differenced unobserved time-invariant component should be unrelated to the second lag of the dependent variable and the lags thereafter. If this is not the case, we are back to the initial problem, endogeneity. Again, a bit of math will help us understand what is going on.

All is well if

\[\begin{equation}

\Delta \varepsilon_{it} = \Delta \nu_{it}

\end{equation}\]

The unobservable is serially correlated of order 1 but not serially correlated of orders 2 or beyond.

But we are in trouble if

\[\begin{equation}

\Delta \varepsilon_{it} = \Delta \nu_{it} + \Delta \nu_{i(t-1)}

\end{equation}\]

The second lag of the dependent variable will be related to the differenced time-varying component \(\Delta \varepsilon_{it}\). Another way of saying this is that the differenced time-varying unobserved component is serially correlated with an order greater than 1.

**estat abond** provides a test for the serial correlation structure. For the example above,

. estat abond Arellano-Bond test for zero autocorrelation in first-differenced errors +-----------------------+ |Order | z Prob > z| |------+----------------| | 1 |-22.975 0.0000 | | 2 |-.36763 0.7132 | +-----------------------+ H0: no autocorrelation

We reject no autocorrelation of order 1 and cannot reject no autocorrelation of order 2. There is evidence that the Arellano–Bond model assumptions are satisfied. If this were not the case, we would have to look for different instruments. Essentially, we would have to fit a different dynamic model. This is what the **xtdpd** command allows us to do, but it is beyond the scope of this post.

**Parting words**

Dynamic panel-data models provide a useful research framework. In this post, I touched on the interpretation of a couple of results from estimation and postestimation from **xtabond** that will help you understand your output.

**The formulas for our estimator**

The formulas for the sample average and its estimated sampling variance, assuming an independently and identically distributed process, are

\[

\widehat{\mu} = 1/N \sum_{i=1}^N y_i

\]

\[

\widehat{Var}(\widehat{\mu}) = 1/[N(N-1)] \sum_{i=1}^N (y_i-\widehat{\mu})^2

\]

The code **mean1.do** performs these computations on **price** from the auto dataset.

// version 1.0.0 20Oct2015 version 14 sysuse auto quietly summarize price local sum = r(sum) local N = r(N) local mu = (1/`N')*`sum' generate double e2 = (price - `mu')^2 quietly summarize e2 local V = (1/((`N')*(`N'-1)))*r(sum) display "muhat = " `mu' display " V = " `V'

**mean1.do** uses **summarize** to compute the summations. Lines 5–7 and line 11 store results stored by **summarize** in **r()** into local macros that are subsequently used to compute the formulas. I recommend that you use **double**, instead of the default **float**, to compute all variables used in your formulas because it is almost always worth taking up the extra memory to gain the extra precision offered by **double** over **float**. (Essentially, each variable takes up twice as much space, but you get calculations that are correct to about \(10^{-16}\) instead of \(10^{-8}\).)

These calculations yield

**Example 1: Computing the average and its sampling variance**

. do mean1 . // version 1.0.0 20Oct2015 . version 14 . sysuse auto (1978 Automobile Data) . quietly summarize price . local sum = r(sum) . local N = r(N) . local mu = (1/`N')*`sum' . generate double e2 = (price - `mu')^2 . quietly summarize e2 . local V = (1/((`N')*(`N'-1)))*r(sum) . display "muhat = " `mu' muhat = 6165.2568 . display " V = " `V' V = 117561.16 . end of do-file

Now I verify that mean produces the same results.

**Example 2: Results from mean**

. mean price Mean estimation Number of obs = 74 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ price | 6165.257 342.8719 5481.914 6848.6 -------------------------------------------------------------- . matrix list e(b) symmetric e(b)[1,1] price y1 6165.2568 . matrix list e(V) symmetric e(V)[1,1] price price 117561.16

**A first ado-file**

The code in **mymean1.ado** performs the same calculations as **mean1.do**. (The file **mymean1.ado** is in my current working directory.)

*! version 1.0.0 20Oct2015 program define mymean1 version 14 quietly summarize price local sum = r(sum) local N = r(N) local mu = (1/`N')*`sum' capture drop e2 // Drop e2 if it exists generate double e2 = (price - `mu')^2 quietly summarize e2 local V = (1/((`N')*(`N'-1)))*r(sum) display "muhat = " `mu' display " V = " `V' end

Line 1 of **mymean1.ado** specifies that file defines the command **mymean1**. The command name must be the same as the file name that precedes the suffix **.ado**. The **mymean1** command performs the same computations as the do-file **mean1.do**.

**Example 3: Results from mymean1**

. mymean1 muhat = 6165.2568 V = 117561.16

**A slightly better command**

We want our command to be reusable; we want it to estimate the mean for any variable in memory, instead of only for price as performed by **mymean1.ado**. On line 5 of **mymean2.ado**, we use the **syntax** command to store the name of the variable specified by the user into the local macro **varlist**, which we use in the remainder of the computations.

*! version 2.0.0 20Oct2015 program define mymean2 version 14 syntax varlist display "The local macro varlist contains `varlist'" quietly summarize `varlist' local sum = r(sum) local N = r(N) local mu = (1/`N')*`sum' capture drop e2 // Drop e2 if it exists generate double e2 = (`varlist' - `mu')^2 quietly summarize e2 local V = (1/((`N')*(`N'-1)))*r(sum) display "The average of `varlist' is " `mu' display "The estimated variance of the average is " `V' end

The extremely powerful **syntax** command puts the elements of Stata syntax specified by the user into local macros and throws errors when the user makes a mistake. I will discuss **syntax** in greater detail in subsequent posts.

I begin by illustrating how to replicate the previous results.

**Example 4: Results from mymean2 price**

. mymean2 price The local macro varlist contains price The average of price is 6165.2568 The estimated variance of the average is 117561.16

I now illustrate that it works for another variable.

**Example 5: Results from mymean2 trunk**

. mymean2 trunk The local macro varlist contains trunk The average of trunk is 13.756757 The estimated variance of the average is .24724576 . mean trunk Mean estimation Number of obs = 74 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ trunk | 13.75676 .4972381 12.76576 14.74775 -------------------------------------------------------------- . display "The variance of the estimator is " (_se[trunk])^2 The variance of the estimator is .24724576

**Storing results in e()**

**mymean2.ado** does not save the results that it displays. We fix this problem in **mymean3.ado**. Line 2 specifies the option **e-class** on **program define** to make **mymean3** an e-class command. Line 18 uses **ereturn post** to move the matrix of point estimates (**b**) and the estimated variance-covariance of the estimator (**VCE**) into **e(b)** and **e(V)**. The estimation-postestimation framework uses parameter names for display, hypothesis tests, and other features. In lines 15 and 16, we put those names into the column stripes of the vector of estimates and the estimated **VCE**. In line 17, we put those names into the row stripe of the estimated **VCE**.

*! version 3.0.0 20Oct2015 program define mymean3, eclass version 14 syntax varlist quietly summarize `varlist' local sum = r(sum) local N = r(N) matrix b = (1/`N')*`sum' capture drop e2 // Drop e2 if it exists generate double e2 = (`varlist' - b[1,1])^2 quietly summarize e2 matrix V = (1/((`N')*(`N'-1)))*r(sum) matrix colnames b = `varlist' matrix colnames V = `varlist' matrix rownames V = `varlist' ereturn post b V ereturn display end

The **ereturn display** command on line 19 of **mymean3.ado** easily creates a standard output table using the results now stored in **e(b)** and **e(V)**.

**Example 6: Results from mymean3 trunk**

. mymean3 trunk ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- trunk | 13.75676 .4972381 27.67 0.000 12.78219 14.73133 ------------------------------------------------------------------------------

**Estimation-postestimation**

**test**, **lincom**, **testnl**, **nlcom**, and other Wald-based estimation-postestimation features work after **mymean3** because all the required information is stored in **e(b)** and **e(V)**.

To illustrate, I perform a Wald of the null hypothesis that the mean of **trunk** is \(11\).

**Example 7: test works after mymean3**

. test _b[trunk]==11 ( 1) trunk = 11 chi2( 1) = 30.74 Prob > chi2 = 0.0000

The results stored in **e()** are the glue that holds the estimation-postestimation framework together. We have only stored **e(b)** and **e(V)** so far, so not all the standard features are working yet. (But we will get there in the #StataProgramming series.)

**Using temporary names for global objects**

Stata variables and matrices are global, as discussed in my previous blog post. We need some safe names for global objects. These safe names should not be in use elsewhere, and they should be temporary in that we want Stata to drop the corresponding objects when the command finishes. The **tempvar** and **tempname** commands put safe names into local macros and then drop the corresponding objects when the ado-file or do-file finishes. We explicitly dropped **e2**, if it existed, in line 9 of code block 2, in line 12 of code block 3, and in line 11 of code block 4. We do not need such a line in code block, because we are using temporary variable names.

In line 7 of **mymean4.ado**, the **tempvar** command puts a safe name into the local macro **e2**. In line 8 of **mymean4.ado**, the **tempname** command puts safe names into the local macros **b** and **V**. I illustrate the format followed by these safe names by displaying them on lines 9–11. The output reveals that a leading pair of underscores is followed by numbers and capital letters. Line 15 illustrates the use of these safe names. Instead of creating the matrix **b**, we create the matrix whose name is stored in the local macro **b**. In line 8, the **tempname** command created the local macro **b** to hold a safe name.

*! version 4.0.0 20Oct2015 program define mymean4, eclass version 14 syntax varlist tempvar e2 tempname b V display "The safe name in e2 is `e2'" display "The safe name in b is `b'" display "The safe name in V is `V'" quietly summarize `varlist' local sum = r(sum) local N = r(N) matrix `b' = (1/`N')*`sum' generate double `e2' = (`varlist' - `b'[1,1])^2 quietly summarize `e2' matrix `V' = (1/((`N')*(`N'-1)))*r(sum) matrix colnames `b' = `varlist' matrix colnames `V' = `varlist' matrix rownames `V' = `varlist' ereturn post `b' `V' ereturn display end

This code produces the output

**Example 8: Results from mymean4 trunk**

. mymean4 trunk The safe name in e2 is __000000 The safe name in b is __000001 The safe name in V is __000002 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- trunk | 13.75676 .4972381 27.67 0.000 12.78219 14.73133 ------------------------------------------------------------------------------

Removing the lines that display the safe names contained in the local macros yields **mymean5.ado**.

*! version 5.0.0 20Oct2015 program define mymean5, eclass version 14 syntax varlist tempvar e2 tempname b V quietly summarize `varlist' local sum = r(sum) local N = r(N) matrix `b' = (1/`N')*`sum' generate double `e2' = (`varlist' - `b'[1,1])^2 quietly summarize `e2' matrix `V' = (1/((`N')*(`N'-1)))*r(sum) matrix colnames `b' = `varlist' matrix colnames `V' = `varlist' matrix rownames `V' = `varlist' ereturn post `b' `V' ereturn display end

This code produces the output

**Example 9: Results from mymean5 trunk**

. mymean5 trunk ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- trunk | 13.75676 .4972381 27.67 0.000 12.78219 14.73133 ------------------------------------------------------------------------------

**Done and undone**

I illustrated some basic ado-file programming techniques by implementing a command that estimates the mean of variable. Even though we have a command that produces correct, easy-to-read output that has some estimation-postestimation features, we have only scratched the surface of what we usually want to do in an estimation command. I dig a little deeper in the next few posts by developing a command that performs ordinary least-squares estimation.

]]>I am building on a previous post in which I demonstrated how to use **mlexp** to estimate the parameters of a probit model with sample selection. Our results match those obtained with **biprobit**; see [R] biprobit for more details. In a future post, I use these techniques to estimate treatment-effect parameters not yet available from another Stata command.

**Probit model with treatment**

In this section, I describe the potential-outcome framework used to define an ATE. For each treatment level, there is an outcome that we would observe if a person were to select that treatment level. When the outcome is binary and there are two treatment levels, we can specify how the potential outcomes \(y_{0i}\) and \(y_{1i}\) are generated from the regressors \({\bf x}_i\) and the error terms \(\epsilon_{0i}\) and \(\epsilon_{1i}\):

\[\begin{eqnarray*}

y_{0i} &=& {\bf 1}({\bf x}_i{\boldsymbol \beta}_0 + \epsilon_{0i} > 0) \cr

y_{1i} &=& {\bf 1}({\bf x}_i{\boldsymbol \beta}_1 + \epsilon_{1i} > 0)

\end{eqnarray*}\]

(Assuming that each error is standard normal, this gives us a bivariate probit model.) The indicator function \({\bf 1}(\cdot)\) outputs 1 when its input is true and 0 otherwise.

The probit model for potential outcomes \(y_{0i}\) and \(y_{1i}\) with treatment \(t_i\) assumes that we observe the outcome

\[\begin{equation}

y_i = (1-t_i) y_{0i} + t_i y_{1i}

\nonumber

\end{equation}\]

So we observe \(y_{1i}\) under the treatment (\(t_{i}=1\)) and \(y_{0i}\) when the treatment is withheld (\(t_{i}=0\)).

The treatment \(t_i\) is determined by regressors \({\bf z}_i\) and standard normal error \(u_i\):

\[\begin{equation}

t_i = {\bf 1}({\bf z}_i{\boldsymbol \psi} + u_i > 0)

\nonumber

\end{equation}\]

**Probit model with endogenous treatment**

We could estimate the parameters \({\boldsymbol \beta}_0\) and \({\boldsymbol \beta}_1\) using a probit regression on \(y_i\) if \(t_i\) was not related to the unobserved errors \(\epsilon_{0i}\) and \(\epsilon_{1i}\). This may not always be the case. Suppose we modeled whether parents send their children to private school and used private tutoring for the child as a treatment. Unobserved factors that influence private school enrollment may be correlated with the unobserved factors that influence whether private tutoring is given. The treatment would be correlated with the unobserved errors of the outcome.

We can treat \(t_i\) as endogenous by allowing \(\epsilon_{0i}\) and \(\epsilon_{1i}\) to be correlated with \(u_i\). In this post, we will assume that these correlations are the same. Formally, \(\epsilon_{0i}\), \(\epsilon_{1i}\), and \(u_i\) are trivariate normal with covariance:

\[\begin{equation}

\left[\begin{matrix}

1 & \rho_{01} & \rho_{t} \cr

\rho_{01} & 1 & \rho_{t} \cr

\rho_{t} & \rho_{t} & 1

\end{matrix}\right]

\nonumber

\end{equation}\]

The correlation \(\rho_{01}\) cannot be identified because we never observe both \(y_{0i}\) and \(y_{1i}\). However, identification of \(\rho_{01}\) is not necessary to estimate the other parameters, because we will observe the covariates and outcome in observations from each treatment group.

The log-likelihood for observation \(i\) is

\[\begin{eqnarray*}

\ln L_i = & & {\bf 1}(y_i =1 \mbox{ and } t_i = 1) \ln \Phi_2({\bf x}_i{\boldsymbol \beta}_1, {\bf z}_i{\boldsymbol \gamma},\rho_t) + \cr

& & {\bf 1}(y_i=0 \mbox{ and } t_i=1)\ln \Phi_2(-{\bf x}_i{\boldsymbol \beta}_1, {\bf z}_i{\boldsymbol \gamma},-\rho_t) + \cr

& & {\bf 1}(y_i=1 \mbox{ and } t_i=0) \ln \Phi_2({\bf x}_i{\boldsymbol \beta}_0, -{\bf z}_i{\boldsymbol \gamma},-\rho_t) + \cr

& & {\bf 1}(y_i=0 \mbox{ and } t_i = 0)\ln \Phi_2(-{\bf x}_i{\boldsymbol \beta}_0, -{\bf z}_i{\boldsymbol \gamma},\rho_t)

\end{eqnarray*}\]

where \(\Phi_2\) is the bivariate normal cumulative distribution function.

This model is a variation of the bivariate probit model. For a good introduction to the bivariate probit model, see Pindyck and Rubinfeld (1998).

**The data**

We will simulate data from a probit model with an endogenous treatment and then estimate the parameters of the model using **mlexp**. Then, we will use **margins** to estimate the ATE. We simulate a random sample of 10,000 observations.

. set seed 3211 . set obs 10000 number of observations (_N) was 0, now 10,000 . gen x = rnormal() + 4 . gen b = rpoisson(1) . gen z = rnormal()

First, we generate the regressors. The variable \(x\) has a normal distribution with a mean of 4 and variance of 1. It is used as a regressor for the outcome and treatment. The variable \(b\) has a Poisson distribution with a mean of 1 and will be used as a treatment regressor. A standard normal variable \(z\) is also used as a treatment regressor.

. matrix cm = (1, .3,.7 \ .3, 1, .7 \ .7, .7, 1) . drawnorm ey0 ey1 et, corr(cm) . gen t = .5*x - .1*b + .4*z - 2.4 + et > 0 . gen y0 = .6*x - .8 + ey0 > 0 . gen y1 = .3*x - 1.2 + ey1 > 0 . gen y = (1-t)*y0 + t*y1

Next, we draw the unobserved errors. The potential outcome and treatment errors will have correlation \(.7\). We generate the errors using the **drawnorm** command. Finally, the outcome and treatment indicators are created.

**Estimating the Model Parameters**

Now, we will use **mlexp** to estimate the parameters of the probit model with an endogenous treatment. As in the previous post, we use the **cond()** function to calculate different values of the likelihood based on the different values of \(y\) and \(t\). We use the factor variable operator **ibn** on \(t\) in equation **y** to allow for a different intercept at each level of \(t\). An interaction between \(t\) and \(x\) is also specified in equation **y**. This allows for a different coefficient on \(x\) at each level of \(t\). We also specify **vce(robust)** so that we can use **vce(unconditional)** when we use **margins** later.

. mlexp (ln(cond(t,cond(y,binormal({y: i.t#c.x ibn.t}, /// > {t: x b z _cons}, {rho}), /// > binormal(-{y:},{t:}, -{rho})), /// > cond(y,binormal({y:},-{t:},-{rho}), /// > binormal(-{y:},-{t:},{rho}))))) /// > , vce(robust) initial: log pseudolikelihood = -13862.944 alternative: log pseudolikelihood = -15511.071 rescale: log pseudolikelihood = -13818.369 rescale eq: log pseudolikelihood = -10510.488 Iteration 0: log pseudolikelihood = -10510.488 (not concave) Iteration 1: log pseudolikelihood = -10004.946 Iteration 2: log pseudolikelihood = -9487.4032 Iteration 3: log pseudolikelihood = -9286.0118 Iteration 4: log pseudolikelihood = -9183.901 Iteration 5: log pseudolikelihood = -9181.9207 Iteration 6: log pseudolikelihood = -9172.0256 Iteration 7: log pseudolikelihood = -9170.8198 Iteration 8: log pseudolikelihood = -9170.7994 Iteration 9: log pseudolikelihood = -9170.7994 Maximum likelihood estimation Log pseudolikelihood = -9170.7994 Number of obs = 10,000 ------------------------------------------------------------------------------ | Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- y | t#c.x | 0 | .5829362 .0223326 26.10 0.000 .5391651 .6267073 1 | .2745585 .0259477 10.58 0.000 .2237021 .325415 | t | 0 | -.7423227 .0788659 -9.41 0.000 -.896897 -.5877483 1 | -1.088765 .1488922 -7.31 0.000 -1.380589 -.7969419 -------------+---------------------------------------------------------------- t | x | .4900691 .0148391 33.03 0.000 .4609851 .5191532 b | -.1086717 .0132481 -8.20 0.000 -.1346375 -.0827059 z | .4135792 .0150112 27.55 0.000 .3841579 .4430006 _cons | -2.354418 .0640056 -36.78 0.000 -2.479867 -2.228969 -------------+---------------------------------------------------------------- /rho | .7146737 .0377255 18.94 0.000 .6407331 .7886143 ------------------------------------------------------------------------------

Our parameter estimates are close to their true values.

**Estimating the ATE**

The ATE of \(t\) is the expected value of the difference between \(y_{1i}\) and \(y_{0i}\), the average difference between the potential outcomes. Using the law of iterated expectations, we have

\[\begin{eqnarray*}

E(y_{1i}-y_{0i}) &=& E\{E(y_{1i}-y_{0i}|{\bf x}_i)\} \cr

&=& E\{\Phi({\bf x}_i{\boldsymbol \beta}_1)-

\Phi({\bf x}_i{\boldsymbol \beta}_0)\}

\end{eqnarray*}\]

This can be estimated as a predictive margin.

Now, we estimate the ATE using **margins**. We specify the normal probability expression in the **expression()** option. The **xb()** term refers to the linear prediction of the first equation, which we can now predict in Stata 14.1. We specify **r.t** so that **margins** will take the difference of the expression under \(t=1\) and \(t=0\). We specify **vce(unconditional)** to obtain standard errors for the population ATE rather than the sample ATE. The **contrast(nowald)** option is specified to omit the Wald test for the difference.

. margins r.t, expression(normal(xb())) vce(unconditional) contrast(nowald) Contrasts of predictive margins Expression : normal(xb()) -------------------------------------------------------------- | Unconditional | Contrast Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ t | (1 vs 0) | -.4112345 .0248909 -.4600197 -.3624493 --------------------------------------------------------------

We estimate that the ATE of \(t\) on \(y\) is \(-.41\). So taking the treatment decreases the probability of a positive outcome by \(.41\) on average over the population.

We will compare this estimate to the sample difference of \(y_{1}\) and \(y_{0}\).

. gen diff = y1 - y0 . sum diff Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- diff | 10,000 -.4132 .5303715 -1 1

In our sample, the average difference of \(y_{1}\) and \(y_{0}\) is also \(-.41\).

**Conclusion**

I have demonstrated how to estimate the parameters of a model with a complex likelihood function: the probit model with an endogenous treatment using **mlexp**. See [R] mlexp for more details about **mlexp**. I have also demonstrated how to use **margins** to estimate the ATE for the probit model with an endogenous treatment. See [R] margins for more details about **margins**.

**Reference**

Pindyck, R. S., and D. L. Rubinfeld. 1998. *Econometric Models and Economic Forecasts*. 4th ed. New York: McGraw-Hill.

In every command I write, I use local macros to store stuff in a workspace that will not alter a user’s data and to make my code easier to read. A good understanding of the differences between global macros and local macros helps me to write better code. The essential differences between global macros and local macros can be summarized in two points.

- There is only one global macro with a specific name in Stata, and its contents can be accessed or changed by a Stata command executed at any Stata level.
- In contrast, each Stata level can have a local macro of a specific name, and each one’s contents cannot be accessed or changed by commands executed at other Stata levels.

If you are already comfortable with 1 and 2, skip the remainder of this post.

**Global macros are global**

The do-files **globala.do** and **globalb.do** in code blocks **globala** and **globalb** illustrate what it means to be global.

*-------------------------------Begin globala.do --------------- *! globala.do * In this do-file we define the global macro vlist, but we * do not use it global vlist var1 var2 var3 do globalb *-------------------------------End globala.do ---------------

*-------------------------------Begin globalb.do --------------- *! globalb.do * In this do-file, we use the global macro vlist, defined in globala.do display "The global macro vlist contains $vlist" *-------------------------------End globalb.do ---------------

The easiest way to see what this code does is to execute it; the output is in example 1.

**Example 1: Output from do globala**

. do globala . *-------------------------------Begin globala.do --------------- . *! globala.do . * In this do-file we define the global macro vlist, but we . * do not use it . global vlist var1 var2 var3 . . do globalb . *-------------------------------Begin globalb.do --------------- . *! globalb.do . * In this do-file, we use the global macro vlist, defined in globala.do . . display "The global macro vlist contains $vlist" The global macro vlist contains var1 var2 var3 . *-------------------------------End globalb.do --------------- . end of do-file . *-------------------------------End globala.do --------------- . end of do-file

Line 5 of **globalb.do** can access the contents of **vlist** created on line 5 of **globala.do** because **vlist** is a global macro.

Figure 1 makes this same point graphically: the global macro **vlist** is in global memory, and a command executed anywhere can access or change the contents of **vlist**.

**Figure 1: A global macro in global memory**

**Local macros are local**

The do-files **locala.do** and **localb.do** in code blocks 3 and 4 illustrate what it means to be local.

*-------------------------------Begin locala.do --------------- *! locala.do local mylist "a b c" display "mylist contains `mylist'" do localb display "mylist contains `mylist'" *-------------------------------End locala.do ---------------

*-------------------------------Begin localb.do --------------- *! localb.do local mylist "x y z" display "mylist contains `mylist'" *-------------------------------End localb.do ---------------

The easiest way to see what this code does is to execute it; the output is in example 2.

**Example 2: Output from do locala**

. do locala . *-------------------------------Begin locala.do --------------- . *! locala.do . local mylist "a b c" . display "mylist contains `mylist'" mylist contains a b c . . do localb . *-------------------------------Begin localb.do --------------- . *! localb.do . local mylist "x y z" . display "mylist contains `mylist'" mylist contains x y z . *-------------------------------End localb.do --------------- . end of do-file . . display "mylist contains `mylist'" mylist contains a b c . *-------------------------------End locala.do --------------- . end of do-file

The code in blocks 3 and 4 and the output in example 2 illustrate that a command executed at the level of **localb.do** cannot change the local macro **mylist** that is local to **locala.do**. Line 8 of **locala.do** displays the contents of the **mylist** local to **locala.do**. The contents are still **a b c** after **localb.do** finishes because the local macro **mylist** created on line 3 of **locala.do** is local to **locala.do** and it is unaffected by the **mylist** created on line 3 of **localb.do**.

Figure 2 makes this point graphically. The contents of the local macro **mylist** that is local to **locala.do** can be accessed and changed by commands run in **locala.do**, but not by commands run in **localb.do**. Analogously, the contents of the local macro **mylist** that is local to **localb.do** can be accessed and changed by commands run in **localb.do**, but not by commands run in **locala.do**.

**Figure 2: Local macros are local to do-files**

**Done and Undone**

I essentially provided you with a technical appendix to the previous #StataProgramming post. I illustrated that global macros are global and that local macros are local. I use the concepts developed thus far to present an ado-command in the next post.

]]>**What is going on?**

Say I want to fit a linear panel-data model and need to decide whether to use a random-effects or fixed-effects estimator. My decision depends on how time-invariant unobservable variables are related to variables in my model. Here are two examples that may yield different answers:

- A panel dataset of individuals endowed with innate ability that does not change over time
- A panel dataset of countries where the time-invariant unobservables in our model are sets of country-specific geographic characteristics

In the first case, innate ability can affect observable characteristics such as the amount of schooling someone pursues. In the second case, geographic characteristics are probably not correlated with the variables in our model. Of course, these are conjectures, and we want a test to verify if unobservables are related to the variables in our model.

First, I will tell you how to compute the test; then, I will explain the theory and intuition behind it.

**What is going on?**

Computing the test

- Compute the panel-level average of your time-varying covariates.
- Use a random-effects estimator to regress your covariates and the panel-level means generated in (1) against your outcome.
- Test that the panel-level means generated in (1) are jointly zero.

If you reject that the coefficients are jointly zero, the test suggests that there is correlation between the time-invariant unobservables and your regressors, namely, the fixed-effects assumptions are satisfied. If you cannot reject the null that the generated regressors are zero, there is evidence of no correlation between the time-invariant unobservable and your regressors; that is, the random effects assumptions are satisfied.

Below I demonstrate the three-step procedure above using simulated data. The data satisfy the fixed-effects assumptions and have two time-varying covariates and one time-invariant covariate.

**STEP 1**

. bysort id: egen mean_x2 = mean(x2) . bysort id: egen mean_x3 = mean(x3)

**STEP 2**

. quietly xtreg y x1 x2 x3 mean_x2 mean_x3, vce(robust) . estimates store mundlak

**STEP 3**

. test mean_x2 mean_x3 ( 1) mean_x2 = 0 ( 2) mean_x3 = 0 chi2( 2) = 8.94 Prob > chi2 = 0.0114

We reject the null hypothesis. This suggests that time-invariant unobservables are related to our regressors and that the fixed-effects model is appropriate. Note that I used a robust estimator of the variance-covariance matrix. I could not have done this if I had used a Hausman test.

**Where all this came from**

A linear panel-data model is given by

\[\begin{equation*}

y_{it} = x_{it}\beta + \alpha_i + \varepsilon_{it}

\end{equation*}\]

The index \(i\) denotes the individual and the index \(t\) time. \(y_{it}\) is the outcome of interest, \(x_{it}\) is the set of regressors, \(\varepsilon_{it}\) is the time-varying unobservable, and \(\alpha_i\) is the time-invariant unobservable.

The key to the Mundlak approach is to determine if \(\alpha_i\) and \(x_{it}\) are correlated. We know how to think about this problem from our regression intuition. We can think of the mean of \(\alpha_i\) conditional on the time-invariant part of our regressors in the same way that we think of the mean of our outcome conditional on our covariates.

\[\begin{eqnarray*}

\alpha_i &=& \bar{x}_i\theta + \nu_i \\

E\left(\alpha_i|x_i\right) &=& \bar{x}_i\theta

\end{eqnarray*}\]

In the expression above, \(\bar{x}_i\) is the panel-level mean of \(x_{it}\), and \(\nu_i\) is a time-invariant unobservable that is uncorrelated to the regressors.

As in regression, if \(\theta = 0\), we know \(\alpha_i\) and the covariates are uncorrelated. This is what we test. The implied model is given by

\[\begin{eqnarray*}

y_{it} &=& x_{it}\beta + \alpha_i + \varepsilon_{it} \\

y_{it} &=& x_{it}\beta + \bar{x}_i\theta + \nu_i + \varepsilon_{it} \\

E\left(y_{it}|x_{it}\right) &=& x_{it}\beta + \bar{x}_i\theta

\end{eqnarray*}\]

The second equality replaces \(\alpha_i\) by \(\bar{x}_i\theta + \nu_i\). The third equality relies on the fact that the regressors and unobservables are mean independent. The test is given by

\[\begin{equation*}

H_{\text{o}}: \theta = 0

\end{equation*}\]

**Reference**

Mundlak, Y. 1978: On the pooling of time series and cross section data. *Econometrica* 46:69-85.

If you tell me “I program in Stata”, it makes me happy, but I do not know what you mean. Do you write scripts to make your research reproducible, or do you write Stata commands that anyone can use and reuse? In the series #StataProgramming, I will show you how to write your own commands, but I start at the beginning. Discussing the difference between scripts and commands here introduces some essential programming concepts and constructions that I use to write scripts and commands.

**Scripts versus commands**

A script is a program that always performs the same tasks on the same inputs and produces exactly the same results. Scripts in Stata are known as do-files and the files containing them end in **.do**. For example, I could write a do-file to

- read in the National Longitudinal Study of Youth (NLSY) dataset,
- clean the data,
- form a sample for some population, and
- run a bunch of regressions on the sample.

This structure is at the heart of reproducible research; produce the same results from the same inputs every time. Do-files have a one-of structure. For example, I could not somehow tell this do-file that I want it to perform the analogous tasks on the Panel Study on Income Dynamics (PSID). Commands are reusable programs that take arguments to perform a task on any data of certain type. For example, **regress** performs ordinary least squares on the specified variables regardless of whether they come from the NLSY, PSID, or any other dataset. Stata commands are written in the automatic do-file (ado) language; the files containing them end in **.ado**. Stata commands written in the ado language are known as ado-commands.

**An example do-file**

The commands in code block 1 are contained in the file **doex.do** in the current working directory of my computer.

// version 1.0.0 04Oct2015 (This line is comment) version 14 // version #.# fixes the version of Stata use http://www.stata.com/data/accident2.dta summarize accidents tickets

We execute the commands by typing **do doex** which produces

**Example 1: Output from do doex**

. do doex . // version 1.0.0 04Oct2015 (This line is comment) . version 14 // version #.# fixes the version of Stata . use http://www.stata.com/data/accident2.dta . summarize accidents tickets Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- accidents | 948 .8512658 2.851856 0 20 tickets | 948 1.436709 1.849456 0 7 . . end of do-file

- Line 1 in
**doex.do**is a comment that helps to document the code but is not executed by Stata. The**//**initiates a comment. Anything following the**//**on that line is ignored by Stata. - In the comment on line 1, I put a version number and the date that I last changed this file. The date and the version help me keep track of the changes that I make as I work on the project. This information also helps me answer questions from others with whom I have shared a version of this file.
- Line 2 specifies the definition of the Stata language that I use. Stata changes over time. Setting the version ensures that the do-file continues to run and that the results do not change as the Stata language evolves.
- Line 3 reads in the
**accident.dta**dataset. - Line 4 summarizes the variables
**accidents**and**tickets**.

**Storing stuff in Stata**

Programming in Stata is like putting stuff into boxes, making Stata change the stuff in the boxes, and getting the changed stuff out of the boxes. For example, code block 2 contains the code for **doex2.do**, whose output I display in example 2

// version 1.0.0 04Oct2015 (This line is comment) version 14 // version #.# fixes the version of Stata use http://www.stata.com/data/accident2.dta generate ln_traffic = ln(traffic) summarize ln_traffic

**Example 2: Output from do doex2**

. do doex2 . // version 1.0.0 04Oct2015 (This line is comment) . version 14 // version #.# fixes the version of Stata . use http://www.stata.com/data/accident2.dta . generate ln_traffic = ln(traffic) . summarize ln_traffic Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- ln_traffic | 948 1.346907 1.004952 -5.261297 2.302408 . . end of do-file

In line 4 of code block 2, I generate the new variable **ln_traffic** which I summarize on line 5. **doex2.do** uses **generate** to change what is in the box **ln_traffic** and uses **summarize** to get a function of the changed stuff out of the box. Stata variables are the most frequently used box type in Stata, but when you are programming, you will also rely on Stata matrices.

There can only be one variable named **traffic** in a Stata dataset and its contents can be viewed or changed interactively, by a do-file, or by an ado-file command. Similarly, there can only be one Stata matrix named **beta** in a Stata session and its contents can be viewed or changed interactively, by a do-file, or by an ado-file command. Stata variables and Stata matrices are global boxes because there can only be one Stata variable or Stata matrix in a Stata session and its contents can be viewed or changed anywhere in a Stata session.

The opposite of global is local. If it is local in Stata, its contents can only be accessed or changed in the interactive session, in a particular do-file, or a in particular ado-file.

Although I am discussing do-files at the moment, remember that we are learning techniques to write commands. It is essential to understand the differences between global boxes and local boxes to program commands in Stata. Global boxes, like variables, could contain data that the users of your command do not want changed. For example, a command you write should never change a user’s variable in a way that was not requested.

**Levels of Stata**

The notion that there are levels of Stata can help explain the difference between global boxes and local boxes. Suppose that I run 2 do-files or ado-files. Think of the interactive Stata session as level 0 of Stata, and think of each do-file or ado-file as being Stata levels 1 and 2. Global boxes like variables and matrices live in global memory that can be accessed or changed from a Stata command executed in level 0, 1, or 2. Local boxes can only be accessed or changed by a Stata command within a particular level of Stata. (This description is not exactly how Stata works, but the details about how Stata really handles levels are not important here.)

Figure 1 depicts this structure.

Figure 1 clarifies

- that commands executed at all Stata levels can access and change the objects in global memory,
- that only commands executed at Stata level 0 can access and change the objects local to Stata level 0,
- that only commands executed at Stata level 1 can access and change the objects local to Stata level 1, and
- that only commands executed at Stata level 2 can access and change the objects local to Stata level 2.

**Global and local macros: Storing and extracting**

Macros are Stata boxes that hold information as characters, also known as strings. Stata has both global macros and local macros. Global macros are global and local macros are local. Global macros can be accessed and changed by a command executed at any Stata level. Local macros can be accessed and changed only by a command executed at a specific Stata level.

The easiest way to begin to understand global macros is to put something into a global macro and then to get it back out. Code block 3 contains the code for **global1.do** which stores and the retrieves information from a global macro.

// version 1.0.0 04Oct2015 version 14 global vlist "y x1 x2" display "vlist contains $vlist"

**Example 3: Output from do global1**

. do global1 . // version 1.0.0 04Oct2015 . version 14 . global vlist "y x1 x2" . display "vlist contains $vlist" vlist contains y x1 x2 . end of do-file

Line 3 of code block 3 puts the string **y x1 x2** into the global macro named **vlist**. To extract what I put into a global macro, I prefix the name of global macro with a **$**. Line 4 of the code block and its output in example 3 illustrate this usage by extracting and displaying the contents of **vlist**.

Code block 4 contains the code for **local1.do** and its output is given in example 4. They illustrate how to put something into a local macro and how to extract something from it.

// version 1.0.0 04Oct2015 version 14 local vlist "y x1 x2" display "vlist contains `vlist'"

**Example 4: Output from do global1**

. do local1 . // version 1.0.0 04Oct2015 . version 14 . local vlist "y x1 x2" . display "vlist contains `vlist'" vlist contains y x1 x2 . end of do-file

Line 3 of code block 3 puts the string **y x1 x2** into the local macro named **vlist**. To extract what I put into a local macro I enclose the name of the local macro between a single left quote (‘) and a single right quote (’). Line 4 of code block 3 displays what is contained in the local macro **vlist** and its output in example 4 illustrates this usage.

**Getting stuff from Stata commands**

Now that we have boxes, I will show you how to store stuff computed by Stata in these boxes. Analysis commands, like **summarize**, store their results in **r()**. Estimation commands, like **regress**, store their results in **e()**. Somewhat tautologically, commands that store their results in **r()** are also known as r-class commands and commands that store their results in **e()** are also known as e-class commands.

I can use **return list** to see results stored by an r-class command. Below, I list out what **summarize** has stored in **r()** and compute the mean from the stored results.

**Example 5: Getting results from an r-class command**

. use http://www.stata.com/data/accident2.dta, clear . summarize accidents Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- accidents | 948 .8512658 2.851856 0 20 . return list scalars: r(N) = 948 r(sum_w) = 948 r(mean) = .8512658227848101 r(Var) = 8.133081817331211 r(sd) = 2.851855854935732 r(min) = 0 r(max) = 20 r(sum) = 807 . local sum = r(sum) . local N = r(N) . display "The mean is " `sum'/`N' The mean is .85126582

Estimation commands are more formal than analysis commands, so they save more stuff.

Official Stata estimation commands save lots of stuff, because they follow lots of rules that make postestimation easy for users. Do not be alarmed by the number of things stored by **poisson**. Below, I list out the results stored by **poisson** and create a Stata matrix that contains the coefficient estimates.

**Example 6: Getting results from an e-class command**

. poisson accidents traffic tickets male Iteration 0: log likelihood = -377.98594 Iteration 1: log likelihood = -370.68001 Iteration 2: log likelihood = -370.66527 Iteration 3: log likelihood = -370.66527 Poisson regression Number of obs = 948 LR chi2(3) = 3357.64 Prob > chi2 = 0.0000 Log likelihood = -370.66527 Pseudo R2 = 0.8191 ------------------------------------------------------------------------------ accidents | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- traffic | .0764399 .0129856 5.89 0.000 .0509887 .1018912 tickets | 1.366614 .0380641 35.90 0.000 1.29201 1.441218 male | 3.228004 .1145458 28.18 0.000 3.003499 3.45251 _cons | -7.434478 .2590086 -28.70 0.000 -7.942126 -6.92683 ------------------------------------------------------------------------------ . ereturn list scalars: e(rank) = 4 e(N) = 948 e(ic) = 3 e(k) = 4 e(k_eq) = 1 e(k_dv) = 1 e(converged) = 1 e(rc) = 0 e(ll) = -370.6652697757637 e(k_eq_model) = 1 e(ll_0) = -2049.485325326086 e(df_m) = 3 e(chi2) = 3357.640111100644 e(p) = 0 e(r2_p) = .8191422669899876 macros: e(cmdline) : "poisson accidents traffic tickets male" e(cmd) : "poisson" e(predict) : "poisso_p" e(estat_cmd) : "poisson_estat" e(chi2type) : "LR" e(opt) : "moptimize" e(vce) : "oim" e(title) : "Poisson regression" e(user) : "poiss_lf" e(ml_method) : "e2" e(technique) : "nr" e(which) : "max" e(depvar) : "accidents" e(properties) : "b V" matrices: e(b) : 1 x 4 e(V) : 4 x 4 e(ilog) : 1 x 20 e(gradient) : 1 x 4 functions: e(sample) . matrix b = e(b) . matrix list b b[1,4] accidents: accidents: accidents: accidents: traffic tickets male _cons y1 .07643992 1.366614 3.2280044 -7.434478

**Done and Undone**

In this second post in the series #StataProgramming, I discussed the difference between scripts and commands, I provided an introduction to the concepts of global and local memory objects, I discussed global macros and local macros, and I showed how to access results stored by other commands.

In the next post in the series #StataProgramming, I discuss an example that further illustrates the differences between global macros and local macros.

]]>In a previous post, David Drukker demonstrated how to use **mlexp** to estimate the degree of freedom parameter in a chi-squared distribution by maximum likelihood (ML). In this post, I am going to use **mlexp** to estimate the parameters of a probit model with sample selection. I will illustrate how to specify a more complex likelihood in **mlexp** and provide intuition for the probit model with sample selection. Our results match the **heckprobit** command; see **[R] heckprobit** for more details.

**Probit model**

For binary outcome \(y_i\) and regressors \({\bf x}_i\), the probit model assumes

\[\begin{equation} \label{eq:outcome} y_i = {\bf 1}({\bf x}_i{\boldsymbol \beta} + \epsilon_i > 0) \tag{1} \end{equation}\]

where the error \(\epsilon_i\) is standard normal. The indicator function \({\bf1}(\cdot)\) outputs 1 when its input is true and outputs 0 otherwise.

The log likelihood of the probit model is

\[\begin{equation}

\ln L = \sum_{i=1}^{N} y_i \ln \Phi({\bf x}_i{\boldsymbol \beta}) + (1-y_i)\ln\{1-\Phi({\bf x}_i{\boldsymbol \beta})\} \nonumber

\end{equation}\]

where \(\Phi\) is the standard normal cumulative distribution function.

The probit model is widely used to model binary outcomes. But there are situations where it is not appropriate. Sometimes we observe a random sample where the outcome is missing on certain observations. If there is a relationship between the unobserved error of the outcome \(\epsilon_i\) and the unobserved error that affects whether the outcome is observed \(\epsilon_{si}\), then estimates made using the probit model will be inconsistent for \({\boldsymbol \beta}\). For instance, this could happen when we model job satisfaction and our sample includes employed and unemployed individuals. The unobserved factors that affect your job satisfaction may be correlated with factors that affect your employment status. Samples like this are said to suffer from “selection on unobservables”.

**Probit model with sample selection**

Van de Ven and Van Pragg (1981) introduced the probit model with sample selection to allow for consistent estimation of \({\boldsymbol \beta}\) in samples that suffer from selection on unobservables. The equation for the outcome (1) remains the same, but we add another equation. The selection process for the outcome is modeled as

\[\begin{equation}

s_i = {\bf 1}({\bf z}_i{\boldsymbol \gamma} + \epsilon_{si} > 0) \nonumber

\end{equation}\]

where \(s_i=1\) if we observed \(y_i\) and \(s_i=0\) otherwise, and \({\bf z}_i\) are regressors that affect the selection process.

The errors \(\epsilon_i\) and \(\epsilon_{si}\) are assumed to be standard normal with

\[\begin{equation}

\mbox{corr}(\epsilon_i,\epsilon_{si}) = \rho \nonumber

\end{equation}\]

Let \(S\) be the set of observations where \(y_i\) is observed. The likelihood for the probit model with sample selection is

\[\begin{eqnarray*}

\ln L &=& \sum_{i\in S}^{} y_i\ln\Phi_2({\bf x}_i{\boldsymbol \beta}, {\bf z}_i{\boldsymbol \gamma},\rho) +

(1-y_i)\ln\Phi_2(-{\bf x}_i{\boldsymbol \beta}, {\bf z}_i{\boldsymbol \gamma},-\rho) + \cr

& & \sum_{i\not\in S}^{} \ln \{1- \Phi({\bf z}_i{\boldsymbol \gamma})\}

\end{eqnarray*}\]

where \(\Phi_2\) is the bivariate normal cumulative distribution function.

**The data**

We will simulate data from a probit model with sample selection and then estimate the parameters of the model using **mlexp**. We simulate a random sample of 7,000 observations.

. drop _all . set seed 441 . set obs 7000 number of observations (_N) was 0, now 7,000 . generate x = .5*rchi2(2) . generate z = rnormal() . generate b = rbinomial(2,.5)

First, we generate the regressors. We use a \(\chi^2\) variable with \(2\) degrees of freedom \(x\) scaled by \(0.5\) as a regressor for the outcome. A standard normal variable \(z\) is used as a selection regressor. The variable \(b\) has a binomial(\(2,0.5\)) distribution and will be used as a selection regressor.

. matrix cm = (1,.7 \ .7,1) . drawnorm ey es, corr(cm)

Next, we draw the unobserved errors. The outcome \(y\) and selection indicator \(s\) will be generated with errors that have correlation \(0.7\). We generate the errors with the **drawnorm** command.

. generate s = z + 1.3*0.b + 1.b + .5*2.b + es > 0 . generate y = .7*x + ey + .5 > 0 . replace y = . if !s (1,750 real changes made, 1,750 to missing)

Finally, we generate the outcome and selection indicator. We specify the effect of \(b\) on selection by using factor-variable notation. Every value of \(b\) provides a different intercept for \(s\). We set the outcome to missing for observations where \(s\) is \(0\).

**Effect of ignoring sample selection**

First, we will use **mlexp** to estimate the probit model, ignoring the sample selection. We use the **cond()** function to calculate different values of the likelihood based on the value of \(y\). For **cond(a,b,c)**, **b** is returned if **a** is true and **c** is returned otherwise. We use only the observations for which \(y\) is not missing by specifying \(y\) in the **variables()** option. The variables in the equation **y** are specified once, the first time the equation parameters are used in the likelihood. When the equation is used again, it is referred to as \(\{{\bf y}:\}\).

. mlexp (ln(cond(y,normal({y: x _cons}),1-normal({y:})))), variables(y) initial: log likelihood = -3639.0227 alternative: log likelihood = -2342.8722 rescale: log likelihood = -1746.0961 Iteration 0: log likelihood = -1746.0961 Iteration 1: log likelihood = -1503.9519 Iteration 2: log likelihood = -1485.2935 Iteration 3: log likelihood = -1485.1677 Iteration 4: log likelihood = -1485.1677 Maximum likelihood estimation Log likelihood = -1485.1677 Number of obs = 5,250 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | .813723 .0568938 14.30 0.000 .7022132 .9252328 _cons | .7623006 .0386929 19.70 0.000 .6864639 .8381372 ------------------------------------------------------------------------------

Both parameters are overestimated, and the true values are not in the estimated confidence intervals.

**Accounting for sample selection**

Now, we use **mlexp** to estimate the probit model with sample selection. We use the **cond()** function twice, once for the selection indicator value and once for the outcome value. We no longer need to specify the **variables()** option because we will use each observation in the data. We use the factor-variable operator **ibn** in the selection equation so that a separate intercept is used in the equation for each level of \(b\).

. mlexp (ln(cond(s,cond(y,binormal({y: x _cons},{s: z ibn.b}, {rho}), binormal( > -{y:},{s:}, -{rho})),1-normal({s:})))) initial: log likelihood = -8491.053 alternative: log likelihood = -5898.851 rescale: log likelihood = -5898.851 rescale eq: log likelihood = -5654.3504 Iteration 0: log likelihood = -5654.3504 Iteration 1: log likelihood = -5473.5319 (not concave) Iteration 2: log likelihood = -4401.6027 (not concave) Iteration 3: log likelihood = -4340.7398 (not concave) Iteration 4: log likelihood = -4333.6402 (not concave) Iteration 5: log likelihood = -4326.1744 (not concave) Iteration 6: log likelihood = -4316.4936 (not concave) Iteration 7: log likelihood = -4261.307 Iteration 8: log likelihood = -4154.7548 Iteration 9: log likelihood = -4142.7991 Iteration 10: log likelihood = -4141.7431 Iteration 11: log likelihood = -4141.7306 Iteration 12: log likelihood = -4141.7305 Maximum likelihood estimation Log likelihood = -4141.7305 Number of obs = 7,000 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- y | x | .7643362 .0532342 14.36 0.000 .659999 .8686734 _cons | .5259657 .0406914 12.93 0.000 .446212 .6057195 -------------+---------------------------------------------------------------- s | z | 1.028631 .0260977 39.41 0.000 .977481 1.079782 | b | 0 | 1.365497 .0440301 31.01 0.000 1.2792 1.451794 1 | 1.034018 .0297178 34.79 0.000 .9757726 1.092264 2 | .530342 .0353022 15.02 0.000 .461151 .5995331 -------------+---------------------------------------------------------------- /rho | .6854869 .0417266 16.43 0.000 .6037043 .7672696 ------------------------------------------------------------------------------

Our estimates of the coefficient on \(x\) and the constant intercept are closer to the true values. The confidence intervals also include the true values. The correlation \(\rho\) is estimated to be \(0.69\), and the true value of \(0.7\) is in the confidence interval. This model obviously works better.

**Conclusion**

I have demonstrated how to estimate the parameters of a model with a moderately complex likelihood function: the probit model with sample selection using **mlexp**. I also illustrated how to generate data from this model and how its results differ from the simple probit model.

See **[R] mlexp** for more details about **mlexp**. In a future post, we will show how to make predictions after **mlexp** and how to estimate population average parameters using **mlexp** and **margins**.

**Reference**

Van de Ven, W. P. M. M., and B. M. S. Van Pragg. 1981. The demand for deductibles in private health insurance: A probit model with sample selection. *Journal of Econometrics* 17: 229{252.