Programming an estimation command in Stata: A first ado-command
I discuss the code for a simple estimation command to focus on the details of how to implement an estimation command. The command that I discuss estimates the mean by the sample average. I begin by reviewing the formulas and a do-file that implements them. I subsequently introduce ado-file programming and discuss two versions of the command. Along the way, I illustrate some of the postestimation features that work after the command.
This is the fourth post in the series Programming an estimation command in Stata. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.
The formulas for our estimator
The formulas for the sample average and its estimated sampling variance, assuming an independently and identically distributed process, are
\[
\widehat{\mu} = 1/N \sum_{i=1}^N y_i
\]
\[
\widehat{Var}(\widehat{\mu}) = 1/[N(N-1)] \sum_{i=1}^N (y_i-\widehat{\mu})^2
\]
The code mean1.do performs these computations on price from the auto dataset.
// version 1.0.0 20Oct2015 version 14 sysuse auto quietly summarize price local sum = r(sum) local N = r(N) local mu = (1/`N')*`sum' generate double e2 = (price - `mu')^2 quietly summarize e2 local V = (1/((`N')*(`N'-1)))*r(sum) display "muhat = " `mu' display " V = " `V'
mean1.do uses summarize to compute the summations. Lines 5–7 and line 11 store results stored by summarize in r() into local macros that are subsequently used to compute the formulas. I recommend that you use double, instead of the default float, to compute all variables used in your formulas because it is almost always worth taking up the extra memory to gain the extra precision offered by double over float. (Essentially, each variable takes up twice as much space, but you get calculations that are correct to about \(10^{-16}\) instead of \(10^{-8}\).)
These calculations yield
Example 1: Computing the average and its sampling variance
. do mean1 . // version 1.0.0 20Oct2015 . version 14 . sysuse auto (1978 Automobile Data) . quietly summarize price . local sum = r(sum) . local N = r(N) . local mu = (1/`N')*`sum' . generate double e2 = (price - `mu')^2 . quietly summarize e2 . local V = (1/((`N')*(`N'-1)))*r(sum) . display "muhat = " `mu' muhat = 6165.2568 . display " V = " `V' V = 117561.16 . end of do-file
Now I verify that mean produces the same results.
Example 2: Results from mean
. mean price Mean estimation Number of obs = 74 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ price | 6165.257 342.8719 5481.914 6848.6 -------------------------------------------------------------- . matrix list e(b) symmetric e(b)[1,1] price y1 6165.2568 . matrix list e(V) symmetric e(V)[1,1] price price 117561.16
A first ado-file
The code in mymean1.ado performs the same calculations as mean1.do. (The file mymean1.ado is in my current working directory.)
*! version 1.0.0 20Oct2015 program define mymean1 version 14 quietly summarize price local sum = r(sum) local N = r(N) local mu = (1/`N')*`sum' capture drop e2 // Drop e2 if it exists generate double e2 = (price - `mu')^2 quietly summarize e2 local V = (1/((`N')*(`N'-1)))*r(sum) display "muhat = " `mu' display " V = " `V' end
Line 1 of mymean1.ado specifies that file defines the command mymean1. The command name must be the same as the file name that precedes the suffix .ado. The mymean1 command performs the same computations as the do-file mean1.do.
Example 3: Results from mymean1
. mymean1 muhat = 6165.2568 V = 117561.16
A slightly better command
We want our command to be reusable; we want it to estimate the mean for any variable in memory, instead of only for price as performed by mymean1.ado. On line 5 of mymean2.ado, we use the syntax command to store the name of the variable specified by the user into the local macro varlist, which we use in the remainder of the computations.
*! version 2.0.0 20Oct2015 program define mymean2 version 14 syntax varlist display "The local macro varlist contains `varlist'" quietly summarize `varlist' local sum = r(sum) local N = r(N) local mu = (1/`N')*`sum' capture drop e2 // Drop e2 if it exists generate double e2 = (`varlist' - `mu')^2 quietly summarize e2 local V = (1/((`N')*(`N'-1)))*r(sum) display "The average of `varlist' is " `mu' display "The estimated variance of the average is " `V' end
The extremely powerful syntax command puts the elements of Stata syntax specified by the user into local macros and throws errors when the user makes a mistake. I will discuss syntax in greater detail in subsequent posts.
I begin by illustrating how to replicate the previous results.
Example 4: Results from mymean2 price
. mymean2 price The local macro varlist contains price The average of price is 6165.2568 The estimated variance of the average is 117561.16
I now illustrate that it works for another variable.
Example 5: Results from mymean2 trunk
. mymean2 trunk The local macro varlist contains trunk The average of trunk is 13.756757 The estimated variance of the average is .24724576 . mean trunk Mean estimation Number of obs = 74 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ trunk | 13.75676 .4972381 12.76576 14.74775 -------------------------------------------------------------- . display "The variance of the estimator is " (_se[trunk])^2 The variance of the estimator is .24724576
Storing results in e()
mymean2.ado does not save the results that it displays. We fix this problem in mymean3.ado. Line 2 specifies the option e-class on program define to make mymean3 an e-class command. Line 18 uses ereturn post to move the matrix of point estimates (b) and the estimated variance-covariance of the estimator (VCE) into e(b) and e(V). The estimation-postestimation framework uses parameter names for display, hypothesis tests, and other features. In lines 15 and 16, we put those names into the column stripes of the vector of estimates and the estimated VCE. In line 17, we put those names into the row stripe of the estimated VCE.
*! version 3.0.0 20Oct2015 program define mymean3, eclass version 14 syntax varlist quietly summarize `varlist' local sum = r(sum) local N = r(N) matrix b = (1/`N')*`sum' capture drop e2 // Drop e2 if it exists generate double e2 = (`varlist' - b[1,1])^2 quietly summarize e2 matrix V = (1/((`N')*(`N'-1)))*r(sum) matrix colnames b = `varlist' matrix colnames V = `varlist' matrix rownames V = `varlist' ereturn post b V ereturn display end
The ereturn display command on line 19 of mymean3.ado easily creates a standard output table using the results now stored in e(b) and e(V).
Example 6: Results from mymean3 trunk
. mymean3 trunk ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- trunk | 13.75676 .4972381 27.67 0.000 12.78219 14.73133 ------------------------------------------------------------------------------
Estimation-postestimation
test, lincom, testnl, nlcom, and other Wald-based estimation-postestimation features work after mymean3 because all the required information is stored in e(b) and e(V).
To illustrate, I perform a Wald of the null hypothesis that the mean of trunk is \(11\).
Example 7: test works after mymean3
. test _b[trunk]==11 ( 1) trunk = 11 chi2( 1) = 30.74 Prob > chi2 = 0.0000
The results stored in e() are the glue that holds the estimation-postestimation framework together. We have only stored e(b) and e(V) so far, so not all the standard features are working yet. (But we will get there in the #StataProgramming series.)
Using temporary names for global objects
Stata variables and matrices are global, as discussed in my previous blog post. We need some safe names for global objects. These safe names should not be in use elsewhere, and they should be temporary in that we want Stata to drop the corresponding objects when the command finishes. The tempvar and tempname commands put safe names into local macros and then drop the corresponding objects when the ado-file or do-file finishes. We explicitly dropped e2, if it existed, in line 9 of code block 2, in line 12 of code block 3, and in line 11 of code block 4. We do not need such a line in code block, because we are using temporary variable names.
In line 7 of mymean4.ado, the tempvar command puts a safe name into the local macro e2. In line 8 of mymean4.ado, the tempname command puts safe names into the local macros b and V. I illustrate the format followed by these safe names by displaying them on lines 9–11. The output reveals that a leading pair of underscores is followed by numbers and capital letters. Line 15 illustrates the use of these safe names. Instead of creating the matrix b, we create the matrix whose name is stored in the local macro b. In line 8, the tempname command created the local macro b to hold a safe name.
*! version 4.0.0 20Oct2015 program define mymean4, eclass version 14 syntax varlist tempvar e2 tempname b V display "The safe name in e2 is `e2'" display "The safe name in b is `b'" display "The safe name in V is `V'" quietly summarize `varlist' local sum = r(sum) local N = r(N) matrix `b' = (1/`N')*`sum' generate double `e2' = (`varlist' - `b'[1,1])^2 quietly summarize `e2' matrix `V' = (1/((`N')*(`N'-1)))*r(sum) matrix colnames `b' = `varlist' matrix colnames `V' = `varlist' matrix rownames `V' = `varlist' ereturn post `b' `V' ereturn display end
This code produces the output
Example 8: Results from mymean4 trunk
. mymean4 trunk The safe name in e2 is __000000 The safe name in b is __000001 The safe name in V is __000002 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- trunk | 13.75676 .4972381 27.67 0.000 12.78219 14.73133 ------------------------------------------------------------------------------
Removing the lines that display the safe names contained in the local macros yields mymean5.ado.
*! version 5.0.0 20Oct2015 program define mymean5, eclass version 14 syntax varlist tempvar e2 tempname b V quietly summarize `varlist' local sum = r(sum) local N = r(N) matrix `b' = (1/`N')*`sum' generate double `e2' = (`varlist' - `b'[1,1])^2 quietly summarize `e2' matrix `V' = (1/((`N')*(`N'-1)))*r(sum) matrix colnames `b' = `varlist' matrix colnames `V' = `varlist' matrix rownames `V' = `varlist' ereturn post `b' `V' ereturn display end
This code produces the output
Example 9: Results from mymean5 trunk
. mymean5 trunk ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- trunk | 13.75676 .4972381 27.67 0.000 12.78219 14.73133 ------------------------------------------------------------------------------
Done and undone
I illustrated some basic ado-file programming techniques by implementing a command that estimates the mean of variable. Even though we have a command that produces correct, easy-to-read output that has some estimation-postestimation features, we have only scratched the surface of what we usually want to do in an estimation command. I dig a little deeper in the next few posts by developing a command that performs ordinary least-squares estimation.