Home > Programming > Programming an estimation command in Stata: A better OLS command

Programming an estimation command in Stata: A better OLS command

I use the syntax command to improve the command that implements the ordinary least-squares (OLS) estimator that I discussed in Programming an estimation command in Stata: A first command for OLS. I show how to require that all variables be numeric variables and how to make the command accept time-series operated variables.

This is the seventh post in the series Programming an estimation command in Stata. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.

Stata syntax and the syntax command

The myregress2 command described in Programming an estimation command in Stata: A first command for OLS has the syntax

myregress2 depvar [indepvars]

This syntax requires that the dependent variable be specified because depvar is not enclosed in square brackets. The independent variables are optional because indepvars is enclosed in square brackets. Type

. help language

for an introduction to reading Stata syntax diagrams.

This syntax is implemented by the syntax command in line 5 of myregress2.ado, which I discussed at length in Programming an estimation command in Stata: A first command for OLS. The user must specify a list of variable names because varlist is not enclosed in square brackets. The syntax of the syntax command follows the rules of a syntax diagram.

Code block 1: myregress2.ado

*! version 2.0.0  26Oct2015
program define myregress2, eclass
	version 14

	syntax varlist
	gettoken depvar : varlist

	tempname zpz xpx xpy xpxi b V
	tempvar  xbhat res res2 

	quietly matrix accum `zpz' = `varlist'
	local p : word count `varlist'
	local p = `p' + 1
	matrix `xpx'                = `zpz'[2..`p', 2..`p']
	matrix `xpy'                = `zpz'[2..`p', 1]
	matrix `xpxi'               = syminv(`xpx')
	matrix `b'                  = (`xpxi'*`xpy')'
	quietly matrix score double `xbhat' = `b'
	quietly generate double `res'       = (`depvar' - `xbhat')
	quietly generate double `res2'      = (`res')^2
	quietly summarize `res2'
	local N                     = r(N)
	local sum                   = r(sum)
	local s2                    = `sum'/(`N'-(`p'-1))
	matrix `V'                  = `s2'*`xpxi'
	ereturn post `b' `V'
	ereturn local         cmd   "myregress2"
	ereturn display
end

Example 1 illustrates that myregress2 runs the requested regression when I specify a varlist.

Example 1: myregress2 with specified variables

. sysuse auto
(1978 Automobile Data)

. myregress2 price mpg trunk
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -220.1649   65.59262    -3.36   0.001    -348.7241    -91.6057
       trunk |   43.55851   88.71884     0.49   0.623    -130.3272    217.4442
       _cons |   10254.95   2349.084     4.37   0.000      5650.83    14859.07
------------------------------------------------------------------------------

Example 2 illustrates that the syntax command displays an error message and stops execution when I do not specify a varlist. I use set trace on to see each line of code and the output it produces.

Example 2: myregress2 without a varlist

. set trace on 

. myregress2 
  --------------------------------------------------------- begin myregress2 --
  - version 14
  - syntax varlist
varlist required
  ----------------------------------------------------------- end myregress2 --
r(100);

Example 3 illustrates that the syntax command is checking that the specified variables are in the current dataset. syntax throws an error because DoesNotExist is not a variable in the current dataset.

Example 3: myregress2 with a variable not in this dataset

. set trace on 

. myregress2 price mpg trunk DoesNotExist
  --------------------------------------------------------- begin myregress2 --
  - version 14
  - syntax varlist
variable DoesNotExist not found
  ----------------------------------------------------------- end myregress2 --
r(111);

end of do-file

r(111);

Because the syntax command on line 5 is not restricting the specified variables to be numeric, I get the no observations error in example 4 instead of an error indicating the actual problem, which is the string variable make.

Example 4: myregress2 with a string variable

. describe make

              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
make            str18   %-18s                 Make and Model

. myregress2 price mpg trunk make
no observations
r(2000);

end of do-file

r(2000);

On line 5 of myregress3, I modify varlist to only accept numeric variables This change produces a more informative error message when I try to include a string variable in the regression.

Code block 2: myregress3.ado

*! version 3.0.0  30Oct2015
program define myregress3, eclass
	version 14

	syntax varlist(numeric)
	gettoken depvar : varlist

	tempname zpz xpx xpy xpxi b V
	tempvar  xbhat res res2 

	quietly matrix accum `zpz' = `varlist'
	local p : word count `varlist'
	local p = `p' + 1
	matrix `xpx'                = `zpz'[2..`p', 2..`p']
	matrix `xpy'                = `zpz'[2..`p', 1]
	matrix `xpxi'               = syminv(`xpx')
	matrix `b'                  = (`xpxi'*`xpy')'
	quietly matrix score double `xbhat' = `b'
	quietly generate double `res'       = (`depvar' - `xbhat')
	quietly generate double `res2'      = (`res')^2
	quietly summarize `res2'
	local N                     = r(N)
	local sum                   = r(sum)
	local s2                    = `sum'/(`N'-(`p'-1))
	matrix `V'                  = `s2'*`xpxi'
	ereturn post `b' `V'
	ereturn local         cmd   "myregress3"
	ereturn display
end

Example 5: myregress3 with a string variable

. set trace on 

. myregress3 price mpg trunk make
  --------------------------------------------------------- begin myregress3 --
  - version 14
  - syntax varlist(numeric)
string variables not allowed in varlist;
make is a string variable
  ----------------------------------------------------------- end myregress3 --
r(109);

end of do-file

r(109);

On line 5 of myregress4, I modify the varlist to accept time-series (ts) variables. The syntax command puts time-series variables in a canonical form that is stored in the local macro varlist, as illustrated in the display on line 6, whose output appears in example 6.

Code block 3: myregress4.ado

*! version 4.0.0  31Oct2015
program define myregress4, eclass
	version 14

	syntax varlist(numeric ts)
	display "varlist is `varlist'"
	gettoken depvar : varlist

	tempname zpz xpx xpy xpxi b V
	tempvar  xbhat res res2 

	quietly matrix accum `zpz' = `varlist'
	local p : word count `varlist'
	local p = `p' + 1
	matrix `xpx'                = `zpz'[2..`p', 2..`p']
	matrix `xpy'                = `zpz'[2..`p', 1]
	matrix `xpxi'               = syminv(`xpx')
	matrix `b'                  = (`xpxi'*`xpy')'
	quietly matrix score double `xbhat' = `b'
	quietly generate double `res'       = (`depvar' - `xbhat')
	quietly generate double `res2'      = (`res')^2
	quietly summarize `res2'
	local N                     = r(N)
	local sum                   = r(sum)
	local s2                    = `sum'/(`N'-(`p'-1))
	matrix `V'                  = `s2'*`xpxi'
	ereturn post `b' `V'
	ereturn local         cmd   "myregress4"
	ereturn display
end

Example 6: myregress4 with time-series variables

. sysuse gnp96

. myregress4  L(0/3).gnp 
varlist is gnp96 L.gnp96 L2.gnp96 L3.gnp96
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       gnp96 |
         L1. |   1.277086   .0860652    14.84   0.000     1.108402    1.445771
         L2. |   -.135549   .1407719    -0.96   0.336    -.4114568    .1403588
         L3. |  -.1368326   .0871645    -1.57   0.116    -.3076719    .0340067
             |
       _cons |   -2.94825   14.36785    -0.21   0.837    -31.10871    25.21221
------------------------------------------------------------------------------

Done and undone

I used the syntax command to improve how myregress2 handles the variables specified by the user. I showed how to require that all variables be numeric variables and how to make the command accept time-series operated variables. In the next post, I show how to make the command allow for sample restrictions, how to handle missing values, how to allow for factor-operated variables, and how to deal with perfectly collinear variables.