Home > Programming > Programming an estimation command in Stata: Allowing for sample restrictions and factor variables

Programming an estimation command in Stata: Allowing for sample restrictions and factor variables

I modify the ordinary least-squares (OLS) command discussed in Programming an estimation command in Stata: A better OLS command to allow for sample restrictions, to handle missing values, to allow for factor variables, and to deal with perfectly collinear variables.

This is the eighth post in the series Programming an estimation command in Stata. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.

Sample restrictions

The myregress4 command described in Programming an estimation command in Stata: A better OLS command has the syntax

myregress4 depvar [indepvars]

where the indepvars may be time-series operated variables. myregress5 allows for sample restrictions and missing values. It has the syntax

myregress5 depvar [indepvars] [if] [in]

A user may optionally specify an if expression or an in range to restrict the sample. I also make myregress5 handle missing values in the user-specified variables.

Code block 1: myregress5.ado

*! version 5.0.0  22Nov2015
program define myregress5, eclass
	version 14

	syntax varlist(numeric ts) [if] [in]
	marksample touse

	gettoken depvar : varlist

	tempname zpz xpx xpy xpxi b V
	tempvar  xbhat res res2 

	quietly matrix accum `zpz' = `varlist' if `touse'
	local p : word count `varlist'
	local p = `p' + 1
	matrix `xpx'                = `zpz'[2..`p', 2..`p']
	matrix `xpy'                = `zpz'[2..`p', 1]
	matrix `xpxi'               = syminv(`xpx')
	matrix `b'                  = (`xpxi'*`xpy')'
	quietly matrix score double `xbhat' = `b' if `touse'
	quietly generate double `res'       = (`depvar' - `xbhat') if `touse'
	quietly generate double `res2'      = (`res')^2 if `touse'
	quietly summarize `res2' if `touse' , meanonly
	local N                     = r(N)
	local sum                   = r(sum)
	local s2                    = `sum'/(`N'-(`p'-1))
	matrix `V'                  = `s2'*`xpxi'
	ereturn post `b' `V', esample(`touse')
	ereturn scalar           N  = `N'
	ereturn local         cmd   "myregress5"
	ereturn display
end

The syntax command in line 5 specifies that a user may optionally restrict the sample by specifying an if expression or in range. When the user specifies an if expression, syntax puts it into the local macro if; otherwise, the local macro if is empty. When the user specifies an in range, syntax puts it into the local macro in; otherwise, the local macro in is empty.

We could use the local macros if and in to handle user-specified sample restrictions, but these do not account for missing values in the user-specified variables. The marksample command in line 6 creates a local macro named touse, which contains the name of a temporary variable that is a sample-identification variable. Each observation in the sample-identification variable is either one or zero. It is one if the observation is included in the sample. It is zero if the observation is excluded from the sample. An observation can be excluded by a user-specified if expression, by a user-specified in range, or because there is a missing value in one of the user-specified variables.

Lines 20–23 use the sample-identification variable contained in the local macro touse to enforce these sample restrictions on the OLS calculations.

Line 28 posts the sample-identification variable into e(sample), which is one if the observation was included in the estimation sample and it is zero if the observation was excluded from the estimation sample.

Line 29 stores the number of observations in the sample in e(N).

Example 1 illustrates that myregress5 runs the requested regression on the sample that respects the missing values in rep78 and accounts for an if expression.

Example 1: myregress5 with missing values and an if expression

. sysuse auto
(1978 Automobile Data)

. count if !missing(rep78)
  69

. count if !missing(rep78) & mpg < 30
  62

. myregress5 price mpg trunk rep78 if mpg < 30
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -376.8591   107.4289    -3.51   0.000    -587.4159   -166.3024
       trunk |  -36.39376   102.2139    -0.36   0.722    -236.7294    163.9418
       rep78 |   556.3029   378.1101     1.47   0.141    -184.7793    1297.385
       _cons |    12569.5   3455.556     3.64   0.000     5796.735    19342.27
------------------------------------------------------------------------------

. ereturn list

scalars:
                  e(N) =  62

macros:
                e(cmd) : "myregress5"
         e(properties) : "b V"

matrices:
                  e(b) :  1 x 4
                  e(V) :  4 x 4

functions:
             e(sample)   

Allowing for factor variables

Example 1 includes the number of repairs as a continuous variable, but it might be better treated as a discrete factor. myregress6 accepts factor variables. Factor-variable lists usually imply variable lists that contain perfectly collinear variables, so myregress6 also handles
perfectly collinear variables.

Code block 2: myregress6.ado

*! version 6.0.0  22Nov2015
program define myregress6, eclass
	version 14

	syntax varlist(numeric ts fv) [if] [in]
	marksample touse

	gettoken depvar : varlist
	_fv_check_depvar `depvar'

	tempname zpz xpx xpy xpxi b V
	tempvar  xbhat res res2 

	quietly matrix accum `zpz' = `varlist' if `touse'
	local p                    = colsof(`zpz')
	matrix `xpx'               = `zpz'[2..`p', 2..`p']
	matrix `xpy'               = `zpz'[2..`p', 1]
	matrix `xpxi'              = syminv(`xpx')
	local k                    = `p' - diag0cnt(`xpxi') - 1
	matrix `b'                 = (`xpxi'*`xpy')'
	quietly matrix score double `xbhat' = `b' if `touse'
	quietly generate double `res'       = (`depvar' - `xbhat') if `touse'
	quietly generate double `res2'      = (`res')^2 if `touse'
	quietly summarize `res2' if `touse' , meanonly
	local N                     = r(N)
	local sum                   = r(sum)
	local s2                    = `sum'/(`N'-(`k'))
	matrix `V'                  = `s2'*`xpxi'
	ereturn post `b' `V', esample(`touse') buildfvinfo
	ereturn scalar N            = `N'
	ereturn scalar rank         = `k'
	ereturn local  cmd             "myregress6"
	ereturn display
end

The fv in the parentheses after varlist in the syntax command in line 5 modifies the varlist to accept factor variables. Any specified factor variables are stored in the local macro varlist in a canonical form.

Estimation commands do not allow the dependent variable to be a factor variable. The _fv_check_depvar command in line 9 will exit with an error if the local macro depvar contains a factor variable.

Line 15 stores the number of columns in the matrix formed by matrix accum in the local macro p. Line 19 stores the number of linearly independent columns in the local macro k. This calculation uses diag0cnt() to account for the perfectly collinear variables that were dropped. (Each dropped variable puts a zero on the diagonal of the generalized inverse calculated by syminv() and diag0cnt() returns the number of zeros on the diagonal.)

On line 29, I specified option buildfvinfo on ereturn post to store hidden information that ereturn display, contrast, margins, and pwcompare use to label tables and to decide which functions of the parameters are estimable.

Line 31 stores the number of linearly independent variables in e(rank) for postestimation commands.

Now, I use myregress6 to include rep78 as a factor-operated variable. The base category is dropped because we included a constant term.

Example 2: myregress6 with a factor-operated variable

. myregress6 price mpg trunk i.rep78 
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -262.7053   73.49434    -3.57   0.000    -406.7516   -118.6591
       trunk |   41.75706    93.9671     0.44   0.657    -142.4151    225.9292
             |
       rep78 |
          2  |   654.7905   2136.246     0.31   0.759    -3532.175    4841.756
          3  |   1170.606   2001.739     0.58   0.559     -2752.73    5093.941
          4  |   1473.352   2017.138     0.73   0.465    -2480.167     5426.87
          5  |   2896.888   2121.206     1.37   0.172    -1260.599    7054.375
             |
       _cons |   9726.377   2790.009     3.49   0.000      4258.06    15194.69
------------------------------------------------------------------------------

Done and undone

I modified the OLS command discussed in Programming an estimation command in Stata: A better OLS command to allow for sample restrictions, to handle missing values, to allow for factor variables, and to deal with perfectly collinear variables. In the next post, I show how to allow options for robust standard errors and to suppress the constant term.