Programming an estimation command in Stata: Allowing for sample restrictions and factor variables
I modify the ordinary least-squares (OLS) command discussed in Programming an estimation command in Stata: A better OLS command to allow for sample restrictions, to handle missing values, to allow for factor variables, and to deal with perfectly collinear variables.
This is the eighth post in the series Programming an estimation command in Stata. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.
Sample restrictions
The myregress4 command described in Programming an estimation command in Stata: A better OLS command has the syntax
myregress4 depvar [indepvars]
where the indepvars may be time-series operated variables. myregress5 allows for sample restrictions and missing values. It has the syntax
myregress5 depvar [indepvars] [if] [in]
A user may optionally specify an if expression or an in range to restrict the sample. I also make myregress5 handle missing values in the user-specified variables.
*! version 5.0.0 22Nov2015 program define myregress5, eclass version 14 syntax varlist(numeric ts) [if] [in] marksample touse gettoken depvar : varlist tempname zpz xpx xpy xpxi b V tempvar xbhat res res2 quietly matrix accum `zpz' = `varlist' if `touse' local p : word count `varlist' local p = `p' + 1 matrix `xpx' = `zpz'[2..`p', 2..`p'] matrix `xpy' = `zpz'[2..`p', 1] matrix `xpxi' = syminv(`xpx') matrix `b' = (`xpxi'*`xpy')' quietly matrix score double `xbhat' = `b' if `touse' quietly generate double `res' = (`depvar' - `xbhat') if `touse' quietly generate double `res2' = (`res')^2 if `touse' quietly summarize `res2' if `touse' , meanonly local N = r(N) local sum = r(sum) local s2 = `sum'/(`N'-(`p'-1)) matrix `V' = `s2'*`xpxi' ereturn post `b' `V', esample(`touse') ereturn scalar N = `N' ereturn local cmd "myregress5" ereturn display end
The syntax command in line 5 specifies that a user may optionally restrict the sample by specifying an if expression or in range. When the user specifies an if expression, syntax puts it into the local macro if; otherwise, the local macro if is empty. When the user specifies an in range, syntax puts it into the local macro in; otherwise, the local macro in is empty.
We could use the local macros if and in to handle user-specified sample restrictions, but these do not account for missing values in the user-specified variables. The marksample command in line 6 creates a local macro named touse, which contains the name of a temporary variable that is a sample-identification variable. Each observation in the sample-identification variable is either one or zero. It is one if the observation is included in the sample. It is zero if the observation is excluded from the sample. An observation can be excluded by a user-specified if expression, by a user-specified in range, or because there is a missing value in one of the user-specified variables.
Lines 20–23 use the sample-identification variable contained in the local macro touse to enforce these sample restrictions on the OLS calculations.
Line 28 posts the sample-identification variable into e(sample), which is one if the observation was included in the estimation sample and it is zero if the observation was excluded from the estimation sample.
Line 29 stores the number of observations in the sample in e(N).
Example 1 illustrates that myregress5 runs the requested regression on the sample that respects the missing values in rep78 and accounts for an if expression.
Example 1: myregress5 with missing values and an if expression
. sysuse auto (1978 Automobile Data) . count if !missing(rep78) 69 . count if !missing(rep78) & mpg < 30 62 . myregress5 price mpg trunk rep78 if mpg < 30 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -376.8591 107.4289 -3.51 0.000 -587.4159 -166.3024 trunk | -36.39376 102.2139 -0.36 0.722 -236.7294 163.9418 rep78 | 556.3029 378.1101 1.47 0.141 -184.7793 1297.385 _cons | 12569.5 3455.556 3.64 0.000 5796.735 19342.27 ------------------------------------------------------------------------------ . ereturn list scalars: e(N) = 62 macros: e(cmd) : "myregress5" e(properties) : "b V" matrices: e(b) : 1 x 4 e(V) : 4 x 4 functions: e(sample)
Allowing for factor variables
Example 1 includes the number of repairs as a continuous variable, but it might be better treated as a discrete factor. myregress6 accepts factor variables. Factor-variable lists usually imply variable lists that contain perfectly collinear variables, so myregress6 also handles
perfectly collinear variables.
*! version 6.0.0 22Nov2015 program define myregress6, eclass version 14 syntax varlist(numeric ts fv) [if] [in] marksample touse gettoken depvar : varlist _fv_check_depvar `depvar' tempname zpz xpx xpy xpxi b V tempvar xbhat res res2 quietly matrix accum `zpz' = `varlist' if `touse' local p = colsof(`zpz') matrix `xpx' = `zpz'[2..`p', 2..`p'] matrix `xpy' = `zpz'[2..`p', 1] matrix `xpxi' = syminv(`xpx') local k = `p' - diag0cnt(`xpxi') - 1 matrix `b' = (`xpxi'*`xpy')' quietly matrix score double `xbhat' = `b' if `touse' quietly generate double `res' = (`depvar' - `xbhat') if `touse' quietly generate double `res2' = (`res')^2 if `touse' quietly summarize `res2' if `touse' , meanonly local N = r(N) local sum = r(sum) local s2 = `sum'/(`N'-(`k')) matrix `V' = `s2'*`xpxi' ereturn post `b' `V', esample(`touse') buildfvinfo ereturn scalar N = `N' ereturn scalar rank = `k' ereturn local cmd "myregress6" ereturn display end
The fv in the parentheses after varlist in the syntax command in line 5 modifies the varlist to accept factor variables. Any specified factor variables are stored in the local macro varlist in a canonical form.
Estimation commands do not allow the dependent variable to be a factor variable. The _fv_check_depvar command in line 9 will exit with an error if the local macro depvar contains a factor variable.
Line 15 stores the number of columns in the matrix formed by matrix accum in the local macro p. Line 19 stores the number of linearly independent columns in the local macro k. This calculation uses diag0cnt() to account for the perfectly collinear variables that were dropped. (Each dropped variable puts a zero on the diagonal of the generalized inverse calculated by syminv() and diag0cnt() returns the number of zeros on the diagonal.)
On line 29, I specified option buildfvinfo on ereturn post to store hidden information that ereturn display, contrast, margins, and pwcompare use to label tables and to decide which functions of the parameters are estimable.
Line 31 stores the number of linearly independent variables in e(rank) for postestimation commands.
Now, I use myregress6 to include rep78 as a factor-operated variable. The base category is dropped because we included a constant term.
Example 2: myregress6 with a factor-operated variable
. myregress6 price mpg trunk i.rep78 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -262.7053 73.49434 -3.57 0.000 -406.7516 -118.6591 trunk | 41.75706 93.9671 0.44 0.657 -142.4151 225.9292 | rep78 | 2 | 654.7905 2136.246 0.31 0.759 -3532.175 4841.756 3 | 1170.606 2001.739 0.58 0.559 -2752.73 5093.941 4 | 1473.352 2017.138 0.73 0.465 -2480.167 5426.87 5 | 2896.888 2121.206 1.37 0.172 -1260.599 7054.375 | _cons | 9726.377 2790.009 3.49 0.000 4258.06 15194.69 ------------------------------------------------------------------------------
Done and undone
I modified the OLS command discussed in Programming an estimation command in Stata: A better OLS command to allow for sample restrictions, to handle missing values, to allow for factor variables, and to deal with perfectly collinear variables. In the next post, I show how to allow options for robust standard errors and to suppress the constant term.