## Programming an estimation command in Stata: Allowing for sample restrictions and factor variables

I modify the ordinary least-squares (**OLS**) command discussed in Programming an estimation command in Stata: A better OLS command to allow for sample restrictions, to handle missing values, to allow for factor variables, and to deal with perfectly collinear variables.

This is the eighth post in the series **Programming an estimation command in Stata**. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.

**Sample restrictions**

The **myregress4** command described in Programming an estimation command in Stata: A better OLS command has the syntax

**myregress4** *depvar* [*indepvars*]

where the *indepvars* may be time-series operated variables. **myregress5** allows for sample restrictions and missing values. It has the syntax

**myregress5** *depvar* [*indepvars*] [if] [in]

A user may optionally specify an **if** expression or an **in** range to restrict the sample. I also make **myregress5** handle missing values in the user-specified variables.

**Code block 1: myregress5.ado**

*! version 5.0.0 22Nov2015 program define myregress5, eclass version 14 syntax varlist(numeric ts) [if] [in] marksample touse gettoken depvar : varlist tempname zpz xpx xpy xpxi b V tempvar xbhat res res2 quietly matrix accum `zpz' = `varlist' if `touse' local p : word count `varlist' local p = `p' + 1 matrix `xpx' = `zpz'[2..`p', 2..`p'] matrix `xpy' = `zpz'[2..`p', 1] matrix `xpxi' = syminv(`xpx') matrix `b' = (`xpxi'*`xpy')' quietly matrix score double `xbhat' = `b' if `touse' quietly generate double `res' = (`depvar' - `xbhat') if `touse' quietly generate double `res2' = (`res')^2 if `touse' quietly summarize `res2' if `touse' , meanonly local N = r(N) local sum = r(sum) local s2 = `sum'/(`N'-(`p'-1)) matrix `V' = `s2'*`xpxi' ereturn post `b' `V', esample(`touse') ereturn scalar N = `N' ereturn local cmd "myregress5" ereturn display end

The **syntax** command in line 5 specifies that a user may optionally restrict the sample by specifying an **if** expression or **in** range. When the user specifies an **if** expression, **syntax** puts it into the local macro **if**; otherwise, the local macro **if** is empty. When the user specifies an **in** range, **syntax** puts it into the local macro **in**; otherwise, the local macro **in** is empty.

We could use the local macros **if** and **in** to handle user-specified sample restrictions, but these do not account for missing values in the user-specified variables. The **marksample** command in line 6 creates a local macro named **touse**, which contains the name of a temporary variable that is a sample-identification variable. Each observation in the sample-identification variable is either one or zero. It is one if the observation is included in the sample. It is zero if the observation is excluded from the sample. An observation can be excluded by a user-specified **if** expression, by a user-specified **in** range, or because there is a missing value in one of the user-specified variables.

Lines 20–23 use the sample-identification variable contained in the local macro **touse** to enforce these sample restrictions on the **OLS** calculations.

Line 28 posts the sample-identification variable into **e(sample)**, which is one if the observation was included in the estimation sample and it is zero if the observation was excluded from the estimation sample.

Line 29 stores the number of observations in the sample in **e(N)**.

Example 1 illustrates that **myregress5** runs the requested regression on the sample that respects the missing values in **rep78** and accounts for an **if** expression.

**Example 1: myregress5 with missing values and an if expression**

. sysuse auto (1978 Automobile Data) . count if !missing(rep78) 69 . count if !missing(rep78) & mpg < 30 62 . myregress5 price mpg trunk rep78 if mpg < 30 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -376.8591 107.4289 -3.51 0.000 -587.4159 -166.3024 trunk | -36.39376 102.2139 -0.36 0.722 -236.7294 163.9418 rep78 | 556.3029 378.1101 1.47 0.141 -184.7793 1297.385 _cons | 12569.5 3455.556 3.64 0.000 5796.735 19342.27 ------------------------------------------------------------------------------ . ereturn list scalars: e(N) = 62 macros: e(cmd) : "myregress5" e(properties) : "b V" matrices: e(b) : 1 x 4 e(V) : 4 x 4 functions: e(sample)

**Allowing for factor variables**

Example 1 includes the number of repairs as a continuous variable, but it might be better treated as a discrete factor. **myregress6** accepts factor variables. Factor-variable lists usually imply variable lists that contain perfectly collinear variables, so **myregress6** also handles

perfectly collinear variables.

**Code block 2: myregress6.ado**

*! version 6.0.0 22Nov2015 program define myregress6, eclass version 14 syntax varlist(numeric ts fv) [if] [in] marksample touse gettoken depvar : varlist _fv_check_depvar `depvar' tempname zpz xpx xpy xpxi b V tempvar xbhat res res2 quietly matrix accum `zpz' = `varlist' if `touse' local p = colsof(`zpz') matrix `xpx' = `zpz'[2..`p', 2..`p'] matrix `xpy' = `zpz'[2..`p', 1] matrix `xpxi' = syminv(`xpx') local k = `p' - diag0cnt(`xpxi') - 1 matrix `b' = (`xpxi'*`xpy')' quietly matrix score double `xbhat' = `b' if `touse' quietly generate double `res' = (`depvar' - `xbhat') if `touse' quietly generate double `res2' = (`res')^2 if `touse' quietly summarize `res2' if `touse' , meanonly local N = r(N) local sum = r(sum) local s2 = `sum'/(`N'-(`k')) matrix `V' = `s2'*`xpxi' ereturn post `b' `V', esample(`touse') buildfvinfo ereturn scalar N = `N' ereturn scalar rank = `k' ereturn local cmd "myregress6" ereturn display end

The **fv** in the parentheses after **varlist** in the **syntax** command in line 5 modifies the **varlist** to accept factor variables. Any specified factor variables are stored in the local macro **varlist** in a canonical form.

Estimation commands do not allow the dependent variable to be a factor variable. The **_fv_check_depvar** command in line 9 will exit with an error if the local macro **depvar** contains a factor variable.

Line 15 stores the number of columns in the matrix formed by **matrix accum** in the local macro **p**. Line 19 stores the number of linearly independent columns in the local macro **k**. This calculation uses **diag0cnt()** to account for the perfectly collinear variables that were dropped. (Each dropped variable puts a zero on the diagonal of the generalized inverse calculated by **syminv()** and **diag0cnt()** returns the number of zeros on the diagonal.)

On line 29, I specified option **buildfvinfo** on **ereturn post** to store hidden information that **ereturn display**, **contrast**, **margins**, and **pwcompare** use to label tables and to decide which functions of the parameters are estimable.

Line 31 stores the number of linearly independent variables in **e(rank)** for postestimation commands.

Now, I use **myregress6** to include **rep78** as a factor-operated variable. The base category is dropped because we included a constant term.

**Example 2: myregress6 with a factor-operated variable**

. myregress6 price mpg trunk i.rep78 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -262.7053 73.49434 -3.57 0.000 -406.7516 -118.6591 trunk | 41.75706 93.9671 0.44 0.657 -142.4151 225.9292 | rep78 | 2 | 654.7905 2136.246 0.31 0.759 -3532.175 4841.756 3 | 1170.606 2001.739 0.58 0.559 -2752.73 5093.941 4 | 1473.352 2017.138 0.73 0.465 -2480.167 5426.87 5 | 2896.888 2121.206 1.37 0.172 -1260.599 7054.375 | _cons | 9726.377 2790.009 3.49 0.000 4258.06 15194.69 ------------------------------------------------------------------------------

**Done and undone**

I modified the **OLS** command discussed in Programming an estimation command in Stata: A better OLS command to allow for sample restrictions, to handle missing values, to allow for factor variables, and to deal with perfectly collinear variables. In the next post, I show how to allow options for robust standard errors and to suppress the constant term.