## Programming an estimation command in Stata: An OLS command using Mata

I discuss a command that computes ordinary least-squares (**OLS**) results in Mata, paying special attention to the structure of Stata programs that use Mata work functions.

This command builds on several previous posts; at a minimum, you should be familiar with Programming an estimation command in Stata: A first ado-command using Mata and Programming an estimation command in Stata: Computing OLS objects in Mata.

This is the fifteenth post in the series **Programming an estimation command in Stata**. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.

**An OLS command with Mata computations**

The Stata command **myregress11** computes the results in Mata. The syntax of the **myregress11** command is

**myregress11** *depvar* [*indepvars*] [if] [in] [, ** noconstant**]

where *indepvars* can contain factor variables or time-series variables.

In the remainder of this post, I discuss the code for **myregress11.ado**. I recommend that you click on the file name to download the code. To avoid scrolling, view the code in the do-file editor, or your favorite text editor, to see the line numbers.

I do not discuss the formulas for the **OLS** objects. See Programming an estimation command in Stata: Computing OLS objects in Mata for the formulas and Mata implementations.

**Code block 1: myregress11.ado**

*! version 11.0.0 11Jan2016 program define myregress11, eclass sortpreserve version 14 syntax varlist(numeric ts fv) [if] [in] [, noCONStant ] marksample touse gettoken depvar indepvars : varlist _fv_check_depvar `depvar' fvexpand `indepvars' local cnames `r(varlist)' tempname b V N rank df_r mata: mywork("`depvar'", "`cnames'", "`touse'", "`constant'", /// "`b'", "`V'", "`N'", "`rank'", "`df_r'") if "`constant'" == "" { local cnames `cnames' _cons } matrix colnames `b' = `cnames' matrix colnames `V' = `cnames' matrix rownames `V' = `cnames' ereturn post `b' `V', esample(`touse') buildfvinfo ereturn scalar N = `N' ereturn scalar rank = `rank' ereturn scalar df_r = `df_r' ereturn local cmd "myregress11" ereturn display end mata: void mywork( string scalar depvar, string scalar indepvars, string scalar touse, string scalar constant, string scalar bname, string scalar Vname, string scalar nname, string scalar rname, string scalar dfrname) { real vector y, b, e, e2 real matrix X, XpXi real scalar n, k y = st_data(., depvar, touse) X = st_data(., indepvars, touse) n = rows(X) if (constant == "") { X = X,J(n,1,1) } XpXi = quadcross(X, X) XpXi = invsym(XpXi) b = XpXi*quadcross(X, y) e = y - X*b e2 = e:^2 k = cols(X) - diag0cnt(XpXi) V = (quadsum(e2)/(n-k))*XpXi st_matrix(bname, b') st_matrix(Vname, V) st_numscalar(nname, n) st_numscalar(rname, k) st_numscalar(dfrname, n-k) } end

Let’s break this 74-line program into familiar pieces to make it easier to understand. Lines 2–35 define the ado-command, and lines 37–74 define the Mata work function that is used by the ado-command. Although there are more details, I used this structure in **mymean8.ado**, which I discussed in Programming an estimation command in Stata: A first ado-command using Mata.

The ado-command has four parts.

- Lines 5–14 parse what the user typed, identify the sample, and create temporary names for the results returned by our Mata work function.
- Lines 16-17 call the Mata work function.
- Lines 19–31 post the results returned by the Mata work function to
**e()**. - Line 33 displays the results.

The Mata function **mywork()** also has four parts.

- Lines 39–43 parse the arguments.
- Lines 46–48 declare vectors, matrices, and scalars that are local to
**mywork()**. - Lines 54–64 compute the results.
- Lines 66–70 copy the computed results to Stata, using the names that were passed in arguments.

Now let’s discuss the ado-code in some detail. Line 5 uses the **syntax** command to put the names of the variables specified by the user into the local macro **varlist**, to parse an optional **if** restriction, to parse an optional **in** restriction, and to parse the **noconstant** option. The variables specified by the user must be numeric, may contain time-series variables, and may contain factor variables. For more detailed discussions of similar **syntax** usages, see Programming an estimation command in Stata: Allowing for sample restrictions and factor variables, Programming an estimation command in Stata: Allowing for options, and Programming an estimation command in Stata: Using a subroutine to parse a complex option.

Line 6 creates a temporary sample-identification variable and stores its name in the local macro **touse**. I discussed this usage in detail in the section surrounding code block 1 in Programming an estimation command in Stata: Allowing for sample restrictions and factor variables.

Line 8 uses **gettoken** to store the name of the dependent variable in the local macro **depvar** and the names of the independent variables in the local macro **indepvars**. Line 9 uses **_fv_check_depvar** to check that the name of the dependent variable is not a factor variable.

Line 11 uses **fvexpand** to expand the factor variables in **indepvars**. Line 12 puts the expanded names stored in **r(varlist)** by **fvexpand** in the local macro **cnames**. A single factor variable can imply more than one coefficient. **fvexpand** finds the canonical names for these coefficients and returns them in **r(varlist)**. I have not used **fvexpand** until now because the Stata commands that I used to compute the results automatically created the coefficient names. Mata functions are designed for speed, so I must create the coefficient names when I use them.

Example 1 illustrates how one factor variable can imply more than one coefficient.

**Example 1: fvexpand**

. sysuse auto (1978 Automobile Data) . tabulate rep78 Repair | Record 1978 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 2.90 2.90 2 | 8 11.59 14.49 3 | 30 43.48 57.97 4 | 18 26.09 84.06 5 | 11 15.94 100.00 ------------+----------------------------------- Total | 69 100.00 . fvexpand i.rep78 . return list macros: r(fvops) : "true" r(varlist) : "1b.rep78 2.rep78 3.rep78 4.rep78 5.rep78" . summarize 2.rep78 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- 2.rep78 | 69 .115942 .3225009 0 1

The **tabulate** results show that there are five levels in **rep78**. **fvexpand** finds the levels and creates a list of the names of the implied indicator variables **1b.rep78**, **2.rep78**, **3.rep78**, **4.rep78**, and **5.rep78**. Comparing the results from **summarize 2.rep78** and **tabulate rep78** illustrates this notation. The **b** in **1b.rep78** identifies level **1** as the base category to be omitted when there is a constant in the model. Type help fvvarlist for more details.

Line 14 creates the temporary names for the results. For example, it stores a safe, temporary name in the local macro **b** that can be used for the matrix storing the point estimates. I discussed this usage in the section **Using temporary names for global objects** in Programming an estimation command in Stata: A first ado-command using Mata.

Lines 16 and 17 call the Mata function **mywork()**, which uses the information contained in the local macros **depvar**, **cnames**, **touse**, and **constant** to compute the results that are returned in the Stata objects whose names are stored in the local macros **b**, **V**, **N**, **rank**, and **df_r**.

Line 20 appends **_cons** to the local macro **cnames**, if the user specified the option **noconstant**.

Lines 23–25 put row names on the vector of point estimates and row and column names on the matrix containing the estimated variance-covariance of the estimator (**VCE**).

Lines 27–31 post the results to **e()**.

Line 33 displays a standard Stata output table, using the results in **e(b)**, **e(V)**, and **e(df_r)**.

Note that the local macro **b** created on line 14 contains a temporary name that is passed to **mywork()** on line 17 and that the Stata matrix whose name is contained in the local macro **b** is used on lines 23 and 27. **mywork()** puts the vector of point estimates in the Stata matrix whose name is stored in the local macro **b**. Also note that the local macro **V** created on line 14 contains a temporary name that is passed to **mywork()** on line 17 and that the Stata matrix whose name is contained in the local macro **V** is used on lines 24, 25, and 27. **mywork()** puts the estimated **VCE** in the Stata matrix whose name is stored in the local macro **V**.

To see how this works, let’s discuss the **mywork()** function in detail. Lines 39–43 declare that **mywork()** returns nothing, it is **void**, and declare that **mywork()** accepts nine arguments, each of which is a string scalar. The first four arguments are inputs; **depvar** contains the name of the independent variable, **indepvars** contains the names of the independent variables, **touse** contains the name of the sample-identification variable, and **constant** contains either **noconstant** or is empty. The values of these arguments are used on lines 50, 51, and 54 to create the vector **y** and the matrix **X**.

The last five arguments contain names used to write the results back to Stata. **mywork()** writes the results back to Stata using the passed-in temporary names. For example, line 17 shows that the Mata string scalar **bname** contains the temporary name stored in the local macro **b**. Line 66 copies the results stored in the transpose of the Mata vector **b** to a Stata matrix whose name is stored in the Mata string scalar **bname**. (Line 60 shows that the vector **b** contains the **OLS** point estimates.) Lines 23 and 27 then use this Stata vector whose name is stored in the local macro **b**. Similarly, line 17 shows that the Mata string scalar **Vname** contains the temporary name stored in the local macro **V**. Line 67 copies the results stored in the Mata matrix **V** to a Stata matrix whose name is stored in the Mata string scalar **Vname**. (Line 64 shows that **V** contains the estimated **VCE**.) Lines 24, 25, and 27 then use this Stata matrix whose name is stored in the local macro **V**. The arguments **nname**, **rname**, and **dfrname** are used analogously to return the results for the number of observations, the rank of the **VCE**, and the degrees of freedom of the residuals.

Lines 50–64 compute the point estimates and the **VCE**. Except for line 54, I discussed these computations in Programming an estimation command in Stata: A first ado-command using Mata. Line 54 causes a column of 1s to be joined to the covariate matrix **X** when the string scalar **constant** is empty. Lines 5 and 16 imply that the Mata string scalar **constant** contains **noconstant** when the user specifies the **noconstant** option and that it is empty otherwise.

**Done and undone**

I discussed the code for **myregress11.ado**, which uses Mata to compute **OLS** point estimates and a **VCE** that assumes independent and identically distributed observations. The structure of the code is the same as the one that I used in **mymean7.ado** and **mymean8.ado**, discussed in Programming an estimation command in Stata: A first ado-command using Mata, although there are more details in the **OLS** program.

Key to this structure is that the Mata work function accepts two types of arguments: the names of Stata objects that are inputs and temporary names that are used to write the results back to Stata from Mata.

In the next post, I extend **myregress11.ado** to allow for robust or cluster-robust estimators of the **VCE**.