Programming an estimation command in Stata: An OLS command using Mata
I discuss a command that computes ordinary least-squares (OLS) results in Mata, paying special attention to the structure of Stata programs that use Mata work functions.
This command builds on several previous posts; at a minimum, you should be familiar with Programming an estimation command in Stata: A first ado-command using Mata and Programming an estimation command in Stata: Computing OLS objects in Mata.
This is the fifteenth post in the series Programming an estimation command in Stata. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.
An OLS command with Mata computations
The Stata command myregress11 computes the results in Mata. The syntax of the myregress11 command is
myregress11 depvar [indepvars] [if] [in] [, noconstant]
where indepvars can contain factor variables or time-series variables.
In the remainder of this post, I discuss the code for myregress11.ado. I recommend that you click on the file name to download the code. To avoid scrolling, view the code in the do-file editor, or your favorite text editor, to see the line numbers.
I do not discuss the formulas for the OLS objects. See Programming an estimation command in Stata: Computing OLS objects in Mata for the formulas and Mata implementations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | *! version 11.0.0 11Jan2016 program define myregress11, eclass sortpreserve version 14 syntax varlist(numeric ts fv) [if] [in] [, noCONStant ] marksample touse gettoken depvar indepvars : varlist _fv_check_depvar `depvar' fvexpand `indepvars' local cnames `r(varlist)' tempname b V N rank df_r mata: mywork("`depvar'", "`cnames'", "`touse'", "`constant'", /// "`b'", "`V'", "`N'", "`rank'", "`df_r'") if "`constant'" == "" { local cnames `cnames' _cons } matrix colnames `b' = `cnames' matrix colnames `V' = `cnames' matrix rownames `V' = `cnames' ereturn post `b' `V', esample(`touse') buildfvinfo ereturn scalar N = `N' ereturn scalar rank = `rank' ereturn scalar df_r = `df_r' ereturn local cmd "myregress11" ereturn display end mata: void mywork( string scalar depvar, string scalar indepvars, string scalar touse, string scalar constant, string scalar bname, string scalar Vname, string scalar nname, string scalar rname, string scalar dfrname) { real vector y, b, e, e2 real matrix X, XpXi real scalar n, k y = st_data(., depvar, touse) X = st_data(., indepvars, touse) n = rows(X) if (constant == "") { X = X,J(n,1,1) } XpXi = quadcross(X, X) XpXi = invsym(XpXi) b = XpXi*quadcross(X, y) e = y - X*b e2 = e:^2 k = cols(X) - diag0cnt(XpXi) V = (quadsum(e2)/(n-k))*XpXi st_matrix(bname, b') st_matrix(Vname, V) st_numscalar(nname, n) st_numscalar(rname, k) st_numscalar(dfrname, n-k) } end |
Let’s break this 74-line program into familiar pieces to make it easier to understand. Lines 2–35 define the ado-command, and lines 37–74 define the Mata work function that is used by the ado-command. Although there are more details, I used this structure in mymean8.ado, which I discussed in Programming an estimation command in Stata: A first ado-command using Mata.
The ado-command has four parts.
- Lines 5–14 parse what the user typed, identify the sample, and create temporary names for the results returned by our Mata work function.
- Lines 16-17 call the Mata work function.
- Lines 19–31 post the results returned by the Mata work function to e().
- Line 33 displays the results.
The Mata function mywork() also has four parts.
- Lines 39–43 parse the arguments.
- Lines 46–48 declare vectors, matrices, and scalars that are local to mywork().
- Lines 54–64 compute the results.
- Lines 66–70 copy the computed results to Stata, using the names that were passed in arguments.
Now let’s discuss the ado-code in some detail. Line 5 uses the syntax command to put the names of the variables specified by the user into the local macro varlist, to parse an optional if restriction, to parse an optional in restriction, and to parse the noconstant option. The variables specified by the user must be numeric, may contain time-series variables, and may contain factor variables. For more detailed discussions of similar syntax usages, see Programming an estimation command in Stata: Allowing for sample restrictions and factor variables, Programming an estimation command in Stata: Allowing for options, and Programming an estimation command in Stata: Using a subroutine to parse a complex option.
Line 6 creates a temporary sample-identification variable and stores its name in the local macro touse. I discussed this usage in detail in the section surrounding code block 1 in Programming an estimation command in Stata: Allowing for sample restrictions and factor variables.
Line 8 uses gettoken to store the name of the dependent variable in the local macro depvar and the names of the independent variables in the local macro indepvars. Line 9 uses _fv_check_depvar to check that the name of the dependent variable is not a factor variable.
Line 11 uses fvexpand to expand the factor variables in indepvars. Line 12 puts the expanded names stored in r(varlist) by fvexpand in the local macro cnames. A single factor variable can imply more than one coefficient. fvexpand finds the canonical names for these coefficients and returns them in r(varlist). I have not used fvexpand until now because the Stata commands that I used to compute the results automatically created the coefficient names. Mata functions are designed for speed, so I must create the coefficient names when I use them.
Example 1 illustrates how one factor variable can imply more than one coefficient.
Example 1: fvexpand
. sysuse auto (1978 Automobile Data) . tabulate rep78 Repair | Record 1978 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 2.90 2.90 2 | 8 11.59 14.49 3 | 30 43.48 57.97 4 | 18 26.09 84.06 5 | 11 15.94 100.00 ------------+----------------------------------- Total | 69 100.00 . fvexpand i.rep78 . return list macros: r(fvops) : "true" r(varlist) : "1b.rep78 2.rep78 3.rep78 4.rep78 5.rep78" . summarize 2.rep78 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- 2.rep78 | 69 .115942 .3225009 0 1
The tabulate results show that there are five levels in rep78. fvexpand finds the levels and creates a list of the names of the implied indicator variables 1b.rep78, 2.rep78, 3.rep78, 4.rep78, and 5.rep78. Comparing the results from summarize 2.rep78 and tabulate rep78 illustrates this notation. The b in 1b.rep78 identifies level 1 as the base category to be omitted when there is a constant in the model. Type help fvvarlist for more details.
Line 14 creates the temporary names for the results. For example, it stores a safe, temporary name in the local macro b that can be used for the matrix storing the point estimates. I discussed this usage in the section Using temporary names for global objects in Programming an estimation command in Stata: A first ado-command using Mata.
Lines 16 and 17 call the Mata function mywork(), which uses the information contained in the local macros depvar, cnames, touse, and constant to compute the results that are returned in the Stata objects whose names are stored in the local macros b, V, N, rank, and df_r.
Line 20 appends _cons to the local macro cnames, if the user specified the option noconstant.
Lines 23–25 put row names on the vector of point estimates and row and column names on the matrix containing the estimated variance-covariance of the estimator (VCE).
Lines 27–31 post the results to e().
Line 33 displays a standard Stata output table, using the results in e(b), e(V), and e(df_r).
Note that the local macro b created on line 14 contains a temporary name that is passed to mywork() on line 17 and that the Stata matrix whose name is contained in the local macro b is used on lines 23 and 27. mywork() puts the vector of point estimates in the Stata matrix whose name is stored in the local macro b. Also note that the local macro V created on line 14 contains a temporary name that is passed to mywork() on line 17 and that the Stata matrix whose name is contained in the local macro V is used on lines 24, 25, and 27. mywork() puts the estimated VCE in the Stata matrix whose name is stored in the local macro V.
To see how this works, let’s discuss the mywork() function in detail. Lines 39–43 declare that mywork() returns nothing, it is void, and declare that mywork() accepts nine arguments, each of which is a string scalar. The first four arguments are inputs; depvar contains the name of the independent variable, indepvars contains the names of the independent variables, touse contains the name of the sample-identification variable, and constant contains either noconstant or is empty. The values of these arguments are used on lines 50, 51, and 54 to create the vector y and the matrix X.
The last five arguments contain names used to write the results back to Stata. mywork() writes the results back to Stata using the passed-in temporary names. For example, line 17 shows that the Mata string scalar bname contains the temporary name stored in the local macro b. Line 66 copies the results stored in the transpose of the Mata vector b to a Stata matrix whose name is stored in the Mata string scalar bname. (Line 60 shows that the vector b contains the OLS point estimates.) Lines 23 and 27 then use this Stata vector whose name is stored in the local macro b. Similarly, line 17 shows that the Mata string scalar Vname contains the temporary name stored in the local macro V. Line 67 copies the results stored in the Mata matrix V to a Stata matrix whose name is stored in the Mata string scalar Vname. (Line 64 shows that V contains the estimated VCE.) Lines 24, 25, and 27 then use this Stata matrix whose name is stored in the local macro V. The arguments nname, rname, and dfrname are used analogously to return the results for the number of observations, the rank of the VCE, and the degrees of freedom of the residuals.
Lines 50–64 compute the point estimates and the VCE. Except for line 54, I discussed these computations in Programming an estimation command in Stata: A first ado-command using Mata. Line 54 causes a column of 1s to be joined to the covariate matrix X when the string scalar constant is empty. Lines 5 and 16 imply that the Mata string scalar constant contains noconstant when the user specifies the noconstant option and that it is empty otherwise.
Done and undone
I discussed the code for myregress11.ado, which uses Mata to compute OLS point estimates and a VCE that assumes independent and identically distributed observations. The structure of the code is the same as the one that I used in mymean7.ado and mymean8.ado, discussed in Programming an estimation command in Stata: A first ado-command using Mata, although there are more details in the OLS program.
Key to this structure is that the Mata work function accepts two types of arguments: the names of Stata objects that are inputs and temporary names that are used to write the results back to Stata from Mata.
In the next post, I extend myregress11.ado to allow for robust or cluster-robust estimators of the VCE.