## Programming an estimation command in Stata: Adding robust and cluster-robust VCEs to our Mata-based OLS command

I show how to use the undocumented command _vce_parse to parse the options for robust or cluster-robust estimators of the variance-covariance of the estimator (**VCE**). I then discuss **myregress12.ado**, which performs its computations in Mata and computes **VCE** estimators based on independently and identically distributed (**IID**) observations, robust methods, or cluster-robust methods.

**myregress12.ado** performs ordinary least-squares (**OLS**) regression, and it extends **myregress11.ado**, which I discussed in Programming an estimation command in Stata: An OLS command using Mata. To get the most out of this post, you should be familiar with Programming an estimation command in Stata: Using a subroutine to parse a complex option and Programming an estimation command in Stata: Computing OLS objects in Mata.

This is the sixteenth post in the series **Programming an estimation command in Stata**. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.

**Parsing the vce() option**

I used ado-subroutines to simplify the parsing of the options **vce(robust)** and **vce(cluster** *cvarname***)** in **myregress10.ado**; see Programming an estimation command in Stata: Using a subroutine to parse a complex option. Part of the point was to illustrate how to write ado-subroutines and the programming tricks that I used in these subroutines.

Here I use the undocumented command _vce_parse to simplify the parsing. There are many undocumented commands designed to help Stata programmers. They are undocumented in that they are tersely documented in the system help but not documented in the manuals. In addition, the syntax or behavior of these commands may change over Stata releases, although this rarely happens.

**_vce_parse** helps Stata programmers parse the **vce()** option. To see how it works, consider the problem of parsing the syntax of **myregress12**.

**myregress12** *depvar* [*indepvars*] [if] [in] [, **vce( robust** |

__cl__uster*clustervar*

**)**

**]**

__nocons__tantwhere *indepvars* can contain factor variables or time-series variables.

I can use the **syntax** command to put whatever the user specifies in the option **vce()** into the local macro **vce**, but I still must (1) check that what was specified makes sense and (2) create local macros that the code can use to do the right thing. Examples 1–7 create the local macro **vce**, simulating what **syntax** would do, and then use **_vce_robust** to perform tasks (1) and (2).

I begin with the case in which the user specified **vce(robust)**; here the local macro **vce** would contain the word **robust**.

**Example 1: parsing vce(robust)**

. clear all . sysuse auto (1978 Automobile Data) . local vce "robust" . _vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce') . return list macros: r(robust) : "robust" r(vceopt) : "vce(robust)" r(vce) : "robust"

The command

**_vce_parse** , **optlist(Robust) argoptlist(CLuster) :** , **vce(`vce’)**

has two pieces. The piece before the colon (**:**) specifies the rules; the piece after the colon specifies what the user typed. Each piece can have a Stata object followed by some options; note the commas before **optlist(Robust)** and before **vce(`vce’)**. In the case at hand, the second piece only contains what the user specified – **vce(robust)** – and the first piece only contains the options **optlist(Robust)** and **argoptlist(CLuster)**. The option **optlist(Robust)** specifies that the **vce()** option in the second piece may contain the option **robust** and that its minimal abbreviation is **r**. Note how the word **Robust** in **optlist(Robust)** mimics how **syntax** specifies minimum abbreviations. The option **argoptlist(CLuster)** specifies that the **vce()** option in the second piece may contain **cluster** *clustervar*, that the minimum abbreviation of **cluster** is **cl**, and that it will put the argument **clustervar** into a local macro.

After the command,

**_vce_parse** , **optlist(Robust) argoptlist(CLuster) : **, **vce(`vce’)**

I use **return list** to show what **_vce_parse** stored in **r()**. Because local macro **vce** contains “robust”, **_vce_parse**

- puts the word
**robust**in the local macro**r(robust)**; - puts what the user typed,
**vce(robust)**, in the local macro**r(vceopt)**; and - puts the type of
**VCE**,**robust**, in the local macro**r(vce)**.

Examples 2 and 3 illustrate that **_vce_parse** stores the same values in these local macros when the user specifies **vce(rob)** or **vce(r)**, which are valid abbreviations for **vce(robust)**.

**Example 2: parsing vce(rob)**

. local vce "rob" . _vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce') . return list macros: r(robust) : "robust" r(vceopt) : "vce(robust)" r(vce) : "robust"

**Example 3: parsing vce(r)**

. local vce "r" . _vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce') . return list macros: r(robust) : "robust" r(vceopt) : "vce(robust)" r(vce) : "robust"

Now, consider parsing the option **vce(cluster** *clustervar***)**. Because the cluster variable *clustervar* may contain missing values, **_vce_parse** may need to update a sample-identification variable before it stores the name of the cluster variable in a local macro. In example 4, I use the command

**_vce_parse mytouse**, **optlist(Robust) argoptlist(CLuster) :** , **vce(`vce’)**

to handle the case when the user specifies **vce(cluster rep78)**. The results from the **tabulate** and **summarize** commands illustrate that **_vce_parse** updates the sample-identification variable **mytouse** to account for the missing observations in **rep78**.

**Example 4: parsing vce(cluster rep78)**

. generate byte mytouse = 1 . tabulate mytouse mytouse | Freq. Percent Cum. ------------+----------------------------------- 1 | 74 100.00 100.00 ------------+----------------------------------- Total | 74 100.00 . summarize rep78 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- rep78 | 69 3.405797 .9899323 1 5 . local vce "cluster rep78" . _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce') . return list macros: r(robust) : "robust" r(cluster) : "rep78" r(vceopt) : "vce(cluster rep78)" r(vceargs) : "rep78" r(vce) : "cluster" . tabulate mytouse mytouse | Freq. Percent Cum. ------------+----------------------------------- 0 | 5 6.76 6.76 1 | 69 93.24 100.00 ------------+----------------------------------- Total | 74 100.00

I use **return list** to show what **_vce_parse** stored in **r()**. Because local macro **vce** contains **cluster rep78**, **_vce_parse**

- puts the word
**robust**in the local macro**r(robust)**; - puts the name of the cluster variable,
**rep78**, in the local macro**r(cluster)**; - puts what the user typed,
**vce(cluster rep78)**, in the local macro**r(vceopt)**; - puts the argument to the
**cluster**option,**rep78**, in the local macro**r(vceargs)**; and - puts the type of
**VCE**,**cluster**, in the local macro**r(vce)**.

Examples 5 and 6 illustrate that **_vce_parse** stores the same values in these local macros when the user specifies **vce(clus rep78)** or **vce(cl rep78)**, which are valid abbreviations for **vce(cluster rep78)**.

**Example 5: parsing vce(clus rep78)**

. local vce "clus rep78" . _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce') . return list macros: r(robust) : "robust" r(cluster) : "rep78" r(vceopt) : "vce(cluster rep78)" r(vceargs) : "rep78" r(vce) : "cluster"

**Example 6: parsing vce(cl rep78)**

. local vce "cl rep78" . _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce') . return list macros: r(robust) : "robust" r(cluster) : "rep78" r(vceopt) : "vce(cluster rep78)" r(vceargs) : "rep78" r(vce) : "cluster"

Having illustrated how to make **_vce_parse** handle the cases when the user specifies something valid, I will show in example 7 that it will also produce a standard error message when the user specifies an error condition.

**Example 7: parsing vce(silly)**

. local vce "silly" . capture noisily _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vc > e(`vce') vcetype 'silly' not allowed . return list

**_vce_parse** can parse other types of **vce()** options; to see them type **help _vce_parse**.

Also, remember to type undocumented when you are looking for a programmer’s tool.

**The code for myregress12**

Here is the code for **myregress12.ado**, which uses **_vce_parse**. I describe how it works below.

I recommend that you click on the file name to download the code for my myregress12.ado. To avoid scrolling, view the code in the do-file editor, or your favorite text editor, to see the line numbers.

**Code block 1: myregress12.ado**

*! version 12.0.0 16Jan2016 program define myregress12, eclass sortpreserve version 14.1 syntax varlist(numeric ts fv) [if] [in] [, noCONStant vce(string) ] marksample touse _vce_parse `touse' , optlist(Robust) argoptlist(CLuster) : , vce(`vce') local vce "`r(vce)'" local clustervar "`r(cluster)'" if "`vce'" == "robust" | "`vce'" == "cluster" { local vcetype "Robust" } if "`clustervar'" != "" { capture confirm numeric variable `clustervar' if _rc { display in red "invalid vce() option" display in red "cluster variable {bf:`clustervar'} is " /// "string variable instead of a numeric variable" exit(198) } sort `clustervar' } gettoken depvar indepvars : varlist _fv_check_depvar `depvar' fvexpand `indepvars' local cnames `r(varlist)' tempname b V N rank df_r mata: mywork("`depvar'", "`cnames'", "`touse'", "`constant'", /// "`vce'", "`clustervar'", /// "`b'", "`V'", "`N'", "`rank'", "`df_r'") if "`constant'" == "" { local cnames `cnames' _cons } matrix colnames `b' = `cnames' matrix colnames `V' = `cnames' matrix rownames `V' = `cnames' ereturn post `b' `V', esample(`touse') buildfvinfo ereturn scalar N = `N' ereturn scalar rank = `rank' ereturn scalar df_r = `df_r' ereturn local vce "`vce'" ereturn local vcetype "`vcetype'" ereturn local clustvar "`clustervar'" ereturn local cmd "myregress12" ereturn display end mata: void mywork( string scalar depvar, string scalar indepvars, string scalar touse, string scalar constant, string scalar vcetype, string scalar clustervar, string scalar bname, string scalar Vname, string scalar nname, string scalar rname, string scalar dfrname) { real vector y, b, e, e2, cvar, ei real matrix X, XpXi, M, info, xi real scalar n, p, k, nc, i, dfr y = st_data(., depvar, touse) X = st_data(., indepvars, touse) n = rows(X) if (constant == "") { X = X,J(n,1,1) } XpXi = quadcross(X, X) XpXi = invsym(XpXi) b = XpXi*quadcross(X, y) e = y - X*b e2 = e:^2 p = cols(X) k = p - diag0cnt(XpXi) if (vcetype == "robust") { M = quadcross(X, e2, X) dfr = n - k V = (n/dfr)*XpXi*M*XpXi } else if (vcetype == "cluster") { cvar = st_data(., clustervar, touse) info = panelsetup(cvar, 1) nc = rows(info) M = J(k, k, 0) dfr = nc - 1 for(i=1; i<=nc; i++) { xi = panelsubmatrix(X,i,info) ei = panelsubmatrix(e,i,info) M = M + xi'*(ei*ei')*xi } V = ((n-1)/(n-k))*(nc/(nc-1))*XpXi*M*XpXi } else { // vcetype must IID dfr = n - k V = (quadsum(e2)/dfr)*XpXi } st_matrix(bname, b') st_matrix(Vname, V) st_numscalar(nname, n) st_numscalar(rname, k) st_numscalar(dfrname, dfr) } end

Let’s break this 118-line program into familiar pieces. Lines 2-56 define the ado-command, and lines 58-118 define the Mata work function that is used by the ado-command. Despite the addition of details to handle the parsing and computation of a robust or cluster-robust **VCE**, the structures of the ado-command and of the Mata work function are the same as they were in **myregress11.ado**; see Programming an estimation command in Stata: An OLS command using Mata.

The ado-command has four parts.

- Lines 5-31 parse what the user typed, identify the sample, and create temporary names for the results returned by our Mata work function.
- Lines 33-35 call the Mata work function.
- Lines 37-52 post the results returned by the Mata work function to
**e()**. - Line 54 displays the results.

The Mata function **mywork()** also has four parts.

- Lines 60-65 parse the arguments.
- Lines 68-70 declare vectors, matrices, and scalars that are local to
**mywork()**. - Lines 80-108 compute the results.
- Lines 110-114 copy the computed results to Stata, using the names that were passed in the arguments.

Now, I address the details of the ado-code, although I do not discuss details in **myregress12.ado**, which I already covered when describing **myregress11.ado** in Programming an estimation command in Stata: An OLS command using Mata. Line 5 allows the user to specify the **vce()** option, and line 8 uses **_vce_parse** to parse what the user specifies. Lines 9 and 10 put the type of **VCE** found by **_vce_parse** in the local macro **vce** and the name of the cluster variable, if specified, in the local macro **clustervar**. Lines 11-13 put **Robust** in the local **vcetype**, if the specified **vce** is either robust or cluster. If there is a cluster variable, lines 14–23 check that it is numeric and use it to sort the data.

Line 34 passes the new arguments for the type of **VCE** and the name of the cluster variable to the Mata work function **mywork()**.

Lines 49–51 store the type of **VCE**, the output label for the **VCE** type, and the name of the cluster variable in **e()**, respectively.

Now, I address the details of the Mata work function **mywork()** but only discussing what I have added to **mywork()** in **myregress11.ado**. Line 62 declares the new arguments. The string scalar **vcetype** is empty, or it contains “robust”, or it contains “cluster”. The string scalar **clustervar** is either empty or contains the name of the cluster variable.

Lines 68–70 declare the local-to-the-function vectors **cvar** and **ei** and the local-to-the-function matrices **M,** **info**, and **xi** that are needed now but not previously.

Lines 87, 91–92, 104–105, and 108 specify if-else blocks to compute the correct **VCE**. Lines 88–90 compute a robust estimator of the **VCE** if **vcetype** contains “robust”. Lines 93–103 compute a cluster-robust estimator of the **VCE** if **vcetype** contains “cluster”. Lines 106–107 compute an **IID**-based estimator of the **VCE** if **vcetype** contains neither “robust” nor “cluster”.

**Done and undone**

I introduced the undocumented command _vce_parse and discussed the code for **myregress12.ado**, which uses Mata to compute **OLS** point estimates and an **IID**-based **VCE**, a robust **VCE**, or a cluster-robust **VCE**.

The structure of the code is the same as the one that I used in **myregress11.ado** and in **mymean8.ado**, which I discussed in Programming an estimation command in Stata: An OLS command using Mata and in Programming an estimation command in Stata: A first ado-command using Mata. That the structure remains the same makes it easier to handle the details that arise in more complicated problems.