Home > Programming > Programming an estimation command in Stata: Adding robust and cluster-robust VCEs to our Mata-based OLS command

Programming an estimation command in Stata: Adding robust and cluster-robust VCEs to our Mata-based OLS command


I show how to use the undocumented command _vce_parse to parse the options for robust or cluster-robust estimators of the variance-covariance of the estimator (VCE). I then discuss myregress12.ado, which performs its computations in Mata and computes VCE estimators based on independently and identically distributed (IID) observations, robust methods, or cluster-robust methods.

myregress12.ado performs ordinary least-squares (OLS) regression, and it extends myregress11.ado, which I discussed in Programming an estimation command in Stata: An OLS command using Mata. To get the most out of this post, you should be familiar with Programming an estimation command in Stata: Using a subroutine to parse a complex option and Programming an estimation command in Stata: Computing OLS objects in Mata.

This is the sixteenth post in the series Programming an estimation command in Stata. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.

Parsing the vce() option

I used ado-subroutines to simplify the parsing of the options vce(robust) and vce(cluster cvarname) in myregress10.ado; see Programming an estimation command in Stata: Using a subroutine to parse a complex option. Part of the point was to illustrate how to write ado-subroutines and the programming tricks that I used in these subroutines.

Here I use the undocumented command _vce_parse to simplify the parsing. There are many undocumented commands designed to help Stata programmers. They are undocumented in that they are tersely documented in the system help but not documented in the manuals. In addition, the syntax or behavior of these commands may change over Stata releases, although this rarely happens.

_vce_parse helps Stata programmers parse the vce() option. To see how it works, consider the problem of parsing the syntax of myregress12.

myregress12 depvar [indepvars] [if] [in] [, vce(robust | cluster clustervar) noconstant]

where indepvars can contain factor variables or time-series variables.

I can use the syntax command to put whatever the user specifies in the option vce() into the local macro vce, but I still must (1) check that what was specified makes sense and (2) create local macros that the code can use to do the right thing. Examples 1–7 create the local macro vce, simulating what syntax would do, and then use _vce_robust to perform tasks (1) and (2).

I begin with the case in which the user specified vce(robust); here the local macro vce would contain the word robust.

Example 1: parsing vce(robust)

. clear all

. sysuse auto
(1978 Automobile Data)

. local vce "robust"

. _vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce')

. return list

macros:
             r(robust) : "robust"
             r(vceopt) : "vce(robust)"
                r(vce) : "robust"

The command

_vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce’)

has two pieces. The piece before the colon (:) specifies the rules; the piece after the colon specifies what the user typed. Each piece can have a Stata object followed by some options; note the commas before optlist(Robust) and before vce(`vce’). In the case at hand, the second piece only contains what the user specified – vce(robust) – and the first piece only contains the options optlist(Robust) and argoptlist(CLuster). The option optlist(Robust) specifies that the vce() option in the second piece may contain the option robust and that its minimal abbreviation is r. Note how the word Robust in optlist(Robust) mimics how syntax specifies minimum abbreviations. The option argoptlist(CLuster) specifies that the vce() option in the second piece may contain cluster clustervar, that the minimum abbreviation of cluster is cl, and that it will put the argument clustervar into a local macro.

After the command,

_vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce’)

I use return list to show what _vce_parse stored in r(). Because local macro vce contains “robust”, _vce_parse

  1. puts the word robust in the local macro r(robust);
  2. puts what the user typed, vce(robust), in the local macro r(vceopt); and
  3. puts the type of VCE, robust, in the local macro r(vce).

Examples 2 and 3 illustrate that _vce_parse stores the same values in these local macros when the user specifies vce(rob) or vce(r), which are valid abbreviations for vce(robust).

Example 2: parsing vce(rob)

. local vce "rob"

. _vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce')

. return list

macros:
             r(robust) : "robust"
             r(vceopt) : "vce(robust)"
                r(vce) : "robust"

Example 3: parsing vce(r)

. local vce "r"

. _vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce')

. return list

macros:
             r(robust) : "robust"
             r(vceopt) : "vce(robust)"
                r(vce) : "robust"

Now, consider parsing the option vce(cluster clustervar). Because the cluster variable clustervar may contain missing values, _vce_parse may need to update a sample-identification variable before it stores the name of the cluster variable in a local macro. In example 4, I use the command

_vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce’)

to handle the case when the user specifies vce(cluster rep78). The results from the tabulate and summarize commands illustrate that _vce_parse updates the sample-identification variable mytouse to account for the missing observations in rep78.

Example 4: parsing vce(cluster rep78)

. generate byte mytouse = 1

. tabulate mytouse

    mytouse |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         74      100.00      100.00
------------+-----------------------------------
      Total |         74      100.00

. summarize rep78

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       rep78 |         69    3.405797    .9899323          1          5

. local vce "cluster rep78"

. _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce')

. return list

macros:
             r(robust) : "robust"
            r(cluster) : "rep78"
             r(vceopt) : "vce(cluster rep78)"
            r(vceargs) : "rep78"
                r(vce) : "cluster"

. tabulate mytouse

    mytouse |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          5        6.76        6.76
          1 |         69       93.24      100.00
------------+-----------------------------------
      Total |         74      100.00

I use return list to show what _vce_parse stored in r(). Because local macro vce contains cluster rep78, _vce_parse

  1. puts the word robust in the local macro r(robust);
  2. puts the name of the cluster variable, rep78, in the local macro r(cluster);
  3. puts what the user typed, vce(cluster rep78), in the local macro r(vceopt);
  4. puts the argument to the cluster option, rep78, in the local macro r(vceargs); and
  5. puts the type of VCE, cluster, in the local macro r(vce).

Examples 5 and 6 illustrate that _vce_parse stores the same values in these local macros when the user specifies vce(clus rep78) or vce(cl rep78), which are valid abbreviations for vce(cluster rep78).

Example 5: parsing vce(clus rep78)

. local vce "clus rep78"

. _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce')

. return list

macros:
             r(robust) : "robust"
            r(cluster) : "rep78"
             r(vceopt) : "vce(cluster rep78)"
            r(vceargs) : "rep78"
                r(vce) : "cluster"

Example 6: parsing vce(cl rep78)

. local vce "cl rep78"

. _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce')

. return list

macros:
             r(robust) : "robust"
            r(cluster) : "rep78"
             r(vceopt) : "vce(cluster rep78)"
            r(vceargs) : "rep78"
                r(vce) : "cluster"

Having illustrated how to make _vce_parse handle the cases when the user specifies something valid, I will show in example 7 that it will also produce a standard error message when the user specifies an error condition.

Example 7: parsing vce(silly)

. local vce "silly"

. capture noisily _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vc
> e(`vce')
vcetype 'silly' not allowed

. return list

_vce_parse can parse other types of vce() options; to see them type help _vce_parse.

Also, remember to type undocumented when you are looking for a programmer’s tool.

The code for myregress12

Here is the code for myregress12.ado, which uses _vce_parse. I describe how it works below.

I recommend that you click on the file name to download the code for my myregress12.ado. To avoid scrolling, view the code in the do-file editor, or your favorite text editor, to see the line numbers.

Code block 1: myregress12.ado

*! version 12.0.0  16Jan2016
program define myregress12, eclass sortpreserve
    version 14.1

    syntax varlist(numeric ts fv) [if] [in] [, noCONStant vce(string) ]
    marksample touse

    _vce_parse `touse' , optlist(Robust) argoptlist(CLuster) : , vce(`vce')
    local vce        "`r(vce)'"
    local clustervar "`r(cluster)'"
    if "`vce'" == "robust" | "`vce'" == "cluster" {
        local vcetype "Robust"
    }
    if "`clustervar'" != "" {
        capture confirm numeric variable `clustervar'
        if _rc {
            display in red "invalid vce() option"
            display in red "cluster variable {bf:`clustervar'} is " ///
                "string variable instead of a numeric variable"
            exit(198)
        }
        sort `clustervar'
    }

    gettoken depvar indepvars : varlist
    _fv_check_depvar `depvar'

    fvexpand `indepvars' 
    local cnames `r(varlist)'

    tempname b V N rank df_r

    mata: mywork("`depvar'", "`cnames'", "`touse'", "`constant'",    ///
       "`vce'", "`clustervar'",                                  /// 
       "`b'", "`V'", "`N'", "`rank'", "`df_r'") 

    if "`constant'" == "" {
        local cnames `cnames' _cons
    }

    matrix colnames `b' = `cnames'
    matrix colnames `V' = `cnames'
    matrix rownames `V' = `cnames'

    ereturn post `b' `V', esample(`touse') buildfvinfo
    ereturn scalar N        = `N'
    ereturn scalar rank     = `rank'
    ereturn scalar df_r     = `df_r'
    ereturn local  vce      "`vce'"
    ereturn local  vcetype  "`vcetype'"
    ereturn local  clustvar "`clustervar'"
    ereturn local  cmd      "myregress12"

    ereturn display

end

mata:

void mywork( string scalar depvar,  string scalar indepvars, 
             string scalar touse,   string scalar constant,  
             string scalar vcetype, string scalar clustervar,
             string scalar bname,   string scalar Vname,     
             string scalar nname,   string scalar rname,     
             string scalar dfrname) 
{

    real vector    y, b, e, e2, cvar, ei 
    real matrix    X, XpXi, M, info, xi 
    real scalar    n, p, k, nc, i, dfr

    y    = st_data(., depvar, touse)
    X    = st_data(., indepvars, touse)
    n    = rows(X)

    if (constant == "") {
        X    = X,J(n,1,1)
    }

    XpXi = quadcross(X, X)
    XpXi = invsym(XpXi)
    b    = XpXi*quadcross(X, y)
    e    = y - X*b
    e2   = e:^2
    p    = cols(X)
    k    = p - diag0cnt(XpXi)
    if (vcetype == "robust") {
        M    = quadcross(X, e2, X)
        dfr  = n - k
        V    = (n/dfr)*XpXi*M*XpXi
    }
    else if (vcetype == "cluster") {
        cvar = st_data(., clustervar, touse)
        info = panelsetup(cvar, 1)
        nc   = rows(info)
        M    = J(k, k, 0)
        dfr  = nc - 1
        for(i=1; i<=nc; i++) {
            xi = panelsubmatrix(X,i,info)
            ei = panelsubmatrix(e,i,info)
            M  = M + xi'*(ei*ei')*xi
        }
        V    = ((n-1)/(n-k))*(nc/(nc-1))*XpXi*M*XpXi
    }
    else {                 // vcetype must IID
        dfr  = n - k
        V    = (quadsum(e2)/dfr)*XpXi
    }

    st_matrix(bname, b')
    st_matrix(Vname, V)
    st_numscalar(nname, n)
    st_numscalar(rname, k)
    st_numscalar(dfrname, dfr)

}

end

Let’s break this 118-line program into familiar pieces. Lines 2-56 define the ado-command, and lines 58-118 define the Mata work function that is used by the ado-command. Despite the addition of details to handle the parsing and computation of a robust or cluster-robust VCE, the structures of the ado-command and of the Mata work function are the same as they were in myregress11.ado; see Programming an estimation command in Stata: An OLS command using Mata.

The ado-command has four parts.

  1. Lines 5-31 parse what the user typed, identify the sample, and create temporary names for the results returned by our Mata work function.
  2. Lines 33-35 call the Mata work function.
  3. Lines 37-52 post the results returned by the Mata work function to e().
  4. Line 54 displays the results.

The Mata function mywork() also has four parts.

  1. Lines 60-65 parse the arguments.
  2. Lines 68-70 declare vectors, matrices, and scalars that are local to mywork().
  3. Lines 80-108 compute the results.
  4. Lines 110-114 copy the computed results to Stata, using the names that were passed in the arguments.

Now, I address the details of the ado-code, although I do not discuss details in myregress12.ado, which I already covered when describing myregress11.ado in Programming an estimation command in Stata: An OLS command using Mata. Line 5 allows the user to specify the vce() option, and line 8 uses _vce_parse to parse what the user specifies. Lines 9 and 10 put the type of VCE found by _vce_parse in the local macro vce and the name of the cluster variable, if specified, in the local macro clustervar. Lines 11-13 put Robust in the local vcetype, if the specified vce is either robust or cluster. If there is a cluster variable, lines 14–23 check that it is numeric and use it to sort the data.

Line 34 passes the new arguments for the type of VCE and the name of the cluster variable to the Mata work function mywork().

Lines 49–51 store the type of VCE, the output label for the VCE type, and the name of the cluster variable in e(), respectively.

Now, I address the details of the Mata work function mywork() but only discussing what I have added to mywork() in myregress11.ado. Line 62 declares the new arguments. The string scalar vcetype is empty, or it contains “robust”, or it contains “cluster”. The string scalar clustervar is either empty or contains the name of the cluster variable.

Lines 68–70 declare the local-to-the-function vectors cvar and ei and the local-to-the-function matrices M, info, and xi that are needed now but not previously.

Lines 87, 91–92, 104–105, and 108 specify if-else blocks to compute the correct VCE. Lines 88–90 compute a robust estimator of the VCE if vcetype contains “robust”. Lines 93–103 compute a cluster-robust estimator of the VCE if vcetype contains “cluster”. Lines 106–107 compute an IID-based estimator of the VCE if vcetype contains neither “robust” nor “cluster”.

Done and undone

I introduced the undocumented command _vce_parse and discussed the code for myregress12.ado, which uses Mata to compute OLS point estimates and an IID-based VCE, a robust VCE, or a cluster-robust VCE.

The structure of the code is the same as the one that I used in myregress11.ado and in mymean8.ado, which I discussed in Programming an estimation command in Stata: An OLS command using Mata and in Programming an estimation command in Stata: A first ado-command using Mata. That the structure remains the same makes it easier to handle the details that arise in more complicated problems.