Home > Programming > Programming an estimation command in Stata: Preparing to write a plugin

Programming an estimation command in Stata: Preparing to write a plugin

This post is the first in a series that illustrates how to plug code written in another language (like C, C++, or Java) into Stata. This technique is known as writing a plugin or as writing a dynamic-link library (DLL) for Stata.

Plugins can be written for any task, including data management, graphical analysis, or statistical estimation. Per the theme of this series, I discuss plugins for estimation commands.

In this post, I discuss the tradeoffs of writing a plugin, and I discuss a simple program whose calculations I will replace with plugins in subsequent posts.

This is the 29th post in the series Programming an estimation command in Stata. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.

What is a plugin?

A plugin is one or more functions written in another language that you can call from Stata. Technically, the other program is linked into Stata when it is called, a process known as dynamically linking the library containing the other program in to Stata. For this reason, plugins are also known as DLLs.

People write plugins in other languages to solve really difficult problems. The three most common reasons that you might write a plugin for Stata are the following.

  1. You have code written in another language for a method that is not available in Stata or Mata.

  2. You have an implementation in Stata/Mata that does not utilize built-in, fast, vectorized functions, and you could make it faster by doing some of the calculations in a lower-level language like C.

  3. You develop your methods in a low-level language like C and then plug the methods in to Stata and other statistical software packages.

The problem with plugins is that they are difficult. Writing and compiling a plugin written in another language is much more difficult than just using Stata and Mata. In addition, the costs of a mistake are much higher. Plugins can be dangerous; they can corrupt memory or do other things that cause Stata to exit or otherwise “crash”.

It is hard to maintain plugins in C, C++, or Fortran. For these plugins, you must compile a version for every operating system on which you want it to work. For example, you might need compiled libraries for Windows, Mac, and a version of Linux or two. Furthermore, when the operating system gets a new version, you must recompile your plugin, and you might need to distribute versions for the old and new versions of the operating system.

Java plugins are much easier to maintain, but they are usually slower than plugins written in C, C++, or Fortran. You can distribute a single Java library to run on all the supported operating systems. You only need to recompile when changes in the Java environment require it.

Despite these difficulties, there are cases in which a plugin can make a Stata implementation usable or feasible. This series of posts illustrates how to write and compile a plugin in several different languages.

A mean estimator

I begin by discussing mymean10.ado, given in code block 1, which implements the command mymean10. mymean10 calculates the sample-average estimator for the mean and the variance–covariance of the estimator (VCE) for the variables specified by the user. mymean10 allows the user to specify an if restriction and an in range, it handles missing values in the specified variables, but it does not allow any options.

mymean10.ado uses Mata for its computations. Subsequent posts will replace these Mata computations with plugin computations.

Code block 1: mymean10.ado

*! version 10.0.0  13Feb2018
program define mymean10, eclass

    version 15.1

    syntax varlist(numeric) [if] [in]
    marksample touse
    tempname b V N

    mata: mymean_work("`varlist'", "`touse'", "`b'", "`V'", "`N'")

    matrix colnames `b'  = `varlist'
    matrix colnames `V'  = `varlist'
    matrix rownames `V'  = `varlist'
    ereturn post `b' `V', esample(`touse')
    ereturn scalar   N   = `N'
    ereturn scalar df_r  = `N'-1
    ereturn display
end

mata:
void mymean_work(string scalar vlist,          ///
    string scalar touse, string scalar bname,  ///
    string scalar vname, string scalar nname )
{
    real matrix X, E, V
    real vector b
    real scalar n

    X = st_data(., vlist, touse)
    b = mean(X)
    E = (X :- b)
    n = rows(E)
    V = (1/n)*(1/(n-1))*quadcross(E,E)

    st_matrix(bname, b)
    st_matrix(vname, V)
    st_numscalar(nname, n)
}
end

Note the structure of the program. Lines 6–7 parse what the user typed, line 10 uses the Mata work function mymean_work() to do the calculations, lines 12–18 store and display the results, and lines 21–40 define mymean_work().

Let’s look at these parts.

In line 6, syntax creates the local macro varlist, which contains the variables specified by the user. syntax also creates the local macros if and in that respectively contain any if restriction or in range that the user specified. In line 7, marksample uses the local macros varlist, if, and in to create a sample-inclusion variable. This sample-inclusion variable is one for each included observation and is zero for each excluded observation. The sample-inclusion variable accounts for a user-specified if restriction or in range that explicitly excludes an observation, and it accounts for a missing value in any of the variables in varlist that implicitly excludes an observation. marksample puts the name of this sample-inclusion variable in the local macro touse. (See Programming an estimation command in Stata: Allowing for sample restrictions and factor variables for details of this process.)

In line 8, tempname puts temporary names into the local macros b, V, and N. These names are not in use, and the objects stored in these names will be dropped when mymean10 finishes. We use temporary names to avoid overwriting global objects created by users, like Stata matrices and Stata scalars. (See Programming an estimation command in Stata: A first ado-command for an introduction to temporary names.)

Line 10 uses a one-line call to the Mata work function mymean_work() to calculate the point estimates, the VCE, and the sample size. mymean_work() puts the vector of point estimates in the Stata matrix whose name is stored in the local macro b. mymean_work() puts the VCE in the Stata matrix whose name is stored in the local macro V. And mymean_work() puts the number of included observations in the Stata scalar whose name is stored in the local macro N. (Details of this process are also discussed in Programming an estimation command in Stata: Allowing for sample restrictions and factor variables.)

Lines 12–14 put row and column names on the matrices that store the vector of point estimates and the VCE. Lines 15–17 store the results in e(), and line 18 produces a standard Stata output table. (See Programming an estimation command in Stata: A first ado-command using Mata for more details.)

Lines 21–40 define mymean_work().

Lines 22–24 specify that mywork_work() returns nothing to its caller and accepts five string scalar arguments. The first argument, vlist, contains the names of the variables specified by the user. The second, touse, contains the name of the sample-inclusion variable discussed above. The last three contain the names into which the results will be stored. When mymean_work() finishes, the local macro bname contains the name of the Stata matrix storing the vector of point estimates, the local macro vname contains the name of the Stata matrix storing the VCE, and the local macro nname contains the name of the Stata scalar storing the number of sample observations.

Now, I discuss mymean_work(). Lines 26–28 declare variables used in the function. Line 30 puts a copy of the observations for which sample-inclusion variable in Stata is one into the matrix X.

Lines 31–34 calculate the results. Line 31 puts the point estimates into the Mata vector b. Lines 32–34 calculate the VCE and store it in the Mata matrix V.

In line 36, st_matrix() copies the point estimates from b to the Stata matrix whose name is stored in bname. In line 37, st_matrix() copies the VCE from V to the Stata matrix whose name is stored in vname. In line 38, st_numscalar() copies the number of sample observations from the Mata scalar n to the Stata scalar whose name is stored in nname.

What will change when we write a plugin?

All the general structure and many of the specifics stay the same when we write a plugin. What changes is that we call a plugin to do the calculations instead of a Mata function.

Think about writing code in a language like C to replicate the calculations performed by mymean_work(). Three things would change. First, we would not copy the data from Stata into our plugin. The plugin environment gives us a view onto the data in Stata. Second, we have to write a function to implement the mean performed on line 31. Third, we would have to write a function to implement the VCE calculations performed on lines 32–34.

To facilitate the introduction to plugins, I made these changes in Mata, as illustrated in mymean11.ado in code block 2.

Code block 2: mymean11.ado

*! version 11.0.0  13Feb2018
program define mymean11, eclass

    version 15.1

    syntax varlist(numeric) [if] [in]
    marksample touse
    tempname b V N

    mata: mymean_work("`varlist'", "`touse'", "`b'", "`V'", "`N'")

    matrix colnames `b'  = `varlist'
    matrix colnames `V'  = `varlist'
    matrix rownames `V'  = `varlist'

    ereturn post `b' `V', esample(`touse')
    ereturn scalar   N   = `N'
    ereturn scalar df_r  = `N'-1
    ereturn display
end

mata:
void mymean_work(string scalar vlist,          ///
    string scalar touse, string scalar bname,  ///
    string scalar vname, string scalar nname )
{
    real matrix X, E, V, samp
    real vector b
    real scalar n

    st_view(samp=., ., touse)
    st_view(X=., ., vlist)

    MyAve(X, samp, b, n)
    MyV(X, samp, b, V)

    st_matrix(bname, b)
    st_matrix(vname, V)
    st_numscalar(nname, n)
}

void MyAve(real matrix X, real vector samp, b, n)
{
    real scalar   r, c, i, j

    r = rows(X)
    c = cols(X)
    b = J(1, c, 0)
    n = 0
    for(i=1; i<=r; i++) {
        if (samp[i]==1) {
            ++n
            for(j=1; j<=c; j++) {
                b[1,j] = b[1,j] + X[i,j]
            }
        }
    }
    b = (1/n)*b
}

void MyV(real matrix X, real vector samp, real matrix b, V)
{
    real scalar r, c, i, j, j2, n

    r = rows(X)
    c = cols(X)

    V = J(c, c, 0)
    e = J(1, c, 0)
    n = 0
    if (rows(b)!=1 | cols(b)!=c) {
        printf("{err}sample-average vector and VCE are not conformable\n")
        exit(error(503))
    }
    for(i=1; i<=r; i++) {
        if (samp[i]==1) {
            ++n
            for(j=1; j<=c; j++) {
                e[1,j] = X[i,j] - b[1,j]
            }
            for(j=1; j<=c; j++) {
                for(j2=1; j2<=j; j2++) {
                    V[j, j2] = V[j, j2] + e[1,j]*e[1,j2]
                }
            }
        }
    }
    for(j=1; j<=c; j++) {
        for(j2=j+1; j2<=c; j2++) {
            V[j, j2] = V[j2, j]
        }
    }

    V = (1/n)*(1/(n-1))*V
}

end

The Stata part in lines 1–19 is unchanged, as is the call to mymean_work().

Lines 31 and 32 differ from their counterparts in mymean10.ado. Line 30 in mymean10.ado puts a copy of the observations on the user-specified variables for which the sample-inclusion variable is one into the matrix X. Line 31 in mymean11.ado gets a view named samp on to all the observations of the sample-inclusion variable. Line 32 mymean11 gets a separate view named X on to all the observations of the variables specified by the user. These views on samp and X are more like the data–access functions provided by the plugin environment.

In line 34 of mymean11.ado, we use the MyAve() function to put the sample average into b and the number of observations for which the sample-inclusion variable is one into n. The code for MyAve() on lines 42–59 is analogous to the code one could write in, say, C. The largest difference is that line 58 uses Mata operators to divide each element of b by a scalar. The plugins will contain functions to perform these operations.

In line 35, we use the MyV() function to put the VCE into V. The code for MyV() is on lines 61–95. This code is also analogous to the code one could write in C, with the same caveat that line 94 would be implemented as a function.

Lines 37-39 in myean11.ado are the same as their counterpart lines 36–38 in mymean10.ado.

If I were coding an estimator in Mata, I would not use the implementation in myean11. I could make things significantly faster by only putting the observations for which the sample-inclusion variable is one in to the Mata view. The MyAve() and MyV() functions illustrate how the calculations can be performed in loops over data. The plugins will implement versions of these functions.

Done and undone

After discussing some pros and cons of writing a plugin for Stata, I discussed a program whose calculations I could implement in a plugin. In my next post, I will discuss a C plugin for these calculations.