Testing model specification and using the program version of gmm

This post was written jointly with Joerg Luedicke, Senior Social Scientist and Statistician, StataCorp.

The command gmm is used to estimate the parameters of a model using the generalized method of moments (GMM). GMM can be used to estimate the parameters of models that have more identification conditions than parameters, overidentified models. The specification of these models can be evaluated using Hansen’s J statistic (Hansen, 1982).

We use gmm to estimate the parameters of a Poisson model with an endogenous regressor. More instruments than regressors are available, so the model is overidentified. We then use estat overid to calculate Hansen’s J statistic and test the validity of the overidentification restrictions.

In previous posts (see Estimating parameters by maximum likelihood and method of moments using mlexp and gmm and Understanding the generalized method of moments (GMM): A simple example), the interactive version of gmm has been used to estimate simple single-equation models. For more complex models, it can be easier to use the moment-evaluator program version of gmm. We demonstrate how to use this version of gmm.

Poisson model with endogenous regressors

In this post, the Poisson regression of \(y_i\) on exogenous \({\bf x}_i\) and endogenous \({\bf y}_i\) has the form
\begin{equation*}
E(y_i \vert {\bf x}_i,{\bf y}_{2,i},\epsilon_i)= \exp({\boldsymbol \beta}_1{\bf x}_i + {\boldsymbol \beta}_2{\bf y}_{2,i}) + \epsilon_i
\end{equation*}
where \(\epsilon_i\) is a zero-mean error term. The endogenous regressors \({\bf y}_{2,i}\) may be correlated with \(\epsilon_i\). This is the same formulation used by ivpoisson with additive errors; see [R] ivpoisson for more details. For more information on Poisson models with endogenous regressors, see Mullahy (1997), Cameron and Trivedi (2013), Windmeijer and Santos Silva (1997), and Wooldridge (2010).

Moment conditions are expected values that specify the model parameters in terms of the true moments. GMM finds the parameter values that are closest to satisfying the sample equivalent of the moment conditions. In this model, we define moment conditions using an error function,
\begin{equation*}
u_i({\boldsymbol \beta}_1,{\boldsymbol \beta}_2) = y_i – \exp({\boldsymbol \beta}_1{\bf x}_i + {\boldsymbol \beta}_2{\bf y}_{2,i})
\end{equation*}

Let \({\bf x}_{2,i}\) be additional exogenous variables. These are not correlated with \(\epsilon_i\), but are correlated with \({\bf y}_{2,i}\). Combining them with \({\bf x}_i\), we have the instruments \({\bf z}_i = (\begin{matrix} {\bf x}_{i} & {\bf x}_{2,i}\end{matrix})\). So the moment conditions are
\begin{equation*}
E({\bf z}_i u_i({\boldsymbol \beta}_1,{\boldsymbol \beta}_2)) = {\bf 0}
\end{equation*}

Suppose there are \(k\) parameters in \({\boldsymbol \beta}_1\) and \({\boldsymbol \beta}_2\) and \(q\) instruments. When \(q>k\), there are more moment conditions than parameters. The model is overidentified. Here GMM finds parameter estimates that solve weighted moment conditions. GMM minimizes
\[
Q({{\boldsymbol \beta}_1},{\boldsymbol \beta}_2) = \left\{\frac{1}{N}\sum\nolimits_i {{\bf z}}_i
u_i({\boldsymbol \beta}_1,{\boldsymbol \beta}_2)\right\}
{\bf W}
\left\{\frac{1}{N}\sum\nolimits_i {{\bf z}}_i u_i({\boldsymbol \beta}_1,{\boldsymbol \beta}_2)\right\}’
\]
for \(q\times q\) weight matrix \({\bf W}\).

Overidentification test

When the model is correctly specified,
\begin{equation*}
E({\bf z}_i u_i({\boldsymbol \beta}_1,{\boldsymbol \beta}_2)) = {\bf 0}
\end{equation*}

In this case, if \({\bf W}\) is an optimal weight matrix, it is equal to the inverse of the covariance matrix of the moment conditions. Here we have
\[
{\bf W}^{-1} = E\{{\bf z}_i’ u_{i}({\boldsymbol \beta}_1,{\boldsymbol \beta}_2)
u_{i}({\boldsymbol \beta}_1,{\boldsymbol \beta}_2) {\bf z}_i\}
\]

Hansen’s test evaluates the null hypothesis that an overidentified model is correctly specified. The test statistic \(J = N Q(\hat{\boldsymbol \beta}_1, \hat{\boldsymbol \beta}_2)\) is used. If \({\bf W}\) is an optimal weight matrix, under the null hypothesis, Hansen’s J statistic has a \(\chi^2(q-k)\) distribution.

The two-step and iterated estimators used by gmm provide estimates of the optimal W. For overidentified models, the estat overid command calculates Hansen’s J statistic after these estimators are used.

Moment-evaluator program

We define a program that can be called by gmm in calculating the moment conditions for Poisson models with endogenous regressors. See Programming an estimation command in Stata: A map to posted entries for more information about programming in Stata. The program calculates the error function \(u_i\), and gmm generates the moment conditions by multiplying by the instruments \({\bf z}_i\).

To solve the weighted moment conditions, gmm must take the derivative of the moment conditions with respect to the parameters. Using the chain rule, these are the derivatives of the error functions multiplied by the instruments. Users may specify these derivatives themselves, or gmm will calculate the derivatives numerically. Users can gain speed and numeric stability by properly specifying the derivatives themselves.

When linear forms of the parameters are estimated, users may specify derivatives to gmm in terms of the linear form (prediction). The chain rule is then used by gmm to determine the derivatives of the error function \(u_i\) with respect to the parameters. Our error function \(u_i\) is a function of the linear prediction \({\boldsymbol \beta}_1{\bf x}_i + {\boldsymbol \beta}_2{\bf y}_{2,i}\).

The program gmm_ivpois calculates the error function \(u_i\) and the derivative of \(u_i\) in terms of the linear prediction \({\boldsymbol \beta}_1{\bf x}_i + {\boldsymbol \beta}_2{\bf y}_{2,i}\).

program gmm_ivpois
    version 14.1
    syntax varlist [if], at(name) depvar(varlist) rhs(varlist) ///
           [derivatives(varlist)]
    tempvar m
    quietly gen double `m' = 0 `if'
    local i = 1
    foreach var of varlist `rhs' {
        quietly replace `m' = `m' + `var'*`at'[1,`i'] `if'
        local i = `i' + 1
    }
    quietly replace `m' = `m' + `at'[1,`i'] `if'
    quietly replace `varlist' = `depvar' - exp(`m') `if'
    if "`derivatives'" == "" {
         exit
    }
    replace `derivatives' = -exp(`m')
end

Lines 3–4 of gmm_ivpois contain the syntax statement that parses the arguments to the program. All moment-evaluator programs must accept a varlist, the if condition, and the at() option. The varlist corresponds to variables that store the values of the error functions. The program gmm_ivpois will calculate the error function and store it in the specified varlist. The at() option is specified with the name of a matrix that contains the model parameters. The if condition specifies the observations for which estimation is performed.

The program also requires the options depvar() and rhs(). The name of the dependent variable is specified in the depvar() option. The regressors are specified in the rhs() option.

On line 4, derivatives() is optional. The variable name specified here corresponds to the derivative of the error function with respect to the linear prediction.

The linear prediction of the regressors is stored in the temporary variable m over lines 6–12. On line 13, we give the value of the error function to the specified varlist. Lines 14–16 allow the program to exit if derivatives() is not specified. Otherwise, on line 17, we store the value of the derivative of the error function with respect to the linear prediction in the variable specified in derivatives().

The data

We simulate data from a Poisson regression with an endogenous covariate, and then we use gmm and the gmm_ivpois program to estimate the parameters of the regression. We will then use estat overid to check the specification of the model. We simulate a random sample of 3,000 observations.

. set seed  45

. set obs 3000
number of observations (_N) was 0, now 3,000

. generate x = rnormal()*.8 + .5

. generate z = rchi2(1)

. generate w = rnormal()*.5

. matrix cm = (1, .9 \ .9, 1)

. matrix sd = (.5,.8)

. drawnorm e u, corr(cm) sd(sd)

We generate the exogenous covariates \(x\), \(z\), and \(w\). The variable \(x\) will be a regressor, while \(z\) and \(w\) will be extra instruments. Then we use drawnorm to draw the errors \(e\) and \(u\). The errors are positively correlated.

. generate y2 = exp(.2*x + .1*z + .3*w -1 + u)

. generate y = exp(.5*x + .2*y2+1) + e

We generate the endogenous regressor \(y2\) as a lognormal regression on the instruments. The outcome of interest \(y\) has an exponential mean on \(x\) and \(y2\), with \(e\) as an additive error. As \(e\) is correlated with \(u\), \(y2\) is correlated with \(e\).

Estimating the model parameters

Now we use gmm to estimate the parameters of the Poisson regression with endogenous covariates. The name of our moment-evaluator program is listed to the right of gmm. The instruments that gmm will use to form the moment conditions are listed in instruments(). We specify the options depvar() and rhs() with the appropriate variables. They will be passed on to gmm_ivpois.

The parameters are specified as the linear form y in the parameters() option, while we specify haslfderivatives to inform gmm that gmm_ivpois provides derivatives of this linear form. The option nequations() tells gmm how many error functions to expect.

. gmm gmm_ivpois, depvar(y) rhs(x y2)             ///
>         haslfderivatives instruments(x z w)     ///                            
>         parameters({y: x y2 _cons}) nequations(1)

Step 1
Iteration 0:   GMM criterion Q(b) =  14.960972
Iteration 1:   GMM criterion Q(b) =  3.3038486
Iteration 2:   GMM criterion Q(b) =  .59045217
Iteration 3:   GMM criterion Q(b) =  .00079862
Iteration 4:   GMM criterion Q(b) =  .00001419
Iteration 5:   GMM criterion Q(b) =  .00001418

Step 2
Iteration 0:   GMM criterion Q(b) =   .0000567
Iteration 1:   GMM criterion Q(b) =  .00005648
Iteration 2:   GMM criterion Q(b) =  .00005648

GMM estimation

Number of parameters =   3
Number of moments    =   4
Initial weight matrix: Unadjusted                 Number of obs   =      3,000
GMM weight matrix:     Robust

------------------------------------------------------------------------------
             |               Robust
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .5006366   .0033273   150.46   0.000     .4941151     .507158
          y2 |   .2007893   .0075153    26.72   0.000     .1860597    .2155189
       _cons |   1.000717   .0063414   157.81   0.000      .988288    1.013146
------------------------------------------------------------------------------
Instruments for equation 1: x z w _cons

Our coefficients are significant. However, the model could still be misspecified.

Overidentification test

We use estat overid to compute Hansen’s J statistic.

. estat overid

  Test of overidentifying restriction:

  Hansen's J chi2(1) = .169449 (p = 0.6806)

The J statistic equals 0.17. In addition to computing Hansen’s J, estat overid provides a test against misspecification of the model. In this case, we have one more instrument than regressor, so the J statistic has a \(\chi^2(1)\) distribution. The probability of obtaining a \(\chi^2(1)\) value greater than 0.17 is given in parentheses. This probability—the p-value of the test—is large and so we fail to reject the null hypothesis that the model is properly specified.

Conclusion

We have demonstrated how to estimate the parameters of a Poisson regression with an endogenous regressor using the moment-evaluator program version of gmm. We have also demonstrated how to use estat overid to test for model misspecification after estimation of an overidentified model in gmm. See [R] gmm and [R] gmm postestimation for more information.

References

Cameron, A. C., and P. K. Trivedi. 2013. Regression Analysis of Count Data. 2nd ed. New York: Cambridge University Press.

Hansen, L. P. 1982. Large sample properties of generalized method of moments estimators. Econometrica 50: 1029–1054.

Mullahy, J. 1997. Instrumental-variable estimation of count data models: Applications to models of cigarette smoking behavior. Review of Economics and Statistics 79: 586–593.

Windmeijer, F., and J. M. C. Santos Silva. 1997. Endogeneity in count data models: An application to demand for health care. Journal of Applied Econometrics 12: 281–294.

Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press.

Programming an estimation command in Stata: Handling factor variables in optimize()

\(
\newcommand{\xb}{{\bf x}}
\newcommand{\betab}{\boldsymbol{\beta}}\)I discuss a method for handling factor variables when performing nonlinear optimization using optimize(). After illustrating the issue caused by factor variables, I present a method and apply it to an example using optimize().

How poisson handles factor variables

Consider the Poisson regression in which I include a full set of indicator variables created from the categorical variable kids and a constant term.

Example 1: Collinear factor variables

. clear all

. use accident3

. poisson accidents cvalue ibn.kids traffic, coeflegend
note: 3.kids omitted because of collinearity

Iteration 0:   log likelihood = -546.35782
Iteration 1:   log likelihood = -545.11016
Iteration 2:   log likelihood = -545.10898
Iteration 3:   log likelihood = -545.10898

Poisson regression                              Number of obs     =        505
                                                LR chi2(5)        =     361.62
                                                Prob > chi2       =     0.0000
Log likelihood = -545.10898                     Pseudo R2         =     0.2491

------------------------------------------------------------------------------
   accidents |      Coef.  Legend
-------------+----------------------------------------------------------------
      cvalue |  -.6582924  _b[cvalue]
             |
        kids |
          0  |   3.233932  _b[0bn.kids]
          1  |   1.571582  _b[1.kids]
          2  |   1.659241  _b[2.kids]
          3  |          0  _b[3o.kids]
             |
     traffic |   .1383977  _b[traffic]
       _cons |  -2.518175  _b[_cons]
------------------------------------------------------------------------------

The full set of indicator variables is collinear with the constant term. The output shows that no variables were dropped but the name 3o.kids specifies that 3.kids was omitted. Omitted variables are not dropped; instead their coefficients are constrained to zero.

Specifying variables as omitted instead of dropping them allows postestimation features such as margins to work properly.

For the case in example 1, poisson is maximizing the log-likelihood function subject to constraint that \(\beta_{3.kids}=0\). In terms of the parameter vector \(\betab\), I represent this constraint by

\[
\left[\begin{matrix}
0&0&0&0&1&0&0
\end{matrix}\right]
\betab’ = 0
\]

where \(\betab=(
\betab_{cvalue},
\betab_{0.kids},
\betab_{1.kids},
\betab_{2.kids},
\betab_{3.kids},
\betab_{traffic},
\betab_{\_cons})\)

In general, I can represent \(q\) linear equality constraints on a \(1\times k\) parameter vector as

\[
{\bf C}\betab’ = {\bf c}
\]

where \({\bf C}\) is a \(q\times k\) matrix and \({\bf c}\) is a \(q\times 1\) vector. These constraints are conveniently represented as \(\widetilde{\bf C}=\left[{\bf C},{\bf c}\right]\).

I now show how to use optimize() to solve optimization problems subject to linear equality constraints by putting \(\widetilde{\bf C}\) into the optimize object. In code block 1, I use optimize() to maximize the Poisson log-likelihood function for the problem in example 1. Code block 1 augments example 3 in Programming an estimation command in Stata: Using optimize() to estimate Poisson parameters by using optimize_init_constraints() to impose a linear equality constraint on the coefficient vector.

Code block 1: Linear equality constraints in optimize()

mata:
void plleval3(real scalar todo, real vector b,     ///
              real vector y,    real matrix X,     ///
              val, grad, hess)
{
    real vector  xb

    xb = X*b'
   val = -exp(xb) + y:*xb - lnfactorial(y)
}

y  = st_data(., "accidents")
X  = st_data(., "cvalue ibn.kids traffic")
X  = X,J(rows(X), 1, 1)

C  = e(5, 7)
c  = 0
Ct = C,c

S  = optimize_init()
optimize_init_argument(S, 1, y)
optimize_init_argument(S, 2, X)
optimize_init_evaluator(S, &plleval3())
optimize_init_evaluatortype(S, "gf0")
optimize_init_params(S, J(1, 7, .01))
optimize_init_constraints(S, Ct)

bh = optimize(S)
optimize_result_params(S)
end

Only line 13, lines 16–18, and line 26 differ from the code in example 3 in Programming an estimation command in Stata: Using optimize() to estimate Poisson parameters. Line 13 illustrates that st_data() can create the indicator variables from the factor variable ibn.kids. Lines 16–18 define the constraint matrix \(\widetilde{\bf C}\) for this problem. Line 26 puts \(\widetilde{\bf C}\) into the optimize() object S, which causes optimize() to maximize the Poisson log-likelihood function with evaluator plleval3() subject to constraints specified in matrix Ct.

Example 2 illustrates that code block 1 reproduces the point estimates reported in example 1.

Example 2: Linear equality constraints in optimize()

. do pc

. mata:
------------------------------------------------- mata (type end to exit) -----
: void plleval3(real scalar todo, real vector b,     ///
>               real vector y,    real matrix X,     ///
>               val, grad, hess)
> {
>     real vector  xb
>
>     xb = X*b'
>    val = -exp(xb) + y:*xb - lnfactorial(y)
> }
note: argument todo unused
note: argument grad unused
note: argument hess unused

:
: y  = st_data(., "accidents")

: X  = st_data(., "cvalue ibn.kids traffic")

: X  = X,J(rows(X), 1, 1)

:
: C  = e(5, 7)

: c  = 0

: Ct = C,c

:
: S  = optimize_init()

: optimize_init_argument(S, 1, y)

: optimize_init_argument(S, 2, X)

: optimize_init_evaluator(S, &plleval3())

: optimize_init_evaluatortype(S, "gf0")

: optimize_init_params(S, J(1, 7, .01))

: optimize_init_constraints(S, Ct)

:
: bh = optimize(S)
Iteration 0:   f(p) = -845.47138
Iteration 1:   f(p) = -572.68676
Iteration 2:   f(p) = -545.68381
Iteration 3:   f(p) = -545.11241
Iteration 4:   f(p) = -545.10898
Iteration 5:   f(p) = -545.10898
: optimize_result_params(S)
                  1              2              3              4
    +-------------------------------------------------------------
  1 |  -.6582923624    3.233932519    1.571581623     1.65924145
    +-------------------------------------------------------------
                  5              6              7
     ----------------------------------------------+
  1               0      .13839766   -2.518174926  |
     ----------------------------------------------+

: end
-------------------------------------------------------------------------------

.
end of do-file

Code block 1 shows how to use a linear equality constraint to handle collinear variables when we know which variables are omitted. In the code for an estimation command, we must

  1. find which variables will be omitted, and
  2. create the constraint matrix \(\widetilde{\bf C}\) that imposes the constraints implied by omitting these variables.

Example 3 illustrates that _rmcoll stores a list of variables that identifies which variables will be omitted in r(varlist), thereby solving problem 1.

Example 3: Using _rmcoll to identify omitted variables

. _rmcoll cvalue ibn.kids traffic, expand
note: 3.kids omitted because of collinearity

. return list

scalars:
          r(k_omitted) =  1

macros:
            r(varlist) : "cvalue 0bn.kids 1.kids 2.kids 3o.kids traffic"

. local cnames "`r(varlist)' _cons"

I specified the option expand so that _rmcoll would expand any factor variables. The expanded variable list in the local r(varlist) identifies 3.kids as a variable that must be omitted. I then put this expanded variable list, augmented by the name _cons, in the local macro cnames.

Here is an outline for the solution to problem 2 that I present in examples 4–6.

  • In example 4, I create the Stata vector bt, whose column names are contained in cnames.
  • In example 5, I use _ms_omit_info to create the Stata vector bto, which indicates which variables will be omitted from bt.
  • In example 6, I create a Mata matrix specifying the constraints from bto.

Now for the details, beginning with example 4.

Example 4: Putting the coefficient names on a Stata vector

. matrix bt = J(1, 7, 0)

. matrix colnames bt = `cnames'

. matrix list bt

bt[1,7]
                   0.       1.       2.      3o.
     cvalue     kids     kids     kids     kids  traffic    _cons
r1        0        0        0        0        0        0        0

cnames contains the names of the coefficients for this problem, so I create a conformable row vector bt, make cnames the column names on bt, and display bt. The values in bt do not matter; the column names are the important information.

In example 5, _ms_omit_info uses the column names on a Stata vector to create the vector r(omit), which specifies which variables are omitted.

Example 5: Creating a vector that indicates omitted variables

. matrix bt = J(1, 8, 0)

. matrix colnames bt = `cnames'

. _ms_omit_info bt

. return list

scalars:
             r(k_omit) =  1

matrices:
               r(omit) :  1 x 8

. matrix bto = r(omit)

. matrix list bto

bto[1,8]
    c1  c2  c3  c4  c5  c6  c7  c8
r1   0   0   0   0   1   0   0   0

An element of r(omit) is 1 if the corresponding variable is omitted. An element of r(omit) is 0 if the corresponding variable is not omitted. I put a copy of r(omit) in bto.

In example 6, I create a constraint matrix from bto. The loop in example 6 will create the constraint matrix implied by any r(omit) vector created by _ms_omit_info.

Example 6: Creating a constraint matrix from r(omit)

. mata:
------------------------------------------------- mata (type end to exit) -----
: mo = st_matrix("bto")

: ko = sum(mo)

: p  = cols(mo)

: if (ko>0) {
>     Cm   = J(0, p, .)
>     for(j=1; j<=p; j++) {
>         if (mo[j]==1) {
>             Cm  = Cm \ e(j, p)
>         }
>     }
>     Cm = Cm, J(ko, 1, 0)
> }
> else {
>     Cm = J(0,p+1,.)
> }

: "Constraint matrix is "
  Constraint matrix is 

: Cm
       1   2   3   4   5   6   7   8   9
    +-------------------------------------+
  1 |  0   0   0   0   1   0   0   0   0  |
    +-------------------------------------+

: end
-------------------------------------------------------------------------------

After copying bto to the Mata vector mo, I put the number of constraints in the scalar ko and the number of parameters in p. If there are constraints, I initialize Cm to be a matrix with zero rows and p columns, use a for loop to iteratively append a new row corresponding to each constraint identified in mo, and finish by appending a ko \(\times\) 1 column of zeros on to Cm. If there are no constraints, I put a matrix with zero rows and p+1 columns in Cm.

Regardless of whether there are any omitted variables, I can put the Cm matrix created by the method in example 6 into an optimize() object. If there are no omitted variables, Cm will have zero rows, and no constraints will be imposed. If there are omitted variables, Cm will have ko rows, and the constraints for the omitted variables will be imposed.

Code block 2 combines these pieces into a coherent example.

Code block 2: Putting it all together

clear all
use accident3
local depvar    "accidents"
local indepvars "cvalue ibn.kids traffic"
_rmcoll `indepvars', expand
local cnames "`r(varlist)' _cons"
local p   : word count `cnames'
matrix bt = J(1, `p', 0)
matrix colnames bt = `cnames'
_ms_omit_info bt
matrix bto = r(omit)

mata:
void plleval3(real scalar todo, real vector b,     ///
              real vector y,    real matrix X,     ///
              val, grad, hess)
{
    real vector  xb

    xb = X*b'
   val = -exp(xb) + y:*xb - lnfactorial(y)
}

y  = st_data(., "`depvar'")
X  = st_data(., "`indepvars'")
X  = X,J(rows(X), 1, 1)

mo = st_matrix("bto")
ko = sum(mo)
p  = cols(mo)
if (ko>0) {
    Ct   = J(0, p, .)
    for(j=1; j<=p; j++) {
        if (mo[j]==1) {
            Ct  = Ct \ e(j, p)
        }
    }
    Ct = Ct, J(ko, 1, 0)
}
else {
    Ct = J(0,p+1,.)
}

S  = optimize_init()
optimize_init_argument(S, 1, y)
optimize_init_argument(S, 2, X)
optimize_init_evaluator(S, &plleval3())
optimize_init_evaluatortype(S, "gf0")
optimize_init_params(S, J(1, 7, .01))
optimize_init_constraints(S, Ct)

bh = optimize(S)
optimize_result_params(S)
end

Lines 1–2 drop all the objects that I created in previous examples and read the accident3 dataset into memory. Lines 3–4 create locals to hold the dependent variable and the independent variables.

Lines 5–6 use _rmcoll to identify which variables should be omitted and put the list of names in the local macro cnames, as in example 3. Lines 7–9 create bt, whose column names specify which variables should be omitted, as in example 4. Lines 10–11 create bto, whose entries specify which variables should be omitted, as in example 5. Lines
28–42 create the constraint matrix Ct from bto, as in example 6.

Line 50 puts Ct into the optimize() object.

Example 7 illustrates that the code in code block 2 reproduces the
previously obtained results.

Example 7:Putting it all together

. do pc2

. clear all

. use accident3

. local depvar    "accidents"

. local indepvars "cvalue ibn.kids traffic"

. _rmcoll `indepvars', expand
note: 3.kids omitted because of collinearity

. local cnames "`r(varlist)' _cons"

. local p   : word count `cnames'

. matrix bt = J(1, `p', 0)

. matrix colnames bt = `cnames'

. _ms_omit_info bt

. matrix bto = r(omit)

.
. mata:
------------------------------------------------- mata (type end to exit) -----
: void plleval3(real scalar todo, real vector b,     ///
>               real vector y,    real matrix X,     ///
>               val, grad, hess)
> {
>     real vector  xb
>
>     xb = X*b'
>    val = -exp(xb) + y:*xb - lnfactorial(y)
> }
note: argument todo unused
note: argument grad unused
note: argument hess unused

:
: y  = st_data(., "`depvar'")

: X  = st_data(., "`indepvars'")

: X  = X,J(rows(X), 1, 1)

:
: mo = st_matrix("bto")

: ko = sum(mo)

: p  = cols(mo)

: if (ko>0) {
>     Ct   = J(0, p, .)
>     for(j=1; j         if (mo[j]==1) {
>             Ct  = Ct \ e(j, p)
>         }
>     }
>     Ct = Ct, J(ko, 1, 0)
> }
> else {
>     Ct = J(0,p+1,.)
> }

:
: S  = optimize_init()

: optimize_init_argument(S, 1, y)

: optimize_init_argument(S, 2, X)

: optimize_init_evaluator(S, &plleval3())

: optimize_init_evaluatortype(S, "gf0")

: optimize_init_params(S, J(1, 7, .01))

: optimize_init_constraints(S, Ct)

:
: bh = optimize(S)
Iteration 0:   f(p) = -845.47138
Iteration 1:   f(p) = -572.68676
Iteration 2:   f(p) = -545.68381
Iteration 3:   f(p) = -545.11241
Iteration 4:   f(p) = -545.10898
Iteration 5:   f(p) = -545.10898

: optimize_result_params(S)
                  1              2              3              4
    +-------------------------------------------------------------
  1 |  -.6582923624    3.233932519    1.571581623     1.65924145
    +-------------------------------------------------------------
                  5              6              7
     ----------------------------------------------+
  1               0      .13839766   -2.518174926  |
     ----------------------------------------------+

: end
-------------------------------------------------------------------------------

.
end of do-file

Done and undone

I discussed a method for handling factor variables when performing nonlinear optimization using optimize(). In my next post, I implement these methods in an estimation command for Poisson regression.

Handling gaps in time series using business calendars

Time-series data, such as financial data, often have known gaps because there are no observations on days such as weekends or holidays. Using regular Stata datetime formats with time-series data that have gaps can result in misleading analysis. Rather than treating these gaps as missing values, we should adjust our calculations appropriately. I illustrate a convenient way to work with irregularly spaced dates by using Stata’s business calendars.

In nasdaq.dta, I have daily data on the NASDAQ index from February 5, 1971 to March 23, 2015 that I downloaded from the St. Louis Federal Reserve Economic Database (FRED).

. use http://www.stata.com/data/nasdaq

. describe

Contains data from http://www.stata.com/data/nasdaq.dta
  obs:        11,132                          
 vars:             2                          29 Jan 2016 16:21
 size:       155,848                          
-------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
date            str10   %10s                  Daily date
index           float   %9.0g                 NASDAQ Composite Index (1971=100)
-------------------------------------------------------------------------------
Sorted by: 

date is the time variable in our data, which is a string format ordered as year, month, and day. I use the date() function to convert the string daily date to a Stata numeric date and store the values in mydate. To find out more about converting string dates to numeric, you can read A Tour of Datetime in Stata.

. generate mydate = date(date,"YMD")

. format %td mydate

I tsset these data with mydate as the time variable and then list the first five observations, along with the first lag of index.

. tsset mydate
        time variable:  mydate, 05feb1971 to 23mar2015, but with gaps
                delta:  1 day

. list date mydate index l.index in 1/5

     +------------------------------------------+
     |                                        L.|
     |       date      mydate    index    index |
     |------------------------------------------|
  1. | 1971-02-05   05feb1971      100        . |
  2. | 1971-02-08   08feb1971   100.84        . |
  3. | 1971-02-09   09feb1971   100.76   100.84 |
  4. | 1971-02-10   10feb1971   100.69   100.76 |
  5. | 1971-02-11   11feb1971   101.45   100.69 |
     +------------------------------------------+

The first observation on l.index is missing; I expect this because there are no observations prior to the first observation on index. However, the second observation on l.index is also missing. As you may have already noticed, the dates are irregularly spaced in my dataset—the first observation corresponds to a Friday and the second observation to a Monday.

I get missing data in this case because mydate is a regular date, and tsset–ing by a regular date will treat all weekends and other holidays as if they are missing in the dataset instead of ignoring them in calculations. To avoid the problem of gaps inherent in business data, I can create a business calendar. Business calendars specify which dates are omitted. For daily financial data, a business calendar specifies the weekends and holidays for which the markets were closed.

Creating business calendars

Business calendars are defined in files named calname.stbcal. You can create your own calendars, use the ones provided by StataCorp, or obtain them directly from other users or via the SSC. Calendars can also be created automatically from the current dataset using the bcal create command.

Every stbcal-file requires you to specify the following four things:

  • the version of Stata being used
  • the range of the calendar
  • the center date of the calendar
  • the dates to be omitted

I begin by creating nasdaq.stbcal, which will omit Saturdays and Sundays of every month. I do this using the Do-file editor, but you can use any text editor.

version 14.1
purpose "Converting daily financial data into business calendar dates"
dateformat dmy
range 05feb1971 23mar2015
centerdate 05feb1971
omit dayofweek (Sa Su)

The first line specifies the current version of Stata I am using. The second line is optional, but the text typed there will display if I type bcal describe nasdaq and is good for record keeping when I have multiple calenders. Line 3 specifies the display date format and is also optional. Line 4 specifies the range of dates in the dataset.

Line 5 specifies the center of the date to be 05feb1971. I picked the first date in the sample, but I could have picked any date in the range specified for the business calendar. centerdate does not mean choosing a date that is in fact the center of the sample. For example, Stata’s default %td calendar uses 01jan1960 as its center.

The last statement specifies to omit weekends of every month. Later, I will show several variations of the omit command to omit other holidays. Once I have a business calendar, I can use this to convert regular dates to business dates, share this file with colleagues, and also make further changes to my calendar.

Using a business calendar

. bcal load nasdaq
loading ./nasdaq.stbcal ...

     1. version 14.1
     2. purpose "Converting daily financial data into business calendar dates"
     3. dateformat dmy
     4. range 05feb1971 23mar2015
     5. centerdate 05feb1971
     6. omit dayofweek (Sa Su)

(calendar loaded successfully)

. generate bcaldate = bofd("nasdaq",mydate)

. assert !missing(bcaldate) if !missing(mydate)

To create business dates using bofd(), I specified two arguments: the name of the business calendar and the name of the variable containing regular dates. The assert statement verifies that all dates recorded in mydate appear in the business calendar. This is a way of checking that I created my calendar for the complete date range—the bofd() function returns a missing value when mydate does not appear on the specified calendar.

Business dates have a specific display format, %tbcalname, which in my case is %tbnasdaq. In order to display business dates in a Stata date format I will apply this format to bcaldate just as I would for a regular date.

. format %tbnasdaq bcaldate

. list in 1/5

     +---------------------------------------------+
     |       date    index      mydate    bcaldate |
     |---------------------------------------------|
  1. | 1971-02-05      100   05feb1971   05feb1971 |
  2. | 1971-02-08   100.84   08feb1971   08feb1971 |
  3. | 1971-02-09   100.76   09feb1971   09feb1971 |
  4. | 1971-02-10   100.69   10feb1971   10feb1971 |
  5. | 1971-02-11   101.45   11feb1971   11feb1971 |
     +---------------------------------------------+

Although mydate and bcaldate look similar, they have different encodings. Now, I can tsset on the business date bcaldate and list the first five observations with the lag of index recalculated.

. tsset bcaldate
        time variable:  bcaldate, 05feb1971 to 23mar2015, but with gaps
                delta:  1 day

. list bcaldate index l.index in 1/5

     +-----------------------------+
     |                           L.|
     |  bcaldate    index    index |
     |-----------------------------|
  1. | 05feb1971      100        . |
  2. | 08feb1971   100.84      100 |
  3. | 09feb1971   100.76   100.84 |
  4. | 10feb1971   100.69   100.76 |
  5. | 11feb1971   101.45   100.69 |
     +-----------------------------+

As expected, the issue of gaps due to weekends is now resolved. Because I have a calendar that excludes Saturdays and Sundays, bcaldate skipped the weekend between 05feb1971 and 08feb1971 when calculating the lagged index value and will do the same for any subsequent weekends in the data.

Excluding specific dates

So far I have not excluded gaps in the data due to other major holidays, such as Thanksgiving and Christmas. Stata has several variations on the omit command that let you exclude specific dates. For example, I use the omit command to omit the Thanksgiving holiday (the fourth Thursday of November in the U.S.) by adding the following statement in my business calendar.

omit dowinmonth +4 Th of Nov

dowinmonth stands for day of week in month and +4 Th of Nov refers to the fourth Thursday of November. This rule is applied to every year in the data.

Another major holiday is Christmas, with the NASDAQ closed on the 25th of December every year. I can omit this holiday in the calendar as

omit date 25dec*

The * in the statement above indicates that December 25 should be omitted for every year in my nasdaq calendar.

This rule is misleading since the 25th may be on a weekend, in which case the holidays are on the preceeding Friday or following Monday. To capture these cases, I add the following statements:

omit date 25dec* and (-1) if dow(Sa)
omit date 25dec* and (+1) if dow(Su)

The first statement omits December 24 if Christmas is on a Saturday, and the second statement omits December 26 if Christmas is on a Sunday.

Encodings

I mentioned earlier that the encodings of regular date mydate and business date bcaldate are different. To see the encodings of my date variables, I apply the numerical format and list the first five observations.

. format %8.0g mydate bcaldate

. list in 1/5

     +-----------------------------------------+
     |       date    index   mydate   bcaldate |
     |-----------------------------------------|
  1. | 1971-02-05      100     4053          0 |
  2. | 1971-02-08   100.84     4056          1 |
  3. | 1971-02-09   100.76     4057          2 |
  4. | 1971-02-10   100.69     4058          3 |
  5. | 1971-02-11   101.45     4059          4 |
     +-----------------------------------------+

The variable bcaldate starts with 0 because this was the centerdate in my calendar nasdaq.stbcal. The business date encoding is consecutive without gaps, which is why using lags or any time-series operators will yield correct values.

Summary

Using regular dates with time-series data instead of business dates may be misleading in case there are gaps in the data. In this post, I showed a convenient way to work with business dates by creating a business calendar. Once I loaded a calendar file into Stata, I created business dates using the bofd() function. I also showed some variations of the omit command used in business calendars to accommodate specific gaps due to different holidays.

Programming an estimation command in Stata: A poisson command using Mata

\(
\newcommand{\xb}{{\bf x}}
\newcommand{\betab}{\boldsymbol{\beta}}\)I discuss mypoisson1, which computes Poisson-regression results in Mata. The code in mypoisson1.ado is remarkably similar to the code in myregress11.ado, which computes ordinary least-squares (OLS) results in Mata, as I discussed in Programming an estimation command in Stata: An OLS command using Mata.

I build on previous posts. I use the structure of Stata programs that use Mata work functions that I discussed previously in Programming an estimation command in Stata: A first ado-command using Mata and Programming an estimation command in Stata: An OLS command using Mata. You should be familiar with Poisson regression and using optimize(), which I discussed in Programming an estimation command in Stata: Using optimize() to estimate Poisson parameters.

A poisson command with Mata computations

The Stata command mypoisson1 computes the results in Mata. The syntax of the mypoisson1 command is

mypoisson1 depvar indepvars [if] [in] [, noconstant]

where indepvars can contain time-series variables. mypoisson1 does not allow for factor variables because they complicate the program. I discuss these complications, and present solutions, in my next post.

In the remainder of this post, I discuss the code for mypoisson1.ado. I recommend that you click on the file name to download the code. To avoid scrolling, view the code in the do-file editor, or your favorite text editor, to see the line numbers.

Code block 1: mypoisson1.ado

*! version 1.0.0  31Jan2016
program define mypoisson1, eclass sortpreserve
    version 14.1

    syntax varlist(numeric ts min=2) [if] [in] [, noCONStant ]
    marksample touse

    gettoken depvar indepvars : varlist

    _rmcoll `indepvars', `constant' forcedrop
    local indepvars  "`r(varlist)'"

    tempname b V N rank

    mata: mywork("`depvar'", "`indepvars'", "`touse'", "`constant'", ///
       "`b'", "`V'", "`N'", "`rank'")

    if "`constant'" == "" {
        local cnames "`indepvars' _cons"
    }
    else {
        local cnames "`indepvars'"
    }
    matrix colnames `b' = `cnames'
    matrix colnames `V' = `cnames'
    matrix rownames `V' = `cnames'

    ereturn post `b' `V', esample(`touse') buildfvinfo
    ereturn scalar N       = `N'
    ereturn scalar rank    = `rank'
    ereturn local  cmd     "mypoisson1"

    ereturn display

end

mata:

void mywork( string scalar depvar,  string scalar indepvars,
             string scalar touse,   string scalar constant,
             string scalar bname,   string scalar Vname,
             string scalar nname,   string scalar rname)
{

    real vector y, b, mo_v, cv
    real matrix X, V, Cm
    real scalar n, p, rank, ko

    y = st_data(., depvar, touse)
    n = rows(y)
    X = st_data(., indepvars, touse)
    if (constant == "") {
        X = X,J(n, 1, 1)
    }
    p = cols(X)

    S  = optimize_init()
    optimize_init_argument(S, 1, y)
    optimize_init_argument(S, 2, X)
    optimize_init_evaluator(S, &plleval2())
    optimize_init_params(S, J(1, p, .01))

    b    = optimize(S)
    V    = optimize_result_V_oim(S)
    rank = p - diag0cnt(invsym(V))

    st_matrix(bname, b)
    st_matrix(Vname, V)
    st_numscalar(nname, n)
    st_numscalar(rname, rank)
}

void plleval2(real scalar todo, real vector b,     ///
              real vector y,    real matrix X,     ///
              val, grad, hess)
{
    real vector  xb

    xb = X*b'
   val = sum(-exp(xb) + y:*xb - lnfactorial(y))
}

end

mypoisson1.ado has the structure of Stata programs that compute their results in Mata, which I discussed in Programming an estimation command in Stata: A first ado-command using Mata and Programming an estimation command in Stata: An OLS command using Mata. Lines 1–35 define the ado-command mypoisson1. Lines 37–83 define the Mata work function mywork() used in mypoisson1 and the evaluator function plleval2() used in mywork().

The ado-command mypoisson1 has four parts:

  1. Lines 5–13 parse what the user typed, identify the sample, drop collinear variables from the list of independent variables, and create temporary names for Stata objects returned by our Mata work function.
  2. Lines 15–16 call the Mata work function.
  3. Lines 18–31 post the results returned by the Mata work function to e().
  4. Line 33 displays the results.

The Mata work function mywork() has four parts.

  1. Lines 39–42 parse the arguments.
  2. Lines 45–47 declare vectors, matrices, and scalars that are local to mywork().
  3. Lines 49–65 compute the results.
  4. Lines 67–70 copy the computed results to Stata, using the names that were passed in arguments.

Now, I discuss the ado-code in some detail. Lines 2–35 are almost the same as lines 2–35 of myregress11, which I discussed in Programming an estimation command in Stata: An OLS command using Mata. That myregress11 handles factor variables but mypoisson1 does not causes most of the differences. That myregress11 stores the residual degrees of freedom but mypoisson1 does not causes two minor differences.

myregress11 uses the matrix inverter to handle cases of collinear independent variables. In the Poisson-regression case discussed here, collinear independent variables define an unconstrained problem without a unique solution, so optimize() cannot converge. mypoisson1 drops collinear independent variables to avoid this problem. I discuss a better solution in my next post.

Lines 10–11 use _rmcoll to drop the collinear independent variables and store the list of linearly independent variables in the local macro indepvars.

Now, I discuss the Mata work function mywork() in some detail. The mywork() defined on lines 39–71 is remarkably similar to the mywork() function defined on lines 39–72 of myregress11.ado, discussed in Programming an estimation command in Stata: An OLS command using Mata. In mypoisson1.ado, lines 57–65 use optimize() to compute the results. In myregress11.ado, lines 58–64 use matrix computations to compute the results.

Lines 73–81 define the evaluator function plleval2(), which is used by optimize() to compute the results, as I discussed in Programming an estimation command in Stata: Using optimize() to estimate Poisson parameters.

Examples 1 and 2 illustrate that mypoisson1 produces the same results as poisson.

Example 1: mypoisson1
(Uses accident3.dta)

. clear all

. use accident3

. mypoisson1 accidents cvalue kids traffic
Iteration 0:   f(p) = -851.18669
Iteration 1:   f(p) = -556.66874
Iteration 2:   f(p) = -555.81708
Iteration 3:   f(p) = -555.81538
Iteration 4:   f(p) = -555.81538
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      cvalue |  -.6558871   .0706484    -9.28   0.000    -.7943554   -.5174188
        kids |  -1.009017   .0807961   -12.49   0.000    -1.167374   -.8506596
     traffic |   .1467115   .0313762     4.68   0.000     .0852153    .2082077
       _cons |   .5743543   .2839519     2.02   0.043     .0178187     1.13089
------------------------------------------------------------------------------

Example 2: poisson

 poisson accidents cvalue kids traffic

Iteration 0:   log likelihood = -555.86605
Iteration 1:   log likelihood =  -555.8154
Iteration 2:   log likelihood = -555.81538

Poisson regression                              Number of obs     =        505
                                                LR chi2(3)        =     340.20
                                                Prob > chi2       =     0.0000
Log likelihood = -555.81538                     Pseudo R2         =     0.2343

------------------------------------------------------------------------------
   accidents |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      cvalue |  -.6558871   .0706484    -9.28   0.000    -.7943553   -.5174188
        kids |  -1.009017   .0807961   -12.49   0.000    -1.167374   -.8506594
     traffic |   .1467115   .0313762     4.68   0.000     .0852153    .2082076
       _cons |    .574354   .2839515     2.02   0.043     .0178193    1.130889
------------------------------------------------------------------------------

Done and undone

I discussed mypoisson1, which computes Poisson-regression results in Mata. I highlighted how similar the code in mypoisson1.ado is to the code in myregress11.ado.

I also discussed that mypoisson1 drops collinear independent variables and noted that mypoisson1 does not handle factor variables. In my next post, I discuss a solution to the problems caused by collinear independent variables and discuss a command that handles factor variables.

Programming an estimation command in Stata: Using optimize() to estimate Poisson parameters

\(
\newcommand{\xb}{{\bf x}}
\newcommand{\betab}{\boldsymbol{\beta}}\)I show how to use optimize() in Mata to maximize a Poisson log-likelihood function and to obtain estimators of the variance–covariance of the estimator (VCE) based on independent and identically distributed (IID) observations or on robust methods.

Using optimize()

There are many optional choices that one may make when solving a nonlinear optimization problem, but there are very few that one must make. The optimize*() functions in Mata handle this problem by making a set of default choices for you, requiring that you specify a few things, and allowing you to change any of the default choices.

When I use optimize() to solve a nonlinear optimization problem, I do four steps.

  1. I create an optimize() object

      
    : S = optimize_init()

    which contains all the default choices

  2. I use some of the optimize_init_*(S) functions to put information about my optimization problem into S.
  3. I use

      
    : betahat = optimize(S)

    to perform the optimization.

  4. I use some of the optimize_result_*(S) functions to get the results, which optimize(S) stored in S.

Consider maximizing the log-likelihood function of a Poisson model. The contribution of the \(i\)th observation to the log-likelihood is

\[
f_i(\betab) = y_i{\bf x_i}\betab – \exp({\bf x}_i\betab’) – \ln( y_i !)
\]

where \(y_i\) is the dependent variable, \({\bf x}_i\) is the vector of covariates, and \(\betab\) is the row vector of parameters that we select to maximize the log-likelihood function given by \(F(\betab) =\sum_i f_i(\betab)\). I could drop \(ln(y_i!)\), because it does not depend on the parameters. I include it to make the value of the log-likelihood function the same as that reported by Stata. Stata includes these terms so that the values of the log-likelihood functions are comparable across models.

The code block 1 copies the data from Stata to Mata and computes the Poisson log-likelihood function at the vector of parameter values b, which has been set to the arbitrary starting values of .01 for each parameter.

Code block 1: An evaluator function for the Poisson log-likelihood

clear all
use accident3
mata:
     y = st_data(., "accidents")
     X = st_data(., "cvalue kids traffic")
     X = X,J(rows(X), 1, 1)
     b = J(1, cols(X), .01)
    xb = X*b'
    f  = sum(-exp(xb) + y:*xb - lnfactorial(y))
end

The Mata function plleval() in code block 2 puts the value of the Poisson log-likelihood function at the vector of parameter values b into val.

Code block 2: An evaluator function for the Poisson log-likelihood

mata:
void plleval(real scalar todo, real vector b, val, grad, hess)
{
    real vector  y, xb
    real matrix  X

     y = st_data(., "accidents")
     X = st_data(., "cvalue kids traffic")
     X = X,J(rows(X), 1, 1)
    xb = X*b'
   val = sum(-exp(xb) + y:*xb - lnfactorial(y))
}
end

plleval() has the default syntax of an evaluator function that optimize() can call. Evaluator functions have a default syntax so that optimize() can call them, which it must do to find the maximum. After describing the default syntax, I will show how to use evaluators with extra arguments.

plleval() is void, it returns nothing. The real scalar todo allows optimize() to tell the evaluator function what it must compute. The real vector b is the current value of the parameter vector. val is not typed because no matter what it contains on input, it will contain the value of the objective function on output. grad is not typed because it will optionally contain the vector of first derivatives of the objective function at the current value of b on output. hess is not typed because it will optionally contain the matrix of second derivatives of the objective function at the current value of b on output. As plleval() illustrates, the objective function must put the value of the objective function into the third argument, but it need not compute either the vector of first derivatives or the matrix of second derivatives.

In example 1, I use optimize() to maximize the Poisson log-likelihood function computed in plleval().

Example 1: Using optimize() to estimate Poisson parameters

(Uses accident3.dta)

. clear all

. use accident3

. mata:
------------------------------------------------- mata (type end to exit) ------
: void plleval(real scalar todo, real vector b, val, grad, hess)
> {
>     real vector  y, xb
>     real matrix  X
> 
>      y = st_data(., "accidents")
>      X = st_data(., "cvalue kids traffic")
>      X = X,J(rows(X), 1, 1)
>     xb = X*b'
>    val = sum(-exp(xb) + y:*xb - lnfactorial(y))
> }
note: argument todo unused
note: argument grad unused
note: argument hess unused

: 
: S  = optimize_init()

: optimize_init_evaluator(S, &plleval())

: optimize_init_params(S, J(1, 4, .01))

: bh = optimize(S)
Iteration 0:   f(p) = -851.18669  
Iteration 1:   f(p) = -556.66874  
Iteration 2:   f(p) = -555.81708  
Iteration 3:   f(p) = -555.81538  
Iteration 4:   f(p) = -555.81538  

: bh
                  1              2              3              4
    +-------------------------------------------------------------+
  1 |  -.6558871399   -1.009017051    .1467114648    .5743542793  |
    +-------------------------------------------------------------+

: optimize_result_params(S)
                  1              2              3              4
    +-------------------------------------------------------------+
  1 |  -.6558871399   -1.009017051    .1467114648    .5743542793  |
    +-------------------------------------------------------------+

: sqrt(diagonal(optimize_result_V_oim(S)))'
                 1             2             3             4
    +---------------------------------------------------------+
  1 |  .0706483931   .0807960852   .0313761961   .2839519366  |
    +---------------------------------------------------------+

: end
--------------------------------------------------------------------------------

After defining plleval(), I use optimize_init() to create the optimize() object S. I must put information about how to call plleval() and the vector of starting values into S. Typing

optimize_init_evaluator(S, &plleval())

puts the address of the evaluator function plleval() into S. By preceding the name of the evaluator function plleval() with an ampersand (&), I put the address of the evaluator function into S. optimize() requires that you put the function address instead of the function name because having the address speeds up finding the function. Typing

optimize_init_params(S, J(1, 4, .01))

puts the vector of starting values, J(1, 4, .01), into S.

Typing

bh = optimize(S)

causes optimize() to solve the optimization problem described in S, and it causes optimize() to put the vector of optimal parameters in bh. optimize() produces the default iteration log, because we did not change the default specification in S.

When optimize() has completed, the results are in S. For example, I display the bh returned by optimize() and use optimize_result_params(S) to display the result stored in S. I further illustrate by displaying the standard errors; optimize_result_V_oim() retrieves the observed-information-matrix (OIM) estimator of the variance–covariance of the estimator (VCE). Many other results are stored in S; type help mf optimize and look at the optimize_result*() functions for details.

Comparing the results in examples 1 and 2 shows that they are correct.

Example 2: Results from poisson

. poisson accidents cvalue kids traffic

Iteration 0:   log likelihood = -555.86605
Iteration 1:   log likelihood =  -555.8154
Iteration 2:   log likelihood = -555.81538

Poisson regression                              Number of obs     =        505
                                                LR chi2(3)        =     340.20
                                                Prob > chi2       =     0.0000
Log likelihood = -555.81538                     Pseudo R2         =     0.2343

------------------------------------------------------------------------------
   accidents |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      cvalue |  -.6558871   .0706484    -9.28   0.000    -.7943553   -.5174188
        kids |  -1.009017   .0807961   -12.49   0.000    -1.167374   -.8506594
     traffic |   .1467115   .0313762     4.68   0.000     .0852153    .2082076
       _cons |    .574354   .2839515     2.02   0.043     .0178193    1.130889
------------------------------------------------------------------------------

plleval() is slow because it copies the data from Stata to Mata every time optimize() calls it. I would much rather pass the data to the evaluator function, but this requires putting information about the syntax of the new evaluator function in S. For example, I would like to use the evaluator function plleval2(). In example 3, I use optimize_init_argument() to put information into S about the extra arguments accepted by the new evaluator function plleval2().

Code block 3: Passing data to the Poisson evaluator function

mata:
void plleval2(real scalar todo, real vector b,     ///
              real vector y,    real matrix X,     ///
              val, grad, hess)
{
    real vector  xb

    xb = X*b'
   val = sum(-exp(xb) + y:*xb - lnfactorial(y))
}
end

Line 3 declares the extra arguments, the real vector y, and the real vector X. The extra arguments come between the inputs that must always be present, the real scalar todo and the real vector b, and the always-present outputs; val, grad, and hess.

Example 3 uses optimize() to maximize the Poisson objective function coded in plleval2().

Example 3: Using optional arguments to pass data

. mata:
------------------------------------------------- mata (type end to exit) ------
: void plleval2(real scalar todo, real vector b,     ///
>               real vector y,    real matrix X,     ///
>               val, grad, hess)
> {
>     real vector  xb
>
>     xb = X*b'
>    val = sum(-exp(xb) + y:*xb - lnfactorial(y))
> }
note: argument todo unused
note: argument grad unused
note: argument hess unused

:
: y = st_data(., "accidents")

: X = st_data(., "cvalue kids traffic")

: X = X,J(rows(X), 1, 1)

:
: S  = optimize_init()

: optimize_init_argument(S, 1, y)

: optimize_init_argument(S, 2, X)

: optimize_init_evaluator(S, &plleval2())

: optimize_init_params(S, J(1, 4, .01))

:
: bh = optimize(S)
Iteration 0:   f(p) = -851.18669
Iteration 1:   f(p) = -556.66874
Iteration 2:   f(p) = -555.81708
Iteration 3:   f(p) = -555.81538
Iteration 4:   f(p) = -555.81538

: optimize_result_params(S)
                  1              2              3              4
    +-------------------------------------------------------------+
  1 |  -.6558871399   -1.009017051    .1467114648    .5743542793  |
    +-------------------------------------------------------------+

: sqrt(diagonal(optimize_result_V_oim(S)))'
                 1             2             3             4
    +---------------------------------------------------------+
  1 |  .0706483931   .0807960852   .0313761961   .2839519366  |
    +---------------------------------------------------------+

: end
--------------------------------------------------------------------------------

After defining plleval2(), I copy the data from Stata to Mata, and I use optimize_init() to put the default choices into the optimize() object S. When I typed

optimize_init_argument(S, 1, y)

I put information into S specifying that optimize() should pass y as the first extra argument to the evaluator function. When I typed

optimize_init_argument(S, 2, X)

I put information into S specifying that optimize() should pass X as the second extra argument to the evaluator function.

Analogous to example 1, typing

optimize_init_evaluator(S, &plleval2())

puts the address of plleval2() into S, and typing

optimize_init_params(S, J(1, 4, .01))

puts the vector of starting values, J(1, 4, .01), in S.

The results are the same as those in example 1.

Vector of observation–level contributions and robust VCE estimation

Robust estimators for the VCE of an estimator use the structure of observation-level contributions; see Wooldridge (2010, chapters 12 and 13) or Cameron and Trivedi (2005, chapter 5). When the evaluator function gives optimize() a vector of observation-level contributions, instead of a scalar summation, optimize() can use this structure to compute robust or cluster–robust estimators of the VCE.

Consider plleval3(), which puts the vector of observation-level contributions into val.

Code block 4: A vector of observation-level contributions

mata:
void plleval3(real scalar todo, real vector b,     ///
              real vector y,    real matrix X,     ///
              val, grad, hess)
{
    real vector  xb

    xb = X*b'
   val = -exp(xb) + y:*xb - lnfactorial(y)
}
end

To use plleval3(), I must put information in the optimize() object stating that the evaluator function computes a vector of observation-level contributions. In example 4, I use optimize_init_evaluatortype() to put this information into the optimize() object S.

Example 4: Robust VCE estimation

. mata:
------------------------------------------------- mata (type end to exit) ------
: void plleval3(real scalar todo, real vector b,     ///
>               real vector y,    real matrix X,     ///
>               val, grad, hess)
> {
>     real vector  xb
>
>     xb = X*b'
>    val = -exp(xb) + y:*xb - lnfactorial(y)
> }
note: argument todo unused
note: argument grad unused
note: argument hess unused

:
:
: y = st_data(., "accidents")

: X = st_data(., "cvalue kids traffic")

: X = X,J(rows(X), 1, 1)

:
: S  = optimize_init()

: optimize_init_argument(S, 1, y)

: optimize_init_argument(S, 2, X)

: optimize_init_evaluator(S, &plleval3())

: optimize_init_evaluatortype(S, "gf0")

: optimize_init_params(S, J(1, 4, .01))

:
: bh = optimize(S)
Iteration 0:   f(p) = -851.18669
Iteration 1:   f(p) = -556.66874
Iteration 2:   f(p) = -555.81731
Iteration 3:   f(p) = -555.81538
Iteration 4:   f(p) = -555.81538

: optimize_result_params(S)
                  1              2              3              4
    +-------------------------------------------------------------+
  1 |  -.6558871527   -1.009017051    .1467114658    .5743542978  |
    +-------------------------------------------------------------+

: sqrt(diagonal(optimize_result_V_oim(S)))'
                 1             2             3             4
    +---------------------------------------------------------+
  1 |  .0706483832   .0807960809    .031376176   .2839517337  |
    +---------------------------------------------------------+

: sqrt(diagonal(optimize_result_V_robust(S)))'
                 1             2             3             4
    +---------------------------------------------------------+
  1 |  .1096020124    .188666044    .092431746   .6045057623  |
    +---------------------------------------------------------+

: end
--------------------------------------------------------------------------------

After defining plleval3(), I copy the data, create the optimize() object S, put the specifications for the extra arguments y and X in S, and put the address of plleval3() into S. Typing

optimize_init_evaluatortype(S, “gf0”)

puts in S the information that the evaluator function returns a vector of observation-level contribution and that it computes zero derivatives, that is the evaluator function is type “gf0”. Given the vector structure, I can type

optimize_result_V_robust(S)

to compute a robust estimator of the VCE.

sqrt(diagonal(optimize_result_V_robust(S)))’

returns the robust standard errors, which are the same as those reported by poisson in example 5.

Example 5: Robust VCE estimation by poisson

. poisson accidents cvalue kids traffic, vce(robust)

Iteration 0:   log pseudolikelihood = -555.86605
Iteration 1:   log pseudolikelihood =  -555.8154
Iteration 2:   log pseudolikelihood = -555.81538

Poisson regression                              Number of obs     =        505
                                                Wald chi2(3)      =      99.76
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -555.81538               Pseudo R2         =     0.2343

------------------------------------------------------------------------------
             |               Robust
   accidents |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      cvalue |  -.6558871   .1096019    -5.98   0.000    -.8707029   -.4410712
        kids |  -1.009017    .188666    -5.35   0.000    -1.378795   -.6392382
     traffic |   .1467115   .0924316     1.59   0.112    -.0344512    .3278741
       _cons |    .574354   .6045047     0.95   0.342    -.6104535    1.759162
------------------------------------------------------------------------------

Done and undone

I showed how to use optimize() to maximize a Poisson log-likelihood function. I also showed how to obtain a robust estimator of the VCE by coding the evaluator function to compute a vector of observation-level contributions. In my next post, I show how to write a Stata command that uses Mata to estimate the parameters of a Poisson regression model.

References

Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.

Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, Massachusetts: MIT Press.

Programming an estimation command in Stata: A review of nonlinear optimization using Mata

\(\newcommand{\betab}{\boldsymbol{\beta}}
\newcommand{\xb}{{\bf x}}
\newcommand{\yb}{{\bf y}}
\newcommand{\gb}{{\bf g}}
\newcommand{\Hb}{{\bf H}}
\newcommand{\thetab}{\boldsymbol{\theta}}
\newcommand{\Xb}{{\bf X}}
\)I review the theory behind nonlinear optimization and get more practice in Mata programming by implementing an optimizer in Mata. In real problems, I recommend using the optimize() function or moptimize() function instead of the one I describe here. In subsequent posts, I will discuss optimize() and moptimize(). This post will help you develop your Mata programming skills and will improve your understanding of how optimize() and moptimize() work.

A quick review of nonlinear optimization

We want to maximize a real-valued function \(Q(\thetab)\), where \(\thetab\) is a \(p\times 1\) vector of parameters. Minimization is done by maximizing \(-Q(\thetab)\). We require that \(Q(\thetab)\) is twice, continuously differentiable, so that we can use a second-order Taylor series to approximate \(Q(\thetab)\) in a neighborhood of the point \(\thetab_s\),

\[
Q(\thetab) \approx Q(\thetab_s) + \gb_s'(\thetab -\thetab_s)
+ \frac{1}{2} (\thetab -\thetab_s)’\Hb_s (\thetab -\thetab_s)
\tag{1}
\]

where \(\gb_s\) is the \(p\times 1\) vector of first derivatives of \(Q(\thetab)\) evaluated at \(\thetab_s\) and \(\Hb_s\) is the \(p\times p\) matrix of second derivatives of \(Q(\thetab)\) evaluated at \(\thetab_s\), known as the Hessian matrix.

Nonlinear maximization algorithms start with a vector of initial values and produce a sequence of updated values that converge to the parameter vector that maximizes the objective function. The algorithms I discuss here can only find local maxima. The function in figure 1 has a local maximum at .2 and another at 1.5. The global maximum is at .2.

Figure 1: Local maxima
graph1

Each update is produced by finding the \(\thetab\) that maximizes the approximation on the right-hand side of equation (1) and letting it be \(\thetab_{s+1}\). To find the \(\thetab\) that maximizes the approximation, we set to \({\bf 0}\) the derivative of the right-hand side of equation (1) with respect to \(\thetab\),

\[
\gb_s + \Hb_s (\thetab -\thetab_s) = {\bf 0}
\tag{2}
\]

Replacing \(\thetab\) with \(\thetab_{s+1}\) and solving yields the update rule for \(\thetab_{s+1}\).

\[
\thetab_{s+1} = \thetab_s – \Hb_s^{-1} \gb_s
\tag{3}
\]

Note that the update is uniquely defined only if the Hessian matrix \(\Hb_s\) is full rank. To ensure that we have a local maximum, we will require that the Hessian be negative definite at the optimum, which also implies that the symmetric Hessian is full rank.

The update rule in equation (3) does not guarantee that \(Q(\thetab_{s+1})>Q(\thetab_s)\). We want to accept only those \(\thetab_{s+1}\) that do produce such an increase, so in practice, we use

\[
\thetab_{s+1} = \thetab_s – \lambda \Hb_s^{-1} \gb_s
\tag{4}
\]

where \(\lambda\) is the step size. In the algorithm presented here, we start with \(\lambda\) equal to \(1\) and, if necessary, decrease \(\lambda\) until we find a value that yields an increase.

The previous sentence is vague. I clarify it by writing an algorithm in Mata. Suppose that real scalar Q( real vector theta ) is a Mata function that returns the value of the objective function at a value of the parameter vector theta. For the moment, suppose that g_s is the vector of derivatives at the current theta, denoted by theta_s, and that Hi_s is the inverse of the Hessian matrix at theta_s. These definitions allow us to define the update function

Code block 1: Candidate rule for parameter vector

real vector tupdate(                 ///
	real scalar lambda,          ///
	real vector theta_s,         ///
	real vector g_s,             ///
	real matrix Hi_s)
{
	return (theta_s - lambda*Hi_s*g_s)
}

For specified values of lambda, theta_s, g_s, and Hi_s, tupdate() returns a candidate value for theta_s1. But we only accept candidate values of theta_s1 that yield an increase, so instead of using tupdate() to get an update, we would use GetUpdate().

Code block 2: Update function for parameter vector

real vector GetUpdate(            ///
    real vector theta_s,          ///
    real vector g_s,              ///
    real matrix Hi_s)
{
    lambda = 1
    theta_s1 = tupdate(lambda, thetas, g_s, Hi_s)
    while ( Q(theta_s1) < Q(theta_s) ) {
        lambda   = lambda/2
        theta_s1 = tupdate(lambda, thetas, g_s, Hi_s)
    }
    return(theta_s1)
}

GetUpdate() starts by getting a candidate value for theta_s1 when lambda = 1. GetUpdate() returns this candidate theta_s1 if it produces an increase in Q(). Otherwise, GetUpdate() divides lambda by 2 and gets another candidate theta_s1 until it finds a candidate that produces an increase in Q(). GetUpdate() returns the first candidate that produces an increase in Q().

While these functions clarify the ambiguities in the original vague statement, GetUpdate() makes the unwise assumption that there is always a lambda for which the candidate theta_s1 produces an increase in Q(). The version of GetUpdate() in code block 3 does not make this assumption, it exits with an error if lambda is too small; less than \(10^{-11}\).

Code block 3: A better update function for parameter vector

real vector GetUpdate(            ///
    real vector theta_s,          ///
    real vector g_s,              ///
    real matrix Hi_s)
{
    lambda = 1
    theta_s1 = tupdate(lambda, thetas, g_s, Hi_s)
    while ( Q(theta_s1) < Q(theta_s) ) {
        lambda   = lambda/2
        if (lambda < 1e-11) {
            printf("{red}Cannot find parameters that produce an increase.\n")
            exit(error(3360))
        }
        theta_s1 = tupdate(lambda, thetas, g_s, Hi_s)
    }
    return(theta_s1)
}

An outline of our algorithm for nonlinear optimization is the following:

  1. Select initial values for parameter vector.
  2. If current parameters set vector of derivatives of Q() to zero, go to (3); otherwise go to (A).
    • Use GetUpdate() to get new parameter values.
    • Calculate g_s and Hi_s at parameter values from (A).
    • Go to (2).
  3. Display results.

Code block 4 contains a Mata version of this algorithm.

Code block 4: Pseudocode for Newton–Raphson algorithm

theta_s  =  J(p, 1, .01)
GetDerives(theta_s, g_s, Hi_s.)
gz = g_s'*Hi*g
while (abs(gz) > 1e-13) {
	theta_s = GetUpdate(theta_s, g_s, Hi_s)
	GetDerives(theta_s, g_s, Hi_s)
	gz      = g_s'*Hi_s*g_s
	printf("gz is now %8.7g\n", gz)
}
printf("Converged value of theta is\n")
theta_s

Line 2 puts the vector of starting values, a \(p\times 1\) vector with each element equal to .01, in theta_s. Line 3 uses GetDerives() to put the vector of first derivatives into g_s and the inverse of the Hessian matrix into Hi_s. In GetDerives(), I use cholinv() to calculate Hi_s. cholinv() returns missing values if the matrix is not positive definite. By calculating Hi_s = -1*cholinv(-H_s), I ensure that Hi_s contains missing values when the Hessian is not negative definite and full rank.

Line 3 calculates how different the vector of first derivatives is from 0. Instead of using a sum of squares, obtainable by g_s’g_s, I weight the first derivatives by the inverse of the Hessian matrix, which puts the \(p\) first derivatives on a similar scale and ensures that the Hessian matrix is negative definite at convergence. (If the Hessian matrix is not negative definite, GetDerives() will put a matrix of missing values into Hi_s, which causes gz=., which will exceed the tolerance.)

To flush out the details we need a specific problem. Consider maximizing the log-likelihood function of a Poisson model, which has a simple functional form. The contribution of each observation to the log-likelihood is

\[
f_i(\betab) = y_i{\bf x_i}\betab – \exp({\bf x}_i\betab) – \ln( y_i !)
\]

where \(y_i\) is the dependent variable, \({\bf x}_i\) is the vector of covariates, and \(\betab\) is the vector of parameters that we select to maximize the log-likelihood function given by \(F(\betab) =\sum_i f_i(\betab)\). I could drop ln(y_i!), because it does not depend on the parameters. I include it to make the value of the log-likelihood function the same as that reported by Stata. Stata includes these terms so that log-likelihood-function values are comparable across models.

The pll() function in code block 5 computes the Poisson log-likelihood function from the vector of observations on the dependent variable y, the matrix of observations on the covariates X, and the vector of parameter values b.

Code block 5: A function for the Poisson log-likelihood function

// Compute Poisson log-likelihood 
mata:
real scalar pll(real vector y, real matrix X, real vector b)
{
    real vector  xb

    xb = x*b
    f  = sum(-exp(xb) + y:*xb - lnfactorial(y))
}
end

The vector of first derivatives is

\[
\frac{\partial~F(\xb_,\betab)}{\partial \betab}
= \sum_{i=1}^N (y_i – \exp(\xb_i\betab))\xb_i
\]

which I can compute in Mata as quadcolsum((y-exp(X*b)):*X), and the Hessian matrix is

\[
\label{eq:H}
\sum_{i=1}^N\frac{\partial^2~f(\xb_i,\betab)}{\partial\betab
\partial\betab^\prime}
= – \sum_{i=1}^N\exp(\xb_i\betab)\xb_i^\prime \xb_i
\tag{5}
\]

which I can compute in Mata as -quadcross(X, exp(X*b), X).

Here is some code that implements this Newton-Raphson NR algorithm for the Poisson regression problem.

Code block 6: pnr1.do
(Uses accident3.dta)

// Newton-Raphson for Poisson log-likelihood
clear all
use accident3

mata:
real scalar pll(real vector y, real matrix X, real vector b)
{
    real vector  xb
    xb = X*b
    return(sum(-exp(xb) + y:*xb - lnfactorial(y)))
}

void GetDerives(real vector y, real matrix X, real vector theta, g, Hi)
{
	real vector exb

	exb = exp(X*theta)
	g   =  (quadcolsum((y - exb):*X))'
	Hi  = quadcross(X, exb, X)
	Hi = -1*cholinv(Hi)
}

real vector tupdate(                 ///
	real scalar lambda,          ///
	real vector theta_s,          ///
	real vector g_s,              ///
	real matrix Hi_s)
{
	return (theta_s - lambda*Hi_s*g_s)
}

real vector GetUpdate(            ///
    real vector y,                ///
    real matrix X,                ///
    real vector theta_s,          ///
    real vector g_s,              ///
    real matrix Hi_s)
{
    real scalar lambda 
    real vector theta_s1

    lambda = 1
    theta_s1 = tupdate(lambda, theta_s, g_s, Hi_s)
    while ( pll(y, X, theta_s1) <= pll(y, X, theta_s) ) {
        lambda   = lambda/2
        if (lambda < 1e-11) {
            printf("{red}Cannot find parameters that produce an increase.\n")
            exit(error(3360))
        }
        theta_s1 = tupdate(lambda, theta_s, g_s, Hi_s)
    }
    return(theta_s1)
}


y = st_data(., "accidents")
X = st_data(., "cvalue kids traffic")
X = X,J(rows(X), 1, 1)

b  =  J(cols(X), 1, .01)
GetDerives(y, X, b, g=., Hi=.)
gz = .
while (abs(gz) > 1e-11) {
	bs1 = GetUpdate(y, X, b, g, Hi)
	b   = bs1
	GetDerives(y, X, b, g, Hi)
	gz = g'*Hi*g
	printf("gz is now %8.7g\n", gz)
}
printf("Converged value of beta is\n")
b

end

Line 3 reads in the downloadable accident3.dta dataset before dropping down to Mata. I use variables from this dataset on lines 56 and 57.

Lines 6–11 define pll(), which returns the value of the Poisson log-likelihood function, given the vector of observations on the dependent variable y, the matrix of covariate observations X, and the current parameters b.

Lines 13‐21 put the vector of first derivatives in g and the inverse of the Hessian matrix in Hi. Equation 5 specifies a matrix that is negative definite, as long as the covariates are not linearly dependent. As discussed above, cholinv() returns a matrix of missing values if the matrix is not positive definite. I multiply the right-hand side on line 20 by \(-1\) instead of on line 19.

Lines 23–30 implement the tupdate() function previously discussed.

Lines 32–53 implement the GetUpdate() function previously discussed, with the caveats that this version handles the data and uses pll() to compute the value of the objective function.

Lines 56–58 get the data from Stata and row-join a vector to X for the constant term.

Lines 60–71 implement the NR algorithm discussed above for this Poisson regression problem.

Running nr1.do produces

Example 1: NR algorithm for Poisson

. do pnr1

. // Newton-Raphson for Poisson log-likelihood
. clear all

. use accident3

. 
. mata:

[Output Omitted]

: b  =  J(cols(X), 1, .01)

: GetDerives(y, X, b, g=., Hi=.)

: gz = .

: while (abs(gz) > 1e-11) {
>         bs1 = GetUpdate(y, X, b, g, Hi)
>         b   = bs1
>         GetDerives(y, X, b, g, Hi)
>         gz = g'*Hi*g
>         printf("gz is now %8.7g\n", gz)
> }
gz is now -119.201
gz is now -26.6231
gz is now -2.02142
gz is now -.016214
gz is now -1.3e-06
gz is now -8.3e-15

: printf("Converged value of beta is\n")
Converged value of beta is

: b
                  1
    +----------------+
  1 |  -.6558870685  |
  2 |  -1.009016966  |
  3 |   .1467114652  |
  4 |   .5743541223  |
    +----------------+

: 
: end
--------------------------------------------------------------------------------
. 
end of do-file

The point estimates in example 1 are equivalent to those produced by poisson.

Example 2: poisson results

. poisson accidents cvalue kids traffic

Iteration 0:   log likelihood = -555.86605  
Iteration 1:   log likelihood =  -555.8154  
Iteration 2:   log likelihood = -555.81538  

Poisson regression                              Number of obs     =        505
                                                LR chi2(3)        =     340.20
                                                Prob > chi2       =     0.0000
Log likelihood = -555.81538                     Pseudo R2         =     0.2343

------------------------------------------------------------------------------
   accidents |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      cvalue |  -.6558871   .0706484    -9.28   0.000    -.7943553   -.5174188
        kids |  -1.009017   .0807961   -12.49   0.000    -1.167374   -.8506594
     traffic |   .1467115   .0313762     4.68   0.000     .0852153    .2082076
       _cons |    .574354   .2839515     2.02   0.043     .0178193    1.130889
------------------------------------------------------------------------------

Done and undone

I implemented a simple nonlinear optimizer to practice Mata programming and to review the theory behind nonlinear optimization. In future posts, I implement a command for Poisson regression that uses the optimizer in optimize().

Programming an estimation command in Stata: Adding robust and cluster-robust VCEs to our Mata based OLS command

I show how to use the undocumented command _vce_parse to parse the options for robust or cluster-robust estimators of the variance-covariance of the estimator (VCE). I then discuss myregress12.ado, which performs its computations in Mata and computes VCE estimators based on independently and identically distributed (IID) observations, robust methods, or cluster-robust methods.

myregress12.ado performs ordinary least-squares (OLS) regression, and it extends myregress11.ado, which I discussed in Programming an estimation command in Stata: An OLS command using Mata. To get the most out of this post, you should be familiar with Programming an estimation command in Stata: Using a subroutine to parse a complex option and Programming an estimation command in Stata: Computing OLS objects in Mata.

Parsing the vce() option

I used ado-subroutines to simplify the parsing of the options vce(robust) and vce(cluster cvarname) in myregress10.ado; see Programming an estimation command in Stata: Using a subroutine to parse a complex option. Part of the point was to illustrate how to write ado-subroutines and the programming tricks that I used in these subroutines.

Here I use the undocumented command _vce_parse to simplify the parsing. There are many undocumented commands designed to help Stata programmers. They are undocumented in that they are tersely documented in the system help but not documented in the manuals. In addition, the syntax or behavior of these commands may change over Stata releases, although this rarely happens.

_vce_parse helps Stata programmers parse the vce() option. To see how it works, consider the problem of parsing the syntax of myregress12.

myregress12 depvar [indepvars] [if] [in] [, vce(robust | cluster clustervar) noconstant]

where indepvars can contain factor variables or time-series variables.

I can use the syntax command to put whatever the user specifies in the option vce() into the local macro vce, but I still must (1) check that what was specified makes sense and (2) create local macros that the code can use to do the right thing. Examples 1–7 create the local macro vce, simulating what syntax would do, and then use _vce_robust to perform tasks (1) and (2).

I begin with the case in which the user specified vce(robust); here the local macro vce would contain the word robust.

Example 1: parsing vce(robust)

. clear all

. sysuse auto
(1978 Automobile Data)

. local vce "robust"

. _vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce')

. return list

macros:
             r(robust) : "robust"
             r(vceopt) : "vce(robust)"
                r(vce) : "robust"

The command

_vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce’)

has two pieces. The piece before the colon (:) specifies the rules; the piece after the colon specifies what the user typed. Each piece can have a Stata object followed by some options; note the commas before optlist(Robust) and before vce(`vce’). In the case at hand, the second piece only contains what the user specified – vce(robust) – and the first piece only contains the options optlist(Robust) and argoptlist(CLuster). The option optlist(Robust) specifies that the vce() option in the second piece may contain the option robust and that its minimal abbreviation is r. Note how the word Robust in optlist(Robust) mimics how syntax specifies minimum abbreviations. The option argoptlist(CLuster) specifies that the vce() option in the second piece may contain cluster clustervar, that the minimum abbreviation of cluster is cl, and that it will put the argument clustervar into a local macro.

After the command,

_vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce’)

I use return list to show what _vce_parse stored in r(). Because local macro vce contains “robust”, _vce_parse

  1. puts the word robust in the local macro r(robust);
  2. puts what the user typed, vce(robust), in the local macro r(vceopt); and
  3. puts the type of VCE, robust, in the local macro r(vce).

Examples 2 and 3 illustrate that _vce_parse stores the same values in these local macros when the user specifies vce(rob) or vce(r), which are valid abbreviations for vce(robust).

Example 2: parsing vce(rob)

. local vce "rob"

. _vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce')

. return list

macros:
             r(robust) : "robust"
             r(vceopt) : "vce(robust)"
                r(vce) : "robust"

Example 3: parsing vce(r)

. local vce "r"

. _vce_parse , optlist(Robust) argoptlist(CLuster) : , vce(`vce')

. return list

macros:
             r(robust) : "robust"
             r(vceopt) : "vce(robust)"
                r(vce) : "robust"

Now, consider parsing the option vce(cluster clustervar). Because the cluster variable clustervar may contain missing values, _vce_parse may need to update a sample-identification variable before it stores the name of the cluster variable in a local macro. In example 4, I use the command

_vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce’)

to handle the case when the user specifies vce(cluster rep78). The results from the tabulate and summarize commands illustrate that _vce_parse updates the sample-identification variable mytouse to account for the missing observations in rep78.

Example 4: parsing vce(cluster rep78)

. generate byte mytouse = 1

. tabulate mytouse

    mytouse |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         74      100.00      100.00
------------+-----------------------------------
      Total |         74      100.00

. summarize rep78

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       rep78 |         69    3.405797    .9899323          1          5

. local vce "cluster rep78"

. _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce')

. return list

macros:
             r(robust) : "robust"
            r(cluster) : "rep78"
             r(vceopt) : "vce(cluster rep78)"
            r(vceargs) : "rep78"
                r(vce) : "cluster"

. tabulate mytouse

    mytouse |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          5        6.76        6.76
          1 |         69       93.24      100.00
------------+-----------------------------------
      Total |         74      100.00

I use return list to show what _vce_parse stored in r(). Because local macro vce contains cluster rep78, _vce_parse

  1. puts the word robust in the local macro r(robust);
  2. puts the name of the cluster variable, rep78, in the local macro r(cluster);
  3. puts what the user typed, vce(cluster rep78), in the local macro r(vceopt);
  4. puts the argument to the cluster option, rep78, in the local macro r(vceargs); and
  5. puts the type of VCE, cluster, in the local macro r(vce).

Examples 5 and 6 illustrate that _vce_parse stores the same values in these local macros when the user specifies vce(clus rep78) or vce(cl rep78), which are valid abbreviations for vce(cluster rep78).

Example 5: parsing vce(clus rep78)

. local vce "clus rep78"

. _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce')

. return list

macros:
             r(robust) : "robust"
            r(cluster) : "rep78"
             r(vceopt) : "vce(cluster rep78)"
            r(vceargs) : "rep78"
                r(vce) : "cluster"

Example 6: parsing vce(cl rep78)

. local vce "cl rep78"

. _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vce(`vce')

. return list

macros:
             r(robust) : "robust"
            r(cluster) : "rep78"
             r(vceopt) : "vce(cluster rep78)"
            r(vceargs) : "rep78"
                r(vce) : "cluster"

Having illustrated how to make _vce_parse handle the cases when the user specifies something valid, I will show in example 7 that it will also produce a standard error message when the user specifies an error condition.

Example 7: parsing vce(silly)

. local vce "silly"

. capture noisily _vce_parse mytouse, optlist(Robust) argoptlist(CLuster) : , vc
> e(`vce')
vcetype 'silly' not allowed

. return list

_vce_parse can parse other types of vce() options; to see them type help _vce_parse.

Also, remember to type undocumented when you are looking for a programmer’s tool.

The code for myregress12

Here is the code for myregress12.ado, which uses _vce_parse. I describe how it works below.

I recommend that you click on the file name to download the code for my myregress12.ado. To avoid scrolling, view the code in the do-file editor, or your favorite text editor, to see the line numbers.

Code block 1: myregress12.ado

*! version 12.0.0  16Jan2016
program define myregress12, eclass sortpreserve
    version 14.1

    syntax varlist(numeric ts fv) [if] [in] [, noCONStant vce(string) ]
    marksample touse

    _vce_parse `touse' , optlist(Robust) argoptlist(CLuster) : , vce(`vce')
    local vce        "`r(vce)'"
    local clustervar "`r(cluster)'"
    if "`vce'" == "robust" | "`vce'" == "cluster" {
        local vcetype "Robust"
    }
    if "`clustervar'" != "" {
        capture confirm numeric variable `clustervar'
        if _rc {
            display in red "invalid vce() option"
            display in red "cluster variable {bf:`clustervar'} is " ///
                "string variable instead of a numeric variable"
            exit(198)
        }
        sort `clustervar'
    }

    gettoken depvar indepvars : varlist
    _fv_check_depvar `depvar'

    fvexpand `indepvars' 
    local cnames `r(varlist)'

    tempname b V N rank df_r

    mata: mywork("`depvar'", "`cnames'", "`touse'", "`constant'",    ///
       "`vce'", "`clustervar'",                                  /// 
       "`b'", "`V'", "`N'", "`rank'", "`df_r'") 

    if "`constant'" == "" {
        local cnames `cnames' _cons
    }

    matrix colnames `b' = `cnames'
    matrix colnames `V' = `cnames'
    matrix rownames `V' = `cnames'

    ereturn post `b' `V', esample(`touse') buildfvinfo
    ereturn scalar N        = `N'
    ereturn scalar rank     = `rank'
    ereturn scalar df_r     = `df_r'
    ereturn local  vce      "`vce'"
    ereturn local  vcetype  "`vcetype'"
    ereturn local  clustvar "`clustervar'"
    ereturn local  cmd      "myregress12"

    ereturn display

end

mata:

void mywork( string scalar depvar,  string scalar indepvars, 
             string scalar touse,   string scalar constant,  
             string scalar vcetype, string scalar clustervar,
             string scalar bname,   string scalar Vname,     
             string scalar nname,   string scalar rname,     
             string scalar dfrname) 
{

    real vector    y, b, e, e2, cvar, ei 
    real matrix    X, XpXi, M, info, xi 
    real scalar    n, p, k, nc, i, dfr

    y    = st_data(., depvar, touse)
    X    = st_data(., indepvars, touse)
    n    = rows(X)

    if (constant == "") {
        X    = X,J(n,1,1)
    }

    XpXi = quadcross(X, X)
    XpXi = invsym(XpXi)
    b    = XpXi*quadcross(X, y)
    e    = y - X*b
    e2   = e:^2
    p    = cols(X)
    k    = p - diag0cnt(XpXi)
    if (vcetype == "robust") {
        M    = quadcross(X, e2, X)
        dfr  = n - k
        V    = (n/dfr)*XpXi*M*XpXi
    }
    else if (vcetype == "cluster") {
        cvar = st_data(., clustervar, touse)
        info = panelsetup(cvar, 1)
        nc   = rows(info)
        M    = J(k, k, 0)
        dfr  = nc - 1
        for(i=1; i<=nc; i++) {
            xi = panelsubmatrix(X,i,info)
            ei = panelsubmatrix(e,i,info)
            M  = M + xi'*(ei*ei')*xi
        }
        V    = ((n-1)/(n-k))*(nc/(nc-1))*XpXi*M*XpXi
    }
    else {                 // vcetype must IID
        dfr  = n - k
        V    = (quadsum(e2)/dfr)*XpXi
    }

    st_matrix(bname, b')
    st_matrix(Vname, V)
    st_numscalar(nname, n)
    st_numscalar(rname, k)
    st_numscalar(dfrname, dfr)

}

end

Let’s break this 118-line program into familiar pieces. Lines 2-56 define the ado-command, and lines 58-118 define the Mata work function that is used by the ado-command. Despite the addition of details to handle the parsing and computation of a robust or cluster-robust VCE, the structures of the ado-command and of the Mata work function are the same as they were in myregress11.ado; see Programming an estimation command in Stata: An OLS command using Mata.

The ado-command has four parts.

  1. Lines 5-31 parse what the user typed, identify the sample, and create temporary names for the results returned by our Mata work function.
  2. Lines 33-35 call the Mata work function.
  3. Lines 37-52 post the results returned by the Mata work function to e().
  4. Line 54 displays the results.

The Mata function mywork() also has four parts.

  1. Lines 60-65 parse the arguments.
  2. Lines 68-70 declare vectors, matrices, and scalars that are local to mywork().
  3. Lines 80-108 compute the results.
  4. Lines 110-114 copy the computed results to Stata, using the names that were passed in the arguments.

Now, I address the details of the ado-code, although I do not discuss details in myregress12.ado, which I already covered when describing myregress11.ado in Programming an estimation command in Stata: An OLS command using Mata. Line 5 allows the user to specify the vce() option, and line 8 uses _vce_parse to parse what the user specifies. Lines 9 and 10 put the type of VCE found by _vce_parse in the local macro vce and the name of the cluster variable, if specified, in the local macro clustervar. Lines 11-13 put Robust in the local vcetype, if the specified vce is either robust or cluster. If there is a cluster variable, lines 14–23 check that it is numeric and use it to sort the data.

Line 34 passes the new arguments for the type of VCE and the name of the cluster variable to the Mata work function mywork().

Lines 49–51 store the type of VCE, the output label for the VCE type, and the name of the cluster variable in e(), respectively.

Now, I address the details of the Mata work function mywork() but only discussing what I have added to mywork() in myregress11.ado. Line 62 declares the new arguments. The string scalar vcetype is empty, or it contains “robust”, or it contains “cluster”. The string scalar clustervar is either empty or contains the name of the cluster variable.

Lines 68–70 declare the local-to-the-function vectors cvar and ei and the local-to-the-function matrices M, info, and xi that are needed now but not previously.

Lines 87, 91–92, 104–105, and 108 specify if-else blocks to compute the correct VCE. Lines 88–90 compute a robust estimator of the VCE if vcetype contains “robust”. Lines 93–103 compute a cluster-robust estimator of the VCE if vcetype contains “cluster”. Lines 106–107 compute an IID-based estimator of the VCE if vcetype contains neither “robust” nor “cluster”.

Done and undone

I introduced the undocumented command _vce_parse and discussed the code for myregress12.ado, which uses Mata to compute OLS point estimates and an IID-based VCE, a robust VCE, or a cluster-robust VCE.

The structure of the code is the same as the one that I used in myregress11.ado and in mymean8.ado, which I discussed in Programming an estimation command in Stata: An OLS command using Mata and in Programming an estimation command in Stata: A first ado-command using Mata. That the structure remains the same makes it easier to handle the details that arise in more complicated problems.

Bayesian binary item response theory models using bayesmh

This post was written jointly with Yulia Marchenko, Executive Director of Statistics, StataCorp.

Table of Contents

Overview
1PL model
2PL model
3PL model
4PL model
5PL model
Conclusion

Overview

Item response theory (IRT) is used for modeling the relationship between the latent abilities of a group of subjects and the examination items used for measuring their abilities. Stata 14 introduced a suite of commands for fitting IRT models using maximum likelihood; see, for example, the blog post Spotlight on irt by Rafal Raciborski and the [IRT] Item Response Theory manual for more details. In this post, we demonstrate how to fit Bayesian binary IRT models by using the redefine() option introduced for the bayesmh command in Stata 14.1. If you are not familiar with the concepts and jargon of Bayesian statistics, you may want to watch the introductory videos on the Stata Youtube channel before proceeding.

Introduction to Bayesian analysis, part 1 : The basic concepts
Introduction to Bayesian analysis, part 2: MCMC and the Metropolis-Hastings algorithm

We use the abridged version of the mathematics and science data from DeBoeck and Wilson (2004), masc1. The dataset includes 800 student responses to 9 test questions intended to measure mathematical ability.

The irt suite fits IRT models using data in the wide form – one observation per subject with items recorded in separate variables. To fit IRT models using bayesmh, we need data in the long form, where items are recorded as multiple observations per subject. We thus reshape the dataset in a long form: we have a single binary response variable, y, and two index variables, item and id, which identify the items and subjects, respectively. This allows us to formulate our IRT models as multilevel models. The following commands load and prepare the dataset.

. webuse masc1
(Data from De Boeck & Wilson (2004))

. generate id = _n

. quietly reshape long q, i(id) j(item)

. rename q y

To ensure that we include all levels of item and id in our models, we use fvset base none to keep the base categories.

. fvset base none id item

In what follows, we present eight Bayesian binary IRT models increasing in complexity and explanatory power. We perform Bayesian model comparison to gain insight into what would be the more appropriate model for the data at hand.

For high-dimensional models such as IRT models, you may see differences in the estimation results between different platforms or different flavors of Stata because of the nature of the Markov chain Monte Carlo (MCMC) sampling and finite numerical precision. These differences are not a source of concern; they will be within the range of the MCMC variability and will lead to similar inferential conclusions. The differences will diminish as the MCMC sample size increases. The results in this post are obtained from Stata/SE on the 64-bit Linux platform using the default 10,000 MCMC sample size.

Let the items be indexed by \(i=1,\dots,9\) and the subjects by \(j=1,\dots,800\). Let \(\theta_j\) be the latent mathematical ability of subject \(j\), and let \(Y_{ij}\) be the response of subject \(j\) to item \(i\).

Back to table of contents

1PL model

In the one-parameter logistic (1PL) model, the probability of getting a correct response is modeled as an inverse-logit function of location parameters \(b_i\), also called item difficulties, and a common slope parameter \(a\), also called item discrimination:

\[
P(Y_{ij}=1) = {\rm InvLogit}\{a(\theta_j-b_i)\} =
\frac{\exp\{a(\theta_j-b_i)\}}{1+\exp\{a(\theta_j-b_i)\}}
\]

Typically, the abilities are assumed to be normally distributed:
\[
\theta_j \sim {\rm N}(0,1)
\]
In a multilevel framework, the \(\theta_j\)’s represent random effects. In a Bayesian framework, we use the term “random effects” to refer to the parameters corresponding to levels of grouping variables identifying the hierarchy of the data.

A Bayesian formulation of the 1PL model also requires prior specification for the model parameters \(a\) and \(b_i\). The discrimination parameter \(a\) is assumed to be positive and is often modeled in the log scale. Because we have no prior knowledge about the discrimination and difficulty parameters, we assume that the prior distributions of \(\ln(a)\) and \(b_i\) have support on the whole real line, are symmetric, and are centered at 0. A normal prior distribution is thus a natural choice. We furthermore assume that \(\ln(a)\) and \(b_i\) are close to 0 and have prior variance of 1, which is an entirely subjective decision. We thus assign \(\ln(a)\) and \(b_i\) standard normal prior distributions:

\[\ln(a) \sim {\rm N}(0, 1)\] \[b_i \sim {\rm N}(0, 1) \]

To specify the likelihood function of the 1PL model in bayesmh, we use a nonlinear equation specification for the response variable y. The direct nonlinear specification for this model is

bayesmh y = ({discrim}*({subj:i.id}-{diff:i.item})), likelihood(logit) ...

where {discrim} is the discrimination parameter \(a\), {subj:i.id} are latent abilities \(\theta_j\), and {diff:i.item} are item difficulties \(b_i\). The logit model is used for the probability of a success, \(P(Y_{ij}=1)\). Specification {subj:i.id} in the above nonlinear expression is viewed as a substitutable expression for linear combinations of indicators associated with the id variable and parameters \(\theta_j\). This specification may be computationally prohibitive with a large number of subjects. A more efficient solution is to use the redefine() option to include subject random effects \(\theta_j\) in the model. The same argument may apply to the {diff:i.item} specification when there are many items. Thus, it may be computationally convenient to treat the \(b_i\) parameters as “random effects” in the specification and use the redefine() option to include them in the model.

A more efficient specification is thus

bayesmh y = ({discrim}*({subj:}-{diff:})), likelihood(logit) ///
               redefine(subj:i.id) redefine(diff:i.item) ... ///

where {subj:} and {diff:} in the nonlinear specification now represent the \(\theta_j\) and \(b_i\) parameters, respectively, without using expansions into linear combinations of indicator variables.

Below, we show the full bayesmh specification of the 1PL model and the output summary. In our examples, we treat the abilities {subj:i.id} as nuisance parameters and exclude them from the final results. The discrimination model parameter {discrim} must be positive and is thus initialized with 1. A longer burn-in period, burnin(5000), allows for longer adaptation of the MCMC sampler, which is needed given the large number of parameters in the model. Finally, the estimation results are stored for later model comparison.

. set seed 14

. bayesmh y = ({discrim}*({subj:}-{diff:})), likelihood(logit) ///
>         redefine(diff:i.item) redefine(subj:i.id)            ///
>         prior({subj:i.id},    normal(0, 1))                  ///
>         prior({discrim},      lognormal(0, 1))               ///
>         prior({diff:i.item},  normal(0, 1))                  ///
>         init({discrim} 1) exclude({subj:i.id})               ///
>         burnin(5000) saving(sim1pl, replace)
  
Burn-in ...
Simulation ...

Model summary
------------------------------------------------------------------------------
Likelihood: 
  y ~ logit({discrim}*(xb_subj-xb_diff))

Priors: 
  {diff:i.item} ~ normal(0,1)                                              (1)
    {subj:i.id} ~ normal(0,1)                                              (2)
      {discrim} ~ lognormal(0,1)
------------------------------------------------------------------------------
(1) Parameters are elements of the linear form xb_diff.
(2) Parameters are elements of the linear form xb_subj.

Bayesian logistic regression                     MCMC iterations  =     15,000
Random-walk Metropolis-Hastings sampling         Burn-in          =      5,000
                                                 MCMC sample size =     10,000
                                                 Number of obs    =      7,200
                                                 Acceptance rate  =      .3074
                                                 Efficiency:  min =     .02691
                                                              avg =     .06168
Log marginal likelihood =          .                          max =     .09527
 
------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
diff         |
        item |
          1  | -.6934123   .0998543   .003576  -.6934789  -.8909473  -.4917364
          2  | -.1234553   .0917187   .002972  -.1241642  -.3030341   .0597863
          3  | -1.782762   .1323252    .00566  -1.781142   -2.05219  -1.534451
          4  |  .3152835   .0951978   .003289   .3154714   .1279147   .4981263
          5  |  1.622545    .127213   .005561   1.619388   1.377123   1.883083
          6  |  .6815517   .0978777   .003712   .6788345   .4911366    .881128
          7  |  1.303482   .1173994   .005021   1.302328   1.084295   1.544913
          8  | -2.353975   .1620307   .008062  -2.351207  -2.672983  -2.053112
          9  | -1.168668   .1120243   .004526  -1.163922  -1.392936  -.9549209
-------------+----------------------------------------------------------------
     discrim |  .8644787   .0439804   .002681   .8644331   .7818035   .9494433
------------------------------------------------------------------------------

file sim1pl.dta saved

. estimates store est1pl

The sampling efficiency is acceptable, about 6% on average, with no indication of convergence problems. Although detailed convergence inspection of all parameters is outside the scope of this post, we recommend that you do so by using, for example, the bayesgraph diagnostics command.

Though we used informative priors for the model parameters, the estimation results from our Bayesian model are not that different from the maximum likelihood estimates obtained using the irt 1pl command (see example 1 in [IRT] irt 1pl). For example, the posterior mean estimate for {discrim} is 0.86 with an MCMC standard error of 0.003, whereas irt 1pl reports 0.85 with a standard error of 0.05.

The log-marginal likelihood is reported missing because we have excluded the {subj:i.id} parameters from the simulation results and the Laplace-Metropolis estimator of the log-marginal likelihood is not available in such cases. This estimator requires simulation results for all model parameters to compute the log-marginal likelihood.

Back to table of contents

2PL model

The two-parameter logistic (2PL) model extends the 1PL model by allowing for item-specific discrimination. The probability of correct response is now modeled as a function of item-specific slope parameters \(a_i\):
\[
P(Y_{ij}=1) = {\rm InvLogit}\{a_i(\theta_j-b_i)\} =
\frac{\exp\{a_i(\theta_j-b_i)\}}{1+\exp\{a_i(\theta_j-b_i)\}}
\]

The prior specification for \(\theta_j\) stays the same as in the 1PL model. We will, however, apply more elaborate prior specifications for the \(a_i\)’s and \(b_i\)’s. It is a good practice to use proper prior specifications without overwhelming the evidence from the data. The impact of the priors can be controlled by introducing additional hyperparameters. For example, Kim and Bolt (2007) proposed the use of a normal prior for the difficulty parameters with unknown mean and variance. Extending this approach to the discrimination parameters as well, we apply a hierarchical Bayesian model in which the \(\ln(a_i)\) and \(b_i\) parameters have the following prior specifications:

\[ \ln(a_i) \sim {\rm N}(\mu_a, \sigma_a^2) \] \[ b_i \sim {\rm N}(\mu_b, \sigma_b^2) \]

The mean hyperparameters, \(\mu_a\) and \(\mu_b\), and variance hyperparameters, \(\sigma_a^2\) and \(\sigma_b^2\), require informative prior specifications. We assume that the means are centered at 0 with a variation of 0.1:
\[
\mu_a, \mu_b \sim {\rm N}(0, 0.1)
\]

To lower the variability of the \(\ln(a_i)\) and \(b_i\) parameters, we apply an inverse-gamma prior with shape 10 and scale 1 for the variance parameters:

\[
\sigma_a^2, \sigma_b^2 \sim {\rm InvGamma}(10, 1)
\]

Thus, the prior mean of \(\sigma_a^2\) and \(\sigma_b^2\) is about 0.1.

In the bayesmh specification, the hyperparameters \(\mu_a\), \(\mu_b\), \(\sigma_a^2\), and \(\sigma_a^2\) are denoted as {mu_a}, {mu_b}, {var_a}, and {var_b}, respectively. We use the redefine(discrim:i.item) option to include in the model the discrimination parameters \(a_i\), referred to as {discrim:} in the likelihood specification.

Regarding the MCMC simulation, we change some of the default options. The hyperparameters {mu_a}, {mu_b}, {var_a}, and {var_b} are placed in separate blocks to improve the simulation efficiency. The discrimination parameters {discrim:i.item} must be positive and are thus initialized with 1s.

. set seed 14

. bayesmh y = ({discrim:}*({subj:}-{diff:})), likelihood(logit) ///
>         redefine(discrim:i.item) redefine(diff:i.item)        ///
>         redefine(subj:i.id)                                   ///
>         prior({subj:i.id},      normal(0, 1))                 ///
>         prior({discrim:i.item}, lognormal({mu_a}, {var_a}))   ///
>         prior({diff:i.item},    normal({mu_b}, {var_b}))      ///
>         prior({mu_a} {mu_b},    normal(0, 0.1))               ///
>         prior({var_a} {var_b},  igamma(10, 1))                ///
>         block({mu_a mu_b var_a var_b}, split)                 ///
>         init({discrim:i.item} 1)                              ///
>         exclude({subj:i.id}) burnin(5000) saving(sim2pl, replace)
  
Burn-in ...
Simulation ...

Model summary
------------------------------------------------------------------------------
Likelihood: 
  y ~ logit(xb_discrim*(xb_subj-xb_diff))

Priors: 
  {discrim:i.item} ~ lognormal({mu_a},{var_a})                             (1)
     {diff:i.item} ~ normal({mu_b},{var_b})                                (2)
       {subj:i.id} ~ normal(0,1)                                           (3)

Hyperpriors: 
    {mu_a mu_b} ~ normal(0,0.1)
  {var_a var_b} ~ igamma(10,1)
------------------------------------------------------------------------------
(1) Parameters are elements of the linear form xb_discrim.
(2) Parameters are elements of the linear form xb_diff.
(3) Parameters are elements of the linear form xb_subj.

Bayesian logistic regression                     MCMC iterations  =     15,000
Random-walk Metropolis-Hastings sampling         Burn-in          =      5,000
                                                 MCMC sample size =     10,000
                                                 Number of obs    =      7,200
                                                 Acceptance rate  =      .3711
                                                 Efficiency:  min =     .01617
                                                              avg =     .04923
Log marginal likelihood =          .                          max =      .1698
 
------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
discrim      |
        item |
          1  |  1.430976   .1986011   .010953   1.413063   1.089405   1.850241
          2  |  .6954823   .1081209   .004677   .6897267   .4985004   .9276975
          3  |  .9838528   .1343908   .009079   .9780275   .7506566   1.259427
          4  |  .8167792   .1169157   .005601   .8136229   .5992495   1.067578
          5  |  .9402715   .1351977   .010584   .9370298   .6691103   1.214885
          6  |  .9666747   .1420065   .008099   .9616285   .7038868   1.245007
          7  |  .5651287   .0864522   .006201   .5617302   .3956216   .7431265
          8  |  1.354053   .2048404   .015547   1.344227   .9791096   1.761437
          9  |  .7065096   .1060773   .006573   .6999745   .5102749   .9271799
-------------+----------------------------------------------------------------
diff         |
        item |
          1  | -.5070314   .0784172   .003565   -.507922   -.671257  -.3596057
          2  | -.1467198    .117422   .003143  -.1456633  -.3895978   .0716841
          3  | -1.630259   .1900103   .013494  -1.612534  -2.033169  -1.304171
          4  |  .3273735   .1073891   .003565   .3231703   .1248782   .5492114
          5  |  1.529584   .1969554    .01549   1.507982   1.202271   1.993196
          6  |  .6325194    .115724   .005613   .6243691   .4272131   .8851649
          7  |  1.827013   .2884057   .019582    1.79828   1.349654   2.490633
          8  | -1.753744   .1939559   .014743  -1.738199  -2.211475  -1.438146
          9  | -1.384486   .2059005   .012105  -1.361195  -1.838918  -1.059687
-------------+----------------------------------------------------------------
        mu_a | -.1032615   .1148176   .003874   -.102376  -.3347816   .1277031
       var_a |  .1129835   .0356735   .001269   .1056105    .063403   .1981331
        mu_b | -.0696525   .2039387   .004949   -.072602  -.4641566   .3298393
       var_b |  .6216005   .2023137   .008293   .5843444   .3388551   1.101153
------------------------------------------------------------------------------

file sim2pl.dta saved

. estimates store est2pl

The average simulation efficiency is about 5%, but some of the parameters converge slower than the others, such as {diff:7.item}, which has the largest MCMC standard error (0.02) among the difficulty parameters. If this was a rigorous study, to lower the MCMC standard errors, we would recommend longer simulations with MCMC sample sizes of at least 50,000.

We can compare the 1PL and 2PL models by using the deviance information criterion (DIC) available with the bayesstats ic command.

. bayesstats ic est1pl est2pl, diconly

Deviance information criterion

------------------------
             |       DIC
-------------+----------
      est1pl |  8122.428
      est2pl |  8055.005
------------------------

DIC is often used in Bayesian model selection as an alternative to AIC and BIC criteria and can be easily obtained from an MCMC sample. Larger MCMC samples produce more reliable DIC estimates. Because different MCMC samples produce different sample DIC values and the sample approximation error in calculating DIC is not known, one should not rely solely on DIC when choosing a model.

Lower DIC values indicate better fit. The DIC of the 2PL model (8,055) is markedly lower than the DIC of the 1PL model (8,122), implying better fit of the 2PL model.

Back to table of contents

3PL model

The three-parameter logistic (3PL) model introduces lower asymptote parameters \(c_i\), also called guessing parameters. The probability of giving a correct response is given by

\[
P(Y_{ij}=1) = c_i + (1-c_i){\rm InvLogit}\{a_i(\theta_j-b_i)\} ,\ c_i > 0
\]

The guessing parameters may be difficult to estimate using maximum likelihood. Indeed, the irt 3pl command with the sepguessing option fails to converge, as you can verify by typing

. irt 3pl q1-q9, sepguessing

on the original dataset.

It is thus important to specify an informative prior for \(c_i\). We assume that the prior mean of the guessing parameters is about 0.1 and thus apply
\[
c_i \sim {\rm InvGamma}(10, 1)
\]

Similarly to the discrimination and difficulty parameters, the \(c_i\)’s are introduced as random-effects parameters in the bayesmh specification and are referred to as {gues:} in the likelihood specification.

Unlike 1PL and 2PL models, we cannot use the likelihood(logit) option to model the probability of success because the probability of correct response is no longer an inverse-logit transformation of the parameters. Instead, we use likelihood(binlogit(1), noglmtransform) to model the probability of success of a Bernoulli outcome directly.

To have a valid initialization of the MCMC sampler, we assign the \(c_i\)’s positive starting values, 0.1.

. set seed 14

. bayesmh y = ({gues:}+(1-{gues:})*invlogit({discrim:}*({subj:}-{diff:}))), ///
>                 likelihood(binlogit(1), noglmtransform)                   ///
>         redefine(discrim:i.item) redefine(diff:i.item)                    ///
>         redefine(gues:i.item)    redefine(subj:i.id)                      ///
>         prior({subj:i.id},      normal(0, 1))                             ///
>         prior({discrim:i.item}, lognormal({mu_a}, {var_a}))               ///
>         prior({diff:i.item},    normal({mu_b}, {var_b}))                  ///
>         prior({gues:i.item},    igamma(10, 1))                            ///
>         prior({mu_a} {mu_b},    normal(0, 0.1))                           ///
>         prior({var_a} {var_b},  igamma(10, 1))                            ///
>         block({mu_a mu_b var_a var_b}, split)                             ///
>         init({discrim:i.item} 1 {gues:i.item} 0.1)                        ///
>         exclude({subj:i.id}) burnin(5000) saving(sim3pls, replace)
  
Burn-in ...
Simulation ...

Model summary
------------------------------------------------------------------------------
Likelihood: 
  y ~ binomial(xb_gues+(1-xb_gues)*invlogit(xb_discrim*(xb_subj-xb_diff)),1)

Priors: 
  {discrim:i.item} ~ lognormal({mu_a},{var_a})                             (1)
     {diff:i.item} ~ normal({mu_b},{var_b})                                (2)
     {gues:i.item} ~ igamma(10,1)                                          (3)
       {subj:i.id} ~ normal(0,1)                                           (4)

Hyperpriors: 
    {mu_a mu_b} ~ normal(0,0.1)
  {var_a var_b} ~ igamma(10,1)
------------------------------------------------------------------------------
(1) Parameters are elements of the linear form xb_discrim.
(2) Parameters are elements of the linear form xb_diff.
(3) Parameters are elements of the linear form xb_gues.
(4) Parameters are elements of the linear form xb_subj.

Bayesian binomial regression                     MCMC iterations  =     15,000
Random-walk Metropolis-Hastings sampling         Burn-in          =      5,000
                                                 MCMC sample size =     10,000
                                                 Number of obs    =      7,200
                                                 Acceptance rate  =      .3496
                                                 Efficiency:  min =      .0148
                                                              avg =     .03748
Log marginal likelihood =          .                          max =      .2044
 
------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
discrim      |
        item |
          1  |  1.712831   .2839419   .018436   1.681216   1.232644   2.351383
          2  |  .8540871   .1499645   .008265   .8414399   .6058463   1.165732
          3  |  1.094723   .1637954    .01126   1.081756    .817031   1.454845
          4  |  1.090891   .2149095   .013977   1.064651   .7488589   1.588164
          5  |  1.363236   .2525573   .014858   1.338075   .9348136   1.954695
          6  |  1.388325   .3027436   .024245   1.336303   .9466695   2.068181
          7  |  .9288217   .2678741   .021626   .8750048   .5690308   1.603375
          8  |  1.457763   .2201065    .01809   1.438027   1.068937   1.940431
          9  |  .7873631    .127779   .007447   .7796568    .563821    1.06523
-------------+----------------------------------------------------------------
diff         |
        item |
          1  | -.2933734   .0976177   .006339  -.2940499  -.4879558  -.0946848
          2  |  .2140365    .157158   .008333   .2037788  -.0553537   .5550411
          3  | -1.326351   .1981196   .013101  -1.326817  -1.706671  -.9307443
          4  |  .6367877   .1486799   .007895   .6277349   .3791045   .9509913
          5  |  1.616056   .1799378    .00966   1.606213   1.303614   2.006817
          6  |  .8354059    .124184    .00656   .8191839    .614221   1.097801
          7  |  2.066205   .3010858   .018377   2.034757   1.554484   2.709601
          8  | -1.555583   .1671435   .012265   -1.54984   -1.89487  -1.267001
          9  | -.9775626   .2477279   .016722  -.9936727  -1.431964  -.4093629
-------------+----------------------------------------------------------------
gues         |
        item |
          1  |  .1078598   .0337844     .0019   .1020673   .0581353   .1929404
          2  |  .1128113   .0372217   .002162   .1065996   .0596554   .2082417
          3  |   .123031   .0480042   .002579   .1127147   .0605462   .2516237
          4  |  .1190103   .0390721   .002369   .1123544   .0617698   .2095427
          5  |  .0829503   .0185785   .001275   .0807116   .0514752   .1232547
          6  |  .1059315   .0289175   .001708   .1022741   .0584959   .1709483
          7  |  .1235553   .0382661   .002964   .1186648   .0626495   .2067556
          8  |  .1142118   .0408348   .001733   .1062507   .0592389   .2134006
          9  |  .1270767   .0557821   .003939    .113562   .0621876   .2825752
-------------+----------------------------------------------------------------
        mu_a |   .109161   .1218499   .005504   .1126253   -.135329   .3501061
       var_a |   .108864   .0331522   .001053   .1030106   .0604834   .1860996
        mu_b |  .0782094   .1974657   .004367   .0755023  -.3067717   .4638104
       var_b |  .5829738   .1803167   .006263   .5562159   .3260449   1.034225
------------------------------------------------------------------------------

file sim3pls.dta saved

. estimates store est3pls

The estimated posterior means of the \(c_i\)’s range between 0.08 and 0.13. Clearly, the introduction of guessing parameters has an impact on the item discrimination and difficulty parameters. For example, the estimated posterior means of \(\mu_a\) and \(\mu_b\) shift from -0.10 and -0.07, respectively, for the 2PL model to 0.11 and 0.08, respectively, for the 3PL model.

Because the estimated guessing parameters are not that different, one may ask whether item-specific guessing parameters are really necessary. To answer this question, we fit a model with a common guessing parameter, {gues}, and compare it with the previous model.

. set seed 14

. bayesmh y = ({gues}+(1-{gues})*invlogit({discrim:}*({subj:}-{diff:}))), ///
>                 likelihood(binlogit(1), noglmtransform)                 ///
>         redefine(discrim:i.item) redefine(diff:i.item)                  ///
>         redefine(subj:i.id)                                             ///
>         prior({subj:i.id},      normal(0, 1))                           ///
>         prior({discrim:i.item}, lognormal({mu_a}, {var_a}))             ///
>         prior({diff:i.item},    normal({mu_b}, {var_b}))                ///
>         prior({gues},           igamma(10, 1))                          ///
>         prior({mu_a} {mu_b},    normal(0, 0.1))                         ///
>         prior({var_a} {var_b},  igamma(10, 1))                          ///
>         block({mu_a mu_b var_a var_b gues}, split)                      ///
>         init({discrim:i.item} 1 {gues} 0.1)                             ///
>         exclude({subj:i.id}) burnin(5000) saving(sim3pl, replace)
  
Burn-in ...
Simulation ...

Model summary
------------------------------------------------------------------------------
Likelihood: 
  y ~ binomial({gues}+(1-{gues})*invlogit(xb_discrim*(xb_subj-xb_diff)),1)

Priors: 
  {discrim:i.item} ~ lognormal({mu_a},{var_a})                             (1)
     {diff:i.item} ~ normal({mu_b},{var_b})                                (2)
       {subj:i.id} ~ normal(0,1)                                           (3)
            {gues} ~ igamma(10,1)

Hyperpriors: 
    {mu_a mu_b} ~ normal(0,0.1)
  {var_a var_b} ~ igamma(10,1)
------------------------------------------------------------------------------
(1) Parameters are elements of the linear form xb_discrim.
(2) Parameters are elements of the linear form xb_diff.
(3) Parameters are elements of the linear form xb_subj.

Bayesian binomial regression                     MCMC iterations  =     15,000
Random-walk Metropolis-Hastings sampling         Burn-in          =      5,000
                                                 MCMC sample size =     10,000
                                                 Number of obs    =      7,200
                                                 Acceptance rate  =      .3753
                                                 Efficiency:  min =     .01295
                                                              avg =     .03714
Log marginal likelihood =          .                          max =      .1874
 
------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
discrim      |
        item |
          1  |  1.692894   .2748163   .021944   1.664569   1.232347   2.299125
          2  |  .8313512   .1355267    .00606   .8218212   .5928602   1.125729
          3  |  1.058833   .1611742   .014163   1.054126   .7676045   1.393611
          4  |  1.041808   .1718472   .008782   1.029867   .7398569   1.397073
          5  |  1.534997   .3208687   .023965   1.497019   1.019998   2.266078
          6  |   1.38296   .2581948   .019265   1.355706   .9559487   1.979358
          7  |  .8310222   .1698206   .012896   .8107371   .5736484   1.248736
          8  |  1.442949   .2266268   .017562   1.431204   1.066646   1.930829
          9  |    .77944   .1159669   .007266   .7750891   .5657258   1.014941
-------------+----------------------------------------------------------------
diff         |
        item |
          1  | -.3043161   .0859905   .005373  -.2968324  -.4870583  -.1407109
          2  |  .1814508   .1289251   .006543   .1832146  -.0723988   .4313265
          3  | -1.391216   .1924384   .014986  -1.373093  -1.809343  -1.050919
          4  |  .5928491   .1262631   .006721   .5829347    .356614    .857743
          5  |  1.617348   .1929263   .011604   1.601534   1.293032   2.061096
          6  |   .817635   .1172884   .006125    .812838   .5990503   1.064322
          7  |  2.006949   .2743517    .01785   1.981052   1.556682   2.594236
          8  | -1.576235   .1747855   .013455  -1.559435  -1.952676  -1.272108
          9  | -1.039362   .1840773    .01138   -1.02785  -1.432058  -.7160181
-------------+----------------------------------------------------------------
        gues |  .1027336   .0214544   .001753   .1022211   .0627299   .1466367
        mu_a |  .1009741    .123915   .006567   .0965353  -.1343028   .3510697
       var_a |  .1121003   .0344401   .001154   .1059563   .0628117   .1970842
        mu_b |  .0632173   .1979426   .004572   .0666684  -.3292497   .4482957
       var_b |  .5861236   .1818885   .006991   .5574743   .3239369   1.053172
------------------------------------------------------------------------------

file sim3pl.dta saved

. estimates store est3pl

We can again compare the two 3PL models by using the bayesstats ic command:

. bayesstats ic est3pls est3pl, diconly

Deviance information criterion

------------------------
             |       DIC
-------------+----------
     est3pls |  8049.425
      est3pl |  8049.426
------------------------

Although the estimated DICs of the two 3PL models are essentially the same, we decide for demonstration purposes to proceed with the model with item-specific guessing parameters.

Back to table of contents

4PL model

The four-parameter logistic (4PL) model extends the 3PL model by adding item-specific upper asymptote parameters \(d_i\):
\[
P(Y_{ij}=1) = c_i + (d_i-c_i){\rm InvLogit}\{a_i(\theta_j-b_i)\}
,\ c_i < d_i < 1 \] The \(d_i\) parameter can be viewed as an upper limit on the probability of correct response to the \(i\)th item. The probability of giving correct answers by subjects with very high ability can thus be no greater than \(d_i\). We restrict the \(d_i\)'s to the (0.8,1) range and assign them a \({\rm Uniform}(0.8,1)\) prior. For other parameters, we use the same priors as in the 3PL model. In the bayesmh specification of the model, the condition \(c_i < d_i\) is incorporated in the likelihood, and the condition \(d_i < 1\) is implied by the specified prior for the \(d_i\)'s. We initialize the \(d_i\)'s to 0.9. We use the notable option to suppress the long table output.

. set seed 14

. bayesmh y =                                                              ///
>       (({gues:}+({d:}-{gues:})*invlogit({discrim:}*({subj:}-{diff:})))*  ///
>         cond({gues:}<{d:},1,.)), likelihood(binlogit(1), noglmtransform) ///
>         redefine(discrim:i.item) redefine(diff:i.item)                   ///
>         redefine(gues:i.item)    redefine(d:i.item)  redefine(subj:i.id) ///
>         prior({subj:i.id},      normal(0, 1))                            ///
>         prior({discrim:i.item}, lognormal({mu_a}, {var_a}))              ///
>         prior({diff:i.item},    normal({mu_b}, {var_b}))                 ///
>         prior({gues:i.item},    igamma(10, 1))                           ///
>         prior({d:i.item},       uniform(0.8, 1))                         ///
>         prior({mu_a} {mu_b},    normal(0, 0.1))                          ///
>         prior({var_a} {var_b},  igamma(10, 1))                           ///
>         block({mu_a mu_b var_a var_b}, split)                            ///
>         init({discrim:i.item} 1 {gues:i.item} 0.1 {d:i.item} 0.9)        ///
>         exclude({subj:i.id}) burnin(5000) saving(sim4pls, replace) notable
  
Burn-in ...
Simulation ...

Model summary
------------------------------------------------------------------------------
Likelihood: 
  y ~ binomial(<expr1>,1)

Priors: 
  {discrim:i.item} ~ lognormal({mu_a},{var_a})                             (1)
     {diff:i.item} ~ normal({mu_b},{var_b})                                (2)
     {gues:i.item} ~ igamma(10,1)                                          (3)
        {d:i.item} ~ uniform(0.8,1)                                        (4)
       {subj:i.id} ~ normal(0,1)                                           (5)

Hyperpriors: 
    {mu_a mu_b} ~ normal(0,0.1)
  {var_a var_b} ~ igamma(10,1)

Expression: 
  expr1 : (xb_gues+(xb_d-xb_gues)*invlogit(xb_discrim*(xb_subj-xb_diff)))* con
          d(xb_gues<xb_d,1,.)
------------------------------------------------------------------------------
(1) Parameters are elements of the linear form xb_discrim.
(2) Parameters are elements of the linear form xb_diff.
(3) Parameters are elements of the linear form xb_gues.
(4) Parameters are elements of the linear form xb_d.
(5) Parameters are elements of the linear form xb_subj.

Bayesian binomial regression                     MCMC iterations  =     15,000
Random-walk Metropolis-Hastings sampling         Burn-in          =      5,000
                                                 MCMC sample size =     10,000
                                                 Number of obs    =      7,200
                                                 Acceptance rate  =      .3639
                                                 Efficiency:  min =    .004825
                                                              avg =      .0219
Log marginal likelihood =          .                          max =      .1203
Note: There is a high autocorrelation after 500 lags.

file sim4pls.dta saved

. estimates store est4pls
 

We use bayesstats summary to display results of selected model parameters.

. bayesstats summary {d:i.item} {mu_a var_a mu_b var_b}

Posterior summary statistics                      MCMC sample size =    10,000
 
------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
d            |
        item |
          1  |  .9598183   .0255321   .001948   .9621874   .9044441   .9981723
          2  |  .9024564   .0565702   .007407   .9019505   .8066354   .9944216
          3  |  .9525519   .0281878   .002845   .9551054   .8972454   .9971564
          4  |  .8887963   .0561697   .005793   .8859503   .8036236   .9916784
          5  |  .8815547   .0588907   .007215   .8708021   .8031737   .9926549
          6  |  .8891188   .0586482   .006891    .881882   .8024593   .9935512
          7  |   .874271   .0561718   .008087   .8635082   .8018176   .9880433
          8  |  .9663644   .0147606   .001121   .9667563   .9370666   .9950912
          9  |   .889164   .0486038   .005524   .8834207   .8084921   .9857415
-------------+----------------------------------------------------------------
        mu_a |  .3336887   .1436216   .009742    .334092   .0562924   .6164115
       var_a |  .1221547   .0406908   .002376   .1144729   .0642768   .2229326
        mu_b | -.0407488   .1958039   .005645  -.0398847  -.4220523   .3323791
       var_b |  .4991736   .1612246    .00629   .4660071   .2802531   .9023824
------------------------------------------------------------------------------

The bayesmh command issued a note indicating high autocorrelation for some of the model parameters. This may be related to slower MCMC convergence or more substantial problems in the model specification. It is thus worthwhile to inspect the individual autocorrelation of the parameters. We can do so by using the bayesstats ess command. The parameters with lower estimated sample size (ESS) have higher autocorrelation and vice versa.

. bayesstats ess {d:i.item} {mu_a var_a mu_b var_b}

Efficiency summaries    MCMC sample size =    10,000
 
----------------------------------------------------
             |        ESS   Corr. time    Efficiency
-------------+--------------------------------------
d            |
        item |
          1  |     171.82        58.20        0.0172
          2  |      58.33       171.43        0.0058
          3  |      98.17       101.87        0.0098
          4  |      94.02       106.36        0.0094
          5  |      66.62       150.11        0.0067
          6  |      72.44       138.05        0.0072
          7  |      48.25       207.26        0.0048
          8  |     173.30        57.70        0.0173
          9  |      77.41       129.19        0.0077
-------------+--------------------------------------
        mu_a |     217.35        46.01        0.0217
       var_a |     293.34        34.09        0.0293
        mu_b |    1203.20         8.31        0.1203
       var_b |     656.92        15.22        0.0657
----------------------------------------------------

We observe that the parameters with ESS lower than 200 are among the asymptote parameter’s \(d_i\)’s. This may be caused, for example, by overparameterization of the likelihood model and subsequent nonidentifiability, which is not resolved by the specified priors.

We can also fit a model with a common upper asymptote parameter, \(d\), and compare it with the model with the item-specific upper asymptote.

. set seed 14

. bayesmh y =                                                              ///
>         (({gues:}+({d}-{gues:})*invlogit({discrim:}*({subj:}-{diff:})))* ///
>         cond({gues:}<{d},1,.)), likelihood(binlogit(1), noglmtransform)  ///
>         redefine(discrim:i.item) redefine(diff:i.item)                   ///
>         redefine(gues:i.item)    redefine(subj:i.id)                     ///
>         prior({subj:i.id},      normal(0, 1))                            ///
>         prior({discrim:i.item}, lognormal({mu_a}, {var_a}))              ///
>         prior({diff:i.item},    normal({mu_b}, {var_b}))                 ///
>         prior({gues:i.item},    igamma(10, 1))                           ///
>         prior({d},              uniform(0.8, 1))                         ///
>         prior({mu_a} {mu_b},    normal(0, 0.1))                          ///
>         prior({var_a} {var_b},  igamma(10, 1))                           ///
>         block({mu_a mu_b var_a var_b d}, split)                          ///
>         init({discrim:i.item} 1 {gues:i.item} 0.1 {d} 0.9)               ///
>         exclude({subj:i.id}) burnin(5000) saving(sim4pl, replace) notable
  
Burn-in ...
Simulation ...

Model summary
------------------------------------------------------------------------------
Likelihood: 
  y ~ binomial(<expr>>,1)

Priors: 
  {discrim:i.item} ~ lognormal({mu_a},{var_a})                             (1)
     {diff:i.item} ~ normal({mu_b},{var_b})                                (2)
     {gues:i.item} ~ igamma(10,1)                                          (3)
       {subj:i.id} ~ normal(0,1)                                           (4)
               {d} ~ uniform(0.8,1)

Hyperpriors: 
    {mu_a mu_b} ~ normal(0,0.1)
  {var_a var_b} ~ igamma(10,1)

Expression: 
  expr1 : (xb_gues+({d}-xb_gues)*invlogit(xb_discrim*(xb_subj-xb_diff)))* cond
          (xb_gues<{d},1,.)
------------------------------------------------------------------------------
(1) Parameters are elements of the linear form xb_discrim.
(2) Parameters are elements of the linear form xb_diff.
(3) Parameters are elements of the linear form xb_gues.
(4) Parameters are elements of the linear form xb_subj.

Bayesian binomial regression                     MCMC iterations  =     15,000
Random-walk Metropolis-Hastings sampling         Burn-in          =      5,000
                                                 MCMC sample size =     10,000
                                                 Number of obs    =      7,200
                                                 Acceptance rate  =      .3877
                                                 Efficiency:  min =      .0107
                                                              avg =     .03047
Log marginal likelihood =          .                          max =      .1626

file sim4pl.dta saved

. estimates store est4pl

. bayesstats summary {d mu_a var_a mu_b var_b}

Posterior summary statistics                      MCMC sample size =    10,000
 
------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
           d |  .9664578   .0144952   .001293   .9668207   .9371181   .9924572
        mu_a |  .2206696   .1387873    .01113   .2208302  -.0483587   .4952625
       var_a |  .1245785   .0391551   .001806   .1188779   .0658243   .2187058
        mu_b |  .0371722   .2020157    .00501   .0331742  -.3481366   .4336587
       var_b |  .5603447   .1761812   .006817   .5279243   .3157048   .9805077
------------------------------------------------------------------------------

We now compare the two 4PL models by using the bayesstats ic command:

. bayesstats ic est4pls est4pl, diconly

Deviance information criterion

------------------------
             |       DIC
-------------+----------
     est4pls |  8050.805
      est4pl |  8037.075
------------------------

The DIC of the more complex 4PL model (8,051) is substantially higher than the DIC of the simpler model (8,037). This and the potential nonidentifiability of the more complex est4pls model, indicated by high autocorrelation in the simulated MCMC sample, compel us to proceed with the model with a common upper asymptote, est4pl.

The posterior distribution of \(d\) has an estimated 95% equal-tailed credible interval of (0.93, 0.99) and is concentrated about 0.97. The \({\rm Uniform}(0.8,1)\) prior on \(d\) does not seem to be too restrictive. The estimated DIC of the est4pl model (8,037) is lower than the DIC of the est3pls 3PL model from the previous section (8,049), implying that the introduction of the upper asymptote parameter \(d\) does improve the model fit.

Back to table of contents

5PL model

The five-parameter logistic (5PL) model extends the 4PL model by adding item-specific asymmetry parameters \(e_i\):
\[
P(Y_{ij}=1) = c_i + (d_i-c_i){\rm InvLogit}\big[{\{a_i(\theta_j-b_i)\}}^{e_i}\big]
,\ c_i < d_i < 1, \ 0 < e_i < 1 \] In the previous section, we found the 4PL model with common upper asymptote \(d\), est4pl, to be the best one so far. We thus consider here a 5PL model with common upper asymptote \(d\).

Typically, we expect the \(e_i\) parameters to be close to 1. Similarly to the upper asymptote parameter \(d\), the \(e_i\) parameters are assumed to be in the (0.8,1) range and are assigned \({\rm Uniform}(0.8,1)\) prior. We initialize the \(e_i\)s to 0.9. We again use the notable option to suppress the long table output, and we display a subset of results by using bayesstats summary. (We could have used bayesmh's noshow() option instead to achieve the same result.)

. set seed 14

. bayesmh y =                                                               ///
>   (({gues:}+({d}-{gues:})*(invlogit({discrim:}*({subj:}-{diff:})))^{e:})* ///
>         cond({gues:}<{d},1,.)), likelihood(binlogit(1), noglmtransform)   ///
>         redefine(discrim:i.item) redefine(diff:i.item)                    ///
>         redefine(gues:i.item)    redefine(e:i.item)  redefine(subj:i.id)  ///
>         prior({subj:i.id},      normal(0, 1))                             ///
>         prior({discrim:i.item}, lognormal({mu_a}, {var_a}))               ///
>         prior({diff:i.item},    normal({mu_b}, {var_b}))                  ///
>         prior({gues:i.item},    igamma(10, 1))                            ///
>         prior({d},              uniform(0.8, 1))                          ///
>         prior({e:i.item},       uniform(0.8, 1))                          ///
>         prior({mu_a} {mu_b},    normal(0, 0.1))                           ///
>         prior({var_a} {var_b},  igamma(10, 1))                            ///
>         block({mu_a mu_b var_a var_b d}, split)                           ///
>         init({discrim:i.item} 1 {gues:i.item} 0.1 {d} {e:i.item} 0.9)     ///
>         exclude({subj:i.id}) burnin(5000) saving(sim5pls, replace) notable
  
Burn-in ...
Simulation ...

Model summary
------------------------------------------------------------------------------
Likelihood: 
  y ~ binomial(<expr1>,1)

Priors: 
  {discrim:i.item} ~ lognormal({mu_a},{var_a})                             (1)
     {diff:i.item} ~ normal({mu_b},{var_b})                                (2)
     {gues:i.item} ~ igamma(10,1)                                          (3)
        {e:i.item} ~ uniform(0.8,1)                                        (4)
       {subj:i.id} ~ normal(0,1)                                           (5)
               {d} ~ uniform(0.8,1)

Hyperpriors: 
    {mu_a mu_b} ~ normal(0,0.1)
  {var_a var_b} ~ igamma(10,1)

Expression: 
  expr1 : (xb_gues+({d}-xb_gues)*(invlogit(xb_discrim*(xb_subj-xb_diff)))^xb_e
          )* cond(xb_gues<{d},1,.)
------------------------------------------------------------------------------
(1) Parameters are elements of the linear form xb_discrim.
(2) Parameters are elements of the linear form xb_diff.
(3) Parameters are elements of the linear form xb_gues.
(4) Parameters are elements of the linear form xb_e.
(5) Parameters are elements of the linear form xb_subj.

Bayesian binomial regression                     MCMC iterations  =     15,000
Random-walk Metropolis-Hastings sampling         Burn-in          =      5,000
                                                 MCMC sample size =     10,000
                                                 Number of obs    =      7,200
                                                 Acceptance rate  =      .3708
                                                 Efficiency:  min =    .007341
                                                              avg =     .02526
Log marginal likelihood =          .                          max =      .1517

file sim5pls.dta saved

. estimates store est5pls

. bayesstats summary {e:i.item} {d mu_a var_a mu_b var_b}

Posterior summary statistics                      MCMC sample size =    10,000
 
------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
e            |
        item |
          1  |   .897859   .0578428   .006083   .8939272   .8050315   .9957951
          2  |  .9042669   .0585023   .005822     .90525   .8053789   .9956565
          3  |    .88993   .0562398   .005013    .887011    .803389   .9930454
          4  |  .9010241   .0574186   .006492   .9042044   .8030981   .9925598
          5  |  .9126369   .0545625    .00521   .9178927   .8098596   .9964487
          6  |  .9037269   .0583833   .006814   .9086704   .8054932   .9961268
          7  |  .9136308   .0558911   .005373   .9203899   .8112029    .996217
          8  |   .889775   .0568656   .005119   .8849938    .803912   .9938777
          9  |  .8808435    .056257   .004743   .8727194   .8030522   .9904972
-------------+----------------------------------------------------------------
           d |  .9671374   .0144004   .001165   .9670598   .9382404   .9933374
        mu_a |  .2770211   .1353777    .00832   .2782552   .0141125   .5418087
       var_a |   .122635   .0404159   .002148   .1160322   .0666951   .2208711
        mu_b |  .1211885   .1929743   .004955   .1199136  -.2515431    .503733
       var_b |  .5407642   .1747674   .006353   .5088269   .3016315   .9590086
------------------------------------------------------------------------------

We also want to compare the above model with a simpler one using a common asymmetry parameter \(e\).

. set seed 14

. bayesmh y =                                                               ///
>    (({gues:}+({d}-{gues:})*(invlogit({discrim:}*({subj:}-{diff:})))^{e})* ///
>         cond({gues:}<{d},1,.)), likelihood(binlogit(1), noglmtransform)   ///
>         redefine(discrim:i.item) redefine(diff:i.item)                    ///
>         redefine(gues:i.item)    redefine(subj:i.id)                      ///
>         prior({subj:i.id},      normal(0, 1))                             ///
>         prior({discrim:i.item}, lognormal({mu_a}, {var_a}))               ///
>         prior({diff:i.item},    normal({mu_b}, {var_b}))                  ///
>         prior({gues:i.item},    igamma(10, 1))                            ///
>         prior({d} {e},          uniform(0.8, 1))                          ///
>         prior({mu_a} {mu_b},    normal(0, 0.1))                           ///
>         prior({var_a} {var_b},  igamma(10, 1))                            ///
>         block({mu_a mu_b var_a var_b d e}, split)                         ///
>         init({discrim:i.item} 1 {gues:i.item} 0.1 {d e} 0.9)              ///
>         exclude({subj:i.id}) burnin(5000) saving(sim5pl, replace) notable
  
Burn-in ...
Simulation ...

Model summary
------------------------------------------------------------------------------
Likelihood: 
  y ~ binomial(<expr1>,1)

Priors: 
  {discrim:i.item} ~ lognormal({mu_a},{var_a})                             (1)
     {diff:i.item} ~ normal({mu_b},{var_b})                                (2)
     {gues:i.item} ~ igamma(10,1)                                          (3)
       {subj:i.id} ~ normal(0,1)                                           (4)
             {d e} ~ uniform(0.8,1)

Hyperpriors: 
    {mu_a mu_b} ~ normal(0,0.1)
  {var_a var_b} ~ igamma(10,1)

Expression: 
  expr1 : (xb_gues+({d}-xb_gues)*(invlogit(xb_discrim*(xb_subj-xb_diff)))^{e})
          * cond(xb_gues<{d},1,.)
------------------------------------------------------------------------------
(1) Parameters are elements of the linear form xb_discrim.
(2) Parameters are elements of the linear form xb_diff.
(3) Parameters are elements of the linear form xb_gues.
(4) Parameters are elements of the linear form xb_subj.

Bayesian binomial regression                     MCMC iterations  =     15,000
Random-walk Metropolis-Hastings sampling         Burn-in          =      5,000
                                                 MCMC sample size =     10,000
                                                 Number of obs    =      7,200
                                                 Acceptance rate  =      .3805
                                                 Efficiency:  min =    .008179
                                                              avg =     .02768
Log marginal likelihood =          .                          max =     .08904

file sim5pl.dta saved

. estimates store est5pl

. bayesstats summary {e d mu_a var_a mu_b var_b}

Posterior summary statistics                      MCMC sample size =    10,000
 
------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
           e |  .9118363   .0558178   .004194   .9175841   .8063153   .9960286
           d |  .9655166   .0147373   .001495   .9659029   .9354708   .9924492
        mu_a |  .2674271   .1368926   .008485    .270597   .0102798   .5443345
       var_a |  .1250759   .0428095   .002635   .1173619   .0654135   .2340525
        mu_b |  .1015121   .2048178   .006864    .103268  -.3052377   .4934158
       var_b |  .5677309   .1824591   .006981   .5331636   .3079868   1.016762
------------------------------------------------------------------------------

We use bayesstats ic to compare the DIC values of the two 5PL models:

. bayesstats ic est5pls est5pl, diconly

Deviance information criterion

------------------------
             |       DIC
-------------+----------
     est5pls |  8030.894
      est5pl |  8034.517
------------------------

The estimated DIC of the more complex est5pls model (8,031) is lower than the DIC of the simpler model (8,035), suggesting a better fit.

Back to table of contents

Conclusion

Finally, we compare all eight fitted models.

. bayesstats ic est1pl est2pl est3pl est3pls est4pl est4pls est5pl est5pls, ///
>         diconly

Deviance information criterion

------------------------
             |       DIC
-------------+----------
      est1pl |  8122.428
      est2pl |  8055.005
      est3pl |  8049.426
     est3pls |  8049.425
      est4pl |  8037.075
     est4pls |  8050.805
      est5pl |  8034.517
     est5pls |  8030.894
------------------------

The est5pls model has the lowest overall DIC. To confirm this result, we run another set of simulations with a larger MCMC sample size of 50,000. (We simply added the mcmcsize(50000) option to the bayesmh specification of the above eight models.) The following DIC values, based on the larger MCMC sample size, are more reliably estimated.

. bayesstats ic est1pl est2pl est3pl est3pls est4pl est4pls est5pl est5pls, ///
>         diconly

Deviance information criterion

------------------------
             |       DIC
-------------+----------
      est1pl |  8124.015
      est2pl |  8052.068
      est3pl |  8047.067
     est3pls |  8047.738
      est4pl |  8032.417
     est4pls |  8049.712
      est5pl |  8031.375
     est5pls |  8031.905
------------------------

Again, the 5PL models have the lowest DIC values and seem to provide the best fit. However, the DIC differences between models est4pl, est5pl, and est5pls are minimal and may very well be within the estimation error. Regardless, these three models appear to be better than the simpler 1PL, 2PL, and 3PL models.

Additional model checking may be needed to assess the models' fit, and we should not rely solely on the DIC values to make our final model selection. A practitioner may still prefer the simpler est4pl 4PL model to the 5PL models even though it has a slightly higher DIC. In fact, given that the posterior mean estimate of the upper asymptote parameter \(d\) is 0.96 with a 95% equal-tailed credible interval of (0.94, 0.99), some practitioners may prefer the even simpler est3pl 3PL model.

References

De Boeck, P., and M. Wilson, ed. 2004. Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach. New York: Springer.

Kim, J.-S., and D. M. Bolt. 2007. Estimating item response theory models using Markov chain Monte Carlo methods. Educational Measurement: Issues and Practice 26: 38-51.

Programming an estimation command in Stata: A map to posted entries

I have posted 15 entries to the Stata blog about programming an estimation command in Stata. They are best read in order. The comprehensive list below allows you to read them from first to last, at your own pace.

  1. Programming estimators in Stata: Why you should

    To help you write Stata commands that people want to use, I illustrate how Stata syntax is predictable and give an overview of the estimation-postestimation structure that you will want to emulate in your programs.

  2. Programming an estimation command in Stata: Where to store your stuff

    I discuss the difference between scripts and commands, and I introduce some essential programming concepts and constructions that I use to write the scripts and commands.

  3. Programming an estimation command in Stata: Global macros versus local macros

    I discuss a pair of examples that illustrate the differences between global macros and local macros.

  4. Programming an estimation command in Stata: A first ado-command

    I discuss the code for a simple estimation command to focus on the details of how to implement an estimation command. The command that I discuss estimates the mean by the sample average. I begin by reviewing the formulas and a do-file that implements them. I subsequently introduce ado-file programming and discuss two versions of the command. Along the way, I illustrate some of the postestimation features that work after the command.

  5. Programming an estimation command in Stata: Using Stata matrix commands and functions to compute OLS objects

    I present the formulas for computing the ordinary least-squares (OLS) estimator, and I discuss some do-file implementations of them. I discuss the formulas and the computation of independence-based standard errors, robust standard errors, and cluster-robust standard errors. I introduce the Stata matrix commands and matrix functions that I use in ado-commands that I discuss in upcoming posts.

  6. Programming an estimation command in Stata: A first command for OLS

    I show how to write a Stata estimation command that implements the OLS estimator by explaining the code.

  7. Programming an estimation command in Stata: A better OLS command

    I use the syntax command to improve the command that implements the OLS estimator that I discussed in Programming an estimation command in Stata: A first command for OLS. I show how to require that all variables be numeric variables and how to make the command accept time-series operated variables.

  8. Programming an estimation command in Stata: Allowing for sample restrictions and factor variables

    I modify the OLS command discussed in Programming an estimation command in Stata: A better OLS command to allow for sample restrictions, to handle missing values, to allow for factor variables, and to deal with perfectly collinear variables.

  9. Programming an estimation command in Stata: Allowing for options

    I make three improvements to the command that implements the OLS estimator that I discussed in Programming an estimation command in Stata: Allowing for sample restrictions and factor variables. First, I allow the user to request a robust estimator of the variance-covariance of the estimator. Second, I allow the user to suppress the constant term. Third, I store the residual degrees of freedom in e(df_r) so that test will use the t or F distribution instead of the normal or chi-squared distribution to compute the p-value of Wald tests.

  10. Programming an estimation command in Stata: Using a subroutine to parse a complex option

    I make two improvements to the command that implements the OLS estimator that I discussed in Programming an estimation command in Stata: Allowing for options. First, I add an option for a cluster-robust estimator of the variance -covariance of the estimator (VCE). Second, I make the command accept the modern syntax for either a robust or a cluster-robust estimator of the VCE. In the process, I use subroutines in my ado-program to facilitate the parsing, and I discuss some advanced parsing tricks.

  11. Programming an estimation command in Stata: Mata 101

    I introduce Mata, the matrix programming language that is part of Stata.

  12. Programming an estimation command in Stata: Mata functions

    I show how to write a function in Mata, the matrix programming language that is part of Stata.

  13. Programming an estimation command in Stata: A first ado-command using Mata

    I discuss a sequence of ado-commands that use Mata to estimate the mean of a variable. The commands illustrate a general structure for Stata/Mata programs.

  14. Programming an estimation command in Stata: Computing OLS objects in Mata

    I present the formulas for computing the OLS estimator and show how to compute them in Mata. This post is a Mata version of Programming an estimation command in Stata: Using Stata matrix commands and functions to compute OLS objects. I discuss the formulas and the computation of independence-based standard errors, robust standard errors, and cluster-robust standard errors.

  15. Programming an estimation command in Stata: An OLS command using Mata

    I discuss a command that computes OLS results in Mata, paying special attention to the structure of Stata programs that use Mata work functions.

  16. Programming an estimation command in Stata: Adding robust and cluster-robust VCEs to our Mata based OLS command

    I show how to use the undocumented command _vce_parse to parse the options for robust or cluster-robust estimators of the VCE. I then discuss myregress12.ado, which performs its computations in Mata and computes an IID-based, a robust, or a cluster-robust estimator of the VCE.

  17. Programming an estimation command in Stata: A review of nonlinear optimization using Mata

    I review the theory behind nonlinear optimization and get some practice in Mata programming by implementing an optimizer in Mata. This post is designed to help you develop your Mata programming skills and to improve your understanding of how the Mata optimization suites optimize() and moptimize() work.

  18. Programming an estimation command in Stata: Using optimize() to estimate Poisson parameters

    I show how to use optimize() in Mata to maximize a Poisson log-likelihood function and to obtain estimators of the VCE based on IID observations or on robust methods.

  19. Programming an estimation command in Stata: A poisson command using Mata

    I discuss mypoisson1, which computes Poisson-regression results in Mata. The code in mypoisson1.ado is remarkably similar to the code in myregress11.ado, which computes OLS results in Mata, as I discussed in Programming an estimation command in Stata: An OLS command using Mata

  20. Programming an estimation command in Stata: Handling factor variables in optimize()

    I discuss a method for handling factor variables when performing nonlinear optimization using optimize(). After illustrating the issue caused by factor variables, I present a method and apply it to an example using optimize().

regress, probit, or logit?


In a previous post I illustrated that the probit model and the logit model produce statistically equivalent estimates of marginal effects. In this post, I compare the marginal effect estimates from a linear probability model (linear regression) with marginal effect estimates from probit and logit models.

My simulations show that when the true model is a probit or a logit, using a linear probability model can produce inconsistent estimates of the marginal effects of interest to researchers. The conclusions hinge on the probit or logit model being the true model.

Simulation results

For all simulations below, I use a sample size of 10,000 and 5,000 replications. The true data-generating processes (DGPs) are constructed using one discrete covariate and one continuous covariate. I study the average effect of a change in the continuous variable on the conditional probability (AME) and the average effect of a change in the discrete covariate on the conditional probability (ATE). I also look at the effect of a change in the continuous variable on the conditional probability, evaluated at the mean value of the covariates (MEM), and the effect of a change in the discrete covariate on the conditional probability, evaluated at the mean value of the covariates (TEM).

In Table 1, I present the results of a simulation when the true DGP satisfies the assumptions of a logit model. I show the average of the AME and the ATE estimates and the 5% rejection rate of the true null hypotheses. I also provide an approximate true value of the AME and ATE. I obtain the approximate true values by computing the ATE and AME, at the true values of the coefficients, using a sample of 20 million observations. I will provide more details on the simulation in a later section.

Table 1: Average Marginal and Treatment Effects: True DGP Logit

Simulation Results for N=10,000 and 5,000 Replications
Statistic Approximate True Value Logit Regress (LPM)
AME of x1 -.084 -.084 -.094
5% Rejection Rate .050 .99
ATE of x2 .092 .091 .091
5% Rejection Rate .058 .058

From Table 1, we see that the logit model estimates are close to the true value and that the rejection rate of the true null hypothesis is close to 5%. For the linear probability model, the rejection rate is 99% for the AME. For the ATE, the rejection rate and point estimates are close to what is estimated using a logit.

For the MEM and TEM, we have the following:

Table 2: Marginal and Treatment E ects at Mean Values: True DGP Logit

Simulation Results for N=10,000 and 5,000 Replications
Statistic Approximate True Value Logit Regress (LPM)
MEM of x1 -.099 -.099 -.094
5% Rejection Rate .054 .618
TEM of x2 .109 .109 .092
5% Rejection Rate .062 .073

Again, logit estimates behave as expected. For the linear probability model, the rejection rate of the true null hypothesis is 62% for the MEM. For the TEM the rejection rate is 7.3%, and the estimated effect is smaller than the true effect.

For the AME and ATE, when the true GDP is a probit, we have the following:

Table 3: Average Marginal and Treatment Effects: True DGP Probit

Simulation Results for N=10,000 and 5,000 Replications
Statistic Approximate True Value Probit Regress (LPM)
AME of x1 -.094 -.094 -.121
5% Rejection Rate .047 1
ATE of x2 .111 .111 .111
5% Rejection Rate .065 .061

The probit model estimates are close to the true value, and the rejection rate of the true null hypothesis is close to 5%. For the linear probability model, the rejection rate is 100% for the AME. For the ATE, the rejection rate and point estimates are close to what is estimated using a probit.

For the MEM and TEM, we have the following:

Table 4: Marginal and Treatment Effects at Mean Values: True DGP Probit

Simulation Results for N=10,000 and 5,000 Replications
Statistic Approximate True Value Probit Regress (LPM)
MEM of x1 -.121 -.122 -.121
5% Rejection Rate .063 .054
TEM of x2 .150 .150 .110
5% Rejection Rate .059 .158

For the MEM, the probit and linear probability model produce reliable inference. For the TEM, the probit marginal effects behave as expected, but the linear probability model has a rejection rate of 16%, and the point estimates are not close to the true value.

Simulation design

Below is the code I used to generate the data for my simulations. In the first part, lines 6 to 13, I generate outcome variables that satisfy the assumptions of the logit model, y, and the probit model, yp. In the second part, lines 15 to 19, I compute the marginal effects for the logit and probit models. I have a continuous and a discrete covariate. For the discrete covariate, the marginal effect is a treatment effect. In the third part, lines 21 to 29, I compute the marginal effects evaluated at the means. I will use these estimates later to compute approximations to the true values of the effects.

program define mkdata
    syntax, [n(integer 1000)]
    clear
    quietly set obs `n'
    // 1. Generating data from probit, logit, and misspecified 
    generate x1    = rchi2(2)-2
    generate x2    = rbeta(4,2)>.2
    generate u     = runiform()
    generate e     = ln(u) -ln(1-u) 
    generate ep    = rnormal()
    generate xb    = .5*(1 - x1 + x2)
    generate y     =  xb + e > 0
    generate yp    = xb + ep > 0 
    // 2. Computing probit & logit marginal and treatment effects 
    generate m1   = exp(xb)*(-.5)/(1+exp(xb))^2
    generate m2   = exp(1 -.5*x1)/(1+ exp(1 -.5*x1 )) - ///
	              exp(.5 -.5*x1)/(1+ exp(.5 -.5*x1 ))
    generate m1p  = normalden(xb)*(-.5)
    generate m2p  = normal(1 -.5*x1 ) - normal(.5 -.5*x1)
    // 3. Computing marginal and treatment effects at means
    quietly mean x1 x2 
    matrix A        = r(table)
    scalar a        = .5 -.5*A[1,1] + .5*A[1,2]
    scalar b1       =  1 -.5*A[1,1]
    scalar b0       = .5 -.5*A[1,1]
    generate mean1  = exp(a)*(-.5)/(1+exp(a))^2
    generate mean2  = exp(b1)/(1+ exp(b1)) - exp(b0)/(1+ exp(b0))
    generate mean1p = normalden(a)*(-.5)
    generate mean2p = normal(b1) - normal(b0)
end

I approximate the true marginal effects using a sample of 20 million observations. This is a reasonable strategy in this case. For example, take the average marginal effect for a continuous covariate, \(x_{k}\), in the case of the probit model:

\[\begin{equation*}
\frac{1}{N}\sum_{i=1}^N \phi\left(x_{i}\mathbb{\beta}\right)\beta_{k}
\end{equation*}\]

The expression above is an approximation of \(E\left(\phi\left(x_{i}\mathbb{\beta}\right)\beta_{k}\right)\). To obtain this expected value, we would need to integrate over the distribution of all the covariates. This is not practical and would limit my choice of covariates. Instead, I draw a sample of 20 million observations, compute \(\frac{1}{N}\sum_{i=1}^N \phi\left(x_{i}\mathbb{\beta}\right)\beta_{k}\), and take it to be the true value. I follow the same logic for the other marginal effects.

Below is the code I use to compute the approximate true marginal effects. I draw the 20 million observations, compute the averages that I wil use in my simulation, and create locals for each approximate true value.

. mkdata, n(`L')
(2 missing values generated)

. local values "m1 m2 mean1 mean2 m1p m2p mean1p mean2p"

. local means  "mx1 mx2 meanx1 meanx2 mx1p mx2p meanx1p meanx2p"

. local n : word count `values'

. 
. forvalues i= 1/`n' {
  2.         local a: word `i' of `values'
  3.         local b: word `i' of `means'
  4.         sum `a', meanonly
  5.         local `b' = r(mean)
  6. }

Now, I am ready to run all the simulations that I used to produce the results in the previous section. The code that I used for the simulations for the TEM and the MEM when the true DGP is a logit is given by:

. postfile lpm y1l y1l_r y1lp y1lp_r y2l y2l_r y2lp y2lp_r ///
>                 using simslpm, replace 

. forvalues i=1/`R' {
  2.         quietly {
  3.                 mkdata, n(`N')
  4.                 logit  y x1 i.x2, vce(robust) 
  5.                 margins, dydx(*) atmeans post  vce(unconditional)
  6.                 local y1l = _b[x1]
  7.                 test _b[x1] = `meanx1'
  8.                 local y1l_r   = (r(p)<.05) 
  9.                 local y2l = _b[1.x2]
 10.                 test _b[1.x2] = `meanx2'
 11.                 local y2l_r   = (r(p)<.05) 
 12.                 regress  y x1 i.x2, vce(robust) 
 13.                 margins, dydx(*) atmeans post  vce(unconditional)
 14.                 local y1lp = _b[x1]
 15.                 test _b[x1] = `meanx1'
 16.                 local y1lp_r   = (r(p)<.05) 
 17.                 local y2lp = _b[1.x2]
 18.                 test _b[1.x2] = `meanx2'
 19.                 local y2lp_r   = (r(p)<.05) 
 20.                 post lpm (`y1l') (`y1l_r') (`y1lp') (`y1lp_r') ///
>                          (`y2l') (`y2l_r') (`y2lp') (`y2lp_r')
 21.         }
 22. }

. postclose lpm

. use simslpm, clear 

. sum 

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         y1l |      5,000   -.0985646      .00288  -.1083639  -.0889075
       y1l_r |      5,000       .0544     .226828          0          1
        y1lp |      5,000   -.0939211    .0020038  -.1008612  -.0868043
      y1lp_r |      5,000       .6182    .4858765          0          1
         y2l |      5,000    .1084959     .065586  -.1065291   .3743112
-------------+---------------------------------------------------------
       y2l_r |      5,000       .0618     .240816          0          1
        y2lp |      5,000    .0915894     .055462  -.0975456   .3184061
      y2lp_r |      5,000       .0732    .2604906          0          1

For the results for the AME and the ATE when the true DGP is a logit, I use margins without the atmeans option. The other cases are similar. I use robust standard errors for all computations because my likelihood model is an approximation to the true likelihood, and I use the option vce(unconditional) to account for the fact that I am using two-step M-estimation. See Wooldridge (2010) for more details on two-step M-estimation.

You can obtain the code used to produce these results here.

Conclusion

Using a probit or a logit model yields equivalent marginal effects. I provide evidence that the same cannot be said of the marginal effect estimates of the linear probability model when compared with those of the logit and probit models.

Acknowledgment

This post was inspired by a question posed by Stephen Jenkins after my previous post.

Reference

Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, Massachusetts: MIT Press.