Home > Programming > Programming an estimation command in Stata: Adding analytical derivatives to a poisson command using Mata

## Programming an estimation command in Stata: Adding analytical derivatives to a poisson command using Mata

$$\newcommand{\xb}{{\bf x}} \newcommand{\betab}{\boldsymbol{\beta}}$$Using analytically computed derivatives can greatly reduce the time required to solve a nonlinear estimation problem. I show how to use analytically computed derivatives with optimize(), and I discuss mypoisson4.ado, which uses these analytically computed derivatives. Only a few lines of mypoisson4.ado differ from the code for mypoisson3.ado, which I discussed in Programming an estimation command in Stata: Allowing for robust or cluster–robust standard errors in a poisson command using Mata.

This is the twenty-third post in the series Programming an estimation command in Stata. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.

Analytically computed derivatives for Poisson

The contribution of the i(th) observation to the log-likelihood function for the Poisson maximum-likelihood estimator is
$$L_i = -\exp(\xb_i\betab’) + y_i\xb_i\betab’ – \ln(y_i\!)$$

The vector of observation-level contributions can be coded in Mata by

    xb  = X*b'
mu  = exp(xb)
val = (-mu + y:*xb - lnfactorial(y))


where X is the matrix of observations on the covariates, b is the row vector of parameters, y is the vector of observations on the dependent variable, mu is the vector of observations on xb=X*b’, and val is the vector of observation-level contributions.

The gradient for the i(th) observation is
$$g_i = (y_i-\exp(\xb_i\betab’))\xb_i$$

The vector of all the observation-level gradients can be coded in Mata by (y-mu):*X.

The sum of the Hessians calculated at each observation i is
$$H = -\sum_{i=1}^N \exp(\xb_i\betab)\xb_i’\xb_i$$

which can be coded in Mata by -quadcross(X, mu, X).

Using analytically computed gradients in optimize()

The code in dex1.do implements the observation-level gradients in the evaluator function plleval3() used by optimize() in dowork() to maximize the Poisson log-likelihood function for the given data.

Code block 1: dex1.do

mata:

void plleval3(real scalar todo, real vector b,     ///
real vector y,    real matrix X,     ///
{
real vector  xb, mu

xb  = X*b'
mu  = exp(xb)
val = (-mu + y:*xb - lnfactorial(y))

if (todo>=1) {
}
}

void dowork( )
{
real vector   y, b
real matrix   X
real scalar   n, p
transmorphic  S

y  = st_data(., "accidents")
X  = st_data(., "cvalue kids traffic ")
n  = rows(y)
X  = X, J(n, 1, 1)
p  = cols(X)

S  = optimize_init()
optimize_init_argument(S, 1, y)
optimize_init_argument(S, 2, X)
optimize_init_evaluator(S, &plleval3())
optimize_init_evaluatortype(S, "gf1debug")
optimize_init_params(S, J(1, p, .01))

b    = optimize(S)

}

dowork()

end


Lines 2–16 define the evaluator function plleval3(), which stores the observation-level contributions to the log likelihood in val and the observation-level gradients in grad. grad is only calculated when
todo>=1.

optimize() uses todo to tell the evaluator function what it needs. At some points in the optimization process, optimize() needs only the value of the objective function, which optimize() communicates to the evaluator by setting todo=0. At other points in the optimization process, optimize() needs the value of the objective function and the gradient, which optimize() communicates to the evaluator by setting todo=1. At still other points in the optimization process, optimize() needs the value of the objective function, the gradient, and the Hessian, which optimize() communicates to the evaluator by setting todo=2. An evaluator function that calculates the gradient analytically must compute it when todo=1 or todo=2. Coding >= instead of == on line 13 is crucial.

Lines 18–40 define dowork(), which implements a call to optimize() to maximize the Poisson log-likelihood function for these data. Line 35 differs from the examples that I previously discussed; it sets the evaluator type to gf1debug. This evaluator type has two parts: gf1 and debug. gf1 specifies that the evaluator return observation-level contributions to the objective function and that it return a matrix of observation-level gradients when todo==1 or todo==2. Appending debug to gf1 tells optimize() to produce a report comparing the analytically computed derivatives with those computed numerically by optimize() and to use the numerically computed derivatives for the optimization.

Example 1 illustrates the derivative comparison report.

Example 1: gf1debug output

. clear all

. use accident3

. do dex1

. mata:
------------------------------------------------- mata (type end to exit) ------
:
: void plleval3(real scalar todo, real vector b,     ///
>               real vector y,    real matrix X,     ///
> {
>     real vector  xb, mu
>
>     xb  = X*b'
>     mu  = exp(xb)
>     val = (-mu + y:*xb - lnfactorial(y))
>
>     if (todo>=1) {
>         grad = (y - mu):*X
>     }
> }
note: argument hess unused

:
: void dowork( )
> {
>     real vector   y, b
>     real matrix   X
>     real scalar   n, p
>     transmorphic  S
>
>     y  = st_data(., "accidents")
>     X  = st_data(., "cvalue kids traffic ")
>     n  = rows(y)
>     X  = X, J(n, 1, 1)
>     p  = cols(X)
>
>     S  = optimize_init()
>     optimize_init_argument(S, 1, y)
>     optimize_init_argument(S, 2, X)
>     optimize_init_evaluator(S, &plleval3())
>     optimize_init_evaluatortype(S, "gf1debug")
>     optimize_init_params(S, J(1, p, .01))
>
>     b    = optimize(S)
>
> }
note: variable b set but not used

:
: dowork()

gf1debug:  Begin derivative-comparison report ----------------------------------
gf1debug:  Warning:  evaluator did not compute Hessian matrix
gf1debug:  End derivative-comparison report ------------------------------------
Iteration 0:   f(p) = -851.18669

gf1debug:  Begin derivative-comparison report ----------------------------------
gf1debug:  Warning:  evaluator did not compute Hessian matrix
gf1debug:  End derivative-comparison report ------------------------------------
Iteration 1:   f(p) = -556.66874

gf1debug:  Begin derivative-comparison report ----------------------------------
gf1debug:  Warning:  evaluator did not compute Hessian matrix
gf1debug:  End derivative-comparison report ------------------------------------
Iteration 2:   f(p) = -555.81731

gf1debug:  Begin derivative-comparison report ----------------------------------
gf1debug:  Warning:  evaluator did not compute Hessian matrix
gf1debug:  End derivative-comparison report ------------------------------------
Iteration 3:   f(p) = -555.81538

gf1debug:  Begin derivative-comparison report ----------------------------------
gf1debug:  Warning:  evaluator did not compute Hessian matrix
gf1debug:  End derivative-comparison report ------------------------------------
Iteration 4:   f(p) = -555.81538
:
: end
--------------------------------------------------------------------------------

.
end of do-file


For each iteration, mreldif(gradient vectors) reports the maximum relative difference between the analytically and numerically computed derivatives. Away from the optimum, a correctly coded analytical gradient will yield an mreldif of e-08 or smaller. The numerically computed gradients are imperfect approximations to the true gradients, and e-08 is about the best we can reliably hope for when using double precision numbers. Use the mreldif reports from iterations away from the optimum. Because the gradient is almost zero at the optimum, the mreldif calculation produces an over-sized difference for iterations near the optimum.

In the example at hand, the mreldif calculations of 9.91e-07, 2.06e-10, and 1.59e-07 for iterations 0, 1, and 2 indicate that the analytically computed derivatives are correct.

The code in dex2.do differs from that in dex1.do by specifying the evaluator type to be gf1 instead of gf1debug on line 37. A gf1 evaluator type differs from a gf1debug evaluator type in that it uses the analytically computed gradients in the optimization, the numerical gradients are not computed, and there are no derivative comparison reports.

Code block 2: dex2.do

mata:

mata drop plleval3() dowork()

void plleval3(real scalar todo, real vector b,     ///
real vector y,    real matrix X,     ///
{
real vector  xb, mu

xb  = X*b'
mu  = exp(xb)
val = (-mu + y:*xb - lnfactorial(y))

if (todo>=1) {
}
}

void dowork( )
{
real vector   y, b
real matrix   X
real scalar   n, p
transmorphic  S

y  = st_data(., "accidents")
X  = st_data(., "cvalue kids traffic ")
n  = rows(y)
X  = X, J(n, 1, 1)
p  = cols(X)

S  = optimize_init()
optimize_init_argument(S, 1, y)
optimize_init_argument(S, 2, X)
optimize_init_evaluator(S, &plleval3())
optimize_init_evaluatortype(S, "gf1")
optimize_init_params(S, J(1, p, .01))

b    = optimize(S)

}

dowork()

end


Example 2 illustrates the output.

Example 2: gf1 output

. do dex2

. mata:
------------------------------------------------- mata (type end to exit) ------
:
: mata drop plleval3() dowork()

:
: void plleval3(real scalar todo, real vector b,     ///
>               real vector y,    real matrix X,     ///
> {
>     real vector  xb, mu
>
>     xb  = X*b'
>     mu  = exp(xb)
>     val = (-mu + y:*xb - lnfactorial(y))
>
>     if (todo>=1) {
>         grad = (y - mu):*X
>     }
> }
note: argument hess unused

:
: void dowork( )
> {
>     real vector   y, b
>     real matrix   X
>     real scalar   n, p
>     transmorphic  S
>
>     y  = st_data(., "accidents")
>     X  = st_data(., "cvalue kids traffic ")
>     n  = rows(y)
>     X  = X, J(n, 1, 1)
>     p  = cols(X)
>
>     S  = optimize_init()
>     optimize_init_argument(S, 1, y)
>     optimize_init_argument(S, 2, X)
>     optimize_init_evaluator(S, &plleval3())
>     optimize_init_evaluatortype(S, "gf1")
>     optimize_init_params(S, J(1, p, .01))
>
>     b    = optimize(S)
>
> }
note: variable b set but not used

:
: dowork()
Iteration 0:   f(p) = -851.18669
Iteration 1:   f(p) = -556.66855
Iteration 2:   f(p) = -555.81731
Iteration 3:   f(p) = -555.81538
Iteration 4:   f(p) = -555.81538
:
: end
--------------------------------------------------------------------------------

.
end of do-file


Using an analytically computed Hessian in optimize()

The code in dex3.do adds the sum of the observation-level Hessians to the evaluator function plleval3() used by optimize() in dowork().

Code block 3: dex3.do

mata:

mata drop plleval3() dowork()

void plleval3(real scalar todo, real vector b,     ///
real vector y,    real matrix X,     ///
{
real vector  xb, mu

xb  = X*b'
mu  = exp(xb)
val = (-mu + y:*xb - lnfactorial(y))

if (todo>=1) {
}
if (todo==2) {
}

}

void dowork( )
{
real vector   y, b
real matrix   X
real scalar   n, p
transmorphic  S

y  = st_data(., "accidents")
X  = st_data(., "cvalue kids traffic ")
n  = rows(y)
X  = X, J(n, 1, 1)
p  = cols(X)

S  = optimize_init()
optimize_init_argument(S, 1, y)
optimize_init_argument(S, 2, X)
optimize_init_evaluator(S, &plleval3())
optimize_init_evaluatortype(S, "gf2debug")
optimize_init_params(S, J(1, p, .01))

b    = optimize(S)

}

dowork()

end


Lines 18–20 are new to dex3.do, and they compute the Hessian when todo==2. Line 41 in dex3.do specifies a gf2debug evaluator type instead of the gf1 evaluator type specified on line 37 of dex2.do.

The gf2debug evaluator type is a second-derivative version of the gf1debug evaluator type; it specifies that the evaluator return observation-level contributions to the objective function, that it return a matrix of observation-level gradients when todo==1 or todo==2, and that it return a matrix containing the sum of observation-level Hessians when todo==2. The gf2debug evaluator type also specifies that optimize() will produce a derivative-comparison report for the gradient and the Hessian and that optimize() will use the numerically computed derivatives for the optimization.

Example 3 illustrates the output.

Example 3: gf2debug output

. do dex3

. mata:
------------------------------------------------- mata (type end to exit) ------
:
: mata drop plleval3() dowork()

:
: void plleval3(real scalar todo, real vector b,     ///
>               real vector y,    real matrix X,     ///
> {
>     real vector  xb, mu
>
>     xb  = X*b'
>     mu  = exp(xb)
>     val = (-mu + y:*xb - lnfactorial(y))
>
>     if (todo>=1) {
>         grad = (y - mu):*X
>     }
>     if (todo==2) {
>         hess = -quadcross(X, mu, X)
>     }
>
> }

:
: void dowork( )
> {
>     real vector   y, b
>     real matrix   X
>     real scalar   n, p
>     transmorphic  S
>
>     y  = st_data(., "accidents")
>     X  = st_data(., "cvalue kids traffic ")
>     n  = rows(y)
>     X  = X, J(n, 1, 1)
>     p  = cols(X)
>
>     S  = optimize_init()
>     optimize_init_argument(S, 1, y)
>     optimize_init_argument(S, 2, X)
>     optimize_init_evaluator(S, &plleval3())
>     optimize_init_evaluatortype(S, "gf2debug")
>     optimize_init_params(S, J(1, p, .01))
>
>     b    = optimize(S)
>
> }
note: variable b set but not used

:
: dowork()

gf2debug:  Begin derivative-comparison report ----------------------------------
gf2debug:  mreldif(Hessian matrices) =  1.53e-06
gf2debug:  End derivative-comparison report ------------------------------------
Iteration 0:   f(p) = -851.18669

gf2debug:  Begin derivative-comparison report ----------------------------------
gf2debug:  mreldif(Hessian matrices) =  .0001703
gf2debug:  End derivative-comparison report ------------------------------------
Iteration 1:   f(p) = -556.66874

gf2debug:  Begin derivative-comparison report ----------------------------------
gf2debug:  mreldif(Hessian matrices) =  5.42e-07
gf2debug:  End derivative-comparison report ------------------------------------
Iteration 2:   f(p) = -555.81731

gf2debug:  Begin derivative-comparison report ----------------------------------
gf2debug:  mreldif(Hessian matrices) =  2.45e-07
gf2debug:  End derivative-comparison report ------------------------------------
Iteration 3:   f(p) = -555.81538

gf2debug:  Begin derivative-comparison report ----------------------------------
gf2debug:  mreldif(Hessian matrices) =  2.46e-07
gf2debug:  End derivative-comparison report ------------------------------------
Iteration 4:   f(p) = -555.81538
:
: end
--------------------------------------------------------------------------------

.
end of do-file


Unlike the mreldif calculations for the gradient, I look closely at the mreldif calculations for the Hessian near the optimum, because the Hessian must be full rank at the optimum. In this example, the mreldif calculations near the optimum are on the order of e-07, indicating a correctly coded analytical Hessian.

Now consider dex4.do, which differs from dex3.do in that line 40 specifies a gf2 evaluator type instead of a gf2debug evaluator type. A gf2 evaluator type is gf1 evaluator type for first and second derivatives. A gf2 evaluator type differs from a gf2debug evaluator type in that it uses the analytically computed gradients and the analytically computed Hessian in the optimization, the numerical derivatives are not computed, and there are no derivative comparison reports.

Code block 4: dex4.do

mata:

mata drop plleval3() dowork()

void plleval3(real scalar todo, real vector b,     ///
real vector y,    real matrix X,     ///
{
real vector  xb, mu

xb  = X*b'
mu  = exp(xb)
val = (-mu + y:*xb - lnfactorial(y))

if (todo>=1) {
}
if (todo==2) {
}

}

void dowork( )
{
real vector   y, b
real matrix   X
real scalar   n, p
transmorphic  S

y  = st_data(., "accidents")
X  = st_data(., "cvalue kids traffic ")
n  = rows(y)
X  = X, J(n, 1, 1)
p  = cols(X)

S  = optimize_init()
optimize_init_argument(S, 1, y)
optimize_init_argument(S, 2, X)
optimize_init_evaluator(S, &plleval3())
optimize_init_evaluatortype(S, "gf2")
optimize_init_params(S, J(1, p, .01))

b    = optimize(S)

}

dowork()

end


Example 4 illustrates the output.

Example 4: gf2 output

. do dex4

. mata:
------------------------------------------------- mata (type end to exit) ------
:
: mata drop plleval3() dowork()

:
: void plleval3(real scalar todo, real vector b,     ///
>               real vector y,    real matrix X,     ///
> {
>     real vector  xb, mu
>
>     xb  = X*b'
>     mu  = exp(xb)
>     val = (-mu + y:*xb - lnfactorial(y))
>
>     if (todo>=1) {
>         grad = (y - mu):*X
>     }
>     if (todo==2) {
>         hess = -quadcross(X, mu, X)
>     }
>
> }

:
: void dowork( )
> {
>     real vector   y, b
>     real matrix   X
>     real scalar   n, p
>     transmorphic  S
>
>     y  = st_data(., "accidents")
>     X  = st_data(., "cvalue kids traffic ")
>     n  = rows(y)
>     X  = X, J(n, 1, 1)
>     p  = cols(X)
>
>     S  = optimize_init()
>     optimize_init_argument(S, 1, y)
>     optimize_init_argument(S, 2, X)
>     optimize_init_evaluator(S, &plleval3())
>     optimize_init_evaluatortype(S, "gf2")
>     optimize_init_params(S, J(1, p, .01))
>
>     b    = optimize(S)
>
> }
note: variable b set but not used

:
: dowork()
Iteration 0:   f(p) = -851.18669
Iteration 1:   f(p) = -556.66855
Iteration 2:   f(p) = -555.81731
Iteration 3:   f(p) = -555.81538
Iteration 4:   f(p) = -555.81538
:
: end
--------------------------------------------------------------------------------

.
end of do-file


Including analytical derivatives in the command

mypoisson4 is like mypoisson3, except that it computes the derivatives analytically. In the remainder of this post, I briefly discuss the code for mypoisson4.ado.

*! version 4.0.0  28Feb2016
program define mypoisson4, eclass sortpreserve
version 14

syntax varlist(numeric ts fv min=2) [if] [in] [, noCONStant vce(string) ]
marksample touse

_vce_parse touse' , optlist(Robust) argoptlist(CLuster) : , vce(vce')
local vce        "r(vce)'"
local clustervar "r(cluster)'"
if "vce'" == "robust" | "vce'" == "cluster" {
local vcetype "Robust"
}
if "clustervar'" != "" {
capture confirm numeric variable clustervar'
if _rc {
display in red "invalid vce() option"
display in red "cluster variable {bf:clustervar'} is " ///
"string variable instead of a numeric variable"
exit(198)
}
sort clustervar'
}

gettoken depvar indepvars : varlist
_fv_check_depvar depvar'

tempname b mo V N rank

getcinfo indepvars' , constant'
local  cnames "r(cnames)'"
matrix mo' = r(mo)

mata: mywork("depvar'", "cnames'", "touse'", "constant'", ///
"b'", "V'", "N'", "rank'", "mo'", "vce'", "clustervar'")

if "constant'" == "" {
local cnames "cnames' _cons"
}
matrix colnames b' = cnames'
matrix colnames V' = cnames'
matrix rownames V' = cnames'

ereturn post b' V', esample(touse') buildfvinfo
ereturn scalar N       = N'
ereturn scalar rank    = rank'
ereturn local  vce      "vce'"
ereturn local  vcetype  "vcetype'"
ereturn local  clustvar "clustervar'"
ereturn local  cmd     "mypoisson4"

ereturn display

end

program getcinfo, rclass
syntax varlist(ts fv), [ noCONStant ]

_rmcoll varlist' , constant' expand
local cnames r(varlist)'
local p : word count cnames'
if "constant'" == "" {
local p = p' + 1
local cons _cons
}

tempname b mo

matrix b' = J(1, p', 0)
matrix colnames b' = cnames' cons'
_ms_omit_info b'
matrix mo' = r(omit)

return local  cnames "cnames'"
return matrix mo = mo'
end

mata:

void mywork( string scalar depvar,  string scalar indepvars,
string scalar touse,   string scalar constant,
string scalar bname,   string scalar Vname,
string scalar nname,   string scalar rname,
string scalar mo,
string scalar vcetype, string scalar clustervar)
{

real vector y, b
real matrix X, V, Ct
real scalar n, p, rank

y = st_data(., depvar, touse)
n = rows(y)
X = st_data(., indepvars, touse)
if (constant == "") {
X = X,J(n, 1, 1)
}
p = cols(X)

Ct = makeCt(mo)

S  = optimize_init()
optimize_init_argument(S, 1, y)
optimize_init_argument(S, 2, X)
optimize_init_evaluator(S, &plleval3())
optimize_init_evaluatortype(S, "gf2")
optimize_init_params(S, J(1, p, .01))
optimize_init_constraints(S, Ct)

b    = optimize(S)

if (vcetype == "robust") {
V    = optimize_result_V_robust(S)
}
else if (vcetype == "cluster") {
cvar = st_data(., clustervar, touse)
optimize_init_cluster(S, cvar)
V    = optimize_result_V_robust(S)
}
else {                 // vcetype must IID
V    = optimize_result_V_oim(S)
}
rank = p - diag0cnt(invsym(V))

st_matrix(bname, b)
st_matrix(Vname, V)
st_numscalar(nname, n)
st_numscalar(rname, rank)
}

real matrix makeCt(string scalar mo)
{
real vector mo_v
real scalar ko, j, p

mo_v = st_matrix(mo)
p    = cols(mo_v)
ko   = sum(mo_v)
if (ko>0) {
Ct   = J(0, p, .)
for(j=1; j<=p; j++) {
if (mo_v[j]==1) {
Ct  = Ct \ e(j, p)
}
}
Ct = Ct, J(ko, 1, 0)
}
else {
Ct = J(0,p+1,.)
}

return(Ct)

}

void plleval3(real scalar todo, real vector b,     ///
real vector y,    real matrix X,     ///
{
real vector  xb, mu

xb  = X*b'
mu  = exp(xb)
val = (-mu + y:*xb - lnfactorial(y))

if (todo>=1) {
}
if (todo==2) {
}
}

end


Only a few lines of mypoisson4.ado differ from their counterparts in mypoisson3.ado. Line 106 of mypoisson4.ado specifies a gf2 evaluator type, while line 106 of mypoisson3.ado specifies a gf0 evaluator type. Lines 166–171 in mypoisson4.ado compute the gradient and the Hessian analytically, and they have no counterparts in mypoisson3.ado.

The output in examples 5 and 6 confirms that mypoisson4 produces the same results as poisson when the option vce(cluster id) is specified.

Example 5: mypoisson4 results

. mypoisson4 accidents cvalue kids traffic , vce(cluster id)
Iteration 0:   f(p) = -851.18669
Iteration 1:   f(p) = -556.66855
Iteration 2:   f(p) = -555.81731
Iteration 3:   f(p) = -555.81538
Iteration 4:   f(p) = -555.81538
(Std. Err. adjusted for clustering on id)
------------------------------------------------------------------------------
|               Robust
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
cvalue |  -.6558871   .1125223    -5.83   0.000    -.8764267   -.4353475
kids |  -1.009017   .1805639    -5.59   0.000    -1.362916   -.6551182
traffic |   .1467115    .092712     1.58   0.114    -.0350008    .3284237
_cons |   .5743541   .6238015     0.92   0.357    -.6482744    1.796983
------------------------------------------------------------------------------


Example 6: poisson results

. poisson accidents cvalue kids traffic , vce(cluster id)

Iteration 0:   log pseudolikelihood = -555.86605
Iteration 1:   log pseudolikelihood =  -555.8154
Iteration 2:   log pseudolikelihood = -555.81538

Poisson regression                              Number of obs     =        505
Wald chi2(3)      =     103.53
Prob > chi2       =     0.0000
Log pseudolikelihood = -555.81538               Pseudo R2         =     0.2343

(Std. Err. adjusted for 285 clusters in id)
------------------------------------------------------------------------------
|               Robust
accidents |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
cvalue |  -.6558871   .1125223    -5.83   0.000    -.8764266   -.4353475
kids |  -1.009017   .1805639    -5.59   0.000    -1.362915   -.6551181
traffic |   .1467115    .092712     1.58   0.114    -.0350008    .3284237
_cons |    .574354   .6238015     0.92   0.357    -.6482744    1.796982
------------------------------------------------------------------------------
`

Done and undone

I showed how to compute derivatives analytically when using optimize(), and I included analytically computed derivatives in mypoisson4.ado. In my next post, I show how to make predict work after mypoisson4.

Categories: Programming Tags: