Programming an estimation command in Stata: Adding analytical derivatives to a poisson command using Mata
\(\newcommand{\xb}{{\bf x}}
\newcommand{\betab}{\boldsymbol{\beta}}\)Using analytically computed derivatives can greatly reduce the time required to solve a nonlinear estimation problem. I show how to use analytically computed derivatives with optimize(), and I discuss mypoisson4.ado, which uses these analytically computed derivatives. Only a few lines of mypoisson4.ado differ from the code for mypoisson3.ado, which I discussed in Programming an estimation command in Stata: Allowing for robust or cluster–robust standard errors in a poisson command using Mata.
This is the twenty-third post in the series Programming an estimation command in Stata. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.
Analytically computed derivatives for Poisson
The contribution of the i(th) observation to the log-likelihood function for the Poisson maximum-likelihood estimator is
$$
L_i = -\exp(\xb_i\betab’) + y_i\xb_i\betab’ – \ln(y_i\!)
$$
The vector of observation-level contributions can be coded in Mata by
xb = X*b' mu = exp(xb) val = (-mu + y:*xb - lnfactorial(y))
where X is the matrix of observations on the covariates, b is the row vector of parameters, y is the vector of observations on the dependent variable, mu is the vector of observations on xb=X*b’, and val is the vector of observation-level contributions.
The gradient for the i(th) observation is
$$
g_i = (y_i-\exp(\xb_i\betab’))\xb_i
$$
The vector of all the observation-level gradients can be coded in Mata by (y-mu):*X.
The sum of the Hessians calculated at each observation i is
$$
H = -\sum_{i=1}^N \exp(\xb_i\betab)\xb_i’\xb_i
$$
which can be coded in Mata by -quadcross(X, mu, X).
Using analytically computed gradients in optimize()
The code in dex1.do implements the observation-level gradients in the evaluator function plleval3() used by optimize() in dowork() to maximize the Poisson log-likelihood function for the given data.
mata: void plleval3(real scalar todo, real vector b, /// real vector y, real matrix X, /// val, grad, hess) { real vector xb, mu xb = X*b' mu = exp(xb) val = (-mu + y:*xb - lnfactorial(y)) if (todo>=1) { grad = (y - mu):*X } } void dowork( ) { real vector y, b real matrix X real scalar n, p transmorphic S y = st_data(., "accidents") X = st_data(., "cvalue kids traffic ") n = rows(y) X = X, J(n, 1, 1) p = cols(X) S = optimize_init() optimize_init_argument(S, 1, y) optimize_init_argument(S, 2, X) optimize_init_evaluator(S, &plleval3()) optimize_init_evaluatortype(S, "gf1debug") optimize_init_params(S, J(1, p, .01)) b = optimize(S) } dowork() end
Lines 2–16 define the evaluator function plleval3(), which stores the observation-level contributions to the log likelihood in val and the observation-level gradients in grad. grad is only calculated when
todo>=1.
optimize() uses todo to tell the evaluator function what it needs. At some points in the optimization process, optimize() needs only the value of the objective function, which optimize() communicates to the evaluator by setting todo=0. At other points in the optimization process, optimize() needs the value of the objective function and the gradient, which optimize() communicates to the evaluator by setting todo=1. At still other points in the optimization process, optimize() needs the value of the objective function, the gradient, and the Hessian, which optimize() communicates to the evaluator by setting todo=2. An evaluator function that calculates the gradient analytically must compute it when todo=1 or todo=2. Coding >= instead of == on line 13 is crucial.
Lines 18–40 define dowork(), which implements a call to optimize() to maximize the Poisson log-likelihood function for these data. Line 35 differs from the examples that I previously discussed; it sets the evaluator type to gf1debug. This evaluator type has two parts: gf1 and debug. gf1 specifies that the evaluator return observation-level contributions to the objective function and that it return a matrix of observation-level gradients when todo==1 or todo==2. Appending debug to gf1 tells optimize() to produce a report comparing the analytically computed derivatives with those computed numerically by optimize() and to use the numerically computed derivatives for the optimization.
Example 1 illustrates the derivative comparison report.
Example 1: gf1debug output
. clear all . use accident3 . do dex1 . mata: ------------------------------------------------- mata (type end to exit) ------ : : void plleval3(real scalar todo, real vector b, /// > real vector y, real matrix X, /// > val, grad, hess) > { > real vector xb, mu > > xb = X*b' > mu = exp(xb) > val = (-mu + y:*xb - lnfactorial(y)) > > if (todo>=1) { > grad = (y - mu):*X > } > } note: argument hess unused : : void dowork( ) > { > real vector y, b > real matrix X > real scalar n, p > transmorphic S > > y = st_data(., "accidents") > X = st_data(., "cvalue kids traffic ") > n = rows(y) > X = X, J(n, 1, 1) > p = cols(X) > > S = optimize_init() > optimize_init_argument(S, 1, y) > optimize_init_argument(S, 2, X) > optimize_init_evaluator(S, &plleval3()) > optimize_init_evaluatortype(S, "gf1debug") > optimize_init_params(S, J(1, p, .01)) > > b = optimize(S) > > } note: variable b set but not used : : dowork() gf1debug: Begin derivative-comparison report ---------------------------------- gf1debug: mreldif(gradient vectors) = 9.91e-07 gf1debug: Warning: evaluator did not compute Hessian matrix gf1debug: End derivative-comparison report ------------------------------------ Iteration 0: f(p) = -851.18669 gf1debug: Begin derivative-comparison report ---------------------------------- gf1debug: mreldif(gradient vectors) = 2.06e-10 gf1debug: Warning: evaluator did not compute Hessian matrix gf1debug: End derivative-comparison report ------------------------------------ Iteration 1: f(p) = -556.66874 gf1debug: Begin derivative-comparison report ---------------------------------- gf1debug: mreldif(gradient vectors) = 1.59e-07 gf1debug: Warning: evaluator did not compute Hessian matrix gf1debug: End derivative-comparison report ------------------------------------ Iteration 2: f(p) = -555.81731 gf1debug: Begin derivative-comparison report ---------------------------------- gf1debug: mreldif(gradient vectors) = .0000267 gf1debug: Warning: evaluator did not compute Hessian matrix gf1debug: End derivative-comparison report ------------------------------------ Iteration 3: f(p) = -555.81538 gf1debug: Begin derivative-comparison report ---------------------------------- gf1debug: mreldif(gradient vectors) = .0000272 gf1debug: Warning: evaluator did not compute Hessian matrix gf1debug: End derivative-comparison report ------------------------------------ Iteration 4: f(p) = -555.81538 : : end -------------------------------------------------------------------------------- . end of do-file
For each iteration, mreldif(gradient vectors) reports the maximum relative difference between the analytically and numerically computed derivatives. Away from the optimum, a correctly coded analytical gradient will yield an mreldif of e-08 or smaller. The numerically computed gradients are imperfect approximations to the true gradients, and e-08 is about the best we can reliably hope for when using double precision numbers. Use the mreldif reports from iterations away from the optimum. Because the gradient is almost zero at the optimum, the mreldif calculation produces an over-sized difference for iterations near the optimum.
In the example at hand, the mreldif calculations of 9.91e-07, 2.06e-10, and 1.59e-07 for iterations 0, 1, and 2 indicate that the analytically computed derivatives are correct.
The code in dex2.do differs from that in dex1.do by specifying the evaluator type to be gf1 instead of gf1debug on line 37. A gf1 evaluator type differs from a gf1debug evaluator type in that it uses the analytically computed gradients in the optimization, the numerical gradients are not computed, and there are no derivative comparison reports.
mata: mata drop plleval3() dowork() void plleval3(real scalar todo, real vector b, /// real vector y, real matrix X, /// val, grad, hess) { real vector xb, mu xb = X*b' mu = exp(xb) val = (-mu + y:*xb - lnfactorial(y)) if (todo>=1) { grad = (y - mu):*X } } void dowork( ) { real vector y, b real matrix X real scalar n, p transmorphic S y = st_data(., "accidents") X = st_data(., "cvalue kids traffic ") n = rows(y) X = X, J(n, 1, 1) p = cols(X) S = optimize_init() optimize_init_argument(S, 1, y) optimize_init_argument(S, 2, X) optimize_init_evaluator(S, &plleval3()) optimize_init_evaluatortype(S, "gf1") optimize_init_params(S, J(1, p, .01)) b = optimize(S) } dowork() end
Example 2 illustrates the output.
Example 2: gf1 output
. do dex2 . mata: ------------------------------------------------- mata (type end to exit) ------ : : mata drop plleval3() dowork() : : void plleval3(real scalar todo, real vector b, /// > real vector y, real matrix X, /// > val, grad, hess) > { > real vector xb, mu > > xb = X*b' > mu = exp(xb) > val = (-mu + y:*xb - lnfactorial(y)) > > if (todo>=1) { > grad = (y - mu):*X > } > } note: argument hess unused : : void dowork( ) > { > real vector y, b > real matrix X > real scalar n, p > transmorphic S > > y = st_data(., "accidents") > X = st_data(., "cvalue kids traffic ") > n = rows(y) > X = X, J(n, 1, 1) > p = cols(X) > > S = optimize_init() > optimize_init_argument(S, 1, y) > optimize_init_argument(S, 2, X) > optimize_init_evaluator(S, &plleval3()) > optimize_init_evaluatortype(S, "gf1") > optimize_init_params(S, J(1, p, .01)) > > b = optimize(S) > > } note: variable b set but not used : : dowork() Iteration 0: f(p) = -851.18669 Iteration 1: f(p) = -556.66855 Iteration 2: f(p) = -555.81731 Iteration 3: f(p) = -555.81538 Iteration 4: f(p) = -555.81538 : : end -------------------------------------------------------------------------------- . end of do-file
Using an analytically computed Hessian in optimize()
The code in dex3.do adds the sum of the observation-level Hessians to the evaluator function plleval3() used by optimize() in dowork().
mata: mata drop plleval3() dowork() void plleval3(real scalar todo, real vector b, /// real vector y, real matrix X, /// val, grad, hess) { real vector xb, mu xb = X*b' mu = exp(xb) val = (-mu + y:*xb - lnfactorial(y)) if (todo>=1) { grad = (y - mu):*X } if (todo==2) { hess = -quadcross(X, mu, X) } } void dowork( ) { real vector y, b real matrix X real scalar n, p transmorphic S y = st_data(., "accidents") X = st_data(., "cvalue kids traffic ") n = rows(y) X = X, J(n, 1, 1) p = cols(X) S = optimize_init() optimize_init_argument(S, 1, y) optimize_init_argument(S, 2, X) optimize_init_evaluator(S, &plleval3()) optimize_init_evaluatortype(S, "gf2debug") optimize_init_params(S, J(1, p, .01)) b = optimize(S) } dowork() end
Lines 18–20 are new to dex3.do, and they compute the Hessian when todo==2. Line 41 in dex3.do specifies a gf2debug evaluator type instead of the gf1 evaluator type specified on line 37 of dex2.do.
The gf2debug evaluator type is a second-derivative version of the gf1debug evaluator type; it specifies that the evaluator return observation-level contributions to the objective function, that it return a matrix of observation-level gradients when todo==1 or todo==2, and that it return a matrix containing the sum of observation-level Hessians when todo==2. The gf2debug evaluator type also specifies that optimize() will produce a derivative-comparison report for the gradient and the Hessian and that optimize() will use the numerically computed derivatives for the optimization.
Example 3 illustrates the output.
Example 3: gf2debug output
. do dex3 . mata: ------------------------------------------------- mata (type end to exit) ------ : : mata drop plleval3() dowork() : : void plleval3(real scalar todo, real vector b, /// > real vector y, real matrix X, /// > val, grad, hess) > { > real vector xb, mu > > xb = X*b' > mu = exp(xb) > val = (-mu + y:*xb - lnfactorial(y)) > > if (todo>=1) { > grad = (y - mu):*X > } > if (todo==2) { > hess = -quadcross(X, mu, X) > } > > } : : void dowork( ) > { > real vector y, b > real matrix X > real scalar n, p > transmorphic S > > y = st_data(., "accidents") > X = st_data(., "cvalue kids traffic ") > n = rows(y) > X = X, J(n, 1, 1) > p = cols(X) > > S = optimize_init() > optimize_init_argument(S, 1, y) > optimize_init_argument(S, 2, X) > optimize_init_evaluator(S, &plleval3()) > optimize_init_evaluatortype(S, "gf2debug") > optimize_init_params(S, J(1, p, .01)) > > b = optimize(S) > > } note: variable b set but not used : : dowork() gf2debug: Begin derivative-comparison report ---------------------------------- gf2debug: mreldif(gradient vectors) = 9.91e-07 gf2debug: mreldif(Hessian matrices) = 1.53e-06 gf2debug: End derivative-comparison report ------------------------------------ Iteration 0: f(p) = -851.18669 gf2debug: Begin derivative-comparison report ---------------------------------- gf2debug: mreldif(gradient vectors) = 2.06e-10 gf2debug: mreldif(Hessian matrices) = .0001703 gf2debug: End derivative-comparison report ------------------------------------ Iteration 1: f(p) = -556.66874 gf2debug: Begin derivative-comparison report ---------------------------------- gf2debug: mreldif(gradient vectors) = 1.59e-07 gf2debug: mreldif(Hessian matrices) = 5.42e-07 gf2debug: End derivative-comparison report ------------------------------------ Iteration 2: f(p) = -555.81731 gf2debug: Begin derivative-comparison report ---------------------------------- gf2debug: mreldif(gradient vectors) = .0000267 gf2debug: mreldif(Hessian matrices) = 2.45e-07 gf2debug: End derivative-comparison report ------------------------------------ Iteration 3: f(p) = -555.81538 gf2debug: Begin derivative-comparison report ---------------------------------- gf2debug: mreldif(gradient vectors) = .0000272 gf2debug: mreldif(Hessian matrices) = 2.46e-07 gf2debug: End derivative-comparison report ------------------------------------ Iteration 4: f(p) = -555.81538 : : end -------------------------------------------------------------------------------- . end of do-file
Unlike the mreldif calculations for the gradient, I look closely at the mreldif calculations for the Hessian near the optimum, because the Hessian must be full rank at the optimum. In this example, the mreldif calculations near the optimum are on the order of e-07, indicating a correctly coded analytical Hessian.
Now consider dex4.do, which differs from dex3.do in that line 40 specifies a gf2 evaluator type instead of a gf2debug evaluator type. A gf2 evaluator type is gf1 evaluator type for first and second derivatives. A gf2 evaluator type differs from a gf2debug evaluator type in that it uses the analytically computed gradients and the analytically computed Hessian in the optimization, the numerical derivatives are not computed, and there are no derivative comparison reports.
mata: mata drop plleval3() dowork() void plleval3(real scalar todo, real vector b, /// real vector y, real matrix X, /// val, grad, hess) { real vector xb, mu xb = X*b' mu = exp(xb) val = (-mu + y:*xb - lnfactorial(y)) if (todo>=1) { grad = (y - mu):*X } if (todo==2) { hess = -quadcross(X, mu, X) } } void dowork( ) { real vector y, b real matrix X real scalar n, p transmorphic S y = st_data(., "accidents") X = st_data(., "cvalue kids traffic ") n = rows(y) X = X, J(n, 1, 1) p = cols(X) S = optimize_init() optimize_init_argument(S, 1, y) optimize_init_argument(S, 2, X) optimize_init_evaluator(S, &plleval3()) optimize_init_evaluatortype(S, "gf2") optimize_init_params(S, J(1, p, .01)) b = optimize(S) } dowork() end
Example 4 illustrates the output.
Example 4: gf2 output
. do dex4 . mata: ------------------------------------------------- mata (type end to exit) ------ : : mata drop plleval3() dowork() : : void plleval3(real scalar todo, real vector b, /// > real vector y, real matrix X, /// > val, grad, hess) > { > real vector xb, mu > > xb = X*b' > mu = exp(xb) > val = (-mu + y:*xb - lnfactorial(y)) > > if (todo>=1) { > grad = (y - mu):*X > } > if (todo==2) { > hess = -quadcross(X, mu, X) > } > > } : : void dowork( ) > { > real vector y, b > real matrix X > real scalar n, p > transmorphic S > > y = st_data(., "accidents") > X = st_data(., "cvalue kids traffic ") > n = rows(y) > X = X, J(n, 1, 1) > p = cols(X) > > S = optimize_init() > optimize_init_argument(S, 1, y) > optimize_init_argument(S, 2, X) > optimize_init_evaluator(S, &plleval3()) > optimize_init_evaluatortype(S, "gf2") > optimize_init_params(S, J(1, p, .01)) > > b = optimize(S) > > } note: variable b set but not used : : dowork() Iteration 0: f(p) = -851.18669 Iteration 1: f(p) = -556.66855 Iteration 2: f(p) = -555.81731 Iteration 3: f(p) = -555.81538 Iteration 4: f(p) = -555.81538 : : end -------------------------------------------------------------------------------- . end of do-file
Including analytical derivatives in the command
mypoisson4 is like mypoisson3, except that it computes the derivatives analytically. In the remainder of this post, I briefly discuss the code for mypoisson4.ado.
*! version 4.0.0 28Feb2016 program define mypoisson4, eclass sortpreserve version 14 syntax varlist(numeric ts fv min=2) [if] [in] [, noCONStant vce(string) ] marksample touse _vce_parse `touse' , optlist(Robust) argoptlist(CLuster) : , vce(`vce') local vce "`r(vce)'" local clustervar "`r(cluster)'" if "`vce'" == "robust" | "`vce'" == "cluster" { local vcetype "Robust" } if "`clustervar'" != "" { capture confirm numeric variable `clustervar' if _rc { display in red "invalid vce() option" display in red "cluster variable {bf:`clustervar'} is " /// "string variable instead of a numeric variable" exit(198) } sort `clustervar' } gettoken depvar indepvars : varlist _fv_check_depvar `depvar' tempname b mo V N rank getcinfo `indepvars' , `constant' local cnames "`r(cnames)'" matrix `mo' = r(mo) mata: mywork("`depvar'", "`cnames'", "`touse'", "`constant'", /// "`b'", "`V'", "`N'", "`rank'", "`mo'", "`vce'", "`clustervar'") if "`constant'" == "" { local cnames "`cnames' _cons" } matrix colnames `b' = `cnames' matrix colnames `V' = `cnames' matrix rownames `V' = `cnames' ereturn post `b' `V', esample(`touse') buildfvinfo ereturn scalar N = `N' ereturn scalar rank = `rank' ereturn local vce "`vce'" ereturn local vcetype "`vcetype'" ereturn local clustvar "`clustervar'" ereturn local cmd "mypoisson4" ereturn display end program getcinfo, rclass syntax varlist(ts fv), [ noCONStant ] _rmcoll `varlist' , `constant' expand local cnames `r(varlist)' local p : word count `cnames' if "`constant'" == "" { local p = `p' + 1 local cons _cons } tempname b mo matrix `b' = J(1, `p', 0) matrix colnames `b' = `cnames' `cons' _ms_omit_info `b' matrix `mo' = r(omit) return local cnames "`cnames'" return matrix mo = `mo' end mata: void mywork( string scalar depvar, string scalar indepvars, string scalar touse, string scalar constant, string scalar bname, string scalar Vname, string scalar nname, string scalar rname, string scalar mo, string scalar vcetype, string scalar clustervar) { real vector y, b real matrix X, V, Ct real scalar n, p, rank y = st_data(., depvar, touse) n = rows(y) X = st_data(., indepvars, touse) if (constant == "") { X = X,J(n, 1, 1) } p = cols(X) Ct = makeCt(mo) S = optimize_init() optimize_init_argument(S, 1, y) optimize_init_argument(S, 2, X) optimize_init_evaluator(S, &plleval3()) optimize_init_evaluatortype(S, "gf2") optimize_init_params(S, J(1, p, .01)) optimize_init_constraints(S, Ct) b = optimize(S) if (vcetype == "robust") { V = optimize_result_V_robust(S) } else if (vcetype == "cluster") { cvar = st_data(., clustervar, touse) optimize_init_cluster(S, cvar) V = optimize_result_V_robust(S) } else { // vcetype must IID V = optimize_result_V_oim(S) } rank = p - diag0cnt(invsym(V)) st_matrix(bname, b) st_matrix(Vname, V) st_numscalar(nname, n) st_numscalar(rname, rank) } real matrix makeCt(string scalar mo) { real vector mo_v real scalar ko, j, p mo_v = st_matrix(mo) p = cols(mo_v) ko = sum(mo_v) if (ko>0) { Ct = J(0, p, .) for(j=1; j<=p; j++) { if (mo_v[j]==1) { Ct = Ct \ e(j, p) } } Ct = Ct, J(ko, 1, 0) } else { Ct = J(0,p+1,.) } return(Ct) } void plleval3(real scalar todo, real vector b, /// real vector y, real matrix X, /// val, grad, hess) { real vector xb, mu xb = X*b' mu = exp(xb) val = (-mu + y:*xb - lnfactorial(y)) if (todo>=1) { grad = (y - mu):*X } if (todo==2) { hess = -quadcross(X, mu, X) } } end
Only a few lines of mypoisson4.ado differ from their counterparts in mypoisson3.ado. Line 106 of mypoisson4.ado specifies a gf2 evaluator type, while line 106 of mypoisson3.ado specifies a gf0 evaluator type. Lines 166–171 in mypoisson4.ado compute the gradient and the Hessian analytically, and they have no counterparts in mypoisson3.ado.
The output in examples 5 and 6 confirms that mypoisson4 produces the same results as poisson when the option vce(cluster id) is specified.
Example 5: mypoisson4 results
. mypoisson4 accidents cvalue kids traffic , vce(cluster id) Iteration 0: f(p) = -851.18669 Iteration 1: f(p) = -556.66855 Iteration 2: f(p) = -555.81731 Iteration 3: f(p) = -555.81538 Iteration 4: f(p) = -555.81538 (Std. Err. adjusted for clustering on id) ------------------------------------------------------------------------------ | Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- cvalue | -.6558871 .1125223 -5.83 0.000 -.8764267 -.4353475 kids | -1.009017 .1805639 -5.59 0.000 -1.362916 -.6551182 traffic | .1467115 .092712 1.58 0.114 -.0350008 .3284237 _cons | .5743541 .6238015 0.92 0.357 -.6482744 1.796983 ------------------------------------------------------------------------------
Example 6: poisson results
. poisson accidents cvalue kids traffic , vce(cluster id) Iteration 0: log pseudolikelihood = -555.86605 Iteration 1: log pseudolikelihood = -555.8154 Iteration 2: log pseudolikelihood = -555.81538 Poisson regression Number of obs = 505 Wald chi2(3) = 103.53 Prob > chi2 = 0.0000 Log pseudolikelihood = -555.81538 Pseudo R2 = 0.2343 (Std. Err. adjusted for 285 clusters in id) ------------------------------------------------------------------------------ | Robust accidents | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- cvalue | -.6558871 .1125223 -5.83 0.000 -.8764266 -.4353475 kids | -1.009017 .1805639 -5.59 0.000 -1.362915 -.6551181 traffic | .1467115 .092712 1.58 0.114 -.0350008 .3284237 _cons | .574354 .6238015 0.92 0.357 -.6482744 1.796982 ------------------------------------------------------------------------------
Done and undone
I showed how to compute derivatives analytically when using optimize(), and I included analytically computed derivatives in mypoisson4.ado. In my next post, I show how to make predict work after mypoisson4.