Programming an estimation command in Stata: Nonlinear least-squares estimators
\(\newcommand{\xb}{{\bf x}}
\newcommand{\gb}{{\bf g}}
\newcommand{\Hb}{{\bf H}}
\newcommand{\Gb}{{\bf G}}
\newcommand{\Eb}{{\bf E}}
\newcommand{\betab}{\boldsymbol{\beta}}\)I want to write ado-commands to estimate the parameters of an exponential conditional mean (ECM) model and probit conditional mean (PCM) model by nonlinear least squares (NLS). Before I can write these commands, I need to show how to trick optimize() into performing the Gauss–Newton algorithm and apply this trick to these two problems.
This is the 26th post in the series Programming an estimation command in Stata. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.
Gauss–Newton algorithm
Gauss–Newton algorithms frequently perform better than other Newton-type algorithms for solving NLS minimization problems, because they use an expected Hessian instead of a full Hessian.
Recall that Newton-type algorithms get the next guess of parameter estimates by an update rule of the form
$$
\betab_{s+1} = \betab_s – \lambda\Hb_s^{-1}\gb_s
$$
as I discussed in Programming an estimation command in Stata: A review of nonlinear optimization using Mata. The objective function in NLS problems is
$$
\min_{\betab} \frac{1}{2} \sum_{i=1}^n \left[y_i-f(\xb_i,\betab)\right]^2
$$
The Gauss–Newton algorithm uses
$$
\betab_{s+1} = \betab_s – \lambda\Gb_s^{-1}\gb_s
$$
where
$$\Gb_s =-
\sum_{i=1}^N \frac{\partial f(\xb_i,\betab)}{\partial \betab\,’}
\frac{\partial f(\xb_i,\betab)}{\partial \betab}
\hspace{1cm}\mbox{with } \betab \mbox{ evaluated at } \betab_s
$$
Using \(\Gb_s\) instead of \(\Hb_s\) causes Gauss–Newton to perform better than other Newton-type algorithms because the first term in
\begin{align*}
\Hb_s &=
\sum_{i=1}^N \left[y_i-f(\xb_i,\betab)\right]
\frac{\partial^2 f(\xb_i,\betab)}{\partial\betab\,’\partial \betab}
–
\sum_{i=1}^N \frac{\partial f(\xb_i,\betab)}{\partial \boldsymbol{\beta}\,’}
\frac{\partial f(\xb_i,\betab)}{\partial \betab}
\end{align*}
with \(\betab\) evaluated at \(\betab_s\) has mean \({\bf 0}\). Much of the literature on this algorithm exploits the fact that \(\Gb_s^{-1}\gb_s\) can be obtained by an OLS regression of the residuals \([y_i-f(\xb_i,\betab)]\) on the columns of \(\frac{\partial f(\xb_i,\betab)}{\partial \betab\,’}\) (with \(\betab\) evaluated at \(\betab_s\)) because
$$
\gb_s= \sum_{i=1}^N \left[y_i-f(\xb_i,\betab)\right]
\frac{\partial f(\xb_i,\betab)}{\partial \betab\,’}
\hspace{1cm}\mbox{ with } \betab \mbox{ evaluated at } \betab_s
$$
See Cameron and Trivedi (2005, section 10.3.6) and Wooldridge (2010, section 12.7.3) for examples of this approach. While this approach is used in nl, I will trick optimize() into doing Gauss–Newton by specifying \(\Gb_s\) instead of \(\Hb_s\) as my Hessian.
ECM by NLS
In code block 1, I use optimize() to fit the accident data to an ECM model,
$$
\Eb[{\tt accidents}|{\tt cvalue},{\tt tickets}] =
\exp(\beta_1 {\tt cvalue} + \beta_2 {\tt tickets} + \beta_0)
$$
mata: void MYNLExp(real scalar todo, real vector b, /// real vector y, real matrix X, /// val, grad, hess) { real vector r, f, xb real matrix df xb = X*b' f = exp(X*b') r = y - f val = -(r:^2) df = f:*X if (todo>=1) { grad = r:*df } if (todo==2) { hess = -1*quadcross(df, df) } } y = st_data(., "accidents") X = st_data(., "cvalue tickets") n = rows(y) X = X, J(n, 1, 1) p = cols(X) S = optimize_init() optimize_init_argument(S, 1, y) optimize_init_argument(S, 2, X) optimize_init_evaluator(S, &MYNLExp()) optimize_init_conv_nrtol(S, 1e-11) optimize_init_params(S, J(1, 3, .01)) optimize_init_evaluatortype(S, "gf2") bh = optimize(S) M = invsym(-1*optimize_result_Hessian(S)) sb = (-1/(n-p))*optimize_result_value(S) V = sb*M "Point estimates" bh "VCE" V end
Lines 2–21 are the evaluator function for the NLS problem. This code should be familiar from the Poisson regression command that I previously discussed. Note that line 19 defines \(\Gb_s\) to the Hessian.
Lines 22–26 copy the data from Stata into Mata. Lines 28–34 use optimize() to solve the NLS problem. Lines 35–37 compute the estimator for the VCE based on correct specification and errors that are independently and identically distributed; see Wooldridge (2010, 417). Lines 38–41 display the results.
Example 1 runs this code
Example 1: Gauss–Newton in optimize() for ECM model
. use accident3 . do nls1 . mata: ------------------------------------------------- mata (type end to exit) ------ : void MYNLExp(real scalar todo, real vector b, /// > real vector y, real matrix X, /// > val, grad, hess) > { > real vector r, f, xb > real matrix df > > xb = X*b' > f = exp(X*b') > r = y - f > val = -(r:^2) > df = f:*X > > if (todo>=1) { > grad = r:*df > } > if (todo==2) { > hess = -1*quadcross(df, df) > } > } note: variable xb set but not used : y = st_data(., "accidents") : X = st_data(., "cvalue tickets") : n = rows(y) : X = X, J(n, 1, 1) : p = cols(X) : S = optimize_init() : optimize_init_argument(S, 1, y) : optimize_init_argument(S, 2, X) : optimize_init_evaluator(S, &MYNLExp()) : optimize_init_conv_nrtol(S, 1e-11) : optimize_init_params(S, J(1, 3, .01)) : optimize_init_evaluatortype(S, "gf2") : bh = optimize(S) Iteration 0: f(p) = -2530.846 Iteration 1: f(p) = -1116.4901 Iteration 2: f(p) = -248.56923 Iteration 3: f(p) = -225.91644 Iteration 4: f(p) = -225.89573 Iteration 5: f(p) = -225.89573 Iteration 6: f(p) = -225.89573 : M = invsym(-1*optimize_result_Hessian(S)) : sb = (-1/(n-p))*optimize_result_value(S) : V = sb*M : "Point estimates" Point estimates : bh 1 2 3 +-------------------------------------------+ 1 | .1759434081 1.447671532 -7.66060808 | +-------------------------------------------+ : "VCE" VCE : V [symmetric] 1 2 3 +----------------------------------------------+ 1 | .0010491815 | 2 | -.0000111792 .001112881 | 3 | -.0019055019 -.0075744389 .0554943947 | +----------------------------------------------+ : end -------------------------------------------------------------------------------- . end of do-file
For comparison, I use nl to replicate these results.
Example 2: ECM model by nl
. nl (accidents = exp({b1}*cvalue + {b2}*tickets + {b0})) , eps(1e-11) (obs = 505) Iteration 0: residual SS = 1123.962 Iteration 1: residual SS = 380.2557 Iteration 2: residual SS = 232.1493 Iteration 3: residual SS = 225.9077 Iteration 4: residual SS = 225.8957 Iteration 5: residual SS = 225.8957 Iteration 6: residual SS = 225.8957 Iteration 7: residual SS = 225.8957 Iteration 8: residual SS = 225.8957 Iteration 9: residual SS = 225.8957 Iteration 10: residual SS = 225.8957 Iteration 11: residual SS = 225.8957 Source | SS df MS -----------+---------------------------------- Number of obs = 505 Model | 2266.1043 3 755.368089 R-squared = 0.9094 Residual | 225.89573 502 .449991501 Adj R-squared = 0.9088 -----------+---------------------------------- Root MSE = .6708141 Total | 2492 505 4.93465347 Res. dev. = 1026.863 ---------------------------------------------------------------------------- accidents | Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------+---------------------------------------------------------------- /b1 | .1759434 .0323911 5.43 0.000 .1123047 .2395822 /b2 | 1.447671 .0333597 43.40 0.000 1.38213 1.513213 /b0 | -7.660608 .2355717 -32.52 0.000 -8.123436 -7.19778 ---------------------------------------------------------------------------- . matrix list e(V) symmetric e(V)[3,3] b1: b2: b0: _cons _cons _cons b1:_cons .00104918 b2:_cons -.00001118 .00111287 b0:_cons -.0019055 -.00757439 .05549405
As expected, the point estimates and the estimates of the VCE are essentially the same.
I now implement the PCM model
$$
\Eb[{\tt hadaccident}|{\tt cvalue},{\tt tickets}] =
\Phi(\beta_1 {\tt cvalue} + \beta_2 {\tt tickets} + \beta_0)
$$
in code block 2, where \(\Phi()\) is the standard-normal cumulative distribution function.
mata: void MYNLProbit(real scalar todo, real vector b, /// real vector y, real matrix X, /// val, grad, hess) { real vector r, f, xb real matrix df xb = X*b' f = normal(xb) r = y - f val = -(r:^2) df = normalden(xb):*X if (todo>=1) { grad = r:*df } if (todo==2) { hess = -1*quadcross(df, df) } } y = st_data(., "hadaccident") X = st_data(., "cvalue tickets") n = rows(y) X = X, J(n, 1, 1) p = cols(X) S = optimize_init() optimize_init_argument(S, 1, y) optimize_init_argument(S, 2, X) optimize_init_evaluator(S, &MYNLProbit()) optimize_init_conv_nrtol(S, 1e-11) optimize_init_params(S, J(1, 3, .01)) optimize_init_evaluatortype(S, "gf2") bh = optimize(S) M = invsym(-1*optimize_result_Hessian(S)) sb = (-1/(n-p))*optimize_result_value(S) V = sb*M "Point estimates" bh "VCE" V end
nls2.do is almost identical to nls1.do. The differences are that lines 2–21 define MYNLProbit() instead of MYNLExp(), that lines 10 and 13 define the function and the first derivative for the PCM model instead of for the ECM model, that line 22 specifies the binary dependent variable hadaccident instead of the accident count accidents, and that line 31 specifies that optimize() use MYNLProbit() instead of MYNLExp() as the evaluator function.
Example 3 runs this code
Example 3: Gauss–Newton in optimize() for PCM model
. generate hadaccident = accidents>0 . do nls2 . mata: ------------------------------------------------- mata (type end to exit) ------ : void MYNLProbit(real scalar todo, real vector b, /// > real vector y, real matrix X, /// > val, grad, hess) > { > real vector r, f, xb > real matrix df > > xb = X*b' > f = normal(xb) > r = y - f > val = -(r:^2) > df = normalden(xb):*X > > if (todo>=1) { > grad = r:*df > } > if (todo==2) { > hess = -1*quadcross(df, df) > } > } : y = st_data(., "hadaccident") : X = st_data(., "cvalue tickets") : n = rows(y) : X = X, J(n, 1, 1) : p = cols(X) : S = optimize_init() : optimize_init_argument(S, 1, y) : optimize_init_argument(S, 2, X) : optimize_init_evaluator(S, &MYNLProbit()) : optimize_init_conv_nrtol(S, 1e-11) : optimize_init_params(S, J(1, 3, .01)) : optimize_init_evaluatortype(S, "gf2") : bh = optimize(S) Iteration 0: f(p) = -132.90997 Iteration 1: f(p) = -16.917203 Iteration 2: f(p) = -10.995001 Iteration 3: f(p) = -10.437501 Iteration 4: f(p) = -10.427738 Iteration 5: f(p) = -10.427156 Iteration 6: f(p) = -10.427123 Iteration 7: f(p) = -10.427121 Iteration 8: f(p) = -10.42712 Iteration 9: f(p) = -10.42712 Iteration 10: f(p) = -10.42712 Iteration 11: f(p) = -10.42712 : M = invsym(-1*optimize_result_Hessian(S)) : sb = (-1/(n-p))*optimize_result_value(S) : V = sb*M : "Point estimates" Point estimates : bh 1 2 3 +----------------------------------------------+ 1 | .3616312823 2.177513743 -10.95168163 | +----------------------------------------------+ : "VCE" VCE : V [symmetric] 1 2 3 +----------------------------------------------+ 1 | .0084311958 | 2 | .0102702556 .0389739985 | 3 | -.0648856339 -.2061800064 1.114408707 | +----------------------------------------------+ : end -------------------------------------------------------------------------------- . end of do-file
For comparison, I use nl to replicate the results.
Example 4: PCM model by nl
. nl (hadaccident = normal({b1}*cvalue + {b2}*tickets + {b0})) , eps(1e-11) (obs = 505) Iteration 0: residual SS = 29.15659 Iteration 1: residual SS = 18.64138 Iteration 2: residual SS = 12.69845 Iteration 3: residual SS = 10.74838 Iteration 4: residual SS = 10.44729 Iteration 5: residual SS = 10.42855 Iteration 6: residual SS = 10.42723 Iteration 7: residual SS = 10.42713 Iteration 8: residual SS = 10.42712 Iteration 9: residual SS = 10.42712 Iteration 10: residual SS = 10.42712 Iteration 11: residual SS = 10.42712 Iteration 12: residual SS = 10.42712 Iteration 13: residual SS = 10.42712 Iteration 14: residual SS = 10.42712 Iteration 15: residual SS = 10.42712 Iteration 16: residual SS = 10.42712 Iteration 17: residual SS = 10.42712 Iteration 18: residual SS = 10.42712 Iteration 19: residual SS = 10.42712 Iteration 20: residual SS = 10.42712 Iteration 21: residual SS = 10.42712 Iteration 22: residual SS = 10.42712 Source | SS df MS -----------+---------------------------------- Number of obs = 505 Model | 49.57288 3 16.5242932 R-squared = 0.8262 Residual | 10.42712 502 .020771156 Adj R-squared = 0.8252 -----------+---------------------------------- Root MSE = .144122 Total | 60 505 .118811881 Res. dev. = -526.347 ---------------------------------------------------------------------------- hadaccident| Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------+---------------------------------------------------------------- /b1 | .361632 .0918214 3.94 0.000 .1812304 .5420337 /b2 | 2.177515 .1974173 11.03 0.000 1.789649 2.565381 /b0 | -10.95169 1.05565 -10.37 0.000 -13.02572 -8.877652 ---------------------------------------------------------------------------- . matrix list e(V) symmetric e(V)[3,3] b1: b2: b0: _cons _cons _cons b1:_cons .00843118 b2:_cons .01027015 .03897358 b0:_cons -.06488509 -.20617778 1.1143968
As expected, the point estimates and the estimates of the VCE are essentially the same.
Done and undone
I showed how to trick optimize() into performing the Gauss–Newton algorithm and how to compute an estimator of the VCE based on correct specification and independently and identically distributed errors. In the next post, I discuss ado-commands that implement these estimators.
References
Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.
Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, Massachusetts: MIT Press.