Programming an estimation command in Stata: Nonlinear least-squares estimators
\(\newcommand{\xb}{{\bf x}}
\newcommand{\gb}{{\bf g}}
\newcommand{\Hb}{{\bf H}}
\newcommand{\Gb}{{\bf G}}
\newcommand{\Eb}{{\bf E}}
\newcommand{\betab}{\boldsymbol{\beta}}\)I want to write ado-commands to estimate the parameters of an exponential conditional mean (ECM) model and probit conditional mean (PCM) model by nonlinear least squares (NLS). Before I can write these commands, I need to show how to trick optimize() into performing the Gauss–Newton algorithm and apply this trick to these two problems.
This is the 26th post in the series Programming an estimation command in Stata. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.
Gauss–Newton algorithm
Gauss–Newton algorithms frequently perform better than other Newton-type algorithms for solving NLS minimization problems, because they use an expected Hessian instead of a full Hessian.
Recall that Newton-type algorithms get the next guess of parameter estimates by an update rule of the form
$$
\betab_{s+1} = \betab_s – \lambda\Hb_s^{-1}\gb_s
$$
as I discussed in Programming an estimation command in Stata: A review of nonlinear optimization using Mata. The objective function in NLS problems is
$$
\min_{\betab} \frac{1}{2} \sum_{i=1}^n \left[y_i-f(\xb_i,\betab)\right]^2
$$
The Gauss–Newton algorithm uses
$$
\betab_{s+1} = \betab_s – \lambda\Gb_s^{-1}\gb_s
$$
where
$$\Gb_s =-
\sum_{i=1}^N \frac{\partial f(\xb_i,\betab)}{\partial \betab\,’}
\frac{\partial f(\xb_i,\betab)}{\partial \betab}
\hspace{1cm}\mbox{with } \betab \mbox{ evaluated at } \betab_s
$$
Using \(\Gb_s\) instead of \(\Hb_s\) causes Gauss–Newton to perform better than other Newton-type algorithms because the first term in
\begin{align*}
\Hb_s &=
\sum_{i=1}^N \left[y_i-f(\xb_i,\betab)\right]
\frac{\partial^2 f(\xb_i,\betab)}{\partial\betab\,’\partial \betab}
–
\sum_{i=1}^N \frac{\partial f(\xb_i,\betab)}{\partial \boldsymbol{\beta}\,’}
\frac{\partial f(\xb_i,\betab)}{\partial \betab}
\end{align*}
with \(\betab\) evaluated at \(\betab_s\) has mean \({\bf 0}\). Much of the literature on this algorithm exploits the fact that \(\Gb_s^{-1}\gb_s\) can be obtained by an OLS regression of the residuals \([y_i-f(\xb_i,\betab)]\) on the columns of \(\frac{\partial f(\xb_i,\betab)}{\partial \betab\,’}\) (with \(\betab\) evaluated at \(\betab_s\)) because
$$
\gb_s= \sum_{i=1}^N \left[y_i-f(\xb_i,\betab)\right]
\frac{\partial f(\xb_i,\betab)}{\partial \betab\,’}
\hspace{1cm}\mbox{ with } \betab \mbox{ evaluated at } \betab_s
$$
See Cameron and Trivedi (2005, section 10.3.6) and Wooldridge (2010, section 12.7.3) for examples of this approach. While this approach is used in nl, I will trick optimize() into doing Gauss–Newton by specifying \(\Gb_s\) instead of \(\Hb_s\) as my Hessian.
ECM by NLS
In code block 1, I use optimize() to fit the accident data to an ECM model,
$$
\Eb[{\tt accidents}|{\tt cvalue},{\tt tickets}] =
\exp(\beta_1 {\tt cvalue} + \beta_2 {\tt tickets} + \beta_0)
$$
mata:
void MYNLExp(real scalar todo, real vector b, ///
real vector y, real matrix X, ///
val, grad, hess)
{
real vector r, f, xb
real matrix df
xb = X*b'
f = exp(X*b')
r = y - f
val = -(r:^2)
df = f:*X
if (todo>=1) {
grad = r:*df
}
if (todo==2) {
hess = -1*quadcross(df, df)
}
}
y = st_data(., "accidents")
X = st_data(., "cvalue tickets")
n = rows(y)
X = X, J(n, 1, 1)
p = cols(X)
S = optimize_init()
optimize_init_argument(S, 1, y)
optimize_init_argument(S, 2, X)
optimize_init_evaluator(S, &MYNLExp())
optimize_init_conv_nrtol(S, 1e-11)
optimize_init_params(S, J(1, 3, .01))
optimize_init_evaluatortype(S, "gf2")
bh = optimize(S)
M = invsym(-1*optimize_result_Hessian(S))
sb = (-1/(n-p))*optimize_result_value(S)
V = sb*M
"Point estimates"
bh
"VCE"
V
end
Lines 2–21 are the evaluator function for the NLS problem. This code should be familiar from the Poisson regression command that I previously discussed. Note that line 19 defines \(\Gb_s\) to the Hessian.
Lines 22–26 copy the data from Stata into Mata. Lines 28–34 use optimize() to solve the NLS problem. Lines 35–37 compute the estimator for the VCE based on correct specification and errors that are independently and identically distributed; see Wooldridge (2010, 417). Lines 38–41 display the results.
Example 1 runs this code
Example 1: Gauss–Newton in optimize() for ECM model
. use accident3
. do nls1
. mata:
------------------------------------------------- mata (type end to exit) ------
: void MYNLExp(real scalar todo, real vector b, ///
> real vector y, real matrix X, ///
> val, grad, hess)
> {
> real vector r, f, xb
> real matrix df
>
> xb = X*b'
> f = exp(X*b')
> r = y - f
> val = -(r:^2)
> df = f:*X
>
> if (todo>=1) {
> grad = r:*df
> }
> if (todo==2) {
> hess = -1*quadcross(df, df)
> }
> }
note: variable xb set but not used
: y = st_data(., "accidents")
: X = st_data(., "cvalue tickets")
: n = rows(y)
: X = X, J(n, 1, 1)
: p = cols(X)
: S = optimize_init()
: optimize_init_argument(S, 1, y)
: optimize_init_argument(S, 2, X)
: optimize_init_evaluator(S, &MYNLExp())
: optimize_init_conv_nrtol(S, 1e-11)
: optimize_init_params(S, J(1, 3, .01))
: optimize_init_evaluatortype(S, "gf2")
: bh = optimize(S)
Iteration 0: f(p) = -2530.846
Iteration 1: f(p) = -1116.4901
Iteration 2: f(p) = -248.56923
Iteration 3: f(p) = -225.91644
Iteration 4: f(p) = -225.89573
Iteration 5: f(p) = -225.89573
Iteration 6: f(p) = -225.89573
: M = invsym(-1*optimize_result_Hessian(S))
: sb = (-1/(n-p))*optimize_result_value(S)
: V = sb*M
: "Point estimates"
Point estimates
: bh
1 2 3
+-------------------------------------------+
1 | .1759434081 1.447671532 -7.66060808 |
+-------------------------------------------+
: "VCE"
VCE
: V
[symmetric]
1 2 3
+----------------------------------------------+
1 | .0010491815 |
2 | -.0000111792 .001112881 |
3 | -.0019055019 -.0075744389 .0554943947 |
+----------------------------------------------+
: end
--------------------------------------------------------------------------------
.
end of do-file
For comparison, I use nl to replicate these results.
Example 2: ECM model by nl
. nl (accidents = exp({b1}*cvalue + {b2}*tickets + {b0})) , eps(1e-11)
(obs = 505)
Iteration 0: residual SS = 1123.962
Iteration 1: residual SS = 380.2557
Iteration 2: residual SS = 232.1493
Iteration 3: residual SS = 225.9077
Iteration 4: residual SS = 225.8957
Iteration 5: residual SS = 225.8957
Iteration 6: residual SS = 225.8957
Iteration 7: residual SS = 225.8957
Iteration 8: residual SS = 225.8957
Iteration 9: residual SS = 225.8957
Iteration 10: residual SS = 225.8957
Iteration 11: residual SS = 225.8957
Source | SS df MS
-----------+---------------------------------- Number of obs = 505
Model | 2266.1043 3 755.368089 R-squared = 0.9094
Residual | 225.89573 502 .449991501 Adj R-squared = 0.9088
-----------+---------------------------------- Root MSE = .6708141
Total | 2492 505 4.93465347 Res. dev. = 1026.863
----------------------------------------------------------------------------
accidents | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
/b1 | .1759434 .0323911 5.43 0.000 .1123047 .2395822
/b2 | 1.447671 .0333597 43.40 0.000 1.38213 1.513213
/b0 | -7.660608 .2355717 -32.52 0.000 -8.123436 -7.19778
----------------------------------------------------------------------------
. matrix list e(V)
symmetric e(V)[3,3]
b1: b2: b0:
_cons _cons _cons
b1:_cons .00104918
b2:_cons -.00001118 .00111287
b0:_cons -.0019055 -.00757439 .05549405
As expected, the point estimates and the estimates of the VCE are essentially the same.
I now implement the PCM model
$$
\Eb[{\tt hadaccident}|{\tt cvalue},{\tt tickets}] =
\Phi(\beta_1 {\tt cvalue} + \beta_2 {\tt tickets} + \beta_0)
$$
in code block 2, where \(\Phi()\) is the standard-normal cumulative distribution function.
mata:
void MYNLProbit(real scalar todo, real vector b, ///
real vector y, real matrix X, ///
val, grad, hess)
{
real vector r, f, xb
real matrix df
xb = X*b'
f = normal(xb)
r = y - f
val = -(r:^2)
df = normalden(xb):*X
if (todo>=1) {
grad = r:*df
}
if (todo==2) {
hess = -1*quadcross(df, df)
}
}
y = st_data(., "hadaccident")
X = st_data(., "cvalue tickets")
n = rows(y)
X = X, J(n, 1, 1)
p = cols(X)
S = optimize_init()
optimize_init_argument(S, 1, y)
optimize_init_argument(S, 2, X)
optimize_init_evaluator(S, &MYNLProbit())
optimize_init_conv_nrtol(S, 1e-11)
optimize_init_params(S, J(1, 3, .01))
optimize_init_evaluatortype(S, "gf2")
bh = optimize(S)
M = invsym(-1*optimize_result_Hessian(S))
sb = (-1/(n-p))*optimize_result_value(S)
V = sb*M
"Point estimates"
bh
"VCE"
V
end
nls2.do is almost identical to nls1.do. The differences are that lines 2–21 define MYNLProbit() instead of MYNLExp(), that lines 10 and 13 define the function and the first derivative for the PCM model instead of for the ECM model, that line 22 specifies the binary dependent variable hadaccident instead of the accident count accidents, and that line 31 specifies that optimize() use MYNLProbit() instead of MYNLExp() as the evaluator function.
Example 3 runs this code
Example 3: Gauss–Newton in optimize() for PCM model
. generate hadaccident = accidents>0
. do nls2
. mata:
------------------------------------------------- mata (type end to exit) ------
: void MYNLProbit(real scalar todo, real vector b, ///
> real vector y, real matrix X, ///
> val, grad, hess)
> {
> real vector r, f, xb
> real matrix df
>
> xb = X*b'
> f = normal(xb)
> r = y - f
> val = -(r:^2)
> df = normalden(xb):*X
>
> if (todo>=1) {
> grad = r:*df
> }
> if (todo==2) {
> hess = -1*quadcross(df, df)
> }
> }
: y = st_data(., "hadaccident")
: X = st_data(., "cvalue tickets")
: n = rows(y)
: X = X, J(n, 1, 1)
: p = cols(X)
: S = optimize_init()
: optimize_init_argument(S, 1, y)
: optimize_init_argument(S, 2, X)
: optimize_init_evaluator(S, &MYNLProbit())
: optimize_init_conv_nrtol(S, 1e-11)
: optimize_init_params(S, J(1, 3, .01))
: optimize_init_evaluatortype(S, "gf2")
: bh = optimize(S)
Iteration 0: f(p) = -132.90997
Iteration 1: f(p) = -16.917203
Iteration 2: f(p) = -10.995001
Iteration 3: f(p) = -10.437501
Iteration 4: f(p) = -10.427738
Iteration 5: f(p) = -10.427156
Iteration 6: f(p) = -10.427123
Iteration 7: f(p) = -10.427121
Iteration 8: f(p) = -10.42712
Iteration 9: f(p) = -10.42712
Iteration 10: f(p) = -10.42712
Iteration 11: f(p) = -10.42712
: M = invsym(-1*optimize_result_Hessian(S))
: sb = (-1/(n-p))*optimize_result_value(S)
: V = sb*M
: "Point estimates"
Point estimates
: bh
1 2 3
+----------------------------------------------+
1 | .3616312823 2.177513743 -10.95168163 |
+----------------------------------------------+
: "VCE"
VCE
: V
[symmetric]
1 2 3
+----------------------------------------------+
1 | .0084311958 |
2 | .0102702556 .0389739985 |
3 | -.0648856339 -.2061800064 1.114408707 |
+----------------------------------------------+
: end
--------------------------------------------------------------------------------
.
end of do-file
For comparison, I use nl to replicate the results.
Example 4: PCM model by nl
. nl (hadaccident = normal({b1}*cvalue + {b2}*tickets + {b0})) , eps(1e-11)
(obs = 505)
Iteration 0: residual SS = 29.15659
Iteration 1: residual SS = 18.64138
Iteration 2: residual SS = 12.69845
Iteration 3: residual SS = 10.74838
Iteration 4: residual SS = 10.44729
Iteration 5: residual SS = 10.42855
Iteration 6: residual SS = 10.42723
Iteration 7: residual SS = 10.42713
Iteration 8: residual SS = 10.42712
Iteration 9: residual SS = 10.42712
Iteration 10: residual SS = 10.42712
Iteration 11: residual SS = 10.42712
Iteration 12: residual SS = 10.42712
Iteration 13: residual SS = 10.42712
Iteration 14: residual SS = 10.42712
Iteration 15: residual SS = 10.42712
Iteration 16: residual SS = 10.42712
Iteration 17: residual SS = 10.42712
Iteration 18: residual SS = 10.42712
Iteration 19: residual SS = 10.42712
Iteration 20: residual SS = 10.42712
Iteration 21: residual SS = 10.42712
Iteration 22: residual SS = 10.42712
Source | SS df MS
-----------+---------------------------------- Number of obs = 505
Model | 49.57288 3 16.5242932 R-squared = 0.8262
Residual | 10.42712 502 .020771156 Adj R-squared = 0.8252
-----------+---------------------------------- Root MSE = .144122
Total | 60 505 .118811881 Res. dev. = -526.347
----------------------------------------------------------------------------
hadaccident| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
/b1 | .361632 .0918214 3.94 0.000 .1812304 .5420337
/b2 | 2.177515 .1974173 11.03 0.000 1.789649 2.565381
/b0 | -10.95169 1.05565 -10.37 0.000 -13.02572 -8.877652
----------------------------------------------------------------------------
. matrix list e(V)
symmetric e(V)[3,3]
b1: b2: b0:
_cons _cons _cons
b1:_cons .00843118
b2:_cons .01027015 .03897358
b0:_cons -.06488509 -.20617778 1.1143968
As expected, the point estimates and the estimates of the VCE are essentially the same.
Done and undone
I showed how to trick optimize() into performing the Gauss–Newton algorithm and how to compute an estimator of the VCE based on correct specification and independently and identically distributed errors. In the next post, I discuss ado-commands that implement these estimators.
References
Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.
Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, Massachusetts: MIT Press.