## Programming an estimation command in Stata: Nonlinear least-squares estimators

\(\newcommand{\xb}{{\bf x}}

\newcommand{\gb}{{\bf g}}

\newcommand{\Hb}{{\bf H}}

\newcommand{\Gb}{{\bf G}}

\newcommand{\Eb}{{\bf E}}

\newcommand{\betab}{\boldsymbol{\beta}}\)I want to write ado-commands to estimate the parameters of an exponential conditional mean (ECM) model and probit conditional mean (PCM) model by nonlinear least squares (NLS). Before I can write these commands, I need to show how to trick **optimize()** into performing the Gauss–Newton algorithm and apply this trick to these two problems.

This is the 26th post in the series **Programming an estimation command in Stata**. I recommend that you start at the beginning. See **Programming an estimation command in Stata: A map to posted entries** for a map to all the posts in this series.

**Gauss–Newton algorithm**

Gauss–Newton algorithms frequently perform better than other Newton-type algorithms for solving NLS minimization problems, because they use an expected Hessian instead of a full Hessian.

Recall that Newton-type algorithms get the next guess of parameter estimates by an update rule of the form

$$

\betab_{s+1} = \betab_s – \lambda\Hb_s^{-1}\gb_s

$$

as I discussed in **Programming an estimation command in Stata: A review of nonlinear optimization using Mata**. The objective function in NLS problems is

$$

\min_{\betab} \frac{1}{2} \sum_{i=1}^n \left[y_i-f(\xb_i,\betab)\right]^2

$$

The Gauss–Newton algorithm uses

$$

\betab_{s+1} = \betab_s – \lambda\Gb_s^{-1}\gb_s

$$

where

$$\Gb_s =-

\sum_{i=1}^N \frac{\partial f(\xb_i,\betab)}{\partial \betab\,’}

\frac{\partial f(\xb_i,\betab)}{\partial \betab}

\hspace{1cm}\mbox{with } \betab \mbox{ evaluated at } \betab_s

$$

Using \(\Gb_s\) instead of \(\Hb_s\) causes Gauss–Newton to perform better than other Newton-type algorithms because the first term in

\begin{align*}

\Hb_s &=

\sum_{i=1}^N \left[y_i-f(\xb_i,\betab)\right]

\frac{\partial^2 f(\xb_i,\betab)}{\partial\betab\,’\partial \betab}

–

\sum_{i=1}^N \frac{\partial f(\xb_i,\betab)}{\partial \boldsymbol{\beta}\,’}

\frac{\partial f(\xb_i,\betab)}{\partial \betab}

\end{align*}

with \(\betab\) evaluated at \(\betab_s\) has mean \({\bf 0}\). Much of the literature on this algorithm exploits the fact that \(\Gb_s^{-1}\gb_s\) can be obtained by an OLS regression of the residuals \([y_i-f(\xb_i,\betab)]\) on the columns of \(\frac{\partial f(\xb_i,\betab)}{\partial \betab\,’}\) (with \(\betab\) evaluated at \(\betab_s\)) because

$$

\gb_s= \sum_{i=1}^N \left[y_i-f(\xb_i,\betab)\right]

\frac{\partial f(\xb_i,\betab)}{\partial \betab\,’}

\hspace{1cm}\mbox{ with } \betab \mbox{ evaluated at } \betab_s

$$

See Cameron and Trivedi (2005, section 10.3.6) and Wooldridge (2010, section 12.7.3) for examples of this approach. While this approach is used in **nl**, I will trick **optimize()** into doing Gauss–Newton by specifying \(\Gb_s\) instead of \(\Hb_s\) as my Hessian.

**ECM by NLS**

In code block 1, I use **optimize()** to fit the accident data to an ECM model,

$$

\Eb[{\tt accidents}|{\tt cvalue},{\tt tickets}] =

\exp(\beta_1 {\tt cvalue} + \beta_2 {\tt tickets} + \beta_0)

$$

**Code block 1: nls1.do**

mata: void MYNLExp(real scalar todo, real vector b, /// real vector y, real matrix X, /// val, grad, hess) { real vector r, f, xb real matrix df xb = X*b' f = exp(X*b') r = y - f val = -(r:^2) df = f:*X if (todo>=1) { grad = r:*df } if (todo==2) { hess = -1*quadcross(df, df) } } y = st_data(., "accidents") X = st_data(., "cvalue tickets") n = rows(y) X = X, J(n, 1, 1) p = cols(X) S = optimize_init() optimize_init_argument(S, 1, y) optimize_init_argument(S, 2, X) optimize_init_evaluator(S, &MYNLExp()) optimize_init_conv_nrtol(S, 1e-11) optimize_init_params(S, J(1, 3, .01)) optimize_init_evaluatortype(S, "gf2") bh = optimize(S) M = invsym(-1*optimize_result_Hessian(S)) sb = (-1/(n-p))*optimize_result_value(S) V = sb*M "Point estimates" bh "VCE" V end

Lines 2–21 are the evaluator function for the NLS problem. This code should be familiar from the Poisson regression command that I previously discussed. Note that line 19 defines \(\Gb_s\) to the Hessian.

Lines 22–26 copy the data from Stata into Mata. Lines 28–34 use **optimize()** to solve the NLS problem. Lines 35–37 compute the estimator for the VCE based on correct specification and errors that are independently and identically distributed; see Wooldridge (2010, 417). Lines 38–41 display the results.

Example 1 runs this code

**Example 1: Gauss–Newton in optimize() for ECM model**

. use accident3 . do nls1 . mata: ------------------------------------------------- mata (type end to exit) ------ : void MYNLExp(real scalar todo, real vector b, /// > real vector y, real matrix X, /// > val, grad, hess) > { > real vector r, f, xb > real matrix df > > xb = X*b' > f = exp(X*b') > r = y - f > val = -(r:^2) > df = f:*X > > if (todo>=1) { > grad = r:*df > } > if (todo==2) { > hess = -1*quadcross(df, df) > } > } note: variable xb set but not used : y = st_data(., "accidents") : X = st_data(., "cvalue tickets") : n = rows(y) : X = X, J(n, 1, 1) : p = cols(X) : S = optimize_init() : optimize_init_argument(S, 1, y) : optimize_init_argument(S, 2, X) : optimize_init_evaluator(S, &MYNLExp()) : optimize_init_conv_nrtol(S, 1e-11) : optimize_init_params(S, J(1, 3, .01)) : optimize_init_evaluatortype(S, "gf2") : bh = optimize(S) Iteration 0: f(p) = -2530.846 Iteration 1: f(p) = -1116.4901 Iteration 2: f(p) = -248.56923 Iteration 3: f(p) = -225.91644 Iteration 4: f(p) = -225.89573 Iteration 5: f(p) = -225.89573 Iteration 6: f(p) = -225.89573 : M = invsym(-1*optimize_result_Hessian(S)) : sb = (-1/(n-p))*optimize_result_value(S) : V = sb*M : "Point estimates" Point estimates : bh 1 2 3 +-------------------------------------------+ 1 | .1759434081 1.447671532 -7.66060808 | +-------------------------------------------+ : "VCE" VCE : V [symmetric] 1 2 3 +----------------------------------------------+ 1 | .0010491815 | 2 | -.0000111792 .001112881 | 3 | -.0019055019 -.0075744389 .0554943947 | +----------------------------------------------+ : end -------------------------------------------------------------------------------- . end of do-file

For comparison, I use **nl** to replicate these results.

**Example 2: ECM model by nl**

. nl (accidents = exp({b1}*cvalue + {b2}*tickets + {b0})) , eps(1e-11) (obs = 505) Iteration 0: residual SS = 1123.962 Iteration 1: residual SS = 380.2557 Iteration 2: residual SS = 232.1493 Iteration 3: residual SS = 225.9077 Iteration 4: residual SS = 225.8957 Iteration 5: residual SS = 225.8957 Iteration 6: residual SS = 225.8957 Iteration 7: residual SS = 225.8957 Iteration 8: residual SS = 225.8957 Iteration 9: residual SS = 225.8957 Iteration 10: residual SS = 225.8957 Iteration 11: residual SS = 225.8957 Source | SS df MS -----------+---------------------------------- Number of obs = 505 Model | 2266.1043 3 755.368089 R-squared = 0.9094 Residual | 225.89573 502 .449991501 Adj R-squared = 0.9088 -----------+---------------------------------- Root MSE = .6708141 Total | 2492 505 4.93465347 Res. dev. = 1026.863 ---------------------------------------------------------------------------- accidents | Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------+---------------------------------------------------------------- /b1 | .1759434 .0323911 5.43 0.000 .1123047 .2395822 /b2 | 1.447671 .0333597 43.40 0.000 1.38213 1.513213 /b0 | -7.660608 .2355717 -32.52 0.000 -8.123436 -7.19778 ---------------------------------------------------------------------------- . matrix list e(V) symmetric e(V)[3,3] b1: b2: b0: _cons _cons _cons b1:_cons .00104918 b2:_cons -.00001118 .00111287 b0:_cons -.0019055 -.00757439 .05549405

As expected, the point estimates and the estimates of the VCE are essentially the same.

I now implement the PCM model

$$

\Eb[{\tt hadaccident}|{\tt cvalue},{\tt tickets}] =

\Phi(\beta_1 {\tt cvalue} + \beta_2 {\tt tickets} + \beta_0)

$$

in code block 2, where \(\Phi()\) is the standard-normal cumulative distribution function.

**Code block 2: nls2.do**

mata: void MYNLProbit(real scalar todo, real vector b, /// real vector y, real matrix X, /// val, grad, hess) { real vector r, f, xb real matrix df xb = X*b' f = normal(xb) r = y - f val = -(r:^2) df = normalden(xb):*X if (todo>=1) { grad = r:*df } if (todo==2) { hess = -1*quadcross(df, df) } } y = st_data(., "hadaccident") X = st_data(., "cvalue tickets") n = rows(y) X = X, J(n, 1, 1) p = cols(X) S = optimize_init() optimize_init_argument(S, 1, y) optimize_init_argument(S, 2, X) optimize_init_evaluator(S, &MYNLProbit()) optimize_init_conv_nrtol(S, 1e-11) optimize_init_params(S, J(1, 3, .01)) optimize_init_evaluatortype(S, "gf2") bh = optimize(S) M = invsym(-1*optimize_result_Hessian(S)) sb = (-1/(n-p))*optimize_result_value(S) V = sb*M "Point estimates" bh "VCE" V end

**nls2.do** is almost identical to **nls1.do**. The differences are that lines 2–21 define **MYNLProbit()** instead of **MYNLExp()**, that lines 10 and 13 define the function and the first derivative for the PCM model instead of for the ECM model, that line 22 specifies the binary dependent variable **hadaccident** instead of the accident count **accidents**, and that line 31 specifies that **optimize()** use **MYNLProbit()** instead of **MYNLExp()** as the evaluator function.

Example 3 runs this code

**Example 3: Gauss–Newton in optimize() for PCM model**

. generate hadaccident = accidents>0 . do nls2 . mata: ------------------------------------------------- mata (type end to exit) ------ : void MYNLProbit(real scalar todo, real vector b, /// > real vector y, real matrix X, /// > val, grad, hess) > { > real vector r, f, xb > real matrix df > > xb = X*b' > f = normal(xb) > r = y - f > val = -(r:^2) > df = normalden(xb):*X > > if (todo>=1) { > grad = r:*df > } > if (todo==2) { > hess = -1*quadcross(df, df) > } > } : y = st_data(., "hadaccident") : X = st_data(., "cvalue tickets") : n = rows(y) : X = X, J(n, 1, 1) : p = cols(X) : S = optimize_init() : optimize_init_argument(S, 1, y) : optimize_init_argument(S, 2, X) : optimize_init_evaluator(S, &MYNLProbit()) : optimize_init_conv_nrtol(S, 1e-11) : optimize_init_params(S, J(1, 3, .01)) : optimize_init_evaluatortype(S, "gf2") : bh = optimize(S) Iteration 0: f(p) = -132.90997 Iteration 1: f(p) = -16.917203 Iteration 2: f(p) = -10.995001 Iteration 3: f(p) = -10.437501 Iteration 4: f(p) = -10.427738 Iteration 5: f(p) = -10.427156 Iteration 6: f(p) = -10.427123 Iteration 7: f(p) = -10.427121 Iteration 8: f(p) = -10.42712 Iteration 9: f(p) = -10.42712 Iteration 10: f(p) = -10.42712 Iteration 11: f(p) = -10.42712 : M = invsym(-1*optimize_result_Hessian(S)) : sb = (-1/(n-p))*optimize_result_value(S) : V = sb*M : "Point estimates" Point estimates : bh 1 2 3 +----------------------------------------------+ 1 | .3616312823 2.177513743 -10.95168163 | +----------------------------------------------+ : "VCE" VCE : V [symmetric] 1 2 3 +----------------------------------------------+ 1 | .0084311958 | 2 | .0102702556 .0389739985 | 3 | -.0648856339 -.2061800064 1.114408707 | +----------------------------------------------+ : end -------------------------------------------------------------------------------- . end of do-file

For comparison, I use **nl** to replicate the results.

**Example 4: PCM model by nl**

. nl (hadaccident = normal({b1}*cvalue + {b2}*tickets + {b0})) , eps(1e-11) (obs = 505) Iteration 0: residual SS = 29.15659 Iteration 1: residual SS = 18.64138 Iteration 2: residual SS = 12.69845 Iteration 3: residual SS = 10.74838 Iteration 4: residual SS = 10.44729 Iteration 5: residual SS = 10.42855 Iteration 6: residual SS = 10.42723 Iteration 7: residual SS = 10.42713 Iteration 8: residual SS = 10.42712 Iteration 9: residual SS = 10.42712 Iteration 10: residual SS = 10.42712 Iteration 11: residual SS = 10.42712 Iteration 12: residual SS = 10.42712 Iteration 13: residual SS = 10.42712 Iteration 14: residual SS = 10.42712 Iteration 15: residual SS = 10.42712 Iteration 16: residual SS = 10.42712 Iteration 17: residual SS = 10.42712 Iteration 18: residual SS = 10.42712 Iteration 19: residual SS = 10.42712 Iteration 20: residual SS = 10.42712 Iteration 21: residual SS = 10.42712 Iteration 22: residual SS = 10.42712 Source | SS df MS -----------+---------------------------------- Number of obs = 505 Model | 49.57288 3 16.5242932 R-squared = 0.8262 Residual | 10.42712 502 .020771156 Adj R-squared = 0.8252 -----------+---------------------------------- Root MSE = .144122 Total | 60 505 .118811881 Res. dev. = -526.347 ---------------------------------------------------------------------------- hadaccident| Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------+---------------------------------------------------------------- /b1 | .361632 .0918214 3.94 0.000 .1812304 .5420337 /b2 | 2.177515 .1974173 11.03 0.000 1.789649 2.565381 /b0 | -10.95169 1.05565 -10.37 0.000 -13.02572 -8.877652 ---------------------------------------------------------------------------- . matrix list e(V) symmetric e(V)[3,3] b1: b2: b0: _cons _cons _cons b1:_cons .00843118 b2:_cons .01027015 .03897358 b0:_cons -.06488509 -.20617778 1.1143968

As expected, the point estimates and the estimates of the VCE are essentially the same.

**Done and undone**

I showed how to trick **optimize()** into performing the Gauss–Newton algorithm and how to compute an estimator of the VCE based on correct specification and independently and identically distributed errors. In the next post, I discuss ado-commands that implement these estimators.

**References**

Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.

Wooldridge, J. M. 2010. *Econometric Analysis of Cross Section and Panel Data*. 2nd ed. Cambridge, Massachusetts: MIT Press.