Home > Programming > Programming an estimation command in Stata: Nonlinear least-squares estimators

Programming an estimation command in Stata: Nonlinear least-squares estimators

\(\newcommand{\xb}{{\bf x}}
\newcommand{\gb}{{\bf g}}
\newcommand{\Hb}{{\bf H}}
\newcommand{\Gb}{{\bf G}}
\newcommand{\Eb}{{\bf E}}
\newcommand{\betab}{\boldsymbol{\beta}}\)I want to write ado-commands to estimate the parameters of an exponential conditional mean (ECM) model and probit conditional mean (PCM) model by nonlinear least squares (NLS). Before I can write these commands, I need to show how to trick optimize() into performing the Gauss–Newton algorithm and apply this trick to these two problems.

This is the 26th post in the series Programming an estimation command in Stata. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.

Gauss–Newton algorithm

Gauss–Newton algorithms frequently perform better than other Newton-type algorithms for solving NLS minimization problems, because they use an expected Hessian instead of a full Hessian.

Recall that Newton-type algorithms get the next guess of parameter estimates by an update rule of the form
$$
\betab_{s+1} = \betab_s – \lambda\Hb_s^{-1}\gb_s
$$
as I discussed in Programming an estimation command in Stata: A review of nonlinear optimization using Mata. The objective function in NLS problems is
$$
\min_{\betab} \frac{1}{2} \sum_{i=1}^n \left[y_i-f(\xb_i,\betab)\right]^2
$$

The Gauss–Newton algorithm uses
$$
\betab_{s+1} = \betab_s – \lambda\Gb_s^{-1}\gb_s
$$
where
$$\Gb_s =-
\sum_{i=1}^N \frac{\partial f(\xb_i,\betab)}{\partial \betab\,’}
\frac{\partial f(\xb_i,\betab)}{\partial \betab}
\hspace{1cm}\mbox{with } \betab \mbox{ evaluated at } \betab_s
$$
Using \(\Gb_s\) instead of \(\Hb_s\) causes Gauss–Newton to perform better than other Newton-type algorithms because the first term in
\begin{align*}
\Hb_s &=
\sum_{i=1}^N \left[y_i-f(\xb_i,\betab)\right]
\frac{\partial^2 f(\xb_i,\betab)}{\partial\betab\,’\partial \betab}

\sum_{i=1}^N \frac{\partial f(\xb_i,\betab)}{\partial \boldsymbol{\beta}\,’}
\frac{\partial f(\xb_i,\betab)}{\partial \betab}
\end{align*}
with \(\betab\) evaluated at \(\betab_s\) has mean \({\bf 0}\). Much of the literature on this algorithm exploits the fact that \(\Gb_s^{-1}\gb_s\) can be obtained by an OLS regression of the residuals \([y_i-f(\xb_i,\betab)]\) on the columns of \(\frac{\partial f(\xb_i,\betab)}{\partial \betab\,’}\) (with \(\betab\) evaluated at \(\betab_s\)) because
$$
\gb_s= \sum_{i=1}^N \left[y_i-f(\xb_i,\betab)\right]
\frac{\partial f(\xb_i,\betab)}{\partial \betab\,’}
\hspace{1cm}\mbox{ with } \betab \mbox{ evaluated at } \betab_s
$$
See Cameron and Trivedi (2005, section 10.3.6) and Wooldridge (2010, section 12.7.3) for examples of this approach. While this approach is used in nl, I will trick optimize() into doing Gauss–Newton by specifying \(\Gb_s\) instead of \(\Hb_s\) as my Hessian.

ECM by NLS

In code block 1, I use optimize() to fit the accident data to an ECM model,
$$
\Eb[{\tt accidents}|{\tt cvalue},{\tt tickets}] =
\exp(\beta_1 {\tt cvalue} + \beta_2 {\tt tickets} + \beta_0)
$$

Code block 1: nls1.do

mata:
void MYNLExp(real scalar todo, real vector b,   ///
        real vector y, real matrix X,           ///
        val, grad, hess)
{
        real vector  r, f, xb
        real matrix  df

        xb  = X*b'
        f   = exp(X*b')
        r   = y - f
        val = -(r:^2)
        df  = f:*X

        if (todo>=1) {
                grad = r:*df
        }
        if (todo==2) {
                hess = -1*quadcross(df, df)
        }
}
y = st_data(., "accidents")
X = st_data(., "cvalue tickets")
n = rows(y)
X = X, J(n, 1, 1)
p = cols(X)
S = optimize_init()
optimize_init_argument(S, 1, y)
optimize_init_argument(S, 2, X)
optimize_init_evaluator(S, &MYNLExp())
optimize_init_conv_nrtol(S, 1e-11)
optimize_init_params(S, J(1, 3, .01))
optimize_init_evaluatortype(S, "gf2")
bh  = optimize(S)
M   = invsym(-1*optimize_result_Hessian(S))
sb  = (-1/(n-p))*optimize_result_value(S)
V  = sb*M
"Point estimates"
bh
"VCE"
V
end

Lines 2–21 are the evaluator function for the NLS problem. This code should be familiar from the Poisson regression command that I previously discussed. Note that line 19 defines \(\Gb_s\) to the Hessian.

Lines 22–26 copy the data from Stata into Mata. Lines 28–34 use optimize() to solve the NLS problem. Lines 35–37 compute the estimator for the VCE based on correct specification and errors that are independently and identically distributed; see Wooldridge (2010, 417). Lines 38–41 display the results.

Example 1 runs this code

Example 1: Gauss–Newton in optimize() for ECM model

. use accident3

. do nls1

. mata:
------------------------------------------------- mata (type end to exit) ------
: void MYNLExp(real scalar todo, real vector b,   ///
>         real vector y, real matrix X,           ///
>         val, grad, hess)
> {
>         real vector  r, f, xb
>         real matrix  df
>
>         xb  = X*b'
>         f   = exp(X*b')
>         r   = y - f
>         val = -(r:^2)
>         df  = f:*X
>
>         if (todo>=1) {
>                 grad = r:*df
>         }
>         if (todo==2) {
>                 hess = -1*quadcross(df, df)
>         }
> }
note: variable xb set but not used

: y = st_data(., "accidents")

: X = st_data(., "cvalue tickets")

: n = rows(y)

: X = X, J(n, 1, 1)

: p = cols(X)

: S = optimize_init()

: optimize_init_argument(S, 1, y)

: optimize_init_argument(S, 2, X)

: optimize_init_evaluator(S, &MYNLExp())

: optimize_init_conv_nrtol(S, 1e-11)

: optimize_init_params(S, J(1, 3, .01))

: optimize_init_evaluatortype(S, "gf2")

: bh  = optimize(S)
Iteration 0:   f(p) =  -2530.846
Iteration 1:   f(p) = -1116.4901
Iteration 2:   f(p) = -248.56923
Iteration 3:   f(p) = -225.91644
Iteration 4:   f(p) = -225.89573
Iteration 5:   f(p) = -225.89573
Iteration 6:   f(p) = -225.89573

: M   = invsym(-1*optimize_result_Hessian(S))

: sb  = (-1/(n-p))*optimize_result_value(S)

: V  = sb*M

: "Point estimates"
  Point estimates

: bh
                 1             2             3
    +-------------------------------------------+
  1 |  .1759434081   1.447671532   -7.66060808  |
    +-------------------------------------------+

: "VCE"
  VCE

: V
[symmetric]
                  1              2              3
    +----------------------------------------------+
  1 |   .0010491815                                |
  2 |  -.0000111792     .001112881                 |
  3 |  -.0019055019   -.0075744389    .0554943947  |
    +----------------------------------------------+

: end
--------------------------------------------------------------------------------

.
end of do-file

For comparison, I use nl to replicate these results.

Example 2: ECM model by nl

. nl (accidents = exp({b1}*cvalue + {b2}*tickets + {b0})) , eps(1e-11)
(obs = 505)

Iteration 0:  residual SS =  1123.962
Iteration 1:  residual SS =  380.2557
Iteration 2:  residual SS =  232.1493
Iteration 3:  residual SS =  225.9077
Iteration 4:  residual SS =  225.8957
Iteration 5:  residual SS =  225.8957
Iteration 6:  residual SS =  225.8957
Iteration 7:  residual SS =  225.8957
Iteration 8:  residual SS =  225.8957
Iteration 9:  residual SS =  225.8957
Iteration 10:  residual SS =  225.8957
Iteration 11:  residual SS =  225.8957


    Source |      SS            df       MS
-----------+----------------------------------    Number of obs =        505
     Model |  2266.1043          3  755.368089    R-squared     =     0.9094
  Residual |  225.89573        502  .449991501    Adj R-squared =     0.9088
-----------+----------------------------------    Root MSE      =   .6708141
     Total |       2492        505  4.93465347    Res. dev.     =   1026.863

----------------------------------------------------------------------------
 accidents |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------+----------------------------------------------------------------
       /b1 |   .1759434   .0323911     5.43   0.000     .1123047    .2395822
       /b2 |   1.447671   .0333597    43.40   0.000      1.38213    1.513213
       /b0 |  -7.660608   .2355717   -32.52   0.000    -8.123436    -7.19778
----------------------------------------------------------------------------

. matrix list e(V)

symmetric e(V)[3,3]
                  b1:         b2:         b0:
               _cons       _cons       _cons
b1:_cons   .00104918
b2:_cons  -.00001118   .00111287
b0:_cons   -.0019055  -.00757439   .05549405

As expected, the point estimates and the estimates of the VCE are essentially the same.

I now implement the PCM model
$$
\Eb[{\tt hadaccident}|{\tt cvalue},{\tt tickets}] =
\Phi(\beta_1 {\tt cvalue} + \beta_2 {\tt tickets} + \beta_0)
$$
in code block 2, where \(\Phi()\) is the standard-normal cumulative distribution function.

Code block 2: nls2.do

mata:
void MYNLProbit(real scalar todo, real vector b,        ///
        real vector y, real matrix X,           ///
        val, grad, hess)
{
        real vector  r, f, xb
        real matrix  df

        xb  = X*b'
        f   = normal(xb)
        r   = y - f
        val = -(r:^2)
        df  = normalden(xb):*X

        if (todo>=1) {
                grad = r:*df
        }
        if (todo==2) {
                hess = -1*quadcross(df, df)
        }
}
y = st_data(., "hadaccident")
X = st_data(., "cvalue tickets")
n = rows(y)
X = X, J(n, 1, 1)
p = cols(X)
S = optimize_init()
optimize_init_argument(S, 1, y)
optimize_init_argument(S, 2, X)
optimize_init_evaluator(S, &MYNLProbit())
optimize_init_conv_nrtol(S, 1e-11)
optimize_init_params(S, J(1, 3, .01))
optimize_init_evaluatortype(S, "gf2")
bh  = optimize(S)
M   = invsym(-1*optimize_result_Hessian(S))
sb  = (-1/(n-p))*optimize_result_value(S)
V   = sb*M
"Point estimates"
bh
"VCE"
V
end

nls2.do is almost identical to nls1.do. The differences are that lines 2–21 define MYNLProbit() instead of MYNLExp(), that lines 10 and 13 define the function and the first derivative for the PCM model instead of for the ECM model, that line 22 specifies the binary dependent variable hadaccident instead of the accident count accidents, and that line 31 specifies that optimize() use MYNLProbit() instead of MYNLExp() as the evaluator function.

Example 3 runs this code

Example 3: Gauss–Newton in optimize() for PCM model

. generate hadaccident = accidents>0

. do nls2

. mata:
------------------------------------------------- mata (type end to exit) ------
: void MYNLProbit(real scalar todo, real vector b,        ///
>         real vector y, real matrix X,           ///
>         val, grad, hess)
> {
>         real vector  r, f, xb
>         real matrix  df
>
>         xb  = X*b'
>         f   = normal(xb)
>         r   = y - f
>         val = -(r:^2)
>         df  = normalden(xb):*X
>
>         if (todo>=1) {
>                 grad = r:*df
>         }
>         if (todo==2) {
>                 hess = -1*quadcross(df, df)
>         }
> }

: y = st_data(., "hadaccident")

: X = st_data(., "cvalue tickets")

: n = rows(y)

: X = X, J(n, 1, 1)

: p = cols(X)

: S = optimize_init()

: optimize_init_argument(S, 1, y)

: optimize_init_argument(S, 2, X)

: optimize_init_evaluator(S, &MYNLProbit())

: optimize_init_conv_nrtol(S, 1e-11)

: optimize_init_params(S, J(1, 3, .01))

: optimize_init_evaluatortype(S, "gf2")

: bh  = optimize(S)
Iteration 0:   f(p) = -132.90997
Iteration 1:   f(p) = -16.917203
Iteration 2:   f(p) = -10.995001
Iteration 3:   f(p) = -10.437501
Iteration 4:   f(p) = -10.427738
Iteration 5:   f(p) = -10.427156
Iteration 6:   f(p) = -10.427123
Iteration 7:   f(p) = -10.427121
Iteration 8:   f(p) =  -10.42712
Iteration 9:   f(p) =  -10.42712
Iteration 10:  f(p) =  -10.42712
Iteration 11:  f(p) =  -10.42712

: M   = invsym(-1*optimize_result_Hessian(S))

: sb  = (-1/(n-p))*optimize_result_value(S)

: V   = sb*M

: "Point estimates"
  Point estimates

: bh
                  1              2              3
    +----------------------------------------------+
  1 |   .3616312823    2.177513743   -10.95168163  |
    +----------------------------------------------+

: "VCE"
  VCE

: V
[symmetric]
                  1              2              3
    +----------------------------------------------+
  1 |   .0084311958                                |
  2 |   .0102702556    .0389739985                 |
  3 |  -.0648856339   -.2061800064    1.114408707  |
    +----------------------------------------------+

: end
--------------------------------------------------------------------------------

.
end of do-file

For comparison, I use nl to replicate the results.

Example 4: PCM model by nl

. nl (hadaccident = normal({b1}*cvalue + {b2}*tickets + {b0})) , eps(1e-11)
(obs = 505)

Iteration 0:  residual SS =  29.15659
Iteration 1:  residual SS =  18.64138
Iteration 2:  residual SS =  12.69845
Iteration 3:  residual SS =  10.74838
Iteration 4:  residual SS =  10.44729
Iteration 5:  residual SS =  10.42855
Iteration 6:  residual SS =  10.42723
Iteration 7:  residual SS =  10.42713
Iteration 8:  residual SS =  10.42712
Iteration 9:  residual SS =  10.42712
Iteration 10:  residual SS =  10.42712
Iteration 11:  residual SS =  10.42712
Iteration 12:  residual SS =  10.42712
Iteration 13:  residual SS =  10.42712
Iteration 14:  residual SS =  10.42712
Iteration 15:  residual SS =  10.42712
Iteration 16:  residual SS =  10.42712
Iteration 17:  residual SS =  10.42712
Iteration 18:  residual SS =  10.42712
Iteration 19:  residual SS =  10.42712
Iteration 20:  residual SS =  10.42712
Iteration 21:  residual SS =  10.42712
Iteration 22:  residual SS =  10.42712


    Source |      SS            df       MS
-----------+----------------------------------    Number of obs =        505
     Model |   49.57288          3  16.5242932    R-squared     =     0.8262
  Residual |   10.42712        502  .020771156    Adj R-squared =     0.8252
-----------+----------------------------------    Root MSE      =    .144122
     Total |         60        505  .118811881    Res. dev.     =   -526.347

----------------------------------------------------------------------------
hadaccident|      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------+----------------------------------------------------------------
       /b1 |    .361632   .0918214     3.94   0.000     .1812304    .5420337
       /b2 |   2.177515   .1974173    11.03   0.000     1.789649    2.565381
       /b0 |  -10.95169    1.05565   -10.37   0.000    -13.02572   -8.877652
----------------------------------------------------------------------------

. matrix list e(V)

symmetric e(V)[3,3]
                  b1:         b2:         b0:
               _cons       _cons       _cons
b1:_cons   .00843118
b2:_cons   .01027015   .03897358
b0:_cons  -.06488509  -.20617778   1.1143968

As expected, the point estimates and the estimates of the VCE are essentially the same.

Done and undone

I showed how to trick optimize() into performing the Gauss–Newton algorithm and how to compute an estimator of the VCE based on correct specification and independently and identically distributed errors. In the next post, I discuss ado-commands that implement these estimators.

References

Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.

Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, Massachusetts: MIT Press.