## Programming an estimation command in Stata: A review of nonlinear optimization using Mata

\(\newcommand{\betab}{\boldsymbol{\beta}}

\newcommand{\xb}{{\bf x}}

\newcommand{\yb}{{\bf y}}

\newcommand{\gb}{{\bf g}}

\newcommand{\Hb}{{\bf H}}

\newcommand{\thetab}{\boldsymbol{\theta}}

\newcommand{\Xb}{{\bf X}}

\)I review the theory behind nonlinear optimization and get more practice in Mata programming by implementing an optimizer in Mata. In real problems, I recommend using the **optimize()** function or **moptimize()** function instead of the one I describe here. In subsequent posts, I will discuss **optimize()** and **moptimize()**. This post will help you develop your Mata programming skills and will improve your understanding of how **optimize()** and **moptimize()** work.

This is the seventeenth post in the series **Programming an estimation command in Stata**. I recommend that you start at the beginning. See Programming an estimation command in Stata: A map to posted entries for a map to all the posts in this series.

**A quick review of nonlinear optimization**

We want to maximize a real-valued function \(Q(\thetab)\), where \(\thetab\) is a \(p\times 1\) vector of parameters. Minimization is done by maximizing \(-Q(\thetab)\). We require that \(Q(\thetab)\) is twice, continuously differentiable, so that we can use a second-order Taylor series to approximate \(Q(\thetab)\) in a neighborhood of the point \(\thetab_s\),

\[

Q(\thetab) \approx Q(\thetab_s) + \gb_s'(\thetab -\thetab_s)

+ \frac{1}{2} (\thetab -\thetab_s)’\Hb_s (\thetab -\thetab_s)

\tag{1}

\]

where \(\gb_s\) is the \(p\times 1\) vector of first derivatives of \(Q(\thetab)\) evaluated at \(\thetab_s\) and \(\Hb_s\) is the \(p\times p\) matrix of second derivatives of \(Q(\thetab)\) evaluated at \(\thetab_s\), known as the Hessian matrix.

Nonlinear maximization algorithms start with a vector of initial values and produce a sequence of updated values that converge to the parameter vector that maximizes the objective function. The algorithms I discuss here can only find local maxima. The function in figure 1 has a local maximum at .2 and another at 1.5. The global maximum is at .2.

Each update is produced by finding the \(\thetab\) that maximizes the approximation on the right-hand side of equation (1) and letting it be \(\thetab_{s+1}\). To find the \(\thetab\) that maximizes the approximation, we set to \({\bf 0}\) the derivative of the right-hand side of equation (1) with respect to \(\thetab\),

\[

\gb_s + \Hb_s (\thetab -\thetab_s) = {\bf 0}

\tag{2}

\]

Replacing \(\thetab\) with \(\thetab_{s+1}\) and solving yields the update rule for \(\thetab_{s+1}\).

\[

\thetab_{s+1} = \thetab_s – \Hb_s^{-1} \gb_s

\tag{3}

\]

Note that the update is uniquely defined only if the Hessian matrix \(\Hb_s\) is full rank. To ensure that we have a local maximum, we will require that the Hessian be negative definite at the optimum, which also implies that the symmetric Hessian is full rank.

The update rule in equation (3) does not guarantee that \(Q(\thetab_{s+1})>Q(\thetab_s)\). We want to accept only those \(\thetab_{s+1}\) that do produce such an increase, so in practice, we use

\[

\thetab_{s+1} = \thetab_s – \lambda \Hb_s^{-1} \gb_s

\tag{4}

\]

where \(\lambda\) is the step size. In the algorithm presented here, we start with \(\lambda\) equal to \(1\) and, if necessary, decrease \(\lambda\) until we find a value that yields an increase.

The previous sentence is vague. I clarify it by writing an algorithm in Mata. Suppose that **real scalar Q( real vector theta )** is a Mata function that returns the value of the objective function at a value of the parameter vector **theta**. For the moment, suppose that **g_s** is the vector of derivatives at the current theta, denoted by **theta_s**, and that **Hi_s** is the inverse of the Hessian matrix at **theta_s**. These definitions allow us to define the update function

**Code block 1: Candidate rule for parameter vector**

real vector tupdate( /// real scalar lambda, /// real vector theta_s, /// real vector g_s, /// real matrix Hi_s) { return (theta_s - lambda*Hi_s*g_s) }

For specified values of **lambda**, **theta_s**, **g_s**, and **Hi_s**, **tupdate()** returns a candidate value for **theta_s1**. But we only accept candidate values of **theta_s1** that yield an increase, so instead of using **tupdate()** to get an update, we would use **GetUpdate()**.

**Code block 2: Update function for parameter vector**

real vector GetUpdate( /// real vector theta_s, /// real vector g_s, /// real matrix Hi_s) { lambda = 1 theta_s1 = tupdate(lambda, thetas, g_s, Hi_s) while ( Q(theta_s1) < Q(theta_s) ) { lambda = lambda/2 theta_s1 = tupdate(lambda, thetas, g_s, Hi_s) } return(theta_s1) }

**GetUpdate()** starts by getting a candidate value for **theta_s1** when **lambda = 1**. **GetUpdate()** returns this candidate **theta_s1** if it produces an increase in **Q()**. Otherwise, **GetUpdate()** divides **lambda** by 2 and gets another candidate **theta_s1** until it finds a candidate that produces an increase in **Q()**. **GetUpdate()** returns the first candidate that produces an increase in **Q()**.

While these functions clarify the ambiguities in the original vague statement, **GetUpdate()** makes the unwise assumption that there is always a **lambda** for which the candidate **theta_s1** produces an increase in **Q()**. The version of **GetUpdate()** in code block 3 does not make this assumption, it exits with an error if lambda is too small; less than \(10^{-11}\).

**Code block 3: A better update function for parameter vector**

real vector GetUpdate( /// real vector theta_s, /// real vector g_s, /// real matrix Hi_s) { lambda = 1 theta_s1 = tupdate(lambda, thetas, g_s, Hi_s) while ( Q(theta_s1) < Q(theta_s) ) { lambda = lambda/2 if (lambda < 1e-11) { printf("{red}Cannot find parameters that produce an increase.\n") exit(error(3360)) } theta_s1 = tupdate(lambda, thetas, g_s, Hi_s) } return(theta_s1) }

An outline of our algorithm for nonlinear optimization is the following:

- Select initial values for parameter vector.
- If current parameters set vector of derivatives of
**Q()**to zero, go to (3); otherwise go to (A).- Use
**GetUpdate()**to get new parameter values. - Calculate
**g_s**and**Hi_s**at parameter values from (A). - Go to (2).

- Use
- Display results.

Code block 4 contains a Mata version of this algorithm.

**Code block 4: Pseudocode for Newton–Raphson algorithm**

theta_s = J(p, 1, .01) GetDerives(theta_s, g_s, Hi_s.) gz = g_s'*Hi*g while (abs(gz) > 1e-13) { theta_s = GetUpdate(theta_s, g_s, Hi_s) GetDerives(theta_s, g_s, Hi_s) gz = g_s'*Hi_s*g_s printf("gz is now %8.7g\n", gz) } printf("Converged value of theta is\n") theta_s

Line 2 puts the vector of starting values, a \(p\times 1\) vector with each element equal to .01, in **theta_s**. Line 3 uses **GetDerives()** to put the vector of first derivatives into **g_s** and the inverse of the Hessian matrix into **Hi_s**. In **GetDerives()**, I use **cholinv()** to calculate **Hi_s**. **cholinv()** returns missing values if the matrix is not positive definite. By calculating **Hi_s = -1*cholinv(-H_s)**, I ensure that **Hi_s** contains missing values when the Hessian is not negative definite and full rank.

Line 3 calculates how different the vector of first derivatives is from 0. Instead of using a sum of squares, obtainable by **g_s’g_s**, I weight the first derivatives by the inverse of the Hessian matrix, which puts the \(p\) first derivatives on a similar scale and ensures that the Hessian matrix is negative definite at convergence. (If the Hessian matrix is not negative definite, **GetDerives()** will put a matrix of missing values into **Hi_s**, which causes **gz=.**, which will exceed the tolerance.)

To flush out the details we need a specific problem. Consider maximizing the log-likelihood function of a Poisson model, which has a simple functional form. The contribution of each observation to the log-likelihood is

\[

f_i(\betab) = y_i{\bf x_i}\betab – \exp({\bf x}_i\betab) – \ln( y_i !)

\]

where \(y_i\) is the dependent variable, \({\bf x}_i\) is the vector of covariates, and \(\betab\) is the vector of parameters that we select to maximize the log-likelihood function given by \(F(\betab) =\sum_i f_i(\betab)\). I could drop **ln(y_i!)**, because it does not depend on the parameters. I include it to make the value of the log-likelihood function the same as that reported by Stata. Stata includes these terms so that log-likelihood-function values are comparable across models.

The **pll()** function in code block 5 computes the Poisson log-likelihood function from the vector of observations on the dependent variable **y**, the matrix of observations on the covariates **X**, and the vector of parameter values **b**.

**Code block 5: A function for the Poisson log-likelihood function**

// Compute Poisson log-likelihood mata: real scalar pll(real vector y, real matrix X, real vector b) { real vector xb xb = x*b f = sum(-exp(xb) + y:*xb - lnfactorial(y)) } end

The vector of first derivatives is

\[

\frac{\partial~F(\xb_,\betab)}{\partial \betab}

= \sum_{i=1}^N (y_i – \exp(\xb_i\betab))\xb_i

\]

which I can compute in Mata as **quadcolsum((y-exp(X*b)):*X)**, and the Hessian matrix is

\[

\label{eq:H}

\sum_{i=1}^N\frac{\partial^2~f(\xb_i,\betab)}{\partial\betab

\partial\betab^\prime}

= – \sum_{i=1}^N\exp(\xb_i\betab)\xb_i^\prime \xb_i

\tag{5}

\]

which I can compute in Mata as **-quadcross(X, exp(X*b), X)**.

Here is some code that implements this Newton-Raphson **NR** algorithm for the Poisson regression problem.

**Code block 6: pnr1.do**

**(Uses accident3.dta)**

// Newton-Raphson for Poisson log-likelihood clear all use accident3 mata: real scalar pll(real vector y, real matrix X, real vector b) { real vector xb xb = X*b return(sum(-exp(xb) + y:*xb - lnfactorial(y))) } void GetDerives(real vector y, real matrix X, real vector theta, g, Hi) { real vector exb exb = exp(X*theta) g = (quadcolsum((y - exb):*X))' Hi = quadcross(X, exb, X) Hi = -1*cholinv(Hi) } real vector tupdate( /// real scalar lambda, /// real vector theta_s, /// real vector g_s, /// real matrix Hi_s) { return (theta_s - lambda*Hi_s*g_s) } real vector GetUpdate( /// real vector y, /// real matrix X, /// real vector theta_s, /// real vector g_s, /// real matrix Hi_s) { real scalar lambda real vector theta_s1 lambda = 1 theta_s1 = tupdate(lambda, theta_s, g_s, Hi_s) while ( pll(y, X, theta_s1) <= pll(y, X, theta_s) ) { lambda = lambda/2 if (lambda < 1e-11) { printf("{red}Cannot find parameters that produce an increase.\n") exit(error(3360)) } theta_s1 = tupdate(lambda, theta_s, g_s, Hi_s) } return(theta_s1) } y = st_data(., "accidents") X = st_data(., "cvalue kids traffic") X = X,J(rows(X), 1, 1) b = J(cols(X), 1, .01) GetDerives(y, X, b, g=., Hi=.) gz = . while (abs(gz) > 1e-11) { bs1 = GetUpdate(y, X, b, g, Hi) b = bs1 GetDerives(y, X, b, g, Hi) gz = g'*Hi*g printf("gz is now %8.7g\n", gz) } printf("Converged value of beta is\n") b end

Line 3 reads in the downloadable **accident3.dta** dataset before dropping down to Mata. I use variables from this dataset on lines 56 and 57.

Lines 6–11 define **pll()**, which returns the value of the Poisson log-likelihood function, given the vector of observations on the dependent variable **y**, the matrix of covariate observations **X**, and the current parameters **b**.

Lines 13‐21 put the vector of first derivatives in **g** and the inverse of the Hessian matrix in **Hi**. Equation 5 specifies a matrix that is negative definite, as long as the covariates are not linearly dependent. As discussed above, **cholinv()** returns a matrix of missing values if the matrix is not positive definite. I multiply the right-hand side on line 20 by \(-1\) instead of on line 19.

Lines 23–30 implement the **tupdate()** function previously discussed.

Lines 32–53 implement the **GetUpdate()** function previously discussed, with the caveats that this version handles the data and uses **pll()** to compute the value of the objective function.

Lines 56–58 get the data from Stata and row-join a vector to **X** for the constant term.

Lines 60–71 implement the **NR** algorithm discussed above for this Poisson regression problem.

Running nr1.do produces

**Example 1: NR algorithm for Poisson**

. do pnr1 . // Newton-Raphson for Poisson log-likelihood . clear all . use accident3 . . mata: [Output Omitted] : b = J(cols(X), 1, .01) : GetDerives(y, X, b, g=., Hi=.) : gz = . : while (abs(gz) > 1e-11) { > bs1 = GetUpdate(y, X, b, g, Hi) > b = bs1 > GetDerives(y, X, b, g, Hi) > gz = g'*Hi*g > printf("gz is now %8.7g\n", gz) > } gz is now -119.201 gz is now -26.6231 gz is now -2.02142 gz is now -.016214 gz is now -1.3e-06 gz is now -8.3e-15 : printf("Converged value of beta is\n") Converged value of beta is : b 1 +----------------+ 1 | -.6558870685 | 2 | -1.009016966 | 3 | .1467114652 | 4 | .5743541223 | +----------------+ : : end -------------------------------------------------------------------------------- . end of do-file

The point estimates in example 1 are equivalent to those produced by **poisson**.

**Example 2: poisson results**

. poisson accidents cvalue kids traffic Iteration 0: log likelihood = -555.86605 Iteration 1: log likelihood = -555.8154 Iteration 2: log likelihood = -555.81538 Poisson regression Number of obs = 505 LR chi2(3) = 340.20 Prob > chi2 = 0.0000 Log likelihood = -555.81538 Pseudo R2 = 0.2343 ------------------------------------------------------------------------------ accidents | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- cvalue | -.6558871 .0706484 -9.28 0.000 -.7943553 -.5174188 kids | -1.009017 .0807961 -12.49 0.000 -1.167374 -.8506594 traffic | .1467115 .0313762 4.68 0.000 .0852153 .2082076 _cons | .574354 .2839515 2.02 0.043 .0178193 1.130889 ------------------------------------------------------------------------------

**Done and undone**

I implemented a simple nonlinear optimizer to practice Mata programming and to review the theory behind nonlinear optimization. In future posts, I implement a command for Poisson regression that uses the optimizer in **optimize()**.

Pingback: The Stata Blog » Programming an estimation command in Stata: A map to posted entries()

Pingback: The Stata Blog » Programming an estimation command in Stata: Nonlinear least-squares estimators()