## Testing model specification and using the program version of gmm

This post was written jointly with Joerg Luedicke, Senior Social Scientist and Statistician, StataCorp.

The command **gmm** is used to estimate the parameters of a model using the generalized method of moments (GMM). GMM can be used to estimate the parameters of models that have more identification conditions than parameters, overidentified models. The specification of these models can be evaluated using Hansen’s *J* statistic (Hansen, 1982).

We use **gmm** to estimate the parameters of a Poisson model with an endogenous regressor. More instruments than regressors are available, so the model is overidentified. We then use **estat overid** to calculate Hansen’s *J* statistic and test the validity of the overidentification restrictions.

In previous posts (see Estimating parameters by maximum likelihood and method of moments using mlexp and gmm and Understanding the generalized method of moments (GMM): A simple example), the interactive version of **gmm** has been used to estimate simple single-equation models. For more complex models, it can be easier to use the moment-evaluator program version of **gmm**. We demonstrate how to use this version of **gmm**.

**Poisson model with endogenous regressors**

In this post, the Poisson regression of \(y_i\) on exogenous \({\bf x}_i\) and endogenous \({\bf y}_i\) has the form

\begin{equation*}

E(y_i \vert {\bf x}_i,{\bf y}_{2,i},\epsilon_i)= \exp({\boldsymbol \beta}_1{\bf x}_i + {\boldsymbol \beta}_2{\bf y}_{2,i}) + \epsilon_i

\end{equation*}

where \(\epsilon_i\) is a zero-mean error term. The endogenous regressors \({\bf y}_{2,i}\) may be correlated with \(\epsilon_i\). This is the same formulation used by **ivpoisson** with additive errors; see **[R] ivpoisson** for more details. For more information on Poisson models with endogenous regressors, see Mullahy (1997), Cameron and Trivedi (2013), Windmeijer and Santos Silva (1997), and Wooldridge (2010).

Moment conditions are expected values that specify the model parameters in terms of the true moments. GMM finds the parameter values that are closest to satisfying the sample equivalent of the moment conditions. In this model, we define moment conditions using an error function,

\begin{equation*}

u_i({\boldsymbol \beta}_1,{\boldsymbol \beta}_2) = y_i – \exp({\boldsymbol \beta}_1{\bf x}_i + {\boldsymbol \beta}_2{\bf y}_{2,i})

\end{equation*}

Let \({\bf x}_{2,i}\) be additional exogenous variables. These are not correlated with \(\epsilon_i\), but are correlated with \({\bf y}_{2,i}\). Combining them with \({\bf x}_i\), we have the instruments \({\bf z}_i = (\begin{matrix} {\bf x}_{i} & {\bf x}_{2,i}\end{matrix})\). So the moment conditions are

\begin{equation*}

E({\bf z}_i u_i({\boldsymbol \beta}_1,{\boldsymbol \beta}_2)) = {\bf 0}

\end{equation*}

Suppose there are \(k\) parameters in \({\boldsymbol \beta}_1\) and \({\boldsymbol \beta}_2\) and \(q\) instruments. When \(q>k\), there are more moment conditions than parameters. The model is overidentified. Here GMM finds parameter estimates that solve weighted moment conditions. GMM minimizes

\[

Q({{\boldsymbol \beta}_1},{\boldsymbol \beta}_2) = \left\{\frac{1}{N}\sum\nolimits_i {{\bf z}}_i

u_i({\boldsymbol \beta}_1,{\boldsymbol \beta}_2)\right\}

{\bf W}

\left\{\frac{1}{N}\sum\nolimits_i {{\bf z}}_i u_i({\boldsymbol \beta}_1,{\boldsymbol \beta}_2)\right\}’

\]

for \(q\times q\) weight matrix \({\bf W}\).

**Overidentification test**

When the model is correctly specified,

\begin{equation*}

E({\bf z}_i u_i({\boldsymbol \beta}_1,{\boldsymbol \beta}_2)) = {\bf 0}

\end{equation*}

In this case, if \({\bf W}\) is an optimal weight matrix, it is equal to the inverse of the covariance matrix of the moment conditions. Here we have

\[

{\bf W}^{-1} = E\{{\bf z}_i’ u_{i}({\boldsymbol \beta}_1,{\boldsymbol \beta}_2)

u_{i}({\boldsymbol \beta}_1,{\boldsymbol \beta}_2) {\bf z}_i\}

\]

Hansen’s test evaluates the null hypothesis that an overidentified model is correctly specified. The test statistic \(J = N Q(\hat{\boldsymbol \beta}_1, \hat{\boldsymbol \beta}_2)\) is used. If \({\bf W}\) is an optimal weight matrix, under the null hypothesis, Hansen’s *J* statistic has a \(\chi^2(q-k)\) distribution.

The two-step and iterated estimators used by **gmm** provide estimates of the optimal **W**. For overidentified models, the **estat overid** command calculates Hansen’s *J* statistic after these estimators are used.

**Moment-evaluator program**

We define a program that can be called by **gmm** in calculating the moment conditions for Poisson models with endogenous regressors. See Programming an estimation command in Stata: A map to posted entries for more information about programming in Stata. The program calculates the error function \(u_i\), and **gmm** generates the moment conditions by multiplying by the instruments \({\bf z}_i\).

To solve the weighted moment conditions, **gmm** must take the derivative of the moment conditions with respect to the parameters. Using the chain rule, these are the derivatives of the error functions multiplied by the instruments. Users may specify these derivatives themselves, or **gmm** will calculate the derivatives numerically. Users can gain speed and numeric stability by properly specifying the derivatives themselves.

When linear forms of the parameters are estimated, users may specify derivatives to **gmm** in terms of the linear form (prediction). The chain rule is then used by **gmm** to determine the derivatives of the error function \(u_i\) with respect to the parameters. Our error function \(u_i\) is a function of the linear prediction \({\boldsymbol \beta}_1{\bf x}_i + {\boldsymbol \beta}_2{\bf y}_{2,i}\).

The program **gmm_ivpois** calculates the error function \(u_i\) and the derivative of \(u_i\) in terms of the linear prediction \({\boldsymbol \beta}_1{\bf x}_i + {\boldsymbol \beta}_2{\bf y}_{2,i}\).

program gmm_ivpois version 14.1 syntax varlist [if], at(name) depvar(varlist) rhs(varlist) /// [derivatives(varlist)] tempvar m quietly gen double `m' = 0 `if' local i = 1 foreach var of varlist `rhs' { quietly replace `m' = `m' + `var'*`at'[1,`i'] `if' local i = `i' + 1 } quietly replace `m' = `m' + `at'[1,`i'] `if' quietly replace `varlist' = `depvar' - exp(`m') `if' if "`derivatives'" == "" { exit } replace `derivatives' = -exp(`m') end

Lines 3–4 of **gmm_ivpois** contain the syntax statement that parses the arguments to the program. All moment-evaluator programs must accept a **varlist**, the **if** condition, and the **at()** option. The **varlist** corresponds to variables that store the values of the error functions. The program **gmm_ivpois** will calculate the error function and store it in the specified **varlist**. The **at()** option is specified with the name of a matrix that contains the model parameters. The **if** condition specifies the observations for which estimation is performed.

The program also requires the options **depvar()** and **rhs()**. The name of the dependent variable is specified in the **depvar()** option. The regressors are specified in the **rhs()** option.

On line 4, **derivatives()** is optional. The variable name specified here corresponds to the derivative of the error function with respect to the linear prediction.

The linear prediction of the regressors is stored in the temporary variable **m** over lines 6–12. On line 13, we give the value of the error function to the specified **varlist**. Lines 14–16 allow the program to exit if **derivatives()** is not specified. Otherwise, on line 17, we store the value of the derivative of the error function with respect to the linear prediction in the variable specified in **derivatives()**.

**The data**

We simulate data from a Poisson regression with an endogenous covariate, and then we use **gmm** and the **gmm_ivpois** program to estimate the parameters of the regression. We will then use **estat overid** to check the specification of the model. We simulate a random sample of 3,000 observations.

. set seed 45 . set obs 3000 number of observations (_N) was 0, now 3,000 . generate x = rnormal()*.8 + .5 . generate z = rchi2(1) . generate w = rnormal()*.5 . matrix cm = (1, .9 \ .9, 1) . matrix sd = (.5,.8) . drawnorm e u, corr(cm) sd(sd)

We generate the exogenous covariates \(x\), \(z\), and \(w\). The variable \(x\) will be a regressor, while \(z\) and \(w\) will be extra instruments. Then we use **drawnorm** to draw the errors \(e\) and \(u\). The errors are positively correlated.

. generate y2 = exp(.2*x + .1*z + .3*w -1 + u) . generate y = exp(.5*x + .2*y2+1) + e

We generate the endogenous regressor \(y2\) as a lognormal regression on the instruments. The outcome of interest \(y\) has an exponential mean on \(x\) and \(y2\), with \(e\) as an additive error. As \(e\) is correlated with \(u\), \(y2\) is correlated with \(e\).

**Estimating the model parameters**

Now we use **gmm** to estimate the parameters of the Poisson regression with endogenous covariates. The name of our moment-evaluator program is listed to the right of **gmm**. The instruments that **gmm** will use to form the moment conditions are listed in **instruments()**. We specify the options **depvar()** and **rhs()** with the appropriate variables. They will be passed on to **gmm_ivpois**.

The parameters are specified as the linear form **y** in the **parameters()** option, while we specify **haslfderivatives** to inform **gmm** that **gmm_ivpois** provides derivatives of this linear form. The option **nequations()** tells **gmm** how many error functions to expect.

. gmm gmm_ivpois, depvar(y) rhs(x y2) /// > haslfderivatives instruments(x z w) /// > parameters({y: x y2 _cons}) nequations(1) Step 1 Iteration 0: GMM criterion Q(b) = 14.960972 Iteration 1: GMM criterion Q(b) = 3.3038486 Iteration 2: GMM criterion Q(b) = .59045217 Iteration 3: GMM criterion Q(b) = .00079862 Iteration 4: GMM criterion Q(b) = .00001419 Iteration 5: GMM criterion Q(b) = .00001418 Step 2 Iteration 0: GMM criterion Q(b) = .0000567 Iteration 1: GMM criterion Q(b) = .00005648 Iteration 2: GMM criterion Q(b) = .00005648 GMM estimation Number of parameters = 3 Number of moments = 4 Initial weight matrix: Unadjusted Number of obs = 3,000 GMM weight matrix: Robust ------------------------------------------------------------------------------ | Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | .5006366 .0033273 150.46 0.000 .4941151 .507158 y2 | .2007893 .0075153 26.72 0.000 .1860597 .2155189 _cons | 1.000717 .0063414 157.81 0.000 .988288 1.013146 ------------------------------------------------------------------------------ Instruments for equation 1: x z w _cons

Our coefficients are significant. However, the model could still be misspecified.

**Overidentification test**

We use **estat overid** to compute Hansen’s *J* statistic.

. estat overid Test of overidentifying restriction: Hansen's J chi2(1) = .169449 (p = 0.6806)

The *J* statistic equals 0.17. In addition to computing Hansen’s *J*, **estat overid** provides a test against misspecification of the model. In this case, we have one more instrument than regressor, so the *J* statistic has a \(\chi^2(1)\) distribution. The probability of obtaining a \(\chi^2(1)\) value greater than 0.17 is given in parentheses. This probability—the *p*-value of the test—is large and so we fail to reject the null hypothesis that the model is properly specified.

**Conclusion**

We have demonstrated how to estimate the parameters of a Poisson regression with an endogenous regressor using the moment-evaluator program version of **gmm**. We have also demonstrated how to use **estat overid** to test for model misspecification after estimation of an overidentified model in **gmm**. See **[R] gmm** and **[R] gmm postestimation** for more information.

**References**

Cameron, A. C., and P. K. Trivedi. 2013. *Regression Analysis of Count Data*. 2nd ed. New York: Cambridge University Press.

Hansen, L. P. 1982. Large sample properties of generalized method of moments estimators. *Econometrica* 50: 1029–1054.

Mullahy, J. 1997. Instrumental-variable estimation of count data models: Applications to models of cigarette smoking behavior. *Review of Economics and Statistics* 79: 586–593.

Windmeijer, F., and J. M. C. Santos Silva. 1997. Endogeneity in count data models: An application to demand for health care. *Journal of Applied Econometrics* 12: 281–294.

Wooldridge, J. M. 2010. *Econometric Analysis of Cross Section and Panel Data*. 2nd ed. Cambridge, MA: MIT Press.