Multiple equation models: Estimation and marginal effects using gsem
Starting point: A hurdle model with multiple hurdles
In a sequence of posts, we are going to illustrate how to obtain correct standard errors and marginal effects for models with multiple steps.
Our inspiration for this post is an old Statalist inquiry about how to obtain marginal effects for a hurdle model with more than one hurdle (http://www.statalist.org/forums/forum/general-stata-discussion/general/1337504-estimating-marginal-effect-for-triple-hurdle-model). Hurdle models have the appealing property that their likelihood is separable. Each hurdle has its own likelihood and regressors. You can estimate each one of these hurdles separately to obtain point estimates. However, you cannot get standard errors or marginal effects this way.
In this post, we show how to get the marginal effects and standard errors for a hurdle model with two hurdles using gsem. gsem is ideal for this purpose because it allows us to estimate likelihood-based models with multiple equations.
The model
Suppose we are interested in the mean spending on dental care, given the characteristic of the individuals. Some people spend zero dollars on dental care in a year, and some people spend more than zero dollars. Only the individuals that cross a hurdle are willing to spend a positive amount on dental care. Hurdle models allow the characteristics of the individuals that spend a positive amount and those who spend zero to differ.
There could be more than one hurdle. In the dental-care spending example, the second hurdle could be insurance coverage: uninsured, basic insurance, or premium insurance. We model the first hurdle of spending zero or a positive amount by a probit. We model the second hurdle of insurance level using an ordered probit. Finally, we model the positive amount spent using an exponential-mean model.
We are interested in the marginal effects for the mean amount spent for someone with premium insurance, given individual characteristics. The expression for this conditional mean is
\begin{eqnarray*}
E\left(\text{expenditure}|X, {\tt insurance} =\text{premium}\right)
&=& \Phi(X_p\beta_p)\Phi\left(X_o\beta_o – \text{premium}\right)
\exp\left(X_e\beta_e\right)
\end{eqnarray*}
The conditional mean accounts for the probabilities of being in different threshold levels and for the expenditure preferences among those spending a positive amount.We use the subscripts \(p\), \(o\), and \(e\) to emphasize that the covariates and coefficients related to the probit, ordered probit, and exponential mean are different.
Below we will use gsem to estimate the model parameters from simulated data. spend is a binary outcome for whether an individual spends money on dental care, insurance is an ordered outcome indicating insurance level, and expenditure corresponds to the amount spent on dental care.
. gsem (spend <- x1 x2 x4, probit) > (insurance <- x3 x4, oprobit) > (expenditure <- x5 x6 x4, poisson), > vce(robust) note: expenditure has noncount values; you are responsible for the family(poisson) interpretation Iteration 0: log pseudolikelihood = -171938.67 Iteration 1: log pseudolikelihood = -79591.213 Iteration 2: log pseudolikelihood = -78928.015 Iteration 3: log pseudolikelihood = -78925.126 Iteration 4: log pseudolikelihood = -78925.126 Generalized structural equation model Number of obs = 10,000 Response : spend Family : Bernoulli Link : probit Response : insurance Family : ordinal Link : probit Response : expenditure Family : Poisson Link : log Log pseudolikelihood = -78925.126 ---------------------------------------------------------------------------- | Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------------+------------------------------------------------------------ spend <- | x1 | .5189993 .0161283 32.18 0.000 .4873884 .5506102 x2 | -.4755281 .02257 -21.07 0.000 -.5197646 -.4312917 x4 | .5300193 .0187114 28.33 0.000 .4933455 .566693 _cons | .4849085 .0288667 16.80 0.000 .4283308 .5414862 ---------------+------------------------------------------------------------ insurance <- | x3 | .299793 .0084822 35.34 0.000 .2831681 .3164178 x4 | -.2835648 .0135266 -20.96 0.000 -.3100765 -.2570531 ---------------+------------------------------------------------------------ expenditure <- | x5 | -.2992792 .0192201 -15.57 0.000 -.3369499 -.2616086 x6 | .319377 .0483959 6.60 0.000 .2245229 .4142312 x4 | .448041 .0252857 17.72 0.000 .3984819 .4976001 _cons | 1.088217 .0375369 28.99 0.000 1.014646 1.161788 ---------------+------------------------------------------------------------ insurance | /cut1 | -1.28517 .0236876 -54.26 0.000 -1.331596 -1.238743 /cut2 | -.2925979 .0216827 -13.49 0.000 -.3350951 -.2501006 /cut3 | .7400875 .0230452 32.11 0.000 .6949198 .7852552 ----------------------------------------------------------------------------
The estimated probit parameters are in the spend equation. The estimated ordinal-probit parameters are in the insurance equation. The estimated expenditure parameters are in the expenditure equation. We could have obtained these point estimates using probit, oprobit, and poisson. With gsem, we do this jointly and obtain correct standard errors when computing marginal effects. In the case of the poisson model, we are using gsem to obtain an exponential mean and should interpret the outcomes from a quasilikelihood perspective. Because of the quasilikelihood nature of the problem, we use the vce(robust) option.
The average of the marginal effect of x4 is
\begin{equation*}
\frac{1}{N}\sum_{i=1}^N \frac{\partial \hat{E}\left(\text{expenditure}_i|X_i, {\tt insurance}_i\right)}{\partial {\tt x4}_i}
\end{equation*}
and we estimate it by
. margins, vce(unconditional) predict(expression(normal(eta(spend))* > normal(eta(insurance)-_b[insurance_cut2:_cons])* > exp(eta(expenditure)))) dydx(x4) Average marginal effects Number of obs = 10,000 Expression : Predicted normal(eta(spend))* normal(eta(insurance)-_b[insurance_cut2:_cons])* e, predict(expression(normal(eta(spend))* normal(eta(insurance)-_b[insurance_cut2:_cons])* exp(eta(expenditure)))) dy/dx w.r.t. : x4 --------------------------------------------------------------------------- | Unconditional | dy/dx Std. Err. z P>|z| [95% Conf. Interval] ----------+---------------------------------------------------------------- x4 | .5382276 .0506354 10.63 0.000 .4389841 .6374711 ---------------------------------------------------------------------------
We used the expression() option to write an expression for the expected value of interest and predict() and eta() to denote the linear predictions for each model. We use the vce(unconditional) option to allow the covariates to be random instead of fixed. In other words, we are estimating a population effect instead of a sample effect.
Final considerations
We illustrated how to use gsem to obtain the estimates and standard errors for a multiple hurdle model and its marginal effect. In subsequent posts, we will obtain these results using other Stata tools.
Appendix
Below is the code used to produce the data.
clear set seed 111 set obs 10000 // Generating exogenous variables generate x1 = rnormal() generate x2 = int(3*rbeta(2,3)) generate x3 = rchi2(1)-2 generate x4 = ln(rchi2(4)) generate x5 = rnormal() generate x6 = rbeta(2,3)>.6 // Generating unobservables generate ep = rnormal() // for probit generate eo = rnormal() // for ordered probit generate e = rnormal() // for lognormal equation // Generating linear predictions generate xbp = .5*(1 + x1 - x2 + x4) generate xbo = .3*(1 + x3 - x4) generate xbe = .3*(1 - x5 + x6 + x4) // Generating outcomes generate spend = xbp + ep > 0 generate yotemp = xbo + eo generate insurance = yotemp generate yexp = exp(xbe + e) replace insurance = 1 if yotemp < -1 replace insurance = 2 if yotemp> -1 & yotemp<0 replace insurance = 3 if yotemp> 0 & yotemp <1 replace insurance = 4 if yotemp>1 generate expenditure = spend*insurance*yexp