Home > Statistics > Multiple equation models: Estimation and marginal effects using gsem

Multiple equation models: Estimation and marginal effects using gsem

Starting point: A hurdle model with multiple hurdles

In a sequence of posts, we are going to illustrate how to obtain correct standard errors and marginal effects for models with multiple steps.

Our inspiration for this post is an old Statalist inquiry about how to obtain marginal effects for a hurdle model with more than one hurdle (http://www.statalist.org/forums/forum/general-stata-discussion/general/1337504-estimating-marginal-effect-for-triple-hurdle-model). Hurdle models have the appealing property that their likelihood is separable. Each hurdle has its own likelihood and regressors. You can estimate each one of these hurdles separately to obtain point estimates. However, you cannot get standard errors or marginal effects this way.

In this post, we show how to get the marginal effects and standard errors for a hurdle model with two hurdles using gsem. gsem is ideal for this purpose because it allows us to estimate likelihood-based models with multiple equations.

The model

Suppose we are interested in the mean spending on dental care, given the characteristic of the individuals. Some people spend zero dollars on dental care in a year, and some people spend more than zero dollars. Only the individuals that cross a hurdle are willing to spend a positive amount on dental care. Hurdle models allow the characteristics of the individuals that spend a positive amount and those who spend zero to differ.

There could be more than one hurdle. In the dental-care spending example, the second hurdle could be insurance coverage: uninsured, basic insurance, or premium insurance. We model the first hurdle of spending zero or a positive amount by a probit. We model the second hurdle of insurance level using an ordered probit. Finally, we model the positive amount spent using an exponential-mean model.

We are interested in the marginal effects for the mean amount spent for someone with premium insurance, given individual characteristics. The expression for this conditional mean is
\begin{eqnarray*}
E\left(\text{expenditure}|X, {\tt insurance} =\text{premium}\right)
&=& \Phi(X_p\beta_p)\Phi\left(X_o\beta_o – \text{premium}\right)
\exp\left(X_e\beta_e\right)
\end{eqnarray*}
The conditional mean accounts for the probabilities of being in different threshold levels and for the expenditure preferences among those spending a positive amount.We use the subscripts \(p\), \(o\), and \(e\) to emphasize that the covariates and coefficients related to the probit, ordered probit, and exponential mean are different.

Below we will use gsem to estimate the model parameters from simulated data. spend is a binary outcome for whether an individual spends money on dental care, insurance is an ordered outcome indicating insurance level, and expenditure corresponds to the amount spent on dental care.

. gsem  (spend       <- x1 x2 x4, probit)    
>       (insurance   <- x3 x4, oprobit)      
>       (expenditure <- x5 x6 x4, poisson),  
>        vce(robust)
note: expenditure has noncount values;
      you are responsible for the family(poisson) interpretation

Iteration 0:   log pseudolikelihood = -171938.67
Iteration 1:   log pseudolikelihood = -79591.213
Iteration 2:   log pseudolikelihood = -78928.015
Iteration 3:   log pseudolikelihood = -78925.126
Iteration 4:   log pseudolikelihood = -78925.126

Generalized structural equation model        Number of obs     =     10,000

Response       : spend
Family         : Bernoulli
Link           : probit

Response       : insurance
Family         : ordinal
Link           : probit

Response       : expenditure
Family         : Poisson
Link           : log

Log pseudolikelihood = -78925.126

----------------------------------------------------------------------------
               |               Robust
               |      Coef.   Std. Err.     z    P>|z|  [95% Conf. Interval]
---------------+------------------------------------------------------------
spend <-       |
         x1    |   .5189993   .0161283   32.18   0.000   .4873884   .5506102
         x2    |  -.4755281     .02257  -21.07   0.000  -.5197646  -.4312917
         x4    |   .5300193   .0187114   28.33   0.000   .4933455    .566693
      _cons    |   .4849085   .0288667   16.80   0.000   .4283308   .5414862
---------------+------------------------------------------------------------
insurance <-   |
            x3 |    .299793   .0084822   35.34   0.000   .2831681   .3164178
            x4 |  -.2835648   .0135266  -20.96   0.000  -.3100765  -.2570531
---------------+------------------------------------------------------------
expenditure <- |
            x5 |  -.2992792   .0192201  -15.57   0.000  -.3369499  -.2616086
            x6 |    .319377   .0483959    6.60   0.000   .2245229   .4142312
            x4 |    .448041   .0252857   17.72   0.000   .3984819   .4976001
         _cons |   1.088217   .0375369   28.99   0.000   1.014646   1.161788
---------------+------------------------------------------------------------
insurance      |
         /cut1 |   -1.28517   .0236876  -54.26   0.000  -1.331596  -1.238743
         /cut2 |  -.2925979   .0216827  -13.49   0.000  -.3350951  -.2501006
         /cut3 |   .7400875   .0230452   32.11   0.000   .6949198   .7852552
----------------------------------------------------------------------------

The estimated probit parameters are in the spend equation. The estimated ordinal-probit parameters are in the insurance equation. The estimated expenditure parameters are in the expenditure equation. We could have obtained these point estimates using probit, oprobit, and poisson. With gsem, we do this jointly and obtain correct standard errors when computing marginal effects. In the case of the poisson model, we are using gsem to obtain an exponential mean and should interpret the outcomes from a quasilikelihood perspective. Because of the quasilikelihood nature of the problem, we use the vce(robust) option.

The average of the marginal effect of x4 is
\begin{equation*}
\frac{1}{N}\sum_{i=1}^N \frac{\partial \hat{E}\left(\text{expenditure}_i|X_i, {\tt insurance}_i\right)}{\partial {\tt x4}_i}
\end{equation*}
and we estimate it by

. margins, vce(unconditional) predict(expression(normal(eta(spend))* 
>          normal(eta(insurance)-_b[insurance_cut2:_cons])*          
>          exp(eta(expenditure)))) dydx(x4)

Average marginal effects                     Number of obs     =     10,000

Expression   : Predicted normal(eta(spend))*
               normal(eta(insurance)-_b[insurance_cut2:_cons])* e,
               predict(expression(normal(eta(spend))*
               normal(eta(insurance)-_b[insurance_cut2:_cons])*
               exp(eta(expenditure))))
dy/dx w.r.t. : x4

---------------------------------------------------------------------------
          |            Unconditional
          |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------+----------------------------------------------------------------
       x4 |   .5382276   .0506354    10.63   0.000     .4389841    .6374711
---------------------------------------------------------------------------

We used the expression() option to write an expression for the expected value of interest and predict() and eta() to denote the linear predictions for each model. We use the vce(unconditional) option to allow the covariates to be random instead of fixed. In other words, we are estimating a population effect instead of a sample effect.

Final considerations

We illustrated how to use gsem to obtain the estimates and standard errors for a multiple hurdle model and its marginal effect. In subsequent posts, we will obtain these results using other Stata tools.

Appendix

Below is the code used to produce the data.

clear
set seed 111
set obs 10000
// Generating exogenous variables
generate x1 = rnormal()
generate x2 = int(3*rbeta(2,3))
generate x3 = rchi2(1)-2
generate x4 = ln(rchi2(4))
generate x5 = rnormal()
generate x6 = rbeta(2,3)>.6
// Generating unobservables
generate ep = rnormal() // for probit
generate eo = rnormal() // for ordered probit
generate e  = rnormal() // for lognormal equation
// Generating linear predictions
generate xbp = .5*(1 + x1 - x2 + x4)
generate xbo = .3*(1 + x3 - x4)
generate xbe = .3*(1 - x5 + x6 + x4)
// Generating outcomes
generate spend       = xbp + ep > 0
generate yotemp      = xbo + eo
generate insurance   = yotemp
generate yexp = exp(xbe + e)
replace insurance = 1 if yotemp < -1
replace insurance = 2 if yotemp> -1 & yotemp<0
replace insurance = 3 if yotemp> 0 & yotemp <1
replace insurance = 4 if yotemp>1
generate expenditure = spend*insurance*yexp