Home > Statistics > Probit model with sample selection by mlexp

Probit model with sample selection by mlexp

Overview

In a previous post, David Drukker demonstrated how to use mlexp to estimate the degree of freedom parameter in a chi-squared distribution by maximum likelihood (ML). In this post, I am going to use mlexp to estimate the parameters of a probit model with sample selection. I will illustrate how to specify a more complex likelihood in mlexp and provide intuition for the probit model with sample selection. Our results match the heckprobit command; see [R] heckprobit for more details.

Probit model

For binary outcome \(y_i\) and regressors \({\bf x}_i\), the probit model assumes

\[\begin{equation} \label{eq:outcome} y_i = {\bf 1}({\bf x}_i{\boldsymbol \beta} + \epsilon_i > 0) \tag{1} \end{equation}\]

where the error \(\epsilon_i\) is standard normal. The indicator function \({\bf1}(\cdot)\) outputs 1 when its input is true and outputs 0 otherwise.

The log likelihood of the probit model is

\[\begin{equation}
\ln L = \sum_{i=1}^{N} y_i \ln \Phi({\bf x}_i{\boldsymbol \beta}) + (1-y_i)\ln\{1-\Phi({\bf x}_i{\boldsymbol \beta})\} \nonumber
\end{equation}\]

where \(\Phi\) is the standard normal cumulative distribution function.

The probit model is widely used to model binary outcomes. But there are situations where it is not appropriate. Sometimes we observe a random sample where the outcome is missing on certain observations. If there is a relationship between the unobserved error of the outcome \(\epsilon_i\) and the unobserved error that affects whether the outcome is observed \(\epsilon_{si}\), then estimates made using the probit model will be inconsistent for \({\boldsymbol \beta}\). For instance, this could happen when we model job satisfaction and our sample includes employed and unemployed individuals. The unobserved factors that affect your job satisfaction may be correlated with factors that affect your employment status. Samples like this are said to suffer from “selection on unobservables”.

Probit model with sample selection

Van de Ven and Van Pragg (1981) introduced the probit model with sample selection to allow for consistent estimation of \({\boldsymbol \beta}\) in samples that suffer from selection on unobservables. The equation for the outcome (1) remains the same, but we add another equation. The selection process for the outcome is modeled as

\[\begin{equation}
s_i = {\bf 1}({\bf z}_i{\boldsymbol \gamma} + \epsilon_{si} > 0) \nonumber
\end{equation}\]

where \(s_i=1\) if we observed \(y_i\) and \(s_i=0\) otherwise, and \({\bf z}_i\) are regressors that affect the selection process.

The errors \(\epsilon_i\) and \(\epsilon_{si}\) are assumed to be standard normal with

\[\begin{equation}
\mbox{corr}(\epsilon_i,\epsilon_{si}) = \rho \nonumber
\end{equation}\]

Let \(S\) be the set of observations where \(y_i\) is observed. The likelihood for the probit model with sample selection is

\[\begin{eqnarray*}
\ln L &=& \sum_{i\in S}^{} y_i\ln\Phi_2({\bf x}_i{\boldsymbol \beta}, {\bf z}_i{\boldsymbol \gamma},\rho) +
(1-y_i)\ln\Phi_2(-{\bf x}_i{\boldsymbol \beta}, {\bf z}_i{\boldsymbol \gamma},-\rho) + \cr
& & \sum_{i\not\in S}^{} \ln \{1- \Phi({\bf z}_i{\boldsymbol \gamma})\}
\end{eqnarray*}\]

where \(\Phi_2\) is the bivariate normal cumulative distribution function.

The data

We will simulate data from a probit model with sample selection and then estimate the parameters of the model using mlexp. We simulate a random sample of 7,000 observations.

. drop _all

. set seed 441

. set obs 7000
number of observations (_N) was 0, now 7,000

. generate x = .5*rchi2(2)

. generate z = rnormal()

. generate b = rbinomial(2,.5)

First, we generate the regressors. We use a \(\chi^2\) variable with \(2\) degrees of freedom \(x\) scaled by \(0.5\) as a regressor for the outcome. A standard normal variable \(z\) is used as a selection regressor. The variable \(b\) has a binomial(\(2,0.5\)) distribution and will be used as a selection regressor.

. matrix cm = (1,.7 \ .7,1)

. drawnorm ey es, corr(cm)

Next, we draw the unobserved errors. The outcome \(y\) and selection indicator \(s\) will be generated with errors that have correlation \(0.7\). We generate the errors with the drawnorm command.

. generate s = z + 1.3*0.b + 1.b + .5*2.b + es > 0

. generate y = .7*x + ey  + .5 > 0

. replace y = .  if !s
(1,750 real changes made, 1,750 to missing)

Finally, we generate the outcome and selection indicator. We specify the effect of \(b\) on selection by using factor-variable notation. Every value of \(b\) provides a different intercept for \(s\). We set the outcome to missing for observations where \(s\) is \(0\).

Effect of ignoring sample selection

First, we will use mlexp to estimate the probit model, ignoring the sample selection. We use the cond() function to calculate different values of the likelihood based on the value of \(y\). For cond(a,b,c), b is returned if a is true and c is returned otherwise. We use only the observations for which \(y\) is not missing by specifying \(y\) in the variables() option. The variables in the equation y are specified once, the first time the equation parameters are used in the likelihood. When the equation is used again, it is referred to as \(\{{\bf y}:\}\).

. mlexp (ln(cond(y,normal({y: x _cons}),1-normal({y:})))), variables(y)

initial:       log likelihood = -3639.0227
alternative:   log likelihood = -2342.8722
rescale:       log likelihood = -1746.0961
Iteration 0:   log likelihood = -1746.0961  
Iteration 1:   log likelihood = -1503.9519  
Iteration 2:   log likelihood = -1485.2935  
Iteration 3:   log likelihood = -1485.1677  
Iteration 4:   log likelihood = -1485.1677  

Maximum likelihood estimation

Log likelihood = -1485.1677                     Number of obs     =      5,250

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |    .813723   .0568938    14.30   0.000     .7022132    .9252328
       _cons |   .7623006   .0386929    19.70   0.000     .6864639    .8381372
------------------------------------------------------------------------------

Both parameters are overestimated, and the true values are not in the estimated confidence intervals.

Accounting for sample selection

Now, we use mlexp to estimate the probit model with sample selection. We use the cond() function twice, once for the selection indicator value and once for the outcome value. We no longer need to specify the variables() option because we will use each observation in the data. We use the factor-variable operator ibn in the selection equation so that a separate intercept is used in the equation for each level of \(b\).

. mlexp (ln(cond(s,cond(y,binormal({y: x _cons},{s: z ibn.b}, {rho}), binormal(
> -{y:},{s:}, -{rho})),1-normal({s:}))))

initial:       log likelihood =  -8491.053
alternative:   log likelihood =  -5898.851
rescale:       log likelihood =  -5898.851
rescale eq:    log likelihood = -5654.3504
Iteration 0:   log likelihood = -5654.3504  
Iteration 1:   log likelihood = -5473.5319  (not concave)
Iteration 2:   log likelihood = -4401.6027  (not concave)
Iteration 3:   log likelihood = -4340.7398  (not concave)
Iteration 4:   log likelihood = -4333.6402  (not concave)
Iteration 5:   log likelihood = -4326.1744  (not concave)
Iteration 6:   log likelihood = -4316.4936  (not concave)
Iteration 7:   log likelihood =  -4261.307  
Iteration 8:   log likelihood = -4154.7548  
Iteration 9:   log likelihood = -4142.7991  
Iteration 10:  log likelihood = -4141.7431  
Iteration 11:  log likelihood = -4141.7306  
Iteration 12:  log likelihood = -4141.7305  

Maximum likelihood estimation

Log likelihood = -4141.7305                     Number of obs     =      7,000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
y            |
           x |   .7643362   .0532342    14.36   0.000      .659999    .8686734
       _cons |   .5259657   .0406914    12.93   0.000      .446212    .6057195
-------------+----------------------------------------------------------------
s            |
           z |   1.028631   .0260977    39.41   0.000      .977481    1.079782
             |
           b |
          0  |   1.365497   .0440301    31.01   0.000       1.2792    1.451794
          1  |   1.034018   .0297178    34.79   0.000     .9757726    1.092264
          2  |    .530342   .0353022    15.02   0.000      .461151    .5995331
-------------+----------------------------------------------------------------
        /rho |   .6854869   .0417266    16.43   0.000     .6037043    .7672696
------------------------------------------------------------------------------

Our estimates of the coefficient on \(x\) and the constant intercept are closer to the true values. The confidence intervals also include the true values. The correlation \(\rho\) is estimated to be \(0.69\), and the true value of \(0.7\) is in the confidence interval. This model obviously works better.

Conclusion

I have demonstrated how to estimate the parameters of a model with a moderately complex likelihood function: the probit model with sample selection using mlexp. I also illustrated how to generate data from this model and how its results differ from the simple probit model.

See [R] mlexp for more details about mlexp. In a future post, we will show how to make predictions after mlexp and how to estimate population average parameters using mlexp and margins.

Reference

Van de Ven, W. P. M. M., and B. M. S. Van Pragg. 1981. The demand for deductibles in private health insurance: A probit model with sample selection. Journal of Econometrics 17: 229{252.

Categories: Statistics Tags: , ,
  • Pingback: The Stata Blog » Using mlexp to estimate endogenous treatment effects in a probit model()

  • Craig Depken

    Don’t know if anyone is going to see this but I am trying to replicate your code and get the following error when trying to implement the first mlexp command:

    variable _cons not found
    r(111);

    I am using Stata/SE 13.1

    Here’s my code:

    drop _all
    set seed 441
    set obs 7000
    generate x = .5*rchi2(2)
    generate z = rnormal()
    generate b = rbinomial(2,.5)
    matrix cm = (1,.7 .7,1)
    drawnorm ey es, corr(cm)
    generate s = z + 1.3*0.b + 1.b + .5*2.b + es > 0
    generate y = .7*x + ey + .5 > 0
    replace y = . if !s
    mlexp (ln(cond(y,normal({y: x _cons}),1-normal({y:})))), variables(y)
    mlexp (ln(cond(s,cond(y,binormal({y: x _cons},{s: z ibn.b}, {rho}), binormal(-{y:},{s:}, -{rho})),1-normal({s:}))))

    Any thoughts?

  • Charles D Lindsey

    Craig,

    This blog post was written for Stata 14.1. In Stata 13.1, mlexp could not include a constant in an equation.

    This is how a probit regression was coded in mlexp previously

    sysuse auto
    mlexp (ln(cond(foreign==1, ///
    normal({xb:gear_ratio turn} + {c}), ///
    normal(-1*({xb:} +{c})))))

    The {c} parameter represents the constant.

    In Stata 14.1, we would code

    sysuse auto
    mlexp (ln(cond(foreign==1, ///
    normal({xb:gear_ratio turn _cons}), ///
    normal(-1*({xb:})))))

  • Craig Depken

    This is what I thought – thanks for the confirmation on the old syntax.

  • Charles D Lindsey

    I am glad to help.