## Probit model with sample selection by mlexp

**Overview**

In a previous post, David Drukker demonstrated how to use **mlexp** to estimate the degree of freedom parameter in a chi-squared distribution by maximum likelihood (ML). In this post, I am going to use **mlexp** to estimate the parameters of a probit model with sample selection. I will illustrate how to specify a more complex likelihood in **mlexp** and provide intuition for the probit model with sample selection. Our results match the **heckprobit** command; see **[R] heckprobit** for more details.

**Probit model**

For binary outcome \(y_i\) and regressors \({\bf x}_i\), the probit model assumes

\[\begin{equation} \label{eq:outcome} y_i = {\bf 1}({\bf x}_i{\boldsymbol \beta} + \epsilon_i > 0) \tag{1} \end{equation}\]

where the error \(\epsilon_i\) is standard normal. The indicator function \({\bf1}(\cdot)\) outputs 1 when its input is true and outputs 0 otherwise.

The log likelihood of the probit model is

\[\begin{equation}

\ln L = \sum_{i=1}^{N} y_i \ln \Phi({\bf x}_i{\boldsymbol \beta}) + (1-y_i)\ln\{1-\Phi({\bf x}_i{\boldsymbol \beta})\} \nonumber

\end{equation}\]

where \(\Phi\) is the standard normal cumulative distribution function.

The probit model is widely used to model binary outcomes. But there are situations where it is not appropriate. Sometimes we observe a random sample where the outcome is missing on certain observations. If there is a relationship between the unobserved error of the outcome \(\epsilon_i\) and the unobserved error that affects whether the outcome is observed \(\epsilon_{si}\), then estimates made using the probit model will be inconsistent for \({\boldsymbol \beta}\). For instance, this could happen when we model job satisfaction and our sample includes employed and unemployed individuals. The unobserved factors that affect your job satisfaction may be correlated with factors that affect your employment status. Samples like this are said to suffer from “selection on unobservables”.

**Probit model with sample selection**

Van de Ven and Van Pragg (1981) introduced the probit model with sample selection to allow for consistent estimation of \({\boldsymbol \beta}\) in samples that suffer from selection on unobservables. The equation for the outcome (1) remains the same, but we add another equation. The selection process for the outcome is modeled as

\[\begin{equation}

s_i = {\bf 1}({\bf z}_i{\boldsymbol \gamma} + \epsilon_{si} > 0) \nonumber

\end{equation}\]

where \(s_i=1\) if we observed \(y_i\) and \(s_i=0\) otherwise, and \({\bf z}_i\) are regressors that affect the selection process.

The errors \(\epsilon_i\) and \(\epsilon_{si}\) are assumed to be standard normal with

\[\begin{equation}

\mbox{corr}(\epsilon_i,\epsilon_{si}) = \rho \nonumber

\end{equation}\]

Let \(S\) be the set of observations where \(y_i\) is observed. The likelihood for the probit model with sample selection is

\[\begin{eqnarray*}

\ln L &=& \sum_{i\in S}^{} y_i\ln\Phi_2({\bf x}_i{\boldsymbol \beta}, {\bf z}_i{\boldsymbol \gamma},\rho) +

(1-y_i)\ln\Phi_2(-{\bf x}_i{\boldsymbol \beta}, {\bf z}_i{\boldsymbol \gamma},-\rho) + \cr

& & \sum_{i\not\in S}^{} \ln \{1- \Phi({\bf z}_i{\boldsymbol \gamma})\}

\end{eqnarray*}\]

where \(\Phi_2\) is the bivariate normal cumulative distribution function.

**The data**

We will simulate data from a probit model with sample selection and then estimate the parameters of the model using **mlexp**. We simulate a random sample of 7,000 observations.

. drop _all . set seed 441 . set obs 7000 number of observations (_N) was 0, now 7,000 . generate x = .5*rchi2(2) . generate z = rnormal() . generate b = rbinomial(2,.5)

First, we generate the regressors. We use a \(\chi^2\) variable with \(2\) degrees of freedom \(x\) scaled by \(0.5\) as a regressor for the outcome. A standard normal variable \(z\) is used as a selection regressor. The variable \(b\) has a binomial(\(2,0.5\)) distribution and will be used as a selection regressor.

. matrix cm = (1,.7 \ .7,1) . drawnorm ey es, corr(cm)

Next, we draw the unobserved errors. The outcome \(y\) and selection indicator \(s\) will be generated with errors that have correlation \(0.7\). We generate the errors with the **drawnorm** command.

. generate s = z + 1.3*0.b + 1.b + .5*2.b + es > 0 . generate y = .7*x + ey + .5 > 0 . replace y = . if !s (1,750 real changes made, 1,750 to missing)

Finally, we generate the outcome and selection indicator. We specify the effect of \(b\) on selection by using factor-variable notation. Every value of \(b\) provides a different intercept for \(s\). We set the outcome to missing for observations where \(s\) is \(0\).

**Effect of ignoring sample selection**

First, we will use **mlexp** to estimate the probit model, ignoring the sample selection. We use the **cond()** function to calculate different values of the likelihood based on the value of \(y\). For **cond(a,b,c)**, **b** is returned if **a** is true and **c** is returned otherwise. We use only the observations for which \(y\) is not missing by specifying \(y\) in the **variables()** option. The variables in the equation **y** are specified once, the first time the equation parameters are used in the likelihood. When the equation is used again, it is referred to as \(\{{\bf y}:\}\).

. mlexp (ln(cond(y,normal({y: x _cons}),1-normal({y:})))), variables(y) initial: log likelihood = -3639.0227 alternative: log likelihood = -2342.8722 rescale: log likelihood = -1746.0961 Iteration 0: log likelihood = -1746.0961 Iteration 1: log likelihood = -1503.9519 Iteration 2: log likelihood = -1485.2935 Iteration 3: log likelihood = -1485.1677 Iteration 4: log likelihood = -1485.1677 Maximum likelihood estimation Log likelihood = -1485.1677 Number of obs = 5,250 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | .813723 .0568938 14.30 0.000 .7022132 .9252328 _cons | .7623006 .0386929 19.70 0.000 .6864639 .8381372 ------------------------------------------------------------------------------

Both parameters are overestimated, and the true values are not in the estimated confidence intervals.

**Accounting for sample selection**

Now, we use **mlexp** to estimate the probit model with sample selection. We use the **cond()** function twice, once for the selection indicator value and once for the outcome value. We no longer need to specify the **variables()** option because we will use each observation in the data. We use the factor-variable operator **ibn** in the selection equation so that a separate intercept is used in the equation for each level of \(b\).

. mlexp (ln(cond(s,cond(y,binormal({y: x _cons},{s: z ibn.b}, {rho}), binormal( > -{y:},{s:}, -{rho})),1-normal({s:})))) initial: log likelihood = -8491.053 alternative: log likelihood = -5898.851 rescale: log likelihood = -5898.851 rescale eq: log likelihood = -5654.3504 Iteration 0: log likelihood = -5654.3504 Iteration 1: log likelihood = -5473.5319 (not concave) Iteration 2: log likelihood = -4401.6027 (not concave) Iteration 3: log likelihood = -4340.7398 (not concave) Iteration 4: log likelihood = -4333.6402 (not concave) Iteration 5: log likelihood = -4326.1744 (not concave) Iteration 6: log likelihood = -4316.4936 (not concave) Iteration 7: log likelihood = -4261.307 Iteration 8: log likelihood = -4154.7548 Iteration 9: log likelihood = -4142.7991 Iteration 10: log likelihood = -4141.7431 Iteration 11: log likelihood = -4141.7306 Iteration 12: log likelihood = -4141.7305 Maximum likelihood estimation Log likelihood = -4141.7305 Number of obs = 7,000 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- y | x | .7643362 .0532342 14.36 0.000 .659999 .8686734 _cons | .5259657 .0406914 12.93 0.000 .446212 .6057195 -------------+---------------------------------------------------------------- s | z | 1.028631 .0260977 39.41 0.000 .977481 1.079782 | b | 0 | 1.365497 .0440301 31.01 0.000 1.2792 1.451794 1 | 1.034018 .0297178 34.79 0.000 .9757726 1.092264 2 | .530342 .0353022 15.02 0.000 .461151 .5995331 -------------+---------------------------------------------------------------- /rho | .6854869 .0417266 16.43 0.000 .6037043 .7672696 ------------------------------------------------------------------------------

Our estimates of the coefficient on \(x\) and the constant intercept are closer to the true values. The confidence intervals also include the true values. The correlation \(\rho\) is estimated to be \(0.69\), and the true value of \(0.7\) is in the confidence interval. This model obviously works better.

**Conclusion**

I have demonstrated how to estimate the parameters of a model with a moderately complex likelihood function: the probit model with sample selection using **mlexp**. I also illustrated how to generate data from this model and how its results differ from the simple probit model.

See **[R] mlexp** for more details about **mlexp**. In a future post, we will show how to make predictions after **mlexp** and how to estimate population average parameters using **mlexp** and **margins**.

**Reference**

Van de Ven, W. P. M. M., and B. M. S. Van Pragg. 1981. The demand for deductibles in private health insurance: A probit model with sample selection. *Journal of Econometrics* 17: 229{252.

Pingback: The Stata Blog » Using mlexp to estimate endogenous treatment effects in a probit model()