## Solving missing data problems using inverse-probability-weighted estimators

We discuss estimating population-averaged parameters when some of the data are missing. In particular, we show how to use **gmm** to estimate population-averaged parameters for a probit model when the process that causes some of the data to be missing is a function of observable covariates and a random process that is independent of the outcome. This type of missing data is known as missing at random, selection on observables, and exogenous sample selection.

This is a follow-up to an earlier post where we estimated the parameters of a probit model under endogenous sample selection (http://blog.stata.com/2015/11/05/using-mlexp-to-estimate-endogenous-treatment-effects-in-a-probit-model/). In endogenous sample selection, the random process that affects which observations are missing is correlated with an unobservable random process that affects the outcome.

Under exogenous sample selection, probit consistently estimates the regression coefficients, which determine conditional on covariate effects. But estimates of the population-averaged parameters will be inconsistent when the model covariates are correlated with the selection process. To get consistent estimates of the population-averaged parameters in this case, we use inverse-probability weighting to reweight the data so that our estimates reflect the full and partially observed observations.

This estimator uses the same trick as the inverse-probability-weighted (IPW) estimators used in causal inference. For two examples of IPW estimators applied to causal inference, see http://blog.stata.com/2016/09/13/an-ordered-probit-inverse-probability-weighted-ipw-estimator/ and http://blog.stata.com/2014/12/08/using-gmm-to-solve-two-step-estimation-problems/.

**Exogenous sample selection**

Consider the case where we draw a simple random sample from a population at some point in time (\(t_{1}\)) and then survey the same respondents again at a later point (\(t_{2}\)). If we only partly observe our variables of interest at \(t_{2}\), for example, because of panel attrition, then population-averaged inference is not consistent if selection into \(t_{2}\) is a function of the covariates used to model the outcome.

To illustrate, let’s suppose we have a binary outcome variable \(y_{i}\) that is only observed at \(t_{2}\) and we wish to fit the following probit model,

\[y_{i}

= 1[\beta_{0} + \beta_{1}d_{i} + \beta_{2}z_{1,i} + \beta_{3}z_{2,i} + e_{i} >

0] = 1[{\bf x}_i{\boldsymbol \beta} + e_{i} > 0]\]

where only \(z_{1,i}\) and \(z_{2,i}\) are observed for the full sample at \(t_{1}\) and \(d_{i}\) and \(y_{i}\) are not. While we could consistently estimate the conditional parameters in \({\boldsymbol \beta}\) if \(z_{1,i}\) or \(z_{2,i}\) are correlated with respondents dropping out of the sample, we cannot consistently estimate population-averaged effects from our model using the reduced sample. Let’s take a look at a snippet from our (fictitious) dataset:

. li id s y d z1 z2 in 1/5 +-----------------------------------------+ | id s y d z1 z2 | |-----------------------------------------| 1. | 1 1 0 0 .87519349 .53714872 | 2. | 2 1 0 1 -.02251873 .51122735 | 3. | 3 0 . . .4885629 .07667308 | 4. | 4 0 . . -.95665816 .73366976 | 5. | 5 0 . . -1.2078948 .32982661 | +-----------------------------------------+

We see that only observations 1 and 2 were observed at both times and observations 3, 4, and 5 dropped out. Suppose we went ahead and fit our model to the observed part and wanted to estimate, say, marginal means for the binary variable \(d_i\). We use **probit** to estimate the parameters of the model and then use **margins** to estimate the marginal means with the **over()** option.

. probit y i.d z1 z2 Iteration 0: log likelihood = -6387.4076 Iteration 1: log likelihood = -5009.8148 Iteration 2: log likelihood = -5000.3354 Iteration 3: log likelihood = -5000.3285 Iteration 4: log likelihood = -5000.3285 Probit regression Number of obs = 9,234 LR chi2(3) = 2774.16 Prob > chi2 = 0.0000 Log likelihood = -5000.3285 Pseudo R2 = 0.2172 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.d | -.9754357 .0296157 -32.94 0.000 -1.033481 -.9173901 z1 | -.6995319 .0185883 -37.63 0.000 -.7359644 -.6630995 z2 | .998608 .052668 18.96 0.000 .8953806 1.101835 _cons | -.3024672 .0359367 -8.42 0.000 -.3729019 -.2320326 ------------------------------------------------------------------------------ . margins, over(d) Predictive margins Number of obs = 9,234 Model VCE : OIM Expression : Pr(y), predict() over : d ------------------------------------------------------------------------------ | Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- d | 0 | .6828357 .006175 110.58 0.000 .6707329 .6949384 1 | .3705836 .0063825 58.06 0.000 .3580741 .383093 ------------------------------------------------------------------------------

Because we simulated the dataset, we know that the true population parameter for \(d_{i}=0\) is 0.56 and 0.26 for \(d_{i}=1\). Although the conditional parameters of the probit model are consistent, our marginal mean estimates are far off. However, if we have a good model for the sample selection process, and if we observe the relevant variables for that model, we could use it to create weights that would correct for the sample selection. That is, we need a model for estimating each respondent’s probability of selecting him- or herself into the follow-up, and then we weight the reduced sample model with the inverse of that probability. If we can model sample selection based on variables we observe, we speak of exogenous sample selection (unlike endogenous sample selection where we have correlated unobservables across the selection and outcome model). For details about IPW sample-selection models, see Wooldridge (2010), section 19.8. Also notice that the basic idea here is the same as that for IPW treatment-effects estimators; see http://blog.stata.com/2016/09/13/an-ordered-probit-inverse-probability-weighted-ipw-estimator/ and http://blog.stata.com/2015/07/07/introduction-to-treatment-effects-in-stata-part-1/.

While this sounds straightforward, proper estimation of our marginal means requires some care because we have a multistage estimation problem: we have a selection model, an outcome model, and we want to estimate marginal means. One way of solving this problem is to estimate everything simultaneously by using a generalized method-of-moments estimator (http://blog.stata.com/2014/12/08/using-gmm-to-solve-two-step-estimation-problems/).

**Model and estimator**

Modeling the sample selection using a probit model with \(s_i\) being the selection indicator, we have

\[s_{i} = 1[\gamma_{0} + \gamma_{1}z_{1,i} + \gamma_{2}z_{2,i} +

\gamma_{3}z_{3,i} + u_{i} > 0] = 1[{\bf z}_i{\boldsymbol \gamma} + u_i > 0]\]

The conditional probability of selection is

\[

P(s_i=1 \vert {\boldsymbol z}_i) = \Phi({\bf z}_i{\boldsymbol \gamma})

\]

We can use the inverse of this probability as a weight in estimating the model parameters and population-averaged parameters using the fully observed sample. Intuitively, using the inverse-probability weight will correct the estimate to reflect both the fully and partially observed observations.

For the expectations of interest, we have

\begin{eqnarray*} E(y_i\vert d_i) &=&

E\left\{s_i{\Phi({\bf z}_i{\boldsymbol \gamma})}^{-1} E(y_i|d_i,{\bf z}_i)

\Big{\vert} d_i\right\} \cr &=& E\left\{s_i{\Phi({\bf z}_i{\boldsymbol

\gamma})}^{-1} \Phi({\bf x}_i{\boldsymbol \beta})\Big{\vert} d_i\right\}

\end{eqnarray*}

We will use the inverse-probability weight in moment conditions as we estimate the model parameters and marginal means using the generalized method of moments. Because we use a probit model for both the outcome and selection model, we can use the same moment conditions for both, except that we have different samples (the full and reduced sample). Here we use the first-order derivatives of the probit log-likelihood function to retain maximum likelihood estimates. For the selection model, we have sample moment conditions

\[

\sum_{i=1}^{N} \Bigg[ \Bigg\{ s_{i} \frac{\phi({\bf z}_{i} \boldsymbol{\gamma})}{\Phi({\bf z_{i}}\boldsymbol{\gamma})} – (1-s_{i})

\frac{\phi({\bf z}_{i}\boldsymbol{\gamma})}{\Phi(-{\bf z}_{i}\boldsymbol{\gamma})} \Bigg\} {\bf z}_{i} \Bigg] = 0

\]

Let \(S\) be the indices for the fully observed sample. For the outcome model we have sample moment conditions

\[

\sum_{i\in S} \Phi({\bf z}_{i}\boldsymbol{\gamma})^{-1} \Bigg[ \Bigg\{ y_{i} \frac{\phi({\bf x}_{i}\boldsymbol{\beta})}{\Phi({\bf x}_{i}\boldsymbol{\beta})} – (1-y_{i})

\frac{\phi{(\bf x}_{i}\boldsymbol{\beta})}{\Phi(-{\bf x}_{i}\boldsymbol{\beta})} \Bigg\} {\bf x}_{i} \Bigg] = 0

\]

Finally, the sample moment conditions of our marginal parameters are

\[

\sum_{i\in S} \Phi({\bf z}_{i}{\boldsymbol \gamma})^{-1}

\Bigg[ \Bigg\{\Phi({\bf x}_{i}{\boldsymbol \beta}) – \mu_{0} \Bigg\} (1-d_{i}) \Bigg] = 0

\]

\[

\sum_{i\in S} \Phi({\bf z}_{i}{\boldsymbol \gamma})^{-1}

\Bigg[ \Bigg\{\Phi({\bf x}_{i}{\boldsymbol \beta}) – \mu_{1} \Bigg\} d_{i} \Bigg] = 0

\]

**Estimation**

Now, we estimate our parameters with **gmm**, using the interactive version syntax. We first fit the bare probit model without marginal means:

. gmm (eq1: (y*normalden({xb : i.d z1 z2 _cons})/normal({xb:})- > (1-y)*normalden(-{xb:})/normal(-{xb:}))), > instruments(eq1: i.d z1 z2) > winitial(unadjusted, independent) onestep Step 1 Iteration 0: GMM criterion Q(b) = .1679307 Iteration 1: GMM criterion Q(b) = .00078746 Iteration 2: GMM criterion Q(b) = 4.901e-07 Iteration 3: GMM criterion Q(b) = 3.725e-13 Iteration 4: GMM criterion Q(b) = 2.405e-25 note: model is exactly identified GMM estimation Number of parameters = 4 Number of moments = 4 Initial weight matrix: Unadjusted Number of obs = 9,234 ------------------------------------------------------------------------------ | Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.d | -.9754357 .0297148 -32.83 0.000 -1.033676 -.9171958 z1 | -.6995319 .0184779 -37.86 0.000 -.7357479 -.6633159 z2 | .998608 .052408 19.05 0.000 .8958901 1.101326 _cons | -.3024672 .0355903 -8.50 0.000 -.3722229 -.2327115 ------------------------------------------------------------------------------ Instruments for equation eq1: 0b.d 1.d z1 z2 _cons

The true population parameters in \({\boldsymbol \beta}\) are \(\beta_{0}=-0.3\), \(\beta_{1}=-1\), \(\beta_{2}=-0.7\), and \(\beta_{3}=1\), and we can see that our estimates get very close to these. These results are equivalent to what we estimated before using **probit**. Now, we fit the same probit model but add the marginal means:

. gmm (eq1: (y*normalden({xb : i.d z1 z2 _cons})/normal({xb:})- > (1-y)*normalden(-{xb:})/normal(-{xb:}))) > (eq2: (1-d)*(normal({xb:})-{mu0}) ) > (eq3: d*(normal({xb:})-{mu1}) ), > instruments(eq1: i.d z1 z2) > winitial(unadjusted, independent) onestep Step 1 Iteration 0: GMM criterion Q(b) = .29293128 Iteration 1: GMM criterion Q(b) = .00400745 Iteration 2: GMM criterion Q(b) = 2.492e-06 Iteration 3: GMM criterion Q(b) = 1.014e-11 Iteration 4: GMM criterion Q(b) = 1.691e-22 note: model is exactly identified GMM estimation Number of parameters = 6 Number of moments = 6 Initial weight matrix: Unadjusted Number of obs = 9,234 ------------------------------------------------------------------------------ | Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.d | -.9754357 .0297148 -32.83 0.000 -1.033676 -.9171958 z1 | -.6995319 .0184779 -37.86 0.000 -.7357479 -.6633159 z2 | .998608 .052408 19.05 0.000 .8958901 1.101326 _cons | -.3024672 .0355903 -8.50 0.000 -.3722229 -.2327115 -------------+---------------------------------------------------------------- /mu0 | .6828357 .00683 99.98 0.000 .6694491 .6962222 /mu1 | .3705836 .0071107 52.12 0.000 .3566469 .3845202 ------------------------------------------------------------------------------ Instruments for equation eq1: 0b.d 1.d z1 z2 _cons Instruments for equation eq2: _cons Instruments for equation eq3: _cons

The parameters labeled **/mu0** and **/mu1** are estimates of the marginal means for \(d_{i}=0\) and \(d_{i}=1\), respectively, not accounting for sample selection. Again, the marginal means are the same that we obtained earlier from **margins**, and these estimates are inconsistent (the true values were 0.56 for \(d_{i}=0\) and 0.26 for \(d_{i}=1\)).

Finally, we fit our sample-selection model. We specify the **nocommonesample** option because we use different sets of observations across moment conditions:

. gmm (eq1: s*normalden({zb : z1 z2 z3 _cons})/normal({zb:})- > (1-s)*normalden(-{zb:})/normal(-{zb:})) > (eq2: (y*normalden({xb : i.d z1 z2 _cons})/normal({xb:})- > (1-y)*normalden(-{xb:})/normal(-{xb:}))/(normal({zb:}))) > (eq3: (1-d)*(normal({xb:})-{mu0}) / normal({zb:})) > (eq4: d*(normal({xb:})-{mu1}) / normal({zb:})), > instruments(eq1: z1 z2 z3) > instruments(eq2: i.d z1 z2) > winitial(unadjusted, independent) > onestep nocommonesample Step 1 Iteration 0: GMM criterion Q(b) = .57729426 Iteration 1: GMM criterion Q(b) = .03006396 Iteration 2: GMM criterion Q(b) = .00014618 Iteration 3: GMM criterion Q(b) = 2.016e-08 Iteration 4: GMM criterion Q(b) = 1.477e-16 note: model is exactly identified GMM estimation Number of parameters = 10 Number of moments = 10 Initial weight matrix: Unadjusted Number of obs = * ------------------------------------------------------------------------------ | Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- zb | z1 | -.706703 .0115024 -61.44 0.000 -.7292473 -.6841587 z2 | .9562022 .0347235 27.54 0.000 .8881454 1.024259 z3 | -.9735917 .0345574 -28.17 0.000 -1.041323 -.9058605 _cons | -.112247 .0257628 -4.36 0.000 -.1627412 -.0617528 -------------+---------------------------------------------------------------- xb | 1.d | -.9810332 .038836 -25.26 0.000 -1.05715 -.9049161 z1 | -.6651958 .0435223 -15.28 0.000 -.7504979 -.5798937 z2 | .9877247 .0986749 10.01 0.000 .7943255 1.181124 _cons | -.2775556 .0726005 -3.82 0.000 -.4198501 -.1352612 -------------+---------------------------------------------------------------- /mu0 | .5628197 .011652 48.30 0.000 .5399823 .5856572 /mu1 | .2694232 .007469 36.07 0.000 .2547843 .2840621 ------------------------------------------------------------------------------ * Number of observations for equation eq1: 20000 Number of observations for equation eq2: 9234 Number of observations for equation eq3: 9234 Number of observations for equation eq4: 9234 ------------------------------------------------------------------------------ Instruments for equation eq1: z1 z2 z3 _cons Instruments for equation eq2: 0b.d 1.d z1 z2 _cons Instruments for equation eq3: _cons Instruments for equation eq4: _cons

We can see that our estimates of the marginal means are now close to the true values.

**Conclusion**

We demonstrated how to use **gmm** to estimate population-averaged parameters with an IPW estimator. This solves a missing data problem arising from an exogenous sample-selection process.

**Appendix**

Here is the code that we used for generating the dataset:

drop _all set seed 123 qui set obs 20000 generate double d = runiform() > .5 generate double z1 = rnormal() generate double z2 = runiform() generate double z3 = runiform() generate double u = rnormal() generate double e = rnormal() generate double zb = (-0.1 - 0.7*z1 + z2 - z3) generate double xb = (-0.3 - d - 0.7*z1 + z2) generate double s = (zb + u) > 0 generate double y = (xb + e) > 0 qui replace y = . if s == 0 qui replace d = . if s == 0 generate id = _n

**Reference**

Wooldridge, J. M. 2010. *Econometric Analysis of Cross Section and Panel Data*. 2nd ed. Cambridge, MA: MIT Press.