Home > Statistics > Solving missing data problems using inverse-probability-weighted estimators

Solving missing data problems using inverse-probability-weighted estimators

We discuss estimating population-averaged parameters when some of the data are missing. In particular, we show how to use gmm to estimate population-averaged parameters for a probit model when the process that causes some of the data to be missing is a function of observable covariates and a random process that is independent of the outcome. This type of missing data is known as missing at random, selection on observables, and exogenous sample selection.

This is a follow-up to an earlier post where we estimated the parameters of a probit model under endogenous sample selection (http://blog.stata.com/2015/11/05/using-mlexp-to-estimate-endogenous-treatment-effects-in-a-probit-model/). In endogenous sample selection, the random process that affects which observations are missing is correlated with an unobservable random process that affects the outcome.

Under exogenous sample selection, probit consistently estimates the regression coefficients, which determine conditional on covariate effects. But estimates of the population-averaged parameters will be inconsistent when the model covariates are correlated with the selection process. To get consistent estimates of the population-averaged parameters in this case, we use inverse-probability weighting to reweight the data so that our estimates reflect the full and partially observed observations.

This estimator uses the same trick as the inverse-probability-weighted (IPW) estimators used in causal inference. For two examples of IPW estimators applied to causal inference, see http://blog.stata.com/2016/09/13/an-ordered-probit-inverse-probability-weighted-ipw-estimator/ and http://blog.stata.com/2014/12/08/using-gmm-to-solve-two-step-estimation-problems/.

Exogenous sample selection

Consider the case where we draw a simple random sample from a population at some point in time (\(t_{1}\)) and then survey the same respondents again at a later point (\(t_{2}\)). If we only partly observe our variables of interest at \(t_{2}\), for example, because of panel attrition, then population-averaged inference is not consistent if selection into \(t_{2}\) is a function of the covariates used to model the outcome.

To illustrate, let’s suppose we have a binary outcome variable \(y_{i}\) that is only observed at \(t_{2}\) and we wish to fit the following probit model,

\[y_{i}
= 1[\beta_{0} + \beta_{1}d_{i} + \beta_{2}z_{1,i} + \beta_{3}z_{2,i} + e_{i} >
0] = 1[{\bf x}_i{\boldsymbol \beta} + e_{i} > 0]\]

where only \(z_{1,i}\) and \(z_{2,i}\) are observed for the full sample at \(t_{1}\) and \(d_{i}\) and \(y_{i}\) are not. While we could consistently estimate the conditional parameters in \({\boldsymbol \beta}\) if \(z_{1,i}\) or \(z_{2,i}\) are correlated with respondents dropping out of the sample, we cannot consistently estimate population-averaged effects from our model using the reduced sample. Let’s take a look at a snippet from our (fictitious) dataset:

. li id s y d z1 z2 in 1/5

     +-----------------------------------------+
     | id   s   y   d           z1          z2 |
     |-----------------------------------------|
  1. |  1   1   0   0    .87519349   .53714872 |
  2. |  2   1   0   1   -.02251873   .51122735 |
  3. |  3   0   .   .     .4885629   .07667308 |
  4. |  4   0   .   .   -.95665816   .73366976 |
  5. |  5   0   .   .   -1.2078948   .32982661 |
     +-----------------------------------------+

We see that only observations 1 and 2 were observed at both times and observations 3, 4, and 5 dropped out. Suppose we went ahead and fit our model to the observed part and wanted to estimate, say, marginal means for the binary variable \(d_i\). We use probit to estimate the parameters of the model and then use margins to estimate the marginal means with the over() option.

. probit y i.d z1 z2

Iteration 0:   log likelihood = -6387.4076
Iteration 1:   log likelihood = -5009.8148
Iteration 2:   log likelihood = -5000.3354
Iteration 3:   log likelihood = -5000.3285
Iteration 4:   log likelihood = -5000.3285

Probit regression                               Number of obs     =      9,234
                                                LR chi2(3)        =    2774.16
                                                Prob > chi2       =     0.0000
Log likelihood = -5000.3285                     Pseudo R2         =     0.2172

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         1.d |  -.9754357   .0296157   -32.94   0.000    -1.033481   -.9173901
          z1 |  -.6995319   .0185883   -37.63   0.000    -.7359644   -.6630995
          z2 |    .998608    .052668    18.96   0.000     .8953806    1.101835
       _cons |  -.3024672   .0359367    -8.42   0.000    -.3729019   -.2320326
------------------------------------------------------------------------------

. margins, over(d)

Predictive margins                              Number of obs     =      9,234
Model VCE    : OIM

Expression   : Pr(y), predict()
over         : d

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           d |
          0  |   .6828357    .006175   110.58   0.000     .6707329    .6949384
          1  |   .3705836   .0063825    58.06   0.000     .3580741     .383093
------------------------------------------------------------------------------

Because we simulated the dataset, we know that the true population parameter for \(d_{i}=0\) is 0.56 and 0.26 for \(d_{i}=1\). Although the conditional parameters of the probit model are consistent, our marginal mean estimates are far off. However, if we have a good model for the sample selection process, and if we observe the relevant variables for that model, we could use it to create weights that would correct for the sample selection. That is, we need a model for estimating each respondent’s probability of selecting him- or herself into the follow-up, and then we weight the reduced sample model with the inverse of that probability. If we can model sample selection based on variables we observe, we speak of exogenous sample selection (unlike endogenous sample selection where we have correlated unobservables across the selection and outcome model). For details about IPW sample-selection models, see Wooldridge (2010), section 19.8. Also notice that the basic idea here is the same as that for IPW treatment-effects estimators; see http://blog.stata.com/2016/09/13/an-ordered-probit-inverse-probability-weighted-ipw-estimator/ and http://blog.stata.com/2015/07/07/introduction-to-treatment-effects-in-stata-part-1/.

While this sounds straightforward, proper estimation of our marginal means requires some care because we have a multistage estimation problem: we have a selection model, an outcome model, and we want to estimate marginal means. One way of solving this problem is to estimate everything simultaneously by using a generalized method-of-moments estimator (http://blog.stata.com/2014/12/08/using-gmm-to-solve-two-step-estimation-problems/).

Model and estimator

Modeling the sample selection using a probit model with \(s_i\) being the selection indicator, we have

\[s_{i} = 1[\gamma_{0} + \gamma_{1}z_{1,i} + \gamma_{2}z_{2,i} +
\gamma_{3}z_{3,i} + u_{i} > 0] = 1[{\bf z}_i{\boldsymbol \gamma} + u_i > 0]\]

The conditional probability of selection is

\[
P(s_i=1 \vert {\boldsymbol z}_i) = \Phi({\bf z}_i{\boldsymbol \gamma})
\]

We can use the inverse of this probability as a weight in estimating the model parameters and population-averaged parameters using the fully observed sample. Intuitively, using the inverse-probability weight will correct the estimate to reflect both the fully and partially observed observations.

For the expectations of interest, we have

\begin{eqnarray*} E(y_i\vert d_i) &=&
E\left\{s_i{\Phi({\bf z}_i{\boldsymbol \gamma})}^{-1} E(y_i|d_i,{\bf z}_i)
\Big{\vert} d_i\right\} \cr &=& E\left\{s_i{\Phi({\bf z}_i{\boldsymbol
\gamma})}^{-1} \Phi({\bf x}_i{\boldsymbol \beta})\Big{\vert} d_i\right\}
\end{eqnarray*}

We will use the inverse-probability weight in moment conditions as we estimate the model parameters and marginal means using the generalized method of moments. Because we use a probit model for both the outcome and selection model, we can use the same moment conditions for both, except that we have different samples (the full and reduced sample). Here we use the first-order derivatives of the probit log-likelihood function to retain maximum likelihood estimates. For the selection model, we have sample moment conditions

\[
\sum_{i=1}^{N} \Bigg[ \Bigg\{ s_{i} \frac{\phi({\bf z}_{i} \boldsymbol{\gamma})}{\Phi({\bf z_{i}}\boldsymbol{\gamma})} – (1-s_{i})
\frac{\phi({\bf z}_{i}\boldsymbol{\gamma})}{\Phi(-{\bf z}_{i}\boldsymbol{\gamma})} \Bigg\} {\bf z}_{i} \Bigg] = 0
\]

Let \(S\) be the indices for the fully observed sample. For the outcome model we have sample moment conditions

\[
\sum_{i\in S} \Phi({\bf z}_{i}\boldsymbol{\gamma})^{-1} \Bigg[ \Bigg\{ y_{i} \frac{\phi({\bf x}_{i}\boldsymbol{\beta})}{\Phi({\bf x}_{i}\boldsymbol{\beta})} – (1-y_{i})
\frac{\phi{(\bf x}_{i}\boldsymbol{\beta})}{\Phi(-{\bf x}_{i}\boldsymbol{\beta})} \Bigg\} {\bf x}_{i} \Bigg] = 0
\]

Finally, the sample moment conditions of our marginal parameters are

\[
\sum_{i\in S} \Phi({\bf z}_{i}{\boldsymbol \gamma})^{-1}
\Bigg[ \Bigg\{\Phi({\bf x}_{i}{\boldsymbol \beta}) – \mu_{0} \Bigg\} (1-d_{i}) \Bigg] = 0
\]

\[
\sum_{i\in S} \Phi({\bf z}_{i}{\boldsymbol \gamma})^{-1}
\Bigg[ \Bigg\{\Phi({\bf x}_{i}{\boldsymbol \beta}) – \mu_{1} \Bigg\} d_{i} \Bigg] = 0
\]

Estimation

Now, we estimate our parameters with gmm, using the interactive version syntax. We first fit the bare probit model without marginal means:

. gmm (eq1: (y*normalden({xb : i.d z1 z2 _cons})/normal({xb:})-   
>       (1-y)*normalden(-{xb:})/normal(-{xb:}))),                 
>     instruments(eq1: i.d z1 z2)                                 
>     winitial(unadjusted, independent) onestep

Step 1
Iteration 0:   GMM criterion Q(b) =   .1679307
Iteration 1:   GMM criterion Q(b) =  .00078746
Iteration 2:   GMM criterion Q(b) =  4.901e-07
Iteration 3:   GMM criterion Q(b) =  3.725e-13
Iteration 4:   GMM criterion Q(b) =  2.405e-25

note: model is exactly identified

GMM estimation

Number of parameters =   4
Number of moments    =   4
Initial weight matrix: Unadjusted                 Number of obs   =      9,234

------------------------------------------------------------------------------
             |               Robust
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         1.d |  -.9754357   .0297148   -32.83   0.000    -1.033676   -.9171958
          z1 |  -.6995319   .0184779   -37.86   0.000    -.7357479   -.6633159
          z2 |    .998608    .052408    19.05   0.000     .8958901    1.101326
       _cons |  -.3024672   .0355903    -8.50   0.000    -.3722229   -.2327115
------------------------------------------------------------------------------
Instruments for equation eq1: 0b.d 1.d z1 z2 _cons

The true population parameters in \({\boldsymbol \beta}\) are \(\beta_{0}=-0.3\), \(\beta_{1}=-1\), \(\beta_{2}=-0.7\), and \(\beta_{3}=1\), and we can see that our estimates get very close to these. These results are equivalent to what we estimated before using probit. Now, we fit the same probit model but add the marginal means:

. gmm (eq1: (y*normalden({xb : i.d z1 z2 _cons})/normal({xb:})-   
>       (1-y)*normalden(-{xb:})/normal(-{xb:})))                  
>     (eq2: (1-d)*(normal({xb:})-{mu0}) )                         
>     (eq3:     d*(normal({xb:})-{mu1}) ),                        
>     instruments(eq1: i.d z1 z2)                                 
>     winitial(unadjusted, independent) onestep

Step 1
Iteration 0:   GMM criterion Q(b) =  .29293128
Iteration 1:   GMM criterion Q(b) =  .00400745
Iteration 2:   GMM criterion Q(b) =  2.492e-06
Iteration 3:   GMM criterion Q(b) =  1.014e-11
Iteration 4:   GMM criterion Q(b) =  1.691e-22

note: model is exactly identified

GMM estimation

Number of parameters =   6
Number of moments    =   6
Initial weight matrix: Unadjusted                 Number of obs   =      9,234

------------------------------------------------------------------------------
             |               Robust
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         1.d |  -.9754357   .0297148   -32.83   0.000    -1.033676   -.9171958
          z1 |  -.6995319   .0184779   -37.86   0.000    -.7357479   -.6633159
          z2 |    .998608    .052408    19.05   0.000     .8958901    1.101326
       _cons |  -.3024672   .0355903    -8.50   0.000    -.3722229   -.2327115
-------------+----------------------------------------------------------------
        /mu0 |   .6828357     .00683    99.98   0.000     .6694491    .6962222
        /mu1 |   .3705836   .0071107    52.12   0.000     .3566469    .3845202
------------------------------------------------------------------------------
Instruments for equation eq1: 0b.d 1.d z1 z2 _cons
Instruments for equation eq2: _cons
Instruments for equation eq3: _cons

The parameters labeled /mu0 and /mu1 are estimates of the marginal means for \(d_{i}=0\) and \(d_{i}=1\), respectively, not accounting for sample selection. Again, the marginal means are the same that we obtained earlier from margins, and these estimates are inconsistent (the true values were 0.56 for \(d_{i}=0\) and 0.26 for \(d_{i}=1\)).

Finally, we fit our sample-selection model. We specify the nocommonesample option because we use different sets of observations across moment conditions:

. gmm (eq1: s*normalden({zb : z1 z2 z3 _cons})/normal({zb:})-     
>       (1-s)*normalden(-{zb:})/normal(-{zb:}))                   
>     (eq2: (y*normalden({xb : i.d z1 z2 _cons})/normal({xb:})-   
>       (1-y)*normalden(-{xb:})/normal(-{xb:}))/(normal({zb:})))  
>     (eq3: (1-d)*(normal({xb:})-{mu0}) / normal({zb:}))          
>     (eq4:     d*(normal({xb:})-{mu1}) / normal({zb:})),         
>     instruments(eq1: z1 z2 z3)                                  
>     instruments(eq2: i.d z1 z2)                                 
>     winitial(unadjusted, independent)                           
>     onestep nocommonesample

Step 1
Iteration 0:   GMM criterion Q(b) =  .57729426
Iteration 1:   GMM criterion Q(b) =  .03006396
Iteration 2:   GMM criterion Q(b) =  .00014618
Iteration 3:   GMM criterion Q(b) =  2.016e-08
Iteration 4:   GMM criterion Q(b) =  1.477e-16

note: model is exactly identified

GMM estimation

Number of parameters =  10
Number of moments    =  10
Initial weight matrix: Unadjusted                 Number of obs   =   *

------------------------------------------------------------------------------
             |               Robust
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
zb           |
          z1 |   -.706703   .0115024   -61.44   0.000    -.7292473   -.6841587
          z2 |   .9562022   .0347235    27.54   0.000     .8881454    1.024259
          z3 |  -.9735917   .0345574   -28.17   0.000    -1.041323   -.9058605
       _cons |   -.112247   .0257628    -4.36   0.000    -.1627412   -.0617528
-------------+----------------------------------------------------------------
xb           |
         1.d |  -.9810332    .038836   -25.26   0.000     -1.05715   -.9049161
          z1 |  -.6651958   .0435223   -15.28   0.000    -.7504979   -.5798937
          z2 |   .9877247   .0986749    10.01   0.000     .7943255    1.181124
       _cons |  -.2775556   .0726005    -3.82   0.000    -.4198501   -.1352612
-------------+----------------------------------------------------------------
        /mu0 |   .5628197    .011652    48.30   0.000     .5399823    .5856572
        /mu1 |   .2694232    .007469    36.07   0.000     .2547843    .2840621
------------------------------------------------------------------------------
* Number of observations for equation eq1: 20000
  Number of observations for equation eq2: 9234
  Number of observations for equation eq3: 9234
  Number of observations for equation eq4: 9234
------------------------------------------------------------------------------
Instruments for equation eq1: z1 z2 z3 _cons
Instruments for equation eq2: 0b.d 1.d z1 z2 _cons
Instruments for equation eq3: _cons
Instruments for equation eq4: _cons

We can see that our estimates of the marginal means are now close to the true values.

Conclusion

We demonstrated how to use gmm to estimate population-averaged parameters with an IPW estimator. This solves a missing data problem arising from an exogenous sample-selection process.

Appendix

Here is the code that we used for generating the dataset:

drop _all
set seed 123
qui set obs 20000
generate double d  = runiform() > .5
generate double z1 = rnormal()
generate double z2 = runiform()
generate double z3 = runiform()
generate double u  = rnormal()
generate double e  = rnormal()
generate double zb = (-0.1 - 0.7*z1 + z2 - z3)
generate double xb = (-0.3 - d - 0.7*z1 + z2)
generate double s  = (zb + u) > 0
generate double y  = (xb + e) > 0
qui replace y = . if s == 0
qui replace d = . if s == 0
generate id = _n

Reference

Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press.

  • ravas

    Hi, why the method uses the inverse of the conditional probability of selection instead of its probability?
    I know Wooldridge uses the same approach, but form me is counter intuitive to assign more weight, via inverse probability, to the less likely occurrences and less weight to the most likely ones.