Home > Statistics > Estimation under omitted confounders, endogeneity, omitted variable bias, and related problems

Estimation under omitted confounders, endogeneity, omitted variable bias, and related problems

Initial thoughts

Estimating causal relationships from data is one of the fundamental endeavors of researchers, but causality is elusive. In the presence of omitted confounders, endogeneity, omitted variables, or a misspecified model, estimates of predicted values and effects of interest are inconsistent; causality is obscured.

A controlled experiment to estimate causal relations is an alternative. Yet conducting a controlled experiment may be infeasible. Policy makers cannot randomize taxation, for example. In the absence of experimental data, an option is to use instrumental variables or a control function approach.

Stata has many built-in estimators to implement these potential solutions and tools to construct estimators for situations that are not covered by built-in estimators. Below I illustrate both possibilities for a linear model and, in a later post, will talk about nonlinear models.

Linear model example

I start with a linear model with two covariates, \(x_1\) and \(x_2\). In this model, \(x_1\) is unrelated to the error term, \(\varepsilon\); this is given by the condition \(E\left(x_1\varepsilon \right) = 0\). \(x_1\) is exogenous. \(x_2\) is related to the error term; this is given by \(E\left(x_2\varepsilon \right) \neq 0\). \(x_2\) is endogenous. The model is given by

\begin{eqnarray*}
y &=& \beta_0 + x_1\beta_1 + x_2\beta_2 + \varepsilon \\
E\left(x_1\varepsilon \right) &=& 0 \\
E\left(x_2\varepsilon \right) &\neq& 0 \\
\end{eqnarray*}

The fact that \(x_2\) is related to the unobservable component \(\varepsilon\) means that fitting this model using linear regression yields inconsistent parameter estimates.

One option is to use a two-stage least-squares estimator. For two-stage least squares to be valid, I need to correctly specify a model for \(x_2\) that includes a variable, \(z_1\), that is unrelated to the unobservables of the outcome of interest and \(x_1\). We also need \(z_1\) and \(x_1\) to be unrelated to the unobservable of the outcome, \(\varepsilon\), and the unobservable of the equation for \(x_2\). These conditions are expressed by

\begin{eqnarray}
x_2 &=& \Pi_0 + z_1\Pi_1 + x_1\Pi_2 + \nu \label{instrument}\tag{1} \\
E\left(z’\varepsilon \right)&=& E\left(z’\nu \right) = 0 \label{scores} \tag{2}\\
z &\equiv & \left[z_1 \quad x_1 \right] \notag
\end{eqnarray}

The relationship in \eqref{instrument} implies that \(x_2\) can be split into two components: one that is related to \(\varepsilon\), and is therefore the crux of the problem, \(\nu\), and another that is unrelated to \(\varepsilon\), \(\Pi_0 + z_1\Pi_1 + x_1\Pi_2\). The key to two-stage least squares is to get a consistent estimator of the latter component of \(x_2\).

Below I simulate data that satisfy the assumptions above.

set obs 1000
set seed 111
generate e = rchi2(2) - 2
generate v = 0.5*e + rchi2(1) - 1
generate x1 = rchi2(1)-1
generate z1 = rchi2(1)-2
generate x2 = 2*(1 - x1 - z1) + v
generate y  = 2*(1 - x1 - x2) + e

If I estimate the model parameters using two-staged least squares, I obtain

. ivregress 2sls y x1 (x2 = z1)

Instrumental variables (2SLS) regression          Number of obs   =      1,000
                                                  Wald chi2(2)    =    9351.74
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.9124
                                                  Root MSE        =     2.1175

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x2 |   -2.00309   .0227842   -87.92   0.000    -2.047746   -1.958434
          x1 |  -2.010345   .0665863   -30.19   0.000    -2.140851   -1.879838
       _cons |   2.098502   .1158818    18.11   0.000     1.871378    2.325626
------------------------------------------------------------------------------
Instrumented:  x2
Instruments:   x1 z1

I recover the coefficient values for the covariates, which are \(-2\) for x1, \(-2\) for x2, and 2 for the constant.

I can also recover the parameters of the model via structural equation modeling using sem. The key here is to specify two linear equations and state that the unobservable components of both equations are correlated. Interestingly, sem estimation assumes joint normality of the unobservables, which is not satisfied by the model, yet I obtain consistent estimates as illustrated by the coefficient values in the equation for y in the output table below:

. sem (y <- x1 x2) (x2 <- x1 z1), cov(e.y*e.x2) nolog

Endogenous variables

Observed:  y x2

Exogenous variables

Observed:  x1 z1

Structural equation model                       Number of obs     =      1,000
Estimation method  = ml
Log likelihood     = -7564.0866

------------------------------------------------------------------------------
             |                 OIM
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
Structural   |
  y <-       |
          x2 |   -2.00309   .0227842   -87.92   0.000    -2.047746   -1.958434
          x1 |  -2.010345   .0665863   -30.19   0.000    -2.140851   -1.879838
       _cons |   2.098502   .1158818    18.11   0.000     1.871378    2.325626
  -----------+----------------------------------------------------------------
  x2 <-      |
          x1 |   -1.97412   .0431461   -45.75   0.000    -2.058685   -1.889555
          z1 |  -1.958336   .0394089   -49.69   0.000    -2.035576   -1.881096
       _cons |   2.164649   .0713838    30.32   0.000     2.024739    2.304558
-------------+----------------------------------------------------------------
     var(e.y)|   4.483833   .2266015                      4.060989    4.950704
    var(e.x2)|    3.49781   .1564268                      3.204271    3.818238
-------------+----------------------------------------------------------------
cov(e.y,e.x2)|   2.316083   .1655267    13.99   0.000     1.991657     2.64051
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(0)   =      0.00, Prob > chi2 =      .

The syntax of sem requires that I write the two linear equations; establish which variables are endogenous using an <-; and state that the unobservables of the two endogenous variables, denoted by e.y and e.x2, are correlated. The correlation is specified using the option cov(e.y*e.x2).

The coefficients and standard errors I obtain using sem are exactly the same as those from two-stage least squares. This equivalence occurs between moment-based estimation, like two-stage least squares and the generalized method of moments (GMM), and likelihood- and quasilikelihood-based estimators, when the moment conditions and score equations are the same. Therefore, even if the assumptions are different, the estimating equations are the same. The estimating equations for these models are given by \eqref{scores}.

I could also fit this model using GMM implemented in gmm. Here is one way to do this:

  1. Write the residuals of the equations of the endogenous variables. In this example, \(\varepsilon = y – (\beta_0 + x_1\beta_1 + x_2\beta_2)\) and \(\nu = x_2 – (\Pi_0 + z_1\Pi_1 + x_1\Pi_2)\).

  2. Use all the exogenous variables in the system as instruments, in this case, \(x_1\) and \(z_1\).

Using gmm gives us

. gmm (eq1: y  - {xb: x1 x2 _cons})          
> (eq2: x2 - {xpi: x1 z1 _cons}),            
> instruments(x1 z1)                         
> winitial(unadjusted, independent) nolog

Final GMM criterion Q(b) =  7.35e-32

note: model is exactly identified

GMM estimation

Number of parameters =   6
Number of moments    =   6
Initial weight matrix: Unadjusted                 Number of obs   =      1,000
GMM weight matrix:     Robust

------------------------------------------------------------------------------
             |               Robust
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb           |
          x1 |  -2.010345   .0776184   -25.90   0.000    -2.162474   -1.858215
          x2 |   -2.00309   .0323531   -61.91   0.000    -2.066501   -1.939679
       _cons |   2.098502   .1530806    13.71   0.000     1.798469    2.398534
-------------+----------------------------------------------------------------
xpi          |
          x1 |   -1.97412   .0380642   -51.86   0.000    -2.048724   -1.899515
          z1 |  -1.958336   .0448162   -43.70   0.000    -2.046174   -1.870498
       _cons |   2.164649   .0772429    28.02   0.000     2.013255    2.316042
------------------------------------------------------------------------------
Instruments for equation eq1: x1 z1 _cons
Instruments for equation eq2: x1 z1 _cons

Once again, I obtain the same parameter values as with ivregress and gsem. However, the standard errors are different. The reason is that gmm computes robust standard errors by default. If I compute ivregress with robust standard errors, the results are again exactly the same:

. ivregress 2sls y x1 (x2 = z1), vce(robust)

Instrumental variables (2SLS) regression          Number of obs   =      1,000
                                                  Wald chi2(2)    =    6028.31
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.9124
                                                  Root MSE        =     2.1175

------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x2 |   -2.00309   .0323531   -61.91   0.000    -2.066501   -1.939679
          x1 |  -2.010345   .0776184   -25.90   0.000    -2.162474   -1.858215
       _cons |   2.098502   .1530806    13.71   0.000     1.798469    2.398534
------------------------------------------------------------------------------
Instrumented:  x2
Instruments:   x1 z1

Another way to obtain the parameters of interest is using a control function approach. This uses the residuals, from a regression of the endogenous variable \(x_2\) on the instruments \(x_1\) and \(z_1\), as regressors for the regression of \(y\) on \(x_1\) and \(x_2\). Below I implement the control function approach using gmm.

. local xb ({b1}*x1 + {b2}*x2 + {b3}*(x2-{xpi:}) + {b0})

. gmm (eq1: x2 - {xpi: x1 z1 _cons})          
>     (eq2: y  - `xb')                                
>     (eq3: (y  - `xb')*(x2-{xpi:})),         
> instruments(eq1: x1 z1)                     
> instruments(eq2: x1 z1)                     
> winitial(unadjusted, independent) nolog

Final GMM criterion Q(b) =  1.02e-31

note: model is exactly identified

GMM estimation

Number of parameters =   7
Number of moments    =   7
Initial weight matrix: Unadjusted                 Number of obs   =      1,000
GMM weight matrix:     Robust

------------------------------------------------------------------------------
             |               Robust
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   -1.97412   .0380642   -51.86   0.000    -2.048724   -1.899515
          z1 |  -1.958336   .0448162   -43.70   0.000    -2.046174   -1.870498
       _cons |   2.164649   .0772429    28.02   0.000     2.013255    2.316042
-------------+----------------------------------------------------------------
         /b1 |  -2.010345   .0776184   -25.90   0.000    -2.162474   -1.858215
         /b2 |   -2.00309   .0323531   -61.91   0.000    -2.066501   -1.939679
         /b3 |   .6621525   .0700282     9.46   0.000     .5248998    .7994052
         /b0 |   2.098502   .1530806    13.71   0.000     1.798469    2.398534
------------------------------------------------------------------------------
Instruments for equation eq1: x1 z1 _cons
Instruments for equation eq2: x1 z1 _cons
Instruments for equation eq3: _cons

As in the previous examples, I define residuals and instruments, and gmm creates a moment condition using these two pieces of information. In the example above, the residuals from the regression of the endogenous variable on the exogenous variables of the model are at the same time residuals and instruments. Thus I do not include them as an exogenous instrument. Instead, I construct the moment condition for the residuals from the regression of the endogenous variable manually in eq3.

Using the control function approach again gives the same results as in the three previous cases.

In the first example, I used an estimator that exists in Stata. In the last two examples, I used estimation tools that allow us to obtain estimators for a large class of models.

Concluding remarks

Estimating the parameters of a model in the presence of endogeneity or related problems http://blog.stata.com/tag/endogeneity/ is daunting. Above I illustrated how to estimate the parameters of such models using commands in Stata that were created for such a purpose and also illustrated how you can use gmm and sem to estimate these models.

  • John Hipp

    I am working on several ongoing studies where a simple oneway anova shows a significant association between a variable and clinical outcomes. Those results and associated bar graphs areinteresting, but not very helpful when determining how best to use the variable in managing an individual patient. I came up with a way of plotting the data that has been very helpful in justifying how to apply a variable in practice. We are submitting another manuscript for publication where these graphs are being used, and I would like to provide a reference for this graphing method if one exists. Any help with this would be greatly appreciated.

    To illustrate the graphical approach, the infant birth weight example data from Cattaneo et al that is available through stata can be used. if you twoway scatter plot birthweight as a function of age, the effect of maternal age on birthweight is not very clear. Based on logistic regression, maternal age has a significant effect on infant birthweight, but the odds ratio is 0.97. If I run the following script, and plot the results, I can generate the attached graph. In this graph, the dark green line is the average birthweight including all maternal ages. The lighter green lines above and below the dark green line are the 95% CI from the bootstrap command. The left-most marker on the dark blue line, and the right-most marker on the light blue line includes all maternal ages and gives the average birthweight for all ages. Moving along the dark blue line from left-to-right, each successive marker gives the average birthweight for maternal ages > the age threshold on the horizontal-axis. This shows that for ages > 22, the average birthweight increases above the 95% CI for the population as a whole. Moving along the light blue line from right-to-left, the average infant birthweight decreases below the 95% CI when only maternal ages `ma’
    noisily di _column(1)`ma’ _column(10) r(mean)
    }
    quietly forvalues ma = 13 16:45 {
    quietly summarize bweight if mage < `ma'
    noisily di _column(1)`ma' _column(10) r(mean)

    }
    bootstrap meanBW=r(mean), seed(04051957): summarize bweight