Estimation under omitted confounders, endogeneity, omitted variable bias, and related problems
Initial thoughts
Estimating causal relationships from data is one of the fundamental endeavors of researchers, but causality is elusive. In the presence of omitted confounders, endogeneity, omitted variables, or a misspecified model, estimates of predicted values and effects of interest are inconsistent; causality is obscured.
A controlled experiment to estimate causal relations is an alternative. Yet conducting a controlled experiment may be infeasible. Policy makers cannot randomize taxation, for example. In the absence of experimental data, an option is to use instrumental variables or a control function approach.
Stata has many builtin estimators to implement these potential solutions and tools to construct estimators for situations that are not covered by builtin estimators. Below I illustrate both possibilities for a linear model and, in a later post, will talk about nonlinear models.
Linear model example
I start with a linear model with two covariates, \(x_1\) and \(x_2\). In this model, \(x_1\) is unrelated to the error term, \(\varepsilon\); this is given by the condition \(E\left(x_1\varepsilon \right) = 0\). \(x_1\) is exogenous. \(x_2\) is related to the error term; this is given by \(E\left(x_2\varepsilon \right) \neq 0\). \(x_2\) is endogenous. The model is given by
\begin{eqnarray*}
y &=& \beta_0 + x_1\beta_1 + x_2\beta_2 + \varepsilon \\
E\left(x_1\varepsilon \right) &=& 0 \\
E\left(x_2\varepsilon \right) &\neq& 0 \\
\end{eqnarray*}
The fact that \(x_2\) is related to the unobservable component \(\varepsilon\) means that fitting this model using linear regression yields inconsistent parameter estimates.
One option is to use a twostage leastsquares estimator. For twostage least squares to be valid, I need to correctly specify a model for \(x_2\) that includes a variable, \(z_1\), that is unrelated to the unobservables of the outcome of interest and \(x_1\). We also need \(z_1\) and \(x_1\) to be unrelated to the unobservable of the outcome, \(\varepsilon\), and the unobservable of the equation for \(x_2\). These conditions are expressed by
\begin{eqnarray}
x_2 &=& \Pi_0 + z_1\Pi_1 + x_1\Pi_2 + \nu \label{instrument}\tag{1} \\
E\left(z’\varepsilon \right)&=& E\left(z’\nu \right) = 0 \label{scores} \tag{2}\\
z &\equiv & \left[z_1 \quad x_1 \right] \notag
\end{eqnarray}
The relationship in \eqref{instrument} implies that \(x_2\) can be split into two components: one that is related to \(\varepsilon\), and is therefore the crux of the problem, \(\nu\), and another that is unrelated to \(\varepsilon\), \(\Pi_0 + z_1\Pi_1 + x_1\Pi_2\). The key to twostage least squares is to get a consistent estimator of the latter component of \(x_2\).
Below I simulate data that satisfy the assumptions above.
set obs 1000 set seed 111 generate e = rchi2(2)  2 generate v = 0.5*e + rchi2(1)  1 generate x1 = rchi2(1)1 generate z1 = rchi2(1)2 generate x2 = 2*(1  x1  z1) + v generate y = 2*(1  x1  x2) + e
If I estimate the model parameters using twostaged least squares, I obtain
. ivregress 2sls y x1 (x2 = z1) Instrumental variables (2SLS) regression Number of obs = 1,000 Wald chi2(2) = 9351.74 Prob > chi2 = 0.0000 Rsquared = 0.9124 Root MSE = 2.1175  y  Coef. Std. Err. z P>z [95% Conf. Interval] + x2  2.00309 .0227842 87.92 0.000 2.047746 1.958434 x1  2.010345 .0665863 30.19 0.000 2.140851 1.879838 _cons  2.098502 .1158818 18.11 0.000 1.871378 2.325626  Instrumented: x2 Instruments: x1 z1
I recover the coefficient values for the covariates, which are \(2\) for x1, \(2\) for x2, and 2 for the constant.
I can also recover the parameters of the model via structural equation modeling using sem. The key here is to specify two linear equations and state that the unobservable components of both equations are correlated. Interestingly, sem estimation assumes joint normality of the unobservables, which is not satisfied by the model, yet I obtain consistent estimates as illustrated by the coefficient values in the equation for y in the output table below:
. sem (y < x1 x2) (x2 < x1 z1), cov(e.y*e.x2) nolog Endogenous variables Observed: y x2 Exogenous variables Observed: x1 z1 Structural equation model Number of obs = 1,000 Estimation method = ml Log likelihood = 7564.0866   OIM  Coef. Std. Err. z P>z [95% Conf. Interval] + Structural  y <  x2  2.00309 .0227842 87.92 0.000 2.047746 1.958434 x1  2.010345 .0665863 30.19 0.000 2.140851 1.879838 _cons  2.098502 .1158818 18.11 0.000 1.871378 2.325626 + x2 <  x1  1.97412 .0431461 45.75 0.000 2.058685 1.889555 z1  1.958336 .0394089 49.69 0.000 2.035576 1.881096 _cons  2.164649 .0713838 30.32 0.000 2.024739 2.304558 + var(e.y) 4.483833 .2266015 4.060989 4.950704 var(e.x2) 3.49781 .1564268 3.204271 3.818238 + cov(e.y,e.x2) 2.316083 .1655267 13.99 0.000 1.991657 2.64051  LR test of model vs. saturated: chi2(0) = 0.00, Prob > chi2 = .
The syntax of sem requires that I write the two linear equations; establish which variables are endogenous using an <; and state that the unobservables of the two endogenous variables, denoted by e.y and e.x2, are correlated. The correlation is specified using the option cov(e.y*e.x2).
The coefficients and standard errors I obtain using sem are exactly the same as those from twostage least squares. This equivalence occurs between momentbased estimation, like twostage least squares and the generalized method of moments (GMM), and likelihood and quasilikelihoodbased estimators, when the moment conditions and score equations are the same. Therefore, even if the assumptions are different, the estimating equations are the same. The estimating equations for these models are given by \eqref{scores}.
I could also fit this model using GMM implemented in gmm. Here is one way to do this:

Write the residuals of the equations of the endogenous variables. In this example, \(\varepsilon = y – (\beta_0 + x_1\beta_1 + x_2\beta_2)\) and \(\nu = x_2 – (\Pi_0 + z_1\Pi_1 + x_1\Pi_2)\).

Use all the exogenous variables in the system as instruments, in this case, \(x_1\) and \(z_1\).
Using gmm gives us
. gmm (eq1: y  {xb: x1 x2 _cons}) > (eq2: x2  {xpi: x1 z1 _cons}), > instruments(x1 z1) > winitial(unadjusted, independent) nolog Final GMM criterion Q(b) = 7.35e32 note: model is exactly identified GMM estimation Number of parameters = 6 Number of moments = 6 Initial weight matrix: Unadjusted Number of obs = 1,000 GMM weight matrix: Robust   Robust  Coef. Std. Err. z P>z [95% Conf. Interval] + xb  x1  2.010345 .0776184 25.90 0.000 2.162474 1.858215 x2  2.00309 .0323531 61.91 0.000 2.066501 1.939679 _cons  2.098502 .1530806 13.71 0.000 1.798469 2.398534 + xpi  x1  1.97412 .0380642 51.86 0.000 2.048724 1.899515 z1  1.958336 .0448162 43.70 0.000 2.046174 1.870498 _cons  2.164649 .0772429 28.02 0.000 2.013255 2.316042  Instruments for equation eq1: x1 z1 _cons Instruments for equation eq2: x1 z1 _cons
Once again, I obtain the same parameter values as with ivregress and gsem. However, the standard errors are different. The reason is that gmm computes robust standard errors by default. If I compute ivregress with robust standard errors, the results are again exactly the same:
. ivregress 2sls y x1 (x2 = z1), vce(robust) Instrumental variables (2SLS) regression Number of obs = 1,000 Wald chi2(2) = 6028.31 Prob > chi2 = 0.0000 Rsquared = 0.9124 Root MSE = 2.1175   Robust y  Coef. Std. Err. z P>z [95% Conf. Interval] + x2  2.00309 .0323531 61.91 0.000 2.066501 1.939679 x1  2.010345 .0776184 25.90 0.000 2.162474 1.858215 _cons  2.098502 .1530806 13.71 0.000 1.798469 2.398534  Instrumented: x2 Instruments: x1 z1
Another way to obtain the parameters of interest is using a control function approach. This uses the residuals, from a regression of the endogenous variable \(x_2\) on the instruments \(x_1\) and \(z_1\), as regressors for the regression of \(y\) on \(x_1\) and \(x_2\). Below I implement the control function approach using gmm.
. local xb ({b1}*x1 + {b2}*x2 + {b3}*(x2{xpi:}) + {b0}) . gmm (eq1: x2  {xpi: x1 z1 _cons}) > (eq2: y  `xb') > (eq3: (y  `xb')*(x2{xpi:})), > instruments(eq1: x1 z1) > instruments(eq2: x1 z1) > winitial(unadjusted, independent) nolog Final GMM criterion Q(b) = 1.02e31 note: model is exactly identified GMM estimation Number of parameters = 7 Number of moments = 7 Initial weight matrix: Unadjusted Number of obs = 1,000 GMM weight matrix: Robust   Robust  Coef. Std. Err. z P>z [95% Conf. Interval] + x1  1.97412 .0380642 51.86 0.000 2.048724 1.899515 z1  1.958336 .0448162 43.70 0.000 2.046174 1.870498 _cons  2.164649 .0772429 28.02 0.000 2.013255 2.316042 + /b1  2.010345 .0776184 25.90 0.000 2.162474 1.858215 /b2  2.00309 .0323531 61.91 0.000 2.066501 1.939679 /b3  .6621525 .0700282 9.46 0.000 .5248998 .7994052 /b0  2.098502 .1530806 13.71 0.000 1.798469 2.398534  Instruments for equation eq1: x1 z1 _cons Instruments for equation eq2: x1 z1 _cons Instruments for equation eq3: _cons
As in the previous examples, I define residuals and instruments, and gmm creates a moment condition using these two pieces of information. In the example above, the residuals from the regression of the endogenous variable on the exogenous variables of the model are at the same time residuals and instruments. Thus I do not include them as an exogenous instrument. Instead, I construct the moment condition for the residuals from the regression of the endogenous variable manually in eq3.
Using the control function approach again gives the same results as in the three previous cases.
In the first example, I used an estimator that exists in Stata. In the last two examples, I used estimation tools that allow us to obtain estimators for a large class of models.
Concluding remarks
Estimating the parameters of a model in the presence of endogeneity or related problems http://blog.stata.com/tag/endogeneity/ is daunting. Above I illustrated how to estimate the parameters of such models using commands in Stata that were created for such a purpose and also illustrated how you can use gmm and sem to estimate these models.