Understanding omitted confounders, endogeneity, omitted variable bias, and related concepts
Initial thoughts
Estimating causal relationships from data is one of the fundamental endeavors of researchers. Ideally, we could conduct a controlled experiment to estimate causal relations. However, conducting a controlled experiment may be infeasible. For example, education researchers cannot randomize education attainment and they must learn from observational data.
In the absence of experimental data, we construct models to capture the relevant features of the causal relationship we have an interest in, using observational data. Models are successful if the features we did not include can be ignored without affecting our ability to ascertain the causal relationship we are interested in. Sometimes, however, ignoring some features of reality results in models that yield relationships that cannot be interpreted causally. In a regression framework, depending on our discipline or our research question, we give a different name to this phenomenon: endogeneity, omitted confounders, omitted variable bias, simultaneity bias, selection bias, etc.
Below I show how we can understand many of these problems in a unified regression framework and use simulated data to illustrate how they affect estimation and inference.
Understanding omitted confounders, endogeneity, omitted variable bias, and related concepts
Framework
The following statements allow us to obtain a causal relationship in a regression framework.
\begin{eqnarray*}
y &=& g\left(X\right) + \varepsilon \\
E\left(\varepsilon|X\right) &=& 0
\end{eqnarray*}
In the expression above, \(y\) is the outcome vector of interest, \(X\) is a matrix of covariates, \(\varepsilon\) is a vector of unobservables, and \(g\left(X\right)\) is a vector-valued function. The statement \(E\left(\varepsilon|X\right) = 0\) implies that once we account for all the information in the covariates, what we did not include in our model, \(\varepsilon\), does not give us any information, on average. It also implies that, on average, we can infer the causal relationship of our outcome of interest and our covariates. In other words, it implies that
\begin{equation*}
E\left(y|X\right) = g\left(X\right)
\end{equation*}
The opposite occurs when
\begin{eqnarray*}
y &=& g\left(X\right) + \varepsilon \\
E\left(\varepsilon|X\right) &\neq& 0
\end{eqnarray*}
The expression \(E\left(\varepsilon|X\right) \neq 0\) implies that it does not suffice to control for the covariates \(X\) to obtain a causal relationship because the unobservables are not negligible when we incorporate the information of the covariates in our model.
Below I present three examples that fall into this framework. In the examples below, \(g\left(X\right)\) is linear, but the framework extends beyond linearity.
Example 1 (omitted variable bias and confounders). The true model is given by
\begin{eqnarray*}
y &=& X_1\beta_1 + X_2\beta_2 + \varepsilon \\
E\left(\varepsilon| X_1, X_2\right)&=& 0
\end{eqnarray*}
However, the researcher does not include the covariate matrix \(X_2\) in the model and believes that the relationship between the covariates and the outcome is given by
\begin{eqnarray*}
y &=& X_1\beta_1 + \eta \\
E\left(\eta|X_1\right)&=& 0
\end{eqnarray*}
If \(E\left(\eta|X_1\right)= 0\), the researcher will get correct inference about \(\beta_1\) from linear regression. However, \(E\left(\eta|X_1\right)= 0\) will only happen if \(X_2\) is irrelevant once we incorporate the information of \(X_1\). In other words, this happens if \(E\left(X_2|X_1\right)=0\). To see this, we write
\begin{eqnarray*}
E\left(\eta|X_1\right)&=& E\left(X_2\beta_2 + \varepsilon| X_1\right) \\
&=& E\left(X_2|X_1\right)\beta_2 + E\left(\varepsilon| X_1\right) \\
&=& E\left(X_2|X_1\right)\beta_2
\end{eqnarray*}
If \(E\left(\eta|X_1\right) \neq 0\), we have omitted variable bias, which in this case comes from the relationship between the included and omitted variable, that is, \(E\left(X_2|X_1\right)\). Depending on your discipline, you would also refer to \(X_2\) as an omitted confounder.
Below I simulate data that exemplify omitted variable bias.
clear capture set seed 111 quietly set obs 20000 local rho = .5 // Generating correlated regressors generate x1 = rnormal() generate x2 = `rho'*x1 + rnormal() // Generating Model quietly generate y = 1 + x1 - x2 + rnormal()
In line 4, I set a parameter that correlates the two regressors in the model. In lines 6-8 I generate correlated regressors. In line 12, I generate the outcome variable. Below I estimate the model excluding one of the regressors.
. regress y x1, vce(robust) Linear regression Number of obs = 20,000 F(1, 19998) = 2468.92 Prob > F = 0.0000 R-squared = 0.1086 Root MSE = 1.4183 -------------------------------------------------------------------------- | Robust y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+---------------------------------------------------------------- x1 | .4953172 .0099685 49.69 0.000 .4757781 .5148563 _cons | 1.006971 .0100287 100.41 0.000 .9873138 1.026628 --------------------------------------------------------------------------
The estimated coefficient is 0.495, but we know that the true value is 1. Also, our confidence interval suggests that the true value is somewhere between 0.476 and 0.515. Estimation and inference are misleading.
Example 2 (endogeneity in a projection model). The projection model gives us correct inference if
\begin{eqnarray*}
y &=& X_1\beta_1 + X_2\beta_2 + \varepsilon \\
E\left(X_j’\varepsilon \right)&=& 0 \quad \text{for} \quad j \in{1,2}
\end{eqnarray*}
If \(E\left(X_j’\varepsilon \right) \neq 0\), we say that the covariates \(X_j\) are endogenous. The law of iterated expectations states that \(E\left(\varepsilon|X_j\right) = 0\) which yields \(E\left(X_j’\varepsilon \right) = 0\). Thus, if \(E\left(X_j’\varepsilon \right) \neq 0\), we have that \(E\left(\varepsilon|X_j\right) \neq 0\). Say \(X_1\) is endogenous; then, we can write the model under endogeneity within our framework as
\begin{eqnarray*}
y &=& X_1\beta_1 + X_2\beta_2 + \varepsilon \\
E\left(\varepsilon| X_1 \right)&\neq& 0 \\
E\left(\varepsilon| X_2 \right)&=& 0
\end{eqnarray*}
Below I simulate data that exemplify endogeneity:
clear capture set seed 111 quietly set obs 20000 // Generating Endogenous Components matrix C = (1, .5\ .5, 1) quietly drawnorm e v, corr(C) // Generating Regressors generate x1 = rnormal() generate x2 = v // Generating Model generate y = 1 + x1 - x2 + e
In lines 7–10 I generate correlated unobservable variables. In line 14, I generate a covariate that is correlated to one of the unobservables, x2. In line 18, I generate the outcome variable. The covariate x2 is endogenous, and its coefficient should be far away from the true value (in this case, \(-1\)). Below we observe exactly this:
. regress y x1 x2, vce(robust) Linear regression Number of obs = 20,000 F(2, 19997) = 17126.12 Prob > F = 0.0000 R-squared = 0.6292 Root MSE = .86244 -------------------------------------------------------------------------- | Robust y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+---------------------------------------------------------------- x1 | 1.005441 .0060477 166.25 0.000 .9935867 1.017295 x2 | -.4980092 .006066 -82.10 0.000 -.5098991 -.4861193 _cons | .9917196 .0060981 162.63 0.000 .9797669 1.003672 --------------------------------------------------------------------------
The estimated coefficient is \(-0.498\), and our confidence interval suggests that the true value is somewhere between \(-0.510\) and \(-0.486\). Estimation and inference are misleading.
Example 3 (selection bias). In this case, we only observe our outcome of interest for a subset of the population. The subset of the population we observe depends on a rule. For instance, we observe \(y\) if \(y_2\geq 0\). In this case, the conditional expectation of our outcome of interest is given by
\begin{equation*}
E\left(y|X_1, y_2 \geq 0\right) = X_1\beta + E\left(\varepsilon|X_1, y_2 \geq 0 \right)
\end{equation*}
Selection bias arises if \(E\left(\varepsilon|X_1, y_2 \geq 0 \right) \neq 0\). This implies that the selection rule is related to the unobservables in our model. If we define \(X \equiv (X_1, y_2 \geq 0)\), we can rewrite the problem in terms of our general framework:
\begin{eqnarray*}
E\left(y|X\right) &=& X_1\beta + E\left(\varepsilon|X \right) \\
E\left(\varepsilon|X\right) &\neq & 0
\end{eqnarray*}
Below I simulate data that exemplify selection on unobservables:
clear capture set seed 111 quietly set obs 20000 // Generating Endogenous Components matrix C = (1, .8\ .8, 1) quietly drawnorm e v, corr(C) // Generating exogenous variables generate x1 = rbeta(2,3) generate x2 = rbeta(2,3) generate x3 = rnormal() generate x4 = rchi2(1) // Generating outcome variables generate y1 = x1 - x2 + e generate y2 = 2 + x3 - x4 + v replace y1 = . if y2<=0
In lines 7 and 8, I generate correlated unobservable variables. In lines 12–15 I generate the exogenous covariates. In lines 19 and 20, I generate the two outcomes and drop observations according to the selection rule in line 21. If we use linear regression, we obtain
. regress y1 x1 x2, vce(robust) noconstant Linear regression Number of obs = 14,847 F(2, 14845) = 808.75 Prob > F = 0.0000 R-squared = 0.0988 Root MSE = .94485 -------------------------------------------------------------------------- | Robust y1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+---------------------------------------------------------------- x1 | 1.153796 .0291331 39.60 0.000 1.096692 1.210901 x2 | -.7896144 .0288036 -27.41 0.000 -.846073 -.7331558 --------------------------------------------------------------------------
As in the previous cases, the point estimates and confidence intervals lead us to incorrect conclusions.
Concluding remarks
I have presented a general regression framework to understand many of the problems that do not allow us to interpret our results causally. I also illustrated the effects of these problems on our point estimates and confidence intervals using simulated data.