Two faces of misspecification in maximum likelihood: Heteroskedasticity and robust standard errors
For a nonlinear model with heteroskedasticity, a maximum likelihood estimator gives misleading inference and inconsistent marginal effect estimates unless I model the variance. Using a robust estimate of the variance–covariance matrix will not help me obtain correct inference.
This differs from the intuition we gain from linear regression. The estimates of the marginal effects in linear regression are consistent under heteroskedasticity and using robust standard errors yields correct inference.
If robust standard errors do not solve the problems associated with heteroskedasticity for a nonlinear model estimated using maximum likelihood, what does it mean to use robust standard errors in this context? I answer this question using simulations and illustrate the effect of heteroskedasticity in nonlinear models estimated using maximum likelihood.
What happens when I have heteroskedasticity
Suppose that the true model is a heteroskedastic probit where
\begin{equation*}
y = \left\{
\begin{array}{cl}
1 & \text{if} \quad x\beta + \varepsilon > 0\\
0 & \text{otherwise}
\end{array}\right.
\end{equation*}
\begin{equation*}
\varepsilon | x \sim N\left(0, \exp\left(x\gamma\right) \right)
\end{equation*}
This model is heteroskedastic because the variance of the unobserved component, \(\varepsilon\), is a function of the covariates. In contrast, the variance of the probit model does not depend on the covariates; it is equal to 1.
In table 1 below, I show the average of the change in the outcome when a continuous covariate changes, the average marginal effect (AME), the average of the change in the outcome when a discrete variable varies from a base level, which I refer to as the average treatment effect (ATE), and the 5% rejection rate of a test against the true null hypothesis. I compare two estimators, a probit with a robust variance–covariance matrix and a heteroskedastic probit. In table 1, I also show an approximate true value of the AME and ATE. I obtain the approximate true values by computing the ATE and AME, at the true values of the coefficients, using a sample of 10 million observations. I provide more details about estimation and the the simulation in the appendix.
Simulation results for N=10,000 and 2,000 replications
Statistic | Approximate True Value | Probit | Hetprobit |
---|---|---|---|
AME of x1 | -.210 | -.099 | -.210 |
5% Rejection Rate | 1.00 | .052 | |
AME of x3 | .166 | .061 | .166 |
5% Rejection Rate | 1.00 | .062 | |
ATE of x2 (1 vs 0) | -.190 | -.193 | -.191 |
5% Rejection Rate | .060 | .064 | |
ATE of x2 (2 vs 0) | .082 | .077 | .081 |
5% Rejection Rate | .061 | .065 | |
ATE of x2 (3 vs 0) | -.190 | -.192 | -.191 |
5% Rejection Rate | .058 | .063 |
As expected, the heteroskedastic probit estimates are close to the true value, and the rejection rate of the true null hypothesis is close to 5%. This is also true for the ATE probit estimates. However, the probit AME estimates are far away from the true value, and the rejection rate is 100%, regardless of my use of a robust variance–covariance matrix estimator.
The probit likelihood in this example is misspecified. As White (1996) illustrates, the misspecified probit likelihood estimates converge to a well-defined parameter, and robust standard errors provide correct coverage for this parameter. However, the value obtained from the probit likelihood, as the simulations illustrate, gives an inconsistent estimate of the effects of interest. Sometimes, as we will see below, this misspecified value is of interest.
If a robust variance did not correct for heteroskedasticity, what is it doing?
Although a robust variance–covariance matrix estimator is closely related to heteroskedasticity in linear regression models, as I show in the two examples below, a robust variance–covariance matrix estimator has a different interpretation in a nonlinear model estimated using maximum likelihood.
First, let’s look at the probit likelihood in another context.
Example 1 (Probit pseudolikelihood). Suppose the true model is given by
\begin{eqnarray*}
y &=& \Phi\left(x\beta + \varepsilon\right) \\
\varepsilon | x & \sim & N\left(0, 1\right)
\end{eqnarray*}
The model above is a fractional response model. The outcome variable in fractional response models takes values that are greater than or equal to 0 and less than or equal to 1. This is not a binary response model, but we may use the probit likelihood to get a consistent estimate of the outcome mean,
\begin{equation*}
E\left(y|x\right) = \Phi\left(\frac{x\beta}{\sqrt{2}}\right)
\end{equation*}
Below I simulate and obtain estimates for the model above:
clear set seed 222 set obs 1000000 generate x1 = rnormal() generate x2 = int(rbeta(3,2)*3) generate xb = .5*(1 + x1 -1.x2 + 2.x2) generate e = rnormal() generate yp = normal(xb + e)
For these data, the AMEs and ATEs are given by
// Marginal effect x1 generate m1 = normalden(xb/sqrt(2))*(.5/sqrt(2)) // Treatment effect of x2 1 vs 0 generate m21 = normal(.5*(x1)/sqrt(2)) - normal(.5*(x1 + 1)/sqrt(2)) // Treatment effect of x2 2 vs 0 generate m22 = normal(.5*(x1 + 2)/sqrt(2)) - normal(.5*(x1 + 1)/sqrt(2))
To fit the model, I use fracreg, which employs a probit likelihood with a robust variance–covariance matrix by default. fracreg assumes a correct model for the mean and is agnostic about other moments of the outcome. Because we are modeling a subset of the moments of our outcome, in this example the mean, and do not model the other moments, we use a robust estimator of the variance–covariance matrix to obtain consistent estimates of the unknown standard errors. Given that the probit likelihood is not the true likelihood, we refer to the likelihood as a pseudolikelihood or quasilikelihood.
The estimates for the AMEs and ATEs after fracreg are given by
. margins, dydx(*) Average marginal effects Number of obs = 1,000,000 Model VCE : Robust Expression : Conditional mean of yp, predict() dy/dx w.r.t. : x1 1.x2 2.x2 ------------------------------------------------------------------------------ | Delta-method | dy/dx Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | .1209728 .0002398 504.55 0.000 .1205028 .1214427 | x2 | 1 | -.1304846 .0008839 -147.62 0.000 -.1322171 -.1287521 2 | .1175945 .0008696 135.23 0.000 .1158902 .1192988 ------------------------------------------------------------------------------ Note: dy/dx for factor levels is the discrete change from the base level.
which are close to the sample values
. summarize m* Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- m1 | 1,000,000 .1213898 .0219322 .0083771 .1410474 m21 | 1,000,000 -.1305564 .0123099 -.1403162 -.025986 m22 | 1,000,000 .1169432 .0206344 .0128037 .1403162
Let’s look at another example that was discussed in http://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/.
Example 2 (Exponential mean model). Suppose the true model is given by
\begin{eqnarray*}
y &=& \exp\left(x\beta + \varepsilon\right) \\
\varepsilon | x & \sim & N\left(0, 1\right)
\end{eqnarray*}
This is not a Poisson model, but we can use the Poisson likelihood estimates to get a consistent estimate of the outcome mean
\begin{equation*}
E\left(y|x\right) = \exp\left(x\beta + \frac{1}{2}\right)
\end{equation*}
Given that we do not have a Poisson model, our estimates should not be used to obtain statistics that are not functions of the outcome mean. For example, it makes no sense to predict counts or the probability of the outcome being a specific integer, natural predictions, if the true likelihood was a Poisson likelihood.
Below I simulate data for the exponential mean model above:
clear set seed 222 set obs 1000000 generate x1 = rnormal() generate x2 = int(rbeta(3,2)*3) generate xb = .5*(1 + x1 -1.x2 + 2.x2) generate e = rnormal() generate ye = exp(xb + e)
The estimation results are given by
. poisson ye x1 i.x2, vce(robust) note: you are responsible for interpretation of noncount dep. variable Iteration 0: log pseudolikelihood = -2904731.1 Iteration 1: log pseudolikelihood = -2904726.1 Iteration 2: log pseudolikelihood = -2904726.1 Poisson regression Number of obs = 1,000,000 Wald chi2(3) = 142144.11 Prob > chi2 = 0.0000 Log pseudolikelihood = -2904726.1 Pseudo R2 = 0.2087 ------------------------------------------------------------------------------ | Robust ye | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | .5006891 .0018594 269.27 0.000 .4970447 .5043335 | x2 | 1 | -.4953304 .0049604 -99.86 0.000 -.5050527 -.4856081 2 | .5086742 .0050554 100.62 0.000 .4987659 .5185825 | _cons | .9956749 .0044566 223.42 0.000 .9869401 1.00441 ------------------------------------------------------------------------------
The output notes that we have a noncount outcome—in our case a continuous outcome with an exponential mean—and are responsible for the interpretation of our results. The iteration log states that we have a pseudolikelihood, which will always be stated when we use a robust variance–covariance matrix with a maximum likelihood estimator.
The AMEs and ATEs for the exponential mean model are given by
// Marginal effect x1 generate mex1 = exp(.5*(1 + x1 -1.x2 + 2.x2) + .5)*.5 // Treatment effect of x2 1 vs 0 generate te1 = exp(.5*(1 + x1 - 1) + .5) - exp(.5*(1 + x1) + .5) // Treatment effect of x2 2 vs 0 generate te2 = exp(.5*(1 + x1 + 1) + .5) - exp(.5*(1 + x1) + .5)
and their estimates are given by
. quietly poisson ye x1 i.x2, vce(robust) . margins, dydx(*) expression(exp(xb())) Average marginal effects Number of obs = 1,000,000 Model VCE : Robust Expression : exp(xb()) dy/dx w.r.t. : x1 1.x2 2.x2 ------------------------------------------------------------------------------ | Delta-method | dy/dx Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | 1.661569 .0078353 212.06 0.000 1.646212 1.676926 | x2 | 1 | -1.198624 .0142905 -83.88 0.000 -1.226633 -1.170615 2 | 2.034632 .0182593 111.43 0.000 1.998844 2.070419 ------------------------------------------------------------------------------ Note: dy/dx for factor levels is the discrete change from the base level.
which are close to the sample values
. summarize mex1 te* Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- mex1 | 1,000,000 1.654795 1.229289 .0810196 23.7496 te1 | 1,000,000 -1.212149 .6463085 -11.33574 -.1051182 te2 | 1,000,000 1.998497 1.065583 .1733107 18.68948
Because we used a robust variance–covariance matrix, we have consistent estimates of the standard errors of the effects.
Concluding remarks
Using simulations, I showed that heteroskedasticity in nonlinear models estimated using maximum likelihood produces inconsistent estimates of marginal effects. This differs from heteroskedasticity in linear regression models, which does not affect the consistency of marginal effect estimates.
Another difference between linear regression models and nonlinear models estimated using maximum likelihood is the interpretation of the robust variance–covariance matrix. In the latter case, as I illustrated via two examples, it means that we are using a pseudolikelihood to model a set of moments of our outcome and are agnostic about all other moments.
In both cases, we have a misspecified likelihood. In the case of heteroskedasticity, the psedudolikelihood estimates converge to an estimate that is different from the effects of interest. In the case where we model the mean correctly, the psedudolikelihood estimates converge to the effects of interest. They are two faces of the same problem, misspecified likelihoods in nonlinear models estimated using maximum likelihood.
Reference
White, H. 1996. Estimation, Inference and Specification Analysis. Cambridge: Cambridge University Press.
Appendix
The program used for the simulations of the first example is given by
clear all local L = 10000000 local R = 2000 local N = 10000 set seed 222 program define mkdata syntax, [n(integer 10000)] clear quietly set obs `n' generate x1 = rchi2(1)-1 generate x2 = int(4*rbeta(5,2)) generate x3 = rchi2(1)-1 generate sg = exp(.3*(x1 -1.x2 + 2.x2 - 3.x2 + x3)) generate e = rnormal(0 , sg) generate xb = .5*(1 - x1 - 1.x2 + 2.x2 - 3.x2 + x3) generate y = xb + e > 0 generate m1 = normalden(xb/sg)*((-.5 -.3*xb)/sg) generate m3 = normalden(xb/sg)*((.5 -.3*xb)/sg) generate m21 = normal(.5*(-x1 + x3)/exp(.3*(x1 -1 + x3))) /// - normal(.5*(1 -x1 + x3)/exp(.3*(x1 + x3))) generate m22 = normal(.5*(2-x1 + x3)/exp(.3*(x1 + 1 + x3))) /// -normal(.5*(1 -x1 + x3)/exp(.3*(x1+ x3))) generate m23 = m21 end mkdata, n(`L') summarize m1, meanonly local m1 = r(mean) summarize m3, meanonly local m3 = r(mean) summarize m21, meanonly local m21 = r(mean) summarize m22, meanonly local m22 = r(mean) summarize m23, meanonly local m23 = r(mean) display `m1' display `m3' display `m21' display `m22' display `m23' postfile sims est hm1 hm1_r hm21 hm21_r hm22 hm22_r hm23 hm23_r hm3 hm3_r /// rc cv using hetprobit, replace forvalues i=1/`R' { quietly { mkdata, n(`N') capture probit y x1 i.x2 x3, vce(robust) iterate(200) local rc = _rc local cv = e(converged) if (`rc' | `cv'==0){ local hm1 = . local hm1_r = . local hm21 = . local hm21_r = . local hm22 = . local hm22_r = . local hm23 = . local hm23_r = . local hm3 = . local hm3_r = . } else { margins, dydx(*) post local hm1 = _b[x1] test _b[x1] = `m1' local hm1_r = (r(p)<.05) local hm21 = _b[1.x2] test _b[1.x2] = `m21' local hm21_r = (r(p)<.05) local hm22 = _b[2.x2] test _b[2.x2] = `m22' local hm22_r = (r(p)<.05) local hm23 = _b[3.x2] test _b[3.x2] = `m23' local hm23_r = (r(p)<.05) local hm3 = _b[x3] test _b[x3] = `m3' local hm3_r = (r(p)<.05) } post sims (1) (`hm1') (`hm1_r') (`hm21') (`hm21_r') /// (`hm22') (`hm22_r') (`hm23') (`hm23_r') (`hm3') /// (`hm3_r') (`rc') (`cv') capture hetprobit y x1 i.x2 x3, het(x1 i.x2 x3) iterate(200) local rc = _rc local cv = e(converged) if (`rc' | `cv'==0) { local hm1 = . local hm1_r = . local hm21 = . local hm21_r = . local hm22 = . local hm22_r = . local hm23 = . local hm23_r = . local hm3 = . local hm3_r = . } else { margins, dydx(*) post local hm1 = _b[x1] test _b[x1] = `m1' local hm1_r = (r(p)<.05) local hm21 = _b[1.x2] test _b[1.x2] = `m21' local hm21_r = (r(p)<.05) local hm22 = _b[2.x2] test _b[2.x2] = `m22' local hm22_r = (r(p)<.05) local hm23 = _b[3.x2] test _b[3.x2] = `m23' local hm23_r = (r(p)<.05) local hm3 = _b[x3] test _b[x3] = `m3' local hm3_r = (r(p)<.05) } post sims (2) (`hm1') (`hm1_r') (`hm21') (`hm21_r') /// (`hm22') (`hm22_r') (`hm23') (`hm23_r') (`hm3') /// (`hm3_r') (`rc') (`cv') } if (`i'/50) == int(`i'/50) { di ". `i'" } else { di _c "." } } postclose sims use hetprobit, clear label define est 1 "probit" 2 "hetprobit" label values est est bysort est: summarize
In lines 7 to 26, I create a program that defines the data-generating process, the marginal effects, and the treatment effects. In lines 28 to 44, I draw a sample of 10 million observations and take the average of the marginal effects and treatment effects. Because the sample size is large, I take these means to be a good approximation to the true value of the ATEs and AMEs. Lines 46 to 132 show the code used for the simulations. The last lines summarize the simulation results.