Home > Statistics > Two faces of misspecification in maximum likelihood: Heteroskedasticity and robust standard errors

Two faces of misspecification in maximum likelihood: Heteroskedasticity and robust standard errors

For a nonlinear model with heteroskedasticity, a maximum likelihood estimator gives misleading inference and inconsistent marginal effect estimates unless I model the variance. Using a robust estimate of the variance–covariance matrix will not help me obtain correct inference.

This differs from the intuition we gain from linear regression. The estimates of the marginal effects in linear regression are consistent under heteroskedasticity and using robust standard errors yields correct inference.

If robust standard errors do not solve the problems associated with heteroskedasticity for a nonlinear model estimated using maximum likelihood, what does it mean to use robust standard errors in this context? I answer this question using simulations and illustrate the effect of heteroskedasticity in nonlinear models estimated using maximum likelihood.

What happens when I have heteroskedasticity

Suppose that the true model is a heteroskedastic probit where

\begin{equation*}
y = \left\{
\begin{array}{cl}
1 & \text{if} \quad x\beta + \varepsilon > 0\\
0 & \text{otherwise}
\end{array}\right.
\end{equation*}

\begin{equation*}
\varepsilon | x \sim N\left(0, \exp\left(x\gamma\right) \right)
\end{equation*}

This model is heteroskedastic because the variance of the unobserved component, \(\varepsilon\), is a function of the covariates. In contrast, the variance of the probit model does not depend on the covariates; it is equal to 1.

In table 1 below, I show the average of the change in the outcome when a continuous covariate changes, the average marginal effect (AME), the average of the change in the outcome when a discrete variable varies from a base level, which I refer to as the average treatment effect (ATE), and the 5% rejection rate of a test against the true null hypothesis. I compare two estimators, a probit with a robust variance–covariance matrix and a heteroskedastic probit. In table 1, I also show an approximate true value of the AME and ATE. I obtain the approximate true values by computing the ATE and AME, at the true values of the coefficients, using a sample of 10 million observations. I provide more details about estimation and the the simulation in the appendix.

Table 1. Average marginal and treatment effects: True DGP heteroskedastic probit
Simulation results for N=10,000 and 2,000 replications

Statistic Approximate True Value Probit Hetprobit
AME of x1 -.210 -.099 -.210
5% Rejection Rate 1.00 .052
AME of x3 .166 .061 .166
5% Rejection Rate 1.00 .062
ATE of x2 (1 vs 0) -.190 -.193 -.191
5% Rejection Rate .060 .064
ATE of x2 (2 vs 0) .082 .077 .081
5% Rejection Rate .061 .065
ATE of x2 (3 vs 0) -.190 -.192 -.191
5% Rejection Rate .058 .063

As expected, the heteroskedastic probit estimates are close to the true value, and the rejection rate of the true null hypothesis is close to 5%. This is also true for the ATE probit estimates. However, the probit AME estimates are far away from the true value, and the rejection rate is 100%, regardless of my use of a robust variance–covariance matrix estimator.

The probit likelihood in this example is misspecified. As White (1996) illustrates, the misspecified probit likelihood estimates converge to a well-defined parameter, and robust standard errors provide correct coverage for this parameter. However, the value obtained from the probit likelihood, as the simulations illustrate, gives an inconsistent estimate of the effects of interest. Sometimes, as we will see below, this misspecified value is of interest.

If a robust variance did not correct for heteroskedasticity, what is it doing?

Although a robust variance–covariance matrix estimator is closely related to heteroskedasticity in linear regression models, as I show in the two examples below, a robust variance–covariance matrix estimator has a different interpretation in a nonlinear model estimated using maximum likelihood.

First, let’s look at the probit likelihood in another context.

Example 1 (Probit pseudolikelihood). Suppose the true model is given by

\begin{eqnarray*}
y &=& \Phi\left(x\beta + \varepsilon\right) \\
\varepsilon | x & \sim & N\left(0, 1\right)
\end{eqnarray*}

The model above is a fractional response model. The outcome variable in fractional response models takes values that are greater than or equal to 0 and less than or equal to 1. This is not a binary response model, but we may use the probit likelihood to get a consistent estimate of the outcome mean,

\begin{equation*}
E\left(y|x\right) = \Phi\left(\frac{x\beta}{\sqrt{2}}\right)
\end{equation*}

Below I simulate and obtain estimates for the model above:

clear
set seed 222
set obs 1000000
generate x1   = rnormal()
generate x2   = int(rbeta(3,2)*3)
generate xb   = .5*(1 + x1 -1.x2 + 2.x2)
generate e    = rnormal()
generate yp   = normal(xb + e)

For these data, the AMEs and ATEs are given by

// Marginal effect x1
generate m1   = normalden(xb/sqrt(2))*(.5/sqrt(2))
// Treatment effect of x2 1 vs 0
generate m21  = normal(.5*(x1)/sqrt(2)) - normal(.5*(x1 + 1)/sqrt(2))
// Treatment effect of x2 2 vs 0
generate m22  = normal(.5*(x1 + 2)/sqrt(2)) - normal(.5*(x1 + 1)/sqrt(2))

To fit the model, I use fracreg, which employs a probit likelihood with a robust variance–covariance matrix by default. fracreg assumes a correct model for the mean and is agnostic about other moments of the outcome. Because we are modeling a subset of the moments of our outcome, in this example the mean, and do not model the other moments, we use a robust estimator of the variance–covariance matrix to obtain consistent estimates of the unknown standard errors. Given that the probit likelihood is not the true likelihood, we refer to the likelihood as a pseudolikelihood or quasilikelihood.

The estimates for the AMEs and ATEs after fracreg are given by

. margins, dydx(*)

Average marginal effects                        Number of obs     =  1,000,000
Model VCE    : Robust

Expression   : Conditional mean of yp, predict()
dy/dx w.r.t. : x1 1.x2 2.x2

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .1209728   .0002398   504.55   0.000     .1205028    .1214427
             |
          x2 |
          1  |  -.1304846   .0008839  -147.62   0.000    -.1322171   -.1287521
          2  |   .1175945   .0008696   135.23   0.000     .1158902    .1192988
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

which are close to the sample values

. summarize m*

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          m1 |  1,000,000    .1213898    .0219322   .0083771   .1410474
         m21 |  1,000,000   -.1305564    .0123099  -.1403162   -.025986
         m22 |  1,000,000    .1169432    .0206344   .0128037   .1403162

Let’s look at another example that was discussed in http://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/.

Example 2 (Exponential mean model). Suppose the true model is given by

\begin{eqnarray*}
y &=& \exp\left(x\beta + \varepsilon\right) \\
\varepsilon | x & \sim & N\left(0, 1\right)
\end{eqnarray*}

This is not a Poisson model, but we can use the Poisson likelihood estimates to get a consistent estimate of the outcome mean

\begin{equation*}
E\left(y|x\right) = \exp\left(x\beta + \frac{1}{2}\right)
\end{equation*}

Given that we do not have a Poisson model, our estimates should not be used to obtain statistics that are not functions of the outcome mean. For example, it makes no sense to predict counts or the probability of the outcome being a specific integer, natural predictions, if the true likelihood was a Poisson likelihood.

Below I simulate data for the exponential mean model above:

clear
set seed 222
set obs 1000000
generate x1   = rnormal()
generate x2   = int(rbeta(3,2)*3)
generate xb   = .5*(1 + x1 -1.x2 + 2.x2)
generate e    = rnormal()
generate ye   = exp(xb + e)

The estimation results are given by

. poisson ye x1 i.x2, vce(robust)
note: you are responsible for interpretation of noncount dep. variable

Iteration 0:   log pseudolikelihood = -2904731.1
Iteration 1:   log pseudolikelihood = -2904726.1
Iteration 2:   log pseudolikelihood = -2904726.1

Poisson regression                              Number of obs     =  1,000,000
                                                Wald chi2(3)      =  142144.11
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -2904726.1               Pseudo R2         =     0.2087

------------------------------------------------------------------------------
             |               Robust
          ye |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .5006891   .0018594   269.27   0.000     .4970447    .5043335
             |
          x2 |
          1  |  -.4953304   .0049604   -99.86   0.000    -.5050527   -.4856081
          2  |   .5086742   .0050554   100.62   0.000     .4987659    .5185825
             |
       _cons |   .9956749   .0044566   223.42   0.000     .9869401     1.00441
------------------------------------------------------------------------------

The output notes that we have a noncount outcome—in our case a continuous outcome with an exponential mean—and are responsible for the interpretation of our results. The iteration log states that we have a pseudolikelihood, which will always be stated when we use a robust variance–covariance matrix with a maximum likelihood estimator.

The AMEs and ATEs for the exponential mean model are given by

// Marginal effect x1
generate mex1  = exp(.5*(1 + x1 -1.x2 + 2.x2) + .5)*.5
// Treatment effect of x2 1 vs 0
generate te1   = exp(.5*(1 + x1 - 1) + .5) - exp(.5*(1 + x1) + .5)
// Treatment effect of x2 2 vs 0
generate te2   = exp(.5*(1 + x1 + 1) + .5) - exp(.5*(1 + x1) + .5)

and their estimates are given by

. quietly poisson ye x1 i.x2, vce(robust)

. margins, dydx(*) expression(exp(xb()))

Average marginal effects                        Number of obs     =  1,000,000
Model VCE    : Robust

Expression   : exp(xb())
dy/dx w.r.t. : x1 1.x2 2.x2

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   1.661569   .0078353   212.06   0.000     1.646212    1.676926
             |
          x2 |
          1  |  -1.198624   .0142905   -83.88   0.000    -1.226633   -1.170615
          2  |   2.034632   .0182593   111.43   0.000     1.998844    2.070419
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

which are close to the sample values

. summarize mex1 te*

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        mex1 |  1,000,000    1.654795    1.229289   .0810196    23.7496
         te1 |  1,000,000   -1.212149    .6463085  -11.33574  -.1051182
         te2 |  1,000,000    1.998497    1.065583   .1733107   18.68948

Because we used a robust variance–covariance matrix, we have consistent estimates of the standard errors of the effects.

Concluding remarks

Using simulations, I showed that heteroskedasticity in nonlinear models estimated using maximum likelihood produces inconsistent estimates of marginal effects. This differs from heteroskedasticity in linear regression models, which does not affect the consistency of marginal effect estimates.

Another difference between linear regression models and nonlinear models estimated using maximum likelihood is the interpretation of the robust variance–covariance matrix. In the latter case, as I illustrated via two examples, it means that we are using a pseudolikelihood to model a set of moments of our outcome and are agnostic about all other moments.

In both cases, we have a misspecified likelihood. In the case of heteroskedasticity, the psedudolikelihood estimates converge to an estimate that is different from the effects of interest. In the case where we model the mean correctly, the psedudolikelihood estimates converge to the effects of interest. They are two faces of the same problem, misspecified likelihoods in nonlinear models estimated using maximum likelihood.

Reference

White, H. 1996. Estimation, Inference and Specification Analysis. Cambridge: Cambridge University Press.

Appendix

The program used for the simulations of the first example is given by

clear all
local L = 10000000
local R = 2000
local N = 10000
set seed 222

program define mkdata
        syntax, [n(integer 10000)]
        clear
        quietly set obs `n'
        generate x1    = rchi2(1)-1
        generate x2    = int(4*rbeta(5,2))
        generate x3    = rchi2(1)-1
        generate sg    = exp(.3*(x1 -1.x2 + 2.x2 - 3.x2 + x3))
        generate e     = rnormal(0 , sg)
        generate xb    = .5*(1 - x1 - 1.x2 + 2.x2 - 3.x2 + x3)
        generate y     =  xb + e > 0

        generate m1  = normalden(xb/sg)*((-.5 -.3*xb)/sg)
        generate m3  = normalden(xb/sg)*((.5 -.3*xb)/sg)
        generate m21 = normal(.5*(-x1 + x3)/exp(.3*(x1 -1 + x3)))       ///
                   - normal(.5*(1 -x1 + x3)/exp(.3*(x1 + x3)))
        generate m22 = normal(.5*(2-x1 + x3)/exp(.3*(x1 + 1 + x3)))     ///
                                  -normal(.5*(1 -x1 + x3)/exp(.3*(x1+ x3)))
        generate m23 = m21
end

mkdata, n(`L')
summarize m1, meanonly
local m1  = r(mean)
summarize m3, meanonly
local m3  = r(mean)
summarize m21, meanonly
local m21 = r(mean)
summarize m22, meanonly
local m22 = r(mean)
summarize m23, meanonly
local m23 = r(mean)

display `m1'
display `m3'
display `m21'
display `m22'
display `m23'

postfile sims est hm1 hm1_r hm21 hm21_r hm22 hm22_r hm23 hm23_r hm3 hm3_r  ///
         rc cv using hetprobit, replace

forvalues i=1/`R' {
        quietly {
                mkdata, n(`N')
                capture probit y x1 i.x2 x3, vce(robust) iterate(200)
                local rc = _rc
                local cv = e(converged)
                if (`rc' | `cv'==0){
                        local hm1    = .
                        local hm1_r  = .
                        local hm21   = .
                        local hm21_r = .
                        local hm22   = .
                        local hm22_r = .
                        local hm23   = .
                        local hm23_r = .
                        local hm3    = .
                        local hm3_r  = .
                }
                else {
                        margins, dydx(*) post
                        local hm1 = _b[x1]
                        test _b[x1] = `m1'
                        local hm1_r   = (r(p)<.05)
                        local hm21 = _b[1.x2]
                        test _b[1.x2] = `m21'
                        local hm21_r   = (r(p)<.05)
                        local hm22 = _b[2.x2]
                        test _b[2.x2] = `m22'
                        local hm22_r   = (r(p)<.05)
                        local hm23 = _b[3.x2]
                        test _b[3.x2] = `m23'
                        local hm23_r   = (r(p)<.05)
                        local hm3 = _b[x3]
                        test _b[x3] = `m3'
                        local hm3_r   = (r(p)<.05)
                }
                post sims (1) (`hm1') (`hm1_r') (`hm21') (`hm21_r')       ///
                          (`hm22') (`hm22_r') (`hm23') (`hm23_r') (`hm3') ///
                          (`hm3_r') (`rc') (`cv')

                capture hetprobit y x1 i.x2 x3, het(x1 i.x2 x3) iterate(200)
                local rc = _rc
                local cv = e(converged)
                if (`rc' | `cv'==0) {
                        local hm1    = .
                        local hm1_r  = .
                        local hm21   = .
                        local hm21_r = .
                        local hm22   = .
                        local hm22_r = .
                        local hm23   = .
                        local hm23_r = .
                        local hm3    = .
                        local hm3_r  = .
                }
                else {
                        margins, dydx(*) post
                        local hm1 = _b[x1]
                        test _b[x1] = `m1'
                        local hm1_r   = (r(p)<.05)
                        local hm21 = _b[1.x2]
                        test _b[1.x2] = `m21'
                        local hm21_r   = (r(p)<.05)
                        local hm22 = _b[2.x2]
                        test _b[2.x2] = `m22'
                        local hm22_r   = (r(p)<.05)
                        local hm23 = _b[3.x2]
                        test _b[3.x2] = `m23'
                        local hm23_r   = (r(p)<.05)
                        local hm3 = _b[x3]
                        test _b[x3] = `m3'
                        local hm3_r   = (r(p)<.05)
                }
                post sims (2) (`hm1') (`hm1_r') (`hm21') (`hm21_r')       ///
                          (`hm22') (`hm22_r') (`hm23') (`hm23_r') (`hm3') ///
                          (`hm3_r') (`rc') (`cv')
        }
        if (`i'/50) == int(`i'/50) {
        di ".                 `i'"
    }
    else {
        di _c "."
    }
}
postclose sims
use hetprobit, clear
label define est 1 "probit" 2 "hetprobit"
label values est est
bysort est: summarize

In lines 7 to 26, I create a program that defines the data-generating process, the marginal effects, and the treatment effects. In lines 28 to 44, I draw a sample of 10 million observations and take the average of the marginal effects and treatment effects. Because the sample size is large, I take these means to be a good approximation to the true value of the ATEs and AMEs. Lines 46 to 132 show the code used for the simulations. The last lines summarize the simulation results.

  • Joro Kolev

    Dear Enrique, thank you for the very interesting and informative post !

    I have a question on Stata syntax that you are using, and which syntax I am encountering for a first time:

    On line 6 in your first block of code, you use statements such as 1.x2, 2.x2… e.g.

    generate xb = .5*(1 + x1 -1.x2 + 2.x2)

    By trial and error I kind of figured out what this is doing.

    However, what is the formal name of this “trick” and where in the Stata documentation I can find detailed explanation of how this “dot operator” you are using is supposed to behave?

  • epinzon

    Hello Joro,

    This is what is referred to as factor variable notation. If in the command window you type :

    help factor variables

    you will see some information about this. Also you can go from the help file to a more complete discussion in the pdf documentation, clicking on the blue link underneath the title of the help file.

    There is also an FAQ that touches on this from a different perspective

    http://www.stata.com/support/faqs/data-management/creating-dummy-variables/

  • Lukas Lang

    Dear Enrique, I found your post very interesting and useful.

    Suppose we have panel data and we need to estimate our model (whatever it is) by maximum likelihood (ML). Also assume that there is serial correlation in the error term.

    Is serial correlation a problem for the consistency of the ML estimator?

    Does the calculatation of clustered standard errors (clustering at the cross-sectional unit level), say through the usual “cluster(clustervar)” option, allow for the correct inference in this case? Or, will this create the same issues as the ones discussed in your post?