Home > Statistics > Heteroskedasticity robust standard errors: Some practical considerations

Heteroskedasticity robust standard errors: Some practical considerations

Introduction

Some discussions have arisen lately with regard to which standard errors should be used by practitioners in the presence of heteroskedasticity in linear models. The discussion intrigued me, so I took a second look at the existing literature. I provide an overview of theoretical and simulation research that helps us answer this question. I also present simulation results that mimic or expand some of the existing simulation studies. I’ll share the Stata code I used for the simulations in hopes that it might be useful to those that want to explore how the various standard-error estimators perform in situations that are relevant to your research.

From my simulation exercises, I conclude that no single variance–covariance matrix of estimators (VCE) is preferred to others across all possible combinations of different sample sizes and degrees of heteroskedasticity. For instance, if we extend the simulation design of MacKinnon (2012) to include discrete covariates, the 5% rejection rate for the coefficients of the discrete covariates in some cases is best when we use the Huber/White/sandwich VCE provided by Stata’s vce(robust) option. For continuous covariates, the conclusions are different.

From the literature, two practical considerations arise. First, taking sample size on its own as a criterion is not enough to obtain accurate standard errors in the presence of heteroskedasticity. What matters is the number of observations per regressor. If you have 250 observations and 4 regressors, performance of heteroskedasticity-consistent standard-error estimators will probably be good. If you have 250 observations and 10 regressors, this may no longer be true. Similarly, having 10,000 observations may not be enough if you have 500 regressors. As the number of parameters grows, so does the information required to consistently estimate them. Also, as the number of observations per regressor becomes smaller, all current heteroskedasticity-consistent standard errors become inaccurate, as discussed by Cattaneo, Jansson, and Newey (2018).

Second, leverage points far above average matter. Leverage points are functions of the covariates that measure how much influence an observation has on the ordinary least-squares fit. Leverage points are between 0 and 1. A leverage point of 1 wields leverage in the sense that the orientation of the regression line in the direction of the covariate is entirely determined by the covariates. The estimated residual for a point with leverage of 1 is 0. Simulation evidence shows performance of heteroskedasticity-consistent standard errors improves when high-leverage points are not present in a design, as discussed in Chesher and Austin (1991).

These two considerations are related. The mean of the leverage points is equal to the number of regressors divided by the number of observations (the inverse of the number of observations per regressor). For a fixed number of regressors, as the sample size increases, the mean of the leverage and the probability of having leverage points with large values decrease. Thus, having sufficient observations per regressor reduces the problems associated with high-leverage points.

To summarize, when we think about robust standard errors, the relevant metric is the number of observations per regressor. If the number of observations per regressor is small, regardless of the sample size, our inference may be imprecise, even when we use heteroskedasticity-consistent standard errors that correct for bias. There is no silver bullet that will give you reliable inference if there is not enough data for each parameter you want to estimate.

History and intuition

When we think about heteroskedasticity-consistent standard errors in linear models, we think of White (1980). The key result of White’s work is that we can get a consistent estimate of the VCE even if we cannot get a consistent estimate of some of its individual components. This was a groundbreaking insight and led to other important developments in the estimation of standard errors, as mentioned by MacKinnon (2012).

White’s results, however, are asymptotic. They are appropriate when you have a large sample size under certain regularity conditions. There is nothing benign about the phrase “under certain regularity conditions”. This is where things get tricky and where we need to explore what needs to be satisfied for us to trust the tools we are using.

White’s estimator has bias. Like many asymptotic results, the bias decreases as the number of observations increases. MacKinnon and White (1985) propose three asymptotically equivalent estimators to address the small-sample bias of White’s heteroskedasticity-consistent standard errors.

The intuition behind the estimators is that the least-squares residuals tend to underestimate the true errors. The proposed solutions all address this by increasing the weight of the individual estimates of the variance. The first one of these estimators, HC1, is a degrees-of-freedom adjustment of the order \(n/(n-k)\), where \(n\) is the sample size and \(k\) is the number of regressors. You get this in Stata when you use vce(robust). The other two estimators are HC2 (vce(hc2)), which corrects for the bias in the variance of the residual that arises under homoskedasticity, and HC3 (vce(hc3)), a jackknife estimator. MacKinnon and White (1985) find that, for small sample sizes, HC2 and HC3 perform better than HC1 in their simulations and that HC3 is the preferred alternative.

HC2 and HC3 are functions of \(h_{ii}\), the diagonal elements of the matrix
\begin{equation*}
X(X’X)^{-1}X’
\end{equation*}
The \(h_{ii}\) are also referred to as leverage points. A high \(h_{ii}\), relative to the average of the \(h_{ii}\), exerts more leverage on the direction of the regression plane. Points with a leverage of 1, for instance, are on the regression line. HC2 and HC3 give higher weight to residuals of observations with higher leverage.

Chesher and Jewitt (1987) and Chesher and Austin (1991) study the bias of the estimators proposed by MacKinnon and White (1985). The explicit form of the bias is a function of \(h_{ii}\). One of the interesting results of Chesher and Austin (1991) is that the simulations of MacKinnon and White (1985) “contain a point of moderate leverage”, which, once removed, makes “all the tests that use heteroskedasticity[-]consistent covariance matrix estimators perform well”.

Long and Erwin (2000) provide advice about usage of heteroskedasticity-consistent standard errors. Like MacKinnon and White (1985), they find that HC3 performs better in small samples. They also suggest the different VCE estimators studied in MacKinnon and White (1985) start to be equivalent after 250 observations. This finding is consistent across their simulation designs. The number 250 is not random. Their designs have five regressors. After 250 observations, there are more than 50 observations per regressor. This is an important consideration. With 10 regressors and heteroskedasticity, 250 observations might not be adequate.

Theoretical and simulation results by Cattaneo, Jansson, and Newey (2018) illustrate the importance of having enough observations for each regressor. They look at the asymptotic relation of \(k/n\). Although their results are for a different framework, they show that performance of all the estimators discussed above is poor when \(n/k\) is small.

The results of Long and Erwin (2000) and Cattaneo, Jansson, and Newey (2018) are closely related to those of MacKinnon and White (1985), Chesher and Jewitt (1987), and Chesher and Austin (1991). They are related via \(h_{ii}\). The mean of the \(h_{ii}\) is \(k/n\). Thus, the mean of the leverage points is also a way of looking at how much information we need to recover each of the \(k\) parameters in our specification.

Simulation results

Below, I present three sets of simulations results. The first follows the spirit of MacKinnon (2012). The second follows the spirit of Long and Erwin (2000). The third follows Angrist and Pischke (2009). In the simulations, I compare HC1, HC2, HC3, and the wild bootstrap (WB). The WB in the simulations imposes the null hypothesis, is for 999 replications, and uses Rademacher weights.

The MacKinnon (2012) simulations look at the performance of heteroskedasticity-consistent standard errors in the HCk class (HC1, HC2, and HC3) and at the WB. He considers an error term, \(\varepsilon_i\), of the form
\begin{equation*}
\varepsilon_i = \left(\sum_{i=1}^kX_{ik}\beta_k\right)^{\gamma}N(0,1)
\end{equation*}

A value of \(\gamma=0\) implies no heteroskedasticity, and a value of \(\gamma \geq 1\) implies high heteroskedasticity. The covariates are uncorrelated and log-normally distributed. The simulations below that I refer to as MacKinnon-type simulations follow the same structure but also incorporate discrete covariates.

Long and Erwin (2000) create the variance by multiplying the error term by a function of covariates also. However, they allow the covariates to be correlated and include discrete covariates. Also, the distribution of all covariates differs. For some designs, error terms are drawn from a normal distribution and for others from a \(\chi^2\) distribution. The simulations below that I call Long and Erwin-type simulations follow the idea of correlating the covariates, using error terms from a \(\chi^2\) distribution, and having continuous and discrete covariates from different distributions. Unlike Long and Erwin (2000), however, I include all covariates in the expression that multiplies the error term.

The Angrist and Pischke (2009) simulations are for only one binary variable. They introduce heteroskedasticity, allowing the variance to change depending on the value of the covariate. Also, the proportion of zeros and ones is skewed. I follow the same design but explore behavior for different sample sizes and not for a sample size of 30, like Angrist and Pischke (2009).

The reason I chose these three sets of simulations is to try to cover the most representative and well-known results in the literature. I expand them to incorporate features that I wanted to consider, such as discrete covariates and a form of heteroskedasticity that incorporates all covariates. The modifications are minor but provide intuition that was not immediate from the original specifications.

The do-files used for the simulations are in the appendix.

MacKinnon-type simulations

I conduct simulations for 3 sample sizes—100, 1,000, and 5,000—and 4 levels of heteroskedasticity: low (\(\gamma=0.5\)), medium (\(\gamma=1\)), high (\(\gamma=1.5\)), and very high (\(\gamma=2.0\)). In all cases, there are six parameters we are trying to recover. Two parameters are associated with log-normally distributed continuous variables, as suggested in MacKinnon (2012). The other four parameters are from two categorical variables with three categories (the base category is excluded). I keep only simulation draws for which the VCE is full rank. In a couple of the simulations, I lose one of the 2,000 repetitions because the matrix is not full rank.

When \(N=100\), the number of observations per regressor, \(N/k = 16.66\), is small, making inference challenging for all estimators. For each simulation draw, I compute the maximum of the leverage points. The average maximum value of the leverages is around 0.46 for all levels of heteroskedasticity and reaches 1 for some of the simulation draws.

When \(N=1000\), the number of observations per regressor, \(N/k = 166.66\), is a bit larger, and inference starts to become more accurate. The maximum of the leverage for all draws now has a mean of around 0.20 for all levels of heteroskedasticity and, at its largest, is between 0.78 and 0.87 depending on the level of heteroskedasticity. Inference is still challenging, and some of the issues we observe at \(N/k=16.66\) remain.

When \(N=5000\), the number of observations per regressor is \(N/k = 833.33\), and inference becomes more accurate for all estimators. The maximum of the leverage for all draws now has a mean of around 0.10 for all levels of heteroskedasticity and, at its largest, is between 0.6 and 0.82 depending on the level of heteroskedasticity. Even for this sample size, leverage points can be high in some designs.

Below, I present the simulation results. I split the discussion between coefficients associated with continuous covariates and those associated with categorical covariates.

Continuous covariates

For a small sample size, \(N=100\), the 5% rejection rate of the coefficients for the continuous covariates follows what was found by MacKinnon and White (1985). That is, 5% rejection rates are closer to 0.05 for HC3 than for HC2 and HC1. However, 5% rejection rates are above 0.05 for all HCk-type estimators. The WB, on the other hand, tends to be more conservative, with rejection rates that are closer to 0.05, than the other VCE estimators.

Table 1 below presents the simulation results for the 4 VCE estimators for different levels of heteroskedasticity when the sample size is \(N=100\).

Table 1: Continuous covariates: 5% rejection rates for different levels of heteroskedasticity

Simulation results for \(N=100\) and 2,000 replications
Parameter VCE \(\gamma=0.5\) \(\gamma=1.0\) \(\gamma=1.5\) \(\gamma=2.0\)
\(\beta_1\) HC1 0.159 0.208 0.234 0.255
HC2 0.125 0.156 0.170 0.175
HC3 0.089 0.096 0.096 0.086
WB 0.042 0.041 0.043 0.037
\(\beta_2\) HC1 0.137 0.180 0.214 0.238
HC2 0.109 0.138 0.151 0.157
HC3 0.080 0.088 0.089 0.087
WB 0.034 0.035 0.031 0.028

In tables 2 and 3 below, we see that when the sample sizes are \(N=1000\) and \(N=5000\), the behavior above persists. However, as the sample size increases, all estimators are closer to the 5% rejection rate.

Table 2: Continuous covariates: 5% rejection rates for different level of heteroskedasticity

Simulation results for \(N=1000\) and 2,000 replications
Parameter VCE \(\gamma=0.5\) \(\gamma=1.0\) \(\gamma=1.5\) \(\gamma=2.0\)
\(\beta_1\) HC1 0.087 0.104 0.108 0.105
HC2 0.076 0.084 0.083 0.085
HC3 0.066 0.070 0.065 0.066
WB 0.052 0.044 0.044 0.036
\(\beta_2\) HC1 0.087 0.094 0.099 0.097
HC2 0.078 0.075 0.078 0.072
HC3 0.064 0.064 0.061 0.052
WB 0.048 0.045 0.031 0.031

Table 3: Continuous covariates: 5% rejection rates for different levels of heteroskedasticity

Simulation results for \(N=5000\) and 2,000 replications
Parameter VCE \(\gamma=0.5\) \(\gamma=1.0\) \(\gamma=1.5\) \(\gamma=2.0\)
\(\beta_1\) HC1 0.076 0.062 0.065 0.061
HC2 0.072 0.051 0.057 0.053
HC3 0.069 0.044 0.053 0.048
WB 0.061 0.044 0.044 0.039
\(\beta_2\) HC1 0.073 0.062 0.070 0.061
HC2 0.070 0.058 0.062 0.056
HC3 0.066 0.051 0.060 0.050
WB 0.057 0.044 0.050 0.043

Discrete covariates

For \(\beta_3\) and \(\beta_5\) and \(N=100\), the following is true. HC1 is the closest to the 5% rejection rate. HC2 is close to the 5% rejection rate when heteroskedasticity is not high. When heteroskedasticity is high, HC2 has rejection rates that are below the 5% rate. HC3 and the WB have 5% rejection rates that are smaller than 0.05. The rates become smaller the larger the heteroskedasticity. The rates of HC3 and the wild boostrap are always below those of HC2.

For \(\beta_4\) and \(\beta_6\) and \(N=100\), the following is true. HC1 and HC2 have 5% rejection rates that are higher than 0.05 for low levels of heteroskedasticity. HC3 is close to the ideal rate in these cases. When heteroskedasticity is high, the behavior of HC1 remains, HC2 gets closer to the ideal rate, and HC3 starts to produce rates below 0.05. The WB will always produce rates below all other estimators.

When \(N=1000\), all estimators are close to the ideal rejection rate when heteroskedasticity is less than very high. When heteroskedasticity is very high, HC1 is closer to the optimal rejection rate. When \(N=5000\), all estimators are close to the ideal rejection rate except HC3, which has rejection rates below 0.05 for very high levels of heteroskedasticity.

Table 4 below presents the simulation results for the 4 VCE estimators for different levels of heteroskedasticity when the sample size is \(N=100\). Tables 5 and 6 show results for \(N=1000\) and \(N=5000\).
Table 4: Discrete covariates: 5% rejection rates for different levels of heteroskedasticity

Simulation results for \(N=100\) and 2,000 replications
Parameter VCE \(\gamma=0.5\) \(\gamma=1.0\) \(\gamma=1.5\) \(\gamma=2.0\)
\(\beta_3\) HC1 0.054 0.052 0.051 0.047
HC2 0.053 0.050 0.044 0.034
HC3 0.046 0.038 0.026 0.022
WB 0.032 0.032 0.030 0.027
\(\beta_4\) HC1 0.084 0.082 0.076 0.068
HC2 0.072 0.071 0.063 0.049
HC3 0.058 0.053 0.042 0.025
WB 0.040 0.039 0.031 0.025
\(\beta_5\) HC1 0.049 0.050 0.046 0.048
HC2 0.047 0.045 0.037 0.035
HC3 0.036 0.035 0.028 0.019
WB 0.033 0.033 0.027 0.028
\(\beta_6\) HC1 0.081 0.078 0.068 0.061
HC2 0.069 0.066 0.059 0.045
HC3 0.050 0.047 0.037 0.027
WB 0.037 0.033 0.024 0.020

Table 5: Discrete covariates: 5% rejection rates for different levels of heteroskedasticity

Simulation results for \(N=1000\) and 2,000 replications
Parameter VCE \(\gamma=0.5\) \(\gamma=1.0\) \(\gamma=1.5\) \(\gamma=2.0\)
\(\beta_3\) HC1 0.047 0.053 0.053 0.040
HC2 0.047 0.051 0.049 0.032
HC3 0.045 0.050 0.044 0.027
WB 0.043 0.052 0.049 0.037
\(\beta_4\) HC1 0.051 0.054 0.056 0.040
HC2 0.051 0.051 0.049 0.032
HC3 0.049 0.046 0.045 0.029
WB 0.050 0.047 0.050 0.036
\(\beta_5\) HC1 0.044 0.054 0.051 0.054
HC2 0.044 0.053 0.048 0.046
HC3 0.042 0.050 0.045 0.039
WB 0.043 0.053 0.049 0.048
\(\beta_6\) HC1 0.053 0.057 0.051 0.049
HC2 0.052 0.054 0.048 0.043
HC3 0.050 0.052 0.042 0.038
WB 0.047 0.052 0.046 0.041

Table 6: Discrete covariates: 5% rejection rates for different levels of heteroskedasticity

Simulation results for \(N=5000\) and 2,000 replications
Parameter VCE \(\gamma=0.5\) \(\gamma=1.0\) \(\gamma=1.5\) \(\gamma=2.0\)
\(\beta_3\) HC1 0.046 0.053 0.049 0.045
HC2 0.046 0.053 0.047 0.043
HC3 0.046 0.052 0.045 0.040
WB 0.045 0.052 0.049 0.045
\(\beta_4\) HC1 0.058 0.054 0.048 0.048
HC2 0.058 0.054 0.047 0.044
HC3 0.057 0.053 0.045 0.039
WB 0.058 0.052 0.047 0.049
\(\beta_5\) HC1 0.050 0.058 0.047 0.045
HC2 0.050 0.057 0.044 0.041
HC3 0.049 0.057 0.042 0.038
WB 0.048 0.055 0.046 0.043
\(\beta_6\) HC1 0.055 0.059 0.051 0.045
HC2 0.055 0.058 0.050 0.041
HC3 0.055 0.056 0.049 0.039
WB 0.055 0.059 0.051 0.046

Long and Erwin-type simulations

I again conduct simulations for three sample sizes. As in Long and Erwin (2000), I allow correlation between covariates and include both continuous and categorical covariates. The error term is not normal, and I allow for a high level of heteroskedasticity throughout. Instead of the five parameters of Long and Erwin (2000), I focus on six.

When the sample size is \(N=100\), the average value of the maximum leverage is approximately 0.24 and may reach 0.46 for some draws. This is less severe than in the MacKinnon and White-type simulations but still generates rejection rates above 0.05 for the HCk estimators. When the sample size is \(N=1000\), the average maximum leverage is of approximately 0.042 and is at its maximum around 0.11. When \(N=5000\), the maximum leverage is always below 0.04.

I arrive at a similar conclusion for the Long and Erwin-type simulations that I did for the MacKinnon and White-type simulations in the previous section. HC3 is best when approximating the ideal rejection rate for continuous covariates, \(\beta_1\) and \(\beta_2\), but has rejection rates that are low for the discrete covariates. For the discrete covariates, HC1 is closest to the ideal rejection rate but has high rejection rates for continuous covariates. HC2 is better than HC1 for the continuous covariates but worse for the discrete covariates. The WB tends to have coverage rates below 0.05 and lower than the other estimators.

In table 7 below, we present the rejection rates for all covariates and sample sizes.

Table 7: 5% rejection rates for two sample sizes

Parameter VCE \(N=100\) \(N=1000\) \(N=5000\)
\(\beta_1\) HC1 0.099 0.054 0.053
HC2 0.082 0.051 0.052
HC3 0.064 0.050 0.052
WB 0.035 0.047 0.055
\(\beta_2\) HC1 0.089 0.052 0.042
HC2 0.073 0.050 0.042
HC3 0.056 0.048 0.042
WB 0.043 0.051 0.044
\(\beta_3\) HC1 0.046 0.046 0.050
HC2 0.045 0.044 0.049
HC3 0.033 0.044 0.049
WB 0.026 0.047 0.052
\(\beta_4\) HC1 0.031 0.044 0.050
HC2 0.024 0.044 0.050
HC3 0.014 0.040 0.049
WB 0.011 0.046 0.051
\(\beta_5\) HC1 0.047 0.063 0.057
HC2 0.038 0.061 0.057
HC3 0.025 0.060 0.057
WB 0.013 0.063 0.061
\(\beta_6\) HC1 0.059 0.060 0.061
HC2 0.045 0.059 0.060
HC3 0.030 0.057 0.060
WB 0.023 0.062 0.060

Angrist and Pischke-type simulations

I mimic the Angrist and Pischke (2009) simulations, but instead of allowing a sample size of 30, I allow 3 different sample sizes, \(N=100\), \(N=300\), and \(N=1000\). All results are in table 8 below. Here I am trying to recover one parameter for a binary regressor. When I have 100 observations, the coverage rate for all estimators is above 0.05 except for the WB, which is below. The mean of the maximum leverage is approximately 0.11 and at its largest is 0.5. When the sample size is \(N=300\) and \(N=1000\), all estimators are close to the 0.05 rejection rate. Below are the simulation results.

Table 8: 5% rejection rates for three sample sizes

Parameter VCE \(N=100\) \(N=300\) \(N=1000\)
\(\beta_1\) HC1 0.099 0.055 0.055
HC2 0.082 0.052 0.054
HC3 0.066 0.048 0.053
WB 0.030 0.040 0.050

Conclusion
From the literature and my simulations, I conclude that the most important consideration when using heteroskedasticity-consistent standard errors is to have many observations for each parameter (regressor) you would like to estimate. Also, whenever you are concerned about the validity of your standard errors, you should look at the leverage points implied by the fitted model. Leverage points close to 1 should be reason for concern. Simulations show that very high leverage points yield VCE estimators that are not close to the ideal rejection rates.

References

Angrist, J. D., and J.-S. Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton, NJ: Princeton University Press.


Cattaneo, M. D., M. Jansson, and W. K. Newey. 2018. Inference in linear regression models with many covariates and heteroscedasticity. Journal of the American Statistical Association 113: 1350–1361. https://doi.org/10.1080/01621459.2017.1328360.


Chesher, A., and I. Jewitt. 1987. The bias of a heteroskedasticity consistent covariance matrix estimator. Econometrica 55: 1217–1222. https://doi.org/10.2307/1911269.


Chesher, A., and G. Austin. 1991. The finite-sample distributions of heteroskedasticity robust Wald statistics. Journal of Econometrics 47: 153–173. https://doi.org/10.1016/0304-4076(91)90082-O.


Long, J. S., and L. H. Ervin. 2000. Using heteroscedasticity consistent standard errors in the linear regression model. American Statistician 54: 217–224. https://doi.org/10.2307/2685594.


MacKinnon, J. G. 2012. Thirty years of heteroscedasticity-robust inference. In Recent Advances and Future Directions in Causality, Prediction, and Specification Analysis, ed. X. Chen, and N. R. Swanson, 437–461. New York: Springer. https://doi.org/10.1007/978-1-4614-1653-1_17.


MacKinnon, J., and H. White. 1985. Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Journal of Econometrics 29: 305–325. https://doi.org/10.1016/0304-4076(85)90158-7.


White, H. 1980. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48: 817–838. https://doi.org/10.2307/1912934

Appendix: Do-files and simulations

For the MacKinnon-type simulations, there is a file for each sample size and level of heteroskedasticity. There are many ways of running simulations using all of these files. I provide each one so that those wanting to use them can decide which way is best.

For the sample size \(N=100\), for example, the files are named

gamma_05_100.do
gamma_1_100.do
gamma_15_100.do
gamma_20_100.do

The number after the first underscore refers to the level of heteroskedasticity. The number after the second underscore refers to the sample size.

For the Long and Erwin-type simulations. I have

long_100.do
long_1000.do
long_5000.do

The number after the first underscore refers to the sample size.

For the Angrist and Pischke-type simulations, the naming convention is the same as for the Long and Erwin case.

harmless_100.do
harmless_300.do
harmless_1000.do