Use poisson rather than regress; tell a friend
Do you ever fit regressions of the form
ln(y_{j}) = b_{0} + b_{1}x_{1j} + b_{2}x_{2j} + … + b_{k}x_{kj} + ε_{j}
by typing
. generate lny = ln(y)
. regress lny x1 x2 … xk
The above is just an ordinary linear regression except that ln(y) appears on the left-hand side in place of y.
The next time you need to fit such a model, rather than fitting a regression on ln(y), consider typing
. poisson y x1 x2 … xk, vce(robust)
which is to say, fit instead a model of the form
y_{j} = exp(b_{0} + b_{1}x_{1j} + b_{2}x_{2j} + … + b_{k}x_{kj} + ε_{j})
Wait, you are probably thinking. Poisson regression assumes the variance is equal to the mean,
E(y_{j}) = Var(y_{j}) = exp(b_{0} + b_{1}x_{1j} + b_{2}x_{2j} + … + b_{k}x_{kj})
whereas linear regression merely assumes E(ln(y_{j})) = b_{0} + b_{1}x_{1j} + b_{2}x_{2j} + … + b_{k}x_{kj} and places no constraint on the variance. Actually regression does assume the variance is constant but since we are working the logs, that amounts to assuming that Var(y_{j}) is proportional to y_{j}, which is reasonable in many cases and can be relaxed if you specify vce(robust).
In any case, in a Poisson process, the mean is equal to the variance. If your goal is to fit something like a Mincer earnings model,
ln(income_{j}) = b_{0} + b_{1}*education_{j} + b_{2}*experience_{j} + b_{3}*experience_{j}^{2} + ε_{j}
there is simply no reason to think that the the variance of the log of income is equal to its mean. If a person has an expected income of $45,000, there is no reason to think that the variance around that mean is 45,000, which is to say, the standard deviation is $212.13. Indeed, it would be absurd to think one could predict income so accurately based solely on years of schooling and job experience.
Nonetheless, I suggest you fit this model using Poisson regression rather than linear regression. It turns out that the estimated coefficients of the maximum-likelihood Poisson estimator in no way depend on the assumption that E(y_{j}) = Var(y_{j}), so even if the assumption is violated, the estimates of the coefficients b_{0}, b_{1}, …, b_{k} are unaffected. In the maximum-likelihood estimator for Poisson, what does depend on the assumption that E(y_{j}) = Var(y_{j}) are the estimated standard errors of the coefficients b_{0}, b_{1}, …, b_{k}. If the E(y_{j}) = Var(y_{j}) assumption is violated, the reported standard errors are useless. I did not suggest, however, that you type
. poisson y x1 x2 … xk
I suggested that you type
. poisson y x1 x2 … xk, vce(robust)
That is, I suggested that you specify that the variance-covariance matrix of the estimates (of which the standard errors are the square root of the diagonal) be estimated using the Huber/White/Sandwich linearized estimator. That estimator of the variance-covariance matrix does not assume E(y_{j}) = Var(y_{j}), nor does it even require that Var(y_{j}) be constant across j. Thus, Poisson regression with the Huber/White/Sandwich linearized estimator of variance is a permissible alternative to log linear regression — which I am about to show you — and then I’m going to tell you why it’s better.
I have created simulated data in which
y_{j} = exp(8.5172 + 0.06*educ_{j} + 0.1*exp_{j} – 0.002*exp_{j}^{2} + ε_{j})
where ε_{j} is distributed normal with mean 0 and variance 1.083 (standard deviation 1.041). Here’s the result of estimation using regress:
. regress lny educ exp exp2 Source | SS df MS Number of obs = 5000 -------------+------------------------------ F( 3, 4996) = 44.72 Model | 141.437342 3 47.1457806 Prob > F = 0.0000 Residual | 5267.33405 4996 1.05431026 R-squared = 0.0261 -------------+------------------------------ Adj R-squared = 0.0256 Total | 5408.77139 4999 1.08197067 Root MSE = 1.0268 ------------------------------------------------------------------------------ lny | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | .0716126 .0099511 7.20 0.000 .052104 .0911212 exp | .1091811 .0129334 8.44 0.000 .0838261 .1345362 exp2 | -.0022044 .0002893 -7.62 0.000 -.0027716 -.0016373 _cons | 8.272475 .1855614 44.58 0.000 7.908693 8.636257 ------------------------------------------------------------------------------
I intentionally created these data to produce a low R-squared.
We obtained the following results:
truth est. S.E. ---------------------------------- educ 0.0600 0.0716 0.0100 exp 0.1000 0.1092 0.0129 exp2 -0.0020 -0.0022 0.0003 ----------------------------------- _cons 8.5172 8.2725 0.1856 <- unadjusted (1) 9.0587 8.7959 ? <- adjusted (2) ----------------------------------- (1) To be used for predicting E(ln(y_{j})) (2) To be used for predicting E(y_{j})
Note that the estimated coefficients are quite close to the true values. Ordinarily, we would not know the true values, except I created this artificial dataset and those are the values I used.
For the intercept, I list two values, so I need to explain. We estimated a linear regression of the form,
ln(y_{j}) = b_{0} + X_{j}b + ε_{j}
As with all linear regressions,
E(ln(y_{j})) = E(b_{0} + X_{j}b + ε_{j}) = b_{0} + X_{j}b + E(ε_{j}) = b_{0} + X_{j}b
We, however, have no real interest in E(ln(y_{j})). We fit this log regression as a way of obtaining estimates of our real model, namely
y_{j} = exp(b_{0} + X_{j}b + ε_{j})
So rather than taking the expectation of ln(y_{j}), lets take the expectation of y_{j}:
E(y_{j}) = E(exp(b_{0} + X_{j}b + ε_{j})) = E(exp(b_{0} + X_{j}b) * exp(ε_{j})) = exp(b_{0} + X_{j}b) * E(exp(ε_{j}))
E(exp(ε_{j})) is not one. E(exp(ε_{j})) for ε_{j} distributed N(0, σ^{2}) is exp(σ^{2}/2). We thus obtain
E(y_{j}) = exp(b_{0} + X_{j}b) * exp(σ^{2}/2)
People who fit log regressions know about this — or should — and know that to obtain predicted y_{j} values, they must
- Obtain predicted values for ln(y_{j}) = b_{0} + X_{j}b.
- Exponentiate the predicted log values.
- Multiply those exponentiated values by exp(σ^{2}/2), where σ^{2} is the square of the root-mean-square-error (RMSE) of the regression.
They do in this in Stata by typing
. predict yhat
. replace yhat = exp(yhat).
. replace yhat = yhat*exp(e(rmse)^2/2)
In the table I that just showed you,
truth est. S.E. ---------------------------------- educ 0.0600 0.0716 0.0100 exp 0.1000 0.1092 0.0129 exp2 -0.0020 -0.0022 0.0003 ----------------------------------- _cons 8.5172 8.2725 0.1856 <- unadjusted (1) 9.0587 8.7959 ? <- adjusted (2) ----------------------------------- (1) To be used for predicting E(ln(y_{j})) (2) To be used for predicting E(y_{j})
I’m setting us up to compare these estimates with those produced by poisson. When we estimate using poisson, we will not need to take logs because the Poisson model is stated in terms of y_{j}, not ln(y_{j}). In prepartion for that, I have included two lines for the intercept — 8.5172, which is the intercept reported by regress and is the one appropriate for making predictions of ln(y) — and 9.0587, an intercept appropriate for making predictions of y and equal to 8.5172 plus σ^{2}/2. Poisson regression will estimate the 9.0587 result because Poisson is stated in terms of y rather than ln(y).
I placed a question mark in the column for the standard error of the adjusted intercept because, to calculate that, I would need to know the standard error of the estimated RMSE, and regress does not calculate that.
Let’s now look at the results that poisson with option vce(robust) reports. We must not forget to specify option vce(robust) because otherwise, in this model that violates the Poisson assumption that E(y_{j}) = Var(y_{j}), we would obtain incorrect standard errors.
. poisson y educ exp exp2, vce(robust) note: you are responsible for interpretation of noncount dep. variable Iteration 0: log pseudolikelihood = -1.484e+08 Iteration 1: log pseudolikelihood = -1.484e+08 Iteration 2: log pseudolikelihood = -1.484e+08 Poisson regression Number of obs = 5000 Wald chi2(3) = 67.52 Prob > chi2 = 0.0000 Log pseudolikelihood = -1.484e+08 Pseudo R2 = 0.0183 ------------------------------------------------------------------------------ | Robust y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- educ | .0575636 .0127996 4.50 0.000 .0324769 .0826504 exp | .1074603 .0163766 6.56 0.000 .0753628 .1395578 exp2 | -.0022204 .0003604 -6.16 0.000 -.0029267 -.0015141 _cons | 9.016428 .2359002 38.22 0.000 8.554072 9.478784 ------------------------------------------------------------------------------
So now we can fill in the rest of our table:
regress poisson truth est. S.E. est. S.E. ----------------------------------------------------- educ 0.0600 0.0716 0.0100 0.0576 0.1280 exp 0.1000 0.1092 0.0129 0.1075 0.0164 exp2 -0.0020 -0.0022 0.0003 -0.0022 0.0003 ------------------------------------------------------ _cons 8.5172 8.2725 0.1856 ? ? <- (1) 9.0587 8.7959 ? 9.0164 0.2359 <- (2) ------------------------------------------------------ (1) To be used for predicting E(ln(y_{j})) (2) To be used for predicting E(y_{j})
I told you that Poisson works, and in this case, it works well. I’ll now tell you that in all cases it works well, and it works better than log regression. You want to think about Poisson regression with the vce(robust) option as a better alternative to log regression.
How is Poisson better?
First off, Poisson handles outcomes that are zero. Log regression does not because ln(0) is -∞. You want to be careful about what it means to handle zeros, however. Poisson handles zeros that arise in correspondence to the model. In the Poisson model, everybody participates in the y_{j} = exp(b_{0} + X_{j}b + ε_{j}) process. Poisson regression does not handle cases where some participate and others do not, and among those who do not, had they participated, would likely produce an outcome greater than zero. I would never suggest using Poisson regression to handle zeros in an earned income model because those that earned zero simply didn’t participate in the labor force. Had they participated, their earnings might have been low, but certainly they would have been greater than zero. Log linear regression does not handle that problem, either.
Natural zeros do arise in other situations, however, and a popular question on Statalist is whether one should recode those natural zeros as 0.01, 0.0001, or 0.0000001 to avoid the missing values when using log linear regression. The answer is that you should not recode at all; you should use Poisson regression with vce(robust).
Secondly, small nonzero values, however they arise, can be influential in log-linear regressions. 0.01, 0.0001, 0.0000001, and 0 may be close to each other, but in the logs they are -4.61, -9.21, -16.12, and -∞ and thus not close at all. Pretending that the values are close would be the same as pretending that that exp(4.61)=100, exp(9.21)=9,997, exp(16.12)=10,019,062, and exp(∞)=∞ are close to each other. Poisson regression understands that 0.01, 0.0001, 0.0000001, and 0 are indeed nearly equal.
Thirdly, when estimating with Poisson, you do not have to remember to apply the exp(σ^{2}/2) multiplicative adjustment to transform results from ln(y) to y. I wrote earlier that people who fit log regressions of course remember to apply the adjustment, but the sad fact is that they do not.
Finally, I would like to tell you that everyone who estimates log models knows about the Poisson-regression alternative and it is only you who have been out to lunch. You, however, are in esteemed company. At the recent Stata Conference in Chicago, I asked a group of knowledgeable researchers a loaded question, to which the right answer was Poisson regression with option vce(robust), but they mostly got it wrong.
I said to them, “I have a process for which it is perfectly reasonable to assume that the mean of y_{j} is given by exp(b_{0} + X_{j}b), but I have no reason to believe that E(y_{j}) = Var(y_{j}), which is to say, no reason to suspect that the process is Poisson. How would you suggest I estimate the model?” Certainly not using Poisson, they replied. Social scientists suggested I use log regression. Biostatisticians and health researchers suggested I use negative binomial regression even when I objected that the process was not the gamma mixture of Poissons that negative binomial regression assumes. “What else can you do?” they said and shrugged their collective shoulders. And of course, they just assumed over dispersion.
Based on those answers, I was ready to write this blog entry, but it turned out differently than I expected. I was going to slam negative binomial regression. Negative binomial regression makes assumptions about the variance, assumptions different from that made by Poisson, but assumptions nonetheless, and unlike the assumption made in Poisson, those assumptions do appear in the first-order conditions that determine the fitted coefficients that negative binomial regression reports. Not only would negative binomial’s standard errors be wrong — which vce(robust) could fix — but the coefficients would be biased, too, and vce(robust) would not fix that. I planned to run simulations showing this.
When I ran the simulations, I was surprised by the results. The negative binomial estimator (Stata’s nbreg) was remarkably robust to violations in variance assumptions as long as the data were overdispersed. In fact, negative binomial regression did about as well as Poisson regression. I did not run enough simulations to make generalizations, and theory tells me those generalizations have to favor Poisson, but the simulations suggested that if Poisson does do better, it’s not in the first four decimal places. I was impressed. And disappointed. It would have been a dynamite blog entry.
So you’ll have to content yourself with this one.
Others have preceeded me in the knowledge that Poisson regression with vce(robust) is a better alternative to log-linear regression. I direct you to Jeffery Wooldridge, Econometric Analysis of Cross Section and Panel Data, 2nd ed., chapter 18. Or see A. Colin Cameron and Pravin K. Trivedi, Microeconomics Using Stata, revised edition, chapter 17.3.2.
I first learned about this from a talk given by Austin Nichols, Regression for nonnegative skewed dependent variables, given in 2010 at the Stata Conference in Boston. That talk goes far beyond what I have presented here, and I heartily recommend it.