## Our users’ favorite commands

We recently had a contest on our Facebook page. To enter, contestants posted their favorite Stata command, feature, or just a post telling us why they love Stata. Contestants then asked their friends, colleagues, and fellow Stata users to vote for their entry by ‘Like’-ing the post. The prize, a copy of Stata/MP 12 (8-core).

The response was overwhelming! We enjoyed reading all the reasons why users love Stata so much, we wanted to share them with you.

The contest question was:

Do you have a favorite command or feature in Stata? What about a memorable experience when using the software? Post your favorite command, feature, or experience in the comments section of this post. Then, get your friends to “like” your comment. The person with the most “likes” by March 13, 2012, wins. The winner will receive a single-user copy of Stata/MP8 12 with PDF documentation.

We had many submissions with multiple “likes”. The winning submissions are:

 2,235 Likes,1st place: Rodrigo Briceno One of the most remarkable experiences with Stata was when I learned to use loops. Making repetitive procedures in so short amounts of time is really amazing! I LIKE STATA! 1,464 Likes,2nd place: Juan Jose Salcedo My Favorite STATA command is by far COLLAPSE! Getting descriptive statistics couldn’t be any easier! 140 Likes,3rd place: Tymon Sloczynski My favourite command is ‘oaxaca’, a user-written command (by Ben Jann from Zurich) which can be used to carry out the so-called Oaxaca-Blinder decomposition. I often use it in my research and it saves a lot of time – which easily makes it favourite!
Categories: Company Tags:

## Comparing predictions after arima with manual computations

Some of our users have asked about the way predictions are computed after fitting their models with arima. Those users report that they cannot reproduce the complete set of forecasts manually when the model contains MA terms. They specifically refer that they are not able to get the exact values for the first few predicted periods. The reason for the difference between their manual results and the forecasts obtained with predict after arima is the way the starting values and the recursive predictions are computed. While Stata uses the Kalman filter to compute the forecasts based on the state space representation of the model, users reporting differences compute their forecasts with a different estimator that is based on the recursions derived from the ARIMA representation of the model. Both estimators are consistent but they produce slightly different results for the first few forecasting periods.

When using the postestimation command predict after fitting their MA(1) model with arima, some users claim that they should be able to reproduce the predictions with

where

However, the recursive formula for the Kalman filter prediction is based on the shrunk error (See section 13.3 in Hamilton (1993) for the complete derivation based on the state space representation):

where

: is the estimated variance of the white noise disturbance

: corresponds to the unconditional mean for the error term

Let’s use one of the datasets available from our website to fit a MA(1) model and compute the predictions based on the Kalman filter recursions formulated above:

** Predictions with Kalman Filter recursions (obtained with -predict- **
use http://www.stata-press.com/data/r12/lutkepohl, clear
arima dlinvestment, ma(1)
predict double yhat

** Coefficient estimates and sigma^2 from ereturn list **
scalar beta = _b[_cons]
scalar theta = [ARMA]_b[L1.ma]
scalar sigma2 = e(sigma)^2

** pt and shrinking factor for the first two observations**
generate double pt=sigma2 in 1/2
generate double sh_factor=(sigma2)/(sigma2+theta^2*pt) in 2

** Predicted series and errors for the first two observations **
generate double my_yhat = beta
generate double myehat = sh_factor*(dlinvestment - my_yhat) in 2

** Predictions with the Kalman filter recursions **
quietly {
forvalues i = 3/91 {
replace my_yhat = my_yhat + theta*l.myehat in i'
replace pt= (sigma2*theta^2*L.pt)/(sigma2+theta^2*L.pt) in i'
replace sh_factor=(sigma2)/(sigma2+theta^2*pt)          in i'
replace myehat=sh_factor*(dlinvestment - my_yhat)       in i'
}
}

List the first 10 predictions (yhat from predict and my_yhat from the manual computations):

. list qtr yhat my_yhat pt sh_factor in 1/10

+--------------------------------------------------------+
|    qtr        yhat     my_yhat          pt   sh_factor |
|--------------------------------------------------------|
1. | 1960q1   .01686688   .01686688   .00192542           . |
2. | 1960q2   .01686688   .01686688   .00192542   .97272668 |
3. | 1960q3   .02052151   .02052151   .00005251   .99923589 |
4. | 1960q4   .01478403   .01478403   1.471e-06   .99997858 |
5. | 1961q1   .01312365   .01312365   4.125e-08    .9999994 |
|--------------------------------------------------------|
6. | 1961q2   .00326376   .00326376   1.157e-09   .99999998 |
7. | 1961q3   .02471242   .02471242   3.243e-11           1 |
8. | 1961q4   .01691061   .01691061   9.092e-13           1 |
9. | 1962q1   .01412974   .01412974   2.549e-14           1 |
10. | 1962q2   .00643301   .00643301   7.147e-16           1 |
+--------------------------------------------------------+

Notice that the shrinking factor (sh_factor) tends to 1 as t increases, which implies that after a few initial periods the predictions produced with the Kalman filter recursions become exactly the same as the ones produced by the formula at the top of this entry for the recursions derived from the ARIMA representation of the model.

Reference:

Hamilton, James. 1994. Time Series Analysis. Princeton University Press.

Categories: Statistics Tags:

## Building complicated expressions the easy way

Have you every wanted to make an “easy” calculation–say, after fitting a model–and gotten lost because you just weren’t sure where to find the degrees of freedom of the residual or the standard error of the coefficient? Have you ever been in the midst of constructing an “easy” calculation and was suddenly unsure just what e(df_r) really was? I have a solution.

It’s called Stata’s expression builder. You can get to it from the display dialog (Data->Other Utilities->Hand Calculator)

In the dialog, click the Create button to bring up the builder. Really, it doesn’t look like much:

I want to show you how to use this expression builder; if you’ll stick with me, it’ll be worth your time.

Let’s start over again and assume you are in the midst of an analysis, say,

. sysuse auto, clear
. regress price mpg length

Next invoke the expression builder by pulling down the menu Data->Other Utilities->Hand Calculator. Click Create. It looks like this:

Now click on the tree node icon (+) in front of “Estimation results” and then scroll down to see what’s underneath. You’ll see

Click on Scalars:

The middle box now contains the scalars stored in e(). N happens to be highlighted, but you could click on any of the scalars. If you look below the two boxes, you see the value of the e() scalar selected as well as its value and a short description. e(N) is 74 and is the “number of observations”.

It works the same way for all the other categories in the box on the left: Operators, Functions, Variables, Coefficients, Estimation results, Returned results, System parameters, Matrices, Macros, Scalars, Notes, and Characteristics. You simply click on the tree node icon (+), and the category expands to show what is available.

You have now mastered the expression builder!

Let’s try it out.

Say you want to verify that the p-value of the coefficient on mpg is correctly calculated by regress–which reports 0.052–or more likely, you want to verify that you know how it was calculated. You think the formula is

or, as an expression in Stata,

2*ttail(e(df_r), abs(_b[mpg]/_se[mpg]))

But I’m jumping ahead. You may not remember that _b[mpg] is the coefficient on variable mpg, or that _se[mpg] is its corresponding standard error, or that abs() is Stata’s absolute value function, or that e(df_r) is the residual degrees of freedom from the regression, or that ttail() is Stata’s Student’s t distribution function. We can build the above expression using the builder because all the components can be accessed through the builder. The ttail() and abs() functions are in the Functions category, the e(df_r) scalar is in the Estimation results category, and _b[mpg] and _se[mpg] are in the Coefficients category.

What’s nice about the builder is that not only are the item names listed but also a definition, syntax, and value are displayed when you click on an item. Having all this information in one place makes building a complex expression much easier.

Another example of when the expression builder comes in handy is when computing intraclass correlations after xtmixed. Consider a simple two-level model from Example 1 in [XT] xtmixed, which models weight trajectories of 48 pigs from 9 successive weeks:

. use http://www.stata-press.com/data/r12/pig
. xtmixed weight week || id:, variance

The intraclass correlation is a nonlinear function of variance components. In this example, the (residual) intraclass correlation is the ratio of the between-pig variance, var(_cons), to the total variance, between-pig variance plus residual (within-pig) variance, or var(_cons) + var(residual).

The xtmixed command does not store the estimates of variance components directly. Instead, it stores them as log standard deviations in e(b) such that _b[lns1_1_1:_cons] is the estimated log of between-pig standard deviation, and _b[lnsig_e:_cons] is the estimated log of residual (within-pig) standard deviation. So to compute the intraclass correlation, we must first transform log standard deviations to variances:

exp(2*_b[lns1_1_1:_cons])
exp(2*_b[lnsig_e:_cons])

The final expression for the intraclass correlation is then

exp(2*_b[lns1_1_1:_cons]) / (exp(2*_b[lns1_1_1:_cons])+exp(2*_b[lnsig_e:_cons]))

The problem is that few people remember that _b[lns1_1_1:_cons] is the estimated log of between-pig standard deviation. The few who do certainly do not want to type it. So use the expression builder as we do below:

In this case, we’re using the expression builder accessed from Stata’s nlcom dialog, which reports estimated nonlinear combinations along with their standard errors. Once we press OK here and in the nlcom dialog, we’ll see

. nlcom (exp(2*_b[lns1_1_1:_cons])/(exp(2*_b[lns1_1_1:_cons])+exp(2*_b[lnsig_e:_cons])))

_nl_1:  exp(2*_b[lns1_1_1:_cons])/(exp(2*_b[lns1_1_1:_cons])+exp(2*_b[lnsig_e:_cons]))

------------------------------------------------------------------------------
weight |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_nl_1 |   .7717142   .0393959    19.59   0.000     .6944996    .8489288
------------------------------------------------------------------------------

The above could easily be extended to computing different types of intraclass correlations arising in higher-level random-effects models. The use of the expression builder for that becomes even more handy.

Categories: Statistics Tags:

## The next leap second will be on June 30th, maybe

Leap seconds are the extra seconds inserted every so often to keep precise atomic clocks better synchronized with the rotation of the Earth. Scheduled for June 30th is the extra second 23:59:60 inserted between 23:59:59 and 00:00:00. Or maybe not.

Tomorrow or Friday a vote may be held at the International Telecommuncation Union (ITU) meeting in Geneva to abolish the leap second from the definition of UTC (Coordinated Universial Time). Which would mean StataCorp would not have to post an update to Stata to keep the %tC format working correctly.

As I’ve blogged before — scroll down to “Why Stata has two datetime encodings” in Using dates and times from other software — Stata supports both UTC time (%tC) and constant-86,400-seconds/day time (%tc). Stata does that because some data are collected using leap-second corrected time, and some uncorrected. Stata is unique or nearly unique in providing both time formats.

I read that Google does something very clever: they strech the last second of the year out when a leap second occurs, so the data they collect does not end up with ugly times like 23:59:60, and so that it can be more easily processed by software that assumes a constant 86,400 seconds per day.

The IT industry and a number of others, I gather, are pretty united about the benefits of scrapping the leap second.

The vote is predicted to go against continuing the leap second, according to The Economist magazine. The United States and France are for abolishing the leap second. Britain, Canada, and China are believed to be for continuing it. Some 192 countries will get to vote.

Whichever way the vote goes, I would like to remind readers of advice I previously offered to help alleviate the need for leap seconds: Face west and throw rocks. As I previously noted, the benefit will be transitory if the rocks land back on Earth, so you need to throw the rocks really hard. Having now thought more about this issue, a less strenuous way occurs to me: Push rocks downhill or carry them toward the poles, and preferably do both. These suggestions are designed to attack the real problem, which is that the Earth is currently rotating too slowly.

Categories: Data Management Tags:

## A tip to debug your nl/nlsur function evaluator program

If you have a bug in your evaluator program, nl will produce, most probably, the following error:

your program returned 198
verify that your program is a function evaluator program
r(198);

The error indicates that your program cannot be evaluated.

The best way to spot any issues in your evaluator program is to run it interactively. You just need to define your sample (usually observations where none of the variables are missing), and a matrix with values for your parameters. Let me show you an example with nlces2. This is the code to fit the CES production function, from the documentation for the nl command:

cscript
program nlces2
version 12
syntax varlist(min=3 max=3) if, at(name)
local logout : word 1 of varlist'
local capital : word 2 of varlist'
local labor : word 3 of varlist'
// Retrieve parameters out of at matrix
tempname b0 rho delta
scalar b0' = at'[1, 1]
scalar rho' = at'[1, 2]
scalar delta' = at'[1, 3]
tempvar kterm lterm
generate double kterm' = delta'*capital'^(-1*rho') if'
generate double lterm' = (1-delta')*labor'^(-1*rho') if'
// Fill in dependent variable
replace logout' = b0' - 1/rho'*ln(kterm' + lterm') if'
end

webuse production, clear
nl ces2 @ lnoutput capital labor, parameters(b0 rho delta) ///
initial(b0 0 rho 1 delta 0.5)

Now, let me show you how to run it interactively:

webuse production, clear
*generate a variable to restrict my sample to observations
*with non-missing values in my variables
egen u = rowmiss(lnoutput capital labor)

*generate a matrix with parameters where I will evaluate my function
mat M = (0,1,.5)
gen nloutput_new = 1
nlces2 nloutput_new capital labor if u==0, at(M)

This will evaluate the program only once, using the parameters in matrix M. Notice that I generated a new variable to use as my dependent variable. This is because the program nlces2, when run by itself, will modify the dependent variable.
When you run this program by itself, you will obtain a more specific error message. You can add debugging code to this program, and you can also use the trace setting to see how each step is executed. Type help trace to learn about this setting.

Another possible source of error (which will generate error r(480) when run from nl) is when an evaluator function produces missing values for observations in the sample. If this is the case, you will see those missing values in the variable nloutput_new, i.e., in the variable you entered as dependent when running your evaluator by itself. You can then add debugging code, for example, using codebook or summarize to examine the different parts that contribute to the substitution performed in the dependent variable.

For example, after the line that generates kterm’, I could write

summarize `kterm' if u == 0

to see if this variable contains any missing values in my sample.

This method can also be used to debug your function evaluator programs for nlsur. In order to preserve your dataset, you need to use copies for all the dependent variables in your model.

Categories: Programming Tags:

## Good company

17 October 2011

Dembe, Partridge, and Geist (2011, pdf), in a paper recently published in BMC Health Services Research, report that Stata and SAS were “overwhelmingly the most commonly used software applications employed (in 46% and 42.6% of articles respectively)”. The articles referred to were those in health services research studies published in the U.S.

Good company. Both are, in our humble opinion, excellent packages, although we admit to have a preference for one of them.

We should mention that the authors report that SAS usage grew considerably during the study period, and that Stata usage held roughly constant, a conclusion that matches the results in their Table 1, an extract of which is

2007 2008 2009 2007-2009
total articles 393 374 372 1,139
included articles 282 308 287 877
% Stata used 48.3 42.6 47.4 46.0
% SAS used 37.2 43.1 47.4 42.6

The authors speculated that the growth of SAS “may have been stimulated by enhancements [...] that gave users the ability to use balanced repeated replication (BRR) and jackknife methods for variance estimation with complex survey data [...]“. Since those features were already in Stata, that sounds reasonable to us.

Let us just say, good company. Good companies.

Categories: Company Tags:

## Multilevel random effects in xtmixed and sem — the long and wide of it

xtmixed was built from the ground up for dealing with multilevel random effects — that is its raison d’être. sem was built for multivariate outcomes, for handling latent variables, and for estimating structural equations (also called simultaneous systems or models with endogeneity). Can sem also handle multilevel random effects (REs)? Do we care?

This would be a short entry if either answer were “no”, so let’s get after the first question.

Can sem handle multilevel REs?

A good place to start is to simulate some multilevel RE data. Let’s create data for the 3-level regression model

where the classical multilevel regression assumption holds that and are distributed normal and are uncorrelated.

This represents a model of nested within nested within . An example would be students nested within schools nested within counties. We have random intercepts at the 2nd and 3rd levels — , . Because these are random effects, we need estimate only the variance of , , and .

For our simulated data, let’s assume there are 3 groups at the 3rd level, 2 groups at the 2nd level within each 3rd level group, and 2 individuals within each 2nd level group. Or, , , and . Having only 3 groups at the 3rd level is silly. It gives us only 3 observations to estimate the variance of . But with only observations, we will be able to easily see our entire dataset, and the concepts scale to any number of 3rd-level groups.

First, create our 3rd-level random effects — .

. set obs 3
. gen k = _n
. gen Uk = rnormal()

There are only 3 in our dataset.

I am showing the effects symbolically in the table rather than showing numeric values. It is the pattern of unique effects that will become interesting, not their actual values.

Now, create our 2nd-level random effects — — by doubling this data and creating 2nd-level effects.

. expand 2
. by k, sort: gen j = _n
. gen Vjk = rnormal()

We have 6 unique values of our 2nd-level effects and the same 3 unique values of our 3rd-level effects. Our original 3rd-level effects just appear twice each.

Now, create our 1st-level random effects — — which we typically just call errors.

. expand 2
. by k j, sort: gen i = _n
. gen Eijk = rnormal()

There are still only 3 unique in our dataset, and only 6 unique .

Finally, we create our regression data, using ,

. gen xijk = runiform()
. gen yijk = 2 * xijk + Uk + Vjk + Eijk

We could estimate our multilevel RE model on this data by typing,

. xtmixed yijk xijk || k: || j:

xtmixed uses the index variables k and j to deeply understand the multilevel structure of the our data. sem has no such understanding of multilevel data. What it does have is an understanding of multivariate data and a comfortable willingness to apply constraints.

Let’s restructure our data so that sem can be made to understand its multilevel structure.

First some renaming so that the results of our restructuring will be easier to interpret.

. rename Uk U
. rename Vjk V
. rename Eijk E
. rename xijk x
. rename yijk y

We reshape to turn our multilevel data into multivariate data that sem has a chance of understanding. First, we reshape wide on our 2nd-level identifier j. Before that, we egen to create a unique identifier for each observation of the two groups identified by j.

. egen ik = group(i k)
. reshape wide y x E V, i(ik) j(j)

We now have a y variable for each group in j (y1 and y2). Likewise, we have two x variables, two residuals, and most importantly two 2nd-level random effects V1 and V2. This is the same data, we have merely created a set of variables for every level of j. We have gone from multilevel to multivariate.
We still have a multilevel component. There are still two levels of i in our dataset. We must reshape wide again to remove any remnant of multilevel structure.

. drop ik
. reshape wide y* x* E*, i(k) j(i)

I admit that is a microscopic font, but it is the structure that is important, not the values. We now have 4 y’s, one for each combination of 2nd- and 3rd-level identifiers — i and j. Likewise for the x’s and E’s.

We can think of each xji yji pair of columns as representing a regression for a specific combination of j and i — y11 on x11, y12 on x12, y21 on x21, and y22 on x22. Or, more explicitly,

So, rather than a univariate multilevel regression with 4 nested observation sets, () * (), we now have 4 regressions which are all related through and each of two pairs are related through . Oh, and all share the same coefficient . Oh, and the all have identical variances. Oh, and the also have identical variances. Luckily both the sem command and the SEM Builder (the GUI for sem) make setting constraints easy.

There is one other thing we haven’t addressed. xtmixed understands random effects. Does sem? Random effects are just unobserved (latent) variables and sem clearly understands those. So, yes, sem does understand random effects.

Many SEMers would represent this model in a path diagram by drawing.

There is a lot of information in that diagram. Each regression is represented by one of the x boxes being connected by a path to a y box. That each of the four paths is labeled with means that we have constrained the regressions to have the same coefficient. The y21 and y22 boxes also receive input from the random latent variable V2 (representing our 2nd-level random effects). The other two y boxes receive input from V1 (also our 2nd-level random effects). For this to match how xtmixed handles random effects, V1 and V2 must be constrained to have the same variance. This was done in the path diagram by “locking” them to have the same variance — S_v. To match xtmixed, each of the four residuals must also have the same variance — shown in the diagram as S_e. The residuals and random effect variables also have their paths constrained to 1. That is to say, they do not have coefficients.

We do not need any of the U, V, or E variables. We kept these only to make clear how the multilevel data was restructured to multivariate data. We might “follow the money” in a criminal investigation, but with simulated multilevel data is is best to “follow the effects”. Seeing how these effects were distributed in our reshaped data made it clear how they entered our multivariate model.

Just to prove that this all works, here are the results from a simulated dataset ( rather than the 3 that we have been using). The xtmixed results are,

. xtmixed yijk xijk || k: || j: , mle var

(log omitted)

Mixed-effects ML regression                     Number of obs      =       400

-----------------------------------------------------------
|   No. of       Observations per Group
Group Variable |   Groups    Minimum    Average    Maximum
----------------+------------------------------------------
k |      100          4        4.0          4
j |      200          2        2.0          2
-----------------------------------------------------------

Wald chi2(1)       =     61.84
Log likelihood = -768.96733                     Prob > chi2        =    0.0000

------------------------------------------------------------------------------
yijk |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
xijk |   1.792529   .2279392     7.86   0.000     1.345776    2.239282
_cons |    .460124   .2242677     2.05   0.040     .0205673    .8996807
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
k: Identity                  |
var(_cons) |   2.469012   .5386108      1.610034    3.786268
-----------------------------+------------------------------------------------
j: Identity                  |
var(_cons) |   1.858889    .332251      1.309522    2.638725
-----------------------------+------------------------------------------------
var(Residual) |   .9140237   .0915914      .7510369    1.112381
------------------------------------------------------------------------------
LR test vs. linear regression:       chi2(2) =   259.16   Prob > chi2 = 0.0000

Note: LR test is conservative and provided only for reference.

The sem results are,

sem (y11 <- x11@bx _cons@c V1@1 U@1)
(y12 <- x12@bx _cons@c V1@1 U@1)
(y21 <- x21@bx _cons@c V2@1 U@1)
(y22 <- x22@bx _cons@c V2@1 U@1) ,
covstruct(_lexog, diagonal) cov(_lexog*_oexog@0)
cov( V1@S_v V2@S_v  e.y11@S_e e.y12@S_e e.y21@S_e e.y22@S_e)

(notes omitted)

Endogenous variables

Observed:  y11 y12 y21 y22

Exogenous variables

Observed:  x11 x12 x21 x22
Latent:    V1 U V2

(iteration log omitted)

Structural equation model                       Number of obs      =       100
Estimation method  = ml
Log likelihood     = -826.63615

(constraint listing omitted)
------------------------------------------------------------------------------
|                 OIM             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
Structural   |
y11 <-     |
x11 |   1.792529   .2356323     7.61   0.000     1.330698     2.25436
V1 |          1   7.68e-17  1.3e+16   0.000            1           1
U |          1   2.22e-18  4.5e+17   0.000            1           1
_cons |    .460124    .226404     2.03   0.042     .0163802    .9038677
-----------+----------------------------------------------------------------
y12 <-     |
x12 |   1.792529   .2356323     7.61   0.000     1.330698     2.25436
V1 |          1   2.00e-22  5.0e+21   0.000            1           1
U |          1   5.03e-17  2.0e+16   0.000            1           1
_cons |    .460124    .226404     2.03   0.042     .0163802    .9038677
-----------+----------------------------------------------------------------
y21 <-     |
x21 |   1.792529   .2356323     7.61   0.000     1.330698     2.25436
U |          1   5.70e-46  1.8e+45   0.000            1           1
V2 |          1   5.06e-45  2.0e+44   0.000            1           1
_cons |    .460124    .226404     2.03   0.042     .0163802    .9038677
-----------+----------------------------------------------------------------
y22 <-     |
x22 |   1.792529   .2356323     7.61   0.000     1.330698     2.25436
U |          1  (constrained)
V2 |          1  (constrained)
_cons |    .460124    .226404     2.03   0.042     .0163802    .9038677
-------------+----------------------------------------------------------------
Variance     |
e.y11 |   .9140239    .091602                        .75102    1.112407
e.y12 |   .9140239    .091602                        .75102    1.112407
e.y21 |   .9140239    .091602                        .75102    1.112407
e.y22 |   .9140239    .091602                        .75102    1.112407
V1 |   1.858889   .3323379                      1.309402    2.638967
U |   2.469011   .5386202                      1.610021    3.786296
V2 |   1.858889   .3323379                      1.309402    2.638967
-------------+----------------------------------------------------------------
Covariance   |
x11        |
V1 |          0  (constrained)
U |          0  (constrained)
V2 |          0  (constrained)
-----------+----------------------------------------------------------------
x12        |
V1 |          0  (constrained)
U |          0  (constrained)
V2 |          0  (constrained)
-----------+----------------------------------------------------------------
x21        |
V1 |          0  (constrained)
U |          0  (constrained)
V2 |          0  (constrained)
-----------+----------------------------------------------------------------
x22        |
V1 |          0  (constrained)
U |          0  (constrained)
V2 |          0  (constrained)
-----------+----------------------------------------------------------------
V1         |
U |          0  (constrained)
V2 |          0  (constrained)
-----------+----------------------------------------------------------------
U          |
V2 |          0  (constrained)
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(25)  =     22.43, Prob > chi2 = 0.6110

And here is the path diagram after estimation.

The standard errors of the two estimation methods are asymptotically equivalent, but will differ in finite samples.

Sidenote: Those familiar with multilevel modeling will be wondering if sem can handle unbalanced data. That is to say a different number of observations or subgroups within groups. It can. Simply let reshape create missing values where it will and then add the method(mlmv) option to your sem command. mlmv stands for maximum likelihood with missing values. And, as strange as it may seem, with this option the multivariate sem representation and the multilevel xtmixed representations are the same.

Do we care?

You will have noticed that the sem command was, well, it was really long. (I wrote a little loop to get all the constraints right.) You will also have noticed that there is a lot of redundant output because our SEM model has so many constraints. Why would anyone go to all this trouble to do something that is so simple with xtmixed? The answer lies in all of those constraints. With sem we can relax any of those constraints we wish!

Relax the constraint that the V# have the same variance and you can introduce heteroskedasticity in the 2nd-level effects. That seems a little silly when there are only two levels, but imagine there were 10 levels.

Add a covariance between the V# and you introduce correlation between the groups in the 3rd level.

What’s more, the pattern of heteroskedasticity and correlation can be arbitrary. Here is our path diagram redrawn to represent children within schools within counties and increasing the number of groups in the 2nd level.

We have 5 counties at the 3rd level and two schools within each county at the 2nd level — for a total of 10 dimensions in our multivariate regression. The diagram does not change based on the number of children drawn from each school.

Our regression coefficients have been organized horizontally down the center of the diagram to allow room along the left and right for the random effects. Taken as a multilevel model, we have only a single covariate — x. Just to be clear, we could generalize this to multiple covariates by adding more boxes with covariates for each dependent variable in the diagram.

The labels are chosen carefully. The 3rd-level effects N1, N2, and N3 are for northern counties, and the remaining second level effects S1 and S2 are for southern counties. There is a separate dependent variable and associated error for each school. We have 4 public schools (pub1 pub2, pub3, and pub4); three private schools (prv1 prv2, and prv3); and 3 church-sponsored schools (chr1 chr2, and chr3).

The multivariate structure seen in the diagram makes it clear that we can relax some constraints that the multilevel model imposes. Because the sem representation of the model breaks the 2nd level effect into an effect for each county, we can apply a structure to the 2nd level effect. Consider the path diagram below.

We have correlated the effects for the 3 northern counties. We did this by drawing curved lines between the effects. We have also correlated the effects of the two southern counties. xtmixed does not allow these types of correlations. Had we wished, we could have constrained the correlations of the 3 northern counties to be the same.

We could also have allowed the northern and southern counties to have different variances. We did just that in the diagram below by constraining the northern counties variances to be N and the southern counties variances to be S.

In this diagram we have also correlated the errors for the 4 public schools. As drawn, each correlation is free to take on its own values, but we could just as easily constrain each public school to be equally correlated with all other public schools. Likewise, to keep the diagram readable, we did not correlate the private schools with each other or the church schools with each other. We could have done that.

There is one thing that xtmixed can do that sem cannot. It can put a structure on the residual correlations within the 2nd level groups. xtmixed has a special option, residuals(), for just this purpose.

With xtmixed and sem you get,

• robust and cluster-robust SEs
• survey data

With sem you also get

• endogenous covariates
• estimation by GMM
• missing data — MAR (also called missing on observables)
• heteroskedastic effects at any level
• correlated effects at any level
• easy score tests using estat scoretests
• are the coefficients truly are the same across all equations/levels, whether effects?
• are effects or sets of effects uncorrelated?
• are effects within a grouping homoskedastic?

Whether you view this rethinking of multilevel random-effects models as multivariate structural equation models (SEMs) as interesting, or merely an academic exercise, depends on whether your model calls for any of the items in the second list.

Categories: Statistics Tags:

## Advanced Mata: Pointers

I’m still recycling my talk called “Mata, The Missing Manual” at user meetings, a talk designed to make Mata more approachable. One of the things I say late in the talk is, “Unless you already know what pointers are and know you need them, ignore them. You don’t need them.” And here I am writing about, of all things, pointers. Well, I exaggerated a little in my talk, but just a little.

Before you take my previous advice and stop reading, let me explain: Mata serves a number of purposes and one of them is as the primary langugage we at StataCorp use to implement new features in Stata. I’m not referring to mock ups, toys, and experiments, I’m talking about ready-to-ship code. Stata 12’s Structural Equation Modeling features are written in Mata, so is Multiple Imputation, so is Stata’s optimizer that is used by nearly all estimation commands, and so are most features. Mata has a side to it that is exceedingly serious and intended for use by serious developers, and every one of those features are available to users just as they are to StataCorp developers. This is one of the reasons there are so many user-written commands are available for Stata. Even if you don’t use the serious features, you benefit.

So every so often I need to take time out and address the concerns of these user/developers. I knew I needed to do that now when Kit Baum emailed a question to me that ended with “I’m stumped.” Kit is the author of An Introduction to Stata Programming which has done more to make Mata approachable to professional researchers than anything StataCorp has done, and Kit is not often stumped.

I have a certain reptutation about how I answer most questions. “Why do you want to do that?” I invariably reply, or worse, “You don’t want to do that!” and then go on to give the answer to the question I wished they had asked. When Kit asks a question, however, I just answer it. Kit asked a question about pointers by setting up an artificial example and I have no idea what his real motivation was, so I’m not even going to try to motivate the question for you. The question is interesting in and of itself anyway.

Here is Kit’s artificial example:

real function x2(real scalar x) return(x^2)

real function x3(real scalar x) return(x^3)

void function tryit()
{
pointer(real scalar function) scalar fn
string rowvector                     func
real scalar                          i

func = ("x2", "x3")
for(i=1;i<=length(func);i++) {
fn = &(func[i])
(*fn)(4)
}
}

Kit is working with pointers, and not just pointers to variables, but pointers to functions. A pointer is the memory address, the address where the variable or function is stored. Real compilers translate names into memory addresses which is one of the reasons real compilers produce code that runs fast. Mata is a real compiler. Anyway, pointers are memory addresses, such as 58, 212,770, 427,339,488, except the values are usually written in hexadecimal rather than decimal. In the example, Kit has two functions, x2(x) and x3(x). Kit wants to create a vector of the function addresses and then call each of the functions in the vector. In the artificial example, he's calling each with an argument of 4.

The above code does not work:

: tryit()
tryit():  3101  matrix found where function required
<istmt>:     -  function returned error

The error message is from the Mata compiler and it's complaining about the line

(*fn)(4)

but the real problem is earlier in the tryit() code.

One corrected version of tryit() would read,

void function tryit()
{
pointer(real scalar function) scalar fn
pointer(real scalar function) vector func     // <---
real scalar                          i

func = (&x2(), &x3())                         // <---
for(i=1;i<=length(func);i++) {
fn = func[i]                          // <---
(*fn)(4)
}
}

If you make the three changes I marked, tryit() works:

: tryit()
16
64

I want to explain this code and alternative ways the code could have been fixed. It will be easier if we just work interactively, so let's start all over again:

: real scalar x2(x) return(x^2)

: real scalar x3(x) return(x^3)

: func = (&x2(), &x3())

Let's take a look at what is in func:

: func
1            2
+---------------------------+
1 |  0x19551ef8   0x19552048  |
+---------------------------+

Those are memory addresses. When we typed &x2() and &x3() in the line

: func = (&x2(), &x3())

functions x2() and x3() were not called. &x2() and &x3() instead evaluate to the addresses of the functions named x2() and x3(). I can demonstrate this:

: &x2()
0x19551ef8

0x19551ef8 is the memory address of where the function x2() is stored. 0x19551ef8 may not look like a number, but that is only because it is presented in base 16. 0x19551ef8 is in fact the number 425,008,888, and the compiled code for the function x2() starts at the 425,008,888th byte of memory and continues thereafter.

Let's assign to fn the value of the address of one of the functions, say x2(). I could do that by typing

: fn = func[1]

or by typing

: fn = &x2()

and either way, when I look at fn, it contains a memory address:

: fn
0x19551ef8

Let's now call the function whose address we have stored in fn:

: (*fn)(2)
4

When we call a function and want to pass 2 as an argument, we normally code f(2). In this case, we substitute (*fn) for f because we do not want to call the function named f(), we want to call the function whose address is stored in variable fn. The operator * usually means multiplication, but when * is used as a prefix, it means something different, in much the same way the minus operator - can be subtract or negate. The meaning of unary * is "the contents of". When we code *fn, we mean not the value 425,008,888 stored in fn, we mean the contents of the memory address 425,008,888, which happens to be the function x2().

We type (*fn)(2) and not *fn(2) because *fn(2) would be interpreted to mean *(fn(2)). If there were a function named fn(), that function would be called with argument 2, the result obtained, and then the star would take the contents of that memory address, assuming fn(2) returned a memory address. If it didn't, we'd get a type mismatch error.

The syntax can be confusing until you understand the reasoning behind it. Let's start with all new names. Consider something named X. Actually, there could be two different things named X and Mata would not be confused. There could be a variable named X and there could be a function named X(). To Mata, X and X() are different things, or said in the jargon, have different name spaces. In Mata, variables and functions can have the same names. Variables and functions having the same names in C is not allowed -- C has only one name space. So in C, you can type

fn = &x2

to obtain the address of variable x2 or function x2(), but in Mata, the above means the address of the variable x2, and if there is no such variable, that's an error. In Mata, to obtain the address of function x2(), you type

fn = &x2()

The syntax &x2() is a definitional nugget; there is no taking it apart to understand its logic. But we can take apart the logic of the programmer who defined the syntax. & means "address of" and &thing means to take the address of thing. If thing is a name -- &name -- that means to look up name in the variable space and return its address. If thing is name(), that means look up name in the function space and return its address. They way we formally write this grammar is

&thing, where

thing  :=   name
name()
exp

There are three possibilities for thing; it's a name or it's a name followed by () or it's an expression. The last is not much used. &2 creates a literal 2 and then tells you the address where the 2 is stored, which might be 0x195525d8. &(2+3) creates 5 and then tells you where the 5 is stored.

But let's get back to Kit's problem. Kit coded,

func = ("x2", "x3")

and I said no, code instead

func = (&x2(), &x3())

You do not use strings to obtain pointers, you use the actual name prefixed by ampersand.

There's a subtle difference in what Kit was trying to code and what I did code, however. In what Kit tried to code, Kit was seeking "run-time binding". I, however, coded "compile-time binding". I'm about to explain the difference and show you how to achieve run-time binding, but before I do, let me tell you that

1. You probably want compile-time binding.
2. Compile-time binding is faster.
3. Run-time binding is sometimes required, but when persons new to pointers think they need run-time binding, they usually do not.

Let me define compile-time and run-time binding:

1. Binding refers to establishing addresses corresponding to names and names(). The names are said to be bound to the address.

2. In compile-time binding, the addresses are established at the time the code is compiled.

More correctly, compile-time binding does not really occur at the time the code is compiled, it occurs when the code is brought together for execution, an act called linking and which happens automatically in Mata. This is a fine and unimportant distiction, but I do not want you to think that all the functions have to be compiled at the same time or that the order in which they are compiled matters.

In compile-time binding, if any functions are missing when the code is brought together for execution, and error message is issued.

3. In run-time binding, the addresses are established at the time the code is executed (run), which happens after compilation, and after linking, and is an explicit act performed by you, the programmer.

To obtain the address of a variable or function at run-time, you use built-in function findexternal(). findexternal() takes one argument, a string scalar, containing the name of the object to be found. The function looks up that name and returns the address corresponding to it, or it returns NULL if the object cannot be found. NULL is the word used to mean invalid memory address and is in fact defined as equaling zero.

findexternal() can be used only with globals. The other variables that appear in your program might appear to have names, but those names are used solely by the compiler and, in the compiled code, these "stack-variables" or "local variables" are referred to by their addresses. The names play no other role and are not even preserved, so findexternal() cannot be used to obtain their addresses. There would be no reason you would want findexternal() to find their addresses because, in all such cases, the ampersand prefix is a perfect substitute.

Functions, however, are global, so we can look up functions. Watch:

: findexternal("x2()")
0x19551ef8

Compare that with

: &x2()
0x19551ef8

It's the same result, but they were produced differently. In the findexternal() case, the 0x19551ef8 result was produced after the code was compiled and assembled. The value was obtained, in fact, by execution of the findexternal() function.

In the &x2() case, the 0x19551ef8 result was obtained during the compile/assembly process. We can better understand the distinction if we look up a function that does not exist. I have no function named x4(). Let's obtain x4()'s address:

: findexternal("x4()")
0x0

: &x4()

I may have no function named x4(), but that didn't bother findexternal(). It merely returned 0x0, another way of saying NULL.

In the &x4() case, the compiler issued an error. The compiler, faced with evaluating &x4(), could not, and so complained.

Anyway, here is how we could write tryit() with run-time binding using the findexternal() function:

void function tryit()
{
pointer(real scalar function) scalar fn
pointer(real scalar function) vector func
real scalar                          i

func = (findexternal("x2()"), findexternal("x3()")

for(i=1;i<=length(func);i++) {
fn = func[i]
(*fn)(4)
}
}

To obtain run-time rather than compile-time bindings, all I did was change the line

func = (&x2(), &x3())

to be

func = (findexternal("x2()"), findexternal("x3()")

Or we could write it this way:

void function tryit()
{
pointer(real scalar function) scalar fn
string vector                        func
real scalar                          i

func = ("x2()", "x3()")

for(i=1;i<=length(func);i++) {
fn = findexternal(func[i])
(*fn)(4)
}
}

In this variation, I put the names in a string vector just as Kit did originally. Then I changed the line that Kit wrote,

fn = &(func[i])

fn = findexternal(func[i])

Either way you code it, when performing run-time binding, you the programmer should deal with what is to be done if the function is not found. The loop

for(i=1;i<=length(func);i++) {
fn = findexternal(func[i])
(*fn)(4)
}

for(i=1;i<=length(func);i++) {
fn = findexternal(func[i])
if (fn!=NULL) {
(*fn)(4)
}
else {
...
}
}

Unlike C, if you do not include the code for the not-found case, the program will not crash if the function is not found. Mata will give you an "invalid use of NULL pointer" error message and a traceback log.

If you were writing a program in which the user of your program was to pass to you a function you were to use, such as a likelihood function to be maximized, you could write your program with compile-time binding by coding,

function myopt(..., pointer(real scalar function) scalar f, ...)
{
...
... (*f)(...) ...
...
}

and the user would call you program my coding myopt(..., &myfunc(), ...), or you could use run-time binding by coding

function myopt(..., string scalar fname, ...)
{
pointer(real scalar function) scalar f
...

f = findexternal(fname)
if (f==NULL) {
exit(111)
}
...
... (*f)(...) ...
...
}

and the user would call your program by coding myopt(..., "myfunc()", ...).

In this case I could be convinced to prefer the run-time binding solution for professional code because, the error being tolerated by Mata, I can write code to give a better, more professional looking error message.

Categories: Mata Tags:

## Use poisson rather than regress; tell a friend

Do you ever fit regressions of the form

ln(yj) = b0 + b1x1j + b2x2j + … + bkxkj + εj

by typing

. generate lny = ln(y)

. regress lny x1 x2 … xk

The above is just an ordinary linear regression except that ln(y) appears on the left-hand side in place of y.

The next time you need to fit such a model, rather than fitting a regression on ln(y), consider typing

. poisson y x1 x2 … xk, vce(robust)

which is to say, fit instead a model of the form

yj = exp(b0 + b1x1j + b2x2j + … + bkxkj + εj)

Wait, you are probably thinking. Poisson regression assumes the variance is equal to the mean,

E(yj) = Var(yj) = exp(b0 + b1x1j + b2x2j + … + bkxkj)

whereas linear regression merely assumes E(ln(yj)) = b0 + b1x1j + b2x2j + … + bkxkj and places no constraint on the variance. Actually regression does assume the variance is constant but since we are working the logs, that amounts to assuming that Var(yj) is proportional to yj, which is reasonable in many cases and can be relaxed if you specify vce(robust).

In any case, in a Poisson process, the mean is equal to the variance. If your goal is to fit something like a Mincer earnings model,

ln(incomej) = b0 + b1*educationj + b2*experiencej + b3*experiencej2 + εj

there is simply no reason to think that the the variance of the log of income is equal to its mean. If a person has an expected income of $45,000, there is no reason to think that the variance around that mean is 45,000, which is to say, the standard deviation is$212.13. Indeed, it would be absurd to think one could predict income so accurately based solely on years of schooling and job experience.

Nonetheless, I suggest you fit this model using Poisson regression rather than linear regression. It turns out that the estimated coefficients of the maximum-likelihood Poisson estimator in no way depend on the assumption that E(yj) = Var(yj), so even if the assumption is violated, the estimates of the coefficients b0, b1, …, bk are unaffected. In the maximum-likelihood estimator for Poisson, what does depend on the assumption that E(yj) = Var(yj) are the estimated standard errors of the coefficients b0, b1, …, bk. If the E(yj) = Var(yj) assumption is violated, the reported standard errors are useless. I did not suggest, however, that you type

. poisson y x1 x2 … xk

I suggested that you type

. poisson y x1 x2 … xk, vce(robust)

That is, I suggested that you specify that the variance-covariance matrix of the estimates (of which the standard errors are the square root of the diagonal) be estimated using the Huber/White/Sandwich linearized estimator. That estimator of the variance-covariance matrix does not assume E(yj) = Var(yj), nor does it even require that Var(yj) be constant across j. Thus, Poisson regression with the Huber/White/Sandwich linearized estimator of variance is a permissible alternative to log linear regression — which I am about to show you — and then I’m going to tell you why it’s better.

I have created simulated data in which

yj = exp(8.5172 + 0.06*educj + 0.1*expj – 0.002*expj2 + εj)

where εj is distributed normal with mean 0 and variance 1.083 (standard deviation 1.041). Here’s the result of estimation using regress:

. regress lny educ exp exp2

Source |       SS       df       MS              Number of obs =    5000
-------------+------------------------------           F(  3,  4996) =   44.72
Model |  141.437342     3  47.1457806           Prob > F      =  0.0000
Residual |  5267.33405  4996  1.05431026           R-squared     =  0.0261
-------------+------------------------------           Adj R-squared =  0.0256
Total |  5408.77139  4999  1.08197067           Root MSE      =  1.0268

------------------------------------------------------------------------------
lny |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ |   .0716126   .0099511     7.20   0.000      .052104    .0911212
exp |   .1091811   .0129334     8.44   0.000     .0838261    .1345362
exp2 |  -.0022044   .0002893    -7.62   0.000    -.0027716   -.0016373
_cons |   8.272475   .1855614    44.58   0.000     7.908693    8.636257
------------------------------------------------------------------------------

I intentionally created these data to produce a low R-squared.

We obtained the following results:

truth      est.    S.E.
----------------------------------
educ      0.0600    0.0716  0.0100
exp       0.1000    0.1092  0.0129
exp2     -0.0020   -0.0022  0.0003
-----------------------------------
_cons     8.5172    8.2725  0.1856   <- unadjusted (1)
9.0587    8.7959     ?     <-   adjusted (2)
-----------------------------------
(1) To be used for predicting E(ln(yj))
(2) To be used for predicting E(yj)

Note that the estimated coefficients are quite close to the true values. Ordinarily, we would not know the true values, except I created this artificial dataset and those are the values I used.

For the intercept, I list two values, so I need to explain. We estimated a linear regression of the form,

ln(yj) = b0 + Xjb + εj

As with all linear regressions,

E(ln(yj)) = E(b0 + Xjb + εj)
= b0 + Xjb + E(εj)
= b0 + Xjb

We, however, have no real interest in E(ln(yj)). We fit this log regression as a way of obtaining estimates of our real model, namely

yj = exp(b0 + Xjb + εj)

So rather than taking the expectation of ln(yj), lets take the expectation of yj:

E(yj) = E(exp(b0 + Xjb + εj))
= E(exp(b0 + Xjb) * exp(εj))
= exp(b0 + Xjb) * E(exp(εj))

E(exp(εj)) is not one. E(exp(εj)) for εj distributed N(0, σ2) is exp(σ2/2). We thus obtain

E(yj) = exp(b0 + Xjb) * exp(σ2/2)

People who fit log regressions know about this — or should — and know that to obtain predicted yj values, they must

1. Obtain predicted values for ln(yj) = b0 + Xjb.

2. Exponentiate the predicted log values.

3. Multiply those exponentiated values by exp(σ2/2), where σ2 is the square of the root-mean-square-error (RMSE) of the regression.

They do in this in Stata by typing

. predict yhat

. replace yhat = exp(yhat).

. replace yhat = yhat*exp(e(rmse)^2/2)

In the table I that just showed you,

truth      est.    S.E.
----------------------------------
educ      0.0600    0.0716  0.0100
exp       0.1000    0.1092  0.0129
exp2     -0.0020   -0.0022  0.0003
-----------------------------------
_cons     8.5172    8.2725  0.1856   <- unadjusted (1)
9.0587    8.7959     ?     <-   adjusted (2)
-----------------------------------
(1) To be used for predicting E(ln(yj))
(2) To be used for predicting E(yj)

I’m setting us up to compare these estimates with those produced by poisson. When we estimate using poisson, we will not need to take logs because the Poisson model is stated in terms of yj, not ln(yj). In prepartion for that, I have included two lines for the intercept — 8.5172, which is the intercept reported by regress and is the one appropriate for making predictions of ln(y) — and 9.0587, an intercept appropriate for making predictions of y and equal to 8.5172 plus σ2/2. Poisson regression will estimate the 9.0587 result because Poisson is stated in terms of y rather than ln(y).

I placed a question mark in the column for the standard error of the adjusted intercept because, to calculate that, I would need to know the standard error of the estimated RMSE, and regress does not calculate that.

Let’s now look at the results that poisson with option vce(robust) reports. We must not forget to specify option vce(robust) because otherwise, in this model that violates the Poisson assumption that E(yj) = Var(yj), we would obtain incorrect standard errors.

. poisson y educ exp exp2, vce(robust)
note: you are responsible for interpretation of noncount dep. variable

Iteration 0:   log pseudolikelihood = -1.484e+08
Iteration 1:   log pseudolikelihood = -1.484e+08
Iteration 2:   log pseudolikelihood = -1.484e+08

Poisson regression                                Number of obs   =       5000
Wald chi2(3)    =      67.52
Prob > chi2     =     0.0000
Log pseudolikelihood = -1.484e+08                 Pseudo R2       =     0.0183

------------------------------------------------------------------------------
|               Robust
y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ |   .0575636   .0127996     4.50   0.000     .0324769    .0826504
exp |   .1074603   .0163766     6.56   0.000     .0753628    .1395578
exp2 |  -.0022204   .0003604    -6.16   0.000    -.0029267   -.0015141
_cons |   9.016428   .2359002    38.22   0.000     8.554072    9.478784
------------------------------------------------------------------------------

So now we can fill in the rest of our table:

regress            poisson
truth      est.    S.E.      est.     S.E.
-----------------------------------------------------
educ      0.0600    0.0716  0.0100     0.0576  0.1280
exp       0.1000    0.1092  0.0129     0.1075  0.0164
exp2     -0.0020   -0.0022  0.0003    -0.0022  0.0003
------------------------------------------------------
_cons     8.5172    8.2725  0.1856          ?       ?   <- (1)
9.0587    8.7959       ?     9.0164  0.2359   <- (2)
------------------------------------------------------
(1) To be used for predicting E(ln(yj))
(2) To be used for predicting E(yj)

I told you that Poisson works, and in this case, it works well. I’ll now tell you that in all cases it works well, and it works better than log regression. You want to think about Poisson regression with the vce(robust) option as a better alternative to log regression.

How is Poisson better?

First off, Poisson handles outcomes that are zero. Log regression does not because ln(0) is -∞. You want to be careful about what it means to handle zeros, however. Poisson handles zeros that arise in correspondence to the model. In the Poisson model, everybody participates in the yj = exp(b0 + Xjb + εj) process. Poisson regression does not handle cases where some participate and others do not, and among those who do not, had they participated, would likely produce an outcome greater than zero. I would never suggest using Poisson regression to handle zeros in an earned income model because those that earned zero simply didn’t participate in the labor force. Had they participated, their earnings might have been low, but certainly they would have been greater than zero. Log linear regression does not handle that problem, either.

Natural zeros do arise in other situations, however, and a popular question on Statalist is whether one should recode those natural zeros as 0.01, 0.0001, or 0.0000001 to avoid the missing values when using log linear regression. The answer is that you should not recode at all; you should use Poisson regression with vce(robust).

Secondly, small nonzero values, however they arise, can be influential in log-linear regressions. 0.01, 0.0001, 0.0000001, and 0 may be close to each other, but in the logs they are -4.61, -9.21, -16.12, and -∞ and thus not close at all. Pretending that the values are close would be the same as pretending that that exp(4.61)=100, exp(9.21)=9,997, exp(16.12)=10,019,062, and exp(∞)=∞ are close to each other. Poisson regression understands that 0.01, 0.0001, 0.0000001, and 0 are indeed nearly equal.

Thirdly, when estimating with Poisson, you do not have to remember to apply the exp(σ2/2) multiplicative adjustment to transform results from ln(y) to y. I wrote earlier that people who fit log regressions of course remember to apply the adjustment, but the sad fact is that they do not.

Finally, I would like to tell you that everyone who estimates log models knows about the Poisson-regression alternative and it is only you who have been out to lunch. You, however, are in esteemed company. At the recent Stata Conference in Chicago, I asked a group of knowledgeable researchers a loaded question, to which the right answer was Poisson regression with option vce(robust), but they mostly got it wrong.

I said to them, “I have a process for which it is perfectly reasonable to assume that the mean of yj is given by exp(b0 + Xjb), but I have no reason to believe that E(yj) = Var(yj), which is to say, no reason to suspect that the process is Poisson. How would you suggest I estimate the model?” Certainly not using Poisson, they replied. Social scientists suggested I use log regression. Biostatisticians and health researchers suggested I use negative binomial regression even when I objected that the process was not the gamma mixture of Poissons that negative binomial regression assumes. “What else can you do?” they said and shrugged their collective shoulders. And of course, they just assumed over dispersion.

Based on those answers, I was ready to write this blog entry, but it turned out differently than I expected. I was going to slam negative binomial regression. Negative binomial regression makes assumptions about the variance, assumptions different from that made by Poisson, but assumptions nonetheless, and unlike the assumption made in Poisson, those assumptions do appear in the first-order conditions that determine the fitted coefficients that negative binomial regression reports. Not only would negative binomial’s standard errors be wrong — which vce(robust) could fix — but the coefficients would be biased, too, and vce(robust) would not fix that. I planned to run simulations showing this.

When I ran the simulations, I was surprised by the results. The negative binomial estimator (Stata’s nbreg) was remarkably robust to violations in variance assumptions as long as the data were overdispersed. In fact, negative binomial regression did about as well as Poisson regression. I did not run enough simulations to make generalizations, and theory tells me those generalizations have to favor Poisson, but the simulations suggested that if Poisson does do better, it’s not in the first four decimal places. I was impressed. And disappointed. It would have been a dynamite blog entry.

So you’ll have to content yourself with this one.

Others have preceeded me in the knowledge that Poisson regression with vce(robust) is a better alternative to log-linear regression. I direct you to Jeffery Wooldridge, Econometric Analysis of Cross Section and Panel Data, 2nd ed., chapter 18. Or see A. Colin Cameron and Pravin K. Trivedi, Microeconomics Using Stata, revised edition, chapter 17.3.2.

I first learned about this from a talk given by Austin Nichols, Regression for nonnegative skewed dependent variables, given in 2010 at the Stata Conference in Boston. That talk goes far beyond what I have presented here, and I heartily recommend it.

Categories: Statistics Tags:

## 2011 Stata Conference recap

The 2011 Stata Conference in Chicago ended last Friday, and a good time was had by all.

The two days had the usual wide array of talks, given by researchers in Econometrics, Sociology, Medicine, and Statistics, together with three of us from StataCorp—Bill Gould, David Drukker, and me.

The conference was held in the Gleacher center on the banks of the Chicago River in Chicago (of course), which is a fine facility. I know it sounds mundane, but the acoustics in the lecture hall were excellent, making it very easy for speakers and questions to be heard clearly.

It was really fun talking to old friends and making new ones both during the breaks and the conference dinner on Thursday night.

The Wishes and Grumbles session was one of the liveliest in recent memory. These are always fun for us, because they give us a window on design questions in Stata. The extra buzz from Stata 12 being recently announced was an added bonus.

Chris and Gretchen Farrar, who were running the logistics for the meeting said this was one of the happiest groups they can remember.

Here are the sentiments of Gabi Huiber, who tweeted:

Back from @Stata Conference, telling my wife about it. Her: “You’re glowing. That must have been like a spa retreat for you.

I couldn’t have said it better.

A gallery of photos from the conference is available on Facebook.

The 2012 Stata Conference will be in San Diego on July 26 and 27. See you there!

Categories: Meetings Tags: