## Ermistatas and Stata’s new ERMs commands

Ermistatas is our most popular t-shirt these days. See it and you will understand why.

We call the character Ermistatas and he is thinking—Ermistatas cogitatu. Notice the electricity bolts being emitted and received by his three antennae.

The shirt is popular even among those who do not use Stata and it’s leading them to ask questions. “Who or what is Ermistatas and why is he, she, or it deserving of a t-shirt?”. Then they add, “And why three and not the usual two antennae?”

Ermistatas is the creation of our arts-and-graphics department to represent Stata 15’s new commands for fitting Extended Regression Models—a term we coined. We call it ERMs for short. The new commands are

eregress |
fits linear regressions |

eintreg |
fits interval regression |

eprobit |
fits binary-outcome probit regression |

eoprobit |
fits ordinal-outcome probit regression |

Ermi has three antennae because the new commands handle three problems not usually handled together. I’m going to use the word endogenous to describe them, but if that isn’t a word you use, I’ve included alternative descriptions. The problems that ERMs handle are

- endogenous covariates
*or*

covariates correlated with the error - endogenous selection
*or*

nonrandom selection*or*

missing not at random - exogenous and endogenous treatment assignment
*or*

random and nonrandom treatment assignment

If you are reading the alternative descriptions, know that when economists use the word endogenous, they mean “correlated with the error of the model”. The reason for the correlation can vary. A variable might be endogenous because it has a value that is the outcome of actions previously chosen by the subject, or there is an unobserved confounder affecting both the variable and the outcome, or the variable is simply measured with error.

If I haven’t excited you, I’m not surprised. I could not figure out how to excite you in a few words, so I opened with the picture of the t-shirt in hopes it would keep you reading.

**Why the hullabaloo?**

Stata and other statistical packages have features for fitting models with endogenous covariates, sample selection, and nonrandom treatment assignment. Until now, they could not handle combinations of the three. The other reason for the hullabaloo is that the ERMs commands are really easy to use. Despite that, we had to write a 258-page manual about it. Here’s why.

**What can ERMs do?**

Imagine fitting a model such as

\[

y = b_0 + b_1 \times x_1 + b_2 \times x_2 + … + b_K \times x_K + error

\]

You do not need ERMs if the values of \(y\), \(x_1\), \(x_2\), etc., meet the usual assumptions, which amount to them being uncorrelated with the error. You use the usual linear regression command when \(y\) is continuous, the usual probit command when \(y\) is binary probit, and so on.

Other times, the situation is not as simple as you would wish. In those cases, most researchers introduce into the model the complications that the reality of the situation requires. It is a useful and productive way of proceeding.

It might be inherent in reality that the values of \(x_1\) are a result of choices made by the subjects—it’s their schooling—and if it is, it’s endogenous and you will not be able to fit the model using the usual commands because there are other, confounding variables

Or it might be that \(y\) is observed only for subjects who choose to do something, such as find employment. This is the problem of sample selection for which James Heckman earned the Nobel Prize in 2000.

Or it might be that \(x_2\) records participation in a new treatment for renal cancer and doctors choose the treatment for their patients only when they judge that it will benefit their patients more than conventional treatments.

If you have any or all of these problems, you will be tempted to complicate the model for the problems that the reality of the situation imposed.

I want you to proceed differently, albeit equivalently. I want you to think about fitting the equation on data that you wished that you had, in which \(x_1\), \(y\), and \(x_2\) have none of the problems I just described. Subjects did not choose \(x_1\); their schooling level was chosen randomly for them. \(y\) was observed for all subjects not because they chose to work; they were forced to work. Doctors did not choose the treatment \(x_2\) for the patients they thought would benefit; \(x_2\) was chosen randomly. None of this is possible in today’s modern world, thank goodness, but put that aside. If the data had been created by such a process, you would simply have fit the equation in the usual way. You would fit

\[

y = b_0 + b_1 \times x_1 + b_2 \times x_2 + … + b_K \times x_K + error

\]

and the coefficients you would obtain would be the those that would have been observed in the alternative world.

Next, I want you to think about the data that you do have. It was created by a Data Generating Process (DGP), namely reality with all of its complications. I want you to think about all the problems that DGP is causing for you. Thinking that way is thinking the ERMs way. ERMs is simple at heart. It obtains the values of (\(b_0, b_1, …, b_K\)) for

\[

y = b_0 + b_1 \times x_1 + b_2 \times x_2 + … + b_K \times x_K + error

\]

Those values ERMs obtains are the ones that would have been observed if the data had none of the problems introduced by the DGP. You will have to tell ERMs about the DGP so that it can disentangle the coefficients from the excess correlations in the real data, but ERMs will do that and report the results for the alternative world. ERMs will also report information about the fitted DGP, but that information is mostly useless except for one thing. When it comes to making predictions about \(y\), you can obtain the predictions in the alternative world or obtain predictions with any of the complications of the DGP brought back in, whether separately or together.

That is ERMs in a nutshell. ERMs provides

- The fitted values of (\(b_0, b_1, …, b_K\)) in a world in which endogenous variables were are not endogenous, sample selection did not happen, and treatments were randomly assigned.
- Many other fitted coefficients having to do with DGP.
- The ability to make predictions in that alternative world
*and*the ability to make predictions by reintroducing any of the effects of the DGP, or even effects more or less extreme than the DGP!

I have often said that statisticians seldom answer the questions researchers ask. If a researcher asks, “What are the chances that a fitted coefficient is 0.1 or larger?”, statisticians reply, “I can’t answer that, but I can answer another question that, if you stand on your head and squint, is sort of related.” ERMs is a case where statisticians have provided you with exactly what you wanted. The only cost is that you have to think a little differently and proceed a little more cautiously.

You think about the coefficients and standard errors reported for the equation in the usual way even though they are for the alternative world that the statistician in you (and only the statistician) wished existed. If you want answers to questions that reintroduce the DGP, you must use Stata’s **predict**, **margins** or other commands that will make the calculation using the predicted values and their standard errors that ERMs will provide. It’s easier than it sounds. For treatment-effect modelers, ERMs provides commands to calculate ATET, ATEU, and POMEANS (average treatment effect among the treated, average treatment effect among the untreated, and potential-outcome means). Obviously, if you have only a treatment-effects problem, Stata has other commands for you, but those commands cannot handle this problem: Fit an endogenous treatment arm-model *in which observations are lost to follow-up after treatment-arm assignment* and, if your data are rich enough, *account for the previous (endogenous) choice by some patients to smoke*. ERMs can do that.

The 258-page manual explains how. As I said, it’s easy but different. It’s worth your time.

- If you are an economist, you can fit a Heckman model
*with endogenous variables*, and those endogenous variables can even be in the selection equation! - If you are a biostatistician, understand that what the classic Heckman model handles is lost to follow-up. Your fear is that those lost to follow-up are different. If you have variables that affect being lost but not the experiment’s outcome, you can test for it and adjust for it. The error in the selection equation is allowed to be correlated with the error in the outcome equation.
- If you are somebody else, understand that the Heckman model handles MNAR—missing not at random.
- Regardless of who you you are, you can not only fit models with linear outcomes, you can fit models with censored outcomes or binary outcomes or ordered binary outcomes such as “a little”, “more”, and “a lot”.

ERMs really can be useful.