I care about reproducible research. Anyone who has ever been a research assistant or tried to follow the path set by other researchers also cares. Sometimes, reproducing othersâ€™ results is a frustrating task; sometimes, it is outright impossible. Yet sometimes, it is satisfyingly simple. In my experience, reproducing results is easy when it involves a Stata do-file. I believe this is true even beyond my personal bias (I work for Stata and used the software regularly before that). A recent article published by the American Economic Association (AEA), Vilhuber, Turrito, and Welch (2020), shows that Stata is the preferred package among economists, and I believe reproducibility is a big reason why. Read more…

**Initial thoughts**

Nonparametric regression is similar to linear regression, Poisson regression, and logit or probit regression; it predicts a mean of an outcome for a set of covariates. If you work with the parametric models mentioned above or other models that predict means, you already understand nonparametric regression and can work with it.

The main difference between parametric and nonparametric models is the assumptions about the functional form of the mean conditional on the covariates. Parametric models assume the mean is a known function of \(\mathbf{x}\beta\). Nonparametric regression makes no assumptions about the functional form.

In practice, this means that nonparametric regression yields consistent estimates of the mean function that are robust to functional form misspecification. But we do not need to stop there. With **npregress**, introduced in Stata 15, we may obtain estimates of how the mean changes when we change discrete or continuous covariates, and we can use **margins** to answer other questions about the mean function.

Below I illustrate how to use **npregress** and how to interpret its results. As you will see, the results are interpreted in the same way you would interpret the results of a parametric model using **margins**. Read more…

**Initial thoughts**

Estimating causal relationships from data is one of the fundamental endeavors of researchers, but causality is elusive. In the presence of omitted confounders, endogeneity, omitted variables, or a misspecified model, estimates of predicted values and effects of interest are inconsistent; causality is obscured.

A controlled experiment to estimate causal relations is an alternative. Yet conducting a controlled experiment may be infeasible. Policy makers cannot randomize taxation, for example. In the absence of experimental data, an option is to use instrumental variables or a control function approach.

Stata has many built-in estimators to implement these potential solutions and tools to construct estimators for situations that are not covered by built-in estimators. Below I illustrate both possibilities for a linear model and, in a later post, will talk about nonlinear models. Read more…

For a nonlinear model with heteroskedasticity, a maximum likelihood estimator gives misleading inference and inconsistent marginal effect estimates unless I model the variance. Using a robust estimate of the variance–covariance matrix will not help me obtain correct inference.

This differs from the intuition we gain from linear regression. The estimates of the marginal effects in linear regression are consistent under heteroskedasticity and using robust standard errors yields correct inference.

If robust standard errors do not solve the problems associated with heteroskedasticity for a nonlinear model estimated using maximum likelihood, what does it mean to use robust standard errors in this context? I answer this question using simulations and illustrate the effect of heteroskedasticity in nonlinear models estimated using maximum likelihood. Read more…

I want to estimate, graph, and interpret the effects of nonlinear models with interactions of continuous and discrete variables. The results I am after are not trivial, but obtaining what I want using **margins**, **marginsplot**, and factor-variable notation is straightforward. Read more…

**Initial thoughts**

Estimating causal relationships from data is one of the fundamental endeavors of researchers. Ideally, we could conduct a controlled experiment to estimate causal relations. However, conducting a controlled experiment may be infeasible. For example, education researchers cannot randomize education attainment and they must learn from observational data.

In the absence of experimental data, we construct models to capture the relevant features of the causal relationship we have an interest in, using observational data. Models are successful if the features we did not include can be ignored without affecting our ability to ascertain the causal relationship we are interested in. Sometimes, however, ignoring some features of reality results in models that yield relationships that cannot be interpreted causally. In a regression framework, depending on our discipline or our research question, we give a different name to this phenomenon: endogeneity, omitted confounders, omitted variable bias, simultaneity bias, selection bias, etc.

Below I show how we can understand many of these problems in a unified regression framework and use simulated data to illustrate how they affect estimation and inference. Read more…

In a previous post I illustrated that the probit model and the logit model produce statistically equivalent estimates of marginal effects. In this post, I compare the marginal effect estimates from a linear probability model (linear regression) with marginal effect estimates from probit and logit models.

My simulations show that when the true model is a probit or a logit, using a linear probability model can produce inconsistent estimates of the marginal effects of interest to researchers. The conclusions hinge on the probit or logit model being the true model.

**Simulation results**

For all simulations below, I use a sample size of 10,000 and 5,000 replications. The true data-generating processes (DGPs) are constructed using Read more…

We often use probit and logit models to analyze binary outcomes. A case can be made that the logit model is easier to interpret than the probit model, but Stata’s **margins** command makes any estimator easy to interpret. Ultimately, estimates from both models produce similar results, and using one or the other is a matter of habit or preference.

I show that the estimates from a probit and logit model are similar for the computation of a set of effects that are of interest to researchers. I focus on the effects of changes in the covariates on the probability of a positive outcome for continuous and discrete covariates. I evaluate these effects on average and at the mean value of the covariates. In other words, I study the average marginal effects (AME), the average treatment effects (ATE), the marginal effects at the mean values of the covariates (MEM), and the treatment effects at the mean values of the covariates (TEM).

First, I present the results. Second, I discuss the code used for the simulations.

**Results**

In Table 1, I present the results of a simulation with 4,000 replications when the true data generating process (DGP) satisfies the assumptions of a probit model. I show the Read more…

Random-effects and fixed-effects panel-data models do not allow me to use observable information of previous periods in my model. They are static. Dynamic panel-data models use current and past information. For instance, I may model current health outcomes as a function of health outcomes in the past— a sensible modeling assumption— and of past observable and unobservable characteristics.

Today I will provide information that will help you interpret the estimation and postestimation results from Stata’s Arellano–Bond estimator **xtabond**, the most common linear dynamic panel-data estimator. Read more…

Today I will discuss Mundlak’s (1978) alternative to the Hausman test. Unlike the latter, the Mundlak approach may be used when the errors are heteroskedastic or have intragroup correlation. Read more…