Archive

Archive for the ‘Statistics’ Category

Estimation under omitted confounders, endogeneity, omitted variable bias, and related problems

Initial thoughts

Estimating causal relationships from data is one of the fundamental endeavors of researchers, but causality is elusive. In the presence of omitted confounders, endogeneity, omitted variables, or a misspecified model, estimates of predicted values and effects of interest are inconsistent; causality is obscured.

A controlled experiment to estimate causal relations is an alternative. Yet conducting a controlled experiment may be infeasible. Policy makers cannot randomize taxation, for example. In the absence of experimental data, an option is to use instrumental variables or a control function approach.

Stata has many built-in estimators to implement these potential solutions and tools to construct estimators for situations that are not covered by built-in estimators. Below I illustrate both possibilities for a linear model and, in a later post, will talk about nonlinear models. Read more…

Understanding truncation and censoring

Truncation and censoring are two distinct phenomena that cause our samples to be incomplete. These phenomena arise in medical sciences, engineering, social sciences, and other research fields. If we ignore truncation or censoring when analyzing our data, our estimates of population parameters will be inconsistent.

Truncation or censoring happens during the sampling process. Let’s begin by defining left-truncation and left-censoring:

Our data are left-truncated when individuals below a threshold are not present in the sample. For example, if we want to study the size of certain fish based on the specimens captured with a net, fish smaller than the net grid won’t be present in our sample.

Our data are left-censored at \(\kappa\) if every individual with a value below \(\kappa\) is present in the sample, but the actual value is unknown. This happens, for example, when we have a measuring instrument that cannot detect values below a certain level.

We will focus our discussion on left-truncation and left-censoring, but the concepts we will discuss generalize to all types of censoring and truncation—right, left, and interval.

When performing estimations with truncated or censored data, we need to use tools that account for that type of incomplete data. For truncated linear regression, we can use the truncreg command, and for censored linear regression, we can use the intreg or tobit command.

In this blog post, we will analyze the characteristics of truncated and censored data and discuss using truncreg and tobit to account for the incomplete data. Read more…

Introduction to Bayesian statistics, part 2: MCMC and the Metropolis–Hastings algorithm

In this blog post, I’d like to give you a relatively nontechnical introduction to Markov chain Monte Carlo, often shortened to “MCMC”. MCMC is frequently used for fitting Bayesian statistical models. There are different variations of MCMC, and I’m going to focus on the Metropolis–Hastings (M–H) algorithm. In the interest of brevity, I’m going to omit some details, and I strongly encourage you to read the [BAYES] manual before using MCMC in practice.

Let’s continue with the coin toss example from my previous post Introduction to Bayesian statistics, part 1: The basic concepts. We are interested in the posterior distribution of the parameter \(\theta\), which is the probability that a coin toss results in “heads”. Our prior distribution is a flat, uninformative beta distribution with parameters 1 and 1. And we will use a binomial likelihood function to quantify the data from our experiment, which resulted in 4 heads out of 10 tosses. Read more…

Introduction to Bayesian statistics, part 1: The basic concepts

In this blog post, I’d like to give you a relatively nontechnical introduction to Bayesian statistics. The Bayesian approach to statistics has become increasingly popular, and you can fit Bayesian models using the bayesmh command in Stata. This blog entry will provide a brief introduction to the concepts and jargon of Bayesian statistics and the bayesmh syntax. In my next post, I will introduce the basics of Markov chain Monte Carlo (MCMC) using the Metropolis–Hastings algorithm. Read more…

Long-run restrictions in a structural vector autoregression

\(\def\bfA{{\bf A}}
\def\bfB{{\bf }}
\def\bfC{{\bf C}}\)Introduction

In this blog post, I describe Stata’s capabilities for estimating and analyzing vector autoregression (VAR) models with long-run restrictions by replicating some of the results of Blanchard and Quah (1989). Read more…

Solving missing data problems using inverse-probability-weighted estimators

We discuss estimating population-averaged parameters when some of the data are missing. In particular, we show how to use gmm to estimate population-averaged parameters for a probit model when the process that causes some of the data to be missing is a function of observable covariates and a random process that is independent of the outcome. This type of missing data is known as missing at random, selection on observables, and exogenous sample selection.

This is a follow-up to an earlier post where we estimated the parameters of a probit model under endogenous sample selection (http://blog.stata.com/2015/11/05/using-mlexp-to-estimate-endogenous-treatment-effects-in-a-probit-model/). In endogenous sample selection, the random process that affects which observations are missing is correlated with an unobservable random process that affects the outcome. Read more…

Estimating covariate effects after gmm

In Stata 14.2, we added the ability to use margins to estimate covariate effects after gmm. In this post, I illustrate how to use margins and marginsplot after gmm to estimate covariate effects for a probit model.

Margins are statistics calculated from predictions of a previously fit model at fixed values of some covariates and averaging or otherwise integrating over the remaining covariates. They can be used to estimate population average parameters like the marginal mean, average treatment effect, or the average effect of a covariate on the conditional mean. I will demonstrate how using margins is useful after estimating a model with the generalized method of moments. Read more…

Quantile regression allows covariate effects to differ by quantile

Quantile regression models a quantile of the outcome as a function of covariates. Applied researchers use quantile regressions because they allow the effect of a covariate to differ across conditional quantiles. For example, another year of education may have a large effect on a low conditional quantile of income but a much smaller effect on a high conditional quantile of income. Also, another pack-year of cigarettes may have a larger effect on a low conditional quantile of bronchial effectiveness than on a high conditional quantile of bronchial effectiveness.

I use simulated data to illustrate what the conditional quantile functions estimated by quantile regression are and what the estimable covariate effects are. Read more…

Structural vector autoregression models

\(\def\bfy{{\bf y}}
\def\bfA{{\bf A}}
\def\bfB{{\bf B}}
\def\bfu{{\bf u}}
\def\bfI{{\bf I}}
\def\bfe{{\bf e}}
\def\bfC{{\bf C}}
\def\bfsig{{\boldsymbol \Sigma}}\)In my last post, I discusssed estimation of the vector autoregression (VAR) model,

\begin{align}
\bfy_t &= \bfA_1 \bfy_{t-1} + \dots + \bfA_k \bfy_{t-k} + \bfe_t \tag{1}
\label{var1} \\
E(\bfe_t \bfe_t’) &= \bfsig \label{var2}\tag{2}
\end{align}

where \(\bfy_t\) is a vector of \(n\) endogenous variables, \(\bfA_i\) are coefficient matrices, \(\bfe_t\) are error terms, and \(\bfsig\) is the covariance matrix of the errors.

In discussing impulse–response analysis last time, I briefly discussed the concept of orthogonalizing the shocks in a VAR—that is, decomposing the reduced-form errors in the VAR into mutually uncorrelated shocks. In this post, I will go into more detail on orthogonalization: what it is, why economists do it, and what sorts of questions we hope to answer with it. Read more…

An ordered-probit inverse probability weighted (IPW) estimator

teffects ipw uses multinomial logit to estimate the weights needed to estimate the potential-outcome means (POMs) from a multivalued treatment. I show how to estimate the POMs when the weights come from an ordered probit model. Moment conditions define the ordered probit estimator and the subsequent weighted average used to estimate the POMs. I use gmm to obtain consistent standard errors by stacking the ordered-probit moment conditions and the weighted mean moment conditions. Read more…