Archive

Archive for the ‘Statistics’ Category

Heteroskedasticity robust standard errors: Some practical considerations

Introduction

Some discussions have arisen lately with regard to which standard errors should be used by practitioners in the presence of heteroskedasticity in linear models. The discussion intrigued me, so I took a second look at the existing literature. I provide an overview of theoretical and simulation research that helps us answer this question. I also present simulation results that mimic or expand some of the existing simulation studies. I’ll share the Stata code I used for the simulations in hopes that it might be useful to those that want to explore how the various standard-error estimators perform in situations that are relevant to your research. Read more…

Bayesian threshold autoregressive models

Autoregressive (AR) models are some of the most widely used models in applied economics, among other disciplines, because of their generality and simplicity. However, the dynamic characteristics of real economic and financial data can change from one time period to another, limiting the applicability of linear time-series models. For example, the change of unemployment rate is a function of the state of the economy, whether it is expanding or contracting. A variety of models have been developed that allow time-series dynamics to depend on the regime of the system they are part of. The class of regime-dependent models include Markov-switching, smooth transition, and threshold autoregressive (TAR) models. Read more…

Using the margins command with different functional forms: Proportional versus natural logarithm changes

margins is a powerful tool to obtain predictive margins, marginal predictions, and marginal effects. It is so powerful that it can work with any functional form of our estimated parameters by using the expression() option. I am going to show you how to obtain proportional changes of an outcome with respect to changes in the covariates using two different approaches for linear and nonlinear models. Read more…

Comparing transmissibility of Omicron lineages

Monitoring lineages of the Omicron variant of the SARS-CoV-2 virus continues to be an important health consideration. The World Health Organization identifies BA.1, BA.1.1, and the most recent BA.2 as the most common sublineages. A recent study from Japan, Yamasoba et al. (2022), compares, among other characteristics, the transmissibility of these three Omicron lineages with the latest Delta variant. It identifies BA.2 to have the highest transmissibility of the four. Preprint of the study is available at bioarxiv.org. One interesting aspect of the study is the application of Bayesian multilevel models for representing lineage growth dynamics. In this post, I demonstrate how to use Stata’s bayesmh and bayesstats summary commands to perform similar analysis. Read more…

Calculating power using Monte Carlo simulations, part 5: Structural equation models

In our last four posts in this series, we showed you how to calculate power for a t test using Monte Carlo simulations, how to integrate your simulations into Stata’s power command, and how to do this for linear and logistic regression models and multilevel models. In today’s post, I’m going to show you how to estimate power for structural equation models (SEM) using simulations.

Our goal is to write a program that will calculate power for a given SEM at different sample sizes. We’ll follow the same general procedure as the previous two posts, but the way we’ll go about simulating data is a bit different. Rather than individually simulating each variable for our specified model, we’ll be simulating all our variables simultaneously from a given covariance matrix. Means for each of the variables can also be used to simulate the data if your SEM has a mean structure, such as in group analysis or growth curve analysis. Read more…

Bayesian inference using multiple Markov chains

Overview

Markov chain Monte Carlo (MCMC) is the principal tool for performing Bayesian inference. MCMC is a stochastic procedure that utilizes Markov chains simulated from the posterior distribution of model parameters to compute posterior summaries and make predictions. Given its stochastic nature and dependence on initial values, verifying Markov chain convergence can be difficult—visual inspection of the trace and autocorrelation plots are often used. A more formal method for checking convergence relies on simulating and comparing results from multiple Markov chains; see, for example, Gelman and Rubin (1992) and Gelman et al. (2013). Using multiple chains, rather than a single chain, makes diagnosing convergence easier.

As of Stata 16, bayesmh and its bayes prefix commands support a new option, nchains(), for simulating multiple Markov chains. There is also a new convergence diagnostic command, bayesstats grubin. All Bayesian postestimation commands now support multiple chains. In this blog post, I show you how to check MCMC convergence and improve your Bayesian inference using multiple chains through a series of examples. I also show you how to speed up your sampling by running multiple Markov chains in parallel. Read more…

Using the lasso for inference in high-dimensional models

Why use lasso to do inference about coefficients in high-dimensional models?

High-dimensional models, which have too many potential covariates for the sample size at hand, are increasingly common in applied research. The lasso, discussed in the previous post, can be used to estimate the coefficients of interest in a high-dimensional model. This post discusses commands in Stata 16 that estimate the coefficients of interest in a high-dimensional model. Read more…

An introduction to the lasso in Stata

Why is the lasso interesting?

The least absolute shrinkage and selection operator (lasso) estimates model coefficients and these estimates can be used to select which covariates should be included in a model. The lasso is used for outcome prediction and for inference about causal parameters. In this post, we provide an introduction to the lasso and discuss using the lasso for prediction. In the next post, we discuss using the lasso for inference about causal parameters. Read more…

Fun with frames

I have a confession. I wasn’t excited about the addition of frames to Stata 16. Yes, frames has been one of the most requested features for many years, and our website analytics show that frames is wildly popular. Adding frames was a smart decision and our customers are excited. But I have used Stata for over 20 years, and I have been perfectly happy using one dataset at a time. So I ignored frames.

Then I started working on an example for lasso using genetic data. I simulated patient data along with genetic data for each of 22 chromosomes saved in 22 separate datasets. Working with 23 datasets became cumbersome, so I thought I’d check out frames. I began by reading the manual and then tinkered with my genetic data. Along the way, I discovered a feature of frames that completely blew my mind. I’m going to show you that feature below, and I expect that it will blow your mind as well.

This blog post is not meant to be an introduction to frames. There is a detailed introduction to frames in the Stata 16 manual that will make you an expert. I simply want to show you some of the useful things that you can do with frames, including the following: Read more…

Calculating power using Monte Carlo simulations, part 4: Multilevel/longitudinal models

In my last three posts, I showed you how to calculate power for a t test using Monte Carlo simulations, how to integrate your simulations into Stata’s power command, and how to do this for linear and logistic regression models. In today’s post, I’m going to show you how to estimate power for multilevel/longitudinal models using simulations. You may want to review my earlier post titled “How to simulate multilevel/longitudinal data” before you read this post. Read more…