Statistics Archives - The Stata Blog

Prediction intervals with gradient boosting machine

20 May 2025 Aramayis Dallakyan, Senior Statistician and Software Developer No comments

Introduction
Machine learning methods, such as ensemble decision trees, are widely used to predict outcomes based on data. However, these methods often focus on providing point predictions, which limits their ability to quantify prediction uncertainty. In many applications, such as healthcare and finance, the goal is not only to predict accurately but also to assess the reliability of those predictions. Prediction intervals, which provide lower and upper bounds such that the true response lies within them with high probability, are a reliable tool for quantifying prediction accuracy. An ideal prediction interval should meet several criteria: it should offer valid coverage (defined below) without relying on strong distributional assumptions, be informative by being as narrow as possible for each observation, and be adaptive—provide wider intervals for observations that are “difficult” to predict and narrower intervals for “easy” ones. Read more…

Categories: Statistics Tags: conformal intervals, GBM, machine learning, prediction intervals, quantile regression

Approximate statistical tests for comparing binary classifier error rates using H2OML

22 April 2025 Houssein Assaad, Associate Director, Statistics and Aramayis Dallakyan, Senior Statistician and Software Developer No comments

Motivation

You have just trained a gradient boosting machine (GBM) and a random forest (RF) classifier on your data using Stata’s new h2oml command suite. Your GBM model achieves 87% accuracy on the testing data, and your RF model, 85%. It looks as if GBM is the preferred classifier, right? Not so fast.

Why accuracy alone isn’t enough

Accuracy, area under the curve, and root mean squared error are popular metrics, but they provide only point estimates. These numbers reflect how well a model performed on one specific testing sample, but they don’t account for the variability that can arise from sample to sample. In other words, they don’t answer this key question: Will the difference in performance between these methods hold at the population level, or could it have occurred by chance only in this particular testing dataset? Read more…

Categories: Statistics Tags: ensemble trees classification, H2O, machine learning, statistical tests

Heteroskedasticity robust standard errors: Some practical considerations

6 October 2022 Enrique Pinzon, Associate Director Econometrics No comments

Introduction

Some discussions have arisen lately with regard to which standard errors should be used by practitioners in the presence of heteroskedasticity in linear models. The discussion intrigued me, so I took a second look at the existing literature. I provide an overview of theoretical and simulation research that helps us answer this question. I also present simulation results that mimic or expand some of the existing simulation studies. I’ll share the Stata code I used for the simulations in hopes that it might be useful to those that want to explore how the various standard-error estimators perform in situations that are relevant to your research. Read more…

Categories: Statistics Tags: heteroskedasticity-consistent standard errors, robust standard error, variance-covariance estimation

Bayesian threshold autoregressive models

18 May 2022 Nikolay Balov, Associate Director, Bayesian Statistics No comments

Autoregressive (AR) models are some of the most widely used models in applied economics, among other disciplines, because of their generality and simplicity. However, the dynamic characteristics of real economic and financial data can change from one time period to another, limiting the applicability of linear time-series models. For example, the change of unemployment rate is a function of the state of the economy, whether it is expanding or contracting. A variety of models have been developed that allow time-series dynamics to depend on the regime of the system they are part of. The class of regime-dependent models include Markov-switching, smooth transition, and threshold autoregressive (TAR) models. Read more…

Categories: Statistics Tags: Bayesian, Bayesian inference, bayesmh, threshold autoregressive models, time series

Using the margins command with different functional forms: Proportional versus natural logarithm changes

5 April 2022 Chris Cheng, Staff Econometrician No comments

margins is a powerful tool to obtain predictive margins, marginal predictions, and marginal effects. It is so powerful that it can work with any functional form of our estimated parameters by using the expression() option. I am going to show you how to obtain proportional changes of an outcome with respect to changes in the covariates using two different approaches for linear and nonlinear models. Read more…

Categories: Statistics Tags: margins, proportional change, semielasticity

Comparing transmissibility of Omicron lineages

15 March 2022 Nikolay Balov, Associate Director, Bayesian Statistics No comments

Monitoring lineages of the Omicron variant of the SARS-CoV-2 virus continues to be an important health consideration. The World Health Organization identifies BA.1, BA.1.1, and the most recent BA.2 as the most common sublineages. A recent study from Japan, Yamasoba et al. (2022), compares, among other characteristics, the transmissibility of these three Omicron lineages with the latest Delta variant. It identifies BA.2 to have the highest transmissibility of the four. Preprint of the study is available at bioarxiv.org. One interesting aspect of the study is the application of Bayesian multilevel models for representing lineage growth dynamics. In this post, I demonstrate how to use Stata’s bayesmh and bayesstats summary commands to perform similar analysis. Read more…

Categories: Statistics Tags: Bayesian, Bayesian inference, bayesmh, epidemiology, multilevel model, multinomial model

Calculating power using Monte Carlo simulations, part 5: Structural equation models

19 August 2021 Meghan Cain, Senior Statistician No comments

In our last four posts in this series, we showed you how to calculate power for a t test using Monte Carlo simulations, how to integrate your simulations into Stata’s power command, and how to do this for linear and logistic regression models and multilevel models. In today’s post, I’m going to show you how to estimate power for structural equation models (SEM) using simulations.

Our goal is to write a program that will calculate power for a given SEM at different sample sizes. We’ll follow the same general procedure as the previous two posts, but the way we’ll go about simulating data is a bit different. Rather than individually simulating each variable for our specified model, we’ll be simulating all our variables simultaneously from a given covariance matrix. Means for each of the variables can also be used to simulate the data if your SEM has a mean structure, such as in group analysis or growth curve analysis. Read more…

Categories: Statistics Tags: power, random numbers, sample size, SEM, simulation

Bayesian inference using multiple Markov chains

24 February 2020 Nikolay Balov, Associate Director, Bayesian Statistics No comments

Overview

Markov chain Monte Carlo (MCMC) is the principal tool for performing Bayesian inference. MCMC is a stochastic procedure that utilizes Markov chains simulated from the posterior distribution of model parameters to compute posterior summaries and make predictions. Given its stochastic nature and dependence on initial values, verifying Markov chain convergence can be difficult—visual inspection of the trace and autocorrelation plots are often used. A more formal method for checking convergence relies on simulating and comparing results from multiple Markov chains; see, for example, Gelman and Rubin (1992) and Gelman et al. (2013). Using multiple chains, rather than a single chain, makes diagnosing convergence easier.

As of Stata 16, bayesmh and its bayes prefix commands support a new option, nchains(), for simulating multiple Markov chains. There is also a new convergence diagnostic command, bayesstats grubin. All Bayesian postestimation commands now support multiple chains. In this blog post, I show you how to check MCMC convergence and improve your Bayesian inference using multiple chains through a series of examples. I also show you how to speed up your sampling by running multiple Markov chains in parallel. Read more…

Categories: Statistics Tags: Bayesian inference, convergence, MCMC, multiple chains, social behavior

Using the lasso for inference in high-dimensional models

9 September 2019 David Drukker, Executive Director of Econometrics and Di Liu, Senior Econometrician No comments

Why use lasso to do inference about coefficients in high-dimensional models?

High-dimensional models, which have too many potential covariates for the sample size at hand, are increasingly common in applied research. The lasso, discussed in the previous post, can be used to estimate the coefficients of interest in a high-dimensional model. This post discusses commands in Stata 16 that estimate the coefficients of interest in a high-dimensional model. Read more…

Categories: Statistics Tags: biostatistics, econometrics, inference, lasso, machine learning, statistics

An introduction to the lasso in Stata

9 September 2019 David Drukker, Executive Director of Econometrics and Di Liu, Senior Econometrician No comments

Why is the lasso interesting?

The least absolute shrinkage and selection operator (lasso) estimates model coefficients and these estimates can be used to select which covariates should be included in a model. The lasso is used for outcome prediction and for inference about causal parameters. In this post, we provide an introduction to the lasso and discuss using the lasso for prediction. In the next post, we discuss using the lasso for inference about causal parameters. Read more…

Categories: Statistics Tags: biostatistics, econometrics, lasso, prediction, statistics

Older Entries

Archive

Prediction intervals with gradient boosting machine

Approximate statistical tests for comparing binary classifier error rates using H2OML

Heteroskedasticity robust standard errors: Some practical considerations

Bayesian threshold autoregressive models

Using the margins command with different functional forms: Proportional versus natural logarithm changes

Comparing transmissibility of Omicron lineages

Calculating power using Monte Carlo simulations, part 5: Structural equation models

Bayesian inference using multiple Markov chains

Using the lasso for inference in high-dimensional models

An introduction to the lasso in Stata

Subscribe to the Stata Blog

Recent articles

Archives

Categories

Links