Gelman–Rubin convergence diagnostic using multiple chains
As of Stata 16, see [BAYES] bayesstats grubin and Bayesian analysis: Gelman-Rubin convergence diagnostic.
The original blog posted May 26, 2016, omitted option initrandom from the bayesmh command. The code and the text of the blog entry were updated on August 9, 2018, to reflect this.
Overview
MCMC algorithms used for simulating posterior distributions are indispensable tools in Bayesian analysis. A major consideration in MCMC simulations is that of convergence. Has the simulated Markov chain fully explored the target posterior distribution so far, or do we need longer simulations? A common approach in assessing MCMC convergence is based on running and analyzing the difference between multiple chains.
For a given Bayesian model, bayesmh is capable of producing multiple Markov chains with randomly dispersed initial values by using the initrandom option, available as of the update on 19 May 2016. In this post, I demonstrate the Gelman–Rubin diagnostic as a more formal test for convergence using multiple chains. For graphical diagnostics, see Graphical diagnostics using multiple chains in [BAYES] bayesmh for more details. To compute the Gelman–Rubin diagnostic, I use an unofficial command, grubin, which can be installed by typing the following in Stata:
. net install grubin, from("http://www.stata.com/users/nbalov")
To see the help file, type
. help grubin
The Gelman–Rubin convergence diagnostic
The Gelman–Rubin diagnostic evaluates MCMC convergence by analyzing the difference between multiple Markov chains. The convergence is assessed by comparing the estimated between-chains and within-chain variances for each model parameter. Large differences between these variances indicate nonconvergence. See Gelman and Rubin (1992) and Brooks and Gelman (1997) for the detailed description of the method.
Suppose we have \(M\) chains, each of length \(N\), although the chains may be of different lengths. The same-length assumption simplifies the formulas and is used for convenience. For a model parameter \(\theta\), let \(\{\theta_{mt}\}_{t=1}^{N}\) be the \(m\)th simulated chain, \(m=1,\dots,M\). Let \(\hat\theta_m\) and \(\hat\sigma_m^2\) be the sample posterior mean and variance of the \(m\)th chain, and let the overall sample posterior mean be \(\hat\theta = (1/M)\sum_{m=1}^M\hat\theta_m\). The between-chains and within-chain variances are given by
\begin{align}
B &= \frac{N}{M-1} \sum_{m=1}^M (\hat\theta_m – \hat\theta)^2 \\
W &= \frac{1}{M} \sum_{m=1}^M \hat\sigma_m^2
\end{align}
Under certain stationarity conditions, the pooled variance
$$
\widehat V = \frac{N-1}{N} W + \frac{M+1}{MN} B
$$
is an unbiased estimator of the marginal posterior variance of \(\theta\) (Gelman and Rubin 1992). The potential scale reduction factor (PSRF) is defined to be the ratio of \(\widehat V\) and \(W\). If the \(M\) chains have converged to the target posterior distribution, then PSRF should be close to 1. Brooks and Gelman (1997) corrected the original PSRF by accounting for sampling variability as follows:
$$
R_c = \sqrt{\frac{\hat{d}+3}{\hat{d}+1}\frac{\widehat V}{W}}
$$
where \(\hat d\) is the degrees of freedom estimate of a \(t\) distribution.
PSRF estimates the potential decrease in the between-chains variability \(B\) with respect to the within-chain variability \(W\). If \(R_c\) is large, then longer simulation sequences are expected to either decrease \(B\) or increase \(W\) because the simulations have not yet explored the full posterior distribution. As Brooks and Gelman (1997) have suggested, if \(R_c < 1.2\) for all model parameters, one can be fairly confident that convergence has been reached. Otherwise, longer chains or other means for improving the convergence may be needed. Even more reassuring is to apply the more stringent condition \(R_c < 1.1\), which is the criterion I use in the examples below.
Under the normality assumption on the marginal posterior distribution of \(\theta\) and stationarity assumptions on the chain, the ratio \(B/W\) follows an F distribution with \(M-1\) numerator degrees of freedom and \(\nu\) denominator degrees of freedom. An upper confidence limit \(R_u(\alpha)\) for \(R_c\) can be derived (see section 3.7 in Gelman and Rubin [1992], where \(\nu\) is also defined):
$$
R_u = \sqrt{\frac{\hat{d}+3}{\hat{d}+1}\bigg{(}\frac{N-1}{N} W + \frac{M+1}{M} q_{1-\alpha/2}\bigg{)}}
$$
where \(\alpha\) is a prespecified confidence level and \(q_{1-\alpha/2}\) is the \((1-\alpha/2)\)th quantile of the aforementioned F distribution. We are only interested in the upper confidence limit because we are concerned with large PSRF values. By comparing \(R_c\) to \(R_u\), one can perform a formal test for convergence.
The Stata program grubin calculates and reports the Gelman–Rubin diagnostic for some or all model parameters. The program uses previously stored or saved estimation results of bayesmh. You specify estimation results using either the option estnames() or the option estfiles(). By default, grubin computes the Gelman–Rubin diagnostic for all model parameters. Alternatively, you may specify a subset of model parameters or substitutable expressions containing model parameters following the parameter specification of bayesstats summary. You may also specify a confidence level for calculating the upper confidence limit of PSRF by using the level() option. grubin is an r-class command that reports the \(R_c\) and \(R_u\) values and stores them in the matrices r(Rc) and r(Ru), respectively.
Example
To demonstrate the grubin program, I consider a Bayesian linear model applied to the well-known auto dataset.
. webuse auto (1978 Automobile Data)
I regress the mpg variable on the weight variable by assuming a normal likelihood model with an unknown variance. My Bayesian model thus has three parameters: {mpg:weight}, {mpg:_cons}, and {sigma2}. I specify a weakly informative prior, N(0, 100), for the regression coefficients, and I specify the prior InvGamma(10, 10) for the variance parameter. I block the regression parameters {mpg:} separately to increase sampling efficiency.
In the first set of runs, I simulate 3 chains of length 25. I deliberately chose a small MCMC size hoping to demonstrate lack of convergence. I initialize the 3 chains randomly by specifying the initrandom option of bayesmh. The simulation datasets are saved as sim1.dta, sim2.dta, and sim3.dta.
. set seed 14 . forvalues nchain = 1/3 { 2. quietly bayesmh mpg weight, > likelihood(normal({sigma2})) > prior({mpg:}, normal(0, 100)) > prior({sigma2}, igamma(10, 10)) > block({mpg:}) initrandom > mcmcsize(25) saving(sim`nchain') 3. quietly estimates store chain`nchain' 4. }
The Gelman–Rubin diagnostic assumes normality of the marginal posterior distributions. To improve the normal approximation, it is recommended to transform parameters that are not supported on the whole real line. Because the variance parameter {sigma2} is always positive, I apply the log transformation to normalize its marginal distribution when computing the Gelman–Rubin diagnostic. The transformed parameter is labeled as lnvar.
I now use grubin to calculate and report the Gelman–Rubin diagnostics. I use the default confidence level of 95% for the upper confidence limit.
. grubin {mpg:weight} {mpg:_cons} (lnvar:log({sigma2})), > estnames(chain1 chain2 chain3) Gelman-Rubin convergence diagnostic MCMC sample size = 25 Number of chains = 3 ----------------------------------- | Rc 95% Ru -------------+--------------------- mpg | weight | 1.007256 1.090938 _cons | 1.030188 1.097078 -------------+--------------------- lnvar | 1.221488 1.145878 -----------------------------------
The first column in the output shows the PSRF estimates \(R_c\) and the second column shows the upper confidence limits \(R_u\) for each model parameter. We see that although the \(R_c\)’s of {mpg:weight} and {mpg:_cons} are below 1.1, the \(R_c\) of {sigma2} is quite large at 1.22. Moreover, all \(R_c\) values exceed their corresponding upper confidence limits at the 95% confidence level. Clearly, short Markov chains of length 25 are not sufficient for achieving convergence for this model.
In the next series of simulations, I increase the MCMC size to 50. This time I expect to obtain converging chains.
. set seed 14 . forvalues nchain = 1/3 { 2. quietly bayesmh mpg weight, > likelihood(normal({sigma2})) > prior({mpg:}, normal(0, 100)) > prior({sigma2}, igamma(10, 10)) > block({mpg:}) initrandom > mcmcsize(50) saving(sim`nchain', replace) 3. quietly estimates store chain`nchain' 4. }
I call grubin again with a confidence level of 95%.
. grubin {mpg:weight} {mpg:_cons} (lnvar:log({sigma2})), > estnames(chain1 chain2 chain3) Gelman-Rubin convergence diagnostic MCMC sample size = 50 Number of chains = 3 ----------------------------------- | Rc 95% Ru -------------+--------------------- mpg | weight | 1.045376 1.058433 _cons | 1.083469 1.05792 -------------+--------------------- lnvar | 1.006594 1.056714 -----------------------------------
All three \(R_c\) values are below 1.1, but they still are not quite within the upper confidence limit \(R_u\). This does not necessarily mean that the chains have not converged, because \(R_u\) is computed based on the approximation of the sampling distribution of the \(R_c\) statistic by an F distribution that may not always hold. For such low \(R_c\) values—all below 1.09—I have little reason to suspect nonconvergence. Nevertheless, I run a third set of simulations using a longer chain and a more efficient simulation.
In the last set of simulations, I further increase the MCMC size to 100.
. set seed 14 . forvalues nchain = 1/3 { 2. quietly bayesmh mpg weight, > likelihood(normal({sigma2})) > prior({mpg:}, normal(0, 100)) > prior({sigma2}, igamma(10, 10)) > block({mpg:}) initrandom > mcmcsize(100) saving(sim`nchain', replace) 3. quietly estimates store chain`nchain' 4. } . grubin {mpg:weight} {mpg:_cons} (lnvar:log({sigma2})), > estnames(chain1 chain2 chain3) Gelman-Rubin convergence diagnostic MCMC sample size = 100 Number of chains = 3 ----------------------------------- | Rc 95% Ru -------------+--------------------- mpg | weight | 1.019446 1.031024 _cons | 1.003891 1.02604 -------------+--------------------- lnvar | .9993561 1.020912 -----------------------------------
This time, all the \(R_c\) values are well below 1.01 and, moreover, below their corresponding upper confidence limits. We can conclude that all chains have converged.
References
Brooks, S. P., and A. Gelman. 1997. General Methods for Monitoring Convergence of Iterative Simulations. Journal of Computational and Graphical Statistics 7: 434–455.
Gelman, A., and D. B. Rubin. 1992. Inference from Iterative Simulation Using Multiple Sequences. Statistical Science 7: 457–511.