Bayesian modeling: Beyond Stata’s built-in models
This post was written jointly with Nikolay Balov, Senior Statistician and Software Developer, StataCorp.
A question on Statalist motivated us to write this blog entry.
A user asked if the churdle command (http://www.stata.com/stata14/hurdle-models/) for fitting hurdle models, new in Stata 14, can be combined with the bayesmh command (http://www.stata.com/stata14/bayesian-analysis/) for fitting Bayesian models, also new in Stata 14:
Our initial reaction to this question was ‘No’ or, more precisely, ‘Not easily’—hurdle models are not among the likelihood models supported by bayesmh. One can write a program to compute the log likelihood of the double hurdle model and use this program with bayesmh (in the spirit of http://www.stata.com/stata14/bayesian-evaluators/), but this may seem like a daunting task if you are not familiar with Stata programming.
And then we realized, why not simply call churdle from the evaluator to compute the log likelihood? All we need is for churdle to evaluate the log likelihood at specific values of model parameters without performing iterations. This can be achieved by specifying churdle‘s options from() and iterate(0).
Let’s look at an example. We consider a simple hurdle model using a subset of the fitness dataset from [R] churdle:
. webuse fitness . set seed 17653 . sample 10 . churdle linear hours age, select(commute) ll(0) Iteration 0: log likelihood = -2783.3352 Iteration 1: log likelihood = -2759.029 Iteration 2: log likelihood = -2758.9992 Iteration 3: log likelihood = -2758.9992 Cragg hurdle regression Number of obs = 1,983 LR chi2(1) = 3.42 Prob > chi2 = 0.0645 Log likelihood = -2758.9992 Pseudo R2 = 0.0006 ------------------------------------------------------------------------------ hours | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- hours | age | .0051263 .0028423 1.80 0.071 -.0004446 .0106971 _cons | 1.170932 .1238682 9.45 0.000 .9281548 1.413709 -------------+---------------------------------------------------------------- selection_ll | commute | -.0655171 .1561046 -0.42 0.675 -.3714765 .2404423 _cons | .1421166 .0882658 1.61 0.107 -.0308813 .3151144 -------------+---------------------------------------------------------------- lnsigma | _cons | .1280215 .03453 3.71 0.000 .060344 .195699 -------------+---------------------------------------------------------------- /sigma | 1.136577 .039246 1.062202 1.216161 ------------------------------------------------------------------------------
Let’s assume for a moment that we already have an evaluator, mychurdle1, that returns the corresponding log-likelihood value. We can fit a Bayesian hurdle model using bayesmh as follows:
. gen byte hours0 = (hours==0) //dependent variable for the selection equation . set seed 123 . bayesmh (hours age) (hours0 commute), llevaluator(mychurdle1, parameters({lnsig})) prior({hours:} {hours0:} {lnsig}, flat) saving(sim, replace) dots (output omitted)
We use a two-equation specification to fit this model. The main regression is specified first, and the selection regression is specified next. The additional parameter, log of the standard deviation associated with the main regression, is specified in llevaluator()‘s suboption parameters(). All parameters are assigned flat priors to obtain results similar to churdle. MCMC results are saved in sim.dta.
Here is the actual output from bayesmh:
. bayesmh (hours age) (hours0 commute), llevaluator(mychurdle1, parameters({lns > ig})) prior({hours:} {hours0:} {lnsig}, flat) saving(sim, replace) dots Burn-in 2500 aaaaaaaaa1000aaaaaa...2000..... done Simulation 10000 .........1000.........2000.........3000.........4000.........5 > 000.........6000.........7000.........8000.........9000.........10000 done Model summary ------------------------------------------------------------------------------ Likelihood: hours hours0 ~ mychurdle1(xb_hours,xb_hours0,{lnsig}) Priors: {hours:age _cons} ~ 1 (flat) (1) {hours0:commute _cons} ~ 1 (flat) (2) {lnsig} ~ 1 (flat) ------------------------------------------------------------------------------ (1) Parameters are elements of the linear form xb_hours. (2) Parameters are elements of the linear form xb_hours0. Bayesian regression MCMC iterations = 12,500 Random-walk Metropolis-Hastings sampling Burn-in = 2,500 MCMC sample size = 10,000 Number of obs = 1,983 Acceptance rate = .2889 Efficiency: min = .05538 avg = .06266 Log marginal likelihood = -2772.3953 max = .06945 ------------------------------------------------------------------------------ | Equal-tailed | Mean Std. Dev. MCSE Median [95% Cred. Interval] -------------+---------------------------------------------------------------- hours | age | .0050916 .0027972 .000106 .0049923 -.0000372 .0107231 _cons | 1.167265 .124755 .004889 1.175293 .9125105 1.392021 -------------+---------------------------------------------------------------- hours0 | commute | -.0621282 .1549908 .006585 -.0613511 -.3623891 .2379805 _cons | .1425693 .0863626 .003313 .1430396 -.0254507 .3097677 -------------+---------------------------------------------------------------- lnsig | .1321532 .0346446 .001472 .1326704 .0663646 .2015249 ------------------------------------------------------------------------------ file sim.dta saved
The results are similar to those produced by churdle, as one would expect with noninformative priors.
If desired, we can use bayesstats summary to obtain the estimate of the standard deviation:
. bayesstats summary (sigma: exp({lnsig})) Posterior summary statistics MCMC sample size = 10,000 sigma : exp({lnsig}) ------------------------------------------------------------------------------ | Equal-tailed | Mean Std. Dev. MCSE Median [95% Cred. Interval] -------------+---------------------------------------------------------------- sigma | 1.141969 .0396264 .001685 1.141874 1.068616 1.223267 ------------------------------------------------------------------------------
Let’s now talk in more detail about a log-likelihood evaluator. We will consider two evaluators: one using churdle and one directly implementing the log likelihood of the considered hurdle model.
Log-likelihood evaluator using churdle
Here we demonstrate how to write a log-likelihood evaluator that calls an existing Stata estimation command, churdle in our example, to compute the log likelihood.
program mychurdle1 version 14.0 args llf tempname b mat `b' = ($MH_b, $MH_p) capture churdle linear $MH_y1 $MH_y1x1 if $MH_touse, /// select($MH_y2x1) ll(0) from(`b') iterate(0) if _rc { if (_rc==1) { // handle break key exit _rc } scalar `llf' = . } else { scalar `llf' = e(ll) } end
The mychurdle1 program returns the log-likelihood value computed by churdle at the current values of model parameters. This program accepts one argument — a temporary scalar to contain the log-likelihood value llf. We stored current values of model parameters (regression coefficients from two equations stored in vector MH_b and the extra parameter, log standard-deviation, stored in vector MH_p) in a temporary matrix b. We specified churdle‘s options from() and iterate(0) to evaluate the log likelihood at the current parameter values. Finally, we stored the resulting log-likelihood value in llf (or missing value if the command failed to evaluate the log likelihood).
Log-likelihood evaluator directly computing log likelihood
Here we demonstrate how to write a log-likelihood evaluator that computes the likelihood of the fitted hurdle model directly rather than calling churdle.
program mychurdle2 version 14.0 args lnf xb xg lnsig tempname sig scalar `sig' = exp(`lnsig') tempvar lnfj qui gen double `lnfj' = normal(`xg') if $MH_touse qui replace `lnfj' = log(1 - `lnfj') if $MH_y1 <= 0 & $MH_touse qui replace `lnfj' = log(`lnfj') - log(normal(`xb'/`sig')) /// + log(normalden($MH_y1,`xb',`sig')) /// if $MH_y1 > 0 & $MH_touse summarize `lnfj' if $MH_touse, meanonly if r(N) < $MH_n { scalar `lnf' = . exit } scalar `lnf' = r(sum) end
The mychurdle2 program accepts four arguments: a temporary scalar to contain the log-likelihood value llf, temporary variables xb and xg containing linear predictors from the corresponding main and selection equations evaluated at the current values of model parameters, and temporary scalar lnsig containing the current value of the log standard-deviation parameter. We compute and store the observation-level log likelihood in a temporary variable lnfj. Global MH_y1 contains the name of the dependent variable from the first (main) equation, and global MH_touse marks the estimation sample. If all observation-specific log likelihood contributions are nonmissing, we store the overall log-likelihood value in lnf or we otherwise store missing.
We fit our model using the same syntax as earlier, except we use mychurdle2 as the program evaluator.
. set seed 123 . bayesmh (hours age) (hours0 commute), llevaluator(mychurdle2, parameters({lns > ig})) prior({hours:} {hours0:} {lnsig}, flat) saving(sim, replace) dots Burn-in 2500 aaaaaaaaa1000aaaaaa...2000..... done Simulation 10000 .........1000.........2000.........3000.........4000.........5 > 000.........6000.........7000.........8000.........9000.........10000 done Model summary ------------------------------------------------------------------------------ Likelihood: hours hours0 ~ mychurdle2(xb_hours,xb_hours0,{lnsig}) Priors: {hours:age _cons} ~ 1 (flat) (1) {hours0:commute _cons} ~ 1 (flat) (2) {lnsig} ~ 1 (flat) ------------------------------------------------------------------------------ (1) Parameters are elements of the linear form xb_hours. (2) Parameters are elements of the linear form xb_hours0. Bayesian regression MCMC iterations = 12,500 Random-walk Metropolis-Hastings sampling Burn-in = 2,500 MCMC sample size = 10,000 Number of obs = 1,983 Acceptance rate = .2889 Efficiency: min = .05538 avg = .06266 Log marginal likelihood = -2772.3953 max = .06945 ------------------------------------------------------------------------------ | Equal-tailed | Mean Std. Dev. MCSE Median [95% Cred. Interval] -------------+---------------------------------------------------------------- hours | age | .0050916 .0027972 .000106 .0049923 -.0000372 .0107231 _cons | 1.167265 .124755 .004889 1.175293 .9125105 1.392021 -------------+---------------------------------------------------------------- hours0 | commute | -.0621282 .1549908 .006585 -.0613511 -.3623891 .2379805 _cons | .1425693 .0863626 .003313 .1430396 -.0254507 .3097677 -------------+---------------------------------------------------------------- lnsig | .1321532 .0346446 .001472 .1326704 .0663646 .2015249 ------------------------------------------------------------------------------ file sim.dta not found; file saved
We obtain the same results as those obtained using approach 1, and we obtain them much faster.
Final remarks
Approach 1 is very straightforward. It can be applied to any Stata command that returns the log likelihood and allows you to specify parameter values at which this log likelihood must be evaluated. Without too much programming effort, you can use almost any existing Stata maximum likelihood estimation command with bayesmh. A disadvantage of approach 1 is slower execution compared with programming the likelihood directly, as in approach 2. For example, the command using the mychurdle1 evaluator from approach 1 took about 25 minutes to run, whereas the command using the mychurdle2 evaluator from approach 2 took only 20 seconds.