Bayesian modeling: Beyond Stata’s built-in models
This post was written jointly with Nikolay Balov, Senior Statistician and Software Developer, StataCorp.
A question on Statalist motivated us to write this blog entry.
A user asked if the churdle command (http://www.stata.com/stata14/hurdle-models/) for fitting hurdle models, new in Stata 14, can be combined with the bayesmh command (http://www.stata.com/stata14/bayesian-analysis/) for fitting Bayesian models, also new in Stata 14:
Our initial reaction to this question was ‘No’ or, more precisely, ‘Not easily’—hurdle models are not among the likelihood models supported by bayesmh. One can write a program to compute the log likelihood of the double hurdle model and use this program with bayesmh (in the spirit of http://www.stata.com/stata14/bayesian-evaluators/), but this may seem like a daunting task if you are not familiar with Stata programming.
And then we realized, why not simply call churdle from the evaluator to compute the log likelihood? All we need is for churdle to evaluate the log likelihood at specific values of model parameters without performing iterations. This can be achieved by specifying churdle‘s options from() and iterate(0).
Let’s look at an example. We consider a simple hurdle model using a subset of the fitness dataset from [R] churdle:
. webuse fitness
. set seed 17653
. sample 10
. churdle linear hours age, select(commute) ll(0)
Iteration 0: log likelihood = -2783.3352
Iteration 1: log likelihood = -2759.029
Iteration 2: log likelihood = -2758.9992
Iteration 3: log likelihood = -2758.9992
Cragg hurdle regression Number of obs = 1,983
LR chi2(1) = 3.42
Prob > chi2 = 0.0645
Log likelihood = -2758.9992 Pseudo R2 = 0.0006
------------------------------------------------------------------------------
hours | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
hours |
age | .0051263 .0028423 1.80 0.071 -.0004446 .0106971
_cons | 1.170932 .1238682 9.45 0.000 .9281548 1.413709
-------------+----------------------------------------------------------------
selection_ll |
commute | -.0655171 .1561046 -0.42 0.675 -.3714765 .2404423
_cons | .1421166 .0882658 1.61 0.107 -.0308813 .3151144
-------------+----------------------------------------------------------------
lnsigma |
_cons | .1280215 .03453 3.71 0.000 .060344 .195699
-------------+----------------------------------------------------------------
/sigma | 1.136577 .039246 1.062202 1.216161
------------------------------------------------------------------------------
Let’s assume for a moment that we already have an evaluator, mychurdle1, that returns the corresponding log-likelihood value. We can fit a Bayesian hurdle model using bayesmh as follows:
. gen byte hours0 = (hours==0) //dependent variable for the selection equation
. set seed 123
. bayesmh (hours age) (hours0 commute),
llevaluator(mychurdle1, parameters({lnsig}))
prior({hours:} {hours0:} {lnsig}, flat)
saving(sim, replace) dots
(output omitted)
We use a two-equation specification to fit this model. The main regression is specified first, and the selection regression is specified next. The additional parameter, log of the standard deviation associated with the main regression, is specified in llevaluator()‘s suboption parameters(). All parameters are assigned flat priors to obtain results similar to churdle. MCMC results are saved in sim.dta.
Here is the actual output from bayesmh:
. bayesmh (hours age) (hours0 commute), llevaluator(mychurdle1, parameters({lns
> ig})) prior({hours:} {hours0:} {lnsig}, flat) saving(sim, replace) dots
Burn-in 2500 aaaaaaaaa1000aaaaaa...2000..... done
Simulation 10000 .........1000.........2000.........3000.........4000.........5
> 000.........6000.........7000.........8000.........9000.........10000 done
Model summary
------------------------------------------------------------------------------
Likelihood:
hours hours0 ~ mychurdle1(xb_hours,xb_hours0,{lnsig})
Priors:
{hours:age _cons} ~ 1 (flat) (1)
{hours0:commute _cons} ~ 1 (flat) (2)
{lnsig} ~ 1 (flat)
------------------------------------------------------------------------------
(1) Parameters are elements of the linear form xb_hours.
(2) Parameters are elements of the linear form xb_hours0.
Bayesian regression MCMC iterations = 12,500
Random-walk Metropolis-Hastings sampling Burn-in = 2,500
MCMC sample size = 10,000
Number of obs = 1,983
Acceptance rate = .2889
Efficiency: min = .05538
avg = .06266
Log marginal likelihood = -2772.3953 max = .06945
------------------------------------------------------------------------------
| Equal-tailed
| Mean Std. Dev. MCSE Median [95% Cred. Interval]
-------------+----------------------------------------------------------------
hours |
age | .0050916 .0027972 .000106 .0049923 -.0000372 .0107231
_cons | 1.167265 .124755 .004889 1.175293 .9125105 1.392021
-------------+----------------------------------------------------------------
hours0 |
commute | -.0621282 .1549908 .006585 -.0613511 -.3623891 .2379805
_cons | .1425693 .0863626 .003313 .1430396 -.0254507 .3097677
-------------+----------------------------------------------------------------
lnsig | .1321532 .0346446 .001472 .1326704 .0663646 .2015249
------------------------------------------------------------------------------
file sim.dta saved
The results are similar to those produced by churdle, as one would expect with noninformative priors.
If desired, we can use bayesstats summary to obtain the estimate of the standard deviation:
. bayesstats summary (sigma: exp({lnsig}))
Posterior summary statistics MCMC sample size = 10,000
sigma : exp({lnsig})
------------------------------------------------------------------------------
| Equal-tailed
| Mean Std. Dev. MCSE Median [95% Cred. Interval]
-------------+----------------------------------------------------------------
sigma | 1.141969 .0396264 .001685 1.141874 1.068616 1.223267
------------------------------------------------------------------------------
Let’s now talk in more detail about a log-likelihood evaluator. We will consider two evaluators: one using churdle and one directly implementing the log likelihood of the considered hurdle model.
Log-likelihood evaluator using churdle
Here we demonstrate how to write a log-likelihood evaluator that calls an existing Stata estimation command, churdle in our example, to compute the log likelihood.
program mychurdle1
version 14.0
args llf
tempname b
mat `b' = ($MH_b, $MH_p)
capture churdle linear $MH_y1 $MH_y1x1 if $MH_touse, ///
select($MH_y2x1) ll(0) from(`b') iterate(0)
if _rc {
if (_rc==1) { // handle break key
exit _rc
}
scalar `llf' = .
}
else {
scalar `llf' = e(ll)
}
end
The mychurdle1 program returns the log-likelihood value computed by churdle at the current values of model parameters. This program accepts one argument — a temporary scalar to contain the log-likelihood value llf. We stored current values of model parameters (regression coefficients from two equations stored in vector MH_b and the extra parameter, log standard-deviation, stored in vector MH_p) in a temporary matrix b. We specified churdle‘s options from() and iterate(0) to evaluate the log likelihood at the current parameter values. Finally, we stored the resulting log-likelihood value in llf (or missing value if the command failed to evaluate the log likelihood).
Log-likelihood evaluator directly computing log likelihood
Here we demonstrate how to write a log-likelihood evaluator that computes the likelihood of the fitted hurdle model directly rather than calling churdle.
program mychurdle2
version 14.0
args lnf xb xg lnsig
tempname sig
scalar `sig' = exp(`lnsig')
tempvar lnfj
qui gen double `lnfj' = normal(`xg') if $MH_touse
qui replace `lnfj' = log(1 - `lnfj') if $MH_y1 <= 0 & $MH_touse
qui replace `lnfj' = log(`lnfj') - log(normal(`xb'/`sig')) ///
+ log(normalden($MH_y1,`xb',`sig')) ///
if $MH_y1 > 0 & $MH_touse
summarize `lnfj' if $MH_touse, meanonly
if r(N) < $MH_n {
scalar `lnf' = .
exit
}
scalar `lnf' = r(sum)
end
The mychurdle2 program accepts four arguments: a temporary scalar to contain the log-likelihood value llf, temporary variables xb and xg containing linear predictors from the corresponding main and selection equations evaluated at the current values of model parameters, and temporary scalar lnsig containing the current value of the log standard-deviation parameter. We compute and store the observation-level log likelihood in a temporary variable lnfj. Global MH_y1 contains the name of the dependent variable from the first (main) equation, and global MH_touse marks the estimation sample. If all observation-specific log likelihood contributions are nonmissing, we store the overall log-likelihood value in lnf or we otherwise store missing.
We fit our model using the same syntax as earlier, except we use mychurdle2 as the program evaluator.
. set seed 123
. bayesmh (hours age) (hours0 commute), llevaluator(mychurdle2, parameters({lns
> ig})) prior({hours:} {hours0:} {lnsig}, flat) saving(sim, replace) dots
Burn-in 2500 aaaaaaaaa1000aaaaaa...2000..... done
Simulation 10000 .........1000.........2000.........3000.........4000.........5
> 000.........6000.........7000.........8000.........9000.........10000 done
Model summary
------------------------------------------------------------------------------
Likelihood:
hours hours0 ~ mychurdle2(xb_hours,xb_hours0,{lnsig})
Priors:
{hours:age _cons} ~ 1 (flat) (1)
{hours0:commute _cons} ~ 1 (flat) (2)
{lnsig} ~ 1 (flat)
------------------------------------------------------------------------------
(1) Parameters are elements of the linear form xb_hours.
(2) Parameters are elements of the linear form xb_hours0.
Bayesian regression MCMC iterations = 12,500
Random-walk Metropolis-Hastings sampling Burn-in = 2,500
MCMC sample size = 10,000
Number of obs = 1,983
Acceptance rate = .2889
Efficiency: min = .05538
avg = .06266
Log marginal likelihood = -2772.3953 max = .06945
------------------------------------------------------------------------------
| Equal-tailed
| Mean Std. Dev. MCSE Median [95% Cred. Interval]
-------------+----------------------------------------------------------------
hours |
age | .0050916 .0027972 .000106 .0049923 -.0000372 .0107231
_cons | 1.167265 .124755 .004889 1.175293 .9125105 1.392021
-------------+----------------------------------------------------------------
hours0 |
commute | -.0621282 .1549908 .006585 -.0613511 -.3623891 .2379805
_cons | .1425693 .0863626 .003313 .1430396 -.0254507 .3097677
-------------+----------------------------------------------------------------
lnsig | .1321532 .0346446 .001472 .1326704 .0663646 .2015249
------------------------------------------------------------------------------
file sim.dta not found; file saved
We obtain the same results as those obtained using approach 1, and we obtain them much faster.
Final remarks
Approach 1 is very straightforward. It can be applied to any Stata command that returns the log likelihood and allows you to specify parameter values at which this log likelihood must be evaluated. Without too much programming effort, you can use almost any existing Stata maximum likelihood estimation command with bayesmh. A disadvantage of approach 1 is slower execution compared with programming the likelihood directly, as in approach 2. For example, the command using the mychurdle1 evaluator from approach 1 took about 25 minutes to run, whereas the command using the mychurdle2 evaluator from approach 2 took only 20 seconds.