Home > Statistics > Bayesian modeling: Beyond Stata’s built-in models

Bayesian modeling: Beyond Stata’s built-in models

This post was written jointly with Nikolay Balov, Senior Statistician and Software Developer, StataCorp.

A question on Statalist motivated us to write this blog entry.

A user asked if the churdle command (http://www.stata.com/stata14/hurdle-models/) for fitting hurdle models, new in Stata 14, can be combined with the bayesmh command (http://www.stata.com/stata14/bayesian-analysis/) for fitting Bayesian models, also new in Stata 14:

http://www.statalist.org/forums/forum/general-stata-discussion/general/1290426-comibining-bayesmh-and-churdle

Our initial reaction to this question was ‘No’ or, more precisely, ‘Not easily’—hurdle models are not among the likelihood models supported by bayesmh. One can write a program to compute the log likelihood of the double hurdle model and use this program with bayesmh (in the spirit of http://www.stata.com/stata14/bayesian-evaluators/), but this may seem like a daunting task if you are not familiar with Stata programming.

And then we realized, why not simply call churdle from the evaluator to compute the log likelihood? All we need is for churdle to evaluate the log likelihood at specific values of model parameters without performing iterations. This can be achieved by specifying churdle‘s options from() and iterate(0).

Let’s look at an example. We consider a simple hurdle model using a subset of the fitness dataset from [R] churdle:

. webuse fitness
. set seed 17653
. sample 10
. churdle linear hours age, select(commute) ll(0)

Iteration 0:   log likelihood = -2783.3352
Iteration 1:   log likelihood =  -2759.029
Iteration 2:   log likelihood = -2758.9992
Iteration 3:   log likelihood = -2758.9992

Cragg hurdle regression                         Number of obs     =      1,983
                                                LR chi2(1)        =       3.42
                                                Prob > chi2       =     0.0645
Log likelihood = -2758.9992                     Pseudo R2         =     0.0006

------------------------------------------------------------------------------
       hours |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
hours        |
         age |   .0051263   .0028423     1.80   0.071    -.0004446    .0106971
       _cons |   1.170932   .1238682     9.45   0.000     .9281548    1.413709
-------------+----------------------------------------------------------------
selection_ll |
     commute |  -.0655171   .1561046    -0.42   0.675    -.3714765    .2404423
       _cons |   .1421166   .0882658     1.61   0.107    -.0308813    .3151144
-------------+----------------------------------------------------------------
lnsigma      |
       _cons |   .1280215     .03453     3.71   0.000      .060344     .195699
-------------+----------------------------------------------------------------
      /sigma |   1.136577    .039246                      1.062202    1.216161
------------------------------------------------------------------------------

Let’s assume for a moment that we already have an evaluator, mychurdle1, that returns the corresponding log-likelihood value. We can fit a Bayesian hurdle model using bayesmh as follows:

. gen byte hours0 = (hours==0) //dependent variable for the selection equation
. set seed 123
. bayesmh (hours age) (hours0 commute),
        llevaluator(mychurdle1, parameters({lnsig}))
        prior({hours:} {hours0:} {lnsig}, flat)
        saving(sim, replace) dots

(output omitted)

We use a two-equation specification to fit this model. The main regression is specified first, and the selection regression is specified next. The additional parameter, log of the standard deviation associated with the main regression, is specified in llevaluator()‘s suboption parameters(). All parameters are assigned flat priors to obtain results similar to churdle. MCMC results are saved in sim.dta.

Here is the actual output from bayesmh:

. bayesmh (hours age) (hours0 commute), llevaluator(mychurdle1, parameters({lns
> ig})) prior({hours:} {hours0:} {lnsig}, flat) saving(sim, replace) dots

Burn-in 2500 aaaaaaaaa1000aaaaaa...2000..... done
Simulation 10000 .........1000.........2000.........3000.........4000.........5
> 000.........6000.........7000.........8000.........9000.........10000 done

Model summary
------------------------------------------------------------------------------
Likelihood:
  hours hours0 ~ mychurdle1(xb_hours,xb_hours0,{lnsig})

Priors:
       {hours:age _cons} ~ 1 (flat)                                        (1)
  {hours0:commute _cons} ~ 1 (flat)                                        (2)
                 {lnsig} ~ 1 (flat)
------------------------------------------------------------------------------
(1) Parameters are elements of the linear form xb_hours.
(2) Parameters are elements of the linear form xb_hours0.

Bayesian regression                              MCMC iterations  =     12,500
Random-walk Metropolis-Hastings sampling         Burn-in          =      2,500
                                                 MCMC sample size =     10,000
                                                 Number of obs    =      1,983
                                                 Acceptance rate  =      .2889
                                                 Efficiency:  min =     .05538
                                                              avg =     .06266
Log marginal likelihood = -2772.3953                          max =     .06945

------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
hours        |
         age |  .0050916   .0027972   .000106   .0049923  -.0000372   .0107231
       _cons |  1.167265    .124755   .004889   1.175293   .9125105   1.392021
-------------+----------------------------------------------------------------
hours0       |
     commute | -.0621282   .1549908   .006585  -.0613511  -.3623891   .2379805
       _cons |  .1425693   .0863626   .003313   .1430396  -.0254507   .3097677
-------------+----------------------------------------------------------------
       lnsig |  .1321532   .0346446   .001472   .1326704   .0663646   .2015249
------------------------------------------------------------------------------

file sim.dta saved

The results are similar to those produced by churdle, as one would expect with noninformative priors.

If desired, we can use bayesstats summary to obtain the estimate of the standard deviation:

. bayesstats summary (sigma: exp({lnsig}))

Posterior summary statistics                      MCMC sample size =    10,000

       sigma : exp({lnsig})

------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
       sigma |  1.141969   .0396264   .001685   1.141874   1.068616   1.223267
------------------------------------------------------------------------------

 
Let’s now talk in more detail about a log-likelihood evaluator. We will consider two evaluators: one using churdle and one directly implementing the log likelihood of the considered hurdle model.

 

Log-likelihood evaluator using churdle

 

Here we demonstrate how to write a log-likelihood evaluator that calls an existing Stata estimation command, churdle in our example, to compute the log likelihood.

program mychurdle1
        version 14.0
        args llf
        tempname b
        mat `b' = ($MH_b, $MH_p)
        capture churdle linear $MH_y1 $MH_y1x1 if $MH_touse, ///
                    select($MH_y2x1) ll(0) from(`b') iterate(0)
        if _rc {
                if (_rc==1) { // handle break key
                        exit _rc
                }
                scalar `llf' = .
        }
        else {
                scalar `llf' = e(ll)
        }
end

The mychurdle1 program returns the log-likelihood value computed by churdle at the current values of model parameters. This program accepts one argument — a temporary scalar to contain the log-likelihood value llf. We stored current values of model parameters (regression coefficients from two equations stored in vector MH_b and the extra parameter, log standard-deviation, stored in vector MH_p) in a temporary matrix b. We specified churdle‘s options from() and iterate(0) to evaluate the log likelihood at the current parameter values. Finally, we stored the resulting log-likelihood value in llf (or missing value if the command failed to evaluate the log likelihood).

 

Log-likelihood evaluator directly computing log likelihood

 

Here we demonstrate how to write a log-likelihood evaluator that computes the likelihood of the fitted hurdle model directly rather than calling churdle.

program mychurdle2
        version 14.0
        args lnf xb xg lnsig
        tempname sig
        scalar `sig' = exp(`lnsig')
        tempvar lnfj
        qui gen double `lnfj' = normal(`xg')  if $MH_touse
        qui replace `lnfj'    = log(1 - `lnfj') if $MH_y1 <= 0 & $MH_touse
        qui replace `lnfj'    = log(`lnfj') - log(normal(`xb'/`sig'))   ///
                              + log(normalden($MH_y1,`xb',`sig'))       ///
                                if $MH_y1 > 0 & $MH_touse
        summarize `lnfj' if $MH_touse, meanonly
        if r(N) < $MH_n {
            scalar `lnf' = .
            exit
        }
        scalar `lnf' = r(sum)
end

The mychurdle2 program accepts four arguments: a temporary scalar to contain the log-likelihood value llf, temporary variables xb and xg containing linear predictors from the corresponding main and selection equations evaluated at the current values of model parameters, and temporary scalar lnsig containing the current value of the log standard-deviation parameter. We compute and store the observation-level log likelihood in a temporary variable lnfj. Global MH_y1 contains the name of the dependent variable from the first (main) equation, and global MH_touse marks the estimation sample. If all observation-specific log likelihood contributions are nonmissing, we store the overall log-likelihood value in lnf or we otherwise store missing.

We fit our model using the same syntax as earlier, except we use mychurdle2 as the program evaluator.

. set seed 123
. bayesmh (hours age) (hours0 commute), llevaluator(mychurdle2, parameters({lns
> ig})) prior({hours:} {hours0:} {lnsig}, flat) saving(sim, replace) dots

Burn-in 2500 aaaaaaaaa1000aaaaaa...2000..... done
Simulation 10000 .........1000.........2000.........3000.........4000.........5
> 000.........6000.........7000.........8000.........9000.........10000 done

Model summary
------------------------------------------------------------------------------
Likelihood:
  hours hours0 ~ mychurdle2(xb_hours,xb_hours0,{lnsig})

Priors:
       {hours:age _cons} ~ 1 (flat)                                        (1)
  {hours0:commute _cons} ~ 1 (flat)                                        (2)
                 {lnsig} ~ 1 (flat)
------------------------------------------------------------------------------
(1) Parameters are elements of the linear form xb_hours.
(2) Parameters are elements of the linear form xb_hours0.

Bayesian regression                              MCMC iterations  =     12,500
Random-walk Metropolis-Hastings sampling         Burn-in          =      2,500
                                                 MCMC sample size =     10,000
                                                 Number of obs    =      1,983
                                                 Acceptance rate  =      .2889
                                                 Efficiency:  min =     .05538
                                                              avg =     .06266
Log marginal likelihood = -2772.3953                          max =     .06945

------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
hours        |
         age |  .0050916   .0027972   .000106   .0049923  -.0000372   .0107231
       _cons |  1.167265    .124755   .004889   1.175293   .9125105   1.392021
-------------+----------------------------------------------------------------
hours0       |
     commute | -.0621282   .1549908   .006585  -.0613511  -.3623891   .2379805
       _cons |  .1425693   .0863626   .003313   .1430396  -.0254507   .3097677
-------------+----------------------------------------------------------------
       lnsig |  .1321532   .0346446   .001472   .1326704   .0663646   .2015249
------------------------------------------------------------------------------

file sim.dta not found; file saved

We obtain the same results as those obtained using approach 1, and we obtain them much faster.

 

Final remarks

 

Approach 1 is very straightforward. It can be applied to any Stata command that returns the log likelihood and allows you to specify parameter values at which this log likelihood must be evaluated. Without too much programming effort, you can use almost any existing Stata maximum likelihood estimation command with bayesmh. A disadvantage of approach 1 is slower execution compared with programming the likelihood directly, as in approach 2. For example, the command using the mychurdle1 evaluator from approach 1 took about 25 minutes to run, whereas the command using the mychurdle2 evaluator from approach 2 took only 20 seconds.