Home > Statistics > Positive log-likelihood values happen

Positive log-likelihood values happen

From time to time, we get a question from a user puzzled about getting a positive log likelihood for a certain estimation. We get so used to seeing negative log-likelihood values all the time that we may wonder what caused them to be positive.

First, let me point out that there is nothing wrong with a positive log likelihood.

The likelihood is the product of the density evaluated at the observations. Usually, the density takes values that are smaller than one, so its logarithm will be negative. However, this is not true for every distribution.

For example, let’s think of the density of a normal distribution with a small standard deviation, let’s say 0.1.

. di normalden(0,0,.1)
3.9894228

This density will concentrate a large area around zero, and therefore will take large values around this point. Naturally, the logarithm of this value will be positive.

. di log(3.9894228)
1.3836466

In model estimation, the situation is a bit more complex. When you fit a model to a dataset, the log likelihood will be evaluated at every observation. Some of these evaluations may turn out to be positive, and some may turn out to be negative. The sum of all of them is reported. Let me show you an example.

I will start by simulating a dataset appropriate for a linear model.

clear
program drop _all
set seed 1357
set obs 100
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 2*x1 + 3*x2 +1 + .06*rnormal()

I will borrow the code for mynormal_lf from the book Maximum Likelihood Estimation with Stata (W. Gould, J. Pitblado, and B. Poi, 2010, Stata Press) in order to fit my model via maximum likelihood.

program mynormal_lf
        version 11.1
        args lnf mu lnsigma
        quietly replace `lnf' = ln(normalden($ML_y1,`mu',exp(`lnsigma')))
end

ml model lf  mynormal_lf  (y = x1 x2) (lnsigma:)
ml max, nolog

The following table will be displayed:

.   ml max, nolog

                                                  Number of obs   =        100
                                                  Wald chi2(2)    =  456919.97
Log likelihood =  152.37127                       Prob > chi2     =     0.0000

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
eq1          |   
          x1 |   1.995834    .005117   390.04   0.000     1.985805    2.005863
          x2 |   3.014579   .0059332   508.08   0.000      3.00295    3.026208
       _cons |   .9990202   .0052961   188.63   0.000       .98864      1.0094
-------------+----------------------------------------------------------------
lnsigma      |  
       _cons |  -2.942651   .0707107   -41.62   0.000    -3.081242   -2.804061
------------------------------------------------------------------------------

We can see that the estimates are close enough to our original parameters, and also that the log likelihood is positive.

We can obtain the log likelihood for each observation by substituting the estimates in the log-likelihood formula:

. predict double xb

. gen double lnf = ln(normalden(y, xb, exp([lnsigma]_b[_cons])))

. summ lnf, detail

                             lnf
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -1.360689      -1.574499
 5%    -.0729971       -1.14688
10%     .4198644      -.3653152       Obs                 100
25%     1.327405      -.2917259       Sum of Wgt.         100

50%     1.868804                      Mean           1.523713
                        Largest       Std. Dev.      .7287953
75%     1.995713       2.023528
90%     2.016385       2.023544       Variance       .5311426
95%     2.021751       2.023676       Skewness      -2.035996
99%     2.023691       2.023706       Kurtosis       7.114586

. di r(sum)
152.37127

. gen f = exp(lnf)

. summ f, detail

                              f
-------------------------------------------------------------
      Percentiles      Smallest
 1%     .2623688       .2071112
 5%     .9296673       .3176263
10%      1.52623       .6939778       Obs                 100
25%     3.771652       .7469733       Sum of Wgt.         100

50%     6.480548                      Mean           5.448205
                        Largest       Std. Dev.      2.266741
75%     7.357449       7.564968
90%      7.51112        7.56509       Variance       5.138117
95%     7.551539       7.566087       Skewness      -.8968159
99%     7.566199        7.56631       Kurtosis       2.431257

We can see that some values for the log likelihood are negative, but most are positive, and that the sum is the value we already know. In the same way, most of the values of the likelihood are greater than one.

As an exercise, try the commands above with a bigger variance, say, 1. Now the density will be flatter, and there will be no values greater than one.

In short, if you have a positive log likelihood, there is nothing wrong with that, but if you check your dispersion parameters, you will find they are small.

Categories: Statistics Tags:
  • http://www.maartenbuis.nl Maarten Buis

    Nice post, things like these can really baffle a person for a while. It reminds me of a discussion I had a while ago where someone claimed that it was better to model proportions than percentages. He based that on looking at the BIC values, and they do result in a lot smaller BIC values:nnnuse http://fmwww.bc.edu/repec/bocode/c/citybudget.dta, clearnreg governing noleft minorityleft houseval popdensnestat icnngen prop_gov = governing * 100nnreg prop_gov noleft minorityleft houseval popdensnestat icnnnIt took me a while to first figure out myself what was going on, which is basicaly the same issue as the one discussed in this post, and than to convince that other person.

  • http://www.maartenbuis.nl Maarten Buis

    As an asside: It would be nice for ssc to have a use sub-command. That could make examples a lot easier to read. The example above could than start with:nnssc use citybudget, clear

  • Nick Cox

    Isabel’s point can be pushed further by underlining thatnprobability densities have units of measurement. nnNaturally, everyone knows about units of measurement. Somewhere innyour past there was, with probability 1, some fierce sciencenteacher who was scathing when you missed out the units ofnmeasurement in some report. The point re-emerges in the standardnelementary statistics course when it is underlined that the unitsnof variance are the square of the original units of measurement,nwhich is grounds enough for introducing its square root, thenstandard deviation. nnHowever, I’ve often noticed statistical people not using the samenlogic when talking about densities, and sometimes their colleaguesnor students can end up confused. (Occasionally, they get confusedntoo.) nnThe underlying general idea I often explain in this way. Densitynis amount of “stuff” in some “space”. In physics, with density innthe classic sense, “stuff” is clearly mass and “space” is clearlynvolume. Many social scientists might more commonly think ofnsomething like population density, in which “stuff” is number ofnpeople and “space” is area. In the present example, “stuff” is nprobability and “space” is the support of the variable(s) innquestion. nnIn statistics, introductory courses usually insist on andistinction between probabilities for discrete variables andnprobability densities for continuous variables, and ne’er thentwain shall meet. (At higher levels, mathematically-orientednstatisticians who have ingested large doses of measure theorynsometimes insist that anything can be a density; it’s just a casenof the underlying measure, which could be counting measure.) nnFocusing on the continuous case, the units of the density comenfrom working backwards from the fact that the total probability,nthe integral over its support of the density function, must be 1nand must be unit-free and dimensionless. It follows that nnunits of density = 1 / units of variable nnIn the univariate case, the probability can be considered as thenarea under the density curve, and the argument can be made visualnby considering rectangles with sides the density and the variable. nnSo, the units of the density of “miles per gallon” are “gallonsnper mile”, however odd that may seem. nnIn the bivariate or multivariate case, the units we are talkingnabout are the product of the units of the individual variables,nwhich gets messy and unintuitive, but not intrinsically difficultnor problematic. If we were imagining the joint density of mpg andnweight in the auto dataset, the units would be 1 / (miles perngallon * pounds). nnAway from likelihood calculations, the issue often arises whennpeople get a density estimate from say -kdensity- and are puzzlednby densities above 1. In fact, people have been known to ask hownto “fix” the results, which they put down to some bug inn-kdensity-.nnI’ll add a puff for a small classic of exposition: nnD. J. Finney. 1977. nDimensions of statistics. nJournal of the Royal Statistical Society. Series C (Applied Statistics)n26: 285-289. nnNaturally, even if you specify your units for densities, you cannstill make mistakes. I read a thesis in which a candidate reportedna density for his soils of 2 mg/m^3. (Think of how much a cubicnmetre of water weighs, and double it.) It is salutary to realisenthat billion-fold errors are not reserved for cosmology ornfinance, but are possible in your own backyard. nn