Archive

Author Archive

Using Stata’s random-number generators, part 4, details

For those interested in how pseudo random number generators work, I just wrote something on Statalist which you can see in the Statalist archives by clicking the link even if you do not subscribe:

http://www.stata.com/statalist/archive/2012-10/msg01129.html

To remind you, I’ve been writing about how to use random-number generators in parts 1, 2, and 3, and I still have one more posting I want to write on the subject. What I just wrote on Statalist, however, is about how random-number generators work, and I think you will find it interesting.

To find out more about Statalist, see

Statalist

How to successfully ask a question on Statalist

Using Stata’s random-number generators, part 3, drawing with replacement

The topic for today is drawing random samples with replacement. If you haven’t read part 1 and part 2 of this series on random numbers, do so. In the series we’ve
discussed that

  1. Stata’s runiform() function produces random numbers over the range [0,1). To produce such random numbers, type
    . generate double u = runiform()
    

     

  2. To produce continuous random numbers over [a,b), type
    . generate double u = (b-a)*runiform() + a
    

     

  3. To produce integer random numbers over [a,b], type
    . generate ui = floor((b-a+1)*runiform() + a)
    

    If b > 16,777,216, type

    . generate long ui = floor((b-a+1)*runiform() + a)
    

     

  4. To place observations in random order — to shuffle observations — type
    . set seed #
    . generate double u = runiform()
    . sort u
    

     

  5. To draw without replacement a random sample of n observations from a dataset of N observations, type
    . set seed #
    . sort variables_that_put_dta_in_unique_order
    . generate double u = runiform()
    . sort u
    . keep in 1/n
    

    If N>1,000, generate two random variables u1 and u2 in place of u, and substitute sort u1 u2 for sort u.

     

  6. To draw without replacement a P-percent random sample, type
    . set seed #
    . keep if runiform() <= P/100
    

I’ve glossed over details, but the above is the gist of it.

Today I’m going to tell you

  1. To draw a random sample of size n with replacement from a dataset of size N, type
    . set seed #
    . drop _all
    . set obs n
    . generate long obsno = floor(N*runiform()+1)
    . sort obsno
    . save obsnos_to_draw
    
    . use your_dataset, clear
    . generate long obsno = _n
    . merge 1:m obsno using obsnos_to_draw, keep(match) nogen
    

     

  2. You need to set the random-number seed only if you care about reproducibility. I’ll also mention that if N ≤ 16,777,216, it is not necessary to specify that new variable obsno be stored as long; the default float will be sufficient.

     
    The above solution works whether n<N, n=N, or n>N.

 

Drawing samples with replacement

The solution to sampling with replacement n observations from a dataset of size N is

  1. Draw n observation numbers 1, …, N with replacement. For instance, if N=4 and n=3, we might draw observation numbers 1, 3, and 3.

  2. Select those observations from the dataset of interest. For instance, select observations 1, 3, and 3.

As previously discussed in part 1, to generate random integers drawn with replacement over the range [a, b], use the formula

generate varname = floor((b-a+1)*runiform() + a)

In this case, we want a=1 and b=N, and the formula reduces to,

generate varname = floor(N*runiform() + 1)

So the first half of our solution could read

. drop _all
. set obs n
. generate obsno = floor(N*runiform() + 1)

Now we are merely left with the problem of selecting those observations from our dataset, which we can do using merge by typing

. sort obsno
. save obsnos_to_draw
. use dataset_of_interest, clear
. generate obsno = _n
. merge 1:m obsno using obsnos_to_draw, keep(match) nogen

Let’s do an example. In part 2 of this series, I had a dataset with observations corresponding to playing cards:

. use cards

. list in 1/5

     +-------------+
     | rank   suit |
     |-------------|
  1. |  Ace   Club |
  2. |    2   Club |
  3. |    3   Club |
  4. |    4   Club |
  5. |    5   Club |
     +-------------+

There are 52 observations in the dataset; I’m showing you just the first five. Let’s draw 10 cards from the deck, but with replacement.

The first step is to draw the observation numbers. We have N=52 cards in the deck, and we want to draw n=10, so we generate 10 random integers from the integers [1, 52]:

. drop _all

. set obs 10                            // we want n=10
obs was 0, now 10

. gen obsno = floor(52*runiform()+1)    // we draw from N=52


. list obsno                            // let's see what we have

     +-------+
     | obsno |
     |-------|
  1. |    42 |
  2. |    52 |
  3. |    16 |
  4. |     9 |
  5. |    40 |
     |-------|
  6. |    11 |
  7. |    34 |
  8. |    20 |
  9. |    49 |
 10. |    42 |
     +-------+

If you look carefully at the list, you will see that observation number 42 repeats. It will be easier to see the duplicate if we sort the list,

. sort obsno
. list

     +-------+
     | obsno |
     |-------|
  1. |     9 |
  2. |    11 |
  3. |    16 |
  4. |    20 |
  5. |    34 |
     |-------|
  6. |    40 |
  7. |    42 |     <- Obs. 42 repeats
  8. |    42 |     <- See?
  9. |    49 |
 10. |    52 |
     +-------+

An observation didn’t have to repeat, but it’s not surprising that one did because in drawing n=10 from N=52, we would expect one or more repeated cards about 60% of the time.

Anyway, we now know which cards we want, namely cards 9, 11, 16, 20, 34, 40, 42, 42 (again), 49, and 52.

The final step is to select those observations from cards.dta. The way to do that is to perform a one-to-many merge of cards.dta with the list above and keep the matches. Before we can do that, however, we must (1) save the list of observation numbers as a dataset, (2) load cards.dta, and (3) add a variable called obsno to it. Then we will be able to perform the merge. So let’s get that out of the way,

. save obsnos_to_draw                // 1. save the list above
file obsnos_to_draw.dta saved

. use cards                          // 2. load cards.dta

. gen obsno = _n                     // 3.  Add variable obsno to it

Now we can perform the merge:

. merge 1:m obsno using obsnos_to_draw, keep(matched) nogen

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                                10
    -----------------------------------------

I’ll list the result, but let me first briefly explain the command

merge 1:m obsno using obsnos_to_draw, keep(matched) nogen

merge …, we are performing the merge command,

1:m …, the merge is one-to-many,

using obsnos_to_draw …, we merge data in memory with obsnos_todraw.dta,

, keep(matched) …, we keep observations that appear in both datasets,

nogen, do not add variable _merge to the resulting dataset; _merge reports the source of the resulting observations; we said keep(matched) so we know each came from both sources.

And here is the result:

. list

     +-------------------------+
     |  rank      suit   obsno |
     |-------------------------|
  1. |     8      Club       9 |
  2. |  Jack      Club      11 |
  3. |   Ace     Spade      16 |
  4. |     2   Diamond      20 |
  5. |     6     Spade      34 |
     |-------------------------|
  6. |     8     Spade      40 |
  7. |     9     Heart      42 |   <- Obs. 42 is here ...
  8. | Queen     Spade      49 |
  9. |  King     Spade      52 |
 10. |     9     Heart      42 |   <- and here
     +-------------------------+

We drew 10 cards — those are the observation numbers on the left. Variable obsno in our dataset records the original observation (card) number and really, we no longer need the variable. Anyway, obsno==42 appears twice, in real observations 7 and 10, and thus we drew the 9 of Hearts twice.

 

What could go wrong?

Not much can go wrong, it turns out. At this point, our generic solution is

. drop _all
. set obs n
. generate obsno = floor(n*runiform()+1)
. sort obsno
. save obsnos_to_draw

. use our_dataset
. gen obsno = _n
. merge 1:m obsno using obsnos_to_draw, keep(matched) nogen

If you study this code, there are two lines that might cause problems,

. generate obsno = floor(N*runiform()+1)

and

. generate obsno = _n

When you are looking for problems and see a generate or replace, think about rounding.

Let’s look at the right-hand side first. Both calculations produce integers over the range [1, N]. generate performs all calculations in double and the largest integer that can be stored without rounding is 9,007,199,254,740,992 (see previous blog post on precision). Stata allows datasets up to 2,147,483,646, so we can be sure that N is less than the maximum precise-integer double. There are no rounding issues on the right-hand side.

Next let’s look at the left-hand side. Variable obsno is being stored as a float because we did not instruct otherwise. The largest integer value that can be stored without rounding as a float (also covered in previous blog post on precision) is 16,777,216, and that is less than Stata’s 2,147,483,646 maximum observations. When N exceeds 16,777,216, the solution is to store obsno as a long. We could remember to use long on the rare occasion when dealing with such large datasets, but I’m going to change the generic solution to use longs in all cases, even when it’s unnecessary.

What else could go wrong? Well, we tried an example with n<N and that seemed to work. We should now try examples with n=N and n>N to verify there’s no hidden bug or assumption in our code. I’ve tried examples of both and the code works fine.

 

We’re done for today

That’s it. Drawing samples with replacement turns out to be easy, and that shouldn’t surprise us because we have a random-number generator that draws with replacement.

We could complicate the discussion and consider solutions that would run a bit more efficiently when n=N, which is of special interest in statistics because it is a key ingredient in bootstrapping, but we will not. The above solution works fine in the n=N case, and I always advise researchers to favor simple-even-if-slower solutions because they will probably save you time. Writing complicated code takes longer than writing simple code, and testing complicated code takes even longer. I know because that’s what we do at StataCorp.

Using Stata’s random-number generators, part 2, drawing without replacement

Last time I told you that Stata’s runiform() function generates rectangularly (uniformly) distributed random numbers over [0, 1), from 0 to nearly 1, and to be precise, over [0, 0.999999999767169356]. And I gave you two formulas,

  1. To generate continuous random numbers between a and b, use

    generate double u = (b-a)*runiform() + a

    The random numbers will not actually be between a and b: they will be between a and nearly b, but the top will be so close to b, namely 0.999999999767169356*b, that it will not matter.

  2. To generate integer random numbers between a and b, use

    generate ui = floor((b-a+1)*runiform() + a)

I also mentioned that runiform() can solve a variety of problems, including

  • shuffling data (putting observations in random order),
  • drawing random samples without replacement (there’s a minor detail we’ll have to discuss because runiform() itself produces values drawn with replacement),
  • drawing random samples with replacement (which is easier to do than most people realize),
  • drawing stratified random samples (with or without replacement),
  • manufacturing fictional data (something teachers, textbook authors, manual writers, and blog writers often need to do).

Today we will cover shuffling and drawing random samples without replacement — the first two topics on the list — and we will leave drawing random samples with replacement for next time. I’m going to tell you

  1. To place observations in random order — to shuffle the observations — type
    . generate double u = runiform()
    . sort u
    
  2. To draw without replacement a random sample of n observations from a dataset of N observations, type
    . set seed #
    . generate double u = runiform()
    . sort u
    . keep in 1/n
    

    I will tell you that there are good statistical reasons for setting the random-number seed even if you don’t care about reproducibility.

    If you do care about reproducibility, I will mention (1) that you need to use sort to put the original data in a known, reproducible order, before you generate the random variate u, and I will explain (2) a subtle issue that leads us to use different code for N≤1,000 and N>1,000. The code for for N≤1,000 is

    . set seed #
    . sort variables_that_put_data_in_unique_order
    . generate double u = runiform()
    . sort u
    . keep in 1/n
    

    and the code for N>1,000 is

    . set seed #
    . sort variables_that_put_data_in_unique_order
    . generate double u1 = runiform()
    . generate double u2 = runiform()
    . sort u1 u2
    . keep in 1/n
    

    You can use the N>1,000 code for the N≤1,000 case.

  3. To draw without replacement a P-percent random sample, type
    . set seed #
    . keep if runiform() <= P/100
    

    There’s no issue in this case when N is large.

As I mentioned, we’ll discuss drawing random samples with replacement next time. Today, the topic is random samples without replacement. Let’s start.

 
Shuffling data

I have a deck of 52 cards, in order, the first four of which are

. list in 1/4

     +-------------+
     | rank   suit |
     |-------------|
  1. |  Ace   Club |
  2. |    1   Club |
  3. |    2   Club |
  4. |    3   Club |
     +-------------+

Well, actually I just have a Stata dataset with observations corresponding to playing cards. To shuffle the deck — to place the observations in random order — type

. generate double u = runiform()

. sort u

Having done that, here’s your hand,

. list in 1/5

     +----------------------------+
     |  rank      suit          u |
     |----------------------------|
  1. | Queen      Club   .0445188 |
  2. |     5   Diamond   .0580662 |
  3. |     7      Club   .0610638 |
  4. |  King     Heart   .0907986 |
  5. |     6     Spade   .0981878 |
     +----------------------------+

and here’s mine:

. list in 6/10

     +---------------------------+
     | rank      suit          u |
     |---------------------------|
  6. |    8   Diamond   .1024369 |
  7. |    5      Club   .1086679 |
  8. |    8     Spade   .1091783 |
  9. |    2     Spade   .1180158 |
 10. |  Ace      Club   .1369841 |
     +---------------------------+

All I did was generate random numbers — one per observation (card) — and then place the observations in ascending order of the random values. Doing that is equivalent to shuffling the deck. I used runiform() random numbers, meaning rectangularly distributed random numbers over [0, 1), but since I’m only exploiting the random-numbers’ ordinal properties, I could have used random numbers from any continuous distribution.

This simple, elegant, and obvious solution to shuffling data will play an important part of the solution to drawing observations without replacement. I have already more than hinted at the solution when I showed you your hand and mine.

 
Drawing n observations without replacement

Drawing without replacement is exactly the same problem as dealing cards. The solution to the physical card problem is to shuffle the cards and then draw the top cards. The solution to randomly selecting n from N observations is to put the N observations in random order and keep the first n of them.

. use cards, clear

. generate double u = runiform()

. sort u

. keep in 1/5
(47 observations deleted)

. list

     +---------------------------+
     | rank      suit          u |
     |---------------------------|
  1. |  Ace   Diamond   .0064866 |
  2. |    6     Heart   .0087578 |
  3. | King     Spade    .014819 |
  4. |    3     Spade   .0955155 |
  5. | King   Diamond   .1007262 |
     +---------------------------+

. drop u

 
Reproducibility

You might later want to reproduce the analysis, meaning you do not want to draw another random sample, but you want to draw the same random sample. Perhaps you informally distributed some preliminary results and, of course, then discovered a mistake. You want to redistribute updated results and show that your mistake didn’t change results by much, and to drive home the point, you want to use the same samples as you used previously.

Part of the solution is to set the random-number seed. You might type

. set seed 49983882

. use cards, clear

. generate double u = runiform()

. sort u

. keep in 1/5

See help set seed in Stata. As a quick review, when you set the random-number seed, you set Stata’s random-number generator into a fixed, reproducible state, which is to say, the sequence of random numbers that runiform() produces is a function of the seed. Set the seed today to the same value as yesterday, and runiform() will produce the same sequence of random numbers today as it did yesterday. Thus, after setting the seed, if you repeat today exactly what you did yesterday, you will obtain the same results.

So imagine that you set the random number seed today to the value you set it to yesterday and you repeat the above commands. Even so, you might not get the same results! You will not get the same results if the observations in cards.dta are in a different order yesterday and today. Setting the seed merely ensures that if yesterday the smallest value of u was in observation 23, it will again be in observation 23 today (and it will be the same value). If yesterday, however, observation 23 was the 6 of Clubs, and today it’s the 7 of Hearts, then today you will select the 7 of Hearts in place of the 6 of Clubs.

So make sure the data are in the same order. One way to do that is put the dataset in a known order before generating the random values on which you will sort. For instance,

. set seed 49983882

. use cards, clear

. sort rank suit

. generate double u = runiform()

. sort u

. keep in 1/5

An even better solution would add the line

. by rank suit: assert _N==1

just before the generate. That line would check whether sorting on variables rank and suit uniquely orders the observations.

With cards.dta, you can argue that the assert is unnecessary, but not because you know each rank-suit combination occurs once. You have only my assurances about that. I recommend you never trust anyone’s assurances about data. In this case, however, you can argue that the assert is unnecessary because we sorted on all the variables in the dataset and thus uniqueness is not required. Pretend there are two Ace of Clubs in the deck. Would it matter that the first card was Ace of Clubs followed by Ace of Clubs as opposed to being the other way around? Of course it would not; the two states are indistinguishable.

So let’s assume there is another variable in the dataset, say whether there was a grease spot on the back of the card. Yesterday, after sorting, the ordering might have been,

     +---------------------------------+
     | rank   suit   grease          u |
     |---------------------------------|
  1. |  Ace   Club      yes   .6012949 |
  2. |  Ace   Club       no   .1859054 |
     +---------------------------------+

and today,

     +---------------------------------+
     | rank   suit   grease          u |
     |---------------------------------|
  1. |  Ace   Club       no   .6012949 |
  2. |  Ace   Club      yes   .1859054 |
     +---------------------------------+

If yesterday you selected the Ace of Clubs without grease, today you would select the Ace of Clubs with grease.

My recommendation is (1) sort on whatever variables put the data into a unique order, and then verify that, or (2) sort on all the variables in the dataset and then don’t worry whether the order is unique.

 
Ensuring a random ordering

Included in our reproducible solution but omitted from our base solution was setting the random-number seed,

. set seed 49983882

Setting the seed is important even if you don’t care about reproducibility. Each time you launch Stata, Stata sets the same random-number seed, namely 123456789, and that means that runiform() generates the same sequence of random numbers, and that means that if you generated all your random samples right after launching Stata, you would always select the same observations, at least holding N constant.

So set the seed, but don’t set it too often. You set the seed once per problem. If I wanted to draw 10,000 random samples from the same data, I could code:

  use dataset, clear
  set seed 1702213
  sort variables_that_put_data_in_unique_order
  preserve
  forvalues i=1(1)10000 {
          generate double u = runiform()
          sort u
          keep in 1/n
          drop u
          save sample`i', replace
          restore, preserve  
}

In the example I save each sample in a file. In real life, I seldom (never) save the samples; I perform whatever analysis on the samples I need and save the results, which I usually append into a single dataset. I don’t need to save the individual samples because I can recreate them.

 
And the result still might not be reproducible …

runiform() draws random-numbers with replacement. It is thus possible that two or more observations could have the same random values associated with them. Well yes, you’re thinking, I see that it’s possible, but surely it’s so unlikely that it just doesn’t happen. But it does happen:

. clear all

. set obs 100000
obs was 0, now 100000

. generate double u = runiform() 

. by u, sort: assert _N==1
1 contradiction in 99999 by-groups
assertion is false      
r(9); 

In the 100,000-observation dataset I just created, I got a duplicate! By the way, I didn’t have to look hard for such an example, I got it the first time I tried.

I have three things I want to tell you:

  1. Duplicates happen more often than you might guess.

  2. Do not panic about the duplicates. Because of how Stata is written, duplicates do not lower the quality of the sample selected. I’ll explain.

  3. Duplicates do interfere with reproducibility, however, and there is an easy way around that problem.

Let’s start with the chances of observing duplicates. I mentioned in passing last time that runiform() is a 32-bit random-number generator. That means runiform() can return any of 232 values. Their values are, in order,

          0  =  0
      1/232  =  2.32830643654e-10
      2/232  =  4.65661287308e-10 
      3/232  =  6.98491930962e-10
             .
             .
             .
 (232-2)/232  =  0.9999999995343387
 (232-1)/232  =  0.9999999997671694

So what are the chances that in N draws with replacement from an urn containing these 232 values, that all values are distinct? The probability p that all values are distinct is

      232 * (232-1) * ... *(232-N)
p  =  ----------------------------
                 N*232

Here are some values for various values of N. p is the probability that all values are unique, and 1-p is the probability of observing one or more repeated values.

------------------------------------
      N         p            1-p
------------------------------------
     50   0.999999715    0.000000285
    500   0.999970955    0.000029045
  1,000   0.999883707    0.000116293
  5,000   0.997094436    0.002905564
 10,000   0.988427154    0.011572846
 50,000   0.747490440    0.252509560
100,000   0.312187988    0.687812012
200,000   0.009498117    0.990501883
300,000   0.000028161    0.999971839
400,000   0.000000008    0.999999992
500,000   0.000000000    1.000000000
------------------------------------

In shuffling cards we generated N=52 random values. The probability of a repeated values is infinitesimal. In datasets of N=10,000, I expect to see repeated values 1% of the time. In datasets of N=50,000, I expect to see repeated values 25% of the time. By N=100,000, I expect to see repeated values more often than not. By N=500,000, I expect to see repeated value in virtually all sequences.

Even so, I promised you that this problem does not affect the randomness of the ordering. It does not because of how Stata’s sort command is written. Remember the basic solution,

 
. use dataset, clear

. generate double u = runiform()

. sort u

. keep in 1/n

Did you know sort has its own, private random-number generator built into it? It does, and sort uses its random-number generator to determine the order of tied observations. In the manuals we at StataCorp are fond of writing, “the ties will be ordered randomly” and a few sophisticated users probably took that to mean, “the ties will be ordered in a way that we at StataCorp do not know and even though they might be ordered in a way that will cause a bias in the subsequent analysis, because we don’t know, we’ll ignore the possibility.” But we meant it when wrote that the ties will be ordered randomly; we know that because we put a random number generator into sort to ensure the result. And that is why I can now write that repeated values of the runiform() function cause a reproducibility issue, but not a statistical issue.

The solution to the reproducibility issue is to draw two random numbers and use the random-number pair to order the observations:

 
. use dataset, clear

. sort varnames

. set seed #

. generate double u1 = runiform()

. generate double u2 = runiform()

. sort u1 u2

. keep in 1/n

You might wonder if we would ever need three random numbers. It is very unlikely. p, the probability of no problem, equals 1 to at least 5 digits for N=500,000. Of course, the chances of duplication are always nonzero. If you are concerned about this problem, you could add an assert to the code to verify that the two random numbers together do uniquely identify the observations:

. use dataset, clear

. sort varnames

. set seed #

. generate double u1 = runiform()

. generate double u2 = runiform()

. sort u1 u2

. by u1 u2: assert _N==1            // added line

. keep in 1/n

I do not believe that doing that is necessary.

 
Is using doubles necessary?

In the generation of random numbers in all of the above, note that I am storing them as doubles. For the reproducibility issue, that is important. As I mentioned in part 1, the 32-bit random numbers that runiform() produces will be rounded if forced into 23-bit floats.

Above I gave you a table of probabilities p that, in creating

. generate double u = runiform()

the values of u would be distinct. Here is what would happen if you instead stored u as a float:

                               u stored as
          ---------------------------------------------------------
          -------- double ----------     ----------float ----------
      N         p            1-p               p           1-p
-------------------------------------------------------------------
     50   0.999999715    0.000000285     0.999853979    0.000146021
    500   0.999970955    0.000029045     0.985238383    0.014761617
  1,000   0.999883707    0.000116293     0.942190868    0.057809132
  5,000   0.997094436    0.002905564     0.225346930    0.774653070
 10,000   0.988427154    0.011572846     0.002574145    0.997425855
 50,000   0.747490440    0.252509560     0.000000000    1.000000000
100,000   0.312187988    0.687812012     0.000000000    1.000000000
200,000   0.009498117    0.990501883     0.000000000    1.000000000
300,000   0.000028161    0.999971839     0.000000000    1.000000000
400,000   0.000000008    0.999999992     0.000000000    1.000000000
500,000   0.000000000    1.000000000     0.000000000    1.000000000
-------------------------------------------------------------------

 
Drawing without replacement P-percent random samples

We have discussed drawing without replacement n observations from N observations. The number of observations selected has been fixed. Say instead we wanted to draw a 10% random sample, meaning that we independently allow each observation to have a 10% chance of appearing in our sample. In that case, the final number of observations is expected to be 0.1*N, but it may (and probably will) vary from that. The basic solution for drawing a 10% random sample is

. keep if runiform() <= 0.10

and the basic solution for drawing a P% random sample is

. keep if runiform() <= P/100

It is unlikely to matter whether you code <= or < in the comparison. As you now know, runiform() produces values drawn from 232 possible values, and thus the chance of equality is 2-32 or roughly 0.000000000232830644. If you want a P% sample, however, theory says you should code <=.

If you care about reproducibility, you should expand the basic solution to read,

. set seed #

. use data, clear 

. sort variables_that_put_data_in_unique_order

. keep if runiform() <= P/100

Below I draw a 10% sample from the card.dta:

. set seed 838   

. use cards, clear

. sort rank suit

. keep if runiform() <= 10/100
(46 observations deleted)

. list

     +-----------------+
     |  rank      suit |
     |-----------------|
  1. |     2   Diamond |
  2. |     2     Heart |
  3. |     3      Club |
  4. |     5     Heart |
  5. |  Jack   Diamond |
     |-----------------|
  6. | Queen     Spade |
     +-----------------+

 
We’re not done, but we’re done for today

In part 3 of this series I will discuss drawing random samples with replacement.

Using Stata’s random-number generators, part 1

I want to start a series on using Stata’s random-number function. Stata in fact has ten random-number functions:

  1. runiform() generates rectangularly (uniformly) distributed random number over [0,1).
  2. rbeta(a, b) generates beta-distribution beta(a, b) random numbers.
  3. rbinomial(n, p) generates binomial(n, p) random numbers, where n is the number of trials and p the probability of a success.
  4. rchi2(df) generates χ2 with df degrees of freedom random numbers.
  5. rgamma(a, b) generates Γ(a, b) random numbers, where a is the shape parameter and b, the scale parameter.
  6. rhypergeometric(N, K, n) generates hypergeometric random numbers, where N is the population size, K is the number of in the population having the attribute of interest, and n is the sample size.
  7. rnbinomial(n, p) generates negative binomial -- the number of failures before the nth success -- random numbers, where p is the probability of a success. (n can also be noninteger.)
  8. rnormal(μ, σ) generates Gaussian normal random numbers.
  9. rpoisson(m) generates Poisson(m) random numbers.
  10. rt(df) generates Student's t(df) random numbers.

You already know that these random-number generators do not really produce random numbers; they produce pseudo-random numbers. This series is not about that, so we'll be relaxed about calling them random-number generators.

You should already know that you can set the random-number seed before using the generators. That is not required but it is recommended. You set the seed not to obtain better random numbers, but to obtain reproducible random numbers. In fact, setting the seed too often can actually reduce the quality of the random numbers! If you don't know that, then read help set seed in Stata. I should probably pull out the part about setting the seed too often, expand it, and turn it into a blog entry. Anyway, this series is not about that either.

This series is about the use of random-number generators to solve problems, just as most users usually use them. The series will provide practical advice. I'll stay away from describing how they work internally, although long-time readers know that I won't keep the promise. At least I'll try to make sure that any technical details are things you really need to know. As a result, I probably won't even get to write once that if this is the kind of thing that interests you, StataCorp would be delighted to have you join our development staff.

 

runiform(), generating uniformly distributed random numbers

Mostly I'm going to write about runiform() because runiform() can solve such a variety of problems. runiform() can be used to solve,

  • shuffling data (putting observations in random order),
  • drawing random samples without replacement (there's a minor detail we'll have to discuss because runiform() itself produces values drawn with replacement),
  • drawing random samples with replacement (which is easier to do than most people realize),
  • drawing stratified random samples (with or without replacement),
  • manufacturing fictional data (something teachers, textbook authors, manual writers, and blog writers often need to do).

runiform() generates uniformly, a.k.a. rectangularly distributed, random numbers over the interval, I quote from the manual, "0 to nearly 1".

Nearly 1? "Why not all the way to 1?" you should be asking. "And what exactly do you mean by nearly 1?"

The answer is that the generator is more useful if it omits 1 from the interval, and so we shaved just a little off. runiform() produces random numbers over [0, 0.999999999767169356].

Here are two useful formulas you should commit to memory.

  1. If you want to generate continuous random numbers between a and b, use

    generate double u = (b-a)*runiform() + a

    The random numbers will not actually be between a and b, they will be between a and nearly b, but the top will be so close to b, namely 0.999999999767169356*b, that it will not matter.

    Remember to store continuous random values as doubles.

  2. If you want to generate integer random numbers between a and b, use

    generate ui = floor((b-a+1)*runiform() + a)

    In particular, do not even consider using the formula for continuous values but rounded to integers, which is to say, round(u) = round((b-a)*runiform() + a). If you use that formula, and if b-a>1, then a and b will be under represented by 50% each in the samples you generate!

    I stored ui as a default float, so I am assuming that -16,777,216 ≤ a < b ≤ 16,777,216. If you have integers outside of that range, however, store as a long or double.

I’m going to spend the rest of this blog entry explaining the above.

First, I want to show you how I got the two formulas and why you must use the second formula for generating integer uniform deviates.

Then I want explain why we shaved a little from the top of runiform(), namely (1) while it wouldn’t matter for formula 1, it made formula 2 a little easier, (2) the code would run more quickly, (3) we could more easily prove that we had implemented the random-number generator correctly, and (4) anyone digging deeper into our random numbers would not be misled into thinking they had more than 32 bits of resolution. That last point will be important in a future blog entry.

 

Continuous uniforms over [a, b)

runiform() produces random numbers over [0, 1). It therefore obviously follows that (b-a)*runiform()+a produces number over [a, b). Substitute 0 for runiform() and the lower limit is obtained. Substitute 1 for runiform() and the upper limit is obtained.

I can tell you that in fact, runiform() produces random numbers over [0, (232-1)/232].

Thus (b-a)*runiform()+a produces random numbers over [a, ((232-1)/232)*b].

(232-1)/232) approximately equals 0.999999999767169356 and exactly equals 1.fffffffeX-01 if you will allow me to use %21x format, which Stata understands and which you can understand if you see my previous blog posting on precision.

Thus, if you are concerned about results being in the interval [a, b) rather than [a, b], you can use the formula

generate double u = ((b-a)*runiform() + a) / 1.fffffffeX-01

There are seven f’s followed by e in the hexadecimal constant. Alternatively, you could type

generate double u = ((b-a)*runiform() + a) * ((2^32-1)/2^32)

but multiplying by 1.fffffffeX-01 is less typing so I’d type that. Actually I wouldn’t type either one; the small difference between values lying in [a, b) or [a, b] is unimportant.

 

Integer uniforms over [a, b]

Whether we produce real, continuous random numbers over [a, b) or [a, b] may be unimportant, but if we want to draw random integers, the distinction is important.

runiform() produces continuous results over [0, 1).

(b-a)*runiform()+a produces continuous results over [a, b).

To produce integer results, we might round continuous results over segments of the number line:

           a    a+.5  a+1  a+1.5  a+2  a+2.5       b-1.5  b-1  b-.5    b
real line  +-----+-----+-----+-----+-----+-----------+-----+-----+-----+
int  line  |<-a->|<---a+1--->|<---a+2--->|           |<---b-1--->|<-b->|

In the diagram above, think of the numbers being produced by the continuous formula u=(b-a)*runiform()+a as being arrayed along the real line. Then imagine rounding those values, say by using Stata's round(u) function. If you rounded in that way, then

  • Values of u between a and a+0.5 will be rounded to a.
  • Values of u between a+0.5 and a+1.5 will be rounded to a+1.
  • Values of u between a+1.5 and a+2.5 will be rounded to a+2.
  • ...
  • Values of u between b-1.5 and b-0.5 will be rounded to b-1.
  • Values of u between b-0.5 and b-1 will be rounded to b.

Note that the width of the first and last intervals is half that of the other intervals. Given that u follows the rectangular distribution, we thus expect half as many values rounded to a and to b as to a+1 or a+2 or ... or b-1.

And indeed, that is exactly what we would see:

. set obs 100000
obs was 0, now 100000

. gen double u = (5-1)*runiform() + 1

. gen i = round(u)

. summarize u i 

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
           u |    100000    3.005933    1.156486   1.000012   4.999983
           i |    100000     3.00489    1.225757          1          5

. tabulate i

          i |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     12,525       12.53       12.53
          2 |     24,785       24.79       37.31
          3 |     24,886       24.89       62.20
          4 |     25,284       25.28       87.48
          5 |     12,520       12.52      100.00
------------+-----------------------------------
      Total |    100,000      100.00

To avoid the problem we need to make the widths of all the intervals equal, and that is what the formula floor((b-a+1)*runiform() + a) does.

           a          a+1         a+2                     b-1          b          b+1
real line  +-----+-----+-----+-----+-----------------------+-----+-----+-----+-----+
int  line  |<--- a --->|<-- a+1 -->|                       |<-- b-1 -->|<--- b --->)

Our intervals are of equal width and thus we expect to see roughly the same number of observations in each:

. gen better = floor((5-1+1)*runiform() + 1)

. tabulate better

     better |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     19,808       19.81       19.81
          2 |     20,025       20.02       39.83
          3 |     19,963       19.96       59.80
          4 |     20,051       20.05       79.85
          5 |     20,153       20.15      100.00
------------+-----------------------------------
      Total |    100,000      100.00

So now you know why we shaved a little off the top when we implemented runiform(); it made the formula

floor((b-a+1)*runiform() + a):

easier. Our integer [a, b] formula did not have to concern itself that runiform() would sometimes — rarely — return 1. If runiform() did return the occasional 1, the simple formula above would produce the (correspondingly occasional) b+1.

 

How Stata calculates continuous random numbers

I’ve said that we shaved a little off the top, but the fact was that it was easier for us to do the shaving than not.

runiform() is based on the KISS random number generator. KISS produces 32-bit integers, meaning integers the range [0, 232-1], or [0, 4,294,967,295]. You might wonder how we converted that range to being continuous over [0, 1).

Start by thinking of the number KISS produces in its binary form:

b31b30b29b28b27b26b25b24b23b22b21b20b19b18b17b16b15b14b13b12b11b10b9b8b7b6b5b4b3b2b1b0

The corresponding integer is b31*231 + b31*230 + ... + b0*20. All we did was insert a binary point out front:

. b31b30b29b28b27b26b25b24b23b22b21b20b19b18b17b16b15b14b13b12b11b10b9b8b7b6b5b4b3b2b1b0

making the real value b31*2-1 + b30*2-2 + ... + b0*2-32. Doing that is equivalent to dividing by 2-32, except insertion of the binary point is faster. Nonetheless, if we had wanted runiform() to produce numbers over [0, 1], we could have divided by 232-1.

Anyway, if the KISS random number generator produced 3190625931, which in binary is

10111110001011010001011010001011

we converted that to

0.10111110001011010001011010001011

which equals 0.74287549 in base 10.

The largest number the KISS random number generator can produce is, of course,

11111111111111111111111111111111

and 0.11111111111111111111111111111111 equals 0.999999999767169356 in base 10. Thus, the runiform() implementation of KISS generates random numbers in the range [0, 0.999999999767169356].

I could have presented all of this mathematically in base 10: KISS produces integers in the range [0, 232-1], and in runiform() we divide by 232 to thus produce continuous numbers over the range [0, (232-1)/232]. I could have said that, but it loses the flavor and intuition of my longer explanation, and it would gloss over the fact that we just inserted the binary point. If I asked you, a base-10 user, to divide 232 by 10, you wouldn’t actually divide in the same way that they would divide by, say 9. Dividing by 9 is work. Dividing by 10 merely requires shifting the decimal point. 232 divided by 10 is obviously 23.2. You may not have realized that modern digital computers, when programmed by “advanced” programmers, follow similar procedures.

Oh gosh, I do get to say it! If this sort of thing interests you, consider a career at StataCorp. We’d love to have you.

 

Is it important that runiform() values be stored as doubles?

Sometimes it is important. It’s obviously not important when you are generating random integers using floor((b-a+1)*runiform() + a) and -16,777,216 ≤ a < b ≤ 16,777,216. Integers in that range fit into a float without rounding.

When creating continuous values, remember that runiform() produces 32 bits. floats store 23 bits and doubles store 52, so if you store the result of runiform() as a float, it will be rounded. Sometimes the rounding matters, and sometimes it does not. Next time, we will discuss drawing random samples without replacement. In that case, the rounding matters. In most other cases, including drawing random samples with replacement — something else for later — the rounding
does not matter. Rather than thinking hard about the issue, I store all my non-integer
random values as doubles.

 

Tune in for the next episode

Yes, please do tune in for the next episode of everything you need to know about using random-number generators. As I already mentioned, we’ll discuss drawing random samples without replacement. In the third installment, I’m pretty sure we’ll discuss random samples with replacement. After that, I’m a little unsure about the ordering, but I want to discuss oversampling of some groups relative to others and, separately, discuss the manufacturing of fictional data.

Am I forgetting something?

The Penultimate Guide to Precision

There have recently been occasional questions on precision and storage types on Statalist despite all that I have written on the subject, much of it posted in this blog. I take that as evidence that I have yet to produce a useful, readable piece that addresses all the questions researchers have.

So I want to try again. This time I’ll try to write the ultimate piece on the subject, making it as short and snappy as possible, and addressing every popular question of which I am aware — including some I haven’t addressed before — and doing all that without making you wade with me into all the messy details, which I know I have a tendency to do.

I am hopeful that from now on, every question that appears on Statalist that even remotely touches on the subject will be answered with a link back to this page. If I succeed, I will place this in the Stata manuals and get it indexed online in Stata so that users can find it the instant they have questions.

What follows is intended to provide everything scientific researchers need to know to judge the effect of storage precision on their work, to know what can go wrong, and to prevent that. I don’t want to raise expectations too much, however, so I will entitle it …

THE PENULTIMATE GUIDE TO PRECISION

  1. Contents

     1. Numeric types
    2. Floating-point types
    3. Integer types
    4. Integer precision
    5. Floating-point precision
    6. Advice concerning 0.1, 0.2, …
    7. Advice concerning exact data, such as currency data
    8. Advice for programmers
    9. How to interpret %21x format (if you care)
    10. Also see

  2. Numeric types

    1.1 Stata provides five numeric types for storing variables, three of them integer types and two of them floating point.

    1.2 The floating-point types are float and double.

    1.3 The integer types are byte, int, and long.

    1.4 Stata uses these five types for the storage of data.

    1.5 Stata makes all calculations in double precision (and sometimes quad precision) regardless of the type used to store the data.

  3. Floating-point types

    2.1 Stata provides two IEEE 754-2008 floating-point types: float and double.

    2.2 float variables are stored in 4 bytes.

    2.3 double variables are stored in 8 bytes.

    2.4 The ranges of float and double variables are

         Storage
         type             minimum                maximum
         -----------------------------------------------------
         float     -3.40282346639e+ 38      1.70141173319e+ 38
         double    -1.79769313486e+308      8.98846567431e+307
         -----------------------------------------------------
         In addition, float and double can record missing values 
         ., .a, .b, ..., .z.

    The above values are approximations. For those familiar with %21x floating-point hexadecimal format, the exact values are

         Storage
         type                   minimum                maximum
         ------------------------------------------------------- 
         float   -1.fffffe0000000X+07f     +1.fffffe0000000X+07e 
         double  -1.fffffffffffffX+3ff     +1.fffffffffffffX+3fe
         -------------------------------------------------------

    Said differently, and less precisely, float values are in the open interval (-2128, 2127), and double values are in the open interval (-21024, 21023). This is less precise because the intervals shown in the tables are closed intervals.

  4. Integer types

    3.1 Stata provides three integer storage formats: byte, int, and long. They are 1 byte, 2 bytes, and 4 bytes, respectively.

    3.2 Integers may also be stored in Stata’s IEEE 754-2008 floating-point storage formats float and double.

    3.3 Integer values may be stored precisely over the ranges

         storage
         type                   minimum                 maximum
         ------------------------------------------------------
         byte                      -127                     100
         int                    -32,767                  32,740
         long            -2,147,483,647           2,147,483,620
         ------------------------------------------------------
         float              -16,777,216              16,777,216
         double  -9,007,199,254,740,992   9,007,199,254,740,992
         ------------------------------------------------------
         In addition, all storage types can record missing values
         ., .a, .b, ..., .z.

    The overall ranges of float and double were shown in (2.4) and are wider than the ranges for them shown here. The ranges shown here are the subsets of the overall ranges over which no rounding of integer values occurs.

  5. Integer precision

    4.1 (Automatic promotion.) For the integer storage types — for byte, int, and long — numbers outside the ranges listed in (3.3) would be stored as missing (.) except that storage types are promoted automatically. As necessary, Stata promotes bytes to ints, ints to longs, and longs to doubles. Even if a variable is a byte, the effective range is still [-9,007,199,254,740,992, 9,007,199,254,740,992] in the sense that you could change a value of a byte variable to a large value and that value would be stored correctly; the variable that was a byte would, as if by magic, change its type to int, long, or double if that were necessary.

    4.2 (Data input.) Automatic promotion (4.1) applies after the data are input/read/imported/copied into Stata. When first reading, importing, copying, or creating data, it is your responsibility to choose appropriate storage types. Be aware that Stata’s default storage type is float, so if you have large integers, it is usually necessary to specify explicitly the types you wish to use.

    If you are unsure of the type to specify for your integer variables, specify double. After reading the data, you can use compress to demote storage types. compress never results in a loss of precision.

    4.3 Note that you can use the floating-point types float and double to store integer data.

    4.3.1 Integers outside the range [-2,147,483,647, 2,147,483,620] must be stored as doubles if they are to be precisely recorded.

    4.3.2 Integers can be stored as float, but avoid doing that unless you are certain they will be inside the range [-16,777,216, 16,777,216] not just when you initially read, import, or copy them into Stata, but subsequently as you make transformations.

    4.3.3 If you read your integer data as floats, and assuming they are within the allowed range, we recommend that you change them to an integer type. You can do that simply by typing compress. We make that recommendation so that your integer variables will benefit from the automatic promotion described in (4.1).

    4.4 Let us show what can go wrong if you do not follow our advice in (4.3). For the floating-point types — for float and double — integer values outside the ranges listed in (3.3) are rounded.

    Consider a float variable, and remember that the integer range for floats is [-16,777,216, 16,777,216]. If you tried to store a value outside the range in the variable — say, 16,777,221 — and if you checked afterward, you would discover that actually stored was 16,777,220! Here are some other examples of rounding:

         desired value                            stored (rounded)
         to store            true value             float value 
         ------------------------------------------------------
         maximum             16,777,216              16,777,216 
         maximum+1           16,777,217              16,777,216
         ------------------------------------------------------
         maximum+2           16,777,218              16,777,218
         ------------------------------------------------------
         maximum+3           16,777,219              16,777,220
         maximum+4           16,777,220              16,777,220
         maximum+5           16,777,221              16,777,220
         ------------------------------------------------------
         maximum+6           16,777,222              16,777,222
         ------------------------------------------------------
         maximum+7           16,777,223              16,777,224
         maximum+8           16,777,224              16,777,224
         maximum+9           16,777,225              16,777,224
         ------------------------------------------------------
         maximum+10          16,777,226              16,777,226
         ------------------------------------------------------

    When you store large integers in float variables, values will be rounded and no mention will be made of that fact.

    And that is why we say that if you have integer data that must be recorded precisely and if the values might be large — outside the range ±16,777,216 — do not use float. Use long or use double; or just use the compress command and let automatic promotion handle the problem for you.

    4.5 Unlike byte, int, and long, float and double variables are not promoted to preserve integer precision.

    Float values are not promoted because, well, they are not. Actually, there is a deep reason, but it has to do with the use of float variables for their real purpose, which is to store non-integer values.

    Double values are not promoted because there is nothing to promote them to. Double is Stata’s most precise storage type. The largest integer value Stata can store precisely is 9,007,199,254,740,992 and the smallest is -9,007,199,254,740,992.

    Integer values outside the range for doubles round in the same way that float values round, except at absolutely larger values.

  6. Floating-point precision

    5.1 The smallest, nonzero value that can be stored in float and double is

         Storage
         type      value          value in %21x         value in base 10
         -----------------------------------------------------------------
         float     ±2^-127    ±1.0000000000000X-07f   ±5.877471754111e-039
         double    ±2^-1022   ±1.0000000000000X-3fe   ±2.225073858507e-308
         -----------------------------------------------------------------

    We include the value shown in the third column, the value in %21x, for those who know how to read it. It is described in (9), but it is unimportant. We are merely emphasizing that these are the smallest values for properly normalized numbers.

    5.2 The smallest value of epsilon such that 1+epsilon ≠ 1 is

         Storage
         type      epsilon       epsilon in %21x        epsilon in base 10
         -----------------------------------------------------------------
         float      ±2^-23     ±1.0000000000000X-017    ±1.19209289551e-07
         double     ±2^-52     ±1.0000000000000X-034    ±2.22044604925e-16
         -----------------------------------------------------------------

    Epsilon is the distance from 1 to the next number on the floating-point number line. The corresponding unit roundoff error is u = ±epsilon/2. The unit roundoff error is the maximum relative roundoff error that is introduced by the floating-point number storage scheme.

    The smallest value of epsilon such that x+epsilon ≠ x is approximately |x|*epsilon, and the corresponding unit roundoff error is ±|x|*epsilon/2.

    5.3 The precision of the floating-point types is, depending on how you want to measure it,

         Measurement                           float              double
         ----------------------------------------------------------------
         # of binary digits                       23                  52
         # of base 10 digits (approximate)         7                  16 
    
         Relative precision                   ±2^-24              ±2^-53
         ... in base 10 (approximate)      ±5.96e-08           ±1.11e-16
         ----------------------------------------------------------------

    Relative precision is defined as

                           |x - x_as_stored|
                  ± max   ------------------    
                     x            x

    performed using infinite precision arithmetic, x chosen from the subset of reals between the minimum and maximum values that can be stored. It is worth appreciating that relative precision is a worst-case relative error over all possible numbers that can be stored. Relative precision is identical to roundoff error, but perhaps this definition is easier to appreciate.

    5.4 Stata never makes calculations in float precision, even if the data are stored as float.

    Stata makes double-precision calculations regardless of how the numeric data are stored. In some cases, Stata internally uses quad precision, which provides approximately 32 decimal digits of precision. If the result of the calculation is being stored back into a variable in the dataset, then the double (or quad) result is rounded as necessary to be stored.

    5.5 (False precision.) Double precision is 536,870,912 times more accurate than float precision. You may worry that float precision is inadequate to accurately record your data.

    Little in this world is measured to a relative accuracy of ±2-24, the accuracy provided by float precision.

    Ms. Smith, it is reported, made $112,293 this year. Do you believe that is recorded to an accuracy of ±2-24*112,293, or approximately ±0.7 cents?

    David was born on 21jan1952, so on 27mar2012 he was 21,981 days old, or 60.18 years old. Recorded in float precision, the precision is ±60.18*2-24, or roughly ±1.89 minutes.

    Joe reported that he drives 12,234 miles per year. Do you believe that Joe’s report is accurate to ±12,234*2-24, equivalent to ±3.85 feet?

    A sample of 102,400 people reported that they drove, in total, 1,252,761,600 miles last year. Is that accurate to ±74.7 miles (float precision)? If it is, each of them is reporting with an accuracy of roughly ±3.85 feet.

    The distance from the Earth to the moon is often reported as 384,401 kilometers. Recorded as a float, the precision is ±384,401*2-24, or ±23 meters, or ±0.023 kilometers. Because the number was not reported as 384,401.000, one would assume float precision would be accurate to record that result. In fact, float precision is more than sufficiently accurate to record the distance because the distance from the Earth to the moon varies from 356,400 to 406,700 kilometers, some 50,300 kilometers. The distance would have been better reported as 384,401 ±25,150 kilometers. At best, the measurement 384,401 has relative accuracy of ±0.033 (it is accurate to roughly two digits).

    Nonetheless, a few things have been measured with more than float accuracy, and they stand out as crowning accomplishments of mankind. Use double as required.

  7. Advice concerning 0.1, 0.2, …

    6.1 Stata uses base 2, binary. Popular numbers such as 0.1, 0.2, 100.21, and so on, have no exact binary representation in a finite number of binary digits. There are a few exceptions, such as 0.5 and 0.25, but not many.

    6.2 If you create a float variable containing 1.1 and list it, it will list as 1.1 but that is only because Stata’s default display format is %9.0g. If you changed that format to %16.0g, the result would appear as 1.1000000238419.

    This scares some users. If this scares you, go back and read (5.5) False Precision. The relative error is still a modest ±2-24. The number 1.1000000238419 is likely a perfectly acceptable approximation to 1.1 because the 1.1 was never measured to an accuracy of less than ±2-24 anyway.

    6.3 One reason perfectly acceptable approximations to 1.1 such as 1.1000000238419 may bother you is that you cannot select observations containing 1.1 by typing if x==1.1 if x is a float variable. You cannot because the 1.1 on the right is interpreted as double precision 1.1. To select the observations, you have to type if x==float(1.1).

    6.4 If this bothers you, record the data as doubles. It is best to do this at the point when you read the original data or when you make the original calculation. The number will then appear to be 1.1. It will not really be 1.1, but it will have less relative error, namely, ±2-53.

    6.5 If you originally read the data and stored them as floats, it is still sometimes possible to recover the double-precision accuracy just as if you had originally read the data into doubles. You can do this if you know how many decimal digits were recorded after the decimal point and if the values are within a certain range.

    If there was one digit after the decimal point and if the data are in the range [-1,048,576, 1,048,576], which means the values could be -1,048,576, -1,048,575.9, …, -1, 0, 1, …, 1,048,575.9, 1,048,576, then typing

    . gen double y = round(x*10)/10

    will recover the full double-precision result. Stored in y will be the number in double precision just as if you had originally read it that way.

    It is not possible, however, to recover the original result if x is outside the range ±1,048,576 because the float variable contains too little information.

    You can do something similar when there are two, three, or more decimal digits:

         # digits to
         right of 
         decimal pt.   range     command
         -----------------------------------------------------------------
             1      ±1,048,576   gen double y = round(x*10)/10
             2      ±  131,072   gen double y = round(x*100)/100
             3      ±   16,384   gen double y = round(x*1000)/1000
             4      ±    1,024   gen double y = round(x*10000)/10000
             5      ±      128   gen double y = round(x*100000)/100000
             6      ±       16   gen double y = round(x*1000000)/1000000
             7      ±        1   gen double y = round(x*10000000)/10000000
         -----------------------------------------------------------------

    Range is the range of x over which command will produce correct results. For instance, range = ±16 in the next-to-the-last line means that the values recorded in x must be -16 ≤ x ≤ 16.

  8. Advice concerning exact data, such as currency data

    7.1 Yes, there are exact data in this world. Such data are usually counts of something or are currency data, which you can think of as counts of pennies ($0.01) or the smallest unit in whatever currency you are using.

    7.2 Just because the data are exact does not mean you need exact answers. It may still be that calculated answers are adequate if the data are recorded to a relative accuracy of ±2-24 (float). For most analyses — even of currency data — this is often adequate. The U.S. deficit in 2011 was $1.5 trillion. Stored as a float, this amount has a (maximum) error of ±2-24*1.5e+12 = ±$89,406.97. It would be difficult to imagine that ±$89,406.97 would affect any government decision maker dealing with the full $1.5 trillion.

    7.3 That said, you sometimes do need to make exact calculations. Banks tracking their accounts need exact amounts. It is not enough to say to account holders that we have your money within a few pennies, dollars, or hundreds of dollars.

    In that case, the currency data should be converted to integers (pennies) and stored as integers, and then processed as described in (4). Assuming the dollar-and-cent amounts were read into doubles, you can convert them into pennies by typing

    . replace x = x*100

    7.4 If you mistakenly read the currency data as a float, you do not have to re-read the data if the dollar amounts are between ±$131,072. You can type

    . gen double x_in_pennies = round(x*100)

    This works only if x is between ±131,072.

  9. Advice for programmers

    8.1 Stata does all calculations in double (and sometimes quad) precision.

    Float precision may be adequate for recording most data, but float precision is inadequate for performing calculations. That is why Stata does all calculations in double precision. Float precision is also inadequate for storing the results of intermediate calculations.

    There is only one situation in which you need to exercise caution — if you create variables in the data containing intermediate results. Be sure to create all such variables as doubles.

    8.2 The same quad-precision routines StataCorp uses are available to you in Mata; see the manual entries [M-5] mean, [M-5] sum, [M-5] runningsum, and [M-5] quadcross. Use them as you judge necessary.

  10. How to interpret %21x format (if you care)

    9.1 Stata has a display format that will display IEEE 754-2008 floating-point numbers in their full binary glory but in a readable way. You probably do not care; if so, skip this section.

    9.2 IEEE 754-2008 floating-point numbers are stored as a pair of numbers (a, b) that are given the interpretation

    z = a * 2b

    where -2 < a < 2. In double precision, a is recorded with 52 binary digits. In float precision, a is recorded with 23 binary digits. For example, the number 2 is recorded in double precision as

    a = +1.0000000000000000000000000000000000000000000000000000
    b = +1

    The value of pi is recorded as

    a = +1.1001001000011111101101010100010001000010110100011000
    b = +1

    9.3 %21x presents a and b in base 16. The double-precision value of 2 is shown in %21x format as

    +1.0000000000000X+001

    and the value of pi is shown as

    +1.921fb54442d18X+001

    In the case of pi, the interpretation is

    a = +1.921fb54442d18 (base 16)
    b = +001             (base 16)

    Reading this requires practice. It helps to remember that one-half corresponds to 0.8 (base 16). Thus, we can see that a is slightly larger than 1.5 (base 10) and b = 1 (base 10), so _pi is something over 1.5*21 = 3.

    The number 100,000 in %21x is

    +1.86a0000000000X+010

    which is to say

    a = +1.86a0000000000 (base 16)
    b = +010             (base 16)

    We see that a is slightly over 1.5 (base 10), and b is 16 (base 10), so 100,000 is something over 1.5*216 = 98,304.

    9.4 %21x faithfully presents how the computer thinks of the number. For instance, we can easily see that the nice number 1.1 (base 10) is, in binary, a number with many digits to the right of the binary point:

    . display %21x 1.1
    +1.199999999999aX+000

    We can also see why 1.1 stored as a float is different from 1.1 stored as a double:

    . display %21x float(1.1)
    +1.19999a0000000X+000

    Float precision assigns fewer digits to the mantissa than does double precision, and 1.1 (base 10) in base 16 is a repeating hexadecimal.

    9.5 %21x can be used as an input format as well as an output format. For instance, Stata understands

    . gen x = 1.86ax+10

    Stored in x will be 100,000 (base 10).

    9.6 StataCorp has seen too many competent scientific programmers who, needing a perturbance for later use in their program, code something like

    epsilon = 1e-8

    It is worth examining that number:

    . display %21x 1e-8
    +1.5798ee2308c3aX-01b

    That is an ugly number that can only lead to the introduction of roundoff error in their program. A far better number would be

    epsilon = 1.0x-1b

    Stata and Mata understand the above statement because %21x may be used as input as well as output. Naturally, 1.0x-1b looks just like what it is,

    . display %21x 1.0x-1b
    +1.0000000000000X-01b

    and all those pretty zeros will reduce numerical roundoff error.

    In base 10, the pretty 1.0x-1b looks like

    . display %20.0g 1.0x-1b
    7.4505805969238e-09

    and that number may not look pretty to you, but you are not a base-2 digital computer.

    Perhaps the programmer feels that epsilon really needs to be closer to 1e-8. In %21x, we see that 1e-8 is +1.5798ee2308c3aX-01b, so if we want to get closer, perhaps we use

    epsilon = 1.6x-1b

    9.7 %21x was invented by StataCorp.

  11. Also see

    If you wish to learn more, see

    How to read the %21x format

    How to read the %21x format, part 2

    Precision (yet again), Part I

    Precision (yet again), Part II

The next leap second will be on June 30th, maybe

Leap seconds are the extra seconds inserted every so often to keep precise atomic clocks better synchronized with the rotation of the Earth. Scheduled for June 30th is the extra second 23:59:60 inserted between 23:59:59 and 00:00:00. Or maybe not.

Tomorrow or Friday a vote may be held at the International Telecommuncation Union (ITU) meeting in Geneva to abolish the leap second from the definition of UTC (Coordinated Universial Time). Which would mean StataCorp would not have to post an update to Stata to keep the %tC format working correctly.

As I’ve blogged before — scroll down to “Why Stata has two datetime encodings” in Using dates and times from other software — Stata supports both UTC time (%tC) and constant-86,400-seconds/day time (%tc). Stata does that because some data are collected using leap-second corrected time, and some uncorrected. Stata is unique or nearly unique in providing both time formats.

I read that Google does something very clever: they strech the last second of the year out when a leap second occurs, so the data they collect does not end up with ugly times like 23:59:60, and so that it can be more easily processed by software that assumes a constant 86,400 seconds per day.

The IT industry and a number of others, I gather, are pretty united about the benefits of scrapping the leap second.

The vote is predicted to go against continuing the leap second, according to The Economist magazine. The United States and France are for abolishing the leap second. Britain, Canada, and China are believed to be for continuing it. Some 192 countries will get to vote.

Whichever way the vote goes, I would like to remind readers of advice I previously offered to help alleviate the need for leap seconds: Face west and throw rocks. As I previously noted, the benefit will be transitory if the rocks land back on Earth, so you need to throw the rocks really hard. Having now thought more about this issue, a less strenuous way occurs to me: Push rocks downhill or carry them toward the poles, and preferably do both. These suggestions are designed to attack the real problem, which is that the Earth is currently rotating too slowly.

Categories: Data Management Tags: ,

Good company

Dembe, Partridge, and Geist (2011, pdf), in a paper recently published in BMC Health Services Research, report that Stata and SAS were “overwhelmingly the most commonly used software applications employed (in 46% and 42.6% of articles respectively)”. The articles referred to were those in health services research studies published in the U.S.

Good company. Both are, in our humble opinion, excellent packages, although we admit to have a preference for one of them.

We should mention that the authors report that SAS usage grew considerably during the study period, and that Stata usage held roughly constant, a conclusion that matches the results in their Table 1, an extract of which is

2007 2008 2009 2007-2009
total articles 393 374 372 1,139
included articles 282 308 287 877
% Stata used 48.3 42.6 47.4 46.0
% SAS used 37.2 43.1 47.4 42.6

The authors speculated that the growth of SAS “may have been stimulated by enhancements [...] that gave users the ability to use balanced repeated replication (BRR) and jackknife methods for variance estimation with complex survey data [...]“. Since those features were already in Stata, that sounds reasonable to us.

Let us just say, good company. Good companies.

Categories: Company Tags: , , ,

Advanced Mata: Pointers

I’m still recycling my talk called “Mata, The Missing Manual” at user meetings, a talk designed to make Mata more approachable. One of the things I say late in the talk is, “Unless you already know what pointers are and know you need them, ignore them. You don’t need them.” And here I am writing about, of all things, pointers. Well, I exaggerated a little in my talk, but just a little.

Before you take my previous advice and stop reading, let me explain: Mata serves a number of purposes and one of them is as the primary langugage we at StataCorp use to implement new features in Stata. I’m not referring to mock ups, toys, and experiments, I’m talking about ready-to-ship code. Stata 12′s Structural Equation Modeling features are written in Mata, so is Multiple Imputation, so is Stata’s optimizer that is used by nearly all estimation commands, and so are most features. Mata has a side to it that is exceedingly serious and intended for use by serious developers, and every one of those features are available to users just as they are to StataCorp developers. This is one of the reasons there are so many user-written commands are available for Stata. Even if you don’t use the serious features, you benefit.

So every so often I need to take time out and address the concerns of these user/developers. I knew I needed to do that now when Kit Baum emailed a question to me that ended with “I’m stumped.” Kit is the author of An Introduction to Stata Programming which has done more to make Mata approachable to professional researchers than anything StataCorp has done, and Kit is not often stumped.

I have a certain reptutation about how I answer most questions. “Why do you want to do that?” I invariably reply, or worse, “You don’t want to do that!” and then go on to give the answer to the question I wished they had asked. When Kit asks a question, however, I just answer it. Kit asked a question about pointers by setting up an artificial example and I have no idea what his real motivation was, so I’m not even going to try to motivate the question for you. The question is interesting in and of itself anyway.

Here is Kit’s artificial example:

real function x2(real scalar x) return(x^2)
 
real function x3(real scalar x) return(x^3) 

void function tryit() 
{
        pointer(real scalar function) scalar fn
        string rowvector                     func
        real scalar                          i

        func = ("x2", "x3")
        for(i=1;i<=length(func);i++) {
                fn = &(func[i])
                (*fn)(4)
        }
}

Kit is working with pointers, and not just pointers to variables, but pointers to functions. A pointer is the memory address, the address where the variable or function is stored. Real compilers translate names into memory addresses which is one of the reasons real compilers produce code that runs fast. Mata is a real compiler. Anyway, pointers are memory addresses, such as 58, 212,770, 427,339,488, except the values are usually written in hexadecimal rather than decimal. In the example, Kit has two functions, x2(x) and x3(x). Kit wants to create a vector of the function addresses and then call each of the functions in the vector. In the artificial example, he's calling each with an argument of 4.

The above code does not work:

: tryit()
         tryit():  3101  matrix found where function required
         <istmt>:     -  function returned error

The error message is from the Mata compiler and it's complaining about the line

        (*fn)(4)

but the real problem is earlier in the tryit() code.

One corrected version of tryit() would read,

void function tryit()
{
        pointer(real scalar function) scalar fn
        pointer(real scalar function) vector func     // <---
        real scalar                          i

        func = (&x2(), &x3())                         // <---
        for(i=1;i<=length(func);i++) {
                fn = func[i]                          // <---
                (*fn)(4)
        }
}

If you make the three changes I marked, tryit() works:

: tryit()
  16
  64

I want to explain this code and alternative ways the code could have been fixed. It will be easier if we just work interactively, so let's start all over again:

: real scalar x2(x) return(x^2)

: real scalar x3(x) return(x^3)

: func = (&x2(), &x3())

Let's take a look at what is in func:

: func
                1            2
    +---------------------------+
  1 |  0x19551ef8   0x19552048  |
    +---------------------------+

Those are memory addresses. When we typed &x2() and &x3() in the line

: func = (&x2(), &x3())

functions x2() and x3() were not called. &x2() and &x3() instead evaluate to the addresses of the functions named x2() and x3(). I can demonstrate this:

: &x2()
  0x19551ef8

0x19551ef8 is the memory address of where the function x2() is stored. 0x19551ef8 may not look like a number, but that is only because it is presented in base 16. 0x19551ef8 is in fact the number 425,008,888, and the compiled code for the function x2() starts at the 425,008,888th byte of memory and continues thereafter.

Let's assign to fn the value of the address of one of the functions, say x2(). I could do that by typing

: fn = func[1]

or by typing

: fn = &x2()

and either way, when I look at fn, it contains a memory address:

: fn
  0x19551ef8

Let's now call the function whose address we have stored in fn:

: (*fn)(2)
  4

When we call a function and want to pass 2 as an argument, we normally code f(2). In this case, we substitute (*fn) for f because we do not want to call the function named f(), we want to call the function whose address is stored in variable fn. The operator * usually means multiplication, but when * is used as a prefix, it means something different, in much the same way the minus operator - can be subtract or negate. The meaning of unary * is "the contents of". When we code *fn, we mean not the value 425,008,888 stored in fn, we mean the contents of the memory address 425,008,888, which happens to be the function x2().

We type (*fn)(2) and not *fn(2) because *fn(2) would be interpreted to mean *(fn(2)). If there were a function named fn(), that function would be called with argument 2, the result obtained, and then the star would take the contents of that memory address, assuming fn(2) returned a memory address. If it didn't, we'd get a type mismatch error.

The syntax can be confusing until you understand the reasoning behind it. Let's start with all new names. Consider something named X. Actually, there could be two different things named X and Mata would not be confused. There could be a variable named X and there could be a function named X(). To Mata, X and X() are different things, or said in the jargon, have different name spaces. In Mata, variables and functions can have the same names. Variables and functions having the same names in C is not allowed -- C has only one name space. So in C, you can type

fn = &x2

to obtain the address of variable x2 or function x2(), but in Mata, the above means the address of the variable x2, and if there is no such variable, that's an error. In Mata, to obtain the address of function x2(), you type

fn = &x2()

The syntax &x2() is a definitional nugget; there is no taking it apart to understand its logic. But we can take apart the logic of the programmer who defined the syntax. & means "address of" and &thing means to take the address of thing. If thing is a name -- &name -- that means to look up name in the variable space and return its address. If thing is name(), that means look up name in the function space and return its address. They way we formally write this grammar is

 &thing, where 

 thing  :=   name
             name()
             exp

There are three possibilities for thing; it's a name or it's a name followed by () or it's an expression. The last is not much used. &2 creates a literal 2 and then tells you the address where the 2 is stored, which might be 0x195525d8. &(2+3) creates 5 and then tells you where the 5 is stored.

But let's get back to Kit's problem. Kit coded,

func = ("x2", "x3")

and I said no, code instead

func = (&x2(), &x3())

You do not use strings to obtain pointers, you use the actual name prefixed by ampersand.

There's a subtle difference in what Kit was trying to code and what I did code, however. In what Kit tried to code, Kit was seeking "run-time binding". I, however, coded "compile-time binding". I'm about to explain the difference and show you how to achieve run-time binding, but before I do, let me tell you that

  1. You probably want compile-time binding.
  2. Compile-time binding is faster.
  3. Run-time binding is sometimes required, but when persons new to pointers think they need run-time binding, they usually do not.

Let me define compile-time and run-time binding:

  1. Binding refers to establishing addresses corresponding to names and names(). The names are said to be bound to the address.

  2. In compile-time binding, the addresses are established at the time the code is compiled.

    More correctly, compile-time binding does not really occur at the time the code is compiled, it occurs when the code is brought together for execution, an act called linking and which happens automatically in Mata. This is a fine and unimportant distiction, but I do not want you to think that all the functions have to be compiled at the same time or that the order in which they are compiled matters.

    In compile-time binding, if any functions are missing when the code is brought together for execution, and error message is issued.

  3. In run-time binding, the addresses are established at the time the code is executed (run), which happens after compilation, and after linking, and is an explicit act performed by you, the programmer.

To obtain the address of a variable or function at run-time, you use built-in function findexternal(). findexternal() takes one argument, a string scalar, containing the name of the object to be found. The function looks up that name and returns the address corresponding to it, or it returns NULL if the object cannot be found. NULL is the word used to mean invalid memory address and is in fact defined as equaling zero.

findexternal() can be used only with globals. The other variables that appear in your program might appear to have names, but those names are used solely by the compiler and, in the compiled code, these "stack-variables" or "local variables" are referred to by their addresses. The names play no other role and are not even preserved, so findexternal() cannot be used to obtain their addresses. There would be no reason you would want findexternal() to find their addresses because, in all such cases, the ampersand prefix is a perfect substitute.

Functions, however, are global, so we can look up functions. Watch:

: findexternal("x2()")
  0x19551ef8

Compare that with

: &x2()
  0x19551ef8

It's the same result, but they were produced differently. In the findexternal() case, the 0x19551ef8 result was produced after the code was compiled and assembled. The value was obtained, in fact, by execution of the findexternal() function.

In the &x2() case, the 0x19551ef8 result was obtained during the compile/assembly process. We can better understand the distinction if we look up a function that does not exist. I have no function named x4(). Let's obtain x4()'s address:

: findexternal("x4()")
  0x0

: &x4()
         <istmt>:  3499  x4() not found

I may have no function named x4(), but that didn't bother findexternal(). It merely returned 0x0, another way of saying NULL.

In the &x4() case, the compiler issued an error. The compiler, faced with evaluating &x4(), could not, and so complained.

Anyway, here is how we could write tryit() with run-time binding using the findexternal() function:

void function tryit() 
{
        pointer(real scalar function) scalar fn
        pointer(real scalar function) vector func
        real scalar                          i

        func = (findexternal("x2()"), findexternal("x3()")

        for(i=1;i<=length(func);i++) {
                fn = func[i]
                (*fn)(4)
        }
}

To obtain run-time rather than compile-time bindings, all I did was change the line

        func = (&x2(), &x3())

to be

        func = (findexternal("x2()"), findexternal("x3()")

Or we could write it this way:

void function tryit() 
{
        pointer(real scalar function) scalar fn
        string vector                        func
        real scalar                          i

        func = ("x2()", "x3()")

        for(i=1;i<=length(func);i++) {
                fn = findexternal(func[i])
                (*fn)(4)
        }
}

In this variation, I put the names in a string vector just as Kit did originally. Then I changed the line that Kit wrote,

        fn = &(func[i])

to read

        fn = findexternal(func[i])

Either way you code it, when performing run-time binding, you the programmer should deal with what is to be done if the function is not found. The loop

for(i=1;i<=length(func);i++) {
        fn = findexternal(func[i])
        (*fn)(4)
}

would better read

for(i=1;i<=length(func);i++) {
        fn = findexternal(func[i])
        if (fn!=NULL) {
                (*fn)(4)
        }
        else {
                ...
        }
}

Unlike C, if you do not include the code for the not-found case, the program will not crash if the function is not found. Mata will give you an "invalid use of NULL pointer" error message and a traceback log.

If you were writing a program in which the user of your program was to pass to you a function you were to use, such as a likelihood function to be maximized, you could write your program with compile-time binding by coding,

function myopt(..., pointer(real scalar function) scalar f, ...)
{
        ...
        ... (*f)(...) ...
        ...
}

and the user would call you program my coding myopt(..., &myfunc(), ...), or you could use run-time binding by coding

function myopt(..., string scalar fname, ...)
{
        pointer(real scalar function) scalar f
        ...

        f = findexternal(fname)
        if (f==NULL) {
                errprintf("function %s() not found\n", fname)
                exit(111)
        }
        ...
        ... (*f)(...) ...
        ...
}

and the user would call your program by coding myopt(..., "myfunc()", ...).

In this case I could be convinced to prefer the run-time binding solution for professional code because, the error being tolerated by Mata, I can write code to give a better, more professional looking error message.

Categories: Mata Tags: , ,

Use poisson rather than regress; tell a friend

Do you ever fit regressions of the form

ln(yj) = b0 + b1x1j + b2x2j + … + bkxkj + εj

by typing

. generate lny = ln(y)

. regress lny x1 x2 … xk

The above is just an ordinary linear regression except that ln(y) appears on the left-hand side in place of y.

The next time you need to fit such a model, rather than fitting a regression on ln(y), consider typing

. poisson y x1 x2 … xk, vce(robust)

which is to say, fit instead a model of the form

yj = exp(b0 + b1x1j + b2x2j + … + bkxkj + εj)

Wait, you are probably thinking. Poisson regression assumes the variance is equal to the mean,

E(yj) = Var(yj) = exp(b0 + b1x1j + b2x2j + … + bkxkj)

whereas linear regression merely assumes E(ln(yj)) = b0 + b1x1j + b2x2j + … + bkxkj and places no constraint on the variance. Actually regression does assume the variance is constant but since we are working the logs, that amounts to assuming that Var(yj) is proportional to yj, which is reasonable in many cases and can be relaxed if you specify vce(robust).

In any case, in a Poisson process, the mean is equal to the variance. If your goal is to fit something like a Mincer earnings model,

ln(incomej) = b0 + b1*educationj + b2*experiencej + b3*experiencej2 + εj

there is simply no reason to think that the the variance of the log of income is equal to its mean. If a person has an expected income of $45,000, there is no reason to think that the variance around that mean is 45,000, which is to say, the standard deviation is $212.13. Indeed, it would be absurd to think one could predict income so accurately based solely on years of schooling and job experience.

Nonetheless, I suggest you fit this model using Poisson regression rather than linear regression. It turns out that the estimated coefficients of the maximum-likelihood Poisson estimator in no way depend on the assumption that E(yj) = Var(yj), so even if the assumption is violated, the estimates of the coefficients b0, b1, …, bk are unaffected. In the maximum-likelihood estimator for Poisson, what does depend on the assumption that E(yj) = Var(yj) are the estimated standard errors of the coefficients b0, b1, …, bk. If the E(yj) = Var(yj) assumption is violated, the reported standard errors are useless. I did not suggest, however, that you type

. poisson y x1 x2 … xk

I suggested that you type

. poisson y x1 x2 … xk, vce(robust)

That is, I suggested that you specify that the variance-covariance matrix of the estimates (of which the standard errors are the square root of the diagonal) be estimated using the Huber/White/Sandwich linearized estimator. That estimator of the variance-covariance matrix does not assume E(yj) = Var(yj), nor does it even require that Var(yj) be constant across j. Thus, Poisson regression with the Huber/White/Sandwich linearized estimator of variance is a permissible alternative to log linear regression — which I am about to show you — and then I’m going to tell you why it’s better.

I have created simulated data in which

yj = exp(8.5172 + 0.06*educj + 0.1*expj – 0.002*expj2 + εj)

where εj is distributed normal with mean 0 and variance 1.083 (standard deviation 1.041). Here’s the result of estimation using regress:

 
. regress lny educ exp exp2
 
      Source |       SS       df       MS              Number of obs =    5000
-------------+------------------------------           F(  3,  4996) =   44.72
       Model |  141.437342     3  47.1457806           Prob > F      =  0.0000
    Residual |  5267.33405  4996  1.05431026           R-squared     =  0.0261
-------------+------------------------------           Adj R-squared =  0.0256
       Total |  5408.77139  4999  1.08197067           Root MSE      =  1.0268
 
------------------------------------------------------------------------------
         lny |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0716126   .0099511     7.20   0.000      .052104    .0911212
         exp |   .1091811   .0129334     8.44   0.000     .0838261    .1345362
        exp2 |  -.0022044   .0002893    -7.62   0.000    -.0027716   -.0016373
       _cons |   8.272475   .1855614    44.58   0.000     7.908693    8.636257
------------------------------------------------------------------------------

I intentionally created these data to produce a low R-squared.

We obtained the following results:

 
                   truth      est.    S.E.
        ----------------------------------
        educ      0.0600    0.0716  0.0100
        exp       0.1000    0.1092  0.0129
        exp2     -0.0020   -0.0022  0.0003
       -----------------------------------
        _cons     8.5172    8.2725  0.1856   <- unadjusted (1)
                  9.0587    8.7959     ?     <-   adjusted (2)
       -----------------------------------
       (1) To be used for predicting E(ln(yj))
       (2) To be used for predicting E(yj)

Note that the estimated coefficients are quite close to the true values. Ordinarily, we would not know the true values, except I created this artificial dataset and those are the values I used.

For the intercept, I list two values, so I need to explain. We estimated a linear regression of the form,

ln(yj) = b0 + Xjb + εj

As with all linear regressions,

 
     E(ln(yj)) = E(b0 + Xjb + εj)
               = b0 + Xjb + E(εj)
               = b0 + Xjb 

We, however, have no real interest in E(ln(yj)). We fit this log regression as a way of obtaining estimates of our real model, namely

yj = exp(b0 + Xjb + εj)

So rather than taking the expectation of ln(yj), lets take the expectation of yj:

 
E(yj) = E(exp(b0 + Xjb + εj))
      = E(exp(b0 + Xjb) * exp(εj))
      = exp(b0 + Xjb) * E(exp(εj))

E(exp(εj)) is not one. E(exp(εj)) for εj distributed N(0, σ2) is exp(σ2/2). We thus obtain

E(yj) = exp(b0 + Xjb) * exp(σ2/2)

People who fit log regressions know about this — or should — and know that to obtain predicted yj values, they must

  1. Obtain predicted values for ln(yj) = b0 + Xjb.

  2. Exponentiate the predicted log values.

  3. Multiply those exponentiated values by exp(σ2/2), where σ2 is the square of the root-mean-square-error (RMSE) of the regression.

They do in this in Stata by typing

. predict yhat

. replace yhat = exp(yhat).

. replace yhat = yhat*exp(e(rmse)^2/2)

In the table I that just showed you,

 
                   truth      est.    S.E.
        ----------------------------------
        educ      0.0600    0.0716  0.0100
        exp       0.1000    0.1092  0.0129
        exp2     -0.0020   -0.0022  0.0003
       -----------------------------------
        _cons     8.5172    8.2725  0.1856   <- unadjusted (1)
                  9.0587    8.7959     ?     <-   adjusted (2)
       -----------------------------------
       (1) To be used for predicting E(ln(yj))
       (2) To be used for predicting E(yj)

I’m setting us up to compare these estimates with those produced by poisson. When we estimate using poisson, we will not need to take logs because the Poisson model is stated in terms of yj, not ln(yj). In prepartion for that, I have included two lines for the intercept — 8.5172, which is the intercept reported by regress and is the one appropriate for making predictions of ln(y) — and 9.0587, an intercept appropriate for making predictions of y and equal to 8.5172 plus σ2/2. Poisson regression will estimate the 9.0587 result because Poisson is stated in terms of y rather than ln(y).

I placed a question mark in the column for the standard error of the adjusted intercept because, to calculate that, I would need to know the standard error of the estimated RMSE, and regress does not calculate that.

Let’s now look at the results that poisson with option vce(robust) reports. We must not forget to specify option vce(robust) because otherwise, in this model that violates the Poisson assumption that E(yj) = Var(yj), we would obtain incorrect standard errors.

 
. poisson y educ exp exp2, vce(robust) 
note: you are responsible for interpretation of noncount dep. variable
 
Iteration 0:   log pseudolikelihood = -1.484e+08  
Iteration 1:   log pseudolikelihood = -1.484e+08  
Iteration 2:   log pseudolikelihood = -1.484e+08  
 
Poisson regression                                Number of obs   =       5000
                                                  Wald chi2(3)    =      67.52
                                                  Prob > chi2     =     0.0000
Log pseudolikelihood = -1.484e+08                 Pseudo R2       =     0.0183
 
------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0575636   .0127996     4.50   0.000     .0324769    .0826504
         exp |   .1074603   .0163766     6.56   0.000     .0753628    .1395578
        exp2 |  -.0022204   .0003604    -6.16   0.000    -.0029267   -.0015141
       _cons |   9.016428   .2359002    38.22   0.000     8.554072    9.478784
------------------------------------------------------------------------------

So now we can fill in the rest of our table:

 
                               regress            poisson
                   truth      est.    S.E.      est.     S.E.
        -----------------------------------------------------
        educ      0.0600    0.0716  0.0100     0.0576  0.1280
        exp       0.1000    0.1092  0.0129     0.1075  0.0164
        exp2     -0.0020   -0.0022  0.0003    -0.0022  0.0003
       ------------------------------------------------------
        _cons     8.5172    8.2725  0.1856          ?       ?   <- (1)
                  9.0587    8.7959       ?     9.0164  0.2359   <- (2)
       ------------------------------------------------------
       (1) To be used for predicting E(ln(yj))
       (2) To be used for predicting E(yj)

I told you that Poisson works, and in this case, it works well. I’ll now tell you that in all cases it works well, and it works better than log regression. You want to think about Poisson regression with the vce(robust) option as a better alternative to log regression.

How is Poisson better?

First off, Poisson handles outcomes that are zero. Log regression does not because ln(0) is -∞. You want to be careful about what it means to handle zeros, however. Poisson handles zeros that arise in correspondence to the model. In the Poisson model, everybody participates in the yj = exp(b0 + Xjb + εj) process. Poisson regression does not handle cases where some participate and others do not, and among those who do not, had they participated, would likely produce an outcome greater than zero. I would never suggest using Poisson regression to handle zeros in an earned income model because those that earned zero simply didn’t participate in the labor force. Had they participated, their earnings might have been low, but certainly they would have been greater than zero. Log linear regression does not handle that problem, either.

Natural zeros do arise in other situations, however, and a popular question on Statalist is whether one should recode those natural zeros as 0.01, 0.0001, or 0.0000001 to avoid the missing values when using log linear regression. The answer is that you should not recode at all; you should use Poisson regression with vce(robust).

Secondly, small nonzero values, however they arise, can be influential in log-linear regressions. 0.01, 0.0001, 0.0000001, and 0 may be close to each other, but in the logs they are -4.61, -9.21, -16.12, and -∞ and thus not close at all. Pretending that the values are close would be the same as pretending that that exp(4.61)=100, exp(9.21)=9,997, exp(16.12)=10,019,062, and exp(∞)=∞ are close to each other. Poisson regression understands that 0.01, 0.0001, 0.0000001, and 0 are indeed nearly equal.

Thirdly, when estimating with Poisson, you do not have to remember to apply the exp(σ2/2) multiplicative adjustment to transform results from ln(y) to y. I wrote earlier that people who fit log regressions of course remember to apply the adjustment, but the sad fact is that they do not.

Finally, I would like to tell you that everyone who estimates log models knows about the Poisson-regression alternative and it is only you who have been out to lunch. You, however, are in esteemed company. At the recent Stata Conference in Chicago, I asked a group of knowledgeable researchers a loaded question, to which the right answer was Poisson regression with option vce(robust), but they mostly got it wrong.

I said to them, “I have a process for which it is perfectly reasonable to assume that the mean of yj is given by exp(b0 + Xjb), but I have no reason to believe that E(yj) = Var(yj), which is to say, no reason to suspect that the process is Poisson. How would you suggest I estimate the model?” Certainly not using Poisson, they replied. Social scientists suggested I use log regression. Biostatisticians and health researchers suggested I use negative binomial regression even when I objected that the process was not the gamma mixture of Poissons that negative binomial regression assumes. “What else can you do?” they said and shrugged their collective shoulders. And of course, they just assumed over dispersion.

Based on those answers, I was ready to write this blog entry, but it turned out differently than I expected. I was going to slam negative binomial regression. Negative binomial regression makes assumptions about the variance, assumptions different from that made by Poisson, but assumptions nonetheless, and unlike the assumption made in Poisson, those assumptions do appear in the first-order conditions that determine the fitted coefficients that negative binomial regression reports. Not only would negative binomial’s standard errors be wrong — which vce(robust) could fix — but the coefficients would be biased, too, and vce(robust) would not fix that. I planned to run simulations showing this.

When I ran the simulations, I was surprised by the results. The negative binomial estimator (Stata’s nbreg) was remarkably robust to violations in variance assumptions as long as the data were overdispersed. In fact, negative binomial regression did about as well as Poisson regression. I did not run enough simulations to make generalizations, and theory tells me those generalizations have to favor Poisson, but the simulations suggested that if Poisson does do better, it’s not in the first four decimal places. I was impressed. And disappointed. It would have been a dynamite blog entry.

So you’ll have to content yourself with this one.

Others have preceeded me in the knowledge that Poisson regression with vce(robust) is a better alternative to log-linear regression. I direct you to Jeffery Wooldridge, Econometric Analysis of Cross Section and Panel Data, 2nd ed., chapter 18. Or see A. Colin Cameron and Pravin K. Trivedi, Microeconomics Using Stata, revised edition, chapter 17.3.2.

I first learned about this from a talk given by Austin Nichols, Regression for nonnegative skewed dependent variables, given in 2010 at the Stata Conference in Boston. That talk goes far beyond what I have presented here, and I heartily recommend it.

Precision (yet again), Part II

In part I, I wrote about precision issues in English. If you enjoyed that, you may want to stop reading now, because I’m about to go into the technical details. Actually, these details are pretty interesting.

For instance, I offered the following formula for calculating error due to float precision:

maximum_error = 2-24 X

I later mentioned that the formula is an approximation, and said that the true formula is,

maximum_error = 2-24 2floor(log2 X)

I didn’t explain how I got either formula.

I need to be more precise today than I was in my previous posting. For instance, I previously used x for two concepts, the true value and the rounded-after-storage value. Today I need to distinguish those concepts.

X is the true value.

x is the value after rounding due to storage.

The issue is the difference between x and X when X is stored in 24-binary-digit float precision.

Base 10

Although I harp on the value of learning to think in binary and hexadecimal, I admit that I, too, find it easier to think in base 10. So let’s start that way.

Say we record numbers to two digits of accuracy, which I will call d=2. Examples of d=2 numbers include

 1.0
 1.6
12
47
52*10^1 (i.e, 520, but with only two significant digits)

To say that we record numbers to two digits of accuracy is to say that, coming upon the recorded number 1, we know only that the number lies between 0.95 and 1.05; or coming upon 12, that the true number lies between 11.5 and 12.5, and so on. I assume that numbers are rounded efficiently, which is to say, stored values record midpoints of intervals.

Before we get into the math, let me note that most us would be willing to say that numbers recorded this way are accurate to 1 part in 10 or, if d=3, to 1 part in 100. If numbers are accurate to 1 part in 10^(d-1), then couldn’t we must multiply the number by 1/(10^(d-1)) to obtain the width of the interval? Let’s try:

Assume X=520 and d=2. Then 520/(10^(2-1)) = 52. The true interval, however, is (515, 525] and it has width 10. So the simple formula does not work.

The simple formula does not work yet I presented its base-2 equivalent in Part 1 and I even recommended its use! We will get to that. It turns out the smaller the base, the more accurately the simple formula approximates the true formula, but before I can show that, I need the true formula.

Let’s start by thinking about d=1.

  1. The recorded number 0 will contain all numbers between [-0.5, 0.5). The recorded number 1 will contain all numbers between [0.5, 1.5), and so on. For 0, 1, …, 9, the width of the intervals is 1.

  2. The recorded number 10 will contain all numbers between [5, 15). The recorded number 20 will contain all numbers between [15, 25), and so on, For 10, 20, …, 90, the width of the intervals is 10.

The derivation for the width of interval goes like this:

  1. If we recorded the value of X to one decimal digit, the recorded digit will will be b, the recorded value will be x = b*10p, and the power of ten will be p = floor(log10X). More importantly, W1 = 10p will be the width of the interval containing X.

  2. It therefore follows that if we recorded the value of X to two decimal digits, the interval length would be W2 = W1/10. What ever the width with one digit, adding another must reduce width by one-tenth.

  3. If we recorded the value of X to three decimal digits, the interval length would be W3 = W2/10.

  4. Thus, if d is the number of digits to which numbers are recorded, the width of the interval is 10p where p = floor(log10X) – (d-1).

The above formula is exact.

Base 2

Converting the formula

interval_width = 10floor(log10X)-(d-1)

from base 10 to base 2 is easy enough:

interval_width = 2floor(log2X)-(d-1)

In Part 1, I presented this formula for d=24 as

maximum_error = 2floor(log2X)-24 = 2 -24 2floor(log2 X)

In interval_width, it is d-1 and not d that appears in the formula. You might think I made an error and should have put -23 where I put -24 in the maximum_error formula. There is no mistake. In Part 1, the maximum error was defined as a plus-or-minus quantity and is thus half the width of the overall interval. So I divided by 2, and in effect, I did put -23 into the maximum_error formula, at least before I subracted one more from it, making it -24 again.

I started out this posting by considering and dismissing the base-10 approximation formula

interval_width = 10-(d-1) X

which in maximum-error units is

maximum_error = 10-d X

and yet in Part 1, I presented — and even recommended — its base-2, d=24 equivalent,

maximum_error = 2-24 X

It turns out that the approximation formula is not as inaccurate in base 2 and it would be in base 10. The correct formula,

maximum_error = 2floor(log2X)-d

can be written

maximum_error = 2-d 2floor(log2X

so the question becomes about the accuracy of substituting X for 2^floor(log2X). We know by examination that X ≥ 2^floor(log2X), so making the substitution will overstate the error and, in that sense, is a safe thing to do. The question becomes how much the error is overstated.

X can be written 2^(log2X) and thus we need to compare 2^(log2X) with 2^floor(log2X). The floor() function cannot reduce its argument by more than 1, and thus 2^(log2X) cannot differ from 2^floor(log2X) by more than a factor of 2. Under the circumstances, this seems a reasonable approximation.

In the case of base 10, the the floor() function reducing its argument by up to 1 results in a decrease of up to a factor of 10. That, it seems to me, is not a reasonable amount of error.

Categories: Numerical Analysis Tags: ,