Archive

Archive for August 2012

Using Stata’s random-number generators, part 3, drawing with replacement

The topic for today is drawing random samples with replacement. If you haven’t read part 1 and part 2 of this series on random numbers, do so. In the series we’ve
discussed that

  1. Stata’s runiform() function produces random numbers over the range [0,1). To produce such random numbers, type
    . generate double u = runiform()
    

     

  2. To produce continuous random numbers over [a,b), type
    . generate double u = (b-a)*runiform() + a
    

     

  3. To produce integer random numbers over [a,b], type
    . generate ui = floor((b-a+1)*runiform() + a)
    

    If b > 16,777,216, type

    . generate long ui = floor((b-a+1)*runiform() + a)
    

     

  4. To place observations in random order — to shuffle observations — type
    . set seed #
    . generate double u = runiform()
    . sort u
    

     

  5. To draw without replacement a random sample of n observations from a dataset of N observations, type
    . set seed #
    . sort variables_that_put_dta_in_unique_order
    . generate double u = runiform()
    . sort u
    . keep in 1/n
    

    If N>1,000, generate two random variables u1 and u2 in place of u, and substitute sort u1 u2 for sort u.

     

  6. To draw without replacement a P-percent random sample, type
    . set seed #
    . keep if runiform() <= P/100
    

I’ve glossed over details, but the above is the gist of it.

Today I’m going to tell you

  1. To draw a random sample of size n with replacement from a dataset of size N, type
    . set seed #
    . drop _all
    . set obs n
    . generate long obsno = floor(N*runiform()+1)
    . sort obsno
    . save obsnos_to_draw
    
    . use your_dataset, clear
    . generate long obsno = _n
    . merge 1:m obsno using obsnos_to_draw, keep(match) nogen
    

     

  2. You need to set the random-number seed only if you care about reproducibility. I’ll also mention that if N ≤ 16,777,216, it is not necessary to specify that new variable obsno be stored as long; the default float will be sufficient.

     
    The above solution works whether n<N, n=N, or n>N.

 

Drawing samples with replacement

The solution to sampling with replacement n observations from a dataset of size N is

  1. Draw n observation numbers 1, …, N with replacement. For instance, if N=4 and n=3, we might draw observation numbers 1, 3, and 3.

  2. Select those observations from the dataset of interest. For instance, select observations 1, 3, and 3.

As previously discussed in part 1, to generate random integers drawn with replacement over the range [a, b], use the formula

generate varname = floor((b-a+1)*runiform() + a)

In this case, we want a=1 and b=N, and the formula reduces to,

generate varname = floor(N*runiform() + 1)

So the first half of our solution could read

. drop _all
. set obs n
. generate obsno = floor(N*runiform() + 1)

Now we are merely left with the problem of selecting those observations from our dataset, which we can do using merge by typing

. sort obsno
. save obsnos_to_draw
. use dataset_of_interest, clear
. generate obsno = _n
. merge 1:m obsno using obsnos_to_draw, keep(match) nogen

Let’s do an example. In part 2 of this series, I had a dataset with observations corresponding to playing cards:

. use cards

. list in 1/5

     +-------------+
     | rank   suit |
     |-------------|
  1. |  Ace   Club |
  2. |    2   Club |
  3. |    3   Club |
  4. |    4   Club |
  5. |    5   Club |
     +-------------+

There are 52 observations in the dataset; I’m showing you just the first five. Let’s draw 10 cards from the deck, but with replacement.

The first step is to draw the observation numbers. We have N=52 cards in the deck, and we want to draw n=10, so we generate 10 random integers from the integers [1, 52]:

. drop _all

. set obs 10                            // we want n=10
obs was 0, now 10

. gen obsno = floor(52*runiform()+1)    // we draw from N=52


. list obsno                            // let's see what we have

     +-------+
     | obsno |
     |-------|
  1. |    42 |
  2. |    52 |
  3. |    16 |
  4. |     9 |
  5. |    40 |
     |-------|
  6. |    11 |
  7. |    34 |
  8. |    20 |
  9. |    49 |
 10. |    42 |
     +-------+

If you look carefully at the list, you will see that observation number 42 repeats. It will be easier to see the duplicate if we sort the list,

. sort obsno
. list

     +-------+
     | obsno |
     |-------|
  1. |     9 |
  2. |    11 |
  3. |    16 |
  4. |    20 |
  5. |    34 |
     |-------|
  6. |    40 |
  7. |    42 |     <- Obs. 42 repeats
  8. |    42 |     <- See?
  9. |    49 |
 10. |    52 |
     +-------+

An observation didn’t have to repeat, but it’s not surprising that one did because in drawing n=10 from N=52, we would expect one or more repeated cards about 60% of the time.

Anyway, we now know which cards we want, namely cards 9, 11, 16, 20, 34, 40, 42, 42 (again), 49, and 52.

The final step is to select those observations from cards.dta. The way to do that is to perform a one-to-many merge of cards.dta with the list above and keep the matches. Before we can do that, however, we must (1) save the list of observation numbers as a dataset, (2) load cards.dta, and (3) add a variable called obsno to it. Then we will be able to perform the merge. So let’s get that out of the way,

. save obsnos_to_draw                // 1. save the list above
file obsnos_to_draw.dta saved

. use cards                          // 2. load cards.dta

. gen obsno = _n                     // 3.  Add variable obsno to it

Now we can perform the merge:

. merge 1:m obsno using obsnos_to_draw, keep(matched) nogen

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                                10
    -----------------------------------------

I’ll list the result, but let me first briefly explain the command

merge 1:m obsno using obsnos_to_draw, keep(matched) nogen

merge …, we are performing the merge command,

1:m …, the merge is one-to-many,

using obsnos_to_draw …, we merge data in memory with obsnos_todraw.dta,

, keep(matched) …, we keep observations that appear in both datasets,

nogen, do not add variable _merge to the resulting dataset; _merge reports the source of the resulting observations; we said keep(matched) so we know each came from both sources.

And here is the result:

. list

     +-------------------------+
     |  rank      suit   obsno |
     |-------------------------|
  1. |     8      Club       9 |
  2. |  Jack      Club      11 |
  3. |   Ace     Spade      16 |
  4. |     2   Diamond      20 |
  5. |     6     Spade      34 |
     |-------------------------|
  6. |     8     Spade      40 |
  7. |     9     Heart      42 |   <- Obs. 42 is here ...
  8. | Queen     Spade      49 |
  9. |  King     Spade      52 |
 10. |     9     Heart      42 |   <- and here
     +-------------------------+

We drew 10 cards — those are the observation numbers on the left. Variable obsno in our dataset records the original observation (card) number and really, we no longer need the variable. Anyway, obsno==42 appears twice, in real observations 7 and 10, and thus we drew the 9 of Hearts twice.

 

What could go wrong?

Not much can go wrong, it turns out. At this point, our generic solution is

. drop _all
. set obs n
. generate obsno = floor(n*runiform()+1)
. sort obsno
. save obsnos_to_draw

. use our_dataset
. gen obsno = _n
. merge 1:m obsno using obsnos_to_draw, keep(matched) nogen

If you study this code, there are two lines that might cause problems,

. generate obsno = floor(N*runiform()+1)

and

. generate obsno = _n

When you are looking for problems and see a generate or replace, think about rounding.

Let’s look at the right-hand side first. Both calculations produce integers over the range [1, N]. generate performs all calculations in double and the largest integer that can be stored without rounding is 9,007,199,254,740,992 (see previous blog post on precision). Stata allows datasets up to 2,147,483,646, so we can be sure that N is less than the maximum precise-integer double. There are no rounding issues on the right-hand side.

Next let’s look at the left-hand side. Variable obsno is being stored as a float because we did not instruct otherwise. The largest integer value that can be stored without rounding as a float (also covered in previous blog post on precision) is 16,777,216, and that is less than Stata’s 2,147,483,646 maximum observations. When N exceeds 16,777,216, the solution is to store obsno as a long. We could remember to use long on the rare occasion when dealing with such large datasets, but I’m going to change the generic solution to use longs in all cases, even when it’s unnecessary.

What else could go wrong? Well, we tried an example with n<N and that seemed to work. We should now try examples with n=N and n>N to verify there’s no hidden bug or assumption in our code. I’ve tried examples of both and the code works fine.

 

We’re done for today

That’s it. Drawing samples with replacement turns out to be easy, and that shouldn’t surprise us because we have a random-number generator that draws with replacement.

We could complicate the discussion and consider solutions that would run a bit more efficiently when n=N, which is of special interest in statistics because it is a key ingredient in bootstrapping, but we will not. The above solution works fine in the n=N case, and I always advise researchers to favor simple-even-if-slower solutions because they will probably save you time. Writing complicated code takes longer than writing simple code, and testing complicated code takes even longer. I know because that’s what we do at StataCorp.

Using Stata’s random-number generators, part 2, drawing without replacement

Last time I told you that Stata’s runiform() function generates rectangularly (uniformly) distributed random numbers over [0, 1), from 0 to nearly 1, and to be precise, over [0, 0.999999999767169356]. And I gave you two formulas,

  1. To generate continuous random numbers between a and b, use

    generate double u = (b-a)*runiform() + a

    The random numbers will not actually be between a and b: they will be between a and nearly b, but the top will be so close to b, namely 0.999999999767169356*b, that it will not matter.

  2. To generate integer random numbers between a and b, use

    generate ui = floor((b-a+1)*runiform() + a)

I also mentioned that runiform() can solve a variety of problems, including

  • shuffling data (putting observations in random order),
  • drawing random samples without replacement (there’s a minor detail we’ll have to discuss because runiform() itself produces values drawn with replacement),
  • drawing random samples with replacement (which is easier to do than most people realize),
  • drawing stratified random samples (with or without replacement),
  • manufacturing fictional data (something teachers, textbook authors, manual writers, and blog writers often need to do).

Today we will cover shuffling and drawing random samples without replacement — the first two topics on the list — and we will leave drawing random samples with replacement for next time. I’m going to tell you

  1. To place observations in random order — to shuffle the observations — type
    . generate double u = runiform()
    . sort u
    
  2. To draw without replacement a random sample of n observations from a dataset of N observations, type
    . set seed #
    . generate double u = runiform()
    . sort u
    . keep in 1/n
    

    I will tell you that there are good statistical reasons for setting the random-number seed even if you don’t care about reproducibility.

    If you do care about reproducibility, I will mention (1) that you need to use sort to put the original data in a known, reproducible order, before you generate the random variate u, and I will explain (2) a subtle issue that leads us to use different code for N≤1,000 and N>1,000. The code for for N≤1,000 is

    . set seed #
    . sort variables_that_put_data_in_unique_order
    . generate double u = runiform()
    . sort u
    . keep in 1/n
    

    and the code for N>1,000 is

    . set seed #
    . sort variables_that_put_data_in_unique_order
    . generate double u1 = runiform()
    . generate double u2 = runiform()
    . sort u1 u2
    . keep in 1/n
    

    You can use the N>1,000 code for the N≤1,000 case.

  3. To draw without replacement a P-percent random sample, type
    . set seed #
    . keep if runiform() <= P/100
    

    There’s no issue in this case when N is large.

As I mentioned, we’ll discuss drawing random samples with replacement next time. Today, the topic is random samples without replacement. Let’s start.

 
Shuffling data

I have a deck of 52 cards, in order, the first four of which are

. list in 1/4

     +-------------+
     | rank   suit |
     |-------------|
  1. |  Ace   Club |
  2. |    1   Club |
  3. |    2   Club |
  4. |    3   Club |
     +-------------+

Well, actually I just have a Stata dataset with observations corresponding to playing cards. To shuffle the deck — to place the observations in random order — type

. generate double u = runiform()

. sort u

Having done that, here’s your hand,

. list in 1/5

     +----------------------------+
     |  rank      suit          u |
     |----------------------------|
  1. | Queen      Club   .0445188 |
  2. |     5   Diamond   .0580662 |
  3. |     7      Club   .0610638 |
  4. |  King     Heart   .0907986 |
  5. |     6     Spade   .0981878 |
     +----------------------------+

and here’s mine:

. list in 6/10

     +---------------------------+
     | rank      suit          u |
     |---------------------------|
  6. |    8   Diamond   .1024369 |
  7. |    5      Club   .1086679 |
  8. |    8     Spade   .1091783 |
  9. |    2     Spade   .1180158 |
 10. |  Ace      Club   .1369841 |
     +---------------------------+

All I did was generate random numbers — one per observation (card) — and then place the observations in ascending order of the random values. Doing that is equivalent to shuffling the deck. I used runiform() random numbers, meaning rectangularly distributed random numbers over [0, 1), but since I’m only exploiting the random-numbers’ ordinal properties, I could have used random numbers from any continuous distribution.

This simple, elegant, and obvious solution to shuffling data will play an important part of the solution to drawing observations without replacement. I have already more than hinted at the solution when I showed you your hand and mine.

 
Drawing n observations without replacement

Drawing without replacement is exactly the same problem as dealing cards. The solution to the physical card problem is to shuffle the cards and then draw the top cards. The solution to randomly selecting n from N observations is to put the N observations in random order and keep the first n of them.

. use cards, clear

. generate double u = runiform()

. sort u

. keep in 1/5
(47 observations deleted)

. list

     +---------------------------+
     | rank      suit          u |
     |---------------------------|
  1. |  Ace   Diamond   .0064866 |
  2. |    6     Heart   .0087578 |
  3. | King     Spade    .014819 |
  4. |    3     Spade   .0955155 |
  5. | King   Diamond   .1007262 |
     +---------------------------+

. drop u

 
Reproducibility

You might later want to reproduce the analysis, meaning you do not want to draw another random sample, but you want to draw the same random sample. Perhaps you informally distributed some preliminary results and, of course, then discovered a mistake. You want to redistribute updated results and show that your mistake didn’t change results by much, and to drive home the point, you want to use the same samples as you used previously.

Part of the solution is to set the random-number seed. You might type

. set seed 49983882

. use cards, clear

. generate double u = runiform()

. sort u

. keep in 1/5

See help set seed in Stata. As a quick review, when you set the random-number seed, you set Stata’s random-number generator into a fixed, reproducible state, which is to say, the sequence of random numbers that runiform() produces is a function of the seed. Set the seed today to the same value as yesterday, and runiform() will produce the same sequence of random numbers today as it did yesterday. Thus, after setting the seed, if you repeat today exactly what you did yesterday, you will obtain the same results.

So imagine that you set the random number seed today to the value you set it to yesterday and you repeat the above commands. Even so, you might not get the same results! You will not get the same results if the observations in cards.dta are in a different order yesterday and today. Setting the seed merely ensures that if yesterday the smallest value of u was in observation 23, it will again be in observation 23 today (and it will be the same value). If yesterday, however, observation 23 was the 6 of Clubs, and today it’s the 7 of Hearts, then today you will select the 7 of Hearts in place of the 6 of Clubs.

So make sure the data are in the same order. One way to do that is put the dataset in a known order before generating the random values on which you will sort. For instance,

. set seed 49983882

. use cards, clear

. sort rank suit

. generate double u = runiform()

. sort u

. keep in 1/5

An even better solution would add the line

. by rank suit: assert _N==1

just before the generate. That line would check whether sorting on variables rank and suit uniquely orders the observations.

With cards.dta, you can argue that the assert is unnecessary, but not because you know each rank-suit combination occurs once. You have only my assurances about that. I recommend you never trust anyone’s assurances about data. In this case, however, you can argue that the assert is unnecessary because we sorted on all the variables in the dataset and thus uniqueness is not required. Pretend there are two Ace of Clubs in the deck. Would it matter that the first card was Ace of Clubs followed by Ace of Clubs as opposed to being the other way around? Of course it would not; the two states are indistinguishable.

So let’s assume there is another variable in the dataset, say whether there was a grease spot on the back of the card. Yesterday, after sorting, the ordering might have been,

     +---------------------------------+
     | rank   suit   grease          u |
     |---------------------------------|
  1. |  Ace   Club      yes   .6012949 |
  2. |  Ace   Club       no   .1859054 |
     +---------------------------------+

and today,

     +---------------------------------+
     | rank   suit   grease          u |
     |---------------------------------|
  1. |  Ace   Club       no   .6012949 |
  2. |  Ace   Club      yes   .1859054 |
     +---------------------------------+

If yesterday you selected the Ace of Clubs without grease, today you would select the Ace of Clubs with grease.

My recommendation is (1) sort on whatever variables put the data into a unique order, and then verify that, or (2) sort on all the variables in the dataset and then don’t worry whether the order is unique.

 
Ensuring a random ordering

Included in our reproducible solution but omitted from our base solution was setting the random-number seed,

. set seed 49983882

Setting the seed is important even if you don’t care about reproducibility. Each time you launch Stata, Stata sets the same random-number seed, namely 123456789, and that means that runiform() generates the same sequence of random numbers, and that means that if you generated all your random samples right after launching Stata, you would always select the same observations, at least holding N constant.

So set the seed, but don’t set it too often. You set the seed once per problem. If I wanted to draw 10,000 random samples from the same data, I could code:

  use dataset, clear
  set seed 1702213
  sort variables_that_put_data_in_unique_order
  preserve
  forvalues i=1(1)10000 {
          generate double u = runiform()
          sort u
          keep in 1/n
          drop u
          save sample`i', replace
          restore, preserve  
}

In the example I save each sample in a file. In real life, I seldom (never) save the samples; I perform whatever analysis on the samples I need and save the results, which I usually append into a single dataset. I don’t need to save the individual samples because I can recreate them.

 
And the result still might not be reproducible …

runiform() draws random-numbers with replacement. It is thus possible that two or more observations could have the same random values associated with them. Well yes, you’re thinking, I see that it’s possible, but surely it’s so unlikely that it just doesn’t happen. But it does happen:

. clear all

. set obs 100000
obs was 0, now 100000

. generate double u = runiform() 

. by u, sort: assert _N==1
1 contradiction in 99999 by-groups
assertion is false      
r(9); 

In the 100,000-observation dataset I just created, I got a duplicate! By the way, I didn’t have to look hard for such an example, I got it the first time I tried.

I have three things I want to tell you:

  1. Duplicates happen more often than you might guess.

  2. Do not panic about the duplicates. Because of how Stata is written, duplicates do not lower the quality of the sample selected. I’ll explain.

  3. Duplicates do interfere with reproducibility, however, and there is an easy way around that problem.

Let’s start with the chances of observing duplicates. I mentioned in passing last time that runiform() is a 32-bit random-number generator. That means runiform() can return any of 232 values. Their values are, in order,

          0  =  0
      1/232  =  2.32830643654e-10
      2/232  =  4.65661287308e-10 
      3/232  =  6.98491930962e-10
             .
             .
             .
 (232-2)/232  =  0.9999999995343387
 (232-1)/232  =  0.9999999997671694

So what are the chances that in N draws with replacement from an urn containing these 232 values, that all values are distinct? The probability p that all values are distinct is

      232 * (232-1) * ... *(232-N)
p  =  ----------------------------
                 N*232

Here are some values for various values of N. p is the probability that all values are unique, and 1-p is the probability of observing one or more repeated values.

------------------------------------
      N         p            1-p
------------------------------------
     50   0.999999715    0.000000285
    500   0.999970955    0.000029045
  1,000   0.999883707    0.000116293
  5,000   0.997094436    0.002905564
 10,000   0.988427154    0.011572846
 50,000   0.747490440    0.252509560
100,000   0.312187988    0.687812012
200,000   0.009498117    0.990501883
300,000   0.000028161    0.999971839
400,000   0.000000008    0.999999992
500,000   0.000000000    1.000000000
------------------------------------

In shuffling cards we generated N=52 random values. The probability of a repeated values is infinitesimal. In datasets of N=10,000, I expect to see repeated values 1% of the time. In datasets of N=50,000, I expect to see repeated values 25% of the time. By N=100,000, I expect to see repeated values more often than not. By N=500,000, I expect to see repeated value in virtually all sequences.

Even so, I promised you that this problem does not affect the randomness of the ordering. It does not because of how Stata’s sort command is written. Remember the basic solution,

 
. use dataset, clear

. generate double u = runiform()

. sort u

. keep in 1/n

Did you know sort has its own, private random-number generator built into it? It does, and sort uses its random-number generator to determine the order of tied observations. In the manuals we at StataCorp are fond of writing, “the ties will be ordered randomly” and a few sophisticated users probably took that to mean, “the ties will be ordered in a way that we at StataCorp do not know and even though they might be ordered in a way that will cause a bias in the subsequent analysis, because we don’t know, we’ll ignore the possibility.” But we meant it when wrote that the ties will be ordered randomly; we know that because we put a random number generator into sort to ensure the result. And that is why I can now write that repeated values of the runiform() function cause a reproducibility issue, but not a statistical issue.

The solution to the reproducibility issue is to draw two random numbers and use the random-number pair to order the observations:

 
. use dataset, clear

. sort varnames

. set seed #

. generate double u1 = runiform()

. generate double u2 = runiform()

. sort u1 u2

. keep in 1/n

You might wonder if we would ever need three random numbers. It is very unlikely. p, the probability of no problem, equals 1 to at least 5 digits for N=500,000. Of course, the chances of duplication are always nonzero. If you are concerned about this problem, you could add an assert to the code to verify that the two random numbers together do uniquely identify the observations:

. use dataset, clear

. sort varnames

. set seed #

. generate double u1 = runiform()

. generate double u2 = runiform()

. sort u1 u2

. by u1 u2: assert _N==1            // added line

. keep in 1/n

I do not believe that doing that is necessary.

 
Is using doubles necessary?

In the generation of random numbers in all of the above, note that I am storing them as doubles. For the reproducibility issue, that is important. As I mentioned in part 1, the 32-bit random numbers that runiform() produces will be rounded if forced into 23-bit floats.

Above I gave you a table of probabilities p that, in creating

. generate double u = runiform()

the values of u would be distinct. Here is what would happen if you instead stored u as a float:

                               u stored as
          ---------------------------------------------------------
          -------- double ----------     ----------float ----------
      N         p            1-p               p           1-p
-------------------------------------------------------------------
     50   0.999999715    0.000000285     0.999853979    0.000146021
    500   0.999970955    0.000029045     0.985238383    0.014761617
  1,000   0.999883707    0.000116293     0.942190868    0.057809132
  5,000   0.997094436    0.002905564     0.225346930    0.774653070
 10,000   0.988427154    0.011572846     0.002574145    0.997425855
 50,000   0.747490440    0.252509560     0.000000000    1.000000000
100,000   0.312187988    0.687812012     0.000000000    1.000000000
200,000   0.009498117    0.990501883     0.000000000    1.000000000
300,000   0.000028161    0.999971839     0.000000000    1.000000000
400,000   0.000000008    0.999999992     0.000000000    1.000000000
500,000   0.000000000    1.000000000     0.000000000    1.000000000
-------------------------------------------------------------------

 
Drawing without replacement P-percent random samples

We have discussed drawing without replacement n observations from N observations. The number of observations selected has been fixed. Say instead we wanted to draw a 10% random sample, meaning that we independently allow each observation to have a 10% chance of appearing in our sample. In that case, the final number of observations is expected to be 0.1*N, but it may (and probably will) vary from that. The basic solution for drawing a 10% random sample is

. keep if runiform() <= 0.10

and the basic solution for drawing a P% random sample is

. keep if runiform() <= P/100

It is unlikely to matter whether you code <= or < in the comparison. As you now know, runiform() produces values drawn from 232 possible values, and thus the chance of equality is 2-32 or roughly 0.000000000232830644. If you want a P% sample, however, theory says you should code <=.

If you care about reproducibility, you should expand the basic solution to read,

. set seed #

. use data, clear 

. sort variables_that_put_data_in_unique_order

. keep if runiform() <= P/100

Below I draw a 10% sample from the card.dta:

. set seed 838   

. use cards, clear

. sort rank suit

. keep if runiform() <= 10/100
(46 observations deleted)

. list

     +-----------------+
     |  rank      suit |
     |-----------------|
  1. |     2   Diamond |
  2. |     2     Heart |
  3. |     3      Club |
  4. |     5     Heart |
  5. |  Jack   Diamond |
     |-----------------|
  6. | Queen     Spade |
     +-----------------+

 
We’re not done, but we’re done for today

In part 3 of this series I will discuss drawing random samples with replacement.