Archive

Archive for August 2012

Using Stata’s random-number generators, part 3, drawing with replacement

The topic for today is drawing random samples with replacement. If you haven’t read part 1 and part 2 of this series on random numbers, do so. In the series we’ve
discussed that

  1. Stata’s runiform() function produces random numbers over the range [0,1). To produce such random numbers, type
    . generate double u = runiform()
    

     

  2. To produce continuous random numbers over [a,b), type
    . generate double u = (b-a)*runiform() + a
    

     

  3. To produce integer random numbers over [a,b], type
    . generate ui = floor((b-a+1)*runiform() + a)
    

    If b > 16,777,216, type

    . generate long ui = floor((b-a+1)*runiform() + a)
    

     

  4. To place observations in random order — to shuffle observations — type
    . set seed #
    . generate double u = runiform()
    . sort u
    

     

  5. To draw without replacement a random sample of n observations from a dataset of N observations, type
    . set seed #
    . sort variables_that_put_dta_in_unique_order
    . generate double u = runiform()
    . sort u
    . keep in 1/n
    

    If N>1,000, generate two random variables u1 and u2 in place of u, and substitute sort u1 u2 for sort u.

     

  6. To draw without replacement a P-percent random sample, type
    . set seed #
    . keep if runiform() <= P/100
    

I’ve glossed over details, but the above is the gist of it.

Today I’m going to tell you

  1. To draw a random sample of size n with replacement from a dataset of size N, type
    . set seed #
    . drop _all
    . set obs n
    . generate long obsno = floor(N*runiform()+1)
    . sort obsno
    . save obsnos_to_draw
    
    . use your_dataset, clear
    . generate long obsno = _n
    . merge 1:m obsno using obsnos_to_draw, keep(match) nogen
    

     

  2. You need to set the random-number seed only if you care about reproducibility. I’ll also mention that if N ≤ 16,777,216, it is not necessary to specify that new variable obsno be stored as long; the default float will be sufficient.

     
    The above solution works whether n<N, n=N, or n>N.

 

Drawing samples with replacement

The solution to sampling with replacement n observations from a dataset of size N is

  1. Draw n observation numbers 1, …, N with replacement. For instance, if N=4 and n=3, we might draw observation numbers 1, 3, and 3.

  2. Select those observations from the dataset of interest. For instance, select observations 1, 3, and 3.

As previously discussed in part 1, to generate random integers drawn with replacement over the range [a, b], use the formula

generate varname = floor((b-a+1)*runiform() + a)

In this case, we want a=1 and b=N, and the formula reduces to,

generate varname = floor(N*runiform() + 1)

So the first half of our solution could read

. drop _all
. set obs n
. generate obsno = floor(N*runiform() + 1)

Now we are merely left with the problem of selecting those observations from our dataset, which we can do using merge by typing

. sort obsno
. save obsnos_to_draw
. use dataset_of_interest, clear
. generate obsno = _n
. merge 1:m obsno using obsnos_to_draw, keep(match) nogen

Let’s do an example. In part 2 of this series, I had a dataset with observations corresponding to playing cards:

. use cards

. list in 1/5

     +-------------+
     | rank   suit |
     |-------------|
  1. |  Ace   Club |
  2. |    2   Club |
  3. |    3   Club |
  4. |    4   Club |
  5. |    5   Club |
     +-------------+

There are 52 observations in the dataset; I’m showing you just the first five. Let’s draw 10 cards from the deck, but with replacement.

The first step is to draw the observation numbers. We have N=52 cards in the deck, and we want to draw n=10, so we generate 10 random integers from the integers [1, 52]:

. drop _all

. set obs 10                            // we want n=10
obs was 0, now 10

. gen obsno = floor(52*runiform()+1)    // we draw from N=52


. list obsno                            // let's see what we have

     +-------+
     | obsno |
     |-------|
  1. |    42 |
  2. |    52 |
  3. |    16 |
  4. |     9 |
  5. |    40 |
     |-------|
  6. |    11 |
  7. |    34 |
  8. |    20 |
  9. |    49 |
 10. |    42 |
     +-------+

If you look carefully at the list, you will see that observation number 42 repeats. It will be easier to see the duplicate if we sort the list,

. sort obsno
. list

     +-------+
     | obsno |
     |-------|
  1. |     9 |
  2. |    11 |
  3. |    16 |
  4. |    20 |
  5. |    34 |
     |-------|
  6. |    40 |
  7. |    42 |     <- Obs. 42 repeats
  8. |    42 |     <- See?
  9. |    49 |
 10. |    52 |
     +-------+

An observation didn’t have to repeat, but it’s not surprising that one did because in drawing n=10 from N=52, we would expect one or more repeated cards about 60% of the time.

Anyway, we now know which cards we want, namely cards 9, 11, 16, 20, 34, 40, 42, 42 (again), 49, and 52.

The final step is to select those observations from cards.dta. The way to do that is to perform a one-to-many merge of cards.dta with the list above and keep the matches. Before we can do that, however, we must (1) save the list of observation numbers as a dataset, (2) load cards.dta, and (3) add a variable called obsno to it. Then we will be able to perform the merge. So let’s get that out of the way,

. save obsnos_to_draw                // 1. save the list above
file obsnos_to_draw.dta saved

. use cards                          // 2. load cards.dta

. gen obsno = _n                     // 3.  Add variable obsno to it

Now we can perform the merge:

. merge 1:m obsno using obsnos_to_draw, keep(matched) nogen

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                                10
    -----------------------------------------

I’ll list the result, but let me first briefly explain the command

merge 1:m obsno using obsnos_to_draw, keep(matched) nogen

merge …, we are performing the merge command,

1:m …, the merge is one-to-many,

using obsnos_to_draw …, we merge data in memory with obsnos_todraw.dta,

, keep(matched) …, we keep observations that appear in both datasets,

nogen, do not add variable _merge to the resulting dataset; _merge reports the source of the resulting observations; we said keep(matched) so we know each came from both sources.

And here is the result:

. list

     +-------------------------+
     |  rank      suit   obsno |
     |-------------------------|
  1. |     8      Club       9 |
  2. |  Jack      Club      11 |
  3. |   Ace     Spade      16 |
  4. |     2   Diamond      20 |
  5. |     6     Spade      34 |
     |-------------------------|
  6. |     8     Spade      40 |
  7. |     9     Heart      42 |   <- Obs. 42 is here ...
  8. | Queen     Spade      49 |
  9. |  King     Spade      52 |
 10. |     9     Heart      42 |   <- and here
     +-------------------------+

We drew 10 cards — those are the observation numbers on the left. Variable obsno in our dataset records the original observation (card) number and really, we no longer need the variable. Anyway, obsno==42 appears twice, in real observations 7 and 10, and thus we drew the 9 of Hearts twice.

 

What could go wrong?

Not much can go wrong, it turns out. At this point, our generic solution is

. drop _all
. set obs n
. generate obsno = floor(n*runiform()+1)
. sort obsno
. save obsnos_to_draw

. use our_dataset
. gen obsno = _n
. merge 1:m obsno using obsnos_to_draw, keep(matched) nogen

If you study this code, there are two lines that might cause problems,

. generate obsno = floor(N*runiform()+1)

and

. generate obsno = _n

When you are looking for problems and see a generate or replace, think about rounding.

Let’s look at the right-hand side first. Both calculations produce integers over the range [1, N]. generate performs all calculations in double and the largest integer that can be stored without rounding is 9,007,199,254,740,992 (see previous blog post on precision). Stata allows datasets up to 2,147,483,646, so we can be sure that N is less than the maximum precise-integer double. There are no rounding issues on the right-hand side.

Next let’s look at the left-hand side. Variable obsno is being stored as a float because we did not instruct otherwise. The largest integer value that can be stored without rounding as a float (also covered in previous blog post on precision) is 16,777,216, and that is less than Stata’s 2,147,483,646 maximum observations. When N exceeds 16,777,216, the solution is to store obsno as a long. We could remember to use long on the rare occasion when dealing with such large datasets, but I’m going to change the generic solution to use longs in all cases, even when it’s unnecessary.

What else could go wrong? Well, we tried an example with n<N and that seemed to work. We should now try examples with n=N and n>N to verify there’s no hidden bug or assumption in our code. I’ve tried examples of both and the code works fine.

 

We’re done for today

That’s it. Drawing samples with replacement turns out to be easy, and that shouldn’t surprise us because we have a random-number generator that draws with replacement.

We could complicate the discussion and consider solutions that would run a bit more efficiently when n=N, which is of special interest in statistics because it is a key ingredient in bootstrapping, but we will not. The above solution works fine in the n=N case, and I always advise researchers to favor simple-even-if-slower solutions because they will probably save you time. Writing complicated code takes longer than writing simple code, and testing complicated code takes even longer. I know because that’s what we do at StataCorp.