Home > Numerical Analysis > Using Stata’s random-number generators, part 2, drawing without replacement

Using Stata’s random-number generators, part 2, drawing without replacement

Last time I told you that Stata’s runiform() function generates rectangularly (uniformly) distributed random numbers over [0, 1), from 0 to nearly 1, and to be precise, over [0, 0.999999999767169356]. And I gave you two formulas,

  1. To generate continuous random numbers between a and b, use

    generate double u = (b-a)*runiform() + a

    The random numbers will not actually be between a and b: they will be between a and nearly b, but the top will be so close to b, namely 0.999999999767169356*b, that it will not matter.

  2. To generate integer random numbers between a and b, use

    generate ui = floor((b-a+1)*runiform() + a)

I also mentioned that runiform() can solve a variety of problems, including

  • shuffling data (putting observations in random order),
  • drawing random samples without replacement (there’s a minor detail we’ll have to discuss because runiform() itself produces values drawn with replacement),
  • drawing random samples with replacement (which is easier to do than most people realize),
  • drawing stratified random samples (with or without replacement),
  • manufacturing fictional data (something teachers, textbook authors, manual writers, and blog writers often need to do).

Today we will cover shuffling and drawing random samples without replacement — the first two topics on the list — and we will leave drawing random samples with replacement for next time. I’m going to tell you

  1. To place observations in random order — to shuffle the observations — type
    . generate double u = runiform()
    . sort u
    
  2. To draw without replacement a random sample of n observations from a dataset of N observations, type
    . set seed #
    . generate double u = runiform()
    . sort u
    . keep in 1/n
    

    I will tell you that there are good statistical reasons for setting the random-number seed even if you don’t care about reproducibility.

    If you do care about reproducibility, I will mention (1) that you need to use sort to put the original data in a known, reproducible order, before you generate the random variate u, and I will explain (2) a subtle issue that leads us to use different code for N≤1,000 and N>1,000. The code for for N≤1,000 is

    . set seed #
    . sort variables_that_put_data_in_unique_order
    . generate double u = runiform()
    . sort u
    . keep in 1/n
    

    and the code for N>1,000 is

    . set seed #
    . sort variables_that_put_data_in_unique_order
    . generate double u1 = runiform()
    . generate double u2 = runiform()
    . sort u1 u2
    . keep in 1/n
    

    You can use the N>1,000 code for the N≤1,000 case.

  3. To draw without replacement a P-percent random sample, type
    . set seed #
    . keep if runiform() <= P/100
    

    There’s no issue in this case when N is large.

As I mentioned, we’ll discuss drawing random samples with replacement next time. Today, the topic is random samples without replacement. Let’s start.

 
Shuffling data

I have a deck of 52 cards, in order, the first four of which are

. list in 1/4

     +-------------+
     | rank   suit |
     |-------------|
  1. |  Ace   Club |
  2. |    1   Club |
  3. |    2   Club |
  4. |    3   Club |
     +-------------+

Well, actually I just have a Stata dataset with observations corresponding to playing cards. To shuffle the deck — to place the observations in random order — type

. generate double u = runiform()

. sort u

Having done that, here’s your hand,

. list in 1/5

     +----------------------------+
     |  rank      suit          u |
     |----------------------------|
  1. | Queen      Club   .0445188 |
  2. |     5   Diamond   .0580662 |
  3. |     7      Club   .0610638 |
  4. |  King     Heart   .0907986 |
  5. |     6     Spade   .0981878 |
     +----------------------------+

and here’s mine:

. list in 6/10

     +---------------------------+
     | rank      suit          u |
     |---------------------------|
  6. |    8   Diamond   .1024369 |
  7. |    5      Club   .1086679 |
  8. |    8     Spade   .1091783 |
  9. |    2     Spade   .1180158 |
 10. |  Ace      Club   .1369841 |
     +---------------------------+

All I did was generate random numbers — one per observation (card) — and then place the observations in ascending order of the random values. Doing that is equivalent to shuffling the deck. I used runiform() random numbers, meaning rectangularly distributed random numbers over [0, 1), but since I’m only exploiting the random-numbers’ ordinal properties, I could have used random numbers from any continuous distribution.

This simple, elegant, and obvious solution to shuffling data will play an important part of the solution to drawing observations without replacement. I have already more than hinted at the solution when I showed you your hand and mine.

 
Drawing n observations without replacement

Drawing without replacement is exactly the same problem as dealing cards. The solution to the physical card problem is to shuffle the cards and then draw the top cards. The solution to randomly selecting n from N observations is to put the N observations in random order and keep the first n of them.

. use cards, clear

. generate double u = runiform()

. sort u

. keep in 1/5
(47 observations deleted)

. list

     +---------------------------+
     | rank      suit          u |
     |---------------------------|
  1. |  Ace   Diamond   .0064866 |
  2. |    6     Heart   .0087578 |
  3. | King     Spade    .014819 |
  4. |    3     Spade   .0955155 |
  5. | King   Diamond   .1007262 |
     +---------------------------+

. drop u

 
Reproducibility

You might later want to reproduce the analysis, meaning you do not want to draw another random sample, but you want to draw the same random sample. Perhaps you informally distributed some preliminary results and, of course, then discovered a mistake. You want to redistribute updated results and show that your mistake didn’t change results by much, and to drive home the point, you want to use the same samples as you used previously.

Part of the solution is to set the random-number seed. You might type

. set seed 49983882

. use cards, clear

. generate double u = runiform()

. sort u

. keep in 1/5

See help set seed in Stata. As a quick review, when you set the random-number seed, you set Stata’s random-number generator into a fixed, reproducible state, which is to say, the sequence of random numbers that runiform() produces is a function of the seed. Set the seed today to the same value as yesterday, and runiform() will produce the same sequence of random numbers today as it did yesterday. Thus, after setting the seed, if you repeat today exactly what you did yesterday, you will obtain the same results.

So imagine that you set the random number seed today to the value you set it to yesterday and you repeat the above commands. Even so, you might not get the same results! You will not get the same results if the observations in cards.dta are in a different order yesterday and today. Setting the seed merely ensures that if yesterday the smallest value of u was in observation 23, it will again be in observation 23 today (and it will be the same value). If yesterday, however, observation 23 was the 6 of Clubs, and today it’s the 7 of Hearts, then today you will select the 7 of Hearts in place of the 6 of Clubs.

So make sure the data are in the same order. One way to do that is put the dataset in a known order before generating the random values on which you will sort. For instance,

. set seed 49983882

. use cards, clear

. sort rank suit

. generate double u = runiform()

. sort u

. keep in 1/5

An even better solution would add the line

. by rank suit: assert _N==1

just before the generate. That line would check whether sorting on variables rank and suit uniquely orders the observations.

With cards.dta, you can argue that the assert is unnecessary, but not because you know each rank-suit combination occurs once. You have only my assurances about that. I recommend you never trust anyone’s assurances about data. In this case, however, you can argue that the assert is unnecessary because we sorted on all the variables in the dataset and thus uniqueness is not required. Pretend there are two Ace of Clubs in the deck. Would it matter that the first card was Ace of Clubs followed by Ace of Clubs as opposed to being the other way around? Of course it would not; the two states are indistinguishable.

So let’s assume there is another variable in the dataset, say whether there was a grease spot on the back of the card. Yesterday, after sorting, the ordering might have been,

     +---------------------------------+
     | rank   suit   grease          u |
     |---------------------------------|
  1. |  Ace   Club      yes   .6012949 |
  2. |  Ace   Club       no   .1859054 |
     +---------------------------------+

and today,

     +---------------------------------+
     | rank   suit   grease          u |
     |---------------------------------|
  1. |  Ace   Club       no   .6012949 |
  2. |  Ace   Club      yes   .1859054 |
     +---------------------------------+

If yesterday you selected the Ace of Clubs without grease, today you would select the Ace of Clubs with grease.

My recommendation is (1) sort on whatever variables put the data into a unique order, and then verify that, or (2) sort on all the variables in the dataset and then don’t worry whether the order is unique.

 
Ensuring a random ordering

Included in our reproducible solution but omitted from our base solution was setting the random-number seed,

. set seed 49983882

Setting the seed is important even if you don’t care about reproducibility. Each time you launch Stata, Stata sets the same random-number seed, namely 123456789, and that means that runiform() generates the same sequence of random numbers, and that means that if you generated all your random samples right after launching Stata, you would always select the same observations, at least holding N constant.

So set the seed, but don’t set it too often. You set the seed once per problem. If I wanted to draw 10,000 random samples from the same data, I could code:

  use dataset, clear
  set seed 1702213
  sort variables_that_put_data_in_unique_order
  preserve
  forvalues i=1(1)10000 {
          generate double u = runiform()
          sort u
          keep in 1/n
          drop u
          save sample`i', replace
          restore, preserve  
}

In the example I save each sample in a file. In real life, I seldom (never) save the samples; I perform whatever analysis on the samples I need and save the results, which I usually append into a single dataset. I don’t need to save the individual samples because I can recreate them.

 
And the result still might not be reproducible …

runiform() draws random-numbers with replacement. It is thus possible that two or more observations could have the same random values associated with them. Well yes, you’re thinking, I see that it’s possible, but surely it’s so unlikely that it just doesn’t happen. But it does happen:

. clear all

. set obs 100000
obs was 0, now 100000

. generate double u = runiform() 

. by u, sort: assert _N==1
1 contradiction in 99999 by-groups
assertion is false      
r(9); 

In the 100,000-observation dataset I just created, I got a duplicate! By the way, I didn’t have to look hard for such an example, I got it the first time I tried.

I have three things I want to tell you:

  1. Duplicates happen more often than you might guess.

  2. Do not panic about the duplicates. Because of how Stata is written, duplicates do not lower the quality of the sample selected. I’ll explain.

  3. Duplicates do interfere with reproducibility, however, and there is an easy way around that problem.

Let’s start with the chances of observing duplicates. I mentioned in passing last time that runiform() is a 32-bit random-number generator. That means runiform() can return any of 232 values. Their values are, in order,

          0  =  0
      1/232  =  2.32830643654e-10
      2/232  =  4.65661287308e-10 
      3/232  =  6.98491930962e-10
             .
             .
             .
 (232-2)/232  =  0.9999999995343387
 (232-1)/232  =  0.9999999997671694

So what are the chances that in N draws with replacement from an urn containing these 232 values, that all values are distinct? The probability p that all values are distinct is

      232 * (232-1) * ... *(232-N)
p  =  ----------------------------
                 N*232

Here are some values for various values of N. p is the probability that all values are unique, and 1-p is the probability of observing one or more repeated values.

------------------------------------
      N         p            1-p
------------------------------------
     50   0.999999715    0.000000285
    500   0.999970955    0.000029045
  1,000   0.999883707    0.000116293
  5,000   0.997094436    0.002905564
 10,000   0.988427154    0.011572846
 50,000   0.747490440    0.252509560
100,000   0.312187988    0.687812012
200,000   0.009498117    0.990501883
300,000   0.000028161    0.999971839
400,000   0.000000008    0.999999992
500,000   0.000000000    1.000000000
------------------------------------

In shuffling cards we generated N=52 random values. The probability of a repeated values is infinitesimal. In datasets of N=10,000, I expect to see repeated values 1% of the time. In datasets of N=50,000, I expect to see repeated values 25% of the time. By N=100,000, I expect to see repeated values more often than not. By N=500,000, I expect to see repeated value in virtually all sequences.

Even so, I promised you that this problem does not affect the randomness of the ordering. It does not because of how Stata’s sort command is written. Remember the basic solution,

 
. use dataset, clear

. generate double u = runiform()

. sort u

. keep in 1/n

Did you know sort has its own, private random-number generator built into it? It does, and sort uses its random-number generator to determine the order of tied observations. In the manuals we at StataCorp are fond of writing, “the ties will be ordered randomly” and a few sophisticated users probably took that to mean, “the ties will be ordered in a way that we at StataCorp do not know and even though they might be ordered in a way that will cause a bias in the subsequent analysis, because we don’t know, we’ll ignore the possibility.” But we meant it when wrote that the ties will be ordered randomly; we know that because we put a random number generator into sort to ensure the result. And that is why I can now write that repeated values of the runiform() function cause a reproducibility issue, but not a statistical issue.

The solution to the reproducibility issue is to draw two random numbers and use the random-number pair to order the observations:

 
. use dataset, clear

. sort varnames

. set seed #

. generate double u1 = runiform()

. generate double u2 = runiform()

. sort u1 u2

. keep in 1/n

You might wonder if we would ever need three random numbers. It is very unlikely. p, the probability of no problem, equals 1 to at least 5 digits for N=500,000. Of course, the chances of duplication are always nonzero. If you are concerned about this problem, you could add an assert to the code to verify that the two random numbers together do uniquely identify the observations:

. use dataset, clear

. sort varnames

. set seed #

. generate double u1 = runiform()

. generate double u2 = runiform()

. sort u1 u2

. by u1 u2: assert _N==1            // added line

. keep in 1/n

I do not believe that doing that is necessary.

 
Is using doubles necessary?

In the generation of random numbers in all of the above, note that I am storing them as doubles. For the reproducibility issue, that is important. As I mentioned in part 1, the 32-bit random numbers that runiform() produces will be rounded if forced into 23-bit floats.

Above I gave you a table of probabilities p that, in creating

. generate double u = runiform()

the values of u would be distinct. Here is what would happen if you instead stored u as a float:

                               u stored as
          ---------------------------------------------------------
          -------- double ----------     ----------float ----------
      N         p            1-p               p           1-p
-------------------------------------------------------------------
     50   0.999999715    0.000000285     0.999853979    0.000146021
    500   0.999970955    0.000029045     0.985238383    0.014761617
  1,000   0.999883707    0.000116293     0.942190868    0.057809132
  5,000   0.997094436    0.002905564     0.225346930    0.774653070
 10,000   0.988427154    0.011572846     0.002574145    0.997425855
 50,000   0.747490440    0.252509560     0.000000000    1.000000000
100,000   0.312187988    0.687812012     0.000000000    1.000000000
200,000   0.009498117    0.990501883     0.000000000    1.000000000
300,000   0.000028161    0.999971839     0.000000000    1.000000000
400,000   0.000000008    0.999999992     0.000000000    1.000000000
500,000   0.000000000    1.000000000     0.000000000    1.000000000
-------------------------------------------------------------------

 
Drawing without replacement P-percent random samples

We have discussed drawing without replacement n observations from N observations. The number of observations selected has been fixed. Say instead we wanted to draw a 10% random sample, meaning that we independently allow each observation to have a 10% chance of appearing in our sample. In that case, the final number of observations is expected to be 0.1*N, but it may (and probably will) vary from that. The basic solution for drawing a 10% random sample is

. keep if runiform() <= 0.10

and the basic solution for drawing a P% random sample is

. keep if runiform() <= P/100

It is unlikely to matter whether you code <= or < in the comparison. As you now know, runiform() produces values drawn from 232 possible values, and thus the chance of equality is 2-32 or roughly 0.000000000232830644. If you want a P% sample, however, theory says you should code <=.

If you care about reproducibility, you should expand the basic solution to read,

. set seed #

. use data, clear 

. sort variables_that_put_data_in_unique_order

. keep if runiform() <= P/100

Below I draw a 10% sample from the card.dta:

. set seed 838   

. use cards, clear

. sort rank suit

. keep if runiform() <= 10/100
(46 observations deleted)

. list

     +-----------------+
     |  rank      suit |
     |-----------------|
  1. |     2   Diamond |
  2. |     2     Heart |
  3. |     3      Club |
  4. |     5     Heart |
  5. |  Jack   Diamond |
     |-----------------|
  6. | Queen     Spade |
     +-----------------+

 
We’re not done, but we’re done for today

In part 3 of this series I will discuss drawing random samples with replacement.

  • http://www.facebook.com/philip.jones.1257 Philip Jones

    Great post!

  • Aljar Meesters

    Can you explain why sort does not use the same seed as the other random number generators? That would make sort also foolproof with respect to reproducibility.

  • BillGould

    We did not because we we did not want reproducibility in this case.

    Consider a piece of code that, run on the same data, produces different results in different runs. Assume the random behavior is not an intentional outcome of the procedure/formulas that the code implements. Then the observed random behavior indicates a mistake in the procedure/formulas or a mistake in the code. I’ve seen both.

    In my experience, indeterministic sorts are the second most likely cause of different-results-in-different-runs problems. The most likely cause is uninitialized variables. Besides the statistical goals I mentioned in the posting, another of our goals in writing -sort- so that it randomly ordered ties was to highlight indeterministic sorts that lead to irreproducible results. At StataCorp, we run certification scripts in a loop, over and over again, looking for just such problems.

    Setting the random-number seed is a way of reproducing results from routines that are intended to produce different results in different runs. -sort- is not such a function; if it produces different results in different runs, and that matters, that is a bug.

  • Ellie

    instead of save the generated samples separately save sample`i’, replace. Is there a way to put all generated samples as different variables and save them in one data base?