Using Stata’s random-number generators, part 2, drawing without replacement
Last time I told you that Stata’s runiform() function generates rectangularly (uniformly) distributed random numbers over [0, 1), from 0 to nearly 1, and to be precise, over [0, 0.999999999767169356]. And I gave you two formulas,
- To generate continuous random numbers between a and b, use
generate double u = (b–a)*runiform() + a
The random numbers will not actually be between a and b: they will be between a and nearly b, but the top will be so close to b, namely 0.999999999767169356*b, that it will not matter.
- To generate integer random numbers between a and b, use
generate ui = floor((b–a+1)*runiform() + a)
I also mentioned that runiform() can solve a variety of problems, including
- shuffling data (putting observations in random order),
- drawing random samples without replacement (there’s a minor detail we’ll have to discuss because runiform() itself produces values drawn with replacement),
- drawing random samples with replacement (which is easier to do than most people realize),
- drawing stratified random samples (with or without replacement),
- manufacturing fictional data (something teachers, textbook authors, manual writers, and blog writers often need to do).
Today we will cover shuffling and drawing random samples without replacement — the first two topics on the list — and we will leave drawing random samples with replacement for next time. I’m going to tell you
- To place observations in random order — to shuffle the observations — type
. generate double u = runiform() . sort u
- To draw without replacement a random sample of n observations from a dataset of N observations, type
. set seed # . generate double u = runiform() . sort u . keep in 1/n
I will tell you that there are good statistical reasons for setting the random-number seed even if you don’t care about reproducibility.
If you do care about reproducibility, I will mention (1) that you need to use sort to put the original data in a known, reproducible order, before you generate the random variate u, and I will explain (2) a subtle issue that leads us to use different code for N≤1,000 and N>1,000. The code for for N≤1,000 is
. set seed # . sort variables_that_put_data_in_unique_order . generate double u = runiform() . sort u . keep in 1/n
and the code for N>1,000 is
. set seed # . sort variables_that_put_data_in_unique_order . generate double u1 = runiform() . generate double u2 = runiform() . sort u1 u2 . keep in 1/n
You can use the N>1,000 code for the N≤1,000 case.
- To draw without replacement a P-percent random sample, type
. set seed # . keep if runiform() <= P/100
There’s no issue in this case when N is large.
As I mentioned, we’ll discuss drawing random samples with replacement next time. Today, the topic is random samples without replacement. Let’s start.
Shuffling data
I have a deck of 52 cards, in order, the first four of which are
. list in 1/4 +-------------+ | rank suit | |-------------| 1. | Ace Club | 2. | 1 Club | 3. | 2 Club | 4. | 3 Club | +-------------+
Well, actually I just have a Stata dataset with observations corresponding to playing cards. To shuffle the deck — to place the observations in random order — type
. generate double u = runiform() . sort u
Having done that, here’s your hand,
. list in 1/5 +----------------------------+ | rank suit u | |----------------------------| 1. | Queen Club .0445188 | 2. | 5 Diamond .0580662 | 3. | 7 Club .0610638 | 4. | King Heart .0907986 | 5. | 6 Spade .0981878 | +----------------------------+
and here’s mine:
. list in 6/10 +---------------------------+ | rank suit u | |---------------------------| 6. | 8 Diamond .1024369 | 7. | 5 Club .1086679 | 8. | 8 Spade .1091783 | 9. | 2 Spade .1180158 | 10. | Ace Club .1369841 | +---------------------------+
All I did was generate random numbers — one per observation (card) — and then place the observations in ascending order of the random values. Doing that is equivalent to shuffling the deck. I used runiform() random numbers, meaning rectangularly distributed random numbers over [0, 1), but since I’m only exploiting the random-numbers’ ordinal properties, I could have used random numbers from any continuous distribution.
This simple, elegant, and obvious solution to shuffling data will play an important part of the solution to drawing observations without replacement. I have already more than hinted at the solution when I showed you your hand and mine.
Drawing n observations without replacement
Drawing without replacement is exactly the same problem as dealing cards. The solution to the physical card problem is to shuffle the cards and then draw the top cards. The solution to randomly selecting n from N observations is to put the N observations in random order and keep the first n of them.
. use cards, clear . generate double u = runiform() . sort u . keep in 1/5 (47 observations deleted) . list +---------------------------+ | rank suit u | |---------------------------| 1. | Ace Diamond .0064866 | 2. | 6 Heart .0087578 | 3. | King Spade .014819 | 4. | 3 Spade .0955155 | 5. | King Diamond .1007262 | +---------------------------+ . drop u
Reproducibility
You might later want to reproduce the analysis, meaning you do not want to draw another random sample, but you want to draw the same random sample. Perhaps you informally distributed some preliminary results and, of course, then discovered a mistake. You want to redistribute updated results and show that your mistake didn’t change results by much, and to drive home the point, you want to use the same samples as you used previously.
Part of the solution is to set the random-number seed. You might type
. set seed 49983882 . use cards, clear . generate double u = runiform() . sort u . keep in 1/5
See help set seed in Stata. As a quick review, when you set the random-number seed, you set Stata’s random-number generator into a fixed, reproducible state, which is to say, the sequence of random numbers that runiform() produces is a function of the seed. Set the seed today to the same value as yesterday, and runiform() will produce the same sequence of random numbers today as it did yesterday. Thus, after setting the seed, if you repeat today exactly what you did yesterday, you will obtain the same results.
So imagine that you set the random number seed today to the value you set it to yesterday and you repeat the above commands. Even so, you might not get the same results! You will not get the same results if the observations in cards.dta are in a different order yesterday and today. Setting the seed merely ensures that if yesterday the smallest value of u was in observation 23, it will again be in observation 23 today (and it will be the same value). If yesterday, however, observation 23 was the 6 of Clubs, and today it’s the 7 of Hearts, then today you will select the 7 of Hearts in place of the 6 of Clubs.
So make sure the data are in the same order. One way to do that is put the dataset in a known order before generating the random values on which you will sort. For instance,
. set seed 49983882 . use cards, clear . sort rank suit . generate double u = runiform() . sort u . keep in 1/5
An even better solution would add the line
. by rank suit: assert _N==1
just before the generate. That line would check whether sorting on variables rank and suit uniquely orders the observations.
With cards.dta, you can argue that the assert is unnecessary, but not because you know each rank-suit combination occurs once. You have only my assurances about that. I recommend you never trust anyone’s assurances about data. In this case, however, you can argue that the assert is unnecessary because we sorted on all the variables in the dataset and thus uniqueness is not required. Pretend there are two Ace of Clubs in the deck. Would it matter that the first card was Ace of Clubs followed by Ace of Clubs as opposed to being the other way around? Of course it would not; the two states are indistinguishable.
So let’s assume there is another variable in the dataset, say whether there was a grease spot on the back of the card. Yesterday, after sorting, the ordering might have been,
+---------------------------------+ | rank suit grease u | |---------------------------------| 1. | Ace Club yes .6012949 | 2. | Ace Club no .1859054 | +---------------------------------+
and today,
+---------------------------------+ | rank suit grease u | |---------------------------------| 1. | Ace Club no .6012949 | 2. | Ace Club yes .1859054 | +---------------------------------+
If yesterday you selected the Ace of Clubs without grease, today you would select the Ace of Clubs with grease.
My recommendation is (1) sort on whatever variables put the data into a unique order, and then verify that, or (2) sort on all the variables in the dataset and then don’t worry whether the order is unique.
Ensuring a random ordering
Included in our reproducible solution but omitted from our base solution was setting the random-number seed,
. set seed 49983882
Setting the seed is important even if you don’t care about reproducibility. Each time you launch Stata, Stata sets the same random-number seed, namely 123456789, and that means that runiform() generates the same sequence of random numbers, and that means that if you generated all your random samples right after launching Stata, you would always select the same observations, at least holding N constant.
So set the seed, but don’t set it too often. You set the seed once per problem. If I wanted to draw 10,000 random samples from the same data, I could code:
use dataset, clear set seed 1702213 sort variables_that_put_data_in_unique_order preserve forvalues i=1(1)10000 { generate double u = runiform() sort u keep in 1/n drop u save sample`i', replace restore, preserve }
In the example I save each sample in a file. In real life, I seldom (never) save the samples; I perform whatever analysis on the samples I need and save the results, which I usually append into a single dataset. I don’t need to save the individual samples because I can recreate them.
And the result still might not be reproducible …
runiform() draws random-numbers with replacement. It is thus possible that two or more observations could have the same random values associated with them. Well yes, you’re thinking, I see that it’s possible, but surely it’s so unlikely that it just doesn’t happen. But it does happen:
. clear all . set obs 100000 obs was 0, now 100000 . generate double u = runiform() . by u, sort: assert _N==1 1 contradiction in 99999 by-groups assertion is false r(9);
In the 100,000-observation dataset I just created, I got a duplicate! By the way, I didn’t have to look hard for such an example, I got it the first time I tried.
I have three things I want to tell you:
- Duplicates happen more often than you might guess.
- Do not panic about the duplicates. Because of how Stata is written, duplicates do not lower the quality of the sample selected. I’ll explain.
- Duplicates do interfere with reproducibility, however, and there is an easy way around that problem.
Let’s start with the chances of observing duplicates. I mentioned in passing last time that runiform() is a 32-bit random-number generator. That means runiform() can return any of 232 values. Their values are, in order,
0 = 0 1/232 = 2.32830643654e-10 2/232 = 4.65661287308e-10 3/232 = 6.98491930962e-10 . . . (232-2)/232 = 0.9999999995343387 (232-1)/232 = 0.9999999997671694
So what are the chances that in N draws with replacement from an urn containing these 232 values, that all values are distinct? The probability p that all values are distinct is
232 * (232-1) * ... *(232-N) p = ---------------------------- N*232
Here are some values for various values of N. p is the probability that all values are unique, and 1-p is the probability of observing one or more repeated values.
------------------------------------ N p 1-p ------------------------------------ 50 0.999999715 0.000000285 500 0.999970955 0.000029045 1,000 0.999883707 0.000116293 5,000 0.997094436 0.002905564 10,000 0.988427154 0.011572846 50,000 0.747490440 0.252509560 100,000 0.312187988 0.687812012 200,000 0.009498117 0.990501883 300,000 0.000028161 0.999971839 400,000 0.000000008 0.999999992 500,000 0.000000000 1.000000000 ------------------------------------
In shuffling cards we generated N=52 random values. The probability of a repeated values is infinitesimal. In datasets of N=10,000, I expect to see repeated values 1% of the time. In datasets of N=50,000, I expect to see repeated values 25% of the time. By N=100,000, I expect to see repeated values more often than not. By N=500,000, I expect to see repeated value in virtually all sequences.
Even so, I promised you that this problem does not affect the randomness of the ordering. It does not because of how Stata’s sort command is written. Remember the basic solution,
. use dataset, clear . generate double u = runiform() . sort u . keep in 1/n
Did you know sort has its own, private random-number generator built into it? It does, and sort uses its random-number generator to determine the order of tied observations. In the manuals we at StataCorp are fond of writing, “the ties will be ordered randomly” and a few sophisticated users probably took that to mean, “the ties will be ordered in a way that we at StataCorp do not know and even though they might be ordered in a way that will cause a bias in the subsequent analysis, because we don’t know, we’ll ignore the possibility.” But we meant it when wrote that the ties will be ordered randomly; we know that because we put a random number generator into sort to ensure the result. And that is why I can now write that repeated values of the runiform() function cause a reproducibility issue, but not a statistical issue.
The solution to the reproducibility issue is to draw two random numbers and use the random-number pair to order the observations:
. use dataset, clear . sort varnames . set seed # . generate double u1 = runiform() . generate double u2 = runiform() . sort u1 u2 . keep in 1/n
You might wonder if we would ever need three random numbers. It is very unlikely. p, the probability of no problem, equals 1 to at least 5 digits for N=500,000. Of course, the chances of duplication are always nonzero. If you are concerned about this problem, you could add an assert to the code to verify that the two random numbers together do uniquely identify the observations:
. use dataset, clear . sort varnames . set seed # . generate double u1 = runiform() . generate double u2 = runiform() . sort u1 u2 . by u1 u2: assert _N==1 // added line . keep in 1/n
I do not believe that doing that is necessary.
Is using doubles necessary?
In the generation of random numbers in all of the above, note that I am storing them as doubles. For the reproducibility issue, that is important. As I mentioned in part 1, the 32-bit random numbers that runiform() produces will be rounded if forced into 23-bit floats.
Above I gave you a table of probabilities p that, in creating
. generate double u = runiform()
the values of u would be distinct. Here is what would happen if you instead stored u as a float:
u stored as --------------------------------------------------------- -------- double ---------- ----------float ---------- N p 1-p p 1-p ------------------------------------------------------------------- 50 0.999999715 0.000000285 0.999853979 0.000146021 500 0.999970955 0.000029045 0.985238383 0.014761617 1,000 0.999883707 0.000116293 0.942190868 0.057809132 5,000 0.997094436 0.002905564 0.225346930 0.774653070 10,000 0.988427154 0.011572846 0.002574145 0.997425855 50,000 0.747490440 0.252509560 0.000000000 1.000000000 100,000 0.312187988 0.687812012 0.000000000 1.000000000 200,000 0.009498117 0.990501883 0.000000000 1.000000000 300,000 0.000028161 0.999971839 0.000000000 1.000000000 400,000 0.000000008 0.999999992 0.000000000 1.000000000 500,000 0.000000000 1.000000000 0.000000000 1.000000000 -------------------------------------------------------------------
Drawing without replacement P-percent random samples
We have discussed drawing without replacement n observations from N observations. The number of observations selected has been fixed. Say instead we wanted to draw a 10% random sample, meaning that we independently allow each observation to have a 10% chance of appearing in our sample. In that case, the final number of observations is expected to be 0.1*N, but it may (and probably will) vary from that. The basic solution for drawing a 10% random sample is
. keep if runiform() <= 0.10
and the basic solution for drawing a P% random sample is
. keep if runiform() <= P/100
It is unlikely to matter whether you code <= or < in the comparison. As you now know, runiform() produces values drawn from 232 possible values, and thus the chance of equality is 2-32 or roughly 0.000000000232830644. If you want a P% sample, however, theory says you should code <=.
If you care about reproducibility, you should expand the basic solution to read,
. set seed # . use data, clear . sort variables_that_put_data_in_unique_order . keep if runiform() <= P/100
Below I draw a 10% sample from the card.dta:
. set seed 838 . use cards, clear . sort rank suit . keep if runiform() <= 10/100 (46 observations deleted) . list +-----------------+ | rank suit | |-----------------| 1. | 2 Diamond | 2. | 2 Heart | 3. | 3 Club | 4. | 5 Heart | 5. | Jack Diamond | |-----------------| 6. | Queen Spade | +-----------------+
We’re not done, but we’re done for today
In part 3 of this series I will discuss drawing random samples with replacement.