How to generate random numbers in Stata
Overview
I describe how to generate random numbers and discuss some features added in Stata 14. In particular, Stata 14 includes a new default random-number generator (RNG) called the Mersenne Twister (Matsumoto and Nishimura 1998), a new function that generates random integers, the ability to generate random numbers from an interval, and several new functions that generate random variates from nonuniform distributions.
Random numbers from the uniform distribution
In the example below, we use runiform() to create a simulated dataset with 10,000 observations on a (0,1)-uniform variable. Prior to using runiform(), we set the seed so that the results are reproducible.
. set obs 10000 number of observations (_N) was 0, now 10,000 . set seed 98034 . generate u1 = runiform()
The mean of a (0,1)-uniform is .5, and the standard deviation is \(\sqrt{1/12}\approx .289\). The estimates from the simulated data reported in the output below are close to the true values.
summarize u1 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- u1 | 10,000 .5004244 .2865088 .0000502 .999969
To draw uniform variates over (a, b) instead of over (0, 1), we specify runiform(a, b). In the example below, we draw uniform variates over (1, 2) and then estimate the mean and the standard deviation, which we could compare with their theoretical values of 1.5 and \(\sqrt{(1/12)} \approx .289\).
. generate u2 = runiform(1, 2) . summarize u2 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- u2 | 10,000 1.495698 .2887136 1.000088 1.999899
To draw integers uniformly over {a, a+1, …, b}, we specify runiformint(a, b). In the example below, we draw integers uniformly over {0, 1, …, 100} and then estimate the mean and the standard deviation, which we could compare with their theoretical values of 50 and \(\sqrt{(101^2-1)/12}\approx 29.155\).
. generate u3 = runiformint(0, 100) . summarize u3 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- u3 | 10,000 49.9804 29.19094 0 100
Set the seed and make results reproducible
We use set seed # to obtain the same random numbers, which makes the subsequent results reproducible. RNGs come from a recursive formula. The “random” numbers produced are actually deterministic, but they appear to be random. Setting the seed specifies a starting place for the recursion, which causes the random numbers to be the same, as in the example below.
. drop _all . set obs 6 number of observations (_N) was 0, now 6 . set seed 12345 . generate x = runiform() . set seed 12345 . generate y = runiform() . list x y +---------------------+ | x y | |---------------------| 1. | .3576297 .3576297 | 2. | .4004426 .4004426 | 3. | .6893833 .6893833 | 4. | .5597356 .5597356 | 5. | .5744513 .5744513 | |---------------------| 6. | .2076905 .2076905 | +---------------------+
Every time Stata is launched, the seed is set to 123456789.
After generating \(N\) random numbers, the RNG wraps around and starts generating the same sequence all over again. \(N\) is called the period of the RNG. Larger periods are better because we get more random numbers before the sequence wraps. The period of Mersenne Twister is \(2^{19937}-1\), which is huge. Large periods are important when performing complicated simulation studies.
In Stata, the seed is a positive integer (between 0 and \(2^{31}-1\)) that Stata maps onto the state of the RNG. The state of an RNG corresponds to a spot in the sequence. The mapping is not one to one because there are more states than seeds. If you want to pick up where you left off in the sequence, you need to restore the state, as in the example below.
drop _all . set obs 3 number of observations (_N) was 0, now 3 . set seed 12345 . generate x = runiform() . local state `c(rngstate)' . generate y = runiform() . set rngstate `state' . generate z = runiform() . list +--------------------------------+ | x y z | |--------------------------------| 1. | .3576297 .5597356 .5597356 | 2. | .4004426 .5744513 .5744513 | 3. | .6893833 .2076905 .2076905 | +--------------------------------+
After dropping the data and setting the number of observations to 3, we use generate to put random variates in x, store the state of the RNG in the local macro state, and then put random numbers in y. Next, we use set rngstate to restore the state to what it was before we generated y, and then we generate z. The random numbers in z are the same as those in y because restoring the state caused Stata to start at the same place in the sequence as before we generated y. See Programming an estimation command in Stata: Where to store your stuff for an introduction to local macros.
Random variates from various distributions
So far, we have talked about generating uniformly distributed random numbers. Stata also provides functions that generate random numbers from other distributions. The function names are easy to remember: the letter r followed by the name of the distribution. Some common examples are rnormal(), rbeta(), and rweibull(). In the example below, we draw 5,000 observations from a standard normal distribution and summarize the results.
. drop _all . set seed 12345 . set obs 5000 number of observations (_N) was 0, now 5,000 . generate w = rnormal() . summarize w Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- w | 5,000 .0008946 .9903156 -3.478898 3.653764
The estimated mean and standard deviation are close to their true values of 0 and 1.
A note on precision
So far, we generated random numbers with the default data type of float. Generating the random numbers with type double makes ties occur less frequently. Ties can still occur with type double because the huge period of Mersenne Twister exceeds the precison of \(2^{-53}\), so a long enough sequence of random numbers will have repeated numbers.
Conclusion
In this post, I showed how to generate random numbers using random-number functions in Stata. I also discussed how to make results reproducible by setting the seed. In subsequent posts, I will delve into other aspects of RNGs, including methods to generate random variates from other distributions and in Mata.
Reference
Matsumoto, M., and T. Nishimura. 1998. Mersenne Twister: A 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation 8: 3–30.