Home > Statistics > How to generate random numbers in Stata

How to generate random numbers in Stata

Overview

I describe how to generate random numbers and discuss some features added in Stata 14. In particular, Stata 14 includes a new default random-number generator (RNG) called the Mersenne Twister (Matsumoto and Nishimura 1998), a new function that generates random integers, the ability to generate random numbers from an interval, and several new functions that generate random variates from nonuniform distributions.

Random numbers from the uniform distribution

In the example below, we use runiform() to create a simulated dataset with 10,000 observations on a (0,1)-uniform variable. Prior to using runiform(), we set the seed so that the results are reproducible.

. set obs 10000
number of observations (_N) was 0, now 10,000

. set seed 98034

. generate u1 = runiform()

The mean of a (0,1)-uniform is .5, and the standard deviation is \(\sqrt{1/12}\approx .289\). The estimates from the simulated data reported in the output below are close to the true values.

 summarize u1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          u1 |     10,000    .5004244    .2865088   .0000502    .999969

To draw uniform variates over (a, b) instead of over (0, 1), we specify runiform(a, b). In the example below, we draw uniform variates over (1, 2) and then estimate the mean and the standard deviation, which we could compare with their theoretical values of 1.5 and \(\sqrt{(1/12)} \approx .289\).

. generate u2 = runiform(1, 2)

. summarize u2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          u2 |     10,000    1.495698    .2887136   1.000088   1.999899

To draw integers uniformly over {a, a+1, …, b}, we specify runiformint(a, b). In the example below, we draw integers uniformly over {0, 1, …, 100} and then estimate the mean and the standard deviation, which we could compare with their theoretical values of 50 and \(\sqrt{(101^2-1)/12}\approx 29.155\).

. generate u3 = runiformint(0, 100)

. summarize u3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          u3 |     10,000     49.9804    29.19094          0        100

Set the seed and make results reproducible

We use set seed # to obtain the same random numbers, which makes the subsequent results reproducible. RNGs come from a recursive formula. The “random” numbers produced are actually deterministic, but they appear to be random. Setting the seed specifies a starting place for the recursion, which causes the random numbers to be the same, as in the example below.

. drop _all

. set obs 6
number of observations (_N) was 0, now 6

. set seed 12345

. generate x = runiform()

. set seed 12345

. generate y = runiform()

. list x y

     +---------------------+
     |        x          y |
     |---------------------|
  1. | .3576297   .3576297 |
  2. | .4004426   .4004426 |
  3. | .6893833   .6893833 |
  4. | .5597356   .5597356 |
  5. | .5744513   .5744513 |
     |---------------------|
  6. | .2076905   .2076905 |
     +---------------------+

Every time Stata is launched, the seed is set to 123456789.

After generating \(N\) random numbers, the RNG wraps around and starts generating the same sequence all over again. \(N\) is called the period of the RNG. Larger periods are better because we get more random numbers before the sequence wraps. The period of Mersenne Twister is \(2^{19937}-1\), which is huge. Large periods are important when performing complicated simulation studies.

In Stata, the seed is a positive integer (between 0 and \(2^{31}-1\)) that Stata maps onto the state of the RNG. The state of an RNG corresponds to a spot in the sequence. The mapping is not one to one because there are more states than seeds. If you want to pick up where you left off in the sequence, you need to restore the state, as in the example below.

 drop _all

. set obs 3
number of observations (_N) was 0, now 3

. set seed 12345

. generate x = runiform()

. local state `c(rngstate)'

. generate y = runiform()

. set rngstate `state'

. generate z = runiform()

. list

     +--------------------------------+
     |        x          y          z |
     |--------------------------------|
  1. | .3576297   .5597356   .5597356 |
  2. | .4004426   .5744513   .5744513 |
  3. | .6893833   .2076905   .2076905 |
     +--------------------------------+

After dropping the data and setting the number of observations to 3, we use generate to put random variates in x, store the state of the RNG in the local macro state, and then put random numbers in y. Next, we use set rngstate to restore the state to what it was before we generated y, and then we generate z. The random numbers in z are the same as those in y because restoring the state caused Stata to start at the same place in the sequence as before we generated y. See Programming an estimation command in Stata: Where to store your stuff for an introduction to local macros.

Random variates from various distributions

So far, we have talked about generating uniformly distributed random numbers. Stata also provides functions that generate random numbers from other distributions. The function names are easy to remember: the letter r followed by the name of the distribution. Some common examples are rnormal(), rbeta(), and rweibull(). In the example below, we draw 5,000 observations from a standard normal distribution and summarize the results.

. drop _all

. set seed 12345

. set obs 5000
number of observations (_N) was 0, now 5,000

. generate w = rnormal()

. summarize w

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
           w |      5,000    .0008946    .9903156  -3.478898   3.653764

The estimated mean and standard deviation are close to their true values of 0 and 1.

A note on precision

So far, we generated random numbers with the default data type of float. Generating the random numbers with type double makes ties occur less frequently. Ties can still occur with type double because the huge period of Mersenne Twister exceeds the precison of \(2^{-53}\), so a long enough sequence of random numbers will have repeated numbers.

Conclusion

In this post, I showed how to generate random numbers using random-number functions in Stata. I also discussed how to make results reproducible by setting the seed. In subsequent posts, I will delve into other aspects of RNGs, including methods to generate random variates from other distributions and in Mata.

Reference

Matsumoto, M., and T. Nishimura. 1998. Mersenne Twister: A 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation 8: 3–30.