The topic for today is drawing random samples with replacement. If you haven’t read part 1 and part 2 of this series on random numbers, do so. In the series we’ve

discussed that

- Stata’s
**runiform()** function produces random numbers over the range [0,1). To produce such random numbers, type
. generate double u = runiform()

- To produce
*continuous* random numbers over [*a*,*b*), type
. generate double u = (*b*-a)*runiform() + *a*

- To produce
*integer* random numbers over [*a*,*b*], type
. generate ui = floor((*b*-a+1)*runiform() + *a*)

If *b* > 16,777,216, type

. generate long ui = floor((*b*-a+1)*runiform() + *a*)

- To place observations in random order — to shuffle observations — type
. set seed *#*
. generate double u = runiform()
. sort u

- To draw
*without* replacement a random sample of *n* observations from a dataset of *N* observations, type
. set seed *#*
. sort *variables_that_put_dta_in_unique_order*
. generate double u = runiform()
. sort u
. keep in 1/*n*

If *N*>1,000, generate two random variables u1 and u2 in place of u, and substitute **sort u1 u2** for **sort u**.

- To draw
*without* replacement a *P*-percent random sample, type
. set seed *#*
. keep if runiform() <= *P*/100

I’ve glossed over details, but the above is the gist of it.

Today I’m going to tell you

- To draw a random sample of size
*n* *with* replacement from a dataset of size *N*, type
. set seed *#*
. drop _all
. set obs *n*
. generate long obsno = floor(*N**runiform()+1)
. sort obsno
. save obsnos_to_draw
. use *your_dataset*, clear
. generate long obsno = _n
. merge 1:m obsno using obsnos_to_draw, keep(match) nogen

- You need to set the random-number seed only if you care about reproducibility. I’ll also mention that if
*N* ≤ 16,777,216, it is not necessary to specify that new variable obsno be stored as **long**; the default **float** will be sufficient.

The above solution works whether *n*<*N*, *n*=*N*, or *n*>*N*.

**Drawing samples with replacement**

The solution to sampling with replacement *n* observations from a dataset of size *N* is

- Draw
*n* observation numbers 1, …, *N* with replacement. For instance, if *N=4* and *n*=3, we might draw observation numbers 1, 3, and 3.
- Select those observations from the dataset of interest. For instance, select observations 1, 3, and 3.

As previously discussed in part 1, to generate random integers drawn with replacement over the range [*a*, *b*], use the formula

**generate** *varname* **=** **floor((***b***-***a***+1)*runiform() +** *a***)**

In this case, we want *a*=1 and *b*=*N*, and the formula reduces to,

**generate** *varname* **=** **floor(***N****runiform() + 1)**

So the first half of our solution could read

. drop _all
. set obs *n*
. generate obsno = floor(*N**runiform() + 1)

Now we are merely left with the problem of selecting those observations from our dataset, which we can do using **merge** by typing

. sort obsno
. save obsnos_to_draw
. use *dataset_of_interest*, clear
. generate obsno = _n
. merge 1:m obsno using obsnos_to_draw, keep(match) nogen

Let’s do an example. In part 2 of this series, I had a dataset with observations corresponding to playing cards:

. **use cards**
. **list in 1/5**
+-------------+
| rank suit |
|-------------|
1. | Ace Club |
2. | 2 Club |
3. | 3 Club |
4. | 4 Club |
5. | 5 Club |
+-------------+

There are 52 observations in the dataset; I’m showing you just the first five. Let’s draw 10 cards from the deck, but with replacement.

The first step is to draw the observation numbers. We have *N*=52 cards in the deck, and we want to draw *n*=10, so we generate 10 random integers from the integers [1, 52]:

. **drop _all**
. **set obs 10** // we want *n*=10
obs was 0, now 10
. **gen obsno = floor(52*runiform()+1)** // we draw from *N*=52
. **list obsno** // let's see what we have
+-------+
| obsno |
|-------|
1. | 42 |
2. | 52 |
3. | 16 |
4. | 9 |
5. | 40 |
|-------|
6. | 11 |
7. | 34 |
8. | 20 |
9. | 49 |
10. | 42 |
+-------+

If you look carefully at the list, you will see that observation number 42 repeats. It will be easier to see the duplicate if we sort the list,

. **sort obsno**
. **list**
+-------+
| obsno |
|-------|
1. | 9 |
2. | 11 |
3. | 16 |
4. | 20 |
5. | 34 |
|-------|
6. | 40 |
7. | 42 | <- *Obs. 42 repeats*
8. | 42 | <- *See?*
9. | 49 |
10. | 52 |
+-------+

An observation didn’t have to repeat, but it’s not surprising that one did because in drawing *n*=10 from *N*=52, we would expect one or more repeated cards about 60% of the time.

Anyway, we now know which cards we want, namely cards 9, 11, 16, 20, 34, 40, 42, 42 (again), 49, and 52.

The final step is to select those observations from cards.dta. The way to do that is to perform a one-to-many merge of cards.dta with the list above and keep the matches. Before we can do that, however, we must (1) save the list of observation numbers as a dataset, (2) load cards.dta, and (3) add a variable called **obsno** to it. Then we will be able to perform the merge. So let’s get that out of the way,

. **save obsnos_to_draw** // 1. save the list above
file obsnos_to_draw.dta saved
. **use cards** // 2. load cards.dta
. **gen obsno = _n** // 3. Add variable obsno to it

Now we can perform the merge:

. **merge 1:m obsno using obsnos_to_draw, keep(matched) nogen**
Result # of obs.
-----------------------------------------
not matched 0
matched 10
-----------------------------------------

I’ll list the result, but let me first briefly explain the command

**merge 1:m obsno using obsnos_to_draw, keep(matched) nogen**

**merge** …, we are performing the **merge** command,

… **1:m** …, the merge is one-to-many,

… **using obsnos_to_draw** …, we merge data in memory with obsnos_todraw.dta,

…**, keep(matched)** …, we keep observations that appear in both datasets,

… **nogen**, do not add variable _merge to the resulting dataset; _merge reports the source of the resulting observations; we said **keep(matched)** so we know each came from both sources.

And here is the result:

. list
+-------------------------+
| rank suit obsno |
|-------------------------|
1. | 8 Club 9 |
2. | Jack Club 11 |
3. | Ace Spade 16 |
4. | 2 Diamond 20 |
5. | 6 Spade 34 |
|-------------------------|
6. | 8 Spade 40 |
7. | 9 Heart 42 | <- *Obs. 42 is here ...*
8. | Queen Spade 49 |
9. | King Spade 52 |
10. | 9 Heart 42 | <- *and here*
+-------------------------+

We drew 10 cards — those are the observation numbers on the left. Variable obsno in our dataset records the original observation (card) number and really, we no longer need the variable. Anyway, obsno==42 appears twice, in real observations 7 and 10, and thus we drew the 9 of Hearts twice.

**What could go wrong?**

Not much can go wrong, it turns out. At this point, our generic solution is

. drop _all
. set obs *n*
. generate obsno = floor(*n**runiform()+1)
. sort obsno
. save obsnos_to_draw
. use *our_dataset*
. gen obsno = _n
. merge 1:m obsno using obsnos_to_draw, keep(matched) nogen

If you study this code, there are two lines that might cause problems,

. generate obsno = floor(*N**runiform()+1)

and

. generate obsno = _n

When you are looking for problems and see a **generate** or **replace**, think about rounding.

Let’s look at the right-hand side first. Both calculations produce integers over the range [1, *N*]. **generate** performs all calculations in **double** and the largest integer that can be stored without rounding is 9,007,199,254,740,992 (see previous blog post on precision). Stata allows datasets up to 2,147,483,646, so we can be sure that *N* is less than the maximum precise-integer double. There are no rounding issues on the right-hand side.

Next let’s look at the left-hand side. Variable obsno is being stored as a **float** because we did not instruct otherwise. The largest integer value that can be stored without rounding as a float (also covered in previous blog post on precision) is 16,777,216, and that is less than Stata’s 2,147,483,646 maximum observations. When *N* exceeds 16,777,216, the solution is to store obsno as a **long**. We could remember to use **long** on the rare occasion when dealing with such large datasets, but I’m going to change the generic solution to use **long**s in all cases, even when it’s unnecessary.

What else could go wrong? Well, we tried an example with *n*<*N* and that seemed to work. We should now try examples with *n*=*N* and *n*>*N* to verify there’s no hidden bug or assumption in our code. I’ve tried examples of both and the code works fine.

**We’re done for today**

That’s it. Drawing samples with replacement turns out to be easy, and that shouldn’t surprise us because we have a random-number generator that draws with replacement.

We could complicate the discussion and consider solutions that would run a bit more efficiently when *n*=*N*, which is of special interest in statistics because it is a key ingredient in bootstrapping, but we will not. The above solution works fine in the *n*=*N* case, and I always advise researchers to favor simple-even-if-slower solutions because they will probably save you time. Writing complicated code takes longer than writing simple code, and testing complicated code takes even longer. I know because that’s what we do at StataCorp.