Stata 12 Announced

We are pleased to announce a new version of Stata: Stata 12. You can order it today, it starts shipping on July 25, and you can find out about it at

Here are the highlights of what’s new:

There are other things that are new, too. Things like functions for Tukey’s Studentized range and Dunnett’s multiple range, baseline odds for logistic regression, truncated count-data regressions, probability predictions, robust and cluster-robust SEs for fixed-effects Poisson regression, and the like under General Statistics. Or under Survey Data, support for SEM, bootstrap and successive difference replicate (SDR) weights, goodness of fit after binary models, coefficient of variation, and more. Or under Panel Data, probability predictions, multiple imputation support, and more. Or under Survival Data, a goodness-of-fit statistic that is robust to censoring. Or PDF export of results and graphs.

We could go on, but you get the idea. We think Stata 12 is worth a look.

Precision (yet again), Part II

In part I, I wrote about precision issues in English. If you enjoyed that, you may want to stop reading now, because I’m about to go into the technical details. Actually, these details are pretty interesting.

For instance, I offered the following formula for calculating error due to float precision:

maximum_error = 2-24 X

I later mentioned that the formula is an approximation, and said that the true formula is,

maximum_error = 2-24 2floor(log2 X)

I didn’t explain how I got either formula.

I need to be more precise today than I was in my previous posting. For instance, I previously used x for two concepts, the true value and the rounded-after-storage value. Today I need to distinguish those concepts.

X is the true value.

x is the value after rounding due to storage.

The issue is the difference between x and X when X is stored in 24-binary-digit float precision.

Base 10

Although I harp on the value of learning to think in binary and hexadecimal, I admit that I, too, find it easier to think in base 10. So let’s start that way.

Say we record numbers to two digits of accuracy, which I will call d=2. Examples of d=2 numbers include

52*10^1 (i.e, 520, but with only two significant digits)

To say that we record numbers to two digits of accuracy is to say that, coming upon the recorded number 1, we know only that the number lies between 0.95 and 1.05; or coming upon 12, that the true number lies between 11.5 and 12.5, and so on. I assume that numbers are rounded efficiently, which is to say, stored values record midpoints of intervals.

Before we get into the math, let me note that most us would be willing to say that numbers recorded this way are accurate to 1 part in 10 or, if d=3, to 1 part in 100. If numbers are accurate to 1 part in 10^(d-1), then couldn’t we must multiply the number by 1/(10^(d-1)) to obtain the width of the interval? Let’s try:

Assume X=520 and d=2. Then 520/(10^(2-1)) = 52. The true interval, however, is (515, 525] and it has width 10. So the simple formula does not work.

The simple formula does not work yet I presented its base-2 equivalent in Part 1 and I even recommended its use! We will get to that. It turns out the smaller the base, the more accurately the simple formula approximates the true formula, but before I can show that, I need the true formula.

Let’s start by thinking about d=1.

  1. The recorded number 0 will contain all numbers between [-0.5, 0.5). The recorded number 1 will contain all numbers between [0.5, 1.5), and so on. For 0, 1, …, 9, the width of the intervals is 1.

  2. The recorded number 10 will contain all numbers between [5, 15). The recorded number 20 will contain all numbers between [15, 25), and so on, For 10, 20, …, 90, the width of the intervals is 10.

The derivation for the width of interval goes like this:

  1. If we recorded the value of X to one decimal digit, the recorded digit will will be b, the recorded value will be x = b*10p, and the power of ten will be p = floor(log10X). More importantly, W1 = 10p will be the width of the interval containing X.

  2. It therefore follows that if we recorded the value of X to two decimal digits, the interval length would be W2 = W1/10. What ever the width with one digit, adding another must reduce width by one-tenth.

  3. If we recorded the value of X to three decimal digits, the interval length would be W3 = W2/10.

  4. Thus, if d is the number of digits to which numbers are recorded, the width of the interval is 10p where p = floor(log10X) – (d-1).

The above formula is exact.

Base 2

Converting the formula

interval_width = 10floor(log10X)-(d-1)

from base 10 to base 2 is easy enough:

interval_width = 2floor(log2X)-(d-1)

In Part 1, I presented this formula for d=24 as

maximum_error = 2floor(log2X)-24 = 2 -24 2floor(log2 X)

In interval_width, it is d-1 and not d that appears in the formula. You might think I made an error and should have put -23 where I put -24 in the maximum_error formula. There is no mistake. In Part 1, the maximum error was defined as a plus-or-minus quantity and is thus half the width of the overall interval. So I divided by 2, and in effect, I did put -23 into the maximum_error formula, at least before I subracted one more from it, making it -24 again.

I started out this posting by considering and dismissing the base-10 approximation formula

interval_width = 10-(d-1) X

which in maximum-error units is

maximum_error = 10-d X

and yet in Part 1, I presented — and even recommended — its base-2, d=24 equivalent,

maximum_error = 2-24 X

It turns out that the approximation formula is not as inaccurate in base 2 and it would be in base 10. The correct formula,

maximum_error = 2floor(log2X)-d

can be written

maximum_error = 2-d 2floor(log2X

so the question becomes about the accuracy of substituting X for 2^floor(log2X). We know by examination that X ≥ 2^floor(log2X), so making the substitution will overstate the error and, in that sense, is a safe thing to do. The question becomes how much the error is overstated.

X can be written 2^(log2X) and thus we need to compare 2^(log2X) with 2^floor(log2X). The floor() function cannot reduce its argument by more than 1, and thus 2^(log2X) cannot differ from 2^floor(log2X) by more than a factor of 2. Under the circumstances, this seems a reasonable approximation.

In the case of base 10, the the floor() function reducing its argument by up to 1 results in a decrease of up to a factor of 10. That, it seems to me, is not a reasonable amount of error.

Categories: Numerical Analysis Tags: ,

Precision (yet again), Part I

I wrote about precision here and here, but they were pretty technical.

“Great,” coworkers inside StataCorp said to me, “but couldn’t you explain these issues in a way that doesn’t get lost in the details of how computers store binary and maybe, just maybe, write about floats and doubles from a user’s perspective instead of programmer’s perspective?”

“Mmmm,” I said clearly.

Later, when I tried, I liked the result. It contains new material, too. What follows is what I now wish I had written first. I’d would have still written the other two postings, but as technical appendices.

In Part 2 (forthcoming), I provide the mathematical derivations underlying what follows. There are a few interesting issues underlying what follows.

Please excuse the manualish style of what follows, but I suspect that what follows will eventually work its way into Stata’s help files or manuals, so I wrote it that way.



. generate x = 1.1

. list
  (Stata displays output showing x is 1.1 in all observations)

. count if x==1.1

Solution 1:

. count if x==float(1.1)

Solution 2:

. generate double x = 1.1

. count if x==1.1

Solution 3:

. set type double

. generate x = 1.1

. count if x==1.1


Stata works in binary. Stata stores data in float precision by default. Stata preforms all calculations in double precision. Sometimes the combination results in surprises until you think more carefully about what happened.


Remarks are presented under the headings

Why count==1.1 produces 0
How count==float(1.1) solves the problem
How storing data as double appears to solve the problem (and does)
Float is plenty accurate to store most data
Why don’t I have the problems using Excel?


Justifications for all statements made appear in the sections below. In summary,

  1. It sometimes appears that Stata is inaccurate. That is not true and, in fact, the appearance of inaccuracy happens in part because Stata is so accurate.

  2. You can cover up this appearance of inaccuracy by storing all your data in double precision. This will double or more the size of your dataset, and so I do not recommend the double-precision solution unless your dataset is small relative to the amount of memory on your computer. In that case, there is nothing wrong with storing all your data in double precision.

    The easiest way to implement the double-precision solution is by typing set type double. After that, Stata will default to to creating all new variables as doubles, at least for the remainder of the session. If all your datasets are small relative to the amount of memory on your computer, you can set type double, permanently.

  3. The double-precision solution is needlessly wasteful of memory. It is difficult to imagine data that are accurate to more than float precision. Regardless of how your data are stored, Stata does all calculations in double precision, and sometimes in quad precision.

  4. The issue of 1.1 not being equal to 1.1 arises only with “nice” decimal numbers. You just have to remember to use Stata’s float() function when dealing with such numbers.

Why count x==1.1 produces 0

Let’s trace through what happens when you type the commands

. generate x = 1.1

. count if x==1.1

Here is how it works:

  1. Some numbers have no exact finite-digit binary representation just as some numbers have no exact finite-digit decimal representation. One-third, 0.3333… (base 10), is an example of a number with no exact finite-digit decimal representation. In base 12, one-third does have an exact finite-digit representation, namely 0.4 (base 12). In base 2 (binary), base 10 numbers such as 0.1, 0.2, 0.3, 0.4, 0.6, … have no exact finite-digit representation.

  2. Computers store numbers with a finite number of binary digits. In float precision, numbers have 24 binary digits. In double precision, they have 53 binary digits.

    The decimal number 1.1 in binary is 1.000110011001… (base 2). The 1001 on the end repeats forever. Thus, 1.1 (base 10) is stored by a computer as


    in float, or as


    in double. There are 24 and 53 digits in the numbers above.

  3. Typing generate x = 1.1 results in 1.1 being interpreted as the longer binary number Stata performs all calculations in double precision. New variable x is created as a float by default. When the more precise number is stored in x, it is rounded to the shorter number.

  4. Thus when you count if x==1.1 the result is 0 because 1.1 is again interpreted as the longer binary number and the longer number is compared to shorter number stored in x, and they are not equal.

How count x==float(1.1) solves the problem

One way to fix the problem is to change count if x==1.1 to read count if x==float(1.1):

. generate x = 1.1

. count if x==float(1.1)

Function float() rounds results to float precision. When you type float(1.1), the 1.1 is converted to binary, double precision, namely,

 1.0001100110011001100110011001100110011001100110011010 (base 2)

and float() then rounds that long binary number to

 1.00011001100110011001101 (base 2)

or more correctly, to

 1.0001100110011001100110000000000000000000000000000000 (base 2)

because the number is still stored in double precision. Regardless, this new value is equal to the value stored in x, and so count reports that 100 observations contain float(1.1).

As an aside, when you typed generate x = 1.1, Stata acted as if you typed generate x = float(1.1). Whenever you type generate x = … and x is a float, Stata acts if if you typed generate x = float(…).

How storing data as double appears to solve the problem (and does)

When you type

. generate double x = 1.1

. count if x==1.1

it should be pretty obvious how the problem was solved. Stata stores

1.0001100110011001100110011001100110011001100110011010 (base 2)

in x, and then compares the stored result to

1.0001100110011001100110011001100110011001100110011010 (base 2)

and of course they are equal.

In the Summary above, I referred to this as a cover up. It is a cover up because 1.1 (base 10) is not what is stored in x. What is stored in x is the binary number just shown, and to be equal to 1.1 (base 10), the binary number needs to suffixed with 1001, and then another 1001, and then another, and so on without end.

Stata tells you that x is equal to 1.1 because Stata converted the 1.1 in count to the same inexact binary representation as Stata previously stored in x, and those two values are equal, but neither is equal to 1.1 (base 10). This leads to an important property of digital computers:

If storage and calculation are done to the same precision, it will appear to the user as if all numbers that the user types are stored without error.

That is, it appears to you as if there is no inaccuracy in storing 1.1 in x when x is a double because Stata performs calculations in double. And it is equally true that it would appear to you as if there were no accuracy issues storing 1.1 when x is stored in float precision if Stata, observing that x is float, performed calculations involving x in float. The fact is that there are accuracy issues in both cases.

“Wait,” you are probably thinking. “I understand your argument, but I’ve always heard that float is inaccurate and double is accurate. I understand from your argument that it is only a matter of degree but, in this case, those two degrees are on opposite sides of an important line.”

“No,” I respond.

What you have heard is right with respect to calculation. What you have heard might apply to data storage too, but that is unlikely. It turns out that float provides plenty of precision to store most real measurements.

Float is plenty accurate to store most data

The misconception that float precision is inaccurate comes from the true statement that float precision is not accurate enough when it comes to making calculations with stored values. Whether float precision is accurate enough for storing values depends solely on the accuracy with which the values are measured.

Float precision provides 24 base-2 (binary) digits, and thus values stored in float precision have a maximum relative error error of plus-or-minus 2^(-24) = 5.96e-08, or less than +/-1 part in 15 million.

  1. The U.S. deficit in 2011 is projected to be $1.5 trillion. Stored as a float, the number has a (maximum) error of 2^(-24) * 1.5e+12 = $89,407. That is, if the true number is 1.5 trillion, the number recorded in float precision is guaranteed to be somewhere in the range [(1.5e+12)-89,407, (1.5e+14)+89,407]. The projected U.S. deficit is not known to an accuracy of +/-$89,407.

  2. People in the US work about 40 hours per week, or roughly 0.238 of the hours in the week. 2^(-24) * 0.238 = 1.419e-09 of a week, or 0.1 milliseconds. Time worked in a week is not known to an accuracy of +/-0.1 milliseconds.

  3. A cancer survivor might live 350 days. 2^(-24) * 350 = .00002086, or 1.8 seconds. Time of death is rarely recorded to an accuracy of +/-1.8 seconds. Time of diagnosis is never recorded to such accuracy, nor could it be.

  4. The moon is said to be 384,401 kilometers from the Earth. 2^(-24) * 348,401 = 0.023 kilometers, or 23 meters. At its closest and farthest, the moon is 356,400 and 406,700 kilometers from Earth.

  5. Most fundamental constants of the universe are known to a few parts in a million, which is to say, less than 1 part in 15 million, the accuracy float precision can provide. An exception is the speed of light, measured to be 299,793.458 kilometers per second. Record that as a float and you will be off by 0.01 km/s.

In all the examples except the last, quoted are worst-case scenarios. The actual errors depend on the exact number and is a more tedious calculation (not shown):

  1. For the U.S. deficit, the exact error for 1.5 trillion is -$26,624, which is within the plus or minus $89,407 quoted.

  2. For fraction of the week, at 0.238 the error is -0.04 milliseconds, which is within the +/-0.1 milliseconds quoted.

  3. For cancer survival time, at 350 days the actual error is 0, which is within the +/-1.8 seconds quoted.

  4. For the distance between the Earth and moon, the actual error is 0, which is within within the +/-23 meters quoted.

The actual errors may be interesting, but the maximum errors are more useful. Remember the multiplier 2^(-24). All you have to do is multiply a measurement by 2^(-24) and compare the result with the inherent error in the measurement. If 2^(-24) multiplied by the measurement is less than the inherent error, you can use float precision to store your data. Otherwise, you need to use double.

By the way, the formula

maximum_error = 2^(-24) * x

is an approximation. The true formula is

maximum_error = 2^(-24) * 2^(floor(log2(x)))

It can be readily proven that x ≥ 2^(floor(log2(x))) and thus the approximation formula overstates the maximum error. The approximation formula can overstate the maximum error by as much as a factor of 2. Float precision is adequate for most data. There is one kind of data, however, where float precision may not be adequate, and that is financial data such as sales data, general ledgers, and the like. People working with dollar-and-cent data, or Euro-and-Eurocent data, or Pound Stirling-and-penny data, or any other currency data, usually find it best to use doubles. To avoid rounding issues, it is preferable to store the data as pennies. Float precision binary cannot store 0.01, 0.02, and the like, exactly. Integer values, however, can be stored exactly, at least up to certain 16,777,215.

Floats can store up to 16,777,215 exactly. If stored your data in pennies, that would correspond to $167,772.15.

Doubles can store up to 9,007,199,254,740,991 exactly. If you stored your data in pennies, the would correspond to $90,071,992,547,409.91, or just over $90 trillion.

Why don’t I have these problems using Excel?

You do not have these problems when you use Excel because Excel stores numeric values in double precision. As I explained in How float(1.1) solves the problem above,

If storage and calculation are done to the same precision, it will appear to the user as if all numbers that the user types are stored without error.

You can adopt the Excel solution in Stata by typing

. set type double, permanently

You will double (or more) the amount of memory Stata uses to store your data, but if that is not of concern to you, there are no other disadvantages to adopting this solution. If you adopt this solution and later wish to change your mind, type

. set type float, permanently

That’s all for today

If you enjoyed the above, you may want to see Part II (forthcoming). As I said, There are a few technical issues underlying what is written above that may interest those interested in computer science as it applies to statistical computing.

Categories: Numerical Analysis Tags: ,

Stata at JSM 2011 in Miami Beach, FL

StataCorp invites you to stop by our booth, 404, at JSM 2011, July 31 – August 3, in Miami Beach, FL. StataCorp staff and developers will be on hand to answer any questions you have about Stata, from statistics to programming to licensing. You can also register to win a copy of quad-core Stata/MP.

StataCorp is also presenting three continuing education technology workshops at JSM 2011:

Survey Data Analysis with Stata
Jeffrey Pitblado, Associate Director, Statistical Software
Wednesday, August 3, 8:00 AM – 9:45 AM
Register for Activity Number CE_24T

Multiple Imputation Analysis in Stata
Yulia Marchenko, Associate Director, Biostatistics
Wednesday, August 3, 10:00 AM – 11:45 AM
Register for Activity Number CE_28T

Multilevel and Mixed Models in Stata
Bill Rising, Director of Educational Services
Wednesday, August 3, 1:00 PM – 2:45 PM
Register for Activity Number CE_32T

To register for the workshops, sign up when you register to attend JSM or go to

We look forward to seeing you in Miami Beach. Be sure to stop by booth 404 to learn more about Stata or just to visit with the people who make it.

Categories: Meetings Tags: , ,

Merging data, part 2: Multiple-key merges

Multiple-key merges arise when more than one variable is required to uniquely identify the observations in your data. In Merging data, part 1, I discussed single-key merges such as

        . merge 1:1 personid using ...

In that discussion, each observation in the dataset could be uniquely identified on the basis of a single variable. In panel or longitudinal datasets, there are multiple observations on each person or thing and to uniquely identify the observations, we need at least two key variables, such as

        . merge 1:1 personid date using ...

In this dataset we have repeated observations on persons and, within person, the observations can be uniquely identified by the calendar date.

Just to fix ideas, let’s assume I have two datasets. The first, sample.dta, is the one of analytic interest to me. I have data on 737 persons. For each person, I have data recorded on the first and fifteenth of every month, from year 2000 to 2009. Overall, my dataset has 176,880 observations.

The second dataset contains additional information (variables) on the sample of interest. It has over 3,000 people in it and it covers a longer stretch of time. I’ve been told that most of my 737 people should be in this second dataset, but I’ve been warned that, due to data collection or data processing errors over the years, a small fraction will not be found.

“How many?” I asked Bob from data processing during a fictional conversation, my paranoia kicking in.

“I don’t know. Most. Ninety-nine percent. It’s just random stuff,” he replied, knowing how well I respond to the word random.

Let’s call this second set of data the payroll data, although if I can imagine fictional conversations, you can imagine the data are something else. They might be records from follow-up visits of a medical experiment.

In any case, I receive the data, and here is what happened when I merged the data with my sample:

        . use sample, clear

        . merge 1:1 personid date using payroll, keep(master match)

            Result                           # of obs.
            not matched                         2,352  
                from master                     2,352  (_merge==1)
                from using                          0  (_merge==2)

            matched                           174,528  (_merge==3)

In my sample data, I have 174,520 + 2,352 = 176,872 observations. Of those, 174,528 matched, which is 98.7 percent. (The reason that the number of records from the using (payroll) data that were not matched is zero is because I specified option keep(master match), meaning I discarded the unmatched payroll records. Had I not, the number would have been in the low millions.)

For many in this situation, the story would stop right here. Not for me. I want to show you how to tear into multiple-key merges to reassure yourself that things really are as they appear. You realize, of course, that I manufactured this fictional data for this blog entry and I buried a little something that once we find it, would scare you if this were a real story. So I’ll tell you now, this story is loosely based on a real story.

Step 1: Following my own advice

In Merging data, part 1 I recommended that you merge on all common variables, not just the identification variables. This blog entry is not going to rehash the previous blog entry, but I want to emphasize that everything I said in the previous entry about single-key merges applies equally to multiple-key merges. These two datasets share a variable recording the division in which the employee works, so I am included it among the match variables:

        . use sample, clear

        . merge 1:1 personid date division using payroll, keep(master match)

            Result                           # of obs.
            not matched                         2,352  
                from master                     2,352  (_merge==1)
                from using                          0  (_merge==2)

            matched                           174,528  (_merge==3)

The output above matches the output when I merged date and division alone, so I do not appear to have a merge-gone-bad problem. These merged data are looking better and better.

Step 2: Merge on each key variable in isolation

Let’s imagine what could go wrong. Imagine that all the data for certain persons were missing, or that all the data for certain dates were missing. That might not be a problem, but it would certainly raise questions. Depending on the answers, it may be worth a footnote or concerning enough to return the data and ask for a refund.

Finding persons or dates that are entirely unmatched is a lot of work unless you know the following trick: Merge on one key variable at a time.

Let’s start with personid:

        . use sample, clear

        . sort personid

        . by personid: keep if _n==1           // <- don't skip this step
        (176143 observations deleted)

        . merge 1:m personid using payroll, keep(master match)

            Result                           # of obs.
            not matched                             0
            matched                           174,528  (_merge==3)

The output above proves that payroll.dta contains data on every person that appears in sample.dta.

Let me explain. I began by using my sample data and keeping just one observation for every value of personid. I don't care which observation I keep, I just need to keep one and only one. Then I merged on personid, keeping (1) the records that match and (2) the records from the master that do not match. I have no interest in the resulting dataset; I just wanted to see the table merge would report. merge reports that 174,528 personids matched, and 0 did not. Ergo, every value of personid that appears in sample.dta also appears in payroll.dta.

Had merge reported "not matched" = 2, that would mean there would have been two values of personid appearing in sample.dta that do not appear in payroll.dta. It would not have been an indictment of the data if two persons were not matched in their entirety, but I would certainly have looked into the issue. With the merged result in memory, I would have typed

        . list personid if _merge==1
          (output would appear) 

I would have written down the two personids list reported. Then I would have returned to my sample data and looked at the data I had on those two people:

        . use sample, clear

        . list if personid==735527
          (output would appear)

        . list if personid==29887
          (output would appear)

It might be that 735527 was with the company for only a short time and thus the missing payroll record a believable random event. If 735527 had been with the company all ten years, however, I would be back on the phone seeking an explanation. (If these were medical data, certainly you would want to know how a person who never reported for a follow-up visit is known to still be alive after ten years.)

So much for personid. Let's do the same for date:

        . use sample, clear  
        . sort date
        . by date: keep if _n==1
        (176640 observations deleted)
        . merge 1:m date using payroll, keep(master match)
            Result                           # of obs.
            not matched                             0
            matched                           236,832  (_merge==3)

Every date that appears in sample.dta also appears in payroll.dta.

Finally, let's look at division:

        . use sample, clear

        . sort division date

        . by division date: keep if _n==1
        (175200 observations deleted)

        . merge 1:m division date using payroll, keep(master match)

            Result                           # of obs.
            not matched                            24
                from master                        24  (_merge==1)
                from using                          0  (_merge==2)

            matched                           236,832  (_merge==3)

Every division that appears in sample.dta appears in payroll.dta

These data are looking better and better.

If we had only two key variables, we would be done. We, however, are performing the full merge on three variables, namely personid, date, and division, and so there is one more set of comparisons we should examine.

Step 3: Merge on every pair of key variables

With three key variables, the possible pairs are (personid, date), (personid, division), and (division, date). We have already looked at (personid, date), so that just leaves (personid, division) and (division, date).

The method is the same as in Step 2 except that we type two variables where we previously typed one:

        . use sample, clear

        . sort personid division

        . by personid division: keep if _n==1
        (176143 observations deleted)

        . merge 1:m personid division using payroll, keep(master match)

            Result                           # of obs.
            not matched                             0
            matched                           174,528  (_merge==3)

We discover that every personid-division combination that appears in sample.dta also appears in payroll.dta.

Last is (division, date):

        . use sample, clear

        . sort division date

        . by division date: keep if _n==1
        (175200 observations deleted)

        . merge 1:m division date using payroll, keep(master match)

            Result                           # of obs.
            not matched                            24
                from master                        24  (_merge==1)
                from using                          0  (_merge==2)

            matched                           236,832  (_merge==3)

Surprise! Well, you're not surprised because I told you earlier we were going to find something, but if this were real life, you would be surprised after all these reassuring results.

We discover that there are 24 division-date combinations appearing in sample.dta that do not appear in payroll.dta. Let's look at the 24 missing combinations:

        . list division date if _merge==1

                | division        date |
          1129. |        5   01jan2007 |
          1130. |        5   15jan2007 |
          1131. |        5   01feb2007 |
          1132. |        5   15feb2007 |
          1133. |        5   01mar2007 |
          1134. |        5   15mar2007 |
          1135. |        5   01apr2007 |
          1136. |        5   15apr2007 |
          1137. |        5   01may2007 |
          1138. |        5   15may2007 |
          1139. |        5   01jun2007 |
          1140. |        5   15jun2007 |
          1141. |        5   01jul2007 |
          1142. |        5   15jul2007 |
          1143. |        5   01aug2007 |
          1144. |        5   15aug2007 |
          1145. |        5   01sep2007 |
          1146. |        5   15sep2007 |
          1147. |        5   01oct2007 |
          1148. |        5   15oct2007 |
          1149. |        5   01nov2007 |
          1150. |        5   15nov2007 |
          1151. |        5   01dec2007 |
          1152. |        5   15dec2007 |

If you look closely, you will notice that every payroll date in 2007 is listed. So what happened to the payroll records for division 5 in 2007? This may indeed be exactly the kind of random event that Bob had in mind during our fictional conversation. Somehow the company lost a little cluster of payroll records. The loss may mean mean nothing. Or it might be of critical importance. Imagine there's been an allegation that the company treats older workers poorly and imagine that division 5 has the highest average age. Not random. Not random at all.

Step 4: Merge on every triplet of key variables

So much for the fictional story.

If we had four or more key variables, we would now need to merge on every triplet of key variables, and if we had five or more key variables, we then need to merge on every quadruplet of key variables, and if ...

Forget the story. Or remember it if it scares you. Data processing and paranoia make an excellent pairing. What's important is how easy it is to take complicated, multiple-key merges apart. I've never met anyone yet who knew this trick.

Categories: Data Management Tags: ,

Stata Conferences and Meetings Update

Between now and the end of the year, the annual Stata Conference in the United States will take place along with five other Stata meetings in countries around the world.

Stata conferences and meetings feature talks by both Stata users and Stata developers and provide an opportunity to help shape the future of Stata development by interacting with and providing feedback directly to StataCorp personnel.

The talks range from longer presentations by invited speakers to shorter talks demonstrating the use of Stata in a variety of fields. Some talks are statistical in nature while others focus on data management, graphics, or programming in Stata. New enhancements to Stata created both by users and by StataCorp are often featured in talks.

The full schedule of upcoming meetings is

2011 Mexican Stata Users Group meeting
May 12, 2011

2011 German Stata Users Group meeting
July 1, 2011

Stata Conference Chicago 2011
July 14–15, 2011

2011 UK Stata Users Group meeting
September 15–16, 2011

2011 Spanish Stata Users Group meeting
September 22, 2011

2011 Nordic and Baltic Stata Users Group meeting
November 11, 2011

Click on any meeting title for more information, including programs and registration information.

Categories: Meetings Tags: ,

Merging data, part 1: Merges gone bad

Merging concerns combining datasets on the same observations to produce a result with more variables. We will call the datasets one.dta and two.dta.

When it comes to combining datasets, the alternative to merging is appending, which is combining datasets on the same variables to produce a result with more observations. Appending datasets is not the subject for today. But just to fix ideas, appending looks like this:

              | var1  var2  var3  |      one.dta
           1. | one.dta           |
           2. |                   |
            . |                   |
            . |                   |


              | var1  var2  var3  |      two.dta
           1. | two.dta           |
           2. |                   |
            . |                   |


              | var1  var2  var3  |
           1. |                   |    one.dta
           2. |                   |
            . |                   |
            . |                   |
              +                   +      +
        N1+1. |                   |    two.dta   appended
        N2+2. |                   |
            . |                   |

Merging looks like this:

      +-------------------+           +-----------+
      | var1  var2  var3  |           | var4 var5 |
      +-------------------+           +-----------+
   1. |                   |        1. |           |
   2. |                   |    +   2. |           |     =
    . |                   |         . |           |
    . |                   |         . |           |
      +-------------------+           +-----------+
        one.dta                         two.dta

                        | var1  var2  var3    var4 var5 |
                     1. |                               |
                     2. |                               |
                      . |                               |
                      . |                               |
                          one.dta           + two.dta    merged

The matching of the two datasets — deciding which observations in one.dta are combined with which observations in two.dta — could be done simply on the observation numbers: Match one.dta observation 1 with two.dta observation 1, match one.dta observation 2 with two.dta observation 2, and so on. In Stata, you could obtain that result by typing

. use one, clear

. merge 1:1 using two

Never do this because it is too dangerous. You are merely assuming that observation 1 matches with observation 1, observation 2 matches with observation 2, and so on. What if you are wrong? If observation 2 in one.dta is Bob and observation 2 in two.dta is Mary, you will mistakenly combine the observations for Bob and Mary and, perhaps, never notice the mistake.

The better solution is to match the observations on equal values of an identification variable. This way, the observation with id=”Mary” is matched with the observation with id=”Mary”, id=”Bob” with id=”Bob”, id=”United States” with id=”United States”, and id=4934934193 with id=4934934193. In Stata, you do this by typing

. use one, clear

. merge 1:1 id using two

Things can still go wrong. For instance, id=”Bob” will not match id=”Bob ” (with the trailing blank), but if you expected all the observations to match, you will ultimately notice the mistake. Mistakenly unmatched observations tend to get noticed because of all the missing values they cause in subsequent calculations.

It is the mistakenly combined observations that can go unnoticed.

And that is the topic for today, mistakenly matched observations, or merges gone bad.

Observations are mistakenly combined more often than many researchers realize. I’ve seen it happen. I’ve seen it happen, be discovered later, and necessitate withdrawn results. You seriously need to consider the possibility that this could happen to you. Only three things are certain in this world: death, taxes, and merges gone bad.

I am going to assume that you are familiar with merging datasets both conceptually and practically; that you already know what 1:1, m:1, 1:m, and m:n mean; and that you know the role played by “key” variables such as ID. I am going to assume you are familiar with Stata’s merge command. If any of this is untrue, read [D] merge. Type help merge in Stata and click on [D] merge at the top to take you to the full PDF manuals. We are going to pick up where the discussion in [D] merge leaves off.

Detecting when merges go bad

As I said, the topic for today is merges gone bad, by which I mean producing a merged result with the wrong records combined. It is difficult to imagine that typing

. use one, clear

. merge 1:1 id using two

could produce such a result because, to be matched, the observations had to have equal values of the ID. Bob matched with Bob, Mary matched with Mary, and so on.

Right you are. There is no problem assuming the values in the id variable are correct and consistent between datasets. But what if id==4713 means Bob in one dataset and Mary in the other? That can happen if the id variable is simply wrong from the outset or if the id variable became corrupted in prior processing.

1. Use theory to check IDs if they are numeric

One way the id variable can become corrupted is if it is not stored properly or if it is read improperly. This can happen to both string and numeric variables, but right now, we are going to emphasize the numeric case.

Say the identification variable is Social Security number, an example of which is 888-88-8888. Social Security numbers are invariably stored in computers as 888888888, which is to say that they are run together and look a lot like the number 888,888,888. Sometimes they are even stored numerically. Say you have a raw data file containing perfectly valid Social Security numbers recorded in just this manner. Say you read the number as a float. Then 888888888 becomes 888888896, and so does every Social Security number between 888888865 and 888888927, some 63 in total. If Bob has Social Security number 888888869 and Mary has 888888921, and Bob appears in dataset one and Mary in dataset two, then Bob and Mary will be combined because they share the same rounded Social Security number.

Always be suspicious of numeric ID variables stored numerically, not just those stored as floats.

When I read raw data and store the ID variables as numeric, I worry whether I have specified a storage type sufficient to avoid rounding. When I obtain data from other sources that contain numeric ID variables, I assume that the other source improperly stored the values until proven otherwise.

Perhaps you remember that 16,775,215 is the largest integer that can be stored precisely as a float and 9,007,199,254,740,991 is the largest that can be stored precisely as a double. I never do.

Instead, I ask Stata to show me the largest theoretical ID number in hexadecimal. For Social Security numbers, the largest is 999-99-9999, so I type

. inbase 16 999999999

Stata’s inbase command converts decimal numbers to different bases. I learn that 999999999 base-10 is 3b9ac9ff base-16, but I don’t care about the details; I just want to know the number of base-16 digits required. 3b9ac9ff has 8 digits. It takes 8 base-16 digits to record 999999999. As you learned in How to read the %21x format, part 2, I do remember that doubles can record 13 base-16 digits and floats can record 5.75 digits (the 0.75 part being because the last digit must be even). If I didn’t remember those numbers, I would just display a number in %21x format and count the digits to the right of the binary point. Anyway, Social Security numbers can be stored in doubles because 8<13, the number of digits double provides, but not in floats because 8 is not < 5.75, the number of digits float provides.

If Social Security numbers contained 12 digits rather than 9, the largest would be

. inbase 16 999999999999

which has 10 base-16 digits, and because 10<13, it would still fit into a double.

Anyway, if I discover that the storage type is insufficient to store the ID number, I know the ID numbers must be rounded.

2. Check uniqueness of IDs

I said that when I obtain data from other sources, I assume that the other source improperly stored the ID variables until proven otherwise. I should have said, until evidence accumulates to the contrary. Even if the storage type used is sufficient, I do not know what happened in previous processing of the data.

Here’s one way using datasets one.dta and two.dta to accumulate some of that evidence:

. use one, clear              // test 1
. sort id
. by id: assert _N==1

. use two, clear              // test 2
. sort id . by id: assert _N==1 

In these tests, I am verifying that the IDs really are unique in the two datasets that I have. Tests 1 and 2 are unnecessary when I plan later to merge 1:1 because the 1:1 part will cause Stata itself to check that the IDs are unique. Nevertheless, I run the tests. I do this because the datasets I merge are often subsets of the original data, and I want to use all the evidence I have to invalidate the claim that the ID variables really are unique.Sometimes I receive datasets where it takes two variables to make sure I am calling a unique ID. Perhaps I receive data on persons over time, along with the claim that the ID variable is name. The documentation also notes that variable date records when the observation was made. Thus, to uniquely identify each of the observations requires both name and date, and I type

. sort name date
. by name date: assert _N==1

I am not suspicious of only datasets I receive. I run this same test on datasets I create.

3. Merge on all common variables

At this point, I know the ID variable(s) are unique in each dataset. Now I consider the idea that the ID variables are inconsistent across datasets, which is to say that Bob in one dataset, however he is identified, means Mary in the other. Detecting such problems is always problematic, but not nearly as problematic as you might guess.

It is rare that the datasets I need to merge have no variables in common except the ID variable. If the datasets are on persons, perhaps both datasets contain each person’s sex. In that case, I could merge the two datasets and verify that the sex is the same in both. Actually, I can do something easier than that: I can add variable sex to the key variables of the merge:

. use one, clear
. merge 1:1 id sex using two

Assume I have a valid ID variable. Then adding variable sex does not affect the outcome of the merge because sex is constant within id. I obtain the same results as typing merge 1:1 id using two.

Now assume the id variable is invalid. Compared with the results of merge 1:1 id using two, Bob will no longer match with Mary even if they have the same ID. Instead I will obtain separate, unmatched observations for Bob and Mary in the merged data. Thus to complete the test that there are no such mismatches, I must verify that the id variable is unique in the merged result. The complete code reads

. use one, clear
. merge 1:1 id sex using two
. sort id
. by id: assert _N==1

And now you know why in test 2 I checked the uniqueness of ID within dataset by hand rather than depending on merge 1:1. The 1:1 merge I just performed is on id and sex, and thus merge does not check the uniqueness of ID in each dataset. I checked by hand the uniqueness of ID in each dataset and then checked the uniqueness of the result by hand, too.

Passing the above test does not prove that that the ID variable is consistent and thus the merge is correct, but if the assertion is false, I know with certainty either that I have an invalid ID variable or that sex is miscoded in one of the datasets. If my data has roughly equal number of males and females, then the test has a 50 percent chance of detecting a mismatched pair of observations, such as Bob and Mary. If I have just 10 mismatched observations, I have a 1-0.910 = 0.9990 probability of detecting the problem.

I should warn you that if you want to keep just the matched observations, do not perform the merge by coding merge 1:1 id sex using two, keep(matched). You must keep the unmatched observations to perform the final part of the test, namely, that the ID numbers are unique. Then you can drop the unmatched observations.

. use one, clear
. merge 1:1 id sex using two
. sort id
. by id: assert _N==1
. keep if _merge==3

There may be more than one variable that you expect to be the same in combined observations. A convenient feature of this test is that you can add as many expected-to-be-constant variables to merge‘s keylist as you wish:

. use one, clear
. merge 1:1 id sex hiredate groupnumber using two
. sort id
. by id: assert _N==1
. keep if _merge==3

It is rare that there is not at least one variable other than the ID variable that is expected to be equal, but it does happen. Even if you have expected-to-be-constant variables, they may not work as well in detecting problems as variable sex in the example above. The distribution of the variable matters. If your data are of people known to be alive in 1980 and the known-to-be-constant variable is whether born after 1900, even mismatched observations would be likely to have the same value of the variable because most people alive in 1980 were born after 1900.

4. Look at a random sample

This test is weak, but you should do it anyway, if only because it’s so easy. List some of the combined observations and look at them.

. list in 1/5

Do the combined results look like they go together?

By the way, the right way to do this is

. gen u = uniform()
. sort u
. list in 1/5
. drop u

You do not want to look at the first observations because, having small values of ID, they are probably not representative. However IDs are assigned, the process is unlikely to be randomized. Persons with low values of ID will be younger, or older; or healthier, or sicker; or ….

5. Look at a nonrandom sample

You just merged two datasets, so obviously you did that because you needed the variables and those variables are somehow related to the existing variables. Perhaps your data is on persons, and you combined the 2009 data with the 2010 data. Perhaps your data is on countries, and you added export data to your import data. Whatever you just added, it is not random. If it were, you could have saved yourself time by simply generating the new variables containing random numbers.

So generate an index that measures a new variable in terms of an old one, such as

. gen diff = income2010 - income2009


. gen diff = exports - imports

Then sort on the variable and look at the observations containing the most outlandish values of your index:

. sort diff
. list in  1/5
. list in -5/l

These are the observations most likely to be mistakenly combined. Do you believe those observations were combined correctly?


I admit I am not suspicious of every merge I perform. I have built up trust over time in datasets that I have worked with previously. Even so, my ability to make errors is equal to yours, and even with trustworthy datasets, I can introduce problems long before I get to the merge. You need to carefully consider the consequences of a mistake. I do not know anyone who performs merges who has not performed a merge gone bad. The question is whether he or she detected it. I hope so.

Categories: Data Management Tags: ,

Multiprocessor (core) software (think Stata/MP) and percent parallelization

When most people first think about software designed to run on multiple cores such as Stata/MP, they think to themselves, two cores, twice as fast; four cores, four times as fast. They appreciate that reality will somehow intrude so that two cores won’t really be twice as fast as one, but they imagine the intrusion is something like friction and nothing that an intelligently placed drop of oil can’t improve.

In fact, something inherent intrudes. In any process to accomplish something — even physical processes — some parts may be able to to be performed in parallel, but there are invariably parts that just have to be performed one after the other. Anyone who cooks knows that you sometimes add some ingredients, cook a bit, and then add others, and cook some more. So it is, too, with calculating xt = f(xt-1) for t=1 to 100 and t0=1. Depending on the form of f(), sometimes there’s no alternative to calculating x1 = f(x0), then calculating x2 = f(x1), and so on.

In any calculation, some proportion p of the calculation can be parallelized and the remainder, 1-p, cannot. Consider a calculation that takes T hours if it were performed sequentially on a single core. If we had an infinite number of cores and the best possible implementation of the code in parallelized form, the execution time would fall to (1-p)T hours. The part that could be parallelized, which ordinarily would run in pT hours, would run in literally no time at all once split across an infinite number of cores, and that would still leave (1-p)T hours to go. This is known as Amdahl’s Law.

We can generalize this formula to computers with a finite number of cores, say n of them. The parallelizable part of the calculation, the part that would ordinarily run in pT hours, will run in pT/n. The unparallelizable part will still take (1-p)T hours, so we have

Tn = pT/n + (1-p)T

As n goes to infinity, Tn goes to (1-pT).

Stata/MP is pretty impressively parallelized. We achieve p of 0.8 or 0.9 in many cases. We do not claim to have hit the limits of what is possible, but in most cases, we believe we are very close to those limits. Most estimation commands have p above 0.9, and linear regression is actually above 0.99! This is explained in more detail along with percentage parallelization details for all Stata commands in the Stata/MP Performance Report.

Let’s figure out the value of having more cores. Consider a calculation that would ordinarily require T = 1 hour. With p=0.8 and 2 cores, run times would fall to 0.6 hours; With p=0.9, 0.55 hours. That is very close to what would be achieved even with p=1, which is not possible. For 4 cores, run times would fall to 0.4 (p=0.8) and 0.325 (p=0.9). That’s good, but no where near the hoped for 0.25 that we would observe if p were 1.

In fact, to get to 0.25, we need about 16 cores. With 16 cores, run times fall to 0.25 (p=0.8) and 0.15625 (p=0.9). Going to 32 cores improves run times just a little, to 0.225 (p=0.8) and 0.128125 (p=0.9). Going to 64 cores, we would get 0.2125 (p=0.8) and 0.11384615 (p=0.9). There’s little gain at all because all the cores in the world combined, and more, cannot reduce run times to below 0.2 (p=0.8) and 0.1 (p=0.9).

Stata/MP supports up to 64 cores. We could make a version that supports 128 cores, but it would be a lot of work even though we would not have to write even one line of code. The work would be in running the experiments to set the tuning parameters.

It turns out there are yet other ways in which reality intrudes. In addition to some calculations such as xt = f(xt-1) not being parallelizable at all, it’s an oversimplification to say any calculation is parallelizable because there are issues of granularity and of diseconomies of scale, two related, but different, problems.

Let’s start with granularity. Consider making the calculation xt = f(zt) for t = 1 to 100, and let’s do that by splitting on the subscript t. If we have n=2 cores, we’ll assign the calculation for t = 1 to 50 to one core, and for t=51 to 100 to another. If we have four cores, we’ll split t into four parts. Granularity concerns what happens when we move from n=100 to n=101 cores. This problem can be split into only 100 parallelizable parts and the minimum run time is therefore max(T/n, T/100) and not T/n, as we previously assumed.

All problems suffer from granularity. Diseconomies of scale is a related issue, and it strikes sooner than granularity. Many, but not all problems suffer from diseconomies of scale. Rather than calculating f(zt) for t = 1 to 100, let’s consider calculating the sum of f(zt) for t = 1 to 100. We’ll make this calculation in parallel in the same way as we made the previous calculation, by splitting on t. This time, however, each subprocess will report back to us the sum over the subrange. To obtain the overall sum, we will have to add sub-sums. So if we have n=2 cores, core 1 will calculate the sum over t = 1 to 50, core 2 will calculate the sum for t = 51 to 100, and then, the calculation having come back together, the master core will have to calculate the sum of two numbers. Adding two numbers can be done in a blink of an eye.

But what if we split the problem across 100 cores? We would get back 100 numbers which we would then have to sum. Moreover, what if the calculation of f(zt) is trivial? In that case, splitting the calculation among all 100 cores might result in run times that are nearly equal to what we would observe performing the calculation on just one core, even though splitting the calculation between two cores would nearly halve the execution time, and splitting among four would nearly quarter it!

So what’s the maximum number of cores over which we should split this problem? It depends on the relative execution times of f(zt) and the the combination operator to be performed on those results (addition in this case).

It is the diseconomies of scale problem that bit us in the early versions of Stata/MP, at least in beta testing. We did not adequately deal with the problem of splitting calculations among fewer cores than were available. Fixing that problem was a lot of work and, for your information, we are still working on it as hardware becomes available with more and more cores. The right way to address the issue is to have calculation-by-calculation tuning parameters, which we do. But it takes a lot of experimental work to determine the values of those tuning parameters, and the greater the number of cores, the more accurately the values need to be measured. We have the tuning parameters determined accurately enough for up to 64 cores, although there are one or two which we suspect we could improve even more. We would need to do a lot of experimentation, however, to ensure we have values adequate for 128 cores. The irony is that we would be doing that to make sure we don’t use them all except when problems are large enough!

In any case, I have seen articles predicting and in some cases, announcing, computers with hundreds of cores. For applications with p approaching 1, those are exciting announcements. In the world of statistical software, however, these announcements are exciting only for those running with immense datasets.

Graphs, maps, and geocoding

Jim Hufford, Esq. had his first Stata lesson: “This is going to be awesome when I understand what all those little letters and things mean.”

Along those lines — awesome — Jim may want to see these nice Stata scatterplots from the “wannabe economists of the Graduate Institute of International and Development Studies in Geneva” at Rigotnomics.

If you want to graph data onto maps using Stata — and see another awesome graph — see Mitch Abdon’s “Fun with maps in Stata” over at the Stata Daily.

And if you’re interested in geocoding to obtain latitudes and longitudes from human-readable addresses or locations, see Adam Ozimek’s “Computers are taking our jobs: Stata nerds only edition” over at Modeled Behavior and see the related Stata Journal article “Stata utilities for geocoding and generating travel time and travel distance information” by Adam Ozimek and Daniel Miles.

Pi is (still) wrong

See this video, by Vi Hart:

This link was passed on to me by my friend Marcello. I’ve been bold enough to make up words such as eigenaxis and eigenpoint, but it takes real courage to suggest redefining π, even when you’re right!

After seeing the video, you can go here and here to learn more about what is being proposed.

Don’t click on comments until you’ve seen the video. Ms. Hart does a better job presenting the proposal than any of us can.

Categories: Mathematics Tags: ,