Archive

Author Archive

Precision (yet again), Part II

In part I, I wrote about precision issues in English. If you enjoyed that, you may want to stop reading now, because I’m about to go into the technical details. Actually, these details are pretty interesting.

For instance, I offered the following formula for calculating error due to float precision:

maximum_error = 2-24 X

I later mentioned that the formula is an approximation, and said that the true formula is,

maximum_error = 2-24 2floor(log2 X)

I didn’t explain how I got either formula.

I need to be more precise today than I was in my previous posting. For instance, I previously used x for two concepts, the true value and the rounded-after-storage value. Today I need to distinguish those concepts.

X is the true value.

x is the value after rounding due to storage.

The issue is the difference between x and X when X is stored in 24-binary-digit float precision.

Base 10

Although I harp on the value of learning to think in binary and hexadecimal, I admit that I, too, find it easier to think in base 10. So let’s start that way.

Say we record numbers to two digits of accuracy, which I will call d=2. Examples of d=2 numbers include

 1.0
 1.6
12
47
52*10^1 (i.e, 520, but with only two significant digits)

To say that we record numbers to two digits of accuracy is to say that, coming upon the recorded number 1, we know only that the number lies between 0.95 and 1.05; or coming upon 12, that the true number lies between 11.5 and 12.5, and so on. I assume that numbers are rounded efficiently, which is to say, stored values record midpoints of intervals.

Before we get into the math, let me note that most us would be willing to say that numbers recorded this way are accurate to 1 part in 10 or, if d=3, to 1 part in 100. If numbers are accurate to 1 part in 10^(d-1), then couldn’t we must multiply the number by 1/(10^(d-1)) to obtain the width of the interval? Let’s try:

Assume X=520 and d=2. Then 520/(10^(2-1)) = 52. The true interval, however, is (515, 525] and it has width 10. So the simple formula does not work.

The simple formula does not work yet I presented its base-2 equivalent in Part 1 and I even recommended its use! We will get to that. It turns out the smaller the base, the more accurately the simple formula approximates the true formula, but before I can show that, I need the true formula.

Let’s start by thinking about d=1.

  1. The recorded number 0 will contain all numbers between [-0.5, 0.5). The recorded number 1 will contain all numbers between [0.5, 1.5), and so on. For 0, 1, …, 9, the width of the intervals is 1.

  2. The recorded number 10 will contain all numbers between [5, 15). The recorded number 20 will contain all numbers between [15, 25), and so on, For 10, 20, …, 90, the width of the intervals is 10.

The derivation for the width of interval goes like this:

  1. If we recorded the value of X to one decimal digit, the recorded digit will will be b, the recorded value will be x = b*10p, and the power of ten will be p = floor(log10X). More importantly, W1 = 10p will be the width of the interval containing X.

  2. It therefore follows that if we recorded the value of X to two decimal digits, the interval length would be W2 = W1/10. What ever the width with one digit, adding another must reduce width by one-tenth.

  3. If we recorded the value of X to three decimal digits, the interval length would be W3 = W2/10.

  4. Thus, if d is the number of digits to which numbers are recorded, the width of the interval is 10p where p = floor(log10X) – (d-1).

The above formula is exact.

Base 2

Converting the formula

interval_width = 10floor(log10X)-(d-1)

from base 10 to base 2 is easy enough:

interval_width = 2floor(log2X)-(d-1)

In Part 1, I presented this formula for d=24 as

maximum_error = 2floor(log2X)-24 = 2 -24 2floor(log2 X)

In interval_width, it is d-1 and not d that appears in the formula. You might think I made an error and should have put -23 where I put -24 in the maximum_error formula. There is no mistake. In Part 1, the maximum error was defined as a plus-or-minus quantity and is thus half the width of the overall interval. So I divided by 2, and in effect, I did put -23 into the maximum_error formula, at least before I subracted one more from it, making it -24 again.

I started out this posting by considering and dismissing the base-10 approximation formula

interval_width = 10-(d-1) X

which in maximum-error units is

maximum_error = 10-d X

and yet in Part 1, I presented — and even recommended — its base-2, d=24 equivalent,

maximum_error = 2-24 X

It turns out that the approximation formula is not as inaccurate in base 2 and it would be in base 10. The correct formula,

maximum_error = 2floor(log2X)-d

can be written

maximum_error = 2-d 2floor(log2X

so the question becomes about the accuracy of substituting X for 2^floor(log2X). We know by examination that X ≥ 2^floor(log2X), so making the substitution will overstate the error and, in that sense, is a safe thing to do. The question becomes how much the error is overstated.

X can be written 2^(log2X) and thus we need to compare 2^(log2X) with 2^floor(log2X). The floor() function cannot reduce its argument by more than 1, and thus 2^(log2X) cannot differ from 2^floor(log2X) by more than a factor of 2. Under the circumstances, this seems a reasonable approximation.

In the case of base 10, the the floor() function reducing its argument by up to 1 results in a decrease of up to a factor of 10. That, it seems to me, is not a reasonable amount of error.

Categories: Numerical Analysis Tags: ,

Precision (yet again), Part I

I wrote about precision here and here, but they were pretty technical.

“Great,” coworkers inside StataCorp said to me, “but couldn’t you explain these issues in a way that doesn’t get lost in the details of how computers store binary and maybe, just maybe, write about floats and doubles from a user’s perspective instead of programmer’s perspective?”

“Mmmm,” I said clearly.

Later, when I tried, I liked the result. It contains new material, too. What follows is what I now wish I had written first. I’d would have still written the other two postings, but as technical appendices.

In Part 2 (forthcoming), I provide the mathematical derivations underlying what follows. There are a few interesting issues underlying what follows.

Please excuse the manualish style of what follows, but I suspect that what follows will eventually work its way into Stata’s help files or manuals, so I wrote it that way.


Syntax

Problem:

. generate x = 1.1

. list
  (Stata displays output showing x is 1.1 in all observations)

. count if x==1.1
     0

Solution 1:

. count if x==float(1.1)
   100

Solution 2:

. generate double x = 1.1

. count if x==1.1
  100

Solution 3:

. set type double

. generate x = 1.1

. count if x==1.1
  100

Description

Stata works in binary. Stata stores data in float precision by default. Stata preforms all calculations in double precision. Sometimes the combination results in surprises until you think more carefully about what happened.

Remarks

Remarks are presented under the headings

Summary
Why count==1.1 produces 0
How count==float(1.1) solves the problem
How storing data as double appears to solve the problem (and does)
Float is plenty accurate to store most data
Why don’t I have the problems using Excel?

Summary

Justifications for all statements made appear in the sections below. In summary,

  1. It sometimes appears that Stata is inaccurate. That is not true and, in fact, the appearance of inaccuracy happens in part because Stata is so accurate.

  2. You can cover up this appearance of inaccuracy by storing all your data in double precision. This will double or more the size of your dataset, and so I do not recommend the double-precision solution unless your dataset is small relative to the amount of memory on your computer. In that case, there is nothing wrong with storing all your data in double precision.

    The easiest way to implement the double-precision solution is by typing set type double. After that, Stata will default to to creating all new variables as doubles, at least for the remainder of the session. If all your datasets are small relative to the amount of memory on your computer, you can set type double, permanently.

  3. The double-precision solution is needlessly wasteful of memory. It is difficult to imagine data that are accurate to more than float precision. Regardless of how your data are stored, Stata does all calculations in double precision, and sometimes in quad precision.

  4. The issue of 1.1 not being equal to 1.1 arises only with “nice” decimal numbers. You just have to remember to use Stata’s float() function when dealing with such numbers.

Why count x==1.1 produces 0

Let’s trace through what happens when you type the commands

. generate x = 1.1

. count if x==1.1
     0

Here is how it works:

  1. Some numbers have no exact finite-digit binary representation just as some numbers have no exact finite-digit decimal representation. One-third, 0.3333… (base 10), is an example of a number with no exact finite-digit decimal representation. In base 12, one-third does have an exact finite-digit representation, namely 0.4 (base 12). In base 2 (binary), base 10 numbers such as 0.1, 0.2, 0.3, 0.4, 0.6, … have no exact finite-digit representation.

  2. Computers store numbers with a finite number of binary digits. In float precision, numbers have 24 binary digits. In double precision, they have 53 binary digits.

    The decimal number 1.1 in binary is 1.000110011001… (base 2). The 1001 on the end repeats forever. Thus, 1.1 (base 10) is stored by a computer as

     1.00011001100110011001101
    

    in float, or as

     1.0001100110011001100110011001100110011001100110011010
    

    in double. There are 24 and 53 digits in the numbers above.

  3. Typing generate x = 1.1 results in 1.1 being interpreted as the longer binary number Stata performs all calculations in double precision. New variable x is created as a float by default. When the more precise number is stored in x, it is rounded to the shorter number.

  4. Thus when you count if x==1.1 the result is 0 because 1.1 is again interpreted as the longer binary number and the longer number is compared to shorter number stored in x, and they are not equal.

How count x==float(1.1) solves the problem

One way to fix the problem is to change count if x==1.1 to read count if x==float(1.1):

. generate x = 1.1

. count if x==float(1.1)
   100

Function float() rounds results to float precision. When you type float(1.1), the 1.1 is converted to binary, double precision, namely,

 1.0001100110011001100110011001100110011001100110011010 (base 2)

and float() then rounds that long binary number to

 1.00011001100110011001101 (base 2)

or more correctly, to

 1.0001100110011001100110000000000000000000000000000000 (base 2)

because the number is still stored in double precision. Regardless, this new value is equal to the value stored in x, and so count reports that 100 observations contain float(1.1).

As an aside, when you typed generate x = 1.1, Stata acted as if you typed generate x = float(1.1). Whenever you type generate x = … and x is a float, Stata acts if if you typed generate x = float(…).

How storing data as double appears to solve the problem (and does)

When you type

. generate double x = 1.1

. count if x==1.1
   100

it should be pretty obvious how the problem was solved. Stata stores

1.0001100110011001100110011001100110011001100110011010 (base 2)

in x, and then compares the stored result to

1.0001100110011001100110011001100110011001100110011010 (base 2)

and of course they are equal.

In the Summary above, I referred to this as a cover up. It is a cover up because 1.1 (base 10) is not what is stored in x. What is stored in x is the binary number just shown, and to be equal to 1.1 (base 10), the binary number needs to suffixed with 1001, and then another 1001, and then another, and so on without end.

Stata tells you that x is equal to 1.1 because Stata converted the 1.1 in count to the same inexact binary representation as Stata previously stored in x, and those two values are equal, but neither is equal to 1.1 (base 10). This leads to an important property of digital computers:

If storage and calculation are done to the same precision, it will appear to the user as if all numbers that the user types are stored without error.

That is, it appears to you as if there is no inaccuracy in storing 1.1 in x when x is a double because Stata performs calculations in double. And it is equally true that it would appear to you as if there were no accuracy issues storing 1.1 when x is stored in float precision if Stata, observing that x is float, performed calculations involving x in float. The fact is that there are accuracy issues in both cases.

“Wait,” you are probably thinking. “I understand your argument, but I’ve always heard that float is inaccurate and double is accurate. I understand from your argument that it is only a matter of degree but, in this case, those two degrees are on opposite sides of an important line.”

“No,” I respond.

What you have heard is right with respect to calculation. What you have heard might apply to data storage too, but that is unlikely. It turns out that float provides plenty of precision to store most real measurements.

Float is plenty accurate to store most data

The misconception that float precision is inaccurate comes from the true statement that float precision is not accurate enough when it comes to making calculations with stored values. Whether float precision is accurate enough for storing values depends solely on the accuracy with which the values are measured.

Float precision provides 24 base-2 (binary) digits, and thus values stored in float precision have a maximum relative error error of plus-or-minus 2^(-24) = 5.96e-08, or less than +/-1 part in 15 million.

  1. The U.S. deficit in 2011 is projected to be $1.5 trillion. Stored as a float, the number has a (maximum) error of 2^(-24) * 1.5e+12 = $89,407. That is, if the true number is 1.5 trillion, the number recorded in float precision is guaranteed to be somewhere in the range [(1.5e+12)-89,407, (1.5e+14)+89,407]. The projected U.S. deficit is not known to an accuracy of +/-$89,407.

  2. People in the US work about 40 hours per week, or roughly 0.238 of the hours in the week. 2^(-24) * 0.238 = 1.419e-09 of a week, or 0.1 milliseconds. Time worked in a week is not known to an accuracy of +/-0.1 milliseconds.

  3. A cancer survivor might live 350 days. 2^(-24) * 350 = .00002086, or 1.8 seconds. Time of death is rarely recorded to an accuracy of +/-1.8 seconds. Time of diagnosis is never recorded to such accuracy, nor could it be.

  4. The moon is said to be 384,401 kilometers from the Earth. 2^(-24) * 348,401 = 0.023 kilometers, or 23 meters. At its closest and farthest, the moon is 356,400 and 406,700 kilometers from Earth.

  5. Most fundamental constants of the universe are known to a few parts in a million, which is to say, less than 1 part in 15 million, the accuracy float precision can provide. An exception is the speed of light, measured to be 299,793.458 kilometers per second. Record that as a float and you will be off by 0.01 km/s.

In all the examples except the last, quoted are worst-case scenarios. The actual errors depend on the exact number and is a more tedious calculation (not shown):

  1. For the U.S. deficit, the exact error for 1.5 trillion is -$26,624, which is within the plus or minus $89,407 quoted.

  2. For fraction of the week, at 0.238 the error is -0.04 milliseconds, which is within the +/-0.1 milliseconds quoted.

  3. For cancer survival time, at 350 days the actual error is 0, which is within the +/-1.8 seconds quoted.

  4. For the distance between the Earth and moon, the actual error is 0, which is within within the +/-23 meters quoted.

The actual errors may be interesting, but the maximum errors are more useful. Remember the multiplier 2^(-24). All you have to do is multiply a measurement by 2^(-24) and compare the result with the inherent error in the measurement. If 2^(-24) multiplied by the measurement is less than the inherent error, you can use float precision to store your data. Otherwise, you need to use double.

By the way, the formula

maximum_error = 2^(-24) * x

is an approximation. The true formula is

maximum_error = 2^(-24) * 2^(floor(log2(x)))

It can be readily proven that x ≥ 2^(floor(log2(x))) and thus the approximation formula overstates the maximum error. The approximation formula can overstate the maximum error by as much as a factor of 2. Float precision is adequate for most data. There is one kind of data, however, where float precision may not be adequate, and that is financial data such as sales data, general ledgers, and the like. People working with dollar-and-cent data, or Euro-and-Eurocent data, or Pound Stirling-and-penny data, or any other currency data, usually find it best to use doubles. To avoid rounding issues, it is preferable to store the data as pennies. Float precision binary cannot store 0.01, 0.02, and the like, exactly. Integer values, however, can be stored exactly, at least up to certain 16,777,215.

Floats can store up to 16,777,215 exactly. If stored your data in pennies, that would correspond to $167,772.15.

Doubles can store up to 9,007,199,254,740,991 exactly. If you stored your data in pennies, the would correspond to $90,071,992,547,409.91, or just over $90 trillion.

Why don’t I have these problems using Excel?

You do not have these problems when you use Excel because Excel stores numeric values in double precision. As I explained in How float(1.1) solves the problem above,

If storage and calculation are done to the same precision, it will appear to the user as if all numbers that the user types are stored without error.

You can adopt the Excel solution in Stata by typing

. set type double, permanently

You will double (or more) the amount of memory Stata uses to store your data, but if that is not of concern to you, there are no other disadvantages to adopting this solution. If you adopt this solution and later wish to change your mind, type

. set type float, permanently

That’s all for today

If you enjoyed the above, you may want to see Part II (forthcoming). As I said, There are a few technical issues underlying what is written above that may interest those interested in computer science as it applies to statistical computing.

Categories: Numerical Analysis Tags: ,

Merging data, part 2: Multiple-key merges

Multiple-key merges arise when more than one variable is required to uniquely identify the observations in your data. In Merging data, part 1, I discussed single-key merges such as

        . merge 1:1 personid using ...

In that discussion, each observation in the dataset could be uniquely identified on the basis of a single variable. In panel or longitudinal datasets, there are multiple observations on each person or thing and to uniquely identify the observations, we need at least two key variables, such as

        . merge 1:1 personid date using ...

In this dataset we have repeated observations on persons and, within person, the observations can be uniquely identified by the calendar date.

Just to fix ideas, let’s assume I have two datasets. The first, sample.dta, is the one of analytic interest to me. I have data on 737 persons. For each person, I have data recorded on the first and fifteenth of every month, from year 2000 to 2009. Overall, my dataset has 176,880 observations.

The second dataset contains additional information (variables) on the sample of interest. It has over 3,000 people in it and it covers a longer stretch of time. I’ve been told that most of my 737 people should be in this second dataset, but I’ve been warned that, due to data collection or data processing errors over the years, a small fraction will not be found.

“How many?” I asked Bob from data processing during a fictional conversation, my paranoia kicking in.

“I don’t know. Most. Ninety-nine percent. It’s just random stuff,” he replied, knowing how well I respond to the word random.

Let’s call this second set of data the payroll data, although if I can imagine fictional conversations, you can imagine the data are something else. They might be records from follow-up visits of a medical experiment.

In any case, I receive the data, and here is what happened when I merged the data with my sample:

        . use sample, clear

        . merge 1:1 personid date using payroll, keep(master match)

            Result                           # of obs.
            -----------------------------------------
            not matched                         2,352  
                from master                     2,352  (_merge==1)
                from using                          0  (_merge==2)

            matched                           174,528  (_merge==3)
            -----------------------------------------

In my sample data, I have 174,520 + 2,352 = 176,872 observations. Of those, 174,528 matched, which is 98.7 percent. (The reason that the number of records from the using (payroll) data that were not matched is zero is because I specified option keep(master match), meaning I discarded the unmatched payroll records. Had I not, the number would have been in the low millions.)

For many in this situation, the story would stop right here. Not for me. I want to show you how to tear into multiple-key merges to reassure yourself that things really are as they appear. You realize, of course, that I manufactured this fictional data for this blog entry and I buried a little something that once we find it, would scare you if this were a real story. So I’ll tell you now, this story is loosely based on a real story.

Step 1: Following my own advice

In Merging data, part 1 I recommended that you merge on all common variables, not just the identification variables. This blog entry is not going to rehash the previous blog entry, but I want to emphasize that everything I said in the previous entry about single-key merges applies equally to multiple-key merges. These two datasets share a variable recording the division in which the employee works, so I am included it among the match variables:

        . use sample, clear

        . merge 1:1 personid date division using payroll, keep(master match)

            Result                           # of obs.
            -----------------------------------------
            not matched                         2,352  
                from master                     2,352  (_merge==1)
                from using                          0  (_merge==2)

            matched                           174,528  (_merge==3)
            -----------------------------------------

The output above matches the output when I merged date and division alone, so I do not appear to have a merge-gone-bad problem. These merged data are looking better and better.

Step 2: Merge on each key variable in isolation

Let’s imagine what could go wrong. Imagine that all the data for certain persons were missing, or that all the data for certain dates were missing. That might not be a problem, but it would certainly raise questions. Depending on the answers, it may be worth a footnote or concerning enough to return the data and ask for a refund.

Finding persons or dates that are entirely unmatched is a lot of work unless you know the following trick: Merge on one key variable at a time.

Let’s start with personid:

        . use sample, clear

        . sort personid

        . by personid: keep if _n==1           // <- don't skip this step
        (176143 observations deleted)

        . merge 1:m personid using payroll, keep(master match)

            Result                           # of obs.
            -----------------------------------------
            not matched                             0
            matched                           174,528  (_merge==3)
            -----------------------------------------

The output above proves that payroll.dta contains data on every person that appears in sample.dta.

Let me explain. I began by using my sample data and keeping just one observation for every value of personid. I don't care which observation I keep, I just need to keep one and only one. Then I merged on personid, keeping (1) the records that match and (2) the records from the master that do not match. I have no interest in the resulting dataset; I just wanted to see the table merge would report. merge reports that 174,528 personids matched, and 0 did not. Ergo, every value of personid that appears in sample.dta also appears in payroll.dta.

Had merge reported "not matched" = 2, that would mean there would have been two values of personid appearing in sample.dta that do not appear in payroll.dta. It would not have been an indictment of the data if two persons were not matched in their entirety, but I would certainly have looked into the issue. With the merged result in memory, I would have typed

        . list personid if _merge==1
          (output would appear) 

I would have written down the two personids list reported. Then I would have returned to my sample data and looked at the data I had on those two people:

        . use sample, clear

        . list if personid==735527
          (output would appear)

        . list if personid==29887
          (output would appear)

It might be that 735527 was with the company for only a short time and thus the missing payroll record a believable random event. If 735527 had been with the company all ten years, however, I would be back on the phone seeking an explanation. (If these were medical data, certainly you would want to know how a person who never reported for a follow-up visit is known to still be alive after ten years.)

So much for personid. Let's do the same for date:

        . use sample, clear  
        . sort date
        . by date: keep if _n==1
        (176640 observations deleted)
        . merge 1:m date using payroll, keep(master match)
            Result                           # of obs.
            -----------------------------------------
            not matched                             0
            matched                           236,832  (_merge==3)
            -----------------------------------------

Every date that appears in sample.dta also appears in payroll.dta.

Finally, let's look at division:

        . use sample, clear

        . sort division date

        . by division date: keep if _n==1
        (175200 observations deleted)

        . merge 1:m division date using payroll, keep(master match)

            Result                           # of obs.
            -----------------------------------------
            not matched                            24
                from master                        24  (_merge==1)
                from using                          0  (_merge==2)

            matched                           236,832  (_merge==3)
            -----------------------------------------

Every division that appears in sample.dta appears in payroll.dta

These data are looking better and better.

If we had only two key variables, we would be done. We, however, are performing the full merge on three variables, namely personid, date, and division, and so there is one more set of comparisons we should examine.

Step 3: Merge on every pair of key variables

With three key variables, the possible pairs are (personid, date), (personid, division), and (division, date). We have already looked at (personid, date), so that just leaves (personid, division) and (division, date).

The method is the same as in Step 2 except that we type two variables where we previously typed one:

        . use sample, clear

        . sort personid division

        . by personid division: keep if _n==1
        (176143 observations deleted)

        . merge 1:m personid division using payroll, keep(master match)

            Result                           # of obs.
            -----------------------------------------
            not matched                             0
            matched                           174,528  (_merge==3)
            -----------------------------------------

We discover that every personid-division combination that appears in sample.dta also appears in payroll.dta.

Last is (division, date):

        . use sample, clear

        . sort division date

        . by division date: keep if _n==1
        (175200 observations deleted)

        . merge 1:m division date using payroll, keep(master match)

            Result                           # of obs.
            -----------------------------------------
            not matched                            24
                from master                        24  (_merge==1)
                from using                          0  (_merge==2)

            matched                           236,832  (_merge==3)
            -----------------------------------------

Surprise! Well, you're not surprised because I told you earlier we were going to find something, but if this were real life, you would be surprised after all these reassuring results.

We discover that there are 24 division-date combinations appearing in sample.dta that do not appear in payroll.dta. Let's look at the 24 missing combinations:

        . list division date if _merge==1

                +----------------------+
                | division        date |
                |----------------------|
          1129. |        5   01jan2007 |
          1130. |        5   15jan2007 |
          1131. |        5   01feb2007 |
          1132. |        5   15feb2007 |
          1133. |        5   01mar2007 |
                |----------------------|
          1134. |        5   15mar2007 |
          1135. |        5   01apr2007 |
          1136. |        5   15apr2007 |
          1137. |        5   01may2007 |
          1138. |        5   15may2007 |
                |----------------------|
          1139. |        5   01jun2007 |
          1140. |        5   15jun2007 |
          1141. |        5   01jul2007 |
          1142. |        5   15jul2007 |
          1143. |        5   01aug2007 |
                |----------------------|
          1144. |        5   15aug2007 |
          1145. |        5   01sep2007 |
          1146. |        5   15sep2007 |
          1147. |        5   01oct2007 |
          1148. |        5   15oct2007 |
                |----------------------|
          1149. |        5   01nov2007 |
          1150. |        5   15nov2007 |
          1151. |        5   01dec2007 |
          1152. |        5   15dec2007 |
                +----------------------+

If you look closely, you will notice that every payroll date in 2007 is listed. So what happened to the payroll records for division 5 in 2007? This may indeed be exactly the kind of random event that Bob had in mind during our fictional conversation. Somehow the company lost a little cluster of payroll records. The loss may mean mean nothing. Or it might be of critical importance. Imagine there's been an allegation that the company treats older workers poorly and imagine that division 5 has the highest average age. Not random. Not random at all.

Step 4: Merge on every triplet of key variables

So much for the fictional story.

If we had four or more key variables, we would now need to merge on every triplet of key variables, and if we had five or more key variables, we then need to merge on every quadruplet of key variables, and if ...

Forget the story. Or remember it if it scares you. Data processing and paranoia make an excellent pairing. What's important is how easy it is to take complicated, multiple-key merges apart. I've never met anyone yet who knew this trick.

Categories: Data Management Tags: ,

Merging data, part 1: Merges gone bad

Merging concerns combining datasets on the same observations to produce a result with more variables. We will call the datasets one.dta and two.dta.

When it comes to combining datasets, the alternative to merging is appending, which is combining datasets on the same variables to produce a result with more observations. Appending datasets is not the subject for today. But just to fix ideas, appending looks like this:

              +-------------------+
              | var1  var2  var3  |      one.dta
              +-------------------+
           1. | one.dta           |
           2. |                   |
            . |                   |
            . |                   |
              +-------------------+

                        +

              +-------------------+
              | var1  var2  var3  |      two.dta
              +-------------------+
           1. | two.dta           |
           2. |                   |
            . |                   |
              +-------------------+

                       =

              +-------------------+
              | var1  var2  var3  |
              +-------------------+
           1. |                   |    one.dta
           2. |                   |
            . |                   |
            . |                   |
              +                   +      +
        N1+1. |                   |    two.dta   appended
        N2+2. |                   |
            . |                   |
              +-------------------+

Merging looks like this:


      +-------------------+           +-----------+
      | var1  var2  var3  |           | var4 var5 |
      +-------------------+           +-----------+
   1. |                   |        1. |           |
   2. |                   |    +   2. |           |     =
    . |                   |         . |           |
    . |                   |         . |           |
      +-------------------+           +-----------+
        one.dta                         two.dta


                        +-------------------+-----------+
                        | var1  var2  var3    var4 var5 |
                        +-------------------------------+
                     1. |                               |
                     2. |                               |
                      . |                               |
                      . |                               |
                        +-------------------+-----------+
                          one.dta           + two.dta    merged

The matching of the two datasets — deciding which observations in one.dta are combined with which observations in two.dta — could be done simply on the observation numbers: Match one.dta observation 1 with two.dta observation 1, match one.dta observation 2 with two.dta observation 2, and so on. In Stata, you could obtain that result by typing

. use one, clear

. merge 1:1 using two

Never do this because it is too dangerous. You are merely assuming that observation 1 matches with observation 1, observation 2 matches with observation 2, and so on. What if you are wrong? If observation 2 in one.dta is Bob and observation 2 in two.dta is Mary, you will mistakenly combine the observations for Bob and Mary and, perhaps, never notice the mistake.

The better solution is to match the observations on equal values of an identification variable. This way, the observation with id=”Mary” is matched with the observation with id=”Mary”, id=”Bob” with id=”Bob”, id=”United States” with id=”United States”, and id=4934934193 with id=4934934193. In Stata, you do this by typing

. use one, clear

. merge 1:1 id using two

Things can still go wrong. For instance, id=”Bob” will not match id=”Bob ” (with the trailing blank), but if you expected all the observations to match, you will ultimately notice the mistake. Mistakenly unmatched observations tend to get noticed because of all the missing values they cause in subsequent calculations.

It is the mistakenly combined observations that can go unnoticed.

And that is the topic for today, mistakenly matched observations, or merges gone bad.

Observations are mistakenly combined more often than many researchers realize. I’ve seen it happen. I’ve seen it happen, be discovered later, and necessitate withdrawn results. You seriously need to consider the possibility that this could happen to you. Only three things are certain in this world: death, taxes, and merges gone bad.

I am going to assume that you are familiar with merging datasets both conceptually and practically; that you already know what 1:1, m:1, 1:m, and m:n mean; and that you know the role played by “key” variables such as ID. I am going to assume you are familiar with Stata’s merge command. If any of this is untrue, read [D] merge. Type help merge in Stata and click on [D] merge at the top to take you to the full PDF manuals. We are going to pick up where the discussion in [D] merge leaves off.

Detecting when merges go bad

As I said, the topic for today is merges gone bad, by which I mean producing a merged result with the wrong records combined. It is difficult to imagine that typing

. use one, clear

. merge 1:1 id using two

could produce such a result because, to be matched, the observations had to have equal values of the ID. Bob matched with Bob, Mary matched with Mary, and so on.

Right you are. There is no problem assuming the values in the id variable are correct and consistent between datasets. But what if id==4713 means Bob in one dataset and Mary in the other? That can happen if the id variable is simply wrong from the outset or if the id variable became corrupted in prior processing.

1. Use theory to check IDs if they are numeric

One way the id variable can become corrupted is if it is not stored properly or if it is read improperly. This can happen to both string and numeric variables, but right now, we are going to emphasize the numeric case.

Say the identification variable is Social Security number, an example of which is 888-88-8888. Social Security numbers are invariably stored in computers as 888888888, which is to say that they are run together and look a lot like the number 888,888,888. Sometimes they are even stored numerically. Say you have a raw data file containing perfectly valid Social Security numbers recorded in just this manner. Say you read the number as a float. Then 888888888 becomes 888888896, and so does every Social Security number between 888888865 and 888888927, some 63 in total. If Bob has Social Security number 888888869 and Mary has 888888921, and Bob appears in dataset one and Mary in dataset two, then Bob and Mary will be combined because they share the same rounded Social Security number.

Always be suspicious of numeric ID variables stored numerically, not just those stored as floats.

When I read raw data and store the ID variables as numeric, I worry whether I have specified a storage type sufficient to avoid rounding. When I obtain data from other sources that contain numeric ID variables, I assume that the other source improperly stored the values until proven otherwise.

Perhaps you remember that 16,775,215 is the largest integer that can be stored precisely as a float and 9,007,199,254,740,991 is the largest that can be stored precisely as a double. I never do.

Instead, I ask Stata to show me the largest theoretical ID number in hexadecimal. For Social Security numbers, the largest is 999-99-9999, so I type

. inbase 16 999999999
3b9ac9ff

Stata’s inbase command converts decimal numbers to different bases. I learn that 999999999 base-10 is 3b9ac9ff base-16, but I don’t care about the details; I just want to know the number of base-16 digits required. 3b9ac9ff has 8 digits. It takes 8 base-16 digits to record 999999999. As you learned in How to read the %21x format, part 2, I do remember that doubles can record 13 base-16 digits and floats can record 5.75 digits (the 0.75 part being because the last digit must be even). If I didn’t remember those numbers, I would just display a number in %21x format and count the digits to the right of the binary point. Anyway, Social Security numbers can be stored in doubles because 8<13, the number of digits double provides, but not in floats because 8 is not < 5.75, the number of digits float provides.

If Social Security numbers contained 12 digits rather than 9, the largest would be

. inbase 16 999999999999
38d4a50fff

which has 10 base-16 digits, and because 10<13, it would still fit into a double.

Anyway, if I discover that the storage type is insufficient to store the ID number, I know the ID numbers must be rounded.

2. Check uniqueness of IDs

I said that when I obtain data from other sources, I assume that the other source improperly stored the ID variables until proven otherwise. I should have said, until evidence accumulates to the contrary. Even if the storage type used is sufficient, I do not know what happened in previous processing of the data.

Here’s one way using datasets one.dta and two.dta to accumulate some of that evidence:

. use one, clear              // test 1
. sort id
. by id: assert _N==1

. use two, clear              // test 2
. sort id . by id: assert _N==1 

In these tests, I am verifying that the IDs really are unique in the two datasets that I have. Tests 1 and 2 are unnecessary when I plan later to merge 1:1 because the 1:1 part will cause Stata itself to check that the IDs are unique. Nevertheless, I run the tests. I do this because the datasets I merge are often subsets of the original data, and I want to use all the evidence I have to invalidate the claim that the ID variables really are unique.Sometimes I receive datasets where it takes two variables to make sure I am calling a unique ID. Perhaps I receive data on persons over time, along with the claim that the ID variable is name. The documentation also notes that variable date records when the observation was made. Thus, to uniquely identify each of the observations requires both name and date, and I type

. sort name date
. by name date: assert _N==1

I am not suspicious of only datasets I receive. I run this same test on datasets I create.

3. Merge on all common variables

At this point, I know the ID variable(s) are unique in each dataset. Now I consider the idea that the ID variables are inconsistent across datasets, which is to say that Bob in one dataset, however he is identified, means Mary in the other. Detecting such problems is always problematic, but not nearly as problematic as you might guess.

It is rare that the datasets I need to merge have no variables in common except the ID variable. If the datasets are on persons, perhaps both datasets contain each person’s sex. In that case, I could merge the two datasets and verify that the sex is the same in both. Actually, I can do something easier than that: I can add variable sex to the key variables of the merge:

. use one, clear
. merge 1:1 id sex using two

Assume I have a valid ID variable. Then adding variable sex does not affect the outcome of the merge because sex is constant within id. I obtain the same results as typing merge 1:1 id using two.

Now assume the id variable is invalid. Compared with the results of merge 1:1 id using two, Bob will no longer match with Mary even if they have the same ID. Instead I will obtain separate, unmatched observations for Bob and Mary in the merged data. Thus to complete the test that there are no such mismatches, I must verify that the id variable is unique in the merged result. The complete code reads

. use one, clear
. merge 1:1 id sex using two
. sort id
. by id: assert _N==1

And now you know why in test 2 I checked the uniqueness of ID within dataset by hand rather than depending on merge 1:1. The 1:1 merge I just performed is on id and sex, and thus merge does not check the uniqueness of ID in each dataset. I checked by hand the uniqueness of ID in each dataset and then checked the uniqueness of the result by hand, too.

Passing the above test does not prove that that the ID variable is consistent and thus the merge is correct, but if the assertion is false, I know with certainty either that I have an invalid ID variable or that sex is miscoded in one of the datasets. If my data has roughly equal number of males and females, then the test has a 50 percent chance of detecting a mismatched pair of observations, such as Bob and Mary. If I have just 10 mismatched observations, I have a 1-0.910 = 0.9990 probability of detecting the problem.

I should warn you that if you want to keep just the matched observations, do not perform the merge by coding merge 1:1 id sex using two, keep(matched). You must keep the unmatched observations to perform the final part of the test, namely, that the ID numbers are unique. Then you can drop the unmatched observations.

. use one, clear
. merge 1:1 id sex using two
. sort id
. by id: assert _N==1
. keep if _merge==3

There may be more than one variable that you expect to be the same in combined observations. A convenient feature of this test is that you can add as many expected-to-be-constant variables to merge‘s keylist as you wish:

. use one, clear
. merge 1:1 id sex hiredate groupnumber using two
. sort id
. by id: assert _N==1
. keep if _merge==3

It is rare that there is not at least one variable other than the ID variable that is expected to be equal, but it does happen. Even if you have expected-to-be-constant variables, they may not work as well in detecting problems as variable sex in the example above. The distribution of the variable matters. If your data are of people known to be alive in 1980 and the known-to-be-constant variable is whether born after 1900, even mismatched observations would be likely to have the same value of the variable because most people alive in 1980 were born after 1900.

4. Look at a random sample

This test is weak, but you should do it anyway, if only because it’s so easy. List some of the combined observations and look at them.

. list in 1/5

Do the combined results look like they go together?

By the way, the right way to do this is

. gen u = uniform()
. sort u
. list in 1/5
. drop u

You do not want to look at the first observations because, having small values of ID, they are probably not representative. However IDs are assigned, the process is unlikely to be randomized. Persons with low values of ID will be younger, or older; or healthier, or sicker; or ….

5. Look at a nonrandom sample

You just merged two datasets, so obviously you did that because you needed the variables and those variables are somehow related to the existing variables. Perhaps your data is on persons, and you combined the 2009 data with the 2010 data. Perhaps your data is on countries, and you added export data to your import data. Whatever you just added, it is not random. If it were, you could have saved yourself time by simply generating the new variables containing random numbers.

So generate an index that measures a new variable in terms of an old one, such as

. gen diff = income2010 - income2009

or

. gen diff = exports - imports

Then sort on the variable and look at the observations containing the most outlandish values of your index:

. sort diff
. list in  1/5
. list in -5/l

These are the observations most likely to be mistakenly combined. Do you believe those observations were combined correctly?

Conclusion

I admit I am not suspicious of every merge I perform. I have built up trust over time in datasets that I have worked with previously. Even so, my ability to make errors is equal to yours, and even with trustworthy datasets, I can introduce problems long before I get to the merge. You need to carefully consider the consequences of a mistake. I do not know anyone who performs merges who has not performed a merge gone bad. The question is whether he or she detected it. I hope so.

Categories: Data Management Tags: ,

Multiprocessor (core) software (think Stata/MP) and percent parallelization

When most people first think about software designed to run on multiple cores such as Stata/MP, they think to themselves, two cores, twice as fast; four cores, four times as fast. They appreciate that reality will somehow intrude so that two cores won’t really be twice as fast as one, but they imagine the intrusion is something like friction and nothing that an intelligently placed drop of oil can’t improve.

In fact, something inherent intrudes. In any process to accomplish something — even physical processes — some parts may be able to to be performed in parallel, but there are invariably parts that just have to be performed one after the other. Anyone who cooks knows that you sometimes add some ingredients, cook a bit, and then add others, and cook some more. So it is, too, with calculating xt = f(xt-1) for t=1 to 100 and t0=1. Depending on the form of f(), sometimes there’s no alternative to calculating x1 = f(x0), then calculating x2 = f(x1), and so on.

In any calculation, some proportion p of the calculation can be parallelized and the remainder, 1-p, cannot. Consider a calculation that takes T hours if it were performed sequentially on a single core. If we had an infinite number of cores and the best possible implementation of the code in parallelized form, the execution time would fall to (1-p)T hours. The part that could be parallelized, which ordinarily would run in pT hours, would run in literally no time at all once split across an infinite number of cores, and that would still leave (1-p)T hours to go. This is known as Amdahl’s Law.

We can generalize this formula to computers with a finite number of cores, say n of them. The parallelizable part of the calculation, the part that would ordinarily run in pT hours, will run in pT/n. The unparallelizable part will still take (1-p)T hours, so we have

Tn = pT/n + (1-p)T

As n goes to infinity, Tn goes to (1-pT).

Stata/MP is pretty impressively parallelized. We achieve p of 0.8 or 0.9 in many cases. We do not claim to have hit the limits of what is possible, but in most cases, we believe we are very close to those limits. Most estimation commands have p above 0.9, and linear regression is actually above 0.99! This is explained in more detail along with percentage parallelization details for all Stata commands in the Stata/MP Performance Report.

Let’s figure out the value of having more cores. Consider a calculation that would ordinarily require T = 1 hour. With p=0.8 and 2 cores, run times would fall to 0.6 hours; With p=0.9, 0.55 hours. That is very close to what would be achieved even with p=1, which is not possible. For 4 cores, run times would fall to 0.4 (p=0.8) and 0.325 (p=0.9). That’s good, but no where near the hoped for 0.25 that we would observe if p were 1.

In fact, to get to 0.25, we need about 16 cores. With 16 cores, run times fall to 0.25 (p=0.8) and 0.15625 (p=0.9). Going to 32 cores improves run times just a little, to 0.225 (p=0.8) and 0.128125 (p=0.9). Going to 64 cores, we would get 0.2125 (p=0.8) and 0.11384615 (p=0.9). There’s little gain at all because all the cores in the world combined, and more, cannot reduce run times to below 0.2 (p=0.8) and 0.1 (p=0.9).

Stata/MP supports up to 64 cores. We could make a version that supports 128 cores, but it would be a lot of work even though we would not have to write even one line of code. The work would be in running the experiments to set the tuning parameters.

It turns out there are yet other ways in which reality intrudes. In addition to some calculations such as xt = f(xt-1) not being parallelizable at all, it’s an oversimplification to say any calculation is parallelizable because there are issues of granularity and of diseconomies of scale, two related, but different, problems.

Let’s start with granularity. Consider making the calculation xt = f(zt) for t = 1 to 100, and let’s do that by splitting on the subscript t. If we have n=2 cores, we’ll assign the calculation for t = 1 to 50 to one core, and for t=51 to 100 to another. If we have four cores, we’ll split t into four parts. Granularity concerns what happens when we move from n=100 to n=101 cores. This problem can be split into only 100 parallelizable parts and the minimum run time is therefore max(T/n, T/100) and not T/n, as we previously assumed.

All problems suffer from granularity. Diseconomies of scale is a related issue, and it strikes sooner than granularity. Many, but not all problems suffer from diseconomies of scale. Rather than calculating f(zt) for t = 1 to 100, let’s consider calculating the sum of f(zt) for t = 1 to 100. We’ll make this calculation in parallel in the same way as we made the previous calculation, by splitting on t. This time, however, each subprocess will report back to us the sum over the subrange. To obtain the overall sum, we will have to add sub-sums. So if we have n=2 cores, core 1 will calculate the sum over t = 1 to 50, core 2 will calculate the sum for t = 51 to 100, and then, the calculation having come back together, the master core will have to calculate the sum of two numbers. Adding two numbers can be done in a blink of an eye.

But what if we split the problem across 100 cores? We would get back 100 numbers which we would then have to sum. Moreover, what if the calculation of f(zt) is trivial? In that case, splitting the calculation among all 100 cores might result in run times that are nearly equal to what we would observe performing the calculation on just one core, even though splitting the calculation between two cores would nearly halve the execution time, and splitting among four would nearly quarter it!

So what’s the maximum number of cores over which we should split this problem? It depends on the relative execution times of f(zt) and the the combination operator to be performed on those results (addition in this case).

It is the diseconomies of scale problem that bit us in the early versions of Stata/MP, at least in beta testing. We did not adequately deal with the problem of splitting calculations among fewer cores than were available. Fixing that problem was a lot of work and, for your information, we are still working on it as hardware becomes available with more and more cores. The right way to address the issue is to have calculation-by-calculation tuning parameters, which we do. But it takes a lot of experimental work to determine the values of those tuning parameters, and the greater the number of cores, the more accurately the values need to be measured. We have the tuning parameters determined accurately enough for up to 64 cores, although there are one or two which we suspect we could improve even more. We would need to do a lot of experimentation, however, to ensure we have values adequate for 128 cores. The irony is that we would be doing that to make sure we don’t use them all except when problems are large enough!

In any case, I have seen articles predicting and in some cases, announcing, computers with hundreds of cores. For applications with p approaching 1, those are exciting announcements. In the world of statistical software, however, these announcements are exciting only for those running with immense datasets.

Graphs, maps, and geocoding

Jim Hufford, Esq. had his first Stata lesson: “This is going to be awesome when I understand what all those little letters and things mean.”

Along those lines — awesome — Jim may want to see these nice Stata scatterplots from the “wannabe economists of the Graduate Institute of International and Development Studies in Geneva” at Rigotnomics.

If you want to graph data onto maps using Stata — and see another awesome graph — see Mitch Abdon’s “Fun with maps in Stata” over at the Stata Daily.

And if you’re interested in geocoding to obtain latitudes and longitudes from human-readable addresses or locations, see Adam Ozimek’s “Computers are taking our jobs: Stata nerds only edition” over at Modeled Behavior and see the related Stata Journal article “Stata utilities for geocoding and generating travel time and travel distance information” by Adam Ozimek and Daniel Miles.

Pi is (still) wrong

See this video, by Vi Hart:

This link was passed on to me by my friend Marcello. I’ve been bold enough to make up words such as eigenaxis and eigenpoint, but it takes real courage to suggest redefining π, even when you’re right!

After seeing the video, you can go here and here to learn more about what is being proposed.

Don’t click on comments until you’ve seen the video. Ms. Hart does a better job presenting the proposal than any of us can.

Categories: Mathematics Tags: ,

Understanding matrices intuitively, part 2, eigenvalues and eigenvectors

Last time, I showed you a way to graph and to think about matrices. This time, I want to apply the technique to eigenvalues and eigenvectors. The point is to give you a picture that will guide your intuition, just as it was previously.

Before I go on, several people asked after reading part 1 for the code I used to generate the graphs. Here it is, both for part 1 and part 2: matrixcode.zip.

The eigenvectors and eigenvalues of matrix A are defined to be the nonzero x and λ values that solve

Ax = λx

I wrote a lot about Ax in the last post. Just as previously, x is a point in the original, untransformed space and Ax is its transformed value. λ on the right-hand side is a scalar.

Multiplying a point by a scalar moves the point along a line that passes through the origin and the point:

The figure above illustrates y=λx when λ>1. If λ were less than 1, the point would move toward the origin and if λ were also less than 0, the point would pass right by the origin to land on the other side. For any point x, y=λx will be somewhere on the line passing through the origin and x.

Thus Ax = λx means the transformed value Ax lies on a line passing through the origin and the original x. Points that meet that restriction are eigenvectors (or more correctly, as we will see, eigenpoints, a term I just coined), and the corresponding eigenvalues are the λ‘s that record how far the points move along the line.

Actually, if x is a solution to Ax = λx, then so is every other point on the line through 0 and x. That’s easy to see. Assume x is a solution to Ax = λx and substitute cx for x: Acx = λcx. Thus x is not the eigenvector but is merely a point along the eigenvector.

And with that prelude, we are now in a position to interpret Ax = λx fully. Ax = λx finds the lines such that every point on the line, say, x, transformed by Ax moves to being another point on the same line. These lines are thus the natural axes of the transform defined by A.

The equation Ax = λx and the instructions “solve for nonzero x and λ” are deceptive. A more honest way to present the problem would be to transform the equation to polar coordinates. We would have said to find θ and λ such that any point on the line (r, θ) is transformed to (λr, θ). Nonetheless, Ax = λx is how the problem is commonly written.

However we state the problem, here is the picture and solution for A = (2, 1 \ 1, 2)

I used Mata’s eigensystem() function to obtain the eigenvectors and eigenvalues. In the graph, the black and green lines are the eigenvectors.

The first eigenvector is plotted in black. The “eigenvector” I got back from Mata was (0.707 \ 0.707), but that’s just one point on the eigenvector line, the slope of which is 0.707/0.707 = 1, so I graphed the line y = x. The eigenvalue reported by Mata was 3. Thus every point x along the black line moves to three times its distance from the origin when transformed by Ax. I suppressed the origin in the figure, but you can spot it because it is where the black and green lines intersect.

The second eigenvector is plotted in green. The second “eigenvector” I got back from Mata was (-0.707 \ 0.707), so the slope of the eigenvector line is 0.707/(-0.707) = -1. I plotted the line y = -x. The eigenvalue is 1, so the points along the green line do not move at all when transformed by Ax; y=λx and λ=1.

Here’s another example, this time for the matrix A = (1.1, 2 \ 3, 1):

The first “eigenvector” and eigenvalue Mata reported were… Wait! I’m getting tired of quoting the word eigenvector. I’m quoting it because computer software and the mathematical literature call it the eigenvector even though it is just a point along the eigenvector. Actually, what’s being described is not even a vector. A better word would be eigenaxis. Since this posting is pedagogical, I’m going to refer to the computer-reported eigenvector as an eigenpoint along the eigenaxis. When you return to the real world, remember to use the word eigenvector.

The first eigenpoint and eigenvalue that Mata reported were (0.640 \ 0.768) and λ = 3.45. Thus the slope of the eigenaxis is 0.768/0.640 = 1.2, and points along that line — the green line — move to 3.45 times their distance from the origin.

The second eigenpoint and eigenvalue Mata reported were (-0.625 \ 0.781) and λ = -1.4. Thus the slope is -0.781/0.625 = -1.25, and points along that line move to -1.4 times their distance from the origin, which is to say they flip sides and then move out, too. We saw this flipping in my previous posting. You may remember that I put a small circle and triangle at the bottom left and bottom right of the original grid and then let the symbols be transformed by A along with the rest of space. We saw an example like this one, where the triangle moved from the top-left of the original space to the bottom-right of the transformed space. The space was flipped in one of its dimensions. Eigenvalues save us from having to look at pictures with circles and triangles; when a dimension of the space flips, the corresponding eigenvalue is negative.

We examined near singularity last time. Let’s look again, and this time add the eigenaxes:

The blue blob going from bottom-left to top-right is both the compressed space and the first eigenaxis. The second eigenaxis is shown in green.

Mata reported the first eigenpoint as (0.789 \ 0.614) and the second as (-0.460 \ 0.888). Corresponding eigenvalues were reported as 2.78 and 0.07. I should mention that zero eigenvalues indicate singular matrices and small eigenvalues indicate nearly singular matrices. Actually, eigenvalues also reflect the scale of the matrix. A matrix that compresses the space will have all of its eigenvalues be small, and that is not an indication of near singularity. To detect near singularity, one should look at the ratio of the largest to the smallest eigenvalue, which in this case is 0.07/2.78 = 0.03.

Despite appearances, computers do not find 0.03 to be small and thus do not think of this matrix as being nearly singular. This matrix gives computers no problem; Mata can calculate the inverse of this without losing even one binary digit. I mention this and show you the picture so that you will have a better appreciation of just how squished the space can become before computers start complaining.

When do well-programmed computers complain? Say you have a matrix A and make the above graph, but you make it really big — 3 miles by 3 miles. Lay your graph out on the ground and hike out to the middle of it. Now get down on your knees and get out your ruler. Measure the spread of the compressed space at its widest part. Is it an inch? That’s not a problem. One inch is roughly 5*10-6 of the original space (that is, 1 inch by 3 miles wide). If that were a problem, users would complain. It is not problematic until we get around 10-8 of the original area. Figure about 0.002 inches.

There’s more I could say about eigenvalues and eigenvectors. I could mention that rotation matrices have no eigenvectors and eigenvalues, or at least no real ones. A rotation matrix rotates the space, and thus there are no transformed points that are along their original line through the origin. I could mention that one can rebuild the original matrix from its eigenvectors and eigenvalues, and from that, one can generalize powers to matrix powers. It turns out that A-1 has the same eigenvectors as A; its eigenvalues are λ-1 of the original’s. Matrix AA also has the same eigenvectors as A; its eigenvalues are λ2. Ergo, Ap can be formed by transforming the eigenvalues, and it turns out that, indeed, A½ really does, when multiplied by itself, produce A.

Understanding matrices intuitively, part 1

I want to show you a way of picturing and thinking about matrices. The topic for today is the square matrix, which we will call A. I’m going to show you a way of graphing square matrices, although we will have to limit ourselves to the 2 x 2 case. That will be, as they say, without loss of generality. The technique I’m about to show you could be used with 3 x 3 matrices if you had a better 3-dimensional monitor, and as will be revealed, it could be used on 3 x 2 and 2 x 3 matrices, too. If you had more imagination, we could use the technique on 4 x 4, 5 x 5, and even higher-dimensional matrices.

But we will limit ourselves to 2 x 2. A might be

delim{[}{matrix{2}{2}{2 1 1.5 2}}{]}

From now on, I’ll write matrices as

A = (2, 1 \ 1.5, 2)

where commas are used to separate elements on the same row and backslashes are used to separate the rows.

To graph A, I want you to think about

y = Ax

where

y: 2 x 1,

A: 2 x 2, and

x: 2 x 1.

That is, we are going to think about A in terms of its effect in transforming points in space from x to y. For instance, if we had the point

x = (0.75 \ 0.25)

then

y = (1.75 \ 1.625)

because by the rules of matrix multiplication y[1] = 0.75*2 + 0.25*1 = 1.75 and y[2] = 0.75*1.5 + 0.25*2 = 1.625. The matrix A transforms the point (0.75 \ 0.25) to (1.75 \ 1.625). We could graph that:

To get a better understanding of how A transforms the space, we could graph additional points:

I do not want you to get lost among the individual points which A could transform, however. To focus better on A, we are going to graph y = Ax for all x. To do that, I’m first going to take a grid,

One at a time, I’m going to take every point on the grid, call the point x, and run it through the transform y = Ax. Then I’m going to graph the transformed points:

Finally, I’m going to superimpose the two graphs:

In this way, I can now see exactly what A = (2, 1 \ 1.5, 2) does. It stretches the space, and skews it.

I want you to think about transforms like A as transforms of the space, not of the individual points. I used a grid above, but I could just as well have used a picture of the Eiffel tower and, pixel by pixel, transformed it by using y = Ax. The result would be a distorted version of the original image, just as the the grid above is a distorted version of the original grid. The distorted image might not be helpful in understanding the Eiffel Tower, but it is helpful in understanding the properties of A. So it is with the grids.

Notice that in the above image there are two small triangles and two small circles. I put a triangle and circle at the bottom left and top left of the original grid, and then again at the corresponding points on the transformed grid. They are there to help you orient the transformed grid relative to the original. They wouldn’t be necessary had I transformed a picture of the Eiffel tower.

I’ve suppressed the scale information in the graph, but the axes make it obvious that we are looking at the first quadrant in the graph above. I could just as well have transformed a wider area.

Regardless of the region graphed, you are supposed to imagine two infinite planes. I will graph the region that makes it easiest to see the point I wish to make, but you must remember that whatever I’m showing you applies to the entire space.

We need first to become familiar with pictures like this, so let’s see some examples. Pure stretching looks like this:

Pure compression looks like this:

Pay attention to the color of the grids. The original grid, I’m showing in red; the transformed grid is shown in blue.

A pure rotation (and stretching) looks like this:

Note the location of the triangle; this space was rotated around the origin.

Here’s an interesting matrix that produces a surprising result: A = (1, 2 \ 3, 1).

This matrix flips the space! Notice the little triangles. In the original grid, the triangle is located at the top left. In the transformed space, the corresponding triangle ends up at the bottom right! A = (1, 2 \ 3, 1) appears to be an innocuous matrix — it does not even have a negative number in it — and yet somehow, it twisted the space horribly.

So now you know what 2 x 2 matrices do. They skew,stretch, compress, rotate, and even flip 2-space. In a like manner, 3 x 3 matrices do the same to 3-space; 4 x 4 matrices, to 4-space; and so on.

Well, you are no doubt thinking, this is all very entertaining. Not really useful, but entertaining.

Okay, tell me what it means for a matrix to be singular. Better yet, I’ll tell you. It means this:

A singular matrix A compresses the space so much that the poor space is squished until it is nothing more than a line. It is because the space is so squished after transformation by y = Ax that one cannot take the resulting y and get back the original x. Several different x values get squished into that same value of y. Actually, an infinite number do, and we don’t know which you started with.

A = (2, 3 \ 2, 3) squished the space down to a line. The matrix A = (0, 0 \ 0, 0) would squish the space down to a point, namely (0 0). In higher dimensions, say, k, singular matrices can squish space into k-1, k-2, …, or 0 dimensions. The number of dimensions is called the rank of the matrix.

Singular matrices are an extreme case of nearly singular matrices, which are the bane of my existence here at StataCorp. Here is what it means for a matrix to be nearly singular:

Nearly singular matrices result in spaces that are heavily but not fully compressed. In nearly singular matrices, the mapping from x to y is still one-to-one, but x‘s that are far away from each other can end up having nearly equal y values. Nearly singular matrices cause finite-precision computers difficulty. Calculating y = Ax is easy enough, but to calculate the reverse transform x = A-1y means taking small differences and blowing them back up, which can be a numeric disaster in the making.

So much for the pictures illustrating that matrices transform and distort space; the message is that they do. This way of thinking can provide intuition and even deep insights. Here’s one:

In the above graph of the fully singular matrix, I chose a matrix that not only squished the space but also skewed the space some. I didn’t have to include the skew. Had I chosen matrix A = (1, 0 \ 0, 0), I could have compressed the space down onto the horizontal axis. And with that, we have a picture of nonsquare matrices. I didn’t really need a 2 x 2 matrix to map 2-space onto one of its axes; a 2 x 1 vector would have been sufficient. The implication is that, in a very deep sense, nonsquare matrices are identical to square matrices with zero rows or columns added to make them square. You might remember that; it will serve you well.

Here’s another insight:

In the linear regression formula b = (XX)-1Xy, (XX)-1 is a square matrix, so we can think of it as transforming space. Let’s try to understand it that way.

Begin by imagining a case where it just turns out that (XX)-1 = I. In such a case, (XX)-1 would have off-diagonal elements equal to zero, and diagonal elements all equal to one. The off-diagonal elements being equal to 0 means that the variables in the data are uncorrelated; the diagonal elements all being equal to 1 means that the sum of each squared variable would equal 1. That would be true if the variables each had mean 0 and variance 1/N. Such data may not be common, but I can imagine them.

If I had data like that, my formula for calculating b would be b = (XX)-1Xy = IXy = Xy. When I first realized that, it surprised me because I would have expected the formula to be something like b = X-1y. I expected that because we are finding a solution to y = Xb, and b = X-1y is an obvious solution. In fact, that’s just what we got, because it turns out that X-1y = Xy when (XX)-1 = I. They are equal because (XX)-1 = I means that XX = I, which means that X‘ = X-1. For this math to work out, we need a suitable definition of inverse for nonsquare matrices. But they do exist, and in fact, everything you need to work it out is right there in front of you.

Anyway, when correlations are zero and variables are appropriately normalized, the linear regression calculation formula reduces to b = Xy. That makes sense to me (now) and yet, it is still a very neat formula. It takes something that is N x k — the data — and makes k coefficients out of it. Xy is the heart of the linear regression formula.

Let’s call b = Xy the naive formula because it is justified only under the assumption that (XX)-1 = I, and real XX inverses are not equal to I. (XX)-1 is a square matrix and, as we have seen, that means it can be interpreted as compressing, expanding, and rotating space. (And even flipping space, although it turns out the positive-definite restriction on XX rules out the flip.) In the formula (XX)-1Xy, (XX)-1 is compressing, expanding, and skewing Xy, the naive regression coefficients. Thus (XX)-1 is the corrective lens that translates the naive coefficients into the coefficient we seek. And that means XX is the distortion caused by scale of the data and correlations of variables.

Thus I am entitled to describe linear regression as follows: I have data (y, X) to which I want to fit y = Xb. The naive calculation is b = Xy, which ignores the scale and correlations of the variables. The distortion caused by the scale and correlations of the variables is XX. To correct for the distortion, I map the naive coefficients through (XX)-1.

Intuition, like beauty, is in the eye of the beholder. When I learned that the variance matrix of the estimated coefficients was equal to s2(XX)-1, I immediately thought: s2 — there’s the statistics. That single statistical value is then parceled out through the corrective lens that accounts for scale and correlation. If I had data that didn’t need correcting, then the standard errors of all the coefficients would be the same and would be identical to the variance of the residuals.

If you go through the derivation of s2(XX)-1, there’s a temptation to think that s2 is merely something factored out from the variance matrix, probably to emphasize the connection between the variance of the residuals and standard errors. One easily loses sight of the fact that s2 is the heart of the matter, just as Xy is the heart of (XX)-1Xy. Obviously, one needs to view both s2 and Xy though the same corrective lens.

I have more to say about this way of thinking about matrices. Look for part 2 in the near future. Update: part 2 of this posting, “Understanding matrices intuitively, part 2, eigenvalues and eigenvectors”, may now be found at http://blog.stata.com/2011/03/09/understanding-matrices-intuitively-part-2/.

How to read the %21x format, part 2

In my previous posting last week, I explained how computers store binary floating-point numbers, how Stata’s %21x display format displays with fidelity those binary floating-point numbers, how %21x can help you uncover bugs, and how %21x can help you understand behaviors that are not bugs even though they are surpising to us base-10 thinkers. The point is, it is sometimes useful to think in binary, and with %21x, thinking in binary is not difficult.

This week, I want to discuss double versus float precision.

Double (8-byte) precision provides 53 binary digits. Float (4-byte) precision provides 24. Let me show you what float precision looks like.

. display %21x sqrt(2) _newline %21x float(sqrt(2))
+1.6a09e667f3bcdX+000
+1.6a09e60000000X+000

All those zeros in the floating-point result are not really there;
%21x merely padded them on. The display would be more honest if it were

+1.6a09e6       X+000

Of course, +1.6a09e60000000X+000 is a perfectly valid way of writing +1.6a09e6X+000 — just as 1.000 is a valid way of writing 1 — but you must remember that float has fewer digits than double.

Hexadecimal 1.6109e6 is a rounded version of 1.6a09e667f3bcd, and you can think of this in one of two ways:

     double     =  float   + extra precision
1.6a09e667f3bcd = 1.6a09e6 + 0.00000067f3bcd

or

  float   =      double     -  lost precision
1.6a09e6  = 1.6a09e667f3bcd - 0.00000067f3bcd

Note that more digits are lost than appear in the float result! The float result provides six hexadecimal digits (ignoring the 1), and seven digits appear under the heading lost precision. Double precision is more than twice float precision. To be accurate, double precision provides 53 binary digits, float provides 24, so double precision is really 53/24 = 2.208333 precision.

The double of double precision refers to the total number of binary digits used to store the mantissa and the exponent in z=a*2^b, which is 64 versus 32. Precision is 53 versus 24.

In this case, we obtained the floating-point from float(sqrt(2)), meaning that we rounded a more accurate double-precision result. One usually rounds when producing a less precise representation. One of the rounding rules is to round up if the digits being omitted (with a decimal point in front) exceed 1/2, meaning 0.5 in decimal. The equivalent rule in base-16 is to round up if the digits being omitted (with a hexadecimal point in front) exceed 1/2, meaning 0.8 (base-16). The lost digits were .67f3bcd, which are less than 0.8, and therefore, the last digit of the rounded result was not adjusted.

Actually, rounding to float precision is more difficult than I make out, and seeing that numbers are rounded correctly when displayed in %21x can be difficult. These difficulties have to do with the relationship between base-2 — the base in which the computer works — and base-16 — a base similar but not identical to base-2 that we humans find more readable. The fact is that %21x was designed for double precision, so it only does an adequate job of showing single precision. When %21x displays a float-precision number, it shows you the exactly equal double-precision number, and that turns out to matter.

We use base-16 because it is easier to read. But why do we use base-16 and not base-15 or base-17? We use base-16 because it is an integer power of 2, the base the computer uses. One advantage of bases being powers of each other is that base conversion can be done more easily. In fact, conversion can be done almost digit by digit. Doing base conversion is usually a tedious process. Try converting 2394 (base-10) to base-11. Well, you say, 11^3=1331, and 2*11331 = 2662 > 2394, so the first digit is 1 and the remainder is 2394-1331 = 1063. Now, repeating the process with 1063, I observe that 11^2 = 121 and that 1063 is bound by 8*121=969 and 9*121=1089, so the second digit is 9 and I have a remainder of …. And eventually you produce the answer 1887 (base-11).

Converting between bases when one is the power of another not only is easier but also is so easy you can do it in your head. To convert from base-2 to base-16, group the binary digits into groups of four (because 2^4=16) and then translate each group individually.

For instance, to convert 011110100010, proceed as follows:

0111 1010 0010
--------------
   7    a    2

I’ve performed this process often enough that I hardly have to think. But here is how you should think: Divide the binary number into four-digit groups. The four columns of the binary number stand for 8, 4, 2, and 1. When you look at 0111, say to yourself 4+2+1 = 7. When you look at 1010, say to yourself 8+2 = 10, and remember that the digit for 10 in base-16 is a.

Converting back is nearly as easy:

   7    a    2
--------------
0111 1010 0010

Look at 7 and remember the binary columns 8-4-2-1. Though 7 does not contain an 8, it does contain a 4 (leaving 3), and 3 contains a 2 and a 1.

I admit that converting base-16 to base-2 is more tedious than converting base-2 to base-16, but eventually, you’ll have the four-digit binary table memorized; there are only 16 lines. Say 7 to me, and 0111 just pops into my head. Well, I’ve been doing this a long time, and anyway, I’m a geek. I suspect I carry the as-yet-undiscovered binary gene, which means I came into this world with the base-2-to-base-16 conversion table hardwired:

base-2 base-16
0000 0
0001 1
0010 2
0011 3
0100 4
1001 9
1010 a
1111 f

Now that you can convert base-2 to base-16 — convert from binary to hexadecimal — and you can convert back again, let’s return to floating-point numbers.

Remember how floating-point numbers are stored:

z = a * 2^b, 1<=a<2 or a==0

For example,

    0.0 = 0.0000000000000000000000000000000000000000000000000000 * 2^-big
    0.5 = 1.0000000000000000000000000000000000000000000000000000 * 2^-1
    1.0 = 1.0000000000000000000000000000000000000000000000000000 * 2^0
sqrt(2) = 1.0110101000001001111001100110011111110011101111001101 * 2^0
    1.5 = 1.1000000000000000000000000000000000000000000000000000 * 2^0
    2.0 = 1.0000000000000000000000000000000000000000000000000000 * 2^0
    2.5 = 1.0100000000000000000000000000000000000000000000000000 * 2^0
    3.0 = 1.1000000000000000000000000000000000000000000000000000 * 2^1
    _pi = 1.1001001000011111101101010100010001000010110100011000 * 2^1
    etc.

In double precision, there are 53 binary digits of precision. One of the digits is written to the left of binary point, and the remaining 52 are written to the right. Next observe that the 52 binary digits to the right of the binary point can be written in 52/4=13 hexadecimal digits. That is exactly what %21x does:

    0.0 = +0.0000000000000X-3ff
    0.5 = +1.0000000000000X-001
    1.0 = +1.0000000000000X+000
sqrt(2) = +1.6a09e667f3bcdX+000
    1.0 = +1.0000000000000X+000
    1.5 = +1.8000000000000X+000
    2.0 = +1.0000000000000X+001
    2.5 = +1.4000000000000X+001
    3.0 = +1.8000000000000X+002
    _pi = +1.921fb54442d18X+001

You could perform the binary-to-hexadecimal translation for yourself. Consider _pi. The first group of four binary digits after the binary point are 1001, and 9 appears after the binary point in the %21x result. The second group of four are 0010, and 2 appears in the %21x result.
The %21x result is an exact representation of the underlying binary, and thus you are equally entitled to think in either base.

In single precision, the rule is the same:

z = a * 2^b, 1<=a<2 or a==0

But this time, only 24 binary digits are provided for a, and so we have

    0.0 = 0.00000000000000000000000 * 2^-big
    0.5 = 1.00000000000000000000000 * 2^-1
    1.0 = 1.00000000000000000000000 * 2^0
sqrt(2) = 1.01101010000010011110011 * 2^0
    1.5 = 1.10000000000000000000000 * 2^0
    2.0 = 1.00000000000000000000000 * 2^0 
    2.5 = 1.01000000000000000000000 * 2^0
    3.0 = 1.10000000000000000000000 * 2^1
    _pi = 1.10010010000111111011011 * 2^1
    etc.

In single precision, there are 24-1=23 binary digits of precision to the right of the binary point, and 23 is not divisible by 4. If we tried to convert to base-16, we end up with

sqrt(2) = 1.0110 1010 0000 1001 1110 011   * 2^0
          1.   6    a    0    9    e    ?  * 2^0

To fill in the last digit, we could recognize that we can pad on an extra 0 because we are to the right of the binary point. For example, 1.101 == 1.1010. If we padded on the extra 0, we have

sqrt(2) = 1.0110 1010 0000 1001 1110 0110  * 2^0
          1.   6    a    0    9    e    6  * 2^0

That is precisely the result %21x shows us:

. display %21x float(sqrt(2))
+1.6a09e60000000X+000

although we might wish that %21x would omit the 0s that aren’t really there, and instead display this as +1.6a09e6X+000.

The problem with this solution is that it can be misleading because the last digit looks like it contains four binary digits when in fact it contains only three. To show how easily you can be misled, look at _pi in double and float precisions:

. display %21x _pi _newline %21x float(_pi)
+1.921fb54442d18X+001
+1.921fb60000000X+001
        ^
  digit incorrectly rounded?

The computer rounded the last digit up from 5 to 6. The digits after the rounded-up digit in the full-precision result, however, are 0.4442d18, and are clearly less than 0.8 (1/2). Shouldn’t the rounded result be 1.921fb5X+001? The answer is that yes, 1.921fb5X+001 would be a better result if we had 6*4=24 binary digits to the right of the binary point. But we have only 23 digits; correctly rounding to 23 binary digits and then translating into base-16 results in 1.921fb6X+001. Because of the missing binary digit, the last base-16 digit can only take on the values 0, 2, 4, 6, 8, a, c, and e.

The computer performs the rounding in binary. Look at the relevant piece of this double-precision number in binary:

+1.921f   b    5    4    4    42d18X+001      number
       1011 0101 0100 0100 0100               expansion into binary
       1011 01?x xxxx xxxx xxxxxxxx           thinking about rounding
       1011 011x xxxx xxxx xxxxxxxx           performing rounding
+1.921f   b    6                   X+001      convert to base-16

The part I have converted to binary in the second line is around the part to be rounded. In the third line, I’ve put x’s under the part we will have to discard to round this double into a float. The x’d out part — 10100… — is clearly greater than 1/2, so the last digit (where I put a question mark) must be rounded up. Thus, _pi in float precision rounds to 1.921fb6+X001, just as the computer said.

Float precision does not play much of a role in Stata despite the fact that most users store their data as floats. Regardless of how data are stored, Stata makes all calculations in double precision, and float provides more than enough precision for most data applications. The U.S. deficit in 2011 is projected to be $1.5 trillion. One hopes that a grand total of $26,624 — the error that would be introduced by storing this projected deficit in float precision — would not be a significant factor in any lawmaker’s decision concerning the issue. People in the U.S. are said to work about 40 hours per week, or roughly 0.238 of the hours in a week. I doubt that number is accurate to 0.4 milliseconds, the error that float would introduce in recording the fraction. A cancer survivor might live 350.1 days after a treatment, but we would introduce an error of roughly 1/2 second if we record the number as a float. One might question whether the instant of death could even conceptually be determined that accurately. The moon is said to be 384.401 thousand kilometers from the Earth. Record in 1,000s of kilometers in float, and the error is almost 1 meter. At its closest and farthest, the moon is 356,400 and 406,700 kilometers away. Most fundamental constants of the universe are known only to a few parts in a million, which is to say, to less than float precision, although we do know the speed of light in a vacuum to one decimal digit beyond float accuracy; it’s 299,793.458 kilometers per second. Round that to float and you’ll be off by 0.01 km/s.

The largest integer that can be recorded without rounding in float precision is 16,777,215. The largest integer that can be recorded without rounding in double precision is 9,007,199,254,740,991.

People working with dollar-and-cent data in Stata usually find it best to use doubles both to avoid rounding issues and in case the total exceeds $167,772.15. Rounding issues of 0.01, 0.02, etc., are inherent when working with binary floating point, regardless of precision. To avoid all problems, these people should use doubles and record amounts in pennies. That will have no difficulty with sums up to $90,071,992,547,409.91, which is to say, about $90 trillion. That’s nine quadrillion pennies. In my childhood, I thought a quadrillion just meant a lot, but it has a formal definition.

All of which is a long way from where I started, but now you are an expert in understanding binary floating-point numbers the way a scientific programmer needs to understand them: z=a*2^b. You are nearly all the way to understanding the IEEE 754-2008 standard. That standard merely states how a and b are packed into 32 and 64 bits, and the entire point of %21x is to avoid those details because, packed together, the numbers are unreadable by humans.

References

Cox, N. J. 2006. Tip 33: Sweet sixteen: Hexadecimal formats and precision problems. Stata Journal 6: 282-283.

Gould, William. 2006. Mata matters: Precision. Stata Journal 6: 550-560.

Linhart, J. M. 2008. Mata matters: Overflow and IEEE floating-point format. Stata Journal 8: 255-268.