Archive

Archive for June 2011

Stata 12 Announced

We are pleased to announce a new version of Stata: Stata 12. You can order it today, it starts shipping on July 25, and you can find out about it at www.stata.com/stata12/.

Here are the highlights of what’s new:

There are other things that are new, too. Things like functions for Tukey’s Studentized range and Dunnett’s multiple range, baseline odds for logistic regression, truncated count-data regressions, probability predictions, robust and cluster-robust SEs for fixed-effects Poisson regression, and the like under General Statistics. Or under Survey Data, support for SEM, bootstrap and successive difference replicate (SDR) weights, goodness of fit after binary models, coefficient of variation, and more. Or under Panel Data, probability predictions, multiple imputation support, and more. Or under Survival Data, a goodness-of-fit statistic that is robust to censoring. Or PDF export of results and graphs.

We could go on, but you get the idea. We think Stata 12 is worth a look.

Precision (yet again), Part II

In part I, I wrote about precision issues in English. If you enjoyed that, you may want to stop reading now, because I’m about to go into the technical details. Actually, these details are pretty interesting.

For instance, I offered the following formula for calculating error due to float precision:

maximum_error = 2-24 X

I later mentioned that the formula is an approximation, and said that the true formula is,

maximum_error = 2-24 2floor(log2 X)

I didn’t explain how I got either formula.

I need to be more precise today than I was in my previous posting. For instance, I previously used x for two concepts, the true value and the rounded-after-storage value. Today I need to distinguish those concepts.

X is the true value.

x is the value after rounding due to storage.

The issue is the difference between x and X when X is stored in 24-binary-digit float precision.

Base 10

Although I harp on the value of learning to think in binary and hexadecimal, I admit that I, too, find it easier to think in base 10. So let’s start that way.

Say we record numbers to two digits of accuracy, which I will call d=2. Examples of d=2 numbers include

 1.0
 1.6
12
47
52*10^1 (i.e, 520, but with only two significant digits)

To say that we record numbers to two digits of accuracy is to say that, coming upon the recorded number 1, we know only that the number lies between 0.95 and 1.05; or coming upon 12, that the true number lies between 11.5 and 12.5, and so on. I assume that numbers are rounded efficiently, which is to say, stored values record midpoints of intervals.

Before we get into the math, let me note that most us would be willing to say that numbers recorded this way are accurate to 1 part in 10 or, if d=3, to 1 part in 100. If numbers are accurate to 1 part in 10^(d-1), then couldn’t we must multiply the number by 1/(10^(d-1)) to obtain the width of the interval? Let’s try:

Assume X=520 and d=2. Then 520/(10^(2-1)) = 52. The true interval, however, is (515, 525] and it has width 10. So the simple formula does not work.

The simple formula does not work yet I presented its base-2 equivalent in Part 1 and I even recommended its use! We will get to that. It turns out the smaller the base, the more accurately the simple formula approximates the true formula, but before I can show that, I need the true formula.

Let’s start by thinking about d=1.

  1. The recorded number 0 will contain all numbers between [-0.5, 0.5). The recorded number 1 will contain all numbers between [0.5, 1.5), and so on. For 0, 1, …, 9, the width of the intervals is 1.

  2. The recorded number 10 will contain all numbers between [5, 15). The recorded number 20 will contain all numbers between [15, 25), and so on, For 10, 20, …, 90, the width of the intervals is 10.

The derivation for the width of interval goes like this:

  1. If we recorded the value of X to one decimal digit, the recorded digit will will be b, the recorded value will be x = b*10p, and the power of ten will be p = floor(log10X). More importantly, W1 = 10p will be the width of the interval containing X.

  2. It therefore follows that if we recorded the value of X to two decimal digits, the interval length would be W2 = W1/10. What ever the width with one digit, adding another must reduce width by one-tenth.

  3. If we recorded the value of X to three decimal digits, the interval length would be W3 = W2/10.

  4. Thus, if d is the number of digits to which numbers are recorded, the width of the interval is 10p where p = floor(log10X) – (d-1).

The above formula is exact.

Base 2

Converting the formula

interval_width = 10floor(log10X)-(d-1)

from base 10 to base 2 is easy enough:

interval_width = 2floor(log2X)-(d-1)

In Part 1, I presented this formula for d=24 as

maximum_error = 2floor(log2X)-24 = 2 -24 2floor(log2 X)

In interval_width, it is d-1 and not d that appears in the formula. You might think I made an error and should have put -23 where I put -24 in the maximum_error formula. There is no mistake. In Part 1, the maximum error was defined as a plus-or-minus quantity and is thus half the width of the overall interval. So I divided by 2, and in effect, I did put -23 into the maximum_error formula, at least before I subracted one more from it, making it -24 again.

I started out this posting by considering and dismissing the base-10 approximation formula

interval_width = 10-(d-1) X

which in maximum-error units is

maximum_error = 10-d X

and yet in Part 1, I presented — and even recommended — its base-2, d=24 equivalent,

maximum_error = 2-24 X

It turns out that the approximation formula is not as inaccurate in base 2 and it would be in base 10. The correct formula,

maximum_error = 2floor(log2X)-d

can be written

maximum_error = 2-d 2floor(log2X

so the question becomes about the accuracy of substituting X for 2^floor(log2X). We know by examination that X ≥ 2^floor(log2X), so making the substitution will overstate the error and, in that sense, is a safe thing to do. The question becomes how much the error is overstated.

X can be written 2^(log2X) and thus we need to compare 2^(log2X) with 2^floor(log2X). The floor() function cannot reduce its argument by more than 1, and thus 2^(log2X) cannot differ from 2^floor(log2X) by more than a factor of 2. Under the circumstances, this seems a reasonable approximation.

In the case of base 10, the the floor() function reducing its argument by up to 1 results in a decrease of up to a factor of 10. That, it seems to me, is not a reasonable amount of error.

Categories: Numerical Analysis Tags: ,

Precision (yet again), Part I

I wrote about precision here and here, but they were pretty technical.

“Great,” coworkers inside StataCorp said to me, “but couldn’t you explain these issues in a way that doesn’t get lost in the details of how computers store binary and maybe, just maybe, write about floats and doubles from a user’s perspective instead of programmer’s perspective?”

“Mmmm,” I said clearly.

Later, when I tried, I liked the result. It contains new material, too. What follows is what I now wish I had written first. I’d would have still written the other two postings, but as technical appendices.

In Part 2 (forthcoming), I provide the mathematical derivations underlying what follows. There are a few interesting issues underlying what follows.

Please excuse the manualish style of what follows, but I suspect that what follows will eventually work its way into Stata’s help files or manuals, so I wrote it that way.


Syntax

Problem:

. generate x = 1.1

. list
  (Stata displays output showing x is 1.1 in all observations)

. count if x==1.1
     0

Solution 1:

. count if x==float(1.1)
   100

Solution 2:

. generate double x = 1.1

. count if x==1.1
  100

Solution 3:

. set type double

. generate x = 1.1

. count if x==1.1
  100

Description

Stata works in binary. Stata stores data in float precision by default. Stata preforms all calculations in double precision. Sometimes the combination results in surprises until you think more carefully about what happened.

Remarks

Remarks are presented under the headings

Summary
Why count==1.1 produces 0
How count==float(1.1) solves the problem
How storing data as double appears to solve the problem (and does)
Float is plenty accurate to store most data
Why don’t I have the problems using Excel?

Summary

Justifications for all statements made appear in the sections below. In summary,

  1. It sometimes appears that Stata is inaccurate. That is not true and, in fact, the appearance of inaccuracy happens in part because Stata is so accurate.

  2. You can cover up this appearance of inaccuracy by storing all your data in double precision. This will double or more the size of your dataset, and so I do not recommend the double-precision solution unless your dataset is small relative to the amount of memory on your computer. In that case, there is nothing wrong with storing all your data in double precision.

    The easiest way to implement the double-precision solution is by typing set type double. After that, Stata will default to to creating all new variables as doubles, at least for the remainder of the session. If all your datasets are small relative to the amount of memory on your computer, you can set type double, permanently.

  3. The double-precision solution is needlessly wasteful of memory. It is difficult to imagine data that are accurate to more than float precision. Regardless of how your data are stored, Stata does all calculations in double precision, and sometimes in quad precision.

  4. The issue of 1.1 not being equal to 1.1 arises only with “nice” decimal numbers. You just have to remember to use Stata’s float() function when dealing with such numbers.

Why count x==1.1 produces 0

Let’s trace through what happens when you type the commands

. generate x = 1.1

. count if x==1.1
     0

Here is how it works:

  1. Some numbers have no exact finite-digit binary representation just as some numbers have no exact finite-digit decimal representation. One-third, 0.3333… (base 10), is an example of a number with no exact finite-digit decimal representation. In base 12, one-third does have an exact finite-digit representation, namely 0.4 (base 12). In base 2 (binary), base 10 numbers such as 0.1, 0.2, 0.3, 0.4, 0.6, … have no exact finite-digit representation.

  2. Computers store numbers with a finite number of binary digits. In float precision, numbers have 24 binary digits. In double precision, they have 53 binary digits.

    The decimal number 1.1 in binary is 1.000110011001… (base 2). The 1001 on the end repeats forever. Thus, 1.1 (base 10) is stored by a computer as

     1.00011001100110011001101
    

    in float, or as

     1.0001100110011001100110011001100110011001100110011010
    

    in double. There are 24 and 53 digits in the numbers above.

  3. Typing generate x = 1.1 results in 1.1 being interpreted as the longer binary number Stata performs all calculations in double precision. New variable x is created as a float by default. When the more precise number is stored in x, it is rounded to the shorter number.

  4. Thus when you count if x==1.1 the result is 0 because 1.1 is again interpreted as the longer binary number and the longer number is compared to shorter number stored in x, and they are not equal.

How count x==float(1.1) solves the problem

One way to fix the problem is to change count if x==1.1 to read count if x==float(1.1):

. generate x = 1.1

. count if x==float(1.1)
   100

Function float() rounds results to float precision. When you type float(1.1), the 1.1 is converted to binary, double precision, namely,

 1.0001100110011001100110011001100110011001100110011010 (base 2)

and float() then rounds that long binary number to

 1.00011001100110011001101 (base 2)

or more correctly, to

 1.0001100110011001100110000000000000000000000000000000 (base 2)

because the number is still stored in double precision. Regardless, this new value is equal to the value stored in x, and so count reports that 100 observations contain float(1.1).

As an aside, when you typed generate x = 1.1, Stata acted as if you typed generate x = float(1.1). Whenever you type generate x = … and x is a float, Stata acts if if you typed generate x = float(…).

How storing data as double appears to solve the problem (and does)

When you type

. generate double x = 1.1

. count if x==1.1
   100

it should be pretty obvious how the problem was solved. Stata stores

1.0001100110011001100110011001100110011001100110011010 (base 2)

in x, and then compares the stored result to

1.0001100110011001100110011001100110011001100110011010 (base 2)

and of course they are equal.

In the Summary above, I referred to this as a cover up. It is a cover up because 1.1 (base 10) is not what is stored in x. What is stored in x is the binary number just shown, and to be equal to 1.1 (base 10), the binary number needs to suffixed with 1001, and then another 1001, and then another, and so on without end.

Stata tells you that x is equal to 1.1 because Stata converted the 1.1 in count to the same inexact binary representation as Stata previously stored in x, and those two values are equal, but neither is equal to 1.1 (base 10). This leads to an important property of digital computers:

If storage and calculation are done to the same precision, it will appear to the user as if all numbers that the user types are stored without error.

That is, it appears to you as if there is no inaccuracy in storing 1.1 in x when x is a double because Stata performs calculations in double. And it is equally true that it would appear to you as if there were no accuracy issues storing 1.1 when x is stored in float precision if Stata, observing that x is float, performed calculations involving x in float. The fact is that there are accuracy issues in both cases.

“Wait,” you are probably thinking. “I understand your argument, but I’ve always heard that float is inaccurate and double is accurate. I understand from your argument that it is only a matter of degree but, in this case, those two degrees are on opposite sides of an important line.”

“No,” I respond.

What you have heard is right with respect to calculation. What you have heard might apply to data storage too, but that is unlikely. It turns out that float provides plenty of precision to store most real measurements.

Float is plenty accurate to store most data

The misconception that float precision is inaccurate comes from the true statement that float precision is not accurate enough when it comes to making calculations with stored values. Whether float precision is accurate enough for storing values depends solely on the accuracy with which the values are measured.

Float precision provides 24 base-2 (binary) digits, and thus values stored in float precision have a maximum relative error error of plus-or-minus 2^(-24) = 5.96e-08, or less than +/-1 part in 15 million.

  1. The U.S. deficit in 2011 is projected to be $1.5 trillion. Stored as a float, the number has a (maximum) error of 2^(-24) * 1.5e+12 = $89,407. That is, if the true number is 1.5 trillion, the number recorded in float precision is guaranteed to be somewhere in the range [(1.5e+12)-89,407, (1.5e+14)+89,407]. The projected U.S. deficit is not known to an accuracy of +/-$89,407.

  2. People in the US work about 40 hours per week, or roughly 0.238 of the hours in the week. 2^(-24) * 0.238 = 1.419e-09 of a week, or 0.1 milliseconds. Time worked in a week is not known to an accuracy of +/-0.1 milliseconds.

  3. A cancer survivor might live 350 days. 2^(-24) * 350 = .00002086, or 1.8 seconds. Time of death is rarely recorded to an accuracy of +/-1.8 seconds. Time of diagnosis is never recorded to such accuracy, nor could it be.

  4. The moon is said to be 384,401 kilometers from the Earth. 2^(-24) * 348,401 = 0.023 kilometers, or 23 meters. At its closest and farthest, the moon is 356,400 and 406,700 kilometers from Earth.

  5. Most fundamental constants of the universe are known to a few parts in a million, which is to say, less than 1 part in 15 million, the accuracy float precision can provide. An exception is the speed of light, measured to be 299,793.458 kilometers per second. Record that as a float and you will be off by 0.01 km/s.

In all the examples except the last, quoted are worst-case scenarios. The actual errors depend on the exact number and is a more tedious calculation (not shown):

  1. For the U.S. deficit, the exact error for 1.5 trillion is -$26,624, which is within the plus or minus $89,407 quoted.

  2. For fraction of the week, at 0.238 the error is -0.04 milliseconds, which is within the +/-0.1 milliseconds quoted.

  3. For cancer survival time, at 350 days the actual error is 0, which is within the +/-1.8 seconds quoted.

  4. For the distance between the Earth and moon, the actual error is 0, which is within within the +/-23 meters quoted.

The actual errors may be interesting, but the maximum errors are more useful. Remember the multiplier 2^(-24). All you have to do is multiply a measurement by 2^(-24) and compare the result with the inherent error in the measurement. If 2^(-24) multiplied by the measurement is less than the inherent error, you can use float precision to store your data. Otherwise, you need to use double.

By the way, the formula

maximum_error = 2^(-24) * x

is an approximation. The true formula is

maximum_error = 2^(-24) * 2^(floor(log2(x)))

It can be readily proven that x ≥ 2^(floor(log2(x))) and thus the approximation formula overstates the maximum error. The approximation formula can overstate the maximum error by as much as a factor of 2. Float precision is adequate for most data. There is one kind of data, however, where float precision may not be adequate, and that is financial data such as sales data, general ledgers, and the like. People working with dollar-and-cent data, or Euro-and-Eurocent data, or Pound Stirling-and-penny data, or any other currency data, usually find it best to use doubles. To avoid rounding issues, it is preferable to store the data as pennies. Float precision binary cannot store 0.01, 0.02, and the like, exactly. Integer values, however, can be stored exactly, at least up to certain 16,777,215.

Floats can store up to 16,777,215 exactly. If stored your data in pennies, that would correspond to $167,772.15.

Doubles can store up to 9,007,199,254,740,991 exactly. If you stored your data in pennies, the would correspond to $90,071,992,547,409.91, or just over $90 trillion.

Why don’t I have these problems using Excel?

You do not have these problems when you use Excel because Excel stores numeric values in double precision. As I explained in How float(1.1) solves the problem above,

If storage and calculation are done to the same precision, it will appear to the user as if all numbers that the user types are stored without error.

You can adopt the Excel solution in Stata by typing

. set type double, permanently

You will double (or more) the amount of memory Stata uses to store your data, but if that is not of concern to you, there are no other disadvantages to adopting this solution. If you adopt this solution and later wish to change your mind, type

. set type float, permanently

That’s all for today

If you enjoyed the above, you may want to see Part II (forthcoming). As I said, There are a few technical issues underlying what is written above that may interest those interested in computer science as it applies to statistical computing.

Categories: Numerical Analysis Tags: ,

Stata at JSM 2011 in Miami Beach, FL

StataCorp invites you to stop by our booth, 404, at JSM 2011, July 31 – August 3, in Miami Beach, FL. StataCorp staff and developers will be on hand to answer any questions you have about Stata, from statistics to programming to licensing. You can also register to win a copy of quad-core Stata/MP.

StataCorp is also presenting three continuing education technology workshops at JSM 2011:

Survey Data Analysis with Stata
Jeffrey Pitblado, Associate Director, Statistical Software
Wednesday, August 3, 8:00 AM – 9:45 AM
Register for Activity Number CE_24T

Multiple Imputation Analysis in Stata
Yulia Marchenko, Associate Director, Biostatistics
Wednesday, August 3, 10:00 AM – 11:45 AM
Register for Activity Number CE_28T

Multilevel and Mixed Models in Stata
Bill Rising, Director of Educational Services
Wednesday, August 3, 1:00 PM – 2:45 PM
Register for Activity Number CE_32T

To register for the workshops, sign up when you register to attend JSM or go to http://www.amstat.org/meetings/jsm/2011/onlineprogram/.

We look forward to seeing you in Miami Beach. Be sure to stop by booth 404 to learn more about Stata or just to visit with the people who make it.

Categories: Meetings Tags: , ,