Home > Numerical Analysis > Precision (yet again), Part II

## Precision (yet again), Part II

In part I, I wrote about precision issues in English. If you enjoyed that, you may want to stop reading now, because I’m about to go into the technical details. Actually, these details are pretty interesting.

For instance, I offered the following formula for calculating error due to float precision:

maximum_error = 2-24 X

I later mentioned that the formula is an approximation, and said that the true formula is,

maximum_error = 2-24 2floor(log2 X)

I didn’t explain how I got either formula.

I need to be more precise today than I was in my previous posting. For instance, I previously used x for two concepts, the true value and the rounded-after-storage value. Today I need to distinguish those concepts.

X is the true value.

x is the value after rounding due to storage.

The issue is the difference between x and X when X is stored in 24-binary-digit float precision.

Base 10

Although I harp on the value of learning to think in binary and hexadecimal, I admit that I, too, find it easier to think in base 10. So let’s start that way.

Say we record numbers to two digits of accuracy, which I will call d=2. Examples of d=2 numbers include

``` 1.0
1.6
12
47
52*10^1 (i.e, 520, but with only two significant digits)
```

To say that we record numbers to two digits of accuracy is to say that, coming upon the recorded number 1, we know only that the number lies between 0.95 and 1.05; or coming upon 12, that the true number lies between 11.5 and 12.5, and so on. I assume that numbers are rounded efficiently, which is to say, stored values record midpoints of intervals.

Before we get into the math, let me note that most us would be willing to say that numbers recorded this way are accurate to 1 part in 10 or, if d=3, to 1 part in 100. If numbers are accurate to 1 part in 10^(d-1), then couldn’t we must multiply the number by 1/(10^(d-1)) to obtain the width of the interval? Let’s try:

Assume X=520 and d=2. Then 520/(10^(2-1)) = 52. The true interval, however, is (515, 525] and it has width 10. So the simple formula does not work.

The simple formula does not work yet I presented its base-2 equivalent in Part 1 and I even recommended its use! We will get to that. It turns out the smaller the base, the more accurately the simple formula approximates the true formula, but before I can show that, I need the true formula.

Let’s start by thinking about d=1.

1. The recorded number 0 will contain all numbers between [-0.5, 0.5). The recorded number 1 will contain all numbers between [0.5, 1.5), and so on. For 0, 1, …, 9, the width of the intervals is 1.

2. The recorded number 10 will contain all numbers between [5, 15). The recorded number 20 will contain all numbers between [15, 25), and so on, For 10, 20, …, 90, the width of the intervals is 10.

The derivation for the width of interval goes like this:

1. If we recorded the value of X to one decimal digit, the recorded digit will will be b, the recorded value will be x = b*10p, and the power of ten will be p = floor(log10X). More importantly, W1 = 10p will be the width of the interval containing X.

2. It therefore follows that if we recorded the value of X to two decimal digits, the interval length would be W2 = W1/10. What ever the width with one digit, adding another must reduce width by one-tenth.

3. If we recorded the value of X to three decimal digits, the interval length would be W3 = W2/10.

4. Thus, if d is the number of digits to which numbers are recorded, the width of the interval is 10p where p = floor(log10X) – (d-1).

The above formula is exact.

Base 2

Converting the formula

interval_width = 10floor(log10X)-(d-1)

from base 10 to base 2 is easy enough:

interval_width = 2floor(log2X)-(d-1)

In Part 1, I presented this formula for d=24 as

maximum_error = 2floor(log2X)-24 = 2 -24 2floor(log2 X)

In interval_width, it is d-1 and not d that appears in the formula. You might think I made an error and should have put -23 where I put -24 in the maximum_error formula. There is no mistake. In Part 1, the maximum error was defined as a plus-or-minus quantity and is thus half the width of the overall interval. So I divided by 2, and in effect, I did put -23 into the maximum_error formula, at least before I subracted one more from it, making it -24 again.

I started out this posting by considering and dismissing the base-10 approximation formula

interval_width = 10-(d-1) X

which in maximum-error units is

maximum_error = 10-d X

and yet in Part 1, I presented — and even recommended — its base-2, d=24 equivalent,

maximum_error = 2-24 X

It turns out that the approximation formula is not as inaccurate in base 2 and it would be in base 10. The correct formula,

maximum_error = 2floor(log2X)-d

can be written

maximum_error = 2-d 2floor(log2X

so the question becomes about the accuracy of substituting X for 2^floor(log2X). We know by examination that X ≥ 2^floor(log2X), so making the substitution will overstate the error and, in that sense, is a safe thing to do. The question becomes how much the error is overstated.

X can be written 2^(log2X) and thus we need to compare 2^(log2X) with 2^floor(log2X). The floor() function cannot reduce its argument by more than 1, and thus 2^(log2X) cannot differ from 2^floor(log2X) by more than a factor of 2. Under the circumstances, this seems a reasonable approximation.

In the case of base 10, the the floor() function reducing its argument by up to 1 results in a decrease of up to a factor of 10. That, it seems to me, is not a reasonable amount of error.

Categories: Numerical Analysis Tags: