Author Archive

How to read the %21x format, part 2

In my previous posting last week, I explained how computers store binary floating-point numbers, how Stata’s %21x display format displays with fidelity those binary floating-point numbers, how %21x can help you uncover bugs, and how %21x can help you understand behaviors that are not bugs even though they are surpising to us base-10 thinkers. The point is, it is sometimes useful to think in binary, and with %21x, thinking in binary is not difficult.

This week, I want to discuss double versus float precision.

Double (8-byte) precision provides 53 binary digits. Float (4-byte) precision provides 24. Let me show you what float precision looks like.

. display %21x sqrt(2) _newline %21x float(sqrt(2))

All those zeros in the floating-point result are not really there;
%21x merely padded them on. The display would be more honest if it were

+1.6a09e6       X+000

Of course, +1.6a09e60000000X+000 is a perfectly valid way of writing +1.6a09e6X+000 — just as 1.000 is a valid way of writing 1 — but you must remember that float has fewer digits than double.

Hexadecimal 1.6109e6 is a rounded version of 1.6a09e667f3bcd, and you can think of this in one of two ways:

     double     =  float   + extra precision
1.6a09e667f3bcd = 1.6a09e6 + 0.00000067f3bcd


  float   =      double     -  lost precision
1.6a09e6  = 1.6a09e667f3bcd - 0.00000067f3bcd

Note that more digits are lost than appear in the float result! The float result provides six hexadecimal digits (ignoring the 1), and seven digits appear under the heading lost precision. Double precision is more than twice float precision. To be accurate, double precision provides 53 binary digits, float provides 24, so double precision is really 53/24 = 2.208333 precision.

The double of double precision refers to the total number of binary digits used to store the mantissa and the exponent in z=a*2^b, which is 64 versus 32. Precision is 53 versus 24.

In this case, we obtained the floating-point from float(sqrt(2)), meaning that we rounded a more accurate double-precision result. One usually rounds when producing a less precise representation. One of the rounding rules is to round up if the digits being omitted (with a decimal point in front) exceed 1/2, meaning 0.5 in decimal. The equivalent rule in base-16 is to round up if the digits being omitted (with a hexadecimal point in front) exceed 1/2, meaning 0.8 (base-16). The lost digits were .67f3bcd, which are less than 0.8, and therefore, the last digit of the rounded result was not adjusted.

Actually, rounding to float precision is more difficult than I make out, and seeing that numbers are rounded correctly when displayed in %21x can be difficult. These difficulties have to do with the relationship between base-2 — the base in which the computer works — and base-16 — a base similar but not identical to base-2 that we humans find more readable. The fact is that %21x was designed for double precision, so it only does an adequate job of showing single precision. When %21x displays a float-precision number, it shows you the exactly equal double-precision number, and that turns out to matter.

We use base-16 because it is easier to read. But why do we use base-16 and not base-15 or base-17? We use base-16 because it is an integer power of 2, the base the computer uses. One advantage of bases being powers of each other is that base conversion can be done more easily. In fact, conversion can be done almost digit by digit. Doing base conversion is usually a tedious process. Try converting 2394 (base-10) to base-11. Well, you say, 11^3=1331, and 2*11331 = 2662 > 2394, so the first digit is 1 and the remainder is 2394-1331 = 1063. Now, repeating the process with 1063, I observe that 11^2 = 121 and that 1063 is bound by 8*121=969 and 9*121=1089, so the second digit is 9 and I have a remainder of …. And eventually you produce the answer 1887 (base-11).

Converting between bases when one is the power of another not only is easier but also is so easy you can do it in your head. To convert from base-2 to base-16, group the binary digits into groups of four (because 2^4=16) and then translate each group individually.

For instance, to convert 011110100010, proceed as follows:

0111 1010 0010
   7    a    2

I’ve performed this process often enough that I hardly have to think. But here is how you should think: Divide the binary number into four-digit groups. The four columns of the binary number stand for 8, 4, 2, and 1. When you look at 0111, say to yourself 4+2+1 = 7. When you look at 1010, say to yourself 8+2 = 10, and remember that the digit for 10 in base-16 is a.

Converting back is nearly as easy:

   7    a    2
0111 1010 0010

Look at 7 and remember the binary columns 8-4-2-1. Though 7 does not contain an 8, it does contain a 4 (leaving 3), and 3 contains a 2 and a 1.

I admit that converting base-16 to base-2 is more tedious than converting base-2 to base-16, but eventually, you’ll have the four-digit binary table memorized; there are only 16 lines. Say 7 to me, and 0111 just pops into my head. Well, I’ve been doing this a long time, and anyway, I’m a geek. I suspect I carry the as-yet-undiscovered binary gene, which means I came into this world with the base-2-to-base-16 conversion table hardwired:

base-2 base-16
0000 0
0001 1
0010 2
0011 3
0100 4
1001 9
1010 a
1111 f

Now that you can convert base-2 to base-16 — convert from binary to hexadecimal — and you can convert back again, let’s return to floating-point numbers.

Remember how floating-point numbers are stored:

z = a * 2^b, 1<=a<2 or a==0

For example,

    0.0 = 0.0000000000000000000000000000000000000000000000000000 * 2^-big
    0.5 = 1.0000000000000000000000000000000000000000000000000000 * 2^-1
    1.0 = 1.0000000000000000000000000000000000000000000000000000 * 2^0
sqrt(2) = 1.0110101000001001111001100110011111110011101111001101 * 2^0
    1.5 = 1.1000000000000000000000000000000000000000000000000000 * 2^0
    2.0 = 1.0000000000000000000000000000000000000000000000000000 * 2^0
    2.5 = 1.0100000000000000000000000000000000000000000000000000 * 2^0
    3.0 = 1.1000000000000000000000000000000000000000000000000000 * 2^1
    _pi = 1.1001001000011111101101010100010001000010110100011000 * 2^1

In double precision, there are 53 binary digits of precision. One of the digits is written to the left of binary point, and the remaining 52 are written to the right. Next observe that the 52 binary digits to the right of the binary point can be written in 52/4=13 hexadecimal digits. That is exactly what %21x does:

    0.0 = +0.0000000000000X-3ff
    0.5 = +1.0000000000000X-001
    1.0 = +1.0000000000000X+000
sqrt(2) = +1.6a09e667f3bcdX+000
    1.0 = +1.0000000000000X+000
    1.5 = +1.8000000000000X+000
    2.0 = +1.0000000000000X+001
    2.5 = +1.4000000000000X+001
    3.0 = +1.8000000000000X+002
    _pi = +1.921fb54442d18X+001

You could perform the binary-to-hexadecimal translation for yourself. Consider _pi. The first group of four binary digits after the binary point are 1001, and 9 appears after the binary point in the %21x result. The second group of four are 0010, and 2 appears in the %21x result.
The %21x result is an exact representation of the underlying binary, and thus you are equally entitled to think in either base.

In single precision, the rule is the same:

z = a * 2^b, 1<=a<2 or a==0

But this time, only 24 binary digits are provided for a, and so we have

    0.0 = 0.00000000000000000000000 * 2^-big
    0.5 = 1.00000000000000000000000 * 2^-1
    1.0 = 1.00000000000000000000000 * 2^0
sqrt(2) = 1.01101010000010011110011 * 2^0
    1.5 = 1.10000000000000000000000 * 2^0
    2.0 = 1.00000000000000000000000 * 2^0 
    2.5 = 1.01000000000000000000000 * 2^0
    3.0 = 1.10000000000000000000000 * 2^1
    _pi = 1.10010010000111111011011 * 2^1

In single precision, there are 24-1=23 binary digits of precision to the right of the binary point, and 23 is not divisible by 4. If we tried to convert to base-16, we end up with

sqrt(2) = 1.0110 1010 0000 1001 1110 011   * 2^0
          1.   6    a    0    9    e    ?  * 2^0

To fill in the last digit, we could recognize that we can pad on an extra 0 because we are to the right of the binary point. For example, 1.101 == 1.1010. If we padded on the extra 0, we have

sqrt(2) = 1.0110 1010 0000 1001 1110 0110  * 2^0
          1.   6    a    0    9    e    6  * 2^0

That is precisely the result %21x shows us:

. display %21x float(sqrt(2))

although we might wish that %21x would omit the 0s that aren’t really there, and instead display this as +1.6a09e6X+000.

The problem with this solution is that it can be misleading because the last digit looks like it contains four binary digits when in fact it contains only three. To show how easily you can be misled, look at _pi in double and float precisions:

. display %21x _pi _newline %21x float(_pi)
  digit incorrectly rounded?

The computer rounded the last digit up from 5 to 6. The digits after the rounded-up digit in the full-precision result, however, are 0.4442d18, and are clearly less than 0.8 (1/2). Shouldn’t the rounded result be 1.921fb5X+001? The answer is that yes, 1.921fb5X+001 would be a better result if we had 6*4=24 binary digits to the right of the binary point. But we have only 23 digits; correctly rounding to 23 binary digits and then translating into base-16 results in 1.921fb6X+001. Because of the missing binary digit, the last base-16 digit can only take on the values 0, 2, 4, 6, 8, a, c, and e.

The computer performs the rounding in binary. Look at the relevant piece of this double-precision number in binary:

+1.921f   b    5    4    4    42d18X+001      number
       1011 0101 0100 0100 0100               expansion into binary
       1011 01?x xxxx xxxx xxxxxxxx           thinking about rounding
       1011 011x xxxx xxxx xxxxxxxx           performing rounding
+1.921f   b    6                   X+001      convert to base-16

The part I have converted to binary in the second line is around the part to be rounded. In the third line, I’ve put x’s under the part we will have to discard to round this double into a float. The x’d out part — 10100… — is clearly greater than 1/2, so the last digit (where I put a question mark) must be rounded up. Thus, _pi in float precision rounds to 1.921fb6+X001, just as the computer said.

Float precision does not play much of a role in Stata despite the fact that most users store their data as floats. Regardless of how data are stored, Stata makes all calculations in double precision, and float provides more than enough precision for most data applications. The U.S. deficit in 2011 is projected to be $1.5 trillion. One hopes that a grand total of $26,624 — the error that would be introduced by storing this projected deficit in float precision — would not be a significant factor in any lawmaker’s decision concerning the issue. People in the U.S. are said to work about 40 hours per week, or roughly 0.238 of the hours in a week. I doubt that number is accurate to 0.4 milliseconds, the error that float would introduce in recording the fraction. A cancer survivor might live 350.1 days after a treatment, but we would introduce an error of roughly 1/2 second if we record the number as a float. One might question whether the instant of death could even conceptually be determined that accurately. The moon is said to be 384.401 thousand kilometers from the Earth. Record in 1,000s of kilometers in float, and the error is almost 1 meter. At its closest and farthest, the moon is 356,400 and 406,700 kilometers away. Most fundamental constants of the universe are known only to a few parts in a million, which is to say, to less than float precision, although we do know the speed of light in a vacuum to one decimal digit beyond float accuracy; it’s 299,793.458 kilometers per second. Round that to float and you’ll be off by 0.01 km/s.

The largest integer that can be recorded without rounding in float precision is 16,777,215. The largest integer that can be recorded without rounding in double precision is 9,007,199,254,740,991.

People working with dollar-and-cent data in Stata usually find it best to use doubles both to avoid rounding issues and in case the total exceeds $167,772.15. Rounding issues of 0.01, 0.02, etc., are inherent when working with binary floating point, regardless of precision. To avoid all problems, these people should use doubles and record amounts in pennies. That will have no difficulty with sums up to $90,071,992,547,409.91, which is to say, about $90 trillion. That’s nine quadrillion pennies. In my childhood, I thought a quadrillion just meant a lot, but it has a formal definition.

All of which is a long way from where I started, but now you are an expert in understanding binary floating-point numbers the way a scientific programmer needs to understand them: z=a*2^b. You are nearly all the way to understanding the IEEE 754-2008 standard. That standard merely states how a and b are packed into 32 and 64 bits, and the entire point of %21x is to avoid those details because, packed together, the numbers are unreadable by humans.


Cox, N. J. 2006. Tip 33: Sweet sixteen: Hexadecimal formats and precision problems. Stata Journal 6: 282-283.

Gould, William. 2006. Mata matters: Precision. Stata Journal 6: 550-560.

Linhart, J. M. 2008. Mata matters: Overflow and IEEE floating-point format. Stata Journal 8: 255-268.

How to read the %21x format

%21x is a Stata display format, just as are %f, %g, %9.2f, %td, and so on. You could put %21x on any variable in your dataset, but that is not its purpose. Rather, %21x is for use with Stata’s display command for those wanting to better understand the accuracy of the calculations they make. We use %21x frequently in developing Stata.

%21x produces output that looks like this:

. display %21x 1

. display %21x 2

. display %21x 10

. display %21x sqrt(2)

All right, I admit that the result is pretty unreadable to the uninitiated. The purpose of %21x is to show floating-point numbers exactly as the computer stores them and thinks about them. In %21x’s defense, it is more readable than how the computer really records floating-point numbers, yet it loses none of the mathematical essence. Computers really record floating-point numbers like this:

      1 = 3ff0000000000000
      2 = 4000000000000000
     10 = 4024000000000000
sqrt(2) = 3ff6a09e667f3bcd

Or more correctly, they record floating-point numbers in binary, like this:

      1 = 0011111111110000000000000000000000000000000000000000000000000000
      2 = 0100000000000000000000000000000000000000000000000000000000000000
     10 = 0100000000100100000000000000000000000000000000000000000000000000
sqrt(2) = 0011111111110110101000001001111001100110011111110011101111001101

By comparison, %21x is a model of clarity.

The above numbers are 8-byte floating point, also known as double precision, encoded in binary64 IEEE 754-2008 little endian format. Little endian means that the bytes are ordered, left to right, from least significant to most significant. Some computers store floating-point numbers in big endian format — bytes ordered from most significant to least significant — and then numbers look like this:

      1 = 000000000000f03f
      2 = 0000000000000040
     10 = 0000000000004024
sqrt(2) = cd3b7f669ea0f63f


      1 = 0000000000000000000000000000000000000000000000001111000000111111
      2 = 0000000000000000000000000000000000000000000000000000000000000100
     10 = 0000000000000000000000000000000000000000000000000100000000100100
sqrt(2) = 1100110100111011011111110110011010011110000011111111011000111111

Regardless of that, %21x produces the same output:

. display %21x 1

. display %21x 2

. display %21x 10

. display %21x sqrt(2)

Binary computers store floating-point numbers as a number pair, (a, b); the desired number z is encoded

z = a * 2^b

For example,

 1 = 1.00 * 2^0
 2 = 1.00 * 2^1
10 = 1.25 * 2^3

The number pairs are encrypted in the bit patterns, such as 00111111…01, above.

I’ve written the components a and b in decimal, but for reasons that will become clear, we need to preserve the essential binaryness of the computer’s number. We could write the numbers in binary, but they will be more readable if we represent them in base-16:

base-10 base-16
floating point
1 = 1.00 * 2^0
2 = 1.00 * 2^1
10 = 1.40 * 2^3

“1.40?”, you ask, looking at the last row, which indicates 1.40*2^3 for decimal 10.

The period in 1.40 is not a decimal point; it is a hexadecimal point. The first digit after the hexadecimal point is the number for 1/16ths, the next is for 1/(16^2)=1/256ths, and so on. Thus, 1.40 hexadecimal equals 1 + 4*(1/16) + 0*(1/256) = 1.25 in decimal.

And that is how you read %21x values +1.0000000000000X+000, +1.0000000000000X+001, and +1.4000000000000X+003. To wit,

base-10 base-16
floating point
1 = 1.00 * 2^0 = +1.0000000000000X+000
2 = 1.00 * 2^1 = +1.0000000000000X+001
10 = 1.40 * 2^3 = +1.4000000000000X+003

The mantissa is shown to the left of the X, and, to the right of the X, the exponent for the 2. %21x is nothing more than a binary variation of the %e format with which we are all familiar, for example, 12 = 1.20000e+01 = 1.2*10^1. It’s such an obvious generalization that one would guess it has existed for a long time, so excuse me when I mention that we invented it at StataCorp. If I weren’t so humble, I would emphasize that this human-readable way of representing binary floating-point numbers preserves nearly every aspect of the IEEE floating-point number. Being humble, I will merely observe that 1.40x+003 is more readable than 4024000000000000.

Now that you know how to read %21x, let me show you how you might use it. %21x is particularly useful for examining precision issues.

For instance, the cube root of 8 is 2; 2*2*2 = 8. And yet, in Stata, 8^(1/3) is not equal to 2:

. display 8^(1/3)2

. assert 8^(1/3) == 2
assertion is false
r(9) ; 

. display %20.0g 8^(1/3)

I blogged about that previously; see How Stata calculates powers. The error is not much:

. display 8^(1/3)-2-2.220e-16

In %21x format, however, we can see that the error is only one bit:

. display %21x 8^(1/3)

. display %21x 2    

I wish the answer for 8^(1/3) had been +1.0000000000001X+000, because then the one-bit error would have been obvious to you. Instead, rather than being a bit too large, the actual answer is a bit too small — one bit too small to be exact — so we end up with +1.fffffffffffffX+000.

One bit off means being off by 2^(-52), which is 2.220e-16, and which is the number we saw when we displayed in base-10 8^(1/3)-2. So %21x did not reveal anything we could not have figured out in other ways. The nature of the error, however, is more obvious in %21x format than it is in a base-10 format.

On Statalist, the point often comes up that 0.1, 0.2, …, 0.4, 0.6, …, 0.9, 0.11, 0.12, … have no exact representation in the binary base that computers use. That becomes obvious with %21x format:

. display %21x 0.1
+1.999999999999aX-004 . display %21x 0.2 +1.999999999999aX-003. ...

0.5 does have an exact representation, of course, as do all the negative powers of 2:

. display %21x 0.5           //  1/2

. display %21x 0.25          //  1/4

. display %21x 0.125         //  1/8

. display %21x 0.0625        // 1/16

. ...

Integers have exact representations, too:

. display %21x 1

. display %21x 2

. display %21x 3

. ...

. display %21x 10

. ...

. display %21x 10786204

. ...

%21x is a great way of becoming familiar with base-16 (equivalently, base-2), which is worth doing if you program base-16 (equivalently, base-2) computers.

Let me show you something useful that can be done with %21x.

A programmer at StataCorp has implemented a new statistical command. In four examples, the program produces the following results:


Without any additional information, I can tell you that the program has a bug, and that StataCorp will not be publishing the code until the bug is fixed!

How can I know that this program has a bug without even knowing what is being calculated? Let me show you the above results in %21x format:


Do you see what I see? It’s all those zeros. In randomly drawn problems, it would be unlikely that there would be all zeros at the end of each result. What is likely is that the results were somehow rounded, and indeed they were. The rounding in this case was due to using float (4-byte) precision inadvertently. The programmer forgot to include a double in the ado-file.

And that’s one way %21x is used.

I am continually harping on programmers at StataCorp that if they are going to program binary computers, they need to think in binary. I go ballistic when I see a comparison that’s coded as “if (abs(x-y)<1e-8) …” in an attempt to deal with numerical inaccuracy. What kind of number is 1e-8? Well, it’s this kind of number:

. display %21x 1e-8

Why put the computer to all that work, and exactly how many digits are you, the programmer, trying to ignore? Rather than 1e-8, why not use the “nice” numbers 7.451e-09 or 3.725e-09, which is to say, 1.0x-1b or 1.0x-1c? If you do that, then I can see exactly how many digits you are ignoring. If you code 1.0x-1b, I can see you are ignoring 1b=27 binary digits. If you code 1.0x-1c, I can see you are ignoring 1c=28 binary digits. Now, how many digits do you need to ignore? How imprecise do you really think your calculation is? By the way, Stata understands numbers such as 1.0x-1b and 1.0x-1c as input, so you can type the precise number you want.

As another example of thinking in binary, a StataCorp programmer once described a calculation he was making. At one point, the programmer needed to normalize a number in a particular way, and so calculated x/10^trunc(log10(x)), and held onto the 10^trunc(log10(x)) for denormalization later. Dividing by 10, 100, etc., may be easy for us humans, but it’s not easy in binary, and it can result in very small amounts of dreaded round-off error. And why even bother to calculate the log, which is an expensive operation? “Remember,” I said, “how floating-point numbers are recorded on a computer: z = a*2^b, where 0 < = |a| < 2. Writing in C, it’s easy to extract components. In fact, isn’t a number normalized to be between 0 and 2 even better for your purposes?” Yes, it turned out it was.

Even I sometimes forget to think in binary. Just last week I was working on a problem and Alan Riley suggested a solution. I thought a while. “Very clever,” I said. “Recasting the problem in powers of 2 will get rid of that divide that caused half the problem. Even so, there’s still the pesky subtraction.” Alan looked at me, imitating a look I so often give others. “In binary,” Alan patiently explained to me, “the difference you need is the last 19 bits of the original number. Just mask out the other digits.”

At this point, many of you may want to stop reading and go off and play with %21x. If you play with %21x long enough, you’ll eventually examine the relationship between numbers recorded as Stata floats and as Stata doubles, and you may discover something you think to be an error. I will discuss that next week in my next blog posting.

How Stata calculates powers

Excuse me, but I’m going to toot Stata’s horn.

I got an email from Nicholas Cox (an Editor of the Stata Journal) yesterday. He said he was writing something for the Stata Journal and wanted the details on how we calculated a^b. He was focusing on examples such as (-8)^(1/3), where Stata produces a missing value rather than -2, and he wanted to know if our calculation of that was exp((1/3)*ln(-8)). He didn’t say where he was going, but I answered his question.

I have rather a lot to say about this.

Nick’s supposition was correct, in this particular case, and for most values of a and b, Stata calculates a^b as exp(b*ln(a)). In the case of a=-8 and b=1/3, ln(-8)==., and thus (-8)^(1/3)==..

You might be tempted to say Stata (or other software) should just get this right and return -2. The problem is not only that 1/3 has no exact decimal representation, but it has no exact binary representation, either. Watching for 1/3 and doing something special is problematic.

One solution to this problem, if a solution is necessary, would be to introduce a new function, say, invpower(a,b), that returns a^(1/b). Thus, cube roots would be invpower(a,3). Integers have exact numerical representations even in digital computers, so computers can watch for integers and take special actions.

Whether one should watch integers for such special values is an interesting question. Stata does watch for special values of b in calculating a^b. In particular, it watches for b as an integer, 1<=b<=64 or -64<=b<=-1. When Stata finds such integers, it calculates a^b by repeatedly multiplying a or 1/a. Thus Stata calculates ..., a^(-2), a^(-1), ..., a^2, a^3, ... more accurately than it calculates ..., a^(-2+epsilon), a^(-1+epsilon), ..., a^(2+epsilon), a^(3+epsilon), .... So what could be wrong with that? Let's imagine we wish to make a calculation of F(a,b) and that we have two ways to calculate it. The first way is via an approximation formula A(a,b) = F(a,b) + e, where e is error. Do not assume that error has good properties that a statistical model would have. Typically, e/F() has roughly average 0 (I write sloppily) because otherwise we would add values to achieve that. In addition, the error tends to be serially correlated because approximation formulas are usually continuous. In this case, serial correlation is actually a desirable property! Okay, that's one way we have for calculating F(). The second way is by an exact formula E(a,b) = F(a,b), but E() is only valid for certain values of b. Great, you say, we'll use E() for the special values and use A() for the rest. We'll call that function G(). Now consider someone calculating a numerical derivative from results of G(). Say that they wish to calculate (F(a,b+h)-F(a,b))/h for a suitably small value of h, which they do by calculating (G(a,b+h)+G(a,b))/h. Then they obtain

  1. (A(a,b+h)-E(b))/h for some values, and
  2. (A(a,b+h)-A(b))/h for others.

There are other possibilities, for instance, they might obtain (E(a,b-h)-A(b))/h, but the two above will be sufficient for my purposes.

Note that in calculation (2), the serial correlation in the error actually reduces the error in (2) relative to (1)!

This can be an issue.

The error in the case of Stata’s a^b around the integers -64, -63, …, -1, 1, 2, …, 64 is small enough that we just ignored it. For instance, in 2^3, A()-E() = -1.776e-15. Had the error been large enough, we would have combined A() and E() in a different way to produce a more accurate approximation formula. To wit, you know that at certain values of b, E() is exact, so one develops an adjust for A() that uses that information to adjust not just the specific value, but values around the specific value, too.

In this example, A(a,b) = exp(b*ln(a)), a>0, and you may be asking yourself in what sense exp(b*ln(a)) is an approximation. The answer is that exp() and ln() are approximations to the true functions, and in fact, so is * an approximation for the underlying idea of multiplication.

That Stata calculates a^b by repeated multiplication for -64<=b<=-1 and 1<=b<=64 is not something we have ever mentioned. People do not realize the extreme caution we go to on what might seem the minor issues. It is because we do this that things work as you expect. In this case, a^3 is exactly equal to a*a*a. This is ironic because when numerical issues arise that do not have a similarly easy solution, users are disappointed. Why don’t you fix that? In the old Fortran days in which I grew up, one would never expect a^3 to equal a*a*a. One’s nose was constantly being rubbed into numerical issues, which reminded us not to overlook them. By the way, if issues like this interest you, consider applying to StataCorp for employment. We can fill your days with discussions like this.

Categories: Numerical Analysis Tags: ,

Using dates and times from other software

Most software stores dates and times numerically, as durations from some sentinel date, but they differ on the sentinel date and on the units in which the duration is stored. Stata stores dates as the number of days since 01jan1960, and datetimes as the number of milliseconds since 01jan1960 00:00:00.000. January 3, 2011 is stored as 18,630, and 2pm on January 3 is stored as 1,609,682,400,000. Other packages use different choices for bases and units.

It sometimes happens that you need to process in Stata data imported from other software and end up with a numerical variable recording a date or datetime in the other software’s encoding. It is usually possible to adjust the numeric date or datetime values to the sentinel date and units that Stata uses. Below are conversion rules for SAS, SPSS, R, Excel, and Open Office.


SAS stores dates as the number of days since 01jan1960, the same as Stata:

    . gen statadate = sasdate
    . format statadate %td

SAS stores datetimes as the number of seconds since 01jan1960 00:00:00, assuming 86,400 seconds/day. Thus, all that’s necessary is to multiply SAS datetimes by 1,000 and attach a %tc format to the result,

    . gen double statatime = (sastime*1000)
    . format statatime %tc

It is important that variables containing SAS datetimes, such as sastime above, be imported as doubles into Stata.


SPSS stores both dates and datetimes as the number of seconds since 14oct1582 00:00:00, assuming 86,400 seconds/day. To convert SPSS datetimes to Stata datetimes, type

    . gen double statatime = (spsstime*1000) + tc(14oct1582 00:00)
    . format statatime %tc

Multiplying by 1,000 converts from seconds to milliseconds. Adding tc(14oct1582 00:00) accounts for the differing bases.

Function tc() returns the specified datetime as a Stata datetime, which is to say, the number of milliseconds between the specified datetime and 01jan1960 00:00:00.000. We need to add the difference between SPSS’s base and Stata’s base, which is tc(14oct1582 00:00) – tc(01jan1960 00:00), but tc(01jan1960) is definitionally 0, so that just leaves tc(14oct1582 00:00). tc(14oct1582), for your information, is -11,903,760,000,000.

SPSS dates are the same as SPSS datetimes, so to convert an SPSS date to a Stata date, we could type,

    . gen double statatime = (spssdate*1000) + tc(14oct1582 00:00)
    . gen statadate        = dofc(statatime)
    . format statadate %td
    . drop statatime

Function dofc() converts a Stata datetime to a Stata date. We can combine the above into,

    . gen statadate = dofc((spsstime*1000) + tc(14oct1582 00:00))
    . format statadate %td


R stores dates as days since 01jan1970. To convert to a Stata date,

    . gen statadate = rdate - td(01jan1970)
    . format statadate %td

Stata uses 01jan1960 as the base, R uses 01jan1970, so all you have to do is subtract the number of days between 01jan1970 and 01jan1960.

R stores datetimes as the number of UTC adjusted seconds since 01jan1970 00:00:00. UTC stands for Universal Time Coordinated. Rather than assuming 86,400 seconds/day, some UTC days have 86,401 seconds. Leap seconds are sometimes inserted into UTC days to keep the clock coordinated with the Earth’s rotation. Stata’s datetime %tC format is UTC time, which is to say, it accounts for these leap seconds. Thus, to convert R datetimes to Stata, you type

   . gen double statatime = rtime - tC(01jan1970 00:00)
   . format statatime %tC

Note the use of Stata’s tC() function rather than tc() to obtain the number of milliseconds between the differing bases. tc() returns the number of seconds since 01jan1960 00:00:00 assuming 86,400 seconds/day. tC() returns the number of seconds adjusted for leap seconds. In this case, it would not make a difference if we mistakenly typed tc() rather than tC() because no leap seconds were inserted between 1960 and 1970. Had the base year been 1980, however, the use of tC() would have been important. Nine extra seconds were were inserted between 01jan1970 and 01jan1980!

In many cases you may prefer using a time variable that ignores leap seconds. In that case, You can type

    . gen double statatime = cofC(rtime - tC(01jan1970 00:00))
    . format statatime %tc


Excel has used different date systems for different operating systems. Excel for Windows used the “1900 Date System”. Excel for Mac used the “1904 Date System”. More recently, Microsoft has been standardizing on the 1900 Date System.

If you have an Excel for Windows workbook, it is likely to be using 1900.

If you have an Excel for Mac workbook, it is likely to be using 1904, unless it came from a Windows workbook originally.

Anyway, both Excels can use either encoding. See for more information and for instrbuctions on converting your workbook between date systems.

In any case, you are unlikely to encounter Excel numerically coded dates. If you cut-and-paste the spreadsheet into Stata’s Data editor, dates and datetimes paste as strings in human-readable form. If you use a conversion package, most know to convert the date for you.

Excel, 1900 date system

For dates on or after 01mar1900, Excel 1900 Date System stores dates as days since 30dec1899. To convert to a Stata date,

    . gen statadate = exceldate + td(30dec1899)
    . format statadate %td

Excel can store dates between 01jan1900 and 28feb1900, too, but the formula above will not handle those two months. See for more information.

For datetimes on or after 01may1900 00:00:00, Excel 1900 Date System stores datetimes as days plus fraction of day since 30dec1899 00:00:00. To convert with a one-second resolution to a Stata datetime,

    . gen double statatime = round((exceltime+td(30dec1899))*86400)*1000
    . format statatime %tc

Excel, 1904 date system

For dates on or after 01jan1904, Excel 1904 Date System stores dates as days since 01jan1904. To convert to a Stata date,

    . gen statadate = exceldate + td(01jan1904)
    . format statadate %td

For datetimes on or after 01jan1904 00:00:00, Excel 1904 Date System stores datetimes as days plus fraction of day since 01jan1904 00:00:00. To convert with a one-second resolution to a Stata datetime,

    . gen double statatime = round((exceltime+td(01jan1904))*86400)*1000
    . format statatime %tc

Open Office

Open Office uses the Excel, 1900 Date System.

Why Stata has two datetime encodings

We have just seen that most packages assume 86,400 seconds/day, but that one instead uses UTC time, in which days have 86,400 or 86,401 seconds, depending. Stata provides both datetime encodings, called %tc and %tC. That turned out to be convenient in translating times from other packages. Stata will even let you switch from one to the other using the cofC() and Cofc functions, so you know you should be asking, which should I use?

Stata’s %tc format assumes that there are 24*60*60*1,000 ms per day — 86,400 seconds per day — just as an atomic clock does. Atomic clocks count oscillations between the nucleus and electrons of an atom and thus provide a measurement of the real passage of time.

Time of day measurements have historically been based on astronomical observation, which is a fancy way of saying, based on looking at the sun. The sun should be at its highest point at noon, right? So however you kept track of time — falling grains of sand or a wound up spring — you periodically reset your clock and then went about your business. In olden times it was understood that the 60 seconds per minute, 60 minutes per hour, 24 hours per day, were theoretical goals that no mechanical device could reproduce accurately. These days, we have have more accurate definitions for measuring time. A second is 9,192,631,770 periods of the radiation corresponding to the transition between two levels of the ground state of caesium 133. Obviously we have better equipment than the ancients, so problem solved, right? Wrong. There are two problems. The formal definition of a second is just a little too short to match length of a day, and the Earth’s rotation is slowing down.

As a result, since 1972 leap seconds have been added to atomic clocks once or twice a year to keep time measurements in synchronization with the earth’s rotation. Unlike leap years, however, there is no formula for predicting when leap seconds will occur. The Earth may be on average slowing down, but there is a large random component to that. As a result, leap seconds are determined by committee and announced 6 months before they are inserted. Leap seconds are added, if necessary, on the end of the day on June 30 and December 31 of the year. The inserted times are designated as 23:59:60.

Unadjusted atomic clocks may accurately mark the passage of real time, but you need to understand that leap seconds are every bit as real as every other second of the year. Once a leap second is inserted, it ticks just like any other second and real things can happen during that tick.

You may have heard of terms such as GMT and UTC.

GMT is the old Greenwich Mean Time and is based on astronomical observation. GMT has been supplanted by UTC.

UTC stands for coordinated universal time and is measured by atomic clocks, occasionally corrected for leap seconds. UTC is derived from two other times, UT1 and TAI. UT1 is the mean solar time, with which UTC is kept in sync by the occasional addition of a leap second. TAI is the atomically measured pure time. TAI was set to GMT plus 10 seconds in 1958 and has been running unadjusted since then. Update 07 Jan 2010: TAI is a statistical combination of various atomic chronometers and even it has not ticked uniformly over its history; see and especially (Thanks to Steve Allen of the UCO/Lick Observatory for correcting my understanding and for the reference.)

UNK is StataCorp’s term for the time standard most people use. UNK stands for unknowing. UNK is based on a recent time observation, probably UTC, and then just assuming that there are 86,400 seconds per day after that.

The UNK standard is adequate for many purposes, and in such cases, you will want to use %tc rather than the leap second-adjusted %tC encoding. If you are using computer-timestamped data, however, you need to find out whether the timestamping system accounted for leap-second adjustments. Problems can arise even if you do not care about losing or gaining a second here and there.

For instance, you may import timestamp values from other systems recorded in the number of milliseconds that have passed since some agreed upon date. If you choose the wrong encoding scheme, if you chose tc when you should choose %tC, or vice versa, more recent times will be off by 24 seconds.

To avoid such problems, you may decide to import and export data by using Human Readable Forms (HRF) such as “Fri Aug 18 14:05:36 CDT 2006”. This method has advantages, but for %tC (UTC) encoding, times such as 23:59:60 are possible. Some software will refuse to decode such times.

Stata refuses to decode 23:59:60 in the %tc encoding (function clock) and accepts it with %tC (function Clock()). When %tC function Clock() sees a time with a 60th second, Clock() verifies that the time corresponds to an official leap second. Thus, when translating from printable forms, try assuming %tc and check the result for missing values. If there are none, you can assume your use of %tc is valid. If there are missing values and they are due to leap seconds and not some other error, you must use %tC function Clock() to translate from HRF. After that, if you still want to work in %tc units, use function cofC() to translate %tC values into %tc.

If precision matters, the best way to process %tC data is simply to treat them that way. The inconvenience is that you cannot assume that there are 86,400 seconds per day. To obtain the duration between dates, you must subtract the two time values involved. The other difficulty has to do with dealing with dates in the future. Under the %tC (UTC) encoding, there is no set value for any date more than 6 months in the future.


Stata provides two datetime encodings:

  1. %tC, also known as UTC, which accounts for leap seconds, and

  2. %tc, which ignores them (it assumes 86,400 seconds/day).

Systems vary in how they treat time variables. My advice is,

  • If you obtain data from a system that accounts for leap seconds, import using Stata’s %tC.
    1. If you later need to export data to a system that does not account for leap seconds, use Stata’s cofC() function to translate time values before exporting.

    2. If you intend to tsset the time variable and the analysis will be at the second level or finer, just tsset the %tC variable, specifying the appropriate delta() if necessary, for example, delta(1000) for seconds.

    3. If you intend to tsset the time variable and the analysis will be at coarser than the second level (minute, hour, etc.), create a %tc variable from the %tC variable (generate double tctime = cofC(tCtime<) and tsset that, specifying the appropriate delta() if necessary. You must do that because, in a %tC variable, there are not necessarily 60 seconds in a minute; some minutes have 61 seconds.
  • If you obtain data from a system that ignores leap seconds, use Stata’s %tc.
    1. If you later need to export data to a system that does account for leap seconds, use Stata’s Cofc() function to translate time values.

    2. If you intend to tsset the time variable, just tsset it, specifying the appropriate delta().

Some users prefer always to use Stata’s %tc because those values are a little easier to work with. You can do that if

  • you do not mind having up to 1 second of error and

  • you do not import or export numerical values (clock ticks) from other systems that are using leap seconds, because then there could be nearly 30 seconds of accumulated error.

There are two things to remember if you use %tC variables:

  1. The number of seconds between two dates is a function of when the dates occurred. Five days from one date is not simply a matter of adding 5*24*60*60*1,000 ms. You might need to add another 1,000 ms. Three hundred and sixty-five days from now might require adding 1,000 or 2,000 ms. The longer the span, the more you might have to add. The best way to add durations to %tC variables is to extract the components, add to them, and then reconstruct from the numerical components.

  2. You cannot accurately predict datetimes more than six months into the future. We do not know what the %tC value will be of 25dec2026 00:00:00 because every year along the way, the International Earth Rotation Reference Systems Service (IERS) will twice announce whether there will be the insertion of a leap second.

You can help alleviate these inconveniences. Face west and throw rocks. The benefit will be only transitory if the rocks land back on Earth, so you need to throw them really hard. I know what you’re thinking, but this does not need to be a coordinated effort.

Categories: Data Management Tags: ,

How to successfully ask a question on Statalist

As everyone knows, I am a big proponent of Statalist, and not just for selfish reasons, although those reasons play a role. Nearly every member of the technical staff at StataCorp — me included — are members of Statalist. Even when we don’t participate in a particular thread, we do pay attention. The discussions on Statalist play an important role concerning Stata’s development.

Statalist is a discussion group, not just a question-and-answer forum. Nonetheless, new members often use it to obtain answers to questions and that works because those questions sometimes become gist for subsequent discussions. In those cases, the questioners not only get answers, they get much more.

One of the best features of Statalist is that, no matter how poorly you ask a question, you are unlikely to be flamed. Not only are the members of Statalist nice — just as are the members of most lists — they act just as nice on the list as they really are. You are unlikely to be flamed if you ask a question poorly, but you are also unlikely to get an answer.

Here is my recipe to increase the chances of you getting a helpful response. You should also read the Statalist FAQ before writing your question.

Subject line

Make the subject line of your email meaningful. Some good subject lines are:

Survival analysis

Confusion about -stcox-

Unexpected error from -stcox-

-stcox- output

The first two sentences

The first two sentences are the most important, and they are the easiest to write.

In the first sentence, state your problem in Stata terms, but do not go into details. Here are some good first sentences:

I’m having a problem with -stcox-.

I’m getting an unexpected error message from -stcox-.

I’m using -stcox- and don’t know how to interpret the result.

I’m using -stcox- and getting a result I know is wrong, so I know I’m misunderstanding something.

I want to use -stcox- but don’t know how to start.

I think I want to use -stcox-, but I’m unsure.

I want to use -stcox- but my data is complicated and I’m unsure how to proceed.

I have a complicated dataset that I somehow need to transform into a form suitable for use with -stcox-.

Stata crashed!

I’m having a problem that may be more of a statistics issue than a Stata issue.

The purpose of the first sentence is to catch the attention of members who have an interest in your topic and let the others, who were never going to answer you anyway, move on.

The second sentence is even easier to write:

I am using Stata 11.1 for Windows.

I am using Stata 10 for Mac.

Even if you are certain that it’s unimportant which version of Stata you are using, state it anyway.

Write two sentences and you are done with the first paragraph.

The second paragraph

Now write more about your problem. Try not to be overly wordy, but it’s better to be wordy than curt to the point of unclearness. However you write this paragraph, be explicit. If you’re having a problem making Stata work, tell your readers exactly what you typed and exactly how Stata responded. For example,

I typed -stcox weight- and Stata responded “data not st”, r(119).

I typed -stcox weight sex- and Stata showed the usual output, except the standard error on weight was reported as dot.

The form of the second paragraph — which may extend into the third, fourth, … — depends on what you are asking. Describe the problem concisely but completely. Sacrifice conciseness for completeness if you must or you think it will help. To the extent possible, simplify your problem by getting rid of extraneous details. For instance,

I have 100,000 observations and 1,000 variables on firms, but 4
observations and 3 variables will be enough to show the problem.
My data looks like this

        firm_id     date      x
          10043       17     12
          10043       18      5
          13944       17     10
          27394       16      1

I need data that looks like this:

        date    no_of_firms   avg_x
          16              1       1
          17              2      11
          18              1      12

That is, for each date, I want the number of firms and the
average value of x.

Here’s another example for the second and subsequent paragraphs:

The substantive problem is this:  Patients enter and leave the 
hospital, sometimes more than once over the period.  I think  
in this case it would be appropriate to combine the 
separate stays so that a patient who was in for 2 days and 
later for 4 days could be treated as being simply in for 6 days,  
except I also record how many separate stays there were, too.

I'm evaluating cost, so for my purposes, I think treating 
cost as proportional to days in hospital, whatever their
distribution, will be adequate.  I'm looking at total days as a
function of number of stays.  The idea is that letting patients out
too early results in an increase in total days, and I want to
measure this.

I realize that more stays and days might also arise simply because
the patient was sicker.  Some patients die, and that obviously 
truncates stay, so I've omitted them from data.  I have disease
codes, but nothing about health status within code.  

Is there a way to incorporate this added information to improve  
the estimates?  I've got lots of data, so I was thinking of 
using death rate within disease code to somehow rank the codes 
as to seriousness of illness, and then using "seriousness" 
as an explanatory variable.  I guess my question is whether  
anyone knows a way I might do this. 

Or is there someway I could estimate the model seperately within
disease code, somehow constraining the coefficient on number of 
stays to be the same?  I saw something in the manual about 
stratified estimates, but I'm unsure if this is the same thing.

You’re asking someone to invest their time, so invest yours

Before you hit the send key, read what you have written, and improve it. You are asking for someone to invest their time helping you. Help them by making your problem easy to understand.

The easier your problem is to understand, the more likely you are to get a response. Said differently, if you write in a disorganized way so that potential responders must work just to understand you, much less provide you with an answer, you are unlikely to get an response.

Sparkling prose is not required. Proper grammar is not even required, so nonnative English speakers can relax. My advice is that, unless you are often praised for how clearly and entertainingly you write, write short sentences. Organization is more important than the style of the individual setences.

Avoid or explain jargon. Do not assume that the person who responds to your question will be in the same field as you. When dealing with a substantive problem, avoid jargon except for statistical jargon that is common across fields, or explain it. Potential responders like it when you teach them something new, and that makes them more likely to respond.


Write as if you are writing to a colleague whom you know well. Assume interest in your problem. The same thing said negatively: Do not write to list members as you might write to your research assistant, employee, servant, slave, or family member. Nothing is more likely to to get you ignored than to write, “I’m busy and really I don’t have time filter through all the Statalist postings, so respond to me directly, and soon. I need an answer by tomorrow.”

The positive approach, however, works. Just as when writing to a colleague, in general you do not need to apologize, beg, or play on sympathies. Sometimes when I write to colleagues, I do feel the need to explain that I know what I’m asking is silly. “I should know this,” I’ll write, or, “I can’t remember, but …”, or, “I know I should understand, but I don’t”. You can do that on Statalist, but it’s not required. Usually when I write to colleagues I know well, I just jump right in. The same rule works with Statalist.

What’s appropriate

Questions appropriate for Stata’s Technical Services are not appropriate for Statalist, and vice versa. Some questions aren’t appropriate for either one, but those are rare. If you ask an inappropriate question, and ask it well, someone will usually direct you to a better source.

Who can ask, and how

You must join Statalist to send questions. Yes, you can join, ask a question, get your answer, and quit, but if you do, don’t mention this at the outset. List members know this happens, but if you mention it when you ask the question, you’ll sound superior and condescending. Also, stick around for a few days after you get your response, because sometimes your question will generate discussion. If it does, you should participate. You should want to stick around and participate because if there is subsequent discussion, the final result is usually better than the initial reply.

I’ve previously written on how to join (and quit) Statalist. See

Categories: Resources Tags:

Mata, the missing manual, available at SSC

I gave a 1.5 hour talk on Mata at the 2010 UK Stata Users Group Meeting in September. The slides are available in pdf form here. The talk was well received, which of course pleased me. If you’re interested in Mata, I predict you will find the slides useful even if you didn’t attend the meeting.

The problem with the Mata Reference Manual is that, even though it tells you all the details, it never tells you how to put it all together, nor does it motivate you. We developers at StataCorp love the manual for just that reason: It gets right to the details that are so easy to forget.

Anyway, in outline, the talk and slides work like this

  1. They start with the mechanics of including Mata code. It begins gently, at the end of Stata’s NetCourse 151, and ends up discussing big — really big — systems.
  2. Next is a section on appropriate and inappropriate use of Mata.
  3. That’s followed by Mata concepts, from basic to advanced.
  4. And the talk includes a section on debugging!

I was nervous about how the talk would be received before I gave it. It’s been on my to-do list to write a book on Mata, but I never really found a way to approach the subject. The problem is that it’s all so obvious to me that I tend to launch immediately into tedious details. I wrote drafts of a few chapters more than once, and even I didn’t want to reread them.

I don’t know why this overview approach didn’t occur to me earlier. My excuse is that it’s a strange (I claim novel) combination of basic and advanced material, but it seems to work. I titled the talk “missing manual” with the implied promise that I would write that book if the talk was well received. It was. Nowadays, I’m not promising when. Real Soon Now.

The materials for all the talks, not just mine, are available at the SSC(*) and on For the UK 2010, go to or For other User Group Meetings, it’s easiest to start at the Stata page Meeting Proceedings.

If you have questions on the material, the appropriate place to post them is Statalist. I’m a member and am likely to reply, and that way, others who might be interested get to see the exchange, too. Please use “Mata missing manual” as the subject so that it will be easy for nonmembers to search the Statalist Archives and find the thread.

Finally, my “Stata, the missing manual” talk has no connection with the fine Missing-Manual series, “the book that should have been in the box”, created by Pogue Press and O’Reilly Media, whose website is

* The SSC is the Statistical Software Components archive, often called the Boston College Archive, provided by The SSC has become the premier Stata download site for user-written software on the Internet and also archives proceedings of Stata Users Group meetings and conferences.

Categories: Mata Tags: , , , ,


I just want to take a moment to plug Statalist. I’m a member and I hope to convince you to join Statalist, too, but even if I don’t succeed, you need to know about the web-based Statalist Archives because they’re a great resource for finding answers to questions about Stata, and you don’t have to join Statalist to access them.

Statalist’s Archives are found at, or you can click on “Statalist archives” on the right of this blog page, under Links.

Once at the Archives page, you can click on a year and month to get an idea of the flavor of Statalist. More importantly, you can search the archives. The search is Powered by Google and works well for highly specific, directed inquiries. For generic searches such as random numbers or survival analysis, however, I prefer to go to Advanced Search and ask that the results be sorted by date instead of relevance. It’s usually the most recent postings that are the most interesting, and by-date results are listed in just that order.

Anyway, the next time you are puzzling over something in Stata, I suggest that Read more…

Categories: Resources Tags:

More on welcome

When Stata first started back in 1985, communicating with users–well, back then they were potential users because we didn’t have any users yet–was nearly impossible.

From the beginning, we were very modern. Back in 1985, there were competing packages, but no one (not even me) expected personal computers to replace the mainframe. Back then, about the best that could be said about the available statistical packages is that they worked (sometimes) for some problems. What made Stata different was our belief and attitude that personal computers could actually be better than the mainframe for some problems. That in itself was a radical idea! In the mainstream, mainframe computer world, there was a popular saying: Little computers for little minds.

And we’ve stayed modern since then. Stata was (in 1999) the first statistical package to have online updating and an automated, modern, Internet way to handle user-written code. Modern Statas not only have that, but can use datasets directly off the web. But we have fallen behind! It’s 2010, and StataCorp doesn’t have a corporate blog!

Well, we do now.

Well, that may not be the most exciting announcement we’ve ever made. But our blog will be authored by the same people who develop Stata, support Stata, and yes, sell Stata. It will be useful, and it might be more entertaining than you suspect. If it is, that will be because of the people writing it.

Categories: Blogs, Company Tags: ,