Archive

Author Archive

How to read the %21x format

%21x is a Stata display format, just as are %f, %g, %9.2f, %td, and so on. You could put %21x on any variable in your dataset, but that is not its purpose. Rather, %21x is for use with Stata’s display command for those wanting to better understand the accuracy of the calculations they make. We use %21x frequently in developing Stata.

%21x produces output that looks like this:

. display %21x 1
+1.0000000000000X+000

. display %21x 2
+1.0000000000000X+001

. display %21x 10
+1.4000000000000X+003

. display %21x sqrt(2)
+1.6a09e667f3bcdX+000

All right, I admit that the result is pretty unreadable to the uninitiated. The purpose of %21x is to show floating-point numbers exactly as the computer stores them and thinks about them. In %21x’s defense, it is more readable than how the computer really records floating-point numbers, yet it loses none of the mathematical essence. Computers really record floating-point numbers like this:

      1 = 3ff0000000000000
      2 = 4000000000000000
     10 = 4024000000000000
sqrt(2) = 3ff6a09e667f3bcd

Or more correctly, they record floating-point numbers in binary, like this:

      1 = 0011111111110000000000000000000000000000000000000000000000000000
      2 = 0100000000000000000000000000000000000000000000000000000000000000
     10 = 0100000000100100000000000000000000000000000000000000000000000000
sqrt(2) = 0011111111110110101000001001111001100110011111110011101111001101

By comparison, %21x is a model of clarity.

The above numbers are 8-byte floating point, also known as double precision, encoded in binary64 IEEE 754-2008 little endian format. Little endian means that the bytes are ordered, left to right, from least significant to most significant. Some computers store floating-point numbers in big endian format — bytes ordered from most significant to least significant — and then numbers look like this:

      1 = 000000000000f03f
      2 = 0000000000000040
     10 = 0000000000004024
sqrt(2) = cd3b7f669ea0f63f

or:

      1 = 0000000000000000000000000000000000000000000000001111000000111111
      2 = 0000000000000000000000000000000000000000000000000000000000000100
     10 = 0000000000000000000000000000000000000000000000000100000000100100
sqrt(2) = 1100110100111011011111110110011010011110000011111111011000111111

Regardless of that, %21x produces the same output:

. display %21x 1
+1.0000000000000X+000

. display %21x 2
+1.0000000000000X+001

. display %21x 10
+1.4000000000000X+003

. display %21x sqrt(2)
+1.6a09e667f3bcdX+000

Binary computers store floating-point numbers as a number pair, (a, b); the desired number z is encoded

z = a * 2^b

For example,

 1 = 1.00 * 2^0
 2 = 1.00 * 2^1
10 = 1.25 * 2^3

The number pairs are encrypted in the bit patterns, such as 00111111…01, above.

I’ve written the components a and b in decimal, but for reasons that will become clear, we need to preserve the essential binaryness of the computer’s number. We could write the numbers in binary, but they will be more readable if we represent them in base-16:

base-10 base-16
floating point
1 = 1.00 * 2^0
2 = 1.00 * 2^1
10 = 1.40 * 2^3

“1.40?”, you ask, looking at the last row, which indicates 1.40*2^3 for decimal 10.

The period in 1.40 is not a decimal point; it is a hexadecimal point. The first digit after the hexadecimal point is the number for 1/16ths, the next is for 1/(16^2)=1/256ths, and so on. Thus, 1.40 hexadecimal equals 1 + 4*(1/16) + 0*(1/256) = 1.25 in decimal.

And that is how you read %21x values +1.0000000000000X+000, +1.0000000000000X+001, and +1.4000000000000X+003. To wit,

base-10 base-16
floating point
%21x
1 = 1.00 * 2^0 = +1.0000000000000X+000
2 = 1.00 * 2^1 = +1.0000000000000X+001
10 = 1.40 * 2^3 = +1.4000000000000X+003

The mantissa is shown to the left of the X, and, to the right of the X, the exponent for the 2. %21x is nothing more than a binary variation of the %e format with which we are all familiar, for example, 12 = 1.20000e+01 = 1.2*10^1. It’s such an obvious generalization that one would guess it has existed for a long time, so excuse me when I mention that we invented it at StataCorp. If I weren’t so humble, I would emphasize that this human-readable way of representing binary floating-point numbers preserves nearly every aspect of the IEEE floating-point number. Being humble, I will merely observe that 1.40x+003 is more readable than 4024000000000000.

Now that you know how to read %21x, let me show you how you might use it. %21x is particularly useful for examining precision issues.

For instance, the cube root of 8 is 2; 2*2*2 = 8. And yet, in Stata, 8^(1/3) is not equal to 2:

. display 8^(1/3)2

. assert 8^(1/3) == 2
assertion is false
r(9) ; 

. display %20.0g 8^(1/3)
1.99999999999999978

I blogged about that previously; see How Stata calculates powers. The error is not much:

. display 8^(1/3)-2-2.220e-16

In %21x format, however, we can see that the error is only one bit:

. display %21x 8^(1/3)
+1.fffffffffffffX+000

. display %21x 2    
+1.0000000000000X+001

I wish the answer for 8^(1/3) had been +1.0000000000001X+000, because then the one-bit error would have been obvious to you. Instead, rather than being a bit too large, the actual answer is a bit too small — one bit too small to be exact — so we end up with +1.fffffffffffffX+000.

One bit off means being off by 2^(-52), which is 2.220e-16, and which is the number we saw when we displayed in base-10 8^(1/3)-2. So %21x did not reveal anything we could not have figured out in other ways. The nature of the error, however, is more obvious in %21x format than it is in a base-10 format.

On Statalist, the point often comes up that 0.1, 0.2, …, 0.4, 0.6, …, 0.9, 0.11, 0.12, … have no exact representation in the binary base that computers use. That becomes obvious with %21x format:

. display %21x 0.1
+1.999999999999aX-004 . display %21x 0.2 +1.999999999999aX-003. ...

0.5 does have an exact representation, of course, as do all the negative powers of 2:

. display %21x 0.5           //  1/2
+1.0000000000000X-001

. display %21x 0.25          //  1/4
+1.0000000000000X-002

. display %21x 0.125         //  1/8
+1.0000000000000X-003

. display %21x 0.0625        // 1/16
+1.0000000000000X-004

. ...

Integers have exact representations, too:

. display %21x 1
+1.0000000000000X+000

. display %21x 2
+1.0000000000000X+001

. display %21x 3
+1.8000000000000X+001

. ...

. display %21x 10
+1.4000000000000X+003

. ...

. display %21x 10786204
+1.492b380000000X+017

. ...

%21x is a great way of becoming familiar with base-16 (equivalently, base-2), which is worth doing if you program base-16 (equivalently, base-2) computers.

Let me show you something useful that can be done with %21x.

A programmer at StataCorp has implemented a new statistical command. In four examples, the program produces the following results:

41.8479499816895
 6.7744922637939
 0.1928647905588
 1.6006311178207

Without any additional information, I can tell you that the program has a bug, and that StataCorp will not be publishing the code until the bug is fixed!

How can I know that this program has a bug without even knowing what is being calculated? Let me show you the above results in %21x format:

+1.4ec89a0000000X+005
+1.b191480000000X+002
+1.8afcb20000000X-003
+1.99c2f60000000X+000

Do you see what I see? It’s all those zeros. In randomly drawn problems, it would be unlikely that there would be all zeros at the end of each result. What is likely is that the results were somehow rounded, and indeed they were. The rounding in this case was due to using float (4-byte) precision inadvertently. The programmer forgot to include a double in the ado-file.

And that’s one way %21x is used.

I am continually harping on programmers at StataCorp that if they are going to program binary computers, they need to think in binary. I go ballistic when I see a comparison that’s coded as “if (abs(x-y)<1e-8) …” in an attempt to deal with numerical inaccuracy. What kind of number is 1e-8? Well, it’s this kind of number:

. display %21x 1e-8
+1.5798ee2308c3aX-01b

Why put the computer to all that work, and exactly how many digits are you, the programmer, trying to ignore? Rather than 1e-8, why not use the “nice” numbers 7.451e-09 or 3.725e-09, which is to say, 1.0x-1b or 1.0x-1c? If you do that, then I can see exactly how many digits you are ignoring. If you code 1.0x-1b, I can see you are ignoring 1b=27 binary digits. If you code 1.0x-1c, I can see you are ignoring 1c=28 binary digits. Now, how many digits do you need to ignore? How imprecise do you really think your calculation is? By the way, Stata understands numbers such as 1.0x-1b and 1.0x-1c as input, so you can type the precise number you want.

As another example of thinking in binary, a StataCorp programmer once described a calculation he was making. At one point, the programmer needed to normalize a number in a particular way, and so calculated x/10^trunc(log10(x)), and held onto the 10^trunc(log10(x)) for denormalization later. Dividing by 10, 100, etc., may be easy for us humans, but it’s not easy in binary, and it can result in very small amounts of dreaded round-off error. And why even bother to calculate the log, which is an expensive operation? “Remember,” I said, “how floating-point numbers are recorded on a computer: z = a*2^b, where 0 < = |a| < 2. Writing in C, it’s easy to extract components. In fact, isn’t a number normalized to be between 0 and 2 even better for your purposes?” Yes, it turned out it was.

Even I sometimes forget to think in binary. Just last week I was working on a problem and Alan Riley suggested a solution. I thought a while. “Very clever,” I said. “Recasting the problem in powers of 2 will get rid of that divide that caused half the problem. Even so, there’s still the pesky subtraction.” Alan looked at me, imitating a look I so often give others. “In binary,” Alan patiently explained to me, “the difference you need is the last 19 bits of the original number. Just mask out the other digits.”

At this point, many of you may want to stop reading and go off and play with %21x. If you play with %21x long enough, you’ll eventually examine the relationship between numbers recorded as Stata floats and as Stata doubles, and you may discover something you think to be an error. I will discuss that next week in my next blog posting.

How Stata calculates powers

Excuse me, but I’m going to toot Stata’s horn.

I got an email from Nicholas Cox (an Editor of the Stata Journal) yesterday. He said he was writing something for the Stata Journal and wanted the details on how we calculated a^b. He was focusing on examples such as (-8)^(1/3), where Stata produces a missing value rather than -2, and he wanted to know if our calculation of that was exp((1/3)*ln(-8)). He didn’t say where he was going, but I answered his question.

I have rather a lot to say about this.

Nick’s supposition was correct, in this particular case, and for most values of a and b, Stata calculates a^b as exp(b*ln(a)). In the case of a=-8 and b=1/3, ln(-8)==., and thus (-8)^(1/3)==..

You might be tempted to say Stata (or other software) should just get this right and return -2. The problem is not only that 1/3 has no exact decimal representation, but it has no exact binary representation, either. Watching for 1/3 and doing something special is problematic.

One solution to this problem, if a solution is necessary, would be to introduce a new function, say, invpower(a,b), that returns a^(1/b). Thus, cube roots would be invpower(a,3). Integers have exact numerical representations even in digital computers, so computers can watch for integers and take special actions.

Whether one should watch integers for such special values is an interesting question. Stata does watch for special values of b in calculating a^b. In particular, it watches for b as an integer, 1<=b<=64 or -64<=b<=-1. When Stata finds such integers, it calculates a^b by repeatedly multiplying a or 1/a. Thus Stata calculates ..., a^(-2), a^(-1), ..., a^2, a^3, ... more accurately than it calculates ..., a^(-2+epsilon), a^(-1+epsilon), ..., a^(2+epsilon), a^(3+epsilon), ....

So what could be wrong with that?

Let's imagine we wish to make a calculation of F(a,b) and that we have two ways to calculate it.

The first way is via an approximation formula A(a,b) = F(a,b) + e, where e is error. Do not assume that error has good properties that a statistical model would have. Typically, e/F() has roughly average 0 (I write sloppily) because otherwise we would add values to achieve that. In addition, the error tends to be serially correlated because approximation formulas are usually continuous. In this case, serial correlation is actually a desirable property!

Okay, that's one way we have for calculating F(). The second way is by an exact formula E(a,b) = F(a,b), but E() is only valid for certain values of b.

Great, you say, we'll use E() for the special values and use A() for the rest. We'll call that function G().

Now consider someone calculating a numerical derivative from results of G(). Say that they wish to calculate (F(a,b+h)-F(a,b))/h for a suitably small value of h, which they do by calculating (G(a,b+h)+G(a,b))/h. Then they obtain

  1. (A(a,b+h)-E(b))/h for some values, and
  2. (A(a,b+h)-A(b))/h for others.

There are other possibilities, for instance, they might obtain (E(a,b-h)-A(b))/h, but the two above will be sufficient for my purposes.

Note that in calculation (2), the serial correlation in the error actually reduces the error in (2) relative to (1)!

This can be an issue.

The error in the case of Stata’s a^b around the integers -64, -63, …, -1, 1, 2, …, 64 is small enough that we just ignored it. For instance, in 2^3, A()-E() = -1.776e-15. Had the error been large enough, we would have combined A() and E() in a different way to produce a more accurate approximation formula. To wit, you know that at certain values of b, E() is exact, so one develops an adjust for A() that uses that information to adjust not just the specific value, but values around the specific value, too.

In this example, A(a,b) = exp(b*ln(a)), a>0, and you may be asking yourself in what sense exp(b*ln(a)) is an approximation. The answer is that exp() and ln() are approximations to the true functions, and in fact, so is * an approximation for the underlying idea of multiplication.

That Stata calculates a^b by repeated multiplication for -64<=b<=-1 and 1<=b<=64 is not something we have ever mentioned. People do not realize the extreme caution we go to on what might seem the minor issues. It is because we do this that things work as you expect. In this case, a^3 is exactly equal to a*a*a. This is ironic because when numerical issues arise that do not have a similarly easy solution, users are disappointed. Why don’t you fix that? In the old Fortran days in which I grew up, one would never expect a^3 to equal a*a*a. One’s nose was constantly being rubbed into numerical issues, which reminded us not to overlook them.

By the way, if issues like this interest you, consider applying to StataCorp for employment. We can fill your days with discussions like this.

Categories: Numerical Analysis Tags: ,

Using dates and times from other software

Most software stores dates and times numerically, as durations from some sentinel date, but they differ on the sentinel date and on the units in which the duration is stored. Stata stores dates as the number of days since 01jan1960, and datetimes as the number of milliseconds since 01jan1960 00:00:00.000. January 3, 2011 is stored as 18,630, and 2pm on January 3 is stored as 1,609,682,400,000. Other packages use different choices for bases and units.

It sometimes happens that you need to process in Stata data imported from other software and end up with a numerical variable recording a date or datetime in the other software’s encoding. It is usually possible to adjust the numeric date or datetime values to the sentinel date and units that Stata uses. Below are conversion rules for SAS, SPSS, R, Excel, and Open Office.

 
SAS

SAS stores dates as the number of days since 01jan1960, the same as Stata:

    . gen statadate = sasdate
    . format statadate %td

SAS stores datetimes as the number of seconds since 01jan1960 00:00:00, assuming 86,400 seconds/day. Thus, all that’s necessary is to multiply SAS datetimes by 1,000 and attach a %tc format to the result,

    . gen double statatime = (sastime*1000)
    . format statatime %tc

It is important that variables containing SAS datetimes, such as sastime above, be imported as doubles into Stata.

 
SPSS

SPSS stores both dates and datetimes as the number of seconds since 14oct1582 00:00:00, assuming 86,400 seconds/day. To convert SPSS datetimes to Stata datetimes, type

    . gen double statatime = (spsstime*1000) + tc(14oct1582 00:00)
    . format statatime %tc

Multiplying by 1,000 converts from seconds to milliseconds. Adding tc(14oct1582 00:00) accounts for the differing bases.

Function tc() returns the specified datetime as a Stata datetime, which is to say, the number of milliseconds between the specified datetime and 01jan1960 00:00:00.000. We need to add the difference between SPSS’s base and Stata’s base, which is tc(14oct1582 00:00) – tc(01jan1960 00:00), but tc(01jan1960) is definitionally 0, so that just leaves tc(14oct1582 00:00). tc(14oct1582), for your information, is -11,903,760,000,000.

SPSS dates are the same as SPSS datetimes, so to convert an SPSS date to a Stata date, we could type,

    . gen double statatime = (spssdate*1000) + tc(14oct1582 00:00)
    . gen statadate        = dofc(statatime)
    . format statadate %td
    . drop statatime

Function dofc() converts a Stata datetime to a Stata date. We can combine the above into,

    . gen statadate = dofc((spsstime*1000) + tc(14oct1582 00:00))
    . format statadate %td

 
R

R stores dates as days since 01jan1970. To convert to a Stata date,

    . gen statadate = rdate - td(01jan1970)
    . format statadate %td

Stata uses 01jan1960 as the base, R uses 01jan1970, so all you have to do is subtract the number of days between 01jan1970 and 01jan1960.

R stores datetimes as the number of UTC adjusted seconds since 01jan1970 00:00:00. UTC stands for Universal Time Coordinated. Rather than assuming 86,400 seconds/day, some UTC days have 86,401 seconds. Leap seconds are sometimes inserted into UTC days to keep the clock coordinated with the Earth’s rotation. Stata’s datetime %tC format is UTC time, which is to say, it accounts for these leap seconds. Thus, to convert R datetimes to Stata, you type

   . gen double statatime = rtime - tC(01jan1970 00:00)
   . format statatime %tC

Note the use of Stata’s tC() function rather than tc() to obtain the number of milliseconds between the differing bases. tc() returns the number of seconds since 01jan1960 00:00:00 assuming 86,400 seconds/day. tC() returns the number of seconds adjusted for leap seconds. In this case, it would not make a difference if we mistakenly typed tc() rather than tC() because no leap seconds were inserted between 1960 and 1970. Had the base year been 1980, however, the use of tC() would have been important. Nine extra seconds were were inserted between 01jan1970 and 01jan1980!

In many cases you may prefer using a time variable that ignores leap seconds. In that case, You can type

    . gen double statatime = cofC(rtime - tC(01jan1970 00:00))
    . format statatime %tc

 
Excel

Excel has used different date systems for different operating systems. Excel for Windows used the “1900 Date System”. Excel for Mac used the “1904 Date System”. More recently, Microsoft has been standardizing on the 1900 Date System.

If you have an Excel for Windows workbook, it is likely to be using 1900.

If you have an Excel for Mac workbook, it is likely to be using 1904, unless it came from a Windows workbook originally.

Anyway, both Excels can use either encoding. See http://support.microsoft.com/kb/214330 for more information and for instrbuctions on converting your workbook between date systems.

In any case, you are unlikely to encounter Excel numerically coded dates. If you cut-and-paste the spreadsheet into Stata’s Data editor, dates and datetimes paste as strings in human-readable form. If you use a conversion package, most know to convert the date for you.

 
Excel, 1900 date system

For dates on or after 01mar1900, Excel 1900 Date System stores dates as days since 30dec1899. To convert to a Stata date,

    . gen statadate = exceldate + td(30dec1899)
    . format statadate %td

Excel can store dates between 01jan1900 and 28feb1900, too, but the formula above will not handle those two months. See http://www.cpearson.com/excel/datetime.htm for more information.

For datetimes on or after 01may1900 00:00:00, Excel 1900 Date System stores datetimes as days plus fraction of day since 30dec1899 00:00:00. To convert with a one-second resolution to a Stata datetime,

    . gen double statatime = round((exceltime+td(30dec1899))*86400)*1000
    . format statatime %tc

 
Excel, 1904 date system

For dates on or after 01jan1904, Excel 1904 Date System stores dates as days since 01jan1904. To convert to a Stata date,

 
    . gen statadate = exceldate + td(01jan1904)
    . format statadate %td

For datetimes on or after 01jan1904 00:00:00, Excel 1904 Date System stores datetimes as days plus fraction of day since 01jan1904 00:00:00. To convert with a one-second resolution to a Stata datetime,

 
    . gen double statatime = round((exceltime+td(01jan1904))*86400)*1000
    . format statatime %tc

 
Open Office

Open Office uses the Excel, 1900 Date System.

 
Why Stata has two datetime encodings

We have just seen that most packages assume 86,400 seconds/day, but that one instead uses UTC time, in which days have 86,400 or 86,401 seconds, depending. Stata provides both datetime encodings, called %tc and %tC. That turned out to be convenient in translating times from other packages. Stata will even let you switch from one to the other using the cofC() and Cofc functions, so you know you should be asking, which should I use?

Stata’s %tc format assumes that there are 24*60*60*1,000 ms per day — 86,400 seconds per day — just as an atomic clock does. Atomic clocks count oscillations between the nucleus and electrons of an atom and thus provide a measurement of the real passage of time.

Time of day measurements have historically been based on astronomical observation, which is a fancy way of saying, based on looking at the sun. The sun should be at its highest point at noon, right? So however you kept track of time — falling grains of sand or a wound up spring — you periodically reset your clock and then went about your business. In olden times it was understood that the 60 seconds per minute, 60 minutes per hour, 24 hours per day, were theoretical goals that no mechanical device could reproduce accurately. These days, we have have more accurate definitions for measuring time. A second is 9,192,631,770 periods of the radiation corresponding to the transition between two levels of the ground state of caesium 133. Obviously we have better equipment than the ancients, so problem solved, right? Wrong. There are two problems. The formal definition of a second is just a little too short to match length of a day, and the Earth’s rotation is slowing down.

As a result, since 1972 leap seconds have been added to atomic clocks once or twice a year to keep time measurements in synchronization with the earth’s rotation. Unlike leap years, however, there is no formula for predicting when leap seconds will occur. The Earth may be on average slowing down, but there is a large random component to that. As a result, leap seconds are determined by committee and announced 6 months before they are inserted. Leap seconds are added, if necessary, on the end of the day on June 30 and December 31 of the year. The inserted times are designated as 23:59:60.

Unadjusted atomic clocks may accurately mark the passage of real time, but you need to understand that leap seconds are every bit as real as every other second of the year. Once a leap second is inserted, it ticks just like any other second and real things can happen during that tick.

You may have heard of terms such as GMT and UTC.

GMT is the old Greenwich Mean Time and is based on astronomical observation. GMT has been supplanted by UTC.

UTC stands for coordinated universal time and is measured by atomic clocks, occasionally corrected for leap seconds. UTC is derived from two other times, UT1 and TAI. UT1 is the mean solar time, with which UTC is kept in sync by the occasional addition of a leap second. TAI is the atomically measured pure time. TAI was set to GMT plus 10 seconds in 1958 and has been running unadjusted since then. Update 07 Jan 2010: TAI is a statistical combination of various atomic chronometers and even it has not ticked uniformly over its history; see http://www.ucolick.org/~sla/leapsecs/timescales.html and especially http://www.ucolick.org/~sla/leapsecs/dutc.html#TAI. (Thanks to Steve Allen of the UCO/Lick Observatory for correcting my understanding and for the reference.)

UNK is StataCorp’s term for the time standard most people use. UNK stands for unknowing. UNK is based on a recent time observation, probably UTC, and then just assuming that there are 86,400 seconds per day after that.

The UNK standard is adequate for many purposes, and in such cases, you will want to use %tc rather than the leap second-adjusted %tC encoding. If you are using computer-timestamped data, however, you need to find out whether the timestamping system accounted for leap-second adjustments. Problems can arise even if you do not care about losing or gaining a second here and there.

For instance, you may import timestamp values from other systems recorded in the number of milliseconds that have passed since some agreed upon date. If you choose the wrong encoding scheme, if you chose tc when you should choose %tC, or vice versa, more recent times will be off by 24 seconds.

To avoid such problems, you may decide to import and export data by using Human Readable Forms (HRF) such as “Fri Aug 18 14:05:36 CDT 2006″. This method has advantages, but for %tC (UTC) encoding, times such as 23:59:60 are possible. Some software will refuse to decode such times.

Stata refuses to decode 23:59:60 in the %tc encoding (function clock) and accepts it with %tC (function Clock()). When %tC function Clock() sees a time with a 60th second, Clock() verifies that the time corresponds to an official leap second. Thus, when translating from printable forms, try assuming %tc and check the result for missing values. If there are none, you can assume your use of %tc is valid. If there are missing values and they are due to leap seconds and not some other error, you must use %tC function Clock() to translate from HRF. After that, if you still want to work in %tc units, use function cofC() to translate %tC values into %tc.

If precision matters, the best way to process %tC data is simply to treat them that way. The inconvenience is that you cannot assume that there are 86,400 seconds per day. To obtain the duration between dates, you must subtract the two time values involved. The other difficulty has to do with dealing with dates in the future. Under the %tC (UTC) encoding, there is no set value for any date more than 6 months in the future.

 
Advice

Stata provides two datetime encodings:

  1. %tC, also known as UTC, which accounts for leap seconds, and
     

  2. %tc, which ignores them (it assumes 86,400 seconds/day).

Systems vary in how they treat time variables. My advice is,

  • If you obtain data from a system that accounts for leap seconds, import using Stata’s %tC.
    1. If you later need to export data to a system that does not account for leap seconds, use Stata’s cofC() function to translate time values before exporting.
       

    2. If you intend to tsset the time variable and the analysis will be at the second level or finer, just tsset the %tC variable, specifying the appropriate delta() if necessary, for example, delta(1000) for seconds.
       

    3. If you intend to tsset the time variable and the analysis will be at coarser than the second level (minute, hour, etc.), create a %tc variable from the %tC variable (generate double tctime = cofC(tCtime<) and tsset that, specifying the appropriate delta() if necessary. You must do that because, in a %tC variable, there are not necessarily 60 seconds in a minute; some minutes have 61 seconds.
  • If you obtain data from a system that ignores leap seconds, use Stata’s %tc.
    1. If you later need to export data to a system that does account for leap seconds, use Stata’s Cofc() function to translate time values.
       

    2. If you intend to tsset the time variable, just tsset it, specifying the appropriate delta().

Some users prefer always to use Stata’s %tc because those values are a little easier to work with. You can do that if

  • you do not mind having up to 1 second of error and
     

  • you do not import or export numerical values (clock ticks) from other systems that are using leap seconds, because then there could be nearly 30 seconds of accumulated error.

There are two things to remember if you use %tC variables:

  1. The number of seconds between two dates is a function of when the dates occurred. Five days from one date is not simply a matter of adding 5*24*60*60*1,000 ms. You might need to add another 1,000 ms. Three hundred and sixty-five days from now might require adding 1,000 or 2,000 ms. The longer the span, the more you might have to add. The best way to add durations to %tC variables is to extract the components, add to them, and then reconstruct from the numerical components.
     

  2. You cannot accurately predict datetimes more than six months into the future. We do not know what the %tC value will be of 25dec2026 00:00:00 because every year along the way, the International Earth Rotation Reference Systems Service (IERS) will twice announce whether there will be the insertion of a leap second.

You can help alleviate these inconveniences. Face west and throw rocks. The benefit will be only transitory if the rocks land back on Earth, so you need to throw them really hard. I know what you’re thinking, but this does not need to be a coordinated effort.

Categories: Data Management Tags: ,

How to successfully ask a question on Statalist

As everyone knows, I am a big proponent of Statalist, and not just for selfish reasons, although those reasons play a role. Nearly every member of the technical staff at StataCorp — me included — are members of Statalist. Even when we don’t participate in a particular thread, we do pay attention. The discussions on Statalist play an important role concerning Stata’s development.

Statalist is a discussion group, not just a question-and-answer forum. Nonetheless, new members often use it to obtain answers to questions and that works because those questions sometimes become gist for subsequent discussions. In those cases, the questioners not only get answers, they get much more.

One of the best features of Statalist is that, no matter how poorly you ask a question, you are unlikely to be flamed. Not only are the members of Statalist nice — just as are the members of most lists — they act just as nice on the list as they really are. You are unlikely to be flamed if you ask a question poorly, but you are also unlikely to get an answer.

Here is my recipe to increase the chances of you getting a helpful response. You should also read the Statalist FAQ before writing your question.

Subject line

Make the subject line of your email meaningful. Some good subject lines are:

Survival analysis

Confusion about -stcox-

Unexpected error from -stcox-

-stcox- output

The first two sentences

The first two sentences are the most important, and they are the easiest to write.

In the first sentence, state your problem in Stata terms, but do not go into details. Here are some good first sentences:

I’m having a problem with -stcox-.

I’m getting an unexpected error message from -stcox-.

I’m using -stcox- and don’t know how to interpret the result.

I’m using -stcox- and getting a result I know is wrong, so I know I’m misunderstanding something.

I want to use -stcox- but don’t know how to start.

I think I want to use -stcox-, but I’m unsure.

I want to use -stcox- but my data is complicated and I’m unsure how to proceed.

I have a complicated dataset that I somehow need to transform into a form suitable for use with -stcox-.

Stata crashed!

I’m having a problem that may be more of a statistics issue than a Stata issue.

The purpose of the first sentence is to catch the attention of members who have an interest in your topic and let the others, who were never going to answer you anyway, move on.

The second sentence is even easier to write:

I am using Stata 11.1 for Windows.

I am using Stata 10 for Mac.

Even if you are certain that it’s unimportant which version of Stata you are using, state it anyway.

Write two sentences and you are done with the first paragraph.

The second paragraph

Now write more about your problem. Try not to be overly wordy, but it’s better to be wordy than curt to the point of unclearness. However you write this paragraph, be explicit. If you’re having a problem making Stata work, tell your readers exactly what you typed and exactly how Stata responded. For example,

I typed -stcox weight- and Stata responded “data not st”, r(119).

I typed -stcox weight sex- and Stata showed the usual output, except the standard error on weight was reported as dot.

The form of the second paragraph — which may extend into the third, fourth, … — depends on what you are asking. Describe the problem concisely but completely. Sacrifice conciseness for completeness if you must or you think it will help. To the extent possible, simplify your problem by getting rid of extraneous details. For instance,

I have 100,000 observations and 1,000 variables on firms, but 4
observations and 3 variables will be enough to show the problem.
My data looks like this

        firm_id     date      x
        -----------------------
          10043       17     12
          10043       18      5
          13944       17     10
          27394       16      1
        -----------------------

I need data that looks like this:

        date    no_of_firms   avg_x
        ---------------------------
          16              1       1
          17              2      11
          18              1      12

That is, for each date, I want the number of firms and the
average value of x.

Here’s another example for the second and subsequent paragraphs:

The substantive problem is this:  Patients enter and leave the 
hospital, sometimes more than once over the period.  I think  
in this case it would be appropriate to combine the 
separate stays so that a patient who was in for 2 days and 
later for 4 days could be treated as being simply in for 6 days,  
except I also record how many separate stays there were, too.

I'm evaluating cost, so for my purposes, I think treating 
cost as proportional to days in hospital, whatever their
distribution, will be adequate.  I'm looking at total days as a
function of number of stays.  The idea is that letting patients out
too early results in an increase in total days, and I want to
measure this.

I realize that more stays and days might also arise simply because
the patient was sicker.  Some patients die, and that obviously 
truncates stay, so I've omitted them from data.  I have disease
codes, but nothing about health status within code.  

Is there a way to incorporate this added information to improve  
the estimates?  I've got lots of data, so I was thinking of 
using death rate within disease code to somehow rank the codes 
as to seriousness of illness, and then using "seriousness" 
as an explanatory variable.  I guess my question is whether  
anyone knows a way I might do this. 

Or is there someway I could estimate the model seperately within
disease code, somehow constraining the coefficient on number of 
stays to be the same?  I saw something in the manual about 
stratified estimates, but I'm unsure if this is the same thing.

You’re asking someone to invest their time, so invest yours

Before you hit the send key, read what you have written, and improve it. You are asking for someone to invest their time helping you. Help them by making your problem easy to understand.

The easier your problem is to understand, the more likely you are to get a response. Said differently, if you write in a disorganized way so that potential responders must work just to understand you, much less provide you with an answer, you are unlikely to get an response.

Sparkling prose is not required. Proper grammar is not even required, so nonnative English speakers can relax. My advice is that, unless you are often praised for how clearly and entertainingly you write, write short sentences. Organization is more important than the style of the individual setences.

Avoid or explain jargon. Do not assume that the person who responds to your question will be in the same field as you. When dealing with a substantive problem, avoid jargon except for statistical jargon that is common across fields, or explain it. Potential responders like it when you teach them something new, and that makes them more likely to respond.

Tone

Write as if you are writing to a colleague whom you know well. Assume interest in your problem. The same thing said negatively: Do not write to list members as you might write to your research assistant, employee, servant, slave, or family member. Nothing is more likely to to get you ignored than to write, “I’m busy and really I don’t have time filter through all the Statalist postings, so respond to me directly, and soon. I need an answer by tomorrow.”

The positive approach, however, works. Just as when writing to a colleague, in general you do not need to apologize, beg, or play on sympathies. Sometimes when I write to colleagues, I do feel the need to explain that I know what I’m asking is silly. “I should know this,” I’ll write, or, “I can’t remember, but …”, or, “I know I should understand, but I don’t”. You can do that on Statalist, but it’s not required. Usually when I write to colleagues I know well, I just jump right in. The same rule works with Statalist.

What’s appropriate

Questions appropriate for Stata’s Technical Services are not appropriate for Statalist, and vice versa. Some questions aren’t appropriate for either one, but those are rare. If you ask an inappropriate question, and ask it well, someone will usually direct you to a better source.

Who can ask, and how

You must join Statalist to send questions. Yes, you can join, ask a question, get your answer, and quit, but if you do, don’t mention this at the outset. List members know this happens, but if you mention it when you ask the question, you’ll sound superior and condescending. Also, stick around for a few days after you get your response, because sometimes your question will generate discussion. If it does, you should participate. You should want to stick around and participate because if there is subsequent discussion, the final result is usually better than the initial reply.

I’ve previously written on how to join (and quit) Statalist. See http://blog.stata.com/2010/11/08/statalist/.

Categories: Resources Tags:

Mata, the missing manual, available at SSC

I gave a 1.5 hour talk on Mata at the 2010 UK Stata Users Group Meeting in September. The slides are available in pdf form here. The talk was well received, which of course pleased me. If you’re interested in Mata, I predict you will find the slides useful even if you didn’t attend the meeting.

The problem with the Mata Reference Manual is that, even though it tells you all the details, it never tells you how to put it all together, nor does it motivate you. We developers at StataCorp love the manual for just that reason: It gets right to the details that are so easy to forget.

Anyway, in outline, the talk and slides work like this

  1. They start with the mechanics of including Mata code. It begins gently, at the end of Stata’s NetCourse 151, and ends up discussing big — really big — systems.
  2. Next is a section on appropriate and inappropriate use of Mata.
  3. That’s followed by Mata concepts, from basic to advanced.
  4. And the talk includes a section on debugging!

I was nervous about how the talk would be received before I gave it. It’s been on my to-do list to write a book on Mata, but I never really found a way to approach the subject. The problem is that it’s all so obvious to me that I tend to launch immediately into tedious details. I wrote drafts of a few chapters more than once, and even I didn’t want to reread them.

I don’t know why this overview approach didn’t occur to me earlier. My excuse is that it’s a strange (I claim novel) combination of basic and advanced material, but it seems to work. I titled the talk “missing manual” with the implied promise that I would write that book if the talk was well received. It was. Nowadays, I’m not promising when. Real Soon Now.

The materials for all the talks, not just mine, are available at the SSC(*) and on www.stata.com. For the UK 2010, go to http://ideas.repec.org/s/boc/usug10.html or http://www.stata.com/meeting/uk10/abstracts.html. For other User Group Meetings, it’s easiest to start at the Stata page Meeting Proceedings.

If you have questions on the material, the appropriate place to post them is Statalist. I’m a member and am likely to reply, and that way, others who might be interested get to see the exchange, too. Please use “Mata missing manual” as the subject so that it will be easy for nonmembers to search the Statalist Archives and find the thread.

Finally, my “Stata, the missing manual” talk has no connection with the fine Missing-Manual series, “the book that should have been in the box”, created by Pogue Press and O’Reilly Media, whose website is http://missingmanuals.com/.

* The SSC is the Statistical Software Components archive, often called the Boston College Archive, provided by http://www.repec.org/. The SSC has become the premier Stata download site for user-written software on the Internet and also archives proceedings of Stata Users Group meetings and conferences.

Categories: Mata Tags: , , , ,

Statalist

I just want to take a moment to plug Statalist. I’m a member and I hope to convince you to join Statalist, too, but even if I don’t succeed, you need to know about the web-based Statalist Archives because they’re a great resource for finding answers to questions about Stata, and you don’t have to join Statalist to access them.

Statalist’s Archives are found at http://www.stata.com/statalist/archive/, or you can click on “Statalist archives” on the right of this blog page, under Links.

Once at the Archives page, you can click on a year and month to get an idea of the flavor of Statalist. More importantly, you can search the archives. The search is Powered by Google and works well for highly specific, directed inquiries. For generic searches such as random numbers or survival analysis, however, I prefer to go to Advanced Search and ask that the results be sorted by date instead of relevance. It’s usually the most recent postings that are the most interesting, and by-date results are listed in just that order.

Anyway, the next time you are puzzling over something in Stata, I suggest that Read more…

Categories: Resources Tags:

More on welcome

When Stata first started back in 1985, communicating with users–well, back then they were potential users because we didn’t have any users yet–was nearly impossible.

From the beginning, we were very modern. Back in 1985, there were competing packages, but no one (not even me) expected personal computers to replace the mainframe. Back then, about the best that could be said about the available statistical packages is that they worked (sometimes) for some problems. What made Stata different was our belief and attitude that personal computers could actually be better than the mainframe for some problems. That in itself was a radical idea! In the mainstream, mainframe computer world, there was a popular saying: Little computers for little minds.

And we’ve stayed modern since then. Stata was (in 1999) the first statistical package to have online updating and an automated, modern, Internet way to handle user-written code. Modern Statas not only have that, but can use datasets directly off the web. But we have fallen behind! It’s 2010, and StataCorp doesn’t have a corporate blog!

Well, we do now.

Well, that may not be the most exciting announcement we’ve ever made. But our blog will be authored by the same people who develop Stata, support Stata, and yes, sell Stata. It will be useful, and it might be more entertaining than you suspect. If it is, that will be because of the people writing it.

Categories: Blogs, Company Tags: ,