Archive

Archive for January 2011

Stata Conference Chicago 2011 Call for Presentations

The 2011 Stata Conference will be held on July 14 and 15 at the University of Chicago’s Gleacher Center. I’ve enjoyed meeting many enthusiastic Stata users at previous Stata Conferences, and I’m looking forward to seeing both familiar and new faces this year in Chicago.

The organizing committee recently posted a call for presentations on Statalist. That posting is included below.

To submit an abstract for a presentation, or to register for the conference, visit the conference webpage.

From: Phil Schumm <pschumm@uchicago.edu>
To: Statalist <statalist@hsphsun2.harvard.edu>
Subject: st: Stata Conference 2011 in Chicago
Date: Wed, 15 Dec 2010 08:21:32 -0500

On behalf of the organizing committee, I would like to invite everyone to participate in the Stata Conference 2011, to be held July 14-15th in Chicago. The meeting will be held in the University of Chicago’s Gleacher Center, right on the Chicago river in the heart of downtown. Chicago is a great place to visit in the summer, and the location of the conference will make it easy to take advantage of all the city has to offer.

Below is the call for presentations. This year’s organizing committee consists of Lisa Barrow, Scott Long, Richard Williams, and myself. Please contact one of us if you would like to discuss an idea for a presentation or have questions about the program format. For those of you who have not attended a Stata users group meeting before, giving a presentation is a great opportunity to share what you are doing in Stata with others, and to get feedback from interested (and knowledgeable) users and from StataCorp developers. And, as an added bonus, if your abstract is accepted for presentation, the conference registration fee will be waived (presenting author only).

We look forward to seeing everyone in Chicago!

– Phil

    Announcement and call for presentations

The Stata Conference 2011 will be held at the University of Chicago Graduate School of Business’ Gleacher Center. The Gleacher Center is located downtown on the bank of the Chicago River, just steps from Michigan Avenue and within walking distance of most downtown attractions.

Stata users’ meetings are enjoyable and rewarding for Stata users at all levels and from all disciplines. This year’s program will consist of a mixture of user presentations, longer talks by invited presenters, and talks by StataCorp developers. In addition, the program will include the ever-popular “Wishes and Grumbles” session in which users have an opportunity to share their comments and suggestions directly with developers from StataCorp.

All users are encouraged to submit abstracts for possible presentations. Presentations on any Stata-related topic will be considered, including (but not limited to) the following:

  • new user-written commands, including commands for modeling and estimation, graphical analysis, data management or reporting
  • use or evaluation of existing Stata commands
  • methods for teaching statistics with Stata or Stata use itself
  • case studies of Stata use in novel areas or applications
  • surveys or critiques of Stata facilities in specific fields
  • comparisons of Stata to other software, or use of Stata together with other software

User presentations should be either 15 or 25 minutes long, each followed by 5 minutes for questions. Longer talks will be at the discretion of the scientific committee.

    Submission guidelines

Please submit an abstract of no more than 200 words (ASCII text, no math symbols) by using the web submission form at http://repec.org/chi11/chi11.php. All abstracts must be received by March 14, 2011. Please make sure to include a short, informative title, and to indicate whether you wish to be considered for a short (15-minute) or long (25-minute) presentation. In addition, if your presentation has multiple authors, please identify the presenter. The conference registration fee will be waived for the presenter.

If you would like to discuss an idea for a presentation or have questions about the program format, please contact a member of the scientific organizing committee. This year’s committee consists of

Lisa Barrow (Federal Reserve Bank of Chicago) <lbarrow@frbchi.org>
Scott Long (Indiana University) <jslong@indiana.edu>
Phil Schumm (University of Chicago) <pschumm@uchicago.edu>
Rich Williams (Notre Dame) <richard.a.williams.5@nd.edu>

Presenters will be asked to provide electronic materials related to their talk (a copy of the presentation and any programs/datasets, where applicable) to the organizers so that the materials can be posted on the StataCorp website and in the Stata Users Group RePEc archive.

How Stata calculates powers

Excuse me, but I’m going to toot Stata’s horn.

I got an email from Nicholas Cox (an Editor of the Stata Journal) yesterday. He said he was writing something for the Stata Journal and wanted the details on how we calculated a^b. He was focusing on examples such as (-8)^(1/3), where Stata produces a missing value rather than -2, and he wanted to know if our calculation of that was exp((1/3)*ln(-8)). He didn’t say where he was going, but I answered his question.

I have rather a lot to say about this.

Nick’s supposition was correct, in this particular case, and for most values of a and b, Stata calculates a^b as exp(b*ln(a)). In the case of a=-8 and b=1/3, ln(-8)==., and thus (-8)^(1/3)==..

You might be tempted to say Stata (or other software) should just get this right and return -2. The problem is not only that 1/3 has no exact decimal representation, but it has no exact binary representation, either. Watching for 1/3 and doing something special is problematic.

One solution to this problem, if a solution is necessary, would be to introduce a new function, say, invpower(a,b), that returns a^(1/b). Thus, cube roots would be invpower(a,3). Integers have exact numerical representations even in digital computers, so computers can watch for integers and take special actions.

Whether one should watch integers for such special values is an interesting question. Stata does watch for special values of b in calculating a^b. In particular, it watches for b as an integer, 1<=b<=64 or -64<=b<=-1. When Stata finds such integers, it calculates a^b by repeatedly multiplying a or 1/a. Thus Stata calculates ..., a^(-2), a^(-1), ..., a^2, a^3, ... more accurately than it calculates ..., a^(-2+epsilon), a^(-1+epsilon), ..., a^(2+epsilon), a^(3+epsilon), ....

So what could be wrong with that?

Let's imagine we wish to make a calculation of F(a,b) and that we have two ways to calculate it.

The first way is via an approximation formula A(a,b) = F(a,b) + e, where e is error. Do not assume that error has good properties that a statistical model would have. Typically, e/F() has roughly average 0 (I write sloppily) because otherwise we would add values to achieve that. In addition, the error tends to be serially correlated because approximation formulas are usually continuous. In this case, serial correlation is actually a desirable property!

Okay, that's one way we have for calculating F(). The second way is by an exact formula E(a,b) = F(a,b), but E() is only valid for certain values of b.

Great, you say, we'll use E() for the special values and use A() for the rest. We'll call that function G().

Now consider someone calculating a numerical derivative from results of G(). Say that they wish to calculate (F(a,b+h)-F(a,b))/h for a suitably small value of h, which they do by calculating (G(a,b+h)+G(a,b))/h. Then they obtain

  1. (A(a,b+h)-E(b))/h for some values, and
  2. (A(a,b+h)-A(b))/h for others.

There are other possibilities, for instance, they might obtain (E(a,b-h)-A(b))/h, but the two above will be sufficient for my purposes.

Note that in calculation (2), the serial correlation in the error actually reduces the error in (2) relative to (1)!

This can be an issue.

The error in the case of Stata’s a^b around the integers -64, -63, …, -1, 1, 2, …, 64 is small enough that we just ignored it. For instance, in 2^3, A()-E() = -1.776e-15. Had the error been large enough, we would have combined A() and E() in a different way to produce a more accurate approximation formula. To wit, you know that at certain values of b, E() is exact, so one develops an adjust for A() that uses that information to adjust not just the specific value, but values around the specific value, too.

In this example, A(a,b) = exp(b*ln(a)), a>0, and you may be asking yourself in what sense exp(b*ln(a)) is an approximation. The answer is that exp() and ln() are approximations to the true functions, and in fact, so is * an approximation for the underlying idea of multiplication.

That Stata calculates a^b by repeated multiplication for -64<=b<=-1 and 1<=b<=64 is not something we have ever mentioned. People do not realize the extreme caution we go to on what might seem the minor issues. It is because we do this that things work as you expect. In this case, a^3 is exactly equal to a*a*a. This is ironic because when numerical issues arise that do not have a similarly easy solution, users are disappointed. Why don’t you fix that? In the old Fortran days in which I grew up, one would never expect a^3 to equal a*a*a. One’s nose was constantly being rubbed into numerical issues, which reminded us not to overlook them.

By the way, if issues like this interest you, consider applying to StataCorp for employment. We can fill your days with discussions like this.

Categories: Numerical Analysis Tags: ,

Using dates and times from other software

Most software stores dates and times numerically, as durations from some sentinel date, but they differ on the sentinel date and on the units in which the duration is stored. Stata stores dates as the number of days since 01jan1960, and datetimes as the number of milliseconds since 01jan1960 00:00:00.000. January 3, 2011 is stored as 18,630, and 2pm on January 3 is stored as 1,609,682,400,000. Other packages use different choices for bases and units.

It sometimes happens that you need to process in Stata data imported from other software and end up with a numerical variable recording a date or datetime in the other software’s encoding. It is usually possible to adjust the numeric date or datetime values to the sentinel date and units that Stata uses. Below are conversion rules for SAS, SPSS, R, Excel, and Open Office.

 
SAS

SAS stores dates as the number of days since 01jan1960, the same as Stata:

    . gen statadate = sasdate
    . format statadate %td

SAS stores datetimes as the number of seconds since 01jan1960 00:00:00, assuming 86,400 seconds/day. Thus, all that’s necessary is to multiply SAS datetimes by 1,000 and attach a %tc format to the result,

    . gen double statatime = (sastime*1000)
    . format statatime %tc

It is important that variables containing SAS datetimes, such as sastime above, be imported as doubles into Stata.

 
SPSS

SPSS stores both dates and datetimes as the number of seconds since 14oct1582 00:00:00, assuming 86,400 seconds/day. To convert SPSS datetimes to Stata datetimes, type

    . gen double statatime = (spsstime*1000) + tc(14oct1582 00:00)
    . format statatime %tc

Multiplying by 1,000 converts from seconds to milliseconds. Adding tc(14oct1582 00:00) accounts for the differing bases.

Function tc() returns the specified datetime as a Stata datetime, which is to say, the number of milliseconds between the specified datetime and 01jan1960 00:00:00.000. We need to add the difference between SPSS’s base and Stata’s base, which is tc(14oct1582 00:00) – tc(01jan1960 00:00), but tc(01jan1960) is definitionally 0, so that just leaves tc(14oct1582 00:00). tc(14oct1582), for your information, is -11,903,760,000,000.

SPSS dates are the same as SPSS datetimes, so to convert an SPSS date to a Stata date, we could type,

    . gen double statatime = (spssdate*1000) + tc(14oct1582 00:00)
    . gen statadate        = dofc(statatime)
    . format statadate %td
    . drop statatime

Function dofc() converts a Stata datetime to a Stata date. We can combine the above into,

    . gen statadate = dofc((spsstime*1000) + tc(14oct1582 00:00))
    . format statadate %td

 
R

R stores dates as days since 01jan1970. To convert to a Stata date,

    . gen statadate = rdate - td(01jan1970)
    . format statadate %td

Stata uses 01jan1960 as the base, R uses 01jan1970, so all you have to do is subtract the number of days between 01jan1970 and 01jan1960.

R stores datetimes as the number of UTC adjusted seconds since 01jan1970 00:00:00. UTC stands for Universal Time Coordinated. Rather than assuming 86,400 seconds/day, some UTC days have 86,401 seconds. Leap seconds are sometimes inserted into UTC days to keep the clock coordinated with the Earth’s rotation. Stata’s datetime %tC format is UTC time, which is to say, it accounts for these leap seconds. Thus, to convert R datetimes to Stata, you type

   . gen double statatime = rtime - tC(01jan1970 00:00)
   . format statatime %tC

Note the use of Stata’s tC() function rather than tc() to obtain the number of milliseconds between the differing bases. tc() returns the number of seconds since 01jan1960 00:00:00 assuming 86,400 seconds/day. tC() returns the number of seconds adjusted for leap seconds. In this case, it would not make a difference if we mistakenly typed tc() rather than tC() because no leap seconds were inserted between 1960 and 1970. Had the base year been 1980, however, the use of tC() would have been important. Nine extra seconds were were inserted between 01jan1970 and 01jan1980!

In many cases you may prefer using a time variable that ignores leap seconds. In that case, You can type

    . gen double statatime = cofC(rtime - tC(01jan1970 00:00))
    . format statatime %tc

 
Excel

Excel has used different date systems for different operating systems. Excel for Windows used the “1900 Date System”. Excel for Mac used the “1904 Date System”. More recently, Microsoft has been standardizing on the 1900 Date System.

If you have an Excel for Windows workbook, it is likely to be using 1900.

If you have an Excel for Mac workbook, it is likely to be using 1904, unless it came from a Windows workbook originally.

Anyway, both Excels can use either encoding. See http://support.microsoft.com/kb/214330 for more information and for instrbuctions on converting your workbook between date systems.

In any case, you are unlikely to encounter Excel numerically coded dates. If you cut-and-paste the spreadsheet into Stata’s Data editor, dates and datetimes paste as strings in human-readable form. If you use a conversion package, most know to convert the date for you.

 
Excel, 1900 date system

For dates on or after 01mar1900, Excel 1900 Date System stores dates as days since 30dec1899. To convert to a Stata date,

    . gen statadate = exceldate + td(30dec1899)
    . format statadate %td

Excel can store dates between 01jan1900 and 28feb1900, too, but the formula above will not handle those two months. See http://www.cpearson.com/excel/datetime.htm for more information.

For datetimes on or after 01may1900 00:00:00, Excel 1900 Date System stores datetimes as days plus fraction of day since 30dec1899 00:00:00. To convert with a one-second resolution to a Stata datetime,

    . gen double statatime = round((exceltime+td(30dec1899))*86400)*1000
    . format statatime %tc

 
Excel, 1904 date system

For dates on or after 01jan1904, Excel 1904 Date System stores dates as days since 01jan1904. To convert to a Stata date,

 
    . gen statadate = exceldate + td(01jan1904)
    . format statadate %td

For datetimes on or after 01jan1904 00:00:00, Excel 1904 Date System stores datetimes as days plus fraction of day since 01jan1904 00:00:00. To convert with a one-second resolution to a Stata datetime,

 
    . gen double statatime = round((exceltime+td(01jan1904))*86400)*1000
    . format statatime %tc

 
Open Office

Open Office uses the Excel, 1900 Date System.

 
Why Stata has two datetime encodings

We have just seen that most packages assume 86,400 seconds/day, but that one instead uses UTC time, in which days have 86,400 or 86,401 seconds, depending. Stata provides both datetime encodings, called %tc and %tC. That turned out to be convenient in translating times from other packages. Stata will even let you switch from one to the other using the cofC() and Cofc functions, so you know you should be asking, which should I use?

Stata’s %tc format assumes that there are 24*60*60*1,000 ms per day — 86,400 seconds per day — just as an atomic clock does. Atomic clocks count oscillations between the nucleus and electrons of an atom and thus provide a measurement of the real passage of time.

Time of day measurements have historically been based on astronomical observation, which is a fancy way of saying, based on looking at the sun. The sun should be at its highest point at noon, right? So however you kept track of time — falling grains of sand or a wound up spring — you periodically reset your clock and then went about your business. In olden times it was understood that the 60 seconds per minute, 60 minutes per hour, 24 hours per day, were theoretical goals that no mechanical device could reproduce accurately. These days, we have have more accurate definitions for measuring time. A second is 9,192,631,770 periods of the radiation corresponding to the transition between two levels of the ground state of caesium 133. Obviously we have better equipment than the ancients, so problem solved, right? Wrong. There are two problems. The formal definition of a second is just a little too short to match length of a day, and the Earth’s rotation is slowing down.

As a result, since 1972 leap seconds have been added to atomic clocks once or twice a year to keep time measurements in synchronization with the earth’s rotation. Unlike leap years, however, there is no formula for predicting when leap seconds will occur. The Earth may be on average slowing down, but there is a large random component to that. As a result, leap seconds are determined by committee and announced 6 months before they are inserted. Leap seconds are added, if necessary, on the end of the day on June 30 and December 31 of the year. The inserted times are designated as 23:59:60.

Unadjusted atomic clocks may accurately mark the passage of real time, but you need to understand that leap seconds are every bit as real as every other second of the year. Once a leap second is inserted, it ticks just like any other second and real things can happen during that tick.

You may have heard of terms such as GMT and UTC.

GMT is the old Greenwich Mean Time and is based on astronomical observation. GMT has been supplanted by UTC.

UTC stands for coordinated universal time and is measured by atomic clocks, occasionally corrected for leap seconds. UTC is derived from two other times, UT1 and TAI. UT1 is the mean solar time, with which UTC is kept in sync by the occasional addition of a leap second. TAI is the atomically measured pure time. TAI was set to GMT plus 10 seconds in 1958 and has been running unadjusted since then. Update 07 Jan 2010: TAI is a statistical combination of various atomic chronometers and even it has not ticked uniformly over its history; see http://www.ucolick.org/~sla/leapsecs/timescales.html and especially http://www.ucolick.org/~sla/leapsecs/dutc.html#TAI. (Thanks to Steve Allen of the UCO/Lick Observatory for correcting my understanding and for the reference.)

UNK is StataCorp’s term for the time standard most people use. UNK stands for unknowing. UNK is based on a recent time observation, probably UTC, and then just assuming that there are 86,400 seconds per day after that.

The UNK standard is adequate for many purposes, and in such cases, you will want to use %tc rather than the leap second-adjusted %tC encoding. If you are using computer-timestamped data, however, you need to find out whether the timestamping system accounted for leap-second adjustments. Problems can arise even if you do not care about losing or gaining a second here and there.

For instance, you may import timestamp values from other systems recorded in the number of milliseconds that have passed since some agreed upon date. If you choose the wrong encoding scheme, if you chose tc when you should choose %tC, or vice versa, more recent times will be off by 24 seconds.

To avoid such problems, you may decide to import and export data by using Human Readable Forms (HRF) such as “Fri Aug 18 14:05:36 CDT 2006″. This method has advantages, but for %tC (UTC) encoding, times such as 23:59:60 are possible. Some software will refuse to decode such times.

Stata refuses to decode 23:59:60 in the %tc encoding (function clock) and accepts it with %tC (function Clock()). When %tC function Clock() sees a time with a 60th second, Clock() verifies that the time corresponds to an official leap second. Thus, when translating from printable forms, try assuming %tc and check the result for missing values. If there are none, you can assume your use of %tc is valid. If there are missing values and they are due to leap seconds and not some other error, you must use %tC function Clock() to translate from HRF. After that, if you still want to work in %tc units, use function cofC() to translate %tC values into %tc.

If precision matters, the best way to process %tC data is simply to treat them that way. The inconvenience is that you cannot assume that there are 86,400 seconds per day. To obtain the duration between dates, you must subtract the two time values involved. The other difficulty has to do with dealing with dates in the future. Under the %tC (UTC) encoding, there is no set value for any date more than 6 months in the future.

 
Advice

Stata provides two datetime encodings:

  1. %tC, also known as UTC, which accounts for leap seconds, and
     

  2. %tc, which ignores them (it assumes 86,400 seconds/day).

Systems vary in how they treat time variables. My advice is,

  • If you obtain data from a system that accounts for leap seconds, import using Stata’s %tC.
    1. If you later need to export data to a system that does not account for leap seconds, use Stata’s cofC() function to translate time values before exporting.
       

    2. If you intend to tsset the time variable and the analysis will be at the second level or finer, just tsset the %tC variable, specifying the appropriate delta() if necessary, for example, delta(1000) for seconds.
       

    3. If you intend to tsset the time variable and the analysis will be at coarser than the second level (minute, hour, etc.), create a %tc variable from the %tC variable (generate double tctime = cofC(tCtime<) and tsset that, specifying the appropriate delta() if necessary. You must do that because, in a %tC variable, there are not necessarily 60 seconds in a minute; some minutes have 61 seconds.
  • If you obtain data from a system that ignores leap seconds, use Stata’s %tc.
    1. If you later need to export data to a system that does account for leap seconds, use Stata’s Cofc() function to translate time values.
       

    2. If you intend to tsset the time variable, just tsset it, specifying the appropriate delta().

Some users prefer always to use Stata’s %tc because those values are a little easier to work with. You can do that if

  • you do not mind having up to 1 second of error and
     

  • you do not import or export numerical values (clock ticks) from other systems that are using leap seconds, because then there could be nearly 30 seconds of accumulated error.

There are two things to remember if you use %tC variables:

  1. The number of seconds between two dates is a function of when the dates occurred. Five days from one date is not simply a matter of adding 5*24*60*60*1,000 ms. You might need to add another 1,000 ms. Three hundred and sixty-five days from now might require adding 1,000 or 2,000 ms. The longer the span, the more you might have to add. The best way to add durations to %tC variables is to extract the components, add to them, and then reconstruct from the numerical components.
     

  2. You cannot accurately predict datetimes more than six months into the future. We do not know what the %tC value will be of 25dec2026 00:00:00 because every year along the way, the International Earth Rotation Reference Systems Service (IERS) will twice announce whether there will be the insertion of a leap second.

You can help alleviate these inconveniences. Face west and throw rocks. The benefit will be only transitory if the rocks land back on Earth, so you need to throw them really hard. I know what you’re thinking, but this does not need to be a coordinated effort.

Categories: Data Management Tags: ,