Archive

Archive for the ‘Data Management’ Category

Using import excel with real world data

Stata 12’s new import excel command can help you easily import real-world Excel files into Stata. Excel files often contain header and footer information in the first few and last few rows of a sheet, and you may not want that information loaded. Also, the column labels used in the sheet are invalid Stata variable names and therefore cannot be loaded. Both of these issues can be easily solved using import excel.

Let’s start by looking at an Excel spreadsheet, metro_gdp.xls, that is downloaded from the Bureau of Economic Analysis website.

Microsoft Excel screenshot

 

As you can see, the first five rows of the Excel file contain a description of the data, and rows 374 through 381 contain footer notes. We don’t want to load these rows into Stata. import excel has a cellrange() option that can help us avoid unwanted information being loaded.

With cellrange(), you specify the upper left cell and the lower right cell (using standard Excel notation) of the area of data you want loaded. In the file metro_gdp.xls, we want all the data from column A row 6 (upper left cell) to column L row 373 (lower right cell) loaded into Stata. To do this, we type

. import excel metro_gdp.xls, cellrange(A6:L373) clear

In Stata, we open the Data Editor to inspect the loaded data.

Stata Data Editor

 

The first row of the data we loaded contained column labels. Because of these labels, import excel loaded all the data as strings. import excel again has an easy fix. We need to specify the firstrow option to tell import excel that the first row of data contains the variable names.

. import excel metro_gdp.xls, cellrange(A6:L373) firstrow clear

We again open the Data Editor to inspect the data.

Stata Data Editor

 

The data are now in the correct format, but we are missing the year column labels. Stata does not accept numeric variable names, so import excel has to use the Excel column name (C, D, …) for the variable names instead of 2001, 2002, …. The simple solution is to rename the column headers in Excel to something like y2001, y2002, etc., before loading. You can also use Stata to rename the column headers. import excel saves the values in the first row of data as variable labels so that the information is not lost. If we describe the data, we will see all the column labels from the Excel file saved as variable labels.

. describe

Contains data
  obs:           367
 vars:            12
 size:        37,067
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
Fips            str5   %9s                    Fips
Area            str56  %56s                   Area
C               long   %10.0g                 2001
D               long   %10.0g                 2002
E               long   %10.0g                 2003
F               long   %10.0g                 2004
G               long   %10.0g                 2005
H               long   %10.0g                 2006
I               long   %10.0g                 2007
J               long   %10.0g                 2008
K               long   %10.0g                 2009
L               long   %10.0g                 2010
-------------------------------------------------------------------------------
Sorted by:
     Note:  dataset has changed since last saved

We want to grab the variable label for each variable by using the extended macro function :variable label varname, create a valid lowercase variable name from that label by using the strtoname() and lower() functions, and rename the variable to the new name by using rename. We can do this with a foreach loop.

foreach var of varlist _all {
        local label : variable label `var'
        local new_name = lower(strtoname("`label'"))
        rename `var' `new_name'
}

Now when we describe our data, they look like this:

. describe

Contains data
  obs:           367
 vars:            12
 size:        37,067                          
-------------------------------------------------------------------------------
              storage  display     value      
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
fips            str5   %9s                    Fips
area            str56  %56s                   Area
_2001           long   %10.0g                 2001
_2002           long   %10.0g                 2002
_2003           long   %10.0g                 2003
_2004           long   %10.0g                 2004
_2005           long   %10.0g                 2005
_2006           long   %10.0g                 2006
_2007           long   %10.0g                 2007
_2008           long   %10.0g                 2008
_2009           long   %10.0g                 2009
_2010           long   %10.0g                 2010
-------------------------------------------------------------------------------
Sorted by:  
     Note:  dataset has changed since last saved

One last thing we might want to do is to rename the year variables from _20## to y20##, which we can easily accomplish with rename:

. rename (_*) (y*)

. describe

Contains data
  obs:           367
 vars:            12
 size:        37,067                          
-------------------------------------------------------------------------------
              storage  display     value      
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
fips            str5   %9s                    Fips
area            str56  %56s                   Area
y2001           long   %10.0g                 2001
y2002           long   %10.0g                 2002
y2003           long   %10.0g                 2003
y2004           long   %10.0g                 2004
y2005           long   %10.0g                 2005
y2006           long   %10.0g                 2006
y2007           long   %10.0g                 2007
y2008           long   %10.0g                 2008
y2009           long   %10.0g                 2009
y2010           long   %10.0g                 2010
-------------------------------------------------------------------------------
Sorted by:  
     Note:  dataset has changed since last saved
Categories: Data Management Tags: ,

The next leap second will be on June 30th, maybe

Leap seconds are the extra seconds inserted every so often to keep precise atomic clocks better synchronized with the rotation of the Earth. Scheduled for June 30th is the extra second 23:59:60 inserted between 23:59:59 and 00:00:00. Or maybe not.

Tomorrow or Friday a vote may be held at the International Telecommuncation Union (ITU) meeting in Geneva to abolish the leap second from the definition of UTC (Coordinated Universial Time). Which would mean StataCorp would not have to post an update to Stata to keep the %tC format working correctly.

As I’ve blogged before — scroll down to “Why Stata has two datetime encodings” in Using dates and times from other software — Stata supports both UTC time (%tC) and constant-86,400-seconds/day time (%tc). Stata does that because some data are collected using leap-second corrected time, and some uncorrected. Stata is unique or nearly unique in providing both time formats.

I read that Google does something very clever: they strech the last second of the year out when a leap second occurs, so the data they collect does not end up with ugly times like 23:59:60, and so that it can be more easily processed by software that assumes a constant 86,400 seconds per day.

The IT industry and a number of others, I gather, are pretty united about the benefits of scrapping the leap second.

The vote is predicted to go against continuing the leap second, according to The Economist magazine. The United States and France are for abolishing the leap second. Britain, Canada, and China are believed to be for continuing it. Some 192 countries will get to vote.

Whichever way the vote goes, I would like to remind readers of advice I previously offered to help alleviate the need for leap seconds: Face west and throw rocks. As I previously noted, the benefit will be transitory if the rocks land back on Earth, so you need to throw the rocks really hard. Having now thought more about this issue, a less strenuous way occurs to me: Push rocks downhill or carry them toward the poles, and preferably do both. These suggestions are designed to attack the real problem, which is that the Earth is currently rotating too slowly.

Categories: Data Management Tags: ,

Merging data, part 2: Multiple-key merges

Multiple-key merges arise when more than one variable is required to uniquely identify the observations in your data. In Merging data, part 1, I discussed single-key merges such as

        . merge 1:1 personid using ...

In that discussion, each observation in the dataset could be uniquely identified on the basis of a single variable. In panel or longitudinal datasets, there are multiple observations on each person or thing and to uniquely identify the observations, we need at least two key variables, such as

        . merge 1:1 personid date using ...

In this dataset we have repeated observations on persons and, within person, the observations can be uniquely identified by the calendar date.

Just to fix ideas, let’s assume I have two datasets. The first, sample.dta, is the one of analytic interest to me. I have data on 737 persons. For each person, I have data recorded on the first and fifteenth of every month, from year 2000 to 2009. Overall, my dataset has 176,880 observations.

The second dataset contains additional information (variables) on the sample of interest. It has over 3,000 people in it and it covers a longer stretch of time. I’ve been told that most of my 737 people should be in this second dataset, but I’ve been warned that, due to data collection or data processing errors over the years, a small fraction will not be found.

“How many?” I asked Bob from data processing during a fictional conversation, my paranoia kicking in.

“I don’t know. Most. Ninety-nine percent. It’s just random stuff,” he replied, knowing how well I respond to the word random.

Let’s call this second set of data the payroll data, although if I can imagine fictional conversations, you can imagine the data are something else. They might be records from follow-up visits of a medical experiment.

In any case, I receive the data, and here is what happened when I merged the data with my sample:

        . use sample, clear

        . merge 1:1 personid date using payroll, keep(master match)

            Result                           # of obs.
            -----------------------------------------
            not matched                         2,352  
                from master                     2,352  (_merge==1)
                from using                          0  (_merge==2)

            matched                           174,528  (_merge==3)
            -----------------------------------------

In my sample data, I have 174,520 + 2,352 = 176,872 observations. Of those, 174,528 matched, which is 98.7 percent. (The reason that the number of records from the using (payroll) data that were not matched is zero is because I specified option keep(master match), meaning I discarded the unmatched payroll records. Had I not, the number would have been in the low millions.)

For many in this situation, the story would stop right here. Not for me. I want to show you how to tear into multiple-key merges to reassure yourself that things really are as they appear. You realize, of course, that I manufactured this fictional data for this blog entry and I buried a little something that once we find it, would scare you if this were a real story. So I’ll tell you now, this story is loosely based on a real story.

Step 1: Following my own advice

In Merging data, part 1 I recommended that you merge on all common variables, not just the identification variables. This blog entry is not going to rehash the previous blog entry, but I want to emphasize that everything I said in the previous entry about single-key merges applies equally to multiple-key merges. These two datasets share a variable recording the division in which the employee works, so I am included it among the match variables:

        . use sample, clear

        . merge 1:1 personid date division using payroll, keep(master match)

            Result                           # of obs.
            -----------------------------------------
            not matched                         2,352  
                from master                     2,352  (_merge==1)
                from using                          0  (_merge==2)

            matched                           174,528  (_merge==3)
            -----------------------------------------

The output above matches the output when I merged date and division alone, so I do not appear to have a merge-gone-bad problem. These merged data are looking better and better.

Step 2: Merge on each key variable in isolation

Let’s imagine what could go wrong. Imagine that all the data for certain persons were missing, or that all the data for certain dates were missing. That might not be a problem, but it would certainly raise questions. Depending on the answers, it may be worth a footnote or concerning enough to return the data and ask for a refund.

Finding persons or dates that are entirely unmatched is a lot of work unless you know the following trick: Merge on one key variable at a time.

Let’s start with personid:

        . use sample, clear

        . sort personid

        . by personid: keep if _n==1           // <- don't skip this step
        (176143 observations deleted)

        . merge 1:m personid using payroll, keep(master match)

            Result                           # of obs.
            -----------------------------------------
            not matched                             0
            matched                           174,528  (_merge==3)
            -----------------------------------------

The output above proves that payroll.dta contains data on every person that appears in sample.dta.

Let me explain. I began by using my sample data and keeping just one observation for every value of personid. I don't care which observation I keep, I just need to keep one and only one. Then I merged on personid, keeping (1) the records that match and (2) the records from the master that do not match. I have no interest in the resulting dataset; I just wanted to see the table merge would report. merge reports that 174,528 personids matched, and 0 did not. Ergo, every value of personid that appears in sample.dta also appears in payroll.dta.

Had merge reported "not matched" = 2, that would mean there would have been two values of personid appearing in sample.dta that do not appear in payroll.dta. It would not have been an indictment of the data if two persons were not matched in their entirety, but I would certainly have looked into the issue. With the merged result in memory, I would have typed

        . list personid if _merge==1
          (output would appear) 

I would have written down the two personids list reported. Then I would have returned to my sample data and looked at the data I had on those two people:

        . use sample, clear

        . list if personid==735527
          (output would appear)

        . list if personid==29887
          (output would appear)

It might be that 735527 was with the company for only a short time and thus the missing payroll record a believable random event. If 735527 had been with the company all ten years, however, I would be back on the phone seeking an explanation. (If these were medical data, certainly you would want to know how a person who never reported for a follow-up visit is known to still be alive after ten years.)

So much for personid. Let's do the same for date:

        . use sample, clear  
        . sort date
        . by date: keep if _n==1
        (176640 observations deleted)
        . merge 1:m date using payroll, keep(master match)
            Result                           # of obs.
            -----------------------------------------
            not matched                             0
            matched                           236,832  (_merge==3)
            -----------------------------------------

Every date that appears in sample.dta also appears in payroll.dta.

Finally, let's look at division:

        . use sample, clear

        . sort division date

        . by division date: keep if _n==1
        (175200 observations deleted)

        . merge 1:m division date using payroll, keep(master match)

            Result                           # of obs.
            -----------------------------------------
            not matched                            24
                from master                        24  (_merge==1)
                from using                          0  (_merge==2)

            matched                           236,832  (_merge==3)
            -----------------------------------------

Every division that appears in sample.dta appears in payroll.dta

These data are looking better and better.

If we had only two key variables, we would be done. We, however, are performing the full merge on three variables, namely personid, date, and division, and so there is one more set of comparisons we should examine.

Step 3: Merge on every pair of key variables

With three key variables, the possible pairs are (personid, date), (personid, division), and (division, date). We have already looked at (personid, date), so that just leaves (personid, division) and (division, date).

The method is the same as in Step 2 except that we type two variables where we previously typed one:

        . use sample, clear

        . sort personid division

        . by personid division: keep if _n==1
        (176143 observations deleted)

        . merge 1:m personid division using payroll, keep(master match)

            Result                           # of obs.
            -----------------------------------------
            not matched                             0
            matched                           174,528  (_merge==3)
            -----------------------------------------

We discover that every personid-division combination that appears in sample.dta also appears in payroll.dta.

Last is (division, date):

        . use sample, clear

        . sort division date

        . by division date: keep if _n==1
        (175200 observations deleted)

        . merge 1:m division date using payroll, keep(master match)

            Result                           # of obs.
            -----------------------------------------
            not matched                            24
                from master                        24  (_merge==1)
                from using                          0  (_merge==2)

            matched                           236,832  (_merge==3)
            -----------------------------------------

Surprise! Well, you're not surprised because I told you earlier we were going to find something, but if this were real life, you would be surprised after all these reassuring results.

We discover that there are 24 division-date combinations appearing in sample.dta that do not appear in payroll.dta. Let's look at the 24 missing combinations:

        . list division date if _merge==1

                +----------------------+
                | division        date |
                |----------------------|
          1129. |        5   01jan2007 |
          1130. |        5   15jan2007 |
          1131. |        5   01feb2007 |
          1132. |        5   15feb2007 |
          1133. |        5   01mar2007 |
                |----------------------|
          1134. |        5   15mar2007 |
          1135. |        5   01apr2007 |
          1136. |        5   15apr2007 |
          1137. |        5   01may2007 |
          1138. |        5   15may2007 |
                |----------------------|
          1139. |        5   01jun2007 |
          1140. |        5   15jun2007 |
          1141. |        5   01jul2007 |
          1142. |        5   15jul2007 |
          1143. |        5   01aug2007 |
                |----------------------|
          1144. |        5   15aug2007 |
          1145. |        5   01sep2007 |
          1146. |        5   15sep2007 |
          1147. |        5   01oct2007 |
          1148. |        5   15oct2007 |
                |----------------------|
          1149. |        5   01nov2007 |
          1150. |        5   15nov2007 |
          1151. |        5   01dec2007 |
          1152. |        5   15dec2007 |
                +----------------------+

If you look closely, you will notice that every payroll date in 2007 is listed. So what happened to the payroll records for division 5 in 2007? This may indeed be exactly the kind of random event that Bob had in mind during our fictional conversation. Somehow the company lost a little cluster of payroll records. The loss may mean mean nothing. Or it might be of critical importance. Imagine there's been an allegation that the company treats older workers poorly and imagine that division 5 has the highest average age. Not random. Not random at all.

Step 4: Merge on every triplet of key variables

So much for the fictional story.

If we had four or more key variables, we would now need to merge on every triplet of key variables, and if we had five or more key variables, we then need to merge on every quadruplet of key variables, and if ...

Forget the story. Or remember it if it scares you. Data processing and paranoia make an excellent pairing. What's important is how easy it is to take complicated, multiple-key merges apart. I've never met anyone yet who knew this trick.

Categories: Data Management Tags: ,

Merging data, part 1: Merges gone bad

Merging concerns combining datasets on the same observations to produce a result with more variables. We will call the datasets one.dta and two.dta.

When it comes to combining datasets, the alternative to merging is appending, which is combining datasets on the same variables to produce a result with more observations. Appending datasets is not the subject for today. But just to fix ideas, appending looks like this:

              +-------------------+
              | var1  var2  var3  |      one.dta
              +-------------------+
           1. | one.dta           |
           2. |                   |
            . |                   |
            . |                   |
              +-------------------+

                        +

              +-------------------+
              | var1  var2  var3  |      two.dta
              +-------------------+
           1. | two.dta           |
           2. |                   |
            . |                   |
              +-------------------+

                       =

              +-------------------+
              | var1  var2  var3  |
              +-------------------+
           1. |                   |    one.dta
           2. |                   |
            . |                   |
            . |                   |
              +                   +      +
        N1+1. |                   |    two.dta   appended
        N2+2. |                   |
            . |                   |
              +-------------------+

Merging looks like this:


      +-------------------+           +-----------+
      | var1  var2  var3  |           | var4 var5 |
      +-------------------+           +-----------+
   1. |                   |        1. |           |
   2. |                   |    +   2. |           |     =
    . |                   |         . |           |
    . |                   |         . |           |
      +-------------------+           +-----------+
        one.dta                         two.dta


                        +-------------------+-----------+
                        | var1  var2  var3    var4 var5 |
                        +-------------------------------+
                     1. |                               |
                     2. |                               |
                      . |                               |
                      . |                               |
                        +-------------------+-----------+
                          one.dta           + two.dta    merged

The matching of the two datasets — deciding which observations in one.dta are combined with which observations in two.dta — could be done simply on the observation numbers: Match one.dta observation 1 with two.dta observation 1, match one.dta observation 2 with two.dta observation 2, and so on. In Stata, you could obtain that result by typing

. use one, clear

. merge 1:1 using two

Never do this because it is too dangerous. You are merely assuming that observation 1 matches with observation 1, observation 2 matches with observation 2, and so on. What if you are wrong? If observation 2 in one.dta is Bob and observation 2 in two.dta is Mary, you will mistakenly combine the observations for Bob and Mary and, perhaps, never notice the mistake.

The better solution is to match the observations on equal values of an identification variable. This way, the observation with id=”Mary” is matched with the observation with id=”Mary”, id=”Bob” with id=”Bob”, id=”United States” with id=”United States”, and id=4934934193 with id=4934934193. In Stata, you do this by typing

. use one, clear

. merge 1:1 id using two

Things can still go wrong. For instance, id=”Bob” will not match id=”Bob ” (with the trailing blank), but if you expected all the observations to match, you will ultimately notice the mistake. Mistakenly unmatched observations tend to get noticed because of all the missing values they cause in subsequent calculations.

It is the mistakenly combined observations that can go unnoticed.

And that is the topic for today, mistakenly matched observations, or merges gone bad.

Observations are mistakenly combined more often than many researchers realize. I’ve seen it happen. I’ve seen it happen, be discovered later, and necessitate withdrawn results. You seriously need to consider the possibility that this could happen to you. Only three things are certain in this world: death, taxes, and merges gone bad.

I am going to assume that you are familiar with merging datasets both conceptually and practically; that you already know what 1:1, m:1, 1:m, and m:n mean; and that you know the role played by “key” variables such as ID. I am going to assume you are familiar with Stata’s merge command. If any of this is untrue, read [D] merge. Type help merge in Stata and click on [D] merge at the top to take you to the full PDF manuals. We are going to pick up where the discussion in [D] merge leaves off.

Detecting when merges go bad

As I said, the topic for today is merges gone bad, by which I mean producing a merged result with the wrong records combined. It is difficult to imagine that typing

. use one, clear

. merge 1:1 id using two

could produce such a result because, to be matched, the observations had to have equal values of the ID. Bob matched with Bob, Mary matched with Mary, and so on.

Right you are. There is no problem assuming the values in the id variable are correct and consistent between datasets. But what if id==4713 means Bob in one dataset and Mary in the other? That can happen if the id variable is simply wrong from the outset or if the id variable became corrupted in prior processing.

1. Use theory to check IDs if they are numeric

One way the id variable can become corrupted is if it is not stored properly or if it is read improperly. This can happen to both string and numeric variables, but right now, we are going to emphasize the numeric case.

Say the identification variable is Social Security number, an example of which is 888-88-8888. Social Security numbers are invariably stored in computers as 888888888, which is to say that they are run together and look a lot like the number 888,888,888. Sometimes they are even stored numerically. Say you have a raw data file containing perfectly valid Social Security numbers recorded in just this manner. Say you read the number as a float. Then 888888888 becomes 888888896, and so does every Social Security number between 888888865 and 888888927, some 63 in total. If Bob has Social Security number 888888869 and Mary has 888888921, and Bob appears in dataset one and Mary in dataset two, then Bob and Mary will be combined because they share the same rounded Social Security number.

Always be suspicious of numeric ID variables stored numerically, not just those stored as floats.

When I read raw data and store the ID variables as numeric, I worry whether I have specified a storage type sufficient to avoid rounding. When I obtain data from other sources that contain numeric ID variables, I assume that the other source improperly stored the values until proven otherwise.

Perhaps you remember that 16,775,215 is the largest integer that can be stored precisely as a float and 9,007,199,254,740,991 is the largest that can be stored precisely as a double. I never do.

Instead, I ask Stata to show me the largest theoretical ID number in hexadecimal. For Social Security numbers, the largest is 999-99-9999, so I type

. inbase 16 999999999
3b9ac9ff

Stata’s inbase command converts decimal numbers to different bases. I learn that 999999999 base-10 is 3b9ac9ff base-16, but I don’t care about the details; I just want to know the number of base-16 digits required. 3b9ac9ff has 8 digits. It takes 8 base-16 digits to record 999999999. As you learned in How to read the %21x format, part 2, I do remember that doubles can record 13 base-16 digits and floats can record 5.75 digits (the 0.75 part being because the last digit must be even). If I didn’t remember those numbers, I would just display a number in %21x format and count the digits to the right of the binary point. Anyway, Social Security numbers can be stored in doubles because 8<13, the number of digits double provides, but not in floats because 8 is not < 5.75, the number of digits float provides.

If Social Security numbers contained 12 digits rather than 9, the largest would be

. inbase 16 999999999999
38d4a50fff

which has 10 base-16 digits, and because 10<13, it would still fit into a double.

Anyway, if I discover that the storage type is insufficient to store the ID number, I know the ID numbers must be rounded.

2. Check uniqueness of IDs

I said that when I obtain data from other sources, I assume that the other source improperly stored the ID variables until proven otherwise. I should have said, until evidence accumulates to the contrary. Even if the storage type used is sufficient, I do not know what happened in previous processing of the data.

Here’s one way using datasets one.dta and two.dta to accumulate some of that evidence:

. use one, clear              // test 1
. sort id
. by id: assert _N==1

. use two, clear              // test 2
. sort id . by id: assert _N==1 

In these tests, I am verifying that the IDs really are unique in the two datasets that I have. Tests 1 and 2 are unnecessary when I plan later to merge 1:1 because the 1:1 part will cause Stata itself to check that the IDs are unique. Nevertheless, I run the tests. I do this because the datasets I merge are often subsets of the original data, and I want to use all the evidence I have to invalidate the claim that the ID variables really are unique.Sometimes I receive datasets where it takes two variables to make sure I am calling a unique ID. Perhaps I receive data on persons over time, along with the claim that the ID variable is name. The documentation also notes that variable date records when the observation was made. Thus, to uniquely identify each of the observations requires both name and date, and I type

. sort name date
. by name date: assert _N==1

I am not suspicious of only datasets I receive. I run this same test on datasets I create.

3. Merge on all common variables

At this point, I know the ID variable(s) are unique in each dataset. Now I consider the idea that the ID variables are inconsistent across datasets, which is to say that Bob in one dataset, however he is identified, means Mary in the other. Detecting such problems is always problematic, but not nearly as problematic as you might guess.

It is rare that the datasets I need to merge have no variables in common except the ID variable. If the datasets are on persons, perhaps both datasets contain each person’s sex. In that case, I could merge the two datasets and verify that the sex is the same in both. Actually, I can do something easier than that: I can add variable sex to the key variables of the merge:

. use one, clear
. merge 1:1 id sex using two

Assume I have a valid ID variable. Then adding variable sex does not affect the outcome of the merge because sex is constant within id. I obtain the same results as typing merge 1:1 id using two.

Now assume the id variable is invalid. Compared with the results of merge 1:1 id using two, Bob will no longer match with Mary even if they have the same ID. Instead I will obtain separate, unmatched observations for Bob and Mary in the merged data. Thus to complete the test that there are no such mismatches, I must verify that the id variable is unique in the merged result. The complete code reads

. use one, clear
. merge 1:1 id sex using two
. sort id
. by id: assert _N==1

And now you know why in test 2 I checked the uniqueness of ID within dataset by hand rather than depending on merge 1:1. The 1:1 merge I just performed is on id and sex, and thus merge does not check the uniqueness of ID in each dataset. I checked by hand the uniqueness of ID in each dataset and then checked the uniqueness of the result by hand, too.

Passing the above test does not prove that that the ID variable is consistent and thus the merge is correct, but if the assertion is false, I know with certainty either that I have an invalid ID variable or that sex is miscoded in one of the datasets. If my data has roughly equal number of males and females, then the test has a 50 percent chance of detecting a mismatched pair of observations, such as Bob and Mary. If I have just 10 mismatched observations, I have a 1-0.910 = 0.9990 probability of detecting the problem.

I should warn you that if you want to keep just the matched observations, do not perform the merge by coding merge 1:1 id sex using two, keep(matched). You must keep the unmatched observations to perform the final part of the test, namely, that the ID numbers are unique. Then you can drop the unmatched observations.

. use one, clear
. merge 1:1 id sex using two
. sort id
. by id: assert _N==1
. keep if _merge==3

There may be more than one variable that you expect to be the same in combined observations. A convenient feature of this test is that you can add as many expected-to-be-constant variables to merge‘s keylist as you wish:

. use one, clear
. merge 1:1 id sex hiredate groupnumber using two
. sort id
. by id: assert _N==1
. keep if _merge==3

It is rare that there is not at least one variable other than the ID variable that is expected to be equal, but it does happen. Even if you have expected-to-be-constant variables, they may not work as well in detecting problems as variable sex in the example above. The distribution of the variable matters. If your data are of people known to be alive in 1980 and the known-to-be-constant variable is whether born after 1900, even mismatched observations would be likely to have the same value of the variable because most people alive in 1980 were born after 1900.

4. Look at a random sample

This test is weak, but you should do it anyway, if only because it’s so easy. List some of the combined observations and look at them.

. list in 1/5

Do the combined results look like they go together?

By the way, the right way to do this is

. gen u = uniform()
. sort u
. list in 1/5
. drop u

You do not want to look at the first observations because, having small values of ID, they are probably not representative. However IDs are assigned, the process is unlikely to be randomized. Persons with low values of ID will be younger, or older; or healthier, or sicker; or ….

5. Look at a nonrandom sample

You just merged two datasets, so obviously you did that because you needed the variables and those variables are somehow related to the existing variables. Perhaps your data is on persons, and you combined the 2009 data with the 2010 data. Perhaps your data is on countries, and you added export data to your import data. Whatever you just added, it is not random. If it were, you could have saved yourself time by simply generating the new variables containing random numbers.

So generate an index that measures a new variable in terms of an old one, such as

. gen diff = income2010 - income2009

or

. gen diff = exports - imports

Then sort on the variable and look at the observations containing the most outlandish values of your index:

. sort diff
. list in  1/5
. list in -5/l

These are the observations most likely to be mistakenly combined. Do you believe those observations were combined correctly?

Conclusion

I admit I am not suspicious of every merge I perform. I have built up trust over time in datasets that I have worked with previously. Even so, my ability to make errors is equal to yours, and even with trustworthy datasets, I can introduce problems long before I get to the merge. You need to carefully consider the consequences of a mistake. I do not know anyone who performs merges who has not performed a merge gone bad. The question is whether he or she detected it. I hope so.

Categories: Data Management Tags: ,

Graphs, maps, and geocoding

Jim Hufford, Esq. had his first Stata lesson: “This is going to be awesome when I understand what all those little letters and things mean.”

Along those lines — awesome — Jim may want to see these nice Stata scatterplots from the “wannabe economists of the Graduate Institute of International and Development Studies in Geneva” at Rigotnomics.

If you want to graph data onto maps using Stata — and see another awesome graph — see Mitch Abdon’s “Fun with maps in Stata” over at the Stata Daily.

And if you’re interested in geocoding to obtain latitudes and longitudes from human-readable addresses or locations, see Adam Ozimek’s “Computers are taking our jobs: Stata nerds only edition” over at Modeled Behavior and see the related Stata Journal article “Stata utilities for geocoding and generating travel time and travel distance information” by Adam Ozimek and Daniel Miles.

Using dates and times from other software

Most software stores dates and times numerically, as durations from some sentinel date, but they differ on the sentinel date and on the units in which the duration is stored. Stata stores dates as the number of days since 01jan1960, and datetimes as the number of milliseconds since 01jan1960 00:00:00.000. January 3, 2011 is stored as 18,630, and 2pm on January 3 is stored as 1,609,682,400,000. Other packages use different choices for bases and units.

It sometimes happens that you need to process in Stata data imported from other software and end up with a numerical variable recording a date or datetime in the other software’s encoding. It is usually possible to adjust the numeric date or datetime values to the sentinel date and units that Stata uses. Below are conversion rules for SAS, SPSS, R, Excel, and Open Office.

 
SAS

SAS stores dates as the number of days since 01jan1960, the same as Stata:

    . gen statadate = sasdate
    . format statadate %td

SAS stores datetimes as the number of seconds since 01jan1960 00:00:00, assuming 86,400 seconds/day. Thus, all that’s necessary is to multiply SAS datetimes by 1,000 and attach a %tc format to the result,

    . gen double statatime = (sastime*1000)
    . format statatime %tc

It is important that variables containing SAS datetimes, such as sastime above, be imported as doubles into Stata.

 
SPSS

SPSS stores both dates and datetimes as the number of seconds since 14oct1582 00:00:00, assuming 86,400 seconds/day. To convert SPSS datetimes to Stata datetimes, type

    . gen double statatime = (spsstime*1000) + tc(14oct1582 00:00)
    . format statatime %tc

Multiplying by 1,000 converts from seconds to milliseconds. Adding tc(14oct1582 00:00) accounts for the differing bases.

Function tc() returns the specified datetime as a Stata datetime, which is to say, the number of milliseconds between the specified datetime and 01jan1960 00:00:00.000. We need to add the difference between SPSS’s base and Stata’s base, which is tc(14oct1582 00:00) – tc(01jan1960 00:00), but tc(01jan1960) is definitionally 0, so that just leaves tc(14oct1582 00:00). tc(14oct1582), for your information, is -11,903,760,000,000.

SPSS dates are the same as SPSS datetimes, so to convert an SPSS date to a Stata date, we could type,

    . gen double statatime = (spssdate*1000) + tc(14oct1582 00:00)
    . gen statadate        = dofc(statatime)
    . format statadate %td
    . drop statatime

Function dofc() converts a Stata datetime to a Stata date. We can combine the above into,

    . gen statadate = dofc((spsstime*1000) + tc(14oct1582 00:00))
    . format statadate %td

 
R

R stores dates as days since 01jan1970. To convert to a Stata date,

    . gen statadate = rdate - td(01jan1970)
    . format statadate %td

Stata uses 01jan1960 as the base, R uses 01jan1970, so all you have to do is subtract the number of days between 01jan1970 and 01jan1960.

R stores datetimes as the number of UTC adjusted seconds since 01jan1970 00:00:00. UTC stands for Universal Time Coordinated. Rather than assuming 86,400 seconds/day, some UTC days have 86,401 seconds. Leap seconds are sometimes inserted into UTC days to keep the clock coordinated with the Earth’s rotation. Stata’s datetime %tC format is UTC time, which is to say, it accounts for these leap seconds. Thus, to convert R datetimes to Stata, you type

   . gen double statatime = rtime - tC(01jan1970 00:00)
   . format statatime %tC

Note the use of Stata’s tC() function rather than tc() to obtain the number of milliseconds between the differing bases. tc() returns the number of seconds since 01jan1960 00:00:00 assuming 86,400 seconds/day. tC() returns the number of seconds adjusted for leap seconds. In this case, it would not make a difference if we mistakenly typed tc() rather than tC() because no leap seconds were inserted between 1960 and 1970. Had the base year been 1980, however, the use of tC() would have been important. Nine extra seconds were were inserted between 01jan1970 and 01jan1980!

In many cases you may prefer using a time variable that ignores leap seconds. In that case, You can type

    . gen double statatime = cofC(rtime - tC(01jan1970 00:00))
    . format statatime %tc

 
Excel

Excel has used different date systems for different operating systems. Excel for Windows used the “1900 Date System”. Excel for Mac used the “1904 Date System”. More recently, Microsoft has been standardizing on the 1900 Date System.

If you have an Excel for Windows workbook, it is likely to be using 1900.

If you have an Excel for Mac workbook, it is likely to be using 1904, unless it came from a Windows workbook originally.

Anyway, both Excels can use either encoding. See http://support.microsoft.com/kb/214330 for more information and for instrbuctions on converting your workbook between date systems.

In any case, you are unlikely to encounter Excel numerically coded dates. If you cut-and-paste the spreadsheet into Stata’s Data editor, dates and datetimes paste as strings in human-readable form. If you use a conversion package, most know to convert the date for you.

 
Excel, 1900 date system

For dates on or after 01mar1900, Excel 1900 Date System stores dates as days since 30dec1899. To convert to a Stata date,

    . gen statadate = exceldate + td(30dec1899)
    . format statadate %td

Excel can store dates between 01jan1900 and 28feb1900, too, but the formula above will not handle those two months. See http://www.cpearson.com/excel/datetime.htm for more information.

For datetimes on or after 01may1900 00:00:00, Excel 1900 Date System stores datetimes as days plus fraction of day since 30dec1899 00:00:00. To convert with a one-second resolution to a Stata datetime,

    . gen double statatime = round((exceltime+td(30dec1899))*86400)*1000
    . format statatime %tc

 
Excel, 1904 date system

For dates on or after 01jan1904, Excel 1904 Date System stores dates as days since 01jan1904. To convert to a Stata date,

 
    . gen statadate = exceldate + td(01jan1904)
    . format statadate %td

For datetimes on or after 01jan1904 00:00:00, Excel 1904 Date System stores datetimes as days plus fraction of day since 01jan1904 00:00:00. To convert with a one-second resolution to a Stata datetime,

 
    . gen double statatime = round((exceltime+td(01jan1904))*86400)*1000
    . format statatime %tc

 
Open Office

Open Office uses the Excel, 1900 Date System.

 
Why Stata has two datetime encodings

We have just seen that most packages assume 86,400 seconds/day, but that one instead uses UTC time, in which days have 86,400 or 86,401 seconds, depending. Stata provides both datetime encodings, called %tc and %tC. That turned out to be convenient in translating times from other packages. Stata will even let you switch from one to the other using the cofC() and Cofc functions, so you know you should be asking, which should I use?

Stata’s %tc format assumes that there are 24*60*60*1,000 ms per day — 86,400 seconds per day — just as an atomic clock does. Atomic clocks count oscillations between the nucleus and electrons of an atom and thus provide a measurement of the real passage of time.

Time of day measurements have historically been based on astronomical observation, which is a fancy way of saying, based on looking at the sun. The sun should be at its highest point at noon, right? So however you kept track of time — falling grains of sand or a wound up spring — you periodically reset your clock and then went about your business. In olden times it was understood that the 60 seconds per minute, 60 minutes per hour, 24 hours per day, were theoretical goals that no mechanical device could reproduce accurately. These days, we have have more accurate definitions for measuring time. A second is 9,192,631,770 periods of the radiation corresponding to the transition between two levels of the ground state of caesium 133. Obviously we have better equipment than the ancients, so problem solved, right? Wrong. There are two problems. The formal definition of a second is just a little too short to match length of a day, and the Earth’s rotation is slowing down.

As a result, since 1972 leap seconds have been added to atomic clocks once or twice a year to keep time measurements in synchronization with the earth’s rotation. Unlike leap years, however, there is no formula for predicting when leap seconds will occur. The Earth may be on average slowing down, but there is a large random component to that. As a result, leap seconds are determined by committee and announced 6 months before they are inserted. Leap seconds are added, if necessary, on the end of the day on June 30 and December 31 of the year. The inserted times are designated as 23:59:60.

Unadjusted atomic clocks may accurately mark the passage of real time, but you need to understand that leap seconds are every bit as real as every other second of the year. Once a leap second is inserted, it ticks just like any other second and real things can happen during that tick.

You may have heard of terms such as GMT and UTC.

GMT is the old Greenwich Mean Time and is based on astronomical observation. GMT has been supplanted by UTC.

UTC stands for coordinated universal time and is measured by atomic clocks, occasionally corrected for leap seconds. UTC is derived from two other times, UT1 and TAI. UT1 is the mean solar time, with which UTC is kept in sync by the occasional addition of a leap second. TAI is the atomically measured pure time. TAI was set to GMT plus 10 seconds in 1958 and has been running unadjusted since then. Update 07 Jan 2010: TAI is a statistical combination of various atomic chronometers and even it has not ticked uniformly over its history; see http://www.ucolick.org/~sla/leapsecs/timescales.html and especially http://www.ucolick.org/~sla/leapsecs/dutc.html#TAI. (Thanks to Steve Allen of the UCO/Lick Observatory for correcting my understanding and for the reference.)

UNK is StataCorp’s term for the time standard most people use. UNK stands for unknowing. UNK is based on a recent time observation, probably UTC, and then just assuming that there are 86,400 seconds per day after that.

The UNK standard is adequate for many purposes, and in such cases, you will want to use %tc rather than the leap second-adjusted %tC encoding. If you are using computer-timestamped data, however, you need to find out whether the timestamping system accounted for leap-second adjustments. Problems can arise even if you do not care about losing or gaining a second here and there.

For instance, you may import timestamp values from other systems recorded in the number of milliseconds that have passed since some agreed upon date. If you choose the wrong encoding scheme, if you chose tc when you should choose %tC, or vice versa, more recent times will be off by 24 seconds.

To avoid such problems, you may decide to import and export data by using Human Readable Forms (HRF) such as “Fri Aug 18 14:05:36 CDT 2006″. This method has advantages, but for %tC (UTC) encoding, times such as 23:59:60 are possible. Some software will refuse to decode such times.

Stata refuses to decode 23:59:60 in the %tc encoding (function clock) and accepts it with %tC (function Clock()). When %tC function Clock() sees a time with a 60th second, Clock() verifies that the time corresponds to an official leap second. Thus, when translating from printable forms, try assuming %tc and check the result for missing values. If there are none, you can assume your use of %tc is valid. If there are missing values and they are due to leap seconds and not some other error, you must use %tC function Clock() to translate from HRF. After that, if you still want to work in %tc units, use function cofC() to translate %tC values into %tc.

If precision matters, the best way to process %tC data is simply to treat them that way. The inconvenience is that you cannot assume that there are 86,400 seconds per day. To obtain the duration between dates, you must subtract the two time values involved. The other difficulty has to do with dealing with dates in the future. Under the %tC (UTC) encoding, there is no set value for any date more than 6 months in the future.

 
Advice

Stata provides two datetime encodings:

  1. %tC, also known as UTC, which accounts for leap seconds, and
     

  2. %tc, which ignores them (it assumes 86,400 seconds/day).

Systems vary in how they treat time variables. My advice is,

  • If you obtain data from a system that accounts for leap seconds, import using Stata’s %tC.
    1. If you later need to export data to a system that does not account for leap seconds, use Stata’s cofC() function to translate time values before exporting.
       

    2. If you intend to tsset the time variable and the analysis will be at the second level or finer, just tsset the %tC variable, specifying the appropriate delta() if necessary, for example, delta(1000) for seconds.
       

    3. If you intend to tsset the time variable and the analysis will be at coarser than the second level (minute, hour, etc.), create a %tc variable from the %tC variable (generate double tctime = cofC(tCtime<) and tsset that, specifying the appropriate delta() if necessary. You must do that because, in a %tC variable, there are not necessarily 60 seconds in a minute; some minutes have 61 seconds.
  • If you obtain data from a system that ignores leap seconds, use Stata’s %tc.
    1. If you later need to export data to a system that does account for leap seconds, use Stata’s Cofc() function to translate time values.
       

    2. If you intend to tsset the time variable, just tsset it, specifying the appropriate delta().

Some users prefer always to use Stata’s %tc because those values are a little easier to work with. You can do that if

  • you do not mind having up to 1 second of error and
     

  • you do not import or export numerical values (clock ticks) from other systems that are using leap seconds, because then there could be nearly 30 seconds of accumulated error.

There are two things to remember if you use %tC variables:

  1. The number of seconds between two dates is a function of when the dates occurred. Five days from one date is not simply a matter of adding 5*24*60*60*1,000 ms. You might need to add another 1,000 ms. Three hundred and sixty-five days from now might require adding 1,000 or 2,000 ms. The longer the span, the more you might have to add. The best way to add durations to %tC variables is to extract the components, add to them, and then reconstruct from the numerical components.
     

  2. You cannot accurately predict datetimes more than six months into the future. We do not know what the %tC value will be of 25dec2026 00:00:00 because every year along the way, the International Earth Rotation Reference Systems Service (IERS) will twice announce whether there will be the insertion of a leap second.

You can help alleviate these inconveniences. Face west and throw rocks. The benefit will be only transitory if the rocks land back on Earth, so you need to throw them really hard. I know what you’re thinking, but this does not need to be a coordinated effort.

Categories: Data Management Tags: ,

Connection string support added to odbc command

Stata’s odbc command allows you to import data from and export data to any ODBC data source on your computer. ODBC is a standardized way for applications to read data from and write data to different data sources such as databases and spreadsheets.

Until now, before you could use the odbc command, you had to add a named data source (DSN) to the computer via the ODBC Data Source Administrator. If you did not have administrator privileges on your computer, you could not do this.

In the update to Stata 11 released 4 November 2010, a new option, connectionstring(), was added to the odbc command. This option allows you to specify an ODBC data source on the fly using an ODBC connection string instead of having to first add a data source (DSN) to the computer. A connection string lets you specify all necessary parameters to establish a connection between Stata and the ODBC source. Connection strings have a standard syntax for all drivers but there are also driver-specific keyword/value pairs that you can specify. The three standard things that you will probably need in a connection string are DRIVER, SERVER, and DATABASE. For example,

odbc load, … ///
connectionstring(“DRIVER={SQL Server};SERVER=myserver;DATABASE=db;”)

If you also need to specify a username and password to get access to your database you would type

odbc load, …///
conn(“DRIVER={SQL Server};SERVER=server;DATABASE=db;UID=id;PWD=pwd;”)

Again, there are driver specific keyword/value pairs you can add to the connection string. You can perform a search on the Internet for “connection string” and your database name to find what other options you can specify in the connection string. Just remember to separate each connection string keyword/value pair with a semicolon. You can read more about connection string syntax on Microsoft’s website.

To get this capability in your copy of Stata 11, simply type update all and follow the instructions to complete the update. You can then type help odbc to read more about the connectionstring() option.