Home > Data Management > Merging data, part 1: Merges gone bad

Merging data, part 1: Merges gone bad

Merging concerns combining datasets on the same observations to produce a result with more variables. We will call the datasets one.dta and two.dta.

When it comes to combining datasets, the alternative to merging is appending, which is combining datasets on the same variables to produce a result with more observations. Appending datasets is not the subject for today. But just to fix ideas, appending looks like this:

              +-------------------+
              | var1  var2  var3  |      one.dta
              +-------------------+
           1. | one.dta           |
           2. |                   |
            . |                   |
            . |                   |
              +-------------------+

                        +

              +-------------------+
              | var1  var2  var3  |      two.dta
              +-------------------+
           1. | two.dta           |
           2. |                   |
            . |                   |
              +-------------------+

                       =

              +-------------------+
              | var1  var2  var3  |
              +-------------------+
           1. |                   |    one.dta
           2. |                   |
            . |                   |
            . |                   |
              +                   +      +
        N1+1. |                   |    two.dta   appended
        N2+2. |                   |
            . |                   |
              +-------------------+

Merging looks like this:


      +-------------------+           +-----------+
      | var1  var2  var3  |           | var4 var5 |
      +-------------------+           +-----------+
   1. |                   |        1. |           |
   2. |                   |    +   2. |           |     =
    . |                   |         . |           |
    . |                   |         . |           |
      +-------------------+           +-----------+
        one.dta                         two.dta


                        +-------------------+-----------+
                        | var1  var2  var3    var4 var5 |
                        +-------------------------------+
                     1. |                               |
                     2. |                               |
                      . |                               |
                      . |                               |
                        +-------------------+-----------+
                          one.dta           + two.dta    merged

The matching of the two datasets — deciding which observations in one.dta are combined with which observations in two.dta — could be done simply on the observation numbers: Match one.dta observation 1 with two.dta observation 1, match one.dta observation 2 with two.dta observation 2, and so on. In Stata, you could obtain that result by typing

. use one, clear

. merge 1:1 using two

Never do this because it is too dangerous. You are merely assuming that observation 1 matches with observation 1, observation 2 matches with observation 2, and so on. What if you are wrong? If observation 2 in one.dta is Bob and observation 2 in two.dta is Mary, you will mistakenly combine the observations for Bob and Mary and, perhaps, never notice the mistake.

The better solution is to match the observations on equal values of an identification variable. This way, the observation with id=”Mary” is matched with the observation with id=”Mary”, id=”Bob” with id=”Bob”, id=”United States” with id=”United States”, and id=4934934193 with id=4934934193. In Stata, you do this by typing

. use one, clear

. merge 1:1 id using two

Things can still go wrong. For instance, id=”Bob” will not match id=”Bob ” (with the trailing blank), but if you expected all the observations to match, you will ultimately notice the mistake. Mistakenly unmatched observations tend to get noticed because of all the missing values they cause in subsequent calculations.

It is the mistakenly combined observations that can go unnoticed.

And that is the topic for today, mistakenly matched observations, or merges gone bad.

Observations are mistakenly combined more often than many researchers realize. I’ve seen it happen. I’ve seen it happen, be discovered later, and necessitate withdrawn results. You seriously need to consider the possibility that this could happen to you. Only three things are certain in this world: death, taxes, and merges gone bad.

I am going to assume that you are familiar with merging datasets both conceptually and practically; that you already know what 1:1, m:1, 1:m, and m:n mean; and that you know the role played by “key” variables such as ID. I am going to assume you are familiar with Stata’s merge command. If any of this is untrue, read [D] merge. Type help merge in Stata and click on [D] merge at the top to take you to the full PDF manuals. We are going to pick up where the discussion in [D] merge leaves off.

Detecting when merges go bad

As I said, the topic for today is merges gone bad, by which I mean producing a merged result with the wrong records combined. It is difficult to imagine that typing

. use one, clear

. merge 1:1 id using two

could produce such a result because, to be matched, the observations had to have equal values of the ID. Bob matched with Bob, Mary matched with Mary, and so on.

Right you are. There is no problem assuming the values in the id variable are correct and consistent between datasets. But what if id==4713 means Bob in one dataset and Mary in the other? That can happen if the id variable is simply wrong from the outset or if the id variable became corrupted in prior processing.

1. Use theory to check IDs if they are numeric

One way the id variable can become corrupted is if it is not stored properly or if it is read improperly. This can happen to both string and numeric variables, but right now, we are going to emphasize the numeric case.

Say the identification variable is Social Security number, an example of which is 888-88-8888. Social Security numbers are invariably stored in computers as 888888888, which is to say that they are run together and look a lot like the number 888,888,888. Sometimes they are even stored numerically. Say you have a raw data file containing perfectly valid Social Security numbers recorded in just this manner. Say you read the number as a float. Then 888888888 becomes 888888896, and so does every Social Security number between 888888865 and 888888927, some 63 in total. If Bob has Social Security number 888888869 and Mary has 888888921, and Bob appears in dataset one and Mary in dataset two, then Bob and Mary will be combined because they share the same rounded Social Security number.

Always be suspicious of numeric ID variables stored numerically, not just those stored as floats.

When I read raw data and store the ID variables as numeric, I worry whether I have specified a storage type sufficient to avoid rounding. When I obtain data from other sources that contain numeric ID variables, I assume that the other source improperly stored the values until proven otherwise.

Perhaps you remember that 16,775,215 is the largest integer that can be stored precisely as a float and 9,007,199,254,740,991 is the largest that can be stored precisely as a double. I never do.

Instead, I ask Stata to show me the largest theoretical ID number in hexadecimal. For Social Security numbers, the largest is 999-99-9999, so I type

. inbase 16 999999999
3b9ac9ff

Stata’s inbase command converts decimal numbers to different bases. I learn that 999999999 base-10 is 3b9ac9ff base-16, but I don’t care about the details; I just want to know the number of base-16 digits required. 3b9ac9ff has 8 digits. It takes 8 base-16 digits to record 999999999. As you learned in How to read the %21x format, part 2, I do remember that doubles can record 13 base-16 digits and floats can record 5.75 digits (the 0.75 part being because the last digit must be even). If I didn’t remember those numbers, I would just display a number in %21x format and count the digits to the right of the binary point. Anyway, Social Security numbers can be stored in doubles because 8<13, the number of digits double provides, but not in floats because 8 is not < 5.75, the number of digits float provides.

If Social Security numbers contained 12 digits rather than 9, the largest would be

. inbase 16 999999999999
38d4a50fff

which has 10 base-16 digits, and because 10<13, it would still fit into a double.

Anyway, if I discover that the storage type is insufficient to store the ID number, I know the ID numbers must be rounded.

2. Check uniqueness of IDs

I said that when I obtain data from other sources, I assume that the other source improperly stored the ID variables until proven otherwise. I should have said, until evidence accumulates to the contrary. Even if the storage type used is sufficient, I do not know what happened in previous processing of the data.

Here’s one way using datasets one.dta and two.dta to accumulate some of that evidence:

. use one, clear              // test 1
. sort id
. by id: assert _N==1

. use two, clear              // test 2
. sort id . by id: assert _N==1 

In these tests, I am verifying that the IDs really are unique in the two datasets that I have. Tests 1 and 2 are unnecessary when I plan later to merge 1:1 because the 1:1 part will cause Stata itself to check that the IDs are unique. Nevertheless, I run the tests. I do this because the datasets I merge are often subsets of the original data, and I want to use all the evidence I have to invalidate the claim that the ID variables really are unique.Sometimes I receive datasets where it takes two variables to make sure I am calling a unique ID. Perhaps I receive data on persons over time, along with the claim that the ID variable is name. The documentation also notes that variable date records when the observation was made. Thus, to uniquely identify each of the observations requires both name and date, and I type

. sort name date
. by name date: assert _N==1

I am not suspicious of only datasets I receive. I run this same test on datasets I create.

3. Merge on all common variables

At this point, I know the ID variable(s) are unique in each dataset. Now I consider the idea that the ID variables are inconsistent across datasets, which is to say that Bob in one dataset, however he is identified, means Mary in the other. Detecting such problems is always problematic, but not nearly as problematic as you might guess.

It is rare that the datasets I need to merge have no variables in common except the ID variable. If the datasets are on persons, perhaps both datasets contain each person’s sex. In that case, I could merge the two datasets and verify that the sex is the same in both. Actually, I can do something easier than that: I can add variable sex to the key variables of the merge:

. use one, clear
. merge 1:1 id sex using two

Assume I have a valid ID variable. Then adding variable sex does not affect the outcome of the merge because sex is constant within id. I obtain the same results as typing merge 1:1 id using two.

Now assume the id variable is invalid. Compared with the results of merge 1:1 id using two, Bob will no longer match with Mary even if they have the same ID. Instead I will obtain separate, unmatched observations for Bob and Mary in the merged data. Thus to complete the test that there are no such mismatches, I must verify that the id variable is unique in the merged result. The complete code reads

. use one, clear
. merge 1:1 id sex using two
. sort id
. by id: assert _N==1

And now you know why in test 2 I checked the uniqueness of ID within dataset by hand rather than depending on merge 1:1. The 1:1 merge I just performed is on id and sex, and thus merge does not check the uniqueness of ID in each dataset. I checked by hand the uniqueness of ID in each dataset and then checked the uniqueness of the result by hand, too.

Passing the above test does not prove that that the ID variable is consistent and thus the merge is correct, but if the assertion is false, I know with certainty either that I have an invalid ID variable or that sex is miscoded in one of the datasets. If my data has roughly equal number of males and females, then the test has a 50 percent chance of detecting a mismatched pair of observations, such as Bob and Mary. If I have just 10 mismatched observations, I have a 1-0.910 = 0.9990 probability of detecting the problem.

I should warn you that if you want to keep just the matched observations, do not perform the merge by coding merge 1:1 id sex using two, keep(matched). You must keep the unmatched observations to perform the final part of the test, namely, that the ID numbers are unique. Then you can drop the unmatched observations.

. use one, clear
. merge 1:1 id sex using two
. sort id
. by id: assert _N==1
. keep if _merge==3

There may be more than one variable that you expect to be the same in combined observations. A convenient feature of this test is that you can add as many expected-to-be-constant variables to merge‘s keylist as you wish:

. use one, clear
. merge 1:1 id sex hiredate groupnumber using two
. sort id
. by id: assert _N==1
. keep if _merge==3

It is rare that there is not at least one variable other than the ID variable that is expected to be equal, but it does happen. Even if you have expected-to-be-constant variables, they may not work as well in detecting problems as variable sex in the example above. The distribution of the variable matters. If your data are of people known to be alive in 1980 and the known-to-be-constant variable is whether born after 1900, even mismatched observations would be likely to have the same value of the variable because most people alive in 1980 were born after 1900.

4. Look at a random sample

This test is weak, but you should do it anyway, if only because it’s so easy. List some of the combined observations and look at them.

. list in 1/5

Do the combined results look like they go together?

By the way, the right way to do this is

. gen u = uniform()
. sort u
. list in 1/5
. drop u

You do not want to look at the first observations because, having small values of ID, they are probably not representative. However IDs are assigned, the process is unlikely to be randomized. Persons with low values of ID will be younger, or older; or healthier, or sicker; or ….

5. Look at a nonrandom sample

You just merged two datasets, so obviously you did that because you needed the variables and those variables are somehow related to the existing variables. Perhaps your data is on persons, and you combined the 2009 data with the 2010 data. Perhaps your data is on countries, and you added export data to your import data. Whatever you just added, it is not random. If it were, you could have saved yourself time by simply generating the new variables containing random numbers.

So generate an index that measures a new variable in terms of an old one, such as

. gen diff = income2010 - income2009

or

. gen diff = exports - imports

Then sort on the variable and look at the observations containing the most outlandish values of your index:

. sort diff
. list in  1/5
. list in -5/l

These are the observations most likely to be mistakenly combined. Do you believe those observations were combined correctly?

Conclusion

I admit I am not suspicious of every merge I perform. I have built up trust over time in datasets that I have worked with previously. Even so, my ability to make errors is equal to yours, and even with trustworthy datasets, I can introduce problems long before I get to the merge. You need to carefully consider the consequences of a mistake. I do not know anyone who performs merges who has not performed a merge gone bad. The question is whether he or she detected it. I hope so.

Categories: Data Management Tags: ,
  • http://www.facebook.com/profile.php?id=541518548 Darcy McMaughan Moudouni

    Nice write-up! Appreciate the explanation.

  • Marc

    I am using the program -unique- from SSC for checking the uniqueness of id variables…nnice reminder about the pitfalls of merging

  • Clyde Schechter

    Since the default data type when creating new variables in Stata is float, it is dangerous to store social security numbers (or other lengthy identifiers) as doubles. I think this type of identifier is better stored as a string variable. Stata string variables have the nice property that their storage types are automatically revised to accommodate the necessary number of characters when new ones are created.nnThe only drawback I have encountered with storing lengthy ID variables as strings is that sometimes I want to use the ID as a clustering variable in analysis of longitudinal or other repeated-measures data. In that case, using -egen, group(ID)- to create a corresponding numeric variable is a simple expedient. (And the data storage type needed for that is typically smaller because it depends on the number of distinct units, not the length of the identifier string.)

  • Sam Clark

    I manage my data in relational databases, and I’ve always been curious why Stata does not follow widely accepted and well thought-out standards like the relational theory of data and SQL, or others. For example, the JOIN functionality of SQL is so much more robust and intuitive than Stata’s merge command. To accommodate Stata’s idiosyncratic way of doing things I simply do all data management in a relational database use SQL scripts and only use Stata to actually do statistics … because Stata really doesn’t do data management well *at all*.

  • Pingback: Dateien zusammenfügen | Soziobloge

  • Pingback: The Stata Blog » Merging data, part 2: Multiple-key merges

  • Pingback: Stata: Data Merging « At Worlds End

  • Jing

    Thanks for writing this. I just came across this issue today. Wasn’t sure why my merge looked wrong until I found your article.