Home > Data Management > Merging data, part 2: Multiple-key merges

## Merging data, part 2: Multiple-key merges

Multiple-key merges arise when more than one variable is required to uniquely identify the observations in your data. In Merging data, part 1, I discussed single-key merges such as

        . merge 1:1 personid using ...


In that discussion, each observation in the dataset could be uniquely identified on the basis of a single variable. In panel or longitudinal datasets, there are multiple observations on each person or thing and to uniquely identify the observations, we need at least two key variables, such as

        . merge 1:1 personid date using ...


In this dataset we have repeated observations on persons and, within person, the observations can be uniquely identified by the calendar date.

Just to fix ideas, let’s assume I have two datasets. The first, sample.dta, is the one of analytic interest to me. I have data on 737 persons. For each person, I have data recorded on the first and fifteenth of every month, from year 2000 to 2009. Overall, my dataset has 176,880 observations.

The second dataset contains additional information (variables) on the sample of interest. It has over 3,000 people in it and it covers a longer stretch of time. I’ve been told that most of my 737 people should be in this second dataset, but I’ve been warned that, due to data collection or data processing errors over the years, a small fraction will not be found.

“How many?” I asked Bob from data processing during a fictional conversation, my paranoia kicking in.

“I don’t know. Most. Ninety-nine percent. It’s just random stuff,” he replied, knowing how well I respond to the word random.

Let’s call this second set of data the payroll data, although if I can imagine fictional conversations, you can imagine the data are something else. They might be records from follow-up visits of a medical experiment.

In any case, I receive the data, and here is what happened when I merged the data with my sample:

        . use sample, clear

. merge 1:1 personid date using payroll, keep(master match)

Result                           # of obs.
-----------------------------------------
not matched                         2,352
from master                     2,352  (_merge==1)
from using                          0  (_merge==2)

matched                           174,528  (_merge==3)
-----------------------------------------


In my sample data, I have 174,520 + 2,352 = 176,872 observations. Of those, 174,528 matched, which is 98.7 percent. (The reason that the number of records from the using (payroll) data that were not matched is zero is because I specified option keep(master match), meaning I discarded the unmatched payroll records. Had I not, the number would have been in the low millions.)

For many in this situation, the story would stop right here. Not for me. I want to show you how to tear into multiple-key merges to reassure yourself that things really are as they appear. You realize, of course, that I manufactured this fictional data for this blog entry and I buried a little something that once we find it, would scare you if this were a real story. So I’ll tell you now, this story is loosely based on a real story.

Step 1: Following my own advice

In Merging data, part 1 I recommended that you merge on all common variables, not just the identification variables. This blog entry is not going to rehash the previous blog entry, but I want to emphasize that everything I said in the previous entry about single-key merges applies equally to multiple-key merges. These two datasets share a variable recording the division in which the employee works, so I am included it among the match variables:

        . use sample, clear

. merge 1:1 personid date division using payroll, keep(master match)

Result                           # of obs.
-----------------------------------------
not matched                         2,352
from master                     2,352  (_merge==1)
from using                          0  (_merge==2)

matched                           174,528  (_merge==3)
-----------------------------------------


The output above matches the output when I merged date and division alone, so I do not appear to have a merge-gone-bad problem. These merged data are looking better and better.

Step 2: Merge on each key variable in isolation

Let’s imagine what could go wrong. Imagine that all the data for certain persons were missing, or that all the data for certain dates were missing. That might not be a problem, but it would certainly raise questions. Depending on the answers, it may be worth a footnote or concerning enough to return the data and ask for a refund.

Finding persons or dates that are entirely unmatched is a lot of work unless you know the following trick: Merge on one key variable at a time.

        . use sample, clear

. sort personid

. by personid: keep if _n==1           // <- don't skip this step
(176143 observations deleted)

. merge 1:m personid using payroll, keep(master match)

Result                           # of obs.
-----------------------------------------
not matched                             0
matched                           174,528  (_merge==3)
-----------------------------------------


The output above proves that payroll.dta contains data on every person that appears in sample.dta.

Let me explain. I began by using my sample data and keeping just one observation for every value of personid. I don't care which observation I keep, I just need to keep one and only one. Then I merged on personid, keeping (1) the records that match and (2) the records from the master that do not match. I have no interest in the resulting dataset; I just wanted to see the table merge would report. merge reports that 174,528 personids matched, and 0 did not. Ergo, every value of personid that appears in sample.dta also appears in payroll.dta.

Had merge reported "not matched" = 2, that would mean there would have been two values of personid appearing in sample.dta that do not appear in payroll.dta. It would not have been an indictment of the data if two persons were not matched in their entirety, but I would certainly have looked into the issue. With the merged result in memory, I would have typed

        . list personid if _merge==1
(output would appear)


I would have written down the two personids list reported. Then I would have returned to my sample data and looked at the data I had on those two people:

        . use sample, clear

. list if personid==735527
(output would appear)

. list if personid==29887
(output would appear)


It might be that 735527 was with the company for only a short time and thus the missing payroll record a believable random event. If 735527 had been with the company all ten years, however, I would be back on the phone seeking an explanation. (If these were medical data, certainly you would want to know how a person who never reported for a follow-up visit is known to still be alive after ten years.)

So much for personid. Let's do the same for date:

        . use sample, clear
. sort date
. by date: keep if _n==1
(176640 observations deleted)
. merge 1:m date using payroll, keep(master match)
Result                           # of obs.
-----------------------------------------
not matched                             0
matched                           236,832  (_merge==3)
-----------------------------------------


Every date that appears in sample.dta also appears in payroll.dta.

Finally, let's look at division:

        . use sample, clear

. sort division date

. by division date: keep if _n==1
(175200 observations deleted)

. merge 1:m division date using payroll, keep(master match)

Result                           # of obs.
-----------------------------------------
not matched                            24
from master                        24  (_merge==1)
from using                          0  (_merge==2)

matched                           236,832  (_merge==3)
-----------------------------------------


Every division that appears in sample.dta appears in payroll.dta

These data are looking better and better.

If we had only two key variables, we would be done. We, however, are performing the full merge on three variables, namely personid, date, and division, and so there is one more set of comparisons we should examine.

Step 3: Merge on every pair of key variables

With three key variables, the possible pairs are (personid, date), (personid, division), and (division, date). We have already looked at (personid, date), so that just leaves (personid, division) and (division, date).

The method is the same as in Step 2 except that we type two variables where we previously typed one:

        . use sample, clear

. sort personid division

. by personid division: keep if _n==1
(176143 observations deleted)

. merge 1:m personid division using payroll, keep(master match)

Result                           # of obs.
-----------------------------------------
not matched                             0
matched                           174,528  (_merge==3)
-----------------------------------------


We discover that every personid-division combination that appears in sample.dta also appears in payroll.dta.

Last is (division, date):

        . use sample, clear

. sort division date

. by division date: keep if _n==1
(175200 observations deleted)

. merge 1:m division date using payroll, keep(master match)

Result                           # of obs.
-----------------------------------------
not matched                            24
from master                        24  (_merge==1)
from using                          0  (_merge==2)

matched                           236,832  (_merge==3)
-----------------------------------------


Surprise! Well, you're not surprised because I told you earlier we were going to find something, but if this were real life, you would be surprised after all these reassuring results.

We discover that there are 24 division-date combinations appearing in sample.dta that do not appear in payroll.dta. Let's look at the 24 missing combinations:

        . list division date if _merge==1

+----------------------+
| division        date |
|----------------------|
1129. |        5   01jan2007 |
1130. |        5   15jan2007 |
1131. |        5   01feb2007 |
1132. |        5   15feb2007 |
1133. |        5   01mar2007 |
|----------------------|
1134. |        5   15mar2007 |
1135. |        5   01apr2007 |
1136. |        5   15apr2007 |
1137. |        5   01may2007 |
1138. |        5   15may2007 |
|----------------------|
1139. |        5   01jun2007 |
1140. |        5   15jun2007 |
1141. |        5   01jul2007 |
1142. |        5   15jul2007 |
1143. |        5   01aug2007 |
|----------------------|
1144. |        5   15aug2007 |
1145. |        5   01sep2007 |
1146. |        5   15sep2007 |
1147. |        5   01oct2007 |
1148. |        5   15oct2007 |
|----------------------|
1149. |        5   01nov2007 |
1150. |        5   15nov2007 |
1151. |        5   01dec2007 |
1152. |        5   15dec2007 |
+----------------------+


If you look closely, you will notice that every payroll date in 2007 is listed. So what happened to the payroll records for division 5 in 2007? This may indeed be exactly the kind of random event that Bob had in mind during our fictional conversation. Somehow the company lost a little cluster of payroll records. The loss may mean mean nothing. Or it might be of critical importance. Imagine there's been an allegation that the company treats older workers poorly and imagine that division 5 has the highest average age. Not random. Not random at all.

Step 4: Merge on every triplet of key variables

So much for the fictional story.

If we had four or more key variables, we would now need to merge on every triplet of key variables, and if we had five or more key variables, we then need to merge on every quadruplet of key variables, and if ...

Forget the story. Or remember it if it scares you. Data processing and paranoia make an excellent pairing. What's important is how easy it is to take complicated, multiple-key merges apart. I've never met anyone yet who knew this trick.

Categories: Data Management Tags:
• Clyde Schechter

Great cautionary tale.nnThere is another, simple way that merging data sets can unveil errors in the data. u00a0I am currently working on a clinical trial in which I receive data updates every 6 weeks (the outcomes under study are cumulative over time). u00a0I start with a -merge 1:1 study_id using…, update replace- command when I get them. u00a0Last week, to my surprise, I got an error message telling me that the records in the using file were not uniquely identified by study_id! u00a0(This would have been caught at an earlier stage if I were the one creating the update file–but it comes to me from another place.) u00a0So I refused to further use the data until we found out why this was happening.nnIt turns out that the source of the problem is that several study participants had been mistakenly crossed-over into other arms of the study and were now being reported twice: once for each arm. u00a0This of course is a complete protocol violation that would have gone unnoticed for a while but for this.nnYou can never be too paranoid with data. u00a0You can never have too many checks and asserts in a program.

• Anonymous

Clyde provides an excellent example.u00a0 The idea behind both my posting and Clyde’s example is — rather than using -merge- to create the final merged result that you want, and then studying that result to identify problems — to use -merge- to create results in which you have no direct interest but in which it will be easy to identify problems should there be any.nnI suspect that Clyde, receiving updates every 6 weeks, really wants to -append- datasets.u00a0 Before doing that, however, he uses -merge- to verify something about the incoming data.nnIn my examples, I want a fully merged result, but I create a sequence of partially merged results just because it will be easier for me to identify problems from them, if there are problems.nnYou might wonder how Clyde and I were able to figure out the appropriate way to detect the problem before we even knew what the problem was.u00a0 The answer is we thought about things that would be true of our data under the assumption of no problems, and then we verified those things were true.u00a0 It was during the verification process that we discovered the particular problems that we did not anticipate.nnVerifying that what should be true is true is a powerful tool, and not just with -merge-.u00a0 We at StataCorp use the principle in testing statistical code.u00a0 Say you wrote a linear regression estimator and were one of the first to do so.u00a0 How would you know your results were correct?u00a0 Answer:u00a0 think of everything that should be true about the results conditional on them being correct and then verify those properties.u00a0 The mean of the residuals should be 0.u00a0 Are they?u00a0 The variance of the predicted y should be less than the variance of the observed y.u00a0 Is it?u00a0 y-bar should equal X-bar times the coefficients.u00a0 Does it?nnThe rule is not to try to imagine what could go wrong and then look for it.u00a0 Too many different things could go wrong.u00a0 Instead assume nothing is wrong and verify that.u00a0 Make up a long enough list of what should be true and, if there are errors, you will discover them in act of verification, even if you never imagined the particular error you discover.nnAs Clyde says, -assert- is very useful when it comes to verifying what should be true.n~u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0u00a0

• Pingback: Stata　複数のKeyでマッチングする