Merging data, part 2: Multiple-key merges
Multiple-key merges arise when more than one variable is required to uniquely identify the observations in your data. In Merging data, part 1, I discussed single-key merges such as
. merge 1:1 personid using ...
In that discussion, each observation in the dataset could be uniquely identified on the basis of a single variable. In panel or longitudinal datasets, there are multiple observations on each person or thing and to uniquely identify the observations, we need at least two key variables, such as
. merge 1:1 personid date using ...
In this dataset we have repeated observations on persons and, within person, the observations can be uniquely identified by the calendar date.
Just to fix ideas, let’s assume I have two datasets. The first, sample.dta, is the one of analytic interest to me. I have data on 737 persons. For each person, I have data recorded on the first and fifteenth of every month, from year 2000 to 2009. Overall, my dataset has 176,880 observations.
The second dataset contains additional information (variables) on the sample of interest. It has over 3,000 people in it and it covers a longer stretch of time. I’ve been told that most of my 737 people should be in this second dataset, but I’ve been warned that, due to data collection or data processing errors over the years, a small fraction will not be found.
“How many?” I asked Bob from data processing during a fictional conversation, my paranoia kicking in.
“I don’t know. Most. Ninety-nine percent. It’s just random stuff,” he replied, knowing how well I respond to the word random.
Let’s call this second set of data the payroll data, although if I can imagine fictional conversations, you can imagine the data are something else. They might be records from follow-up visits of a medical experiment.
In any case, I receive the data, and here is what happened when I merged the data with my sample:
. use sample, clear . merge 1:1 personid date using payroll, keep(master match) Result # of obs. ----------------------------------------- not matched 2,352 from master 2,352 (_merge==1) from using 0 (_merge==2) matched 174,528 (_merge==3) -----------------------------------------
In my sample data, I have 174,520 + 2,352 = 176,872 observations. Of those, 174,528 matched, which is 98.7 percent. (The reason that the number of records from the using (payroll) data that were not matched is zero is because I specified option keep(master match), meaning I discarded the unmatched payroll records. Had I not, the number would have been in the low millions.)
For many in this situation, the story would stop right here. Not for me. I want to show you how to tear into multiple-key merges to reassure yourself that things really are as they appear. You realize, of course, that I manufactured this fictional data for this blog entry and I buried a little something that once we find it, would scare you if this were a real story. So I’ll tell you now, this story is loosely based on a real story.
Step 1: Following my own advice
In Merging data, part 1 I recommended that you merge on all common variables, not just the identification variables. This blog entry is not going to rehash the previous blog entry, but I want to emphasize that everything I said in the previous entry about single-key merges applies equally to multiple-key merges. These two datasets share a variable recording the division in which the employee works, so I am included it among the match variables:
. use sample, clear . merge 1:1 personid date division using payroll, keep(master match) Result # of obs. ----------------------------------------- not matched 2,352 from master 2,352 (_merge==1) from using 0 (_merge==2) matched 174,528 (_merge==3) -----------------------------------------
The output above matches the output when I merged date and division alone, so I do not appear to have a merge-gone-bad problem. These merged data are looking better and better.
Step 2: Merge on each key variable in isolation
Let’s imagine what could go wrong. Imagine that all the data for certain persons were missing, or that all the data for certain dates were missing. That might not be a problem, but it would certainly raise questions. Depending on the answers, it may be worth a footnote or concerning enough to return the data and ask for a refund.
Finding persons or dates that are entirely unmatched is a lot of work unless you know the following trick: Merge on one key variable at a time.
Let’s start with personid:
. use sample, clear . sort personid . by personid: keep if _n==1 // <- don't skip this step (176143 observations deleted) . merge 1:m personid using payroll, keep(master match) Result # of obs. ----------------------------------------- not matched 0 matched 174,528 (_merge==3) -----------------------------------------
The output above proves that payroll.dta contains data on every person that appears in sample.dta.
Let me explain. I began by using my sample data and keeping just one observation for every value of personid. I don't care which observation I keep, I just need to keep one and only one. Then I merged on personid, keeping (1) the records that match and (2) the records from the master that do not match. I have no interest in the resulting dataset; I just wanted to see the table merge would report. merge reports that 174,528 personids matched, and 0 did not. Ergo, every value of personid that appears in sample.dta also appears in payroll.dta.
Had merge reported "not matched" = 2, that would mean there would have been two values of personid appearing in sample.dta that do not appear in payroll.dta. It would not have been an indictment of the data if two persons were not matched in their entirety, but I would certainly have looked into the issue. With the merged result in memory, I would have typed
. list personid if _merge==1 (output would appear)
I would have written down the two personids list reported. Then I would have returned to my sample data and looked at the data I had on those two people:
. use sample, clear . list if personid==735527 (output would appear) . list if personid==29887 (output would appear)
It might be that 735527 was with the company for only a short time and thus the missing payroll record a believable random event. If 735527 had been with the company all ten years, however, I would be back on the phone seeking an explanation. (If these were medical data, certainly you would want to know how a person who never reported for a follow-up visit is known to still be alive after ten years.)
So much for personid. Let's do the same for date:
. use sample, clear . sort date . by date: keep if _n==1 (176640 observations deleted) . merge 1:m date using payroll, keep(master match) Result # of obs. ----------------------------------------- not matched 0 matched 236,832 (_merge==3) -----------------------------------------
Every date that appears in sample.dta also appears in payroll.dta.
Finally, let's look at division:
. use sample, clear . sort division date . by division date: keep if _n==1 (175200 observations deleted) . merge 1:m division date using payroll, keep(master match) Result # of obs. ----------------------------------------- not matched 24 from master 24 (_merge==1) from using 0 (_merge==2) matched 236,832 (_merge==3) -----------------------------------------
Every division that appears in sample.dta appears in payroll.dta
These data are looking better and better.
If we had only two key variables, we would be done. We, however, are performing the full merge on three variables, namely personid, date, and division, and so there is one more set of comparisons we should examine.
Step 3: Merge on every pair of key variables
With three key variables, the possible pairs are (personid, date), (personid, division), and (division, date). We have already looked at (personid, date), so that just leaves (personid, division) and (division, date).
The method is the same as in Step 2 except that we type two variables where we previously typed one:
. use sample, clear . sort personid division . by personid division: keep if _n==1 (176143 observations deleted) . merge 1:m personid division using payroll, keep(master match) Result # of obs. ----------------------------------------- not matched 0 matched 174,528 (_merge==3) -----------------------------------------
We discover that every personid-division combination that appears in sample.dta also appears in payroll.dta.
Last is (division, date):
. use sample, clear . sort division date . by division date: keep if _n==1 (175200 observations deleted) . merge 1:m division date using payroll, keep(master match) Result # of obs. ----------------------------------------- not matched 24 from master 24 (_merge==1) from using 0 (_merge==2) matched 236,832 (_merge==3) -----------------------------------------
Surprise! Well, you're not surprised because I told you earlier we were going to find something, but if this were real life, you would be surprised after all these reassuring results.
We discover that there are 24 division-date combinations appearing in sample.dta that do not appear in payroll.dta. Let's look at the 24 missing combinations:
. list division date if _merge==1 +----------------------+ | division date | |----------------------| 1129. | 5 01jan2007 | 1130. | 5 15jan2007 | 1131. | 5 01feb2007 | 1132. | 5 15feb2007 | 1133. | 5 01mar2007 | |----------------------| 1134. | 5 15mar2007 | 1135. | 5 01apr2007 | 1136. | 5 15apr2007 | 1137. | 5 01may2007 | 1138. | 5 15may2007 | |----------------------| 1139. | 5 01jun2007 | 1140. | 5 15jun2007 | 1141. | 5 01jul2007 | 1142. | 5 15jul2007 | 1143. | 5 01aug2007 | |----------------------| 1144. | 5 15aug2007 | 1145. | 5 01sep2007 | 1146. | 5 15sep2007 | 1147. | 5 01oct2007 | 1148. | 5 15oct2007 | |----------------------| 1149. | 5 01nov2007 | 1150. | 5 15nov2007 | 1151. | 5 01dec2007 | 1152. | 5 15dec2007 | +----------------------+
If you look closely, you will notice that every payroll date in 2007 is listed. So what happened to the payroll records for division 5 in 2007? This may indeed be exactly the kind of random event that Bob had in mind during our fictional conversation. Somehow the company lost a little cluster of payroll records. The loss may mean mean nothing. Or it might be of critical importance. Imagine there's been an allegation that the company treats older workers poorly and imagine that division 5 has the highest average age. Not random. Not random at all.
Step 4: Merge on every triplet of key variables
So much for the fictional story.
If we had four or more key variables, we would now need to merge on every triplet of key variables, and if we had five or more key variables, we then need to merge on every quadruplet of key variables, and if ...
Forget the story. Or remember it if it scares you. Data processing and paranoia make an excellent pairing. What's important is how easy it is to take complicated, multiple-key merges apart. I've never met anyone yet who knew this trick.