公卫百科  > 所属分类  >  SAS   
[0] 评论[0] 编辑

match_merge

1. Introduction
When you have two data files, you can combine them by merging them side by side, matching up observations based on an identifier. For example, below we have a data file containing information on dads and we have a file containing information on family income called faminc. We would like to match merge the files together so we have the dads observation on the same line with the faminc observation based on the key variable famid. 公卫论坛

dads

famid name inc
2 Art 22000
1 Bill 30000
3 Paul 25000
faminc

famid faminc96 faminc97 faminc98
3 75000 76000 77000
1 40000 40500 41000
2 45000 45400 45800
After match merging the files, they would look like this.
公卫人


famid name inc faminc96 faminc97 faminc98
1 Bill 30000 40000 40500 41000
2 Art 22000 45000 45400 45800
3 Paul 25000 75000 76000 77000
2. One-to-one merge
There are three steps to match merge the dads file with the faminc file (this is called a one-to-one merge because there is a one to one correspondence between the dads and faminc records). These three steps are illustrated in the SAS program merge1.sas below. 公卫考场

Use proc sort to sort dads on famid and save that file (we will call it dads2)
Use proc sort to sort faminc on famid and save that file (we will call it faminc2)
merge the dads2 and faminc2 files based on famid
These three steps are illustrated in the program below.

公卫论坛



* We first created the dads and faminc data files below ;

DATA dads;
INPUT famid name $ inc ;
CARDS;
2 Art 22000
1 Bill 30000
3 Paul 25000
;
RUN;
DATA faminc;
INPUT famid faminc96 faminc97 faminc98 ;
CARDS; 公卫家园
3 75000 76000 77000
1 40000 40500 41000
2 45000 45400 45800
* 1. Sort the dads file by "famid" & save sorted file as dads2 ;
PROC SORT DATA=dads OUT=dads2;
BY famid;
RUN;
* 2. Sort faminc by "famid" & save sorted file as faminc2 ;
公卫人

PROC SORT DATA=faminc OUT=faminc2;
BY famid;
RUN;
* 3. Merge dads2 and faminc2 by famid in a data step ;
DATA dadfam ;
MERGE dads2 faminc2;
BY famid;
RUN:
* Let's do a proc print and look at the results. ;
PROC PRINT DATA=dadfam; 公卫家园
RUN;
The output of the program is shown below.

OBS FAMID NAME INC FAMINC96 FAMINC97 FAMINC98

1 1 Bill 30000 40000 40500 41000
2 2 Art 22000 45000 45400 45800
3 3 Paul 25000 75000 76000 77000
The output from shows that the match merge worked properly. The dad and faminc are merged side by side. The next example considers a one-to-many merge where one observation in one file may have multiple matching records in another file. We will see that kind of merge is really no different from the one-to-one merge we saw here. 公卫人

3. One-to-many merge
Imagine that we had a file with dads like we saw in the previous example, and we had a file with kids where a dad could have more than one kid. Matching up the "dads" with the "kids" is called a "one-to-many" merge since you are matching one dad observation to possibly many kids records. The dads and kids records are shown below.
公卫家园


dads

famid name inc
2 Art 22000
1 Bill 30000
3 Paul 25000
kids

famid kidname birth age wt sex
1 Beth 1 9 60 f
1 Bob 2 6 40 m
1 Barb 3 3 20 f
2 Andy 1 8 80 m
2 Al 2 6 50 m
2 Ann 3 2 20 f
3 Pete 1 6 60 m 公卫论坛
3 Pam 2 4 40 f
3 Phil 3 2 20 m
After matching the dads with the kids you get a file that looks like the one below. Bill is matched up with his kids Beth, Bob and Barb; Art is matched up with Andy Al, and Ann; and Paul is matched up with Pete, Pam and Phil. 公卫论坛

dadkid

FAMID NAME INC MOMDAD KIDNAME BIRTH AGE WT SEX

1 Bill 30000 dad Beth 1 9 60 f
1 Bill 30000 dad Bob 2 6 40 m
1 Bill 30000 dad Barb 3 3 20 f
2 Art 22000 dad Andy 1 8 80 m
2 Art 22000 dad Al 2 6 50 m
2 Art 22000 dad Ann 3 2 20 f 公卫百科
3 Paul 25000 dad Pete 1 6 60 m
3 Paul 25000 dad Pam 2 4 40 f
3 Paul 25000 dad Phil 3 2 20 m
Just like the "one-to-one" merge, we follow the same three steps for a "one-to-many" merge. These three steps are illustrated in the SAS program merge2.sas below. 公卫家园

Use proc sort to sort dads on famid and save that file (we will call it dads2)
Use proc sort to sort kids on famid and save that file (we will call it kids2)
merge the dads2 and kids2 files based on famid
The program below illustrates these steps. 公卫百科

* first we make the "dads" data file ;
DATA dads;
INPUT famid name $ inc ;
CARDS;
2 Art 22000
1 Bill 30000
3 Paul 25000
;
RUN;
* Next we make the "kids" data file ;
DATA kids;
INPUT famid kidname $ birth age wt sex $ ;
公卫百科

CARDS;
1 Beth 1 9 60 f
1 Bob 2 6 40 m
1 Barb 3 3 20 f
2 Andy 1 8 80 m
2 Al 2 6 50 m
2 Ann 3 2 20 f
3 Pete 1 6 60 m
3 Pam 2 4 40 f
3 Phil 3 2 20 m
;
RUN;
* 1. sort "dads" on famid and save the sorted file as "dads2" ; 公卫论坛
PROC SORT DATA=dads OUT=dads2;
BY famid;
RUN;
* 2. sort "kids" on famid and save the sorted file as "kids2" ;
PROC SORT DATA=kids OUT=kids2;
BY famid;
RUN;
* 3. merge "dads2" and "kids2" based on famid, creating "dadkid" ;
DATA dadkid; 公卫人
MERGE dads2 kids2;
BY famid;
RUN;
* Let's do a PROC PRINT of "dadkid" to see if the merge worked ;
PROC PRINT DATA=dadkid;
RUN;
The output of the program is shown below.

OBS FAMID NAME INC MOMDAD KIDNAME BIRTH AGE WT SEX 公卫人

1 1 Bill 30000 dad Beth 1 9 60 f
2 1 Bill 30000 dad Bob 2 6 40 m
3 1 Bill 30000 dad Barb 3 3 20 f
4 2 Art 22000 dad Andy 1 8 80 m
5 2 Art 22000 dad Al 2 6 50 m
6 2 Art 22000 dad Ann 3 2 20 f
7 3 Paul 25000 dad Pete 1 6 60 m
8 3 Paul 25000 dad Pam 2 4 40 f 公卫人
9 3 Paul 25000 dad Phil 3 2 20 m
The output shows just what we hoped to see, the dads merged along side of their kids. You might have wondered what would have happened if the merge statement had reversed the order of the files, had we changed step 3 to look like below.
公卫百科


* 3. merge "dads2" and "kids2" based on famid, creating "dadkid" ;
DATA dadkid;
MERGE kids2 dads2;
BY famid;
RUN;
* Let's do a PROC PRINT of "dadkid" see what happens ;
PROC PRINT DATA=dadkid;
RUN;
The output with the modified step 3 is shown below.

公卫考场



OBS FAMID KIDNAME BIRTH AGE WT SEX NAME INC MOMDAD

1 1 Beth 1 9 60 f Bill 30000 dad
2 1 Bob 2 6 40 m Bill 30000 dad
3 1 Barb 3 3 20 f Bill 30000 dad
4 2 Andy 1 8 80 m Art 22000 dad
5 2 Al 2 6 50 m Art 22000 dad
6 2 Ann 3 2 20 f Art 22000 dad

公卫考场


7 3 Pete 1 6 60 m Paul 25000 dad
8 3 Pam 2 4 40 f Paul 25000 dad
9 3 Phil 3 2 20 m Paul 25000 dad
This output shows what happened when we switched the order of kids2 and dads2 in the merge statement. The merge results are basically the same, except that the order of the variables is modified -- the kids variables are on the left and the dads variables are at the right. Other than that, the results are the same. 公卫百科

4. Problems to look out for
These examples cover situations where there are no complications. We show some examples of complications that can arise and how you can solve them below.

4.1 Mismatching records in one-to-one merge
The two data files have may have records that do not match. Below we illustrate this by including an extra dad (Karl in famid 4) that does not have a corresponding family, and there are two extra families (5 and 6) in the family file that do not have a corresponding dad. 公卫人
DATA dads;
INPUT famid name $ inc;
DATALINES;
2 Art 22000
1 Bill 30000
3 Paul 25000
4 Karl 95000
;
RUN;

DATA faminc;
INPUT famid faminc96 faminc97 faminc98;
DATALINES;
3 75000 76000 77000
1 40000 40500 41000 公卫论坛
2 45000 45400 45800
5 55000 65000 70000
6 22000 24000 28000
;
RUN;

PROC SORT DATA=dads;
BY famid;
RUN;

PROC SORT DATA=faminc;
BY famid;
RUN;

DATA merge121;
MERGE dads(IN=fromdadx) faminc(IN=fromfamx);
公卫人

BY famid;
fromdad = fromdadx;
fromfam = fromfamx;
RUN;
As you see above, we use the in option to create a 0/1 variable fromdadx that indicates whether the resulting file contains a record with data from the dads file. Likewise, we use IN option to create a 0/1 variable fromfamx that indicates if the observation came from the faminc file. The fromdadx and fromfamx variables are temporary, so we make copies of them in fromdad and fromfam so we have copies of these variables that stay with the file. We can then use proc print and proc freq to identify the mismatching records.

公卫家园


PROC PRINT DATA=merge121;
RUN;
PROC FREQ DATA=merge121;
TABLES fromdad*fromfam;
RUN;

The output below illustrates that there were mismatching records. For famid 4, the value of fromdad is 1 and fromfam is 0, as we would expect since there was data from dads for famid 4, but no data from faminc. Also, as we expect, this record has valid data for the variables from the dads file (name and inc) and missing data for the variables from faminc (faminc96 faminc97 and faminc98). We see the reverse pattern for famid's5 and 6. 公卫百科
OBS FAMID NAME INC FAMINC96 FAMINC97 FAMINC98 FROMDAD FROMFAM

1 1 Bill 30000 40000 40500 41000 1 1
2 2 Art 22000 45000 45400 45800 1 1
3 3 Paul 25000 75000 76000 77000 1 1
4 4 Karl 95000 . . . 1 0
5 5 . 55000 65000 70000 0 1
6 6 . 22000 24000 28000 0 1 公卫百科
A closer look at the fromdad and fromfam variables reveals that there are three records that have matching data: one that has data from the dads only, and two records that have data from the faminc file only. The crosstab table below confirms this.
TABLE OF FROMDAD BY FROMFAM

公卫家园



FROMDAD FROMFAM

Frequency|
Percent |
Row Pct |
Col Pct | 0| 1| Total
---------+--------+--------+
0 | 0 | 2 | 2
| 0.00 | 33.33 | 33.33
| 0.00 | 100.00 |
| 0.00 | 40.00 |
---------+--------+--------+
1 | 1 | 3 | 4

公卫考场


| 16.67 | 50.00 | 66.67
| 25.00 | 75.00 |
| 100.00 | 60.00 |
---------+--------+--------+
Total 1 5 6
16.67 83.33 100.00
You may want to use this strategy to check the matching of the two files. If there are unexpected mismatched records, then you should investigate to understand the cause of the mismatched records.

公卫百科


Use the where statement in a proc print to eliminate some of the non-matching records.

4.2 Variables with the same name, but different information
Below we have the files with the information about the dads and family, but look more closely at the names of the variables. In the dads file, there is a variable called inc98, and in the family file there are variables inc96, inc97 and inc98. Let's attempt to merge these files and see what happens. 公卫百科
DATA dads;
INPUT famid name $ inc98;
DATALINES;
2 Art 22000
1 Bill 30000
3 Paul 25000
;
RUN;

DATA faminc;
INPUT famid inc96 inc97 inc98;
DATALINES;
3 75000 76000 77000
1 40000 40500 41000
2 45000 45400 45800 公卫论坛
;
RUN;

PROC SORT DATA=dads;
BY famid;
RUN;

PROC SORT DATA=faminc;
BY famid;
RUN;

DATA merge121;
MERGE faminc dads;
BY famid;
RUN;
PROC PRINT DATA=merge121;
RUN;
The results are shown below. As you see, the variable inc98 has the data from the dads file, the file that appears last on the merge statement. When you merge files that have the same variable, SAS will use the values from the file that appears last on the merge statement.

公卫人


OBS FAMID INC96 INC97 INC98 NAME

1 1 40000 40500 30000 Bill
2 2 45000 45400 22000 Art
3 3 75000 76000 25000 Paul
There are a couple of ways you can solve this problem.
Solution #1. The most obvious solution is to choose variable names in the original files that will not conflict with each other. However, you may have files where the names have already been chosen. 公卫家园

Solution #2. You can rename the variables in a data step using the rename option (which renames the variables before doing the merging). This allows you to select variable names that do not conflict with each other, as illustrated below.

DATA merge121;
公卫考场

MERGE faminc(RENAME=(inc96=faminc96 inc97=faminc97 inc98=faminc98))
dads(RENAME=(inc98=dadinc98));
BY famid;
RUN;
PROC PRINT DATA=merge121;
RUN;

As you can see below, the variables were renamed as specified.
OBS FAMID FAMINC96 FAMINC97 FAMINC98 NAME DADINC98
公卫家园


1 1 40000 40500 41000 Bill 30000
2 2 45000 45400 45800 Art 22000
3 3 75000 76000 77000 Paul 25000
5. For more information
For information on concatenating data files, see the SAS Learning Module on Concatenating Data Files in SAS.

附件列表


您所在的用户组无法下载或查看附件

0

词条内容仅供参考,如果您需要解决具体问题
(尤其在法律、医学等领域),建议您咨询相关领域专业人士。

如果您认为本词条还有待完善,请 编辑

上一篇 PROC_APPEND    下一篇 汪宁

标签

同义词

暂无同义词