公卫百科  > 所属分类  >  SAS   
[2] 评论[0] 编辑

PROC SORT NODUP

A common interview question for SAS jobs is "What is the difference between proc sort nodup and proc sort nodupkey?". The answer the interviewer is expecting is usually "proc sort nodup gets rid of duplicate records but proc sort nodupkey gets rid of other records with the same sort key". However, this is not correct.

公卫考场

"nodup" is an alias for "noduprecs" which appears to mean "no duplicate records" but there is no way sas can know about these duplicate records unless they, by chance, land next to each other in sequence. That is a matter of chance. Take a look at "nodup" at work. Note the record with the "extra" value of 3. It's still there after the "nodup" sort.

公卫百科


data test1;
  input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 1
aa ab 2
aa ab 3
;
proc sort nodup data=test1;
by id1 id2;
run;
options nocenter;
proc print data=test1; 公卫考场
run;
Obs    id1    id2    extra 公卫论坛

 1     aa     ab       3
 2     aa     ab       1
 3     aa     ab       2 公卫考场
 4     aa     ab       3 公卫论坛

Now look again where the two records with an "extra" value of 3 are next to each other in the input dataset. This time it has been removed by "nodup". 
data test2;
  input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 3 公卫考场
aa ab 2
aa ab 1
;
proc sort nodup data=test2;
by id1 id2;
run;
options nocenter;
proc print data=test2;
run;
Obs    id1    id2    extra 公卫百科

 1     aa     ab       3
 2     aa     ab       2
 3     aa     ab       1

公卫家园

If you sort "nodupkey" then you will only be left with one record with that key combination in the above case as you can see below. 
data test3;
  input id1 $ id2 $ extra ;
cards;
aa ab 3
aa ab 3
aa ab 2
aa ab 1 公卫百科
;
proc sort nodupkey data=test3;
by id1 id2;
run;
options nocenter;
proc print data=test3;
run;
Obs    id1    id2    extra 公卫百科

 1     aa     ab       3 公卫考场

It is a big mistake to think sorting "nodup" will remove duplicate records. Sometime it will, sometime it won't. The only way you can be sure of removing duplicate records is to "proc sort nodupkey" and include enough key variables to be sure you will lose the duplicates you want to lose. In the case shown above, then if we knew of the same "extra" values being duplicates we wanted to remove then this variable should be included in the list of sort variables and then "nodupkey" will remove the duplicates as shown below. 公卫论坛
data test4;
   input id1 $ id2 $ extra ;
 cards;
 aa ab 3
 aa ab 1
 aa ab 2
 aa ab 3
 ;
 proc sort nodupkey data=test4;
 by id1 id2 extra;
 run;

公卫论坛

 options nocenter;
 proc print data=test4;
 run;
Obs    id1    id2    extra

公卫考场

 1     aa     ab       1
 2     aa     ab       2
 3     aa     ab       3

公卫考场

MORE DETAILS

公卫百科

http://www.datasavantconsulting.com/roland/

公卫百科

http://www2.sas.com/proceedings/sugi31/030-31.pdf 公卫人

http://www2.sas.com/proceedings/sugi30/037-30.pdf 公卫人

http://www2.sas.com/proceedings/forum2007/069-2007.pdf 公卫论坛

附件列表


2

词条内容仅供参考,如果您需要解决具体问题
(尤其在法律、医学等领域),建议您咨询相关领域专业人士。

如果您认为本词条还有待完善,请 编辑

上一篇 化妆品    下一篇 选择性偏倚

标签

同义词

暂无同义词