AnsweredAssumed Answered

How to eliminate duplicate records

Question asked by richard_moon Employee on Oct 1, 2013
Latest reply on May 29, 2018 by kieran802410
Any ideas as to how to remove duplicate records as part of a process?  Example: We read in a set of CSV records. There is a Header record and 8 individual data records, but 4 out of the 8 records are duplicates.  Through a series of process steps we want to throw out the duplicates so that only 4 unique records are passed to the final step in the process.  See sample data below.  George Smith appears 3 times, Mary Jones and Sam Shea appear 2 times.  They should all only appear once.  For the sake of simplicity, we can use the Email value as the unique key.  In other words, a record is deemed to be a duplicate solely on the basis of matching Email values.  No need to compare every field.
 
First Name,Last Name,Shoe Size,City,State,Zip,Hat Size,Email
George,Smith,9.5,Jersey City,NJ,07302,Medium,george.smith@gmail.com
Mary,Jones,6.5,Austin,TX,78610,Small,mary.jones@gmail.com
Mary,Jones,6.5,Austin,TX,78610,Small,mary.jones@gmail.com
Sam,Shea,8.5,Wilton Manors,FL,33334,Medium,sam.shea@gmail.com
Able,Toleap,7.0,Dublin,OH,43017,Large,able.toleap@gmail.com
George,Smith,9.5,Jersey City,NJ,07302,Medium,george.smith@gmail.com
Sam,Shea,8.5,Wilton Manors,FL,33334,Medium,sam.shea@gmail.com
George,Smith,9.5,Jersey City,NJ,07302,Medium,george.smith@gmail.com

0EM40000000PR1R

Just to give a little more context around the business problem we are trying to solve: There is a lot of performance overhead involved in attempting to push duplicate records into the target system, therefore we want to eliminate the duplicates before we arrive at the target system Connector step.  The input file could be quite large (over 100k records with almost 200 fields per record), so process performance is a consideration.  

Kudos to anyone who can lead us to a simple, performant solution! :-)

Outcomes