Fuzzy Data Fun
-
Wednesday, September 17, 2008 7:40 PM
Note: I am not at all opposed to a T-SQL or third party suggestion in lieu of SSIS.
I have a data set that is full of fuzzy dups that need to be linked to each other. The fuzzy grouping task in SSIS appears to do this. However, I have several records that need to be tied together using several different linkages. It appears that the fuzzy grouping task will only let you choose one fuzzy join. I am looking for a logical OR join. I also do not believe I will get intended results if I pass this result set to another fuzzy grouping task to group new linking dups with previous groupings. It seems it will relink all the records causing even more dups. Here's what the data looks like:ORIGINAL_KEY FIRST_NAME LAST_NAME ADDRESS UNIQUE_IDENTIFIER
1 Betty NULL 484 Duck Pond Rd NULL
2 NULL Lancaster 484 Duck Pond Rd 1234
3 Patrick NULL NULL 5678
4 Betsy Lanacster 484 Duck Pong Rd NULL
5 Patrick Davis 1600 Penn Ave 5678
Here is the logic at hand for record linkages:
If a unique_identifier exists, then any join to it is a perfect match.
OR
If there is a fuzzy match for first_name AND address combination between records.
OR
If there is a fuzzy match for last_name AND address combination between records.
If these rules were used in a fuzzy grouping, the only linking records I would get would be records with original_key 3 and 5.
This is what I need the data to end up looking like:
ORIGINAL_KEY FUZZY_MAP_KEY FIRST_NAME LAST NAME ADDRESS UNIQUE IDENTIFIER
1 4 Betty NULL 484 Duck Pond Rd NULL
2 4 NULL Lancaster 484 Duck Pond Rd 1234
3 5 Patrick NULL NULL 5678
4 4 Betsy Lanacster 484 Duck Pong Rd NULL
5 5 Patrick Davis 1600 Penn Ave 5678
I know this could probably be solved with several temp objects, but I am looking for a simpler solution that won't turn into a recursive nightmare. Any help is greatly appreciated.
All Replies
-
Wednesday, September 17, 2008 8:17 PM
I think this is what you are after:
Code Snippetdeclare
@table table(
ORIGINAL_KEY
int,FIRST_NAME
varchar(30),LAST_NAME
varchar(30),ADDRESS
varchar(50),UNIQUE_IDENTIFIER
int)
insert
into @table select 1, 'Betty', NULL, '484 Duck Pond Rd', NULLinsert
into @table select 2, NULL, 'Lancaster', '484 Duck Pond Rd', 1234insert
into @table select 3, 'Patrick', NULL, NULL, 5678insert
into @table select 4, 'Betsy', 'Lanacster', '484 Duck Pond Rd', NULLinsert
into @table select 5, 'Patrick', 'Davis', '1600 Penn Ave', 5678select
t1.original_key, max(t2.original_key) as Fuzzy_key, t1.first_name, t1.last_name, t1.address, t1.unique_identifierfrom
@table t1join
@table t2 on ((t1.unique_identifier = t2.unique_identifier) or (t1.first_name = t2.first_name or t1.address = t2.address) or (t1.last_name = t2.last_name or t1.address = t2.address))group
by t1.original_key, t1.first_name, t1.last_name, t1.address, t1.unique_identifierThat is assuming that your record #4 entry was supposed to be Pond and not Pong in your sample data.
-
Wednesday, September 17, 2008 8:45 PM
Thank you for the post. Sorry, I should have been more specific. The reason I am using fuzzy matching in SSIS, is because the quality of data is sub par. I.E. - I have two records:
Name Address Zip Code
John Doe 123 Typo Lane 34205
Jon Doe 123 Typo Ln 34205
To the naked eye, this is the same person and these are the same records. However, writing SQL to figure that out can be nightmarish. I can join on zip, but doing a compare on address and name would require significant logic (especially when the customer stores symbols in their name column). SSIS has a task that can match the two records, which surprisingly works pretty well. However, it gets more complicated using the task, since my data can be matched on several different field combinations. Using their task, I can only choose one join. What I really need is the ability to choose several OR joins. I am looking into making a custom SSIS component based on the fuzzy grouping task where I can override the join logic. I think that writing it in T-SQL would be super complicated trying to fit every scenario. I don't think a third-party tool is an option, but will investigate as a last resort.


