Wednesday, September 17, 2008 7:40 PM
Note: I am not at all opposed to a T-SQL or third party suggestion in lieu of SSIS.
ORIGINAL_KEY FIRST_NAME LAST_NAME ADDRESS UNIQUE_IDENTIFIER
1 Betty NULL 484 Duck Pond Rd NULL
2 NULL Lancaster 484 Duck Pond Rd 1234
3 Patrick NULL NULL 5678
4 Betsy Lanacster 484 Duck Pong Rd NULL
5 Patrick Davis 1600 Penn Ave 5678
Here is the logic at hand for record linkages:
If a unique_identifier exists, then any join to it is a perfect match.
If there is a fuzzy match for first_name AND address combination between records.
If there is a fuzzy match for last_name AND address combination between records.
If these rules were used in a fuzzy grouping, the only linking records I would get would be records with original_key 3 and 5.
This is what I need the data to end up looking like:
ORIGINAL_KEY FUZZY_MAP_KEY FIRST_NAME LAST NAME ADDRESS UNIQUE IDENTIFIER
1 4 Betty NULL 484 Duck Pond Rd NULL
2 4 NULL Lancaster 484 Duck Pond Rd 1234
3 5 Patrick NULL NULL 5678
4 4 Betsy Lanacster 484 Duck Pong Rd NULL
5 5 Patrick Davis 1600 Penn Ave 5678
I know this could probably be solved with several temp objects, but I am looking for a simpler solution that won't turn into a recursive nightmare. Any help is greatly appreciated.
Wednesday, September 17, 2008 8:17 PM
I think this is what you are after:Code Snippet
insertinto @table select 1, 'Betty', NULL, '484 Duck Pond Rd', NULL
insertinto @table select 2, NULL, 'Lancaster', '484 Duck Pond Rd', 1234
insertinto @table select 3, 'Patrick', NULL, NULL, 5678
insertinto @table select 4, 'Betsy', 'Lanacster', '484 Duck Pond Rd', NULL
insertinto @table select 5, 'Patrick', 'Davis', '1600 Penn Ave', 5678
selectt1.original_key, max(t2.original_key) as Fuzzy_key, t1.first_name, t1.last_name, t1.address, t1.unique_identifier
join@table t2 on ((t1.unique_identifier = t2.unique_identifier) or (t1.first_name = t2.first_name or t1.address = t2.address) or (t1.last_name = t2.last_name or t1.address = t2.address))
groupby t1.original_key, t1.first_name, t1.last_name, t1.address, t1.unique_identifier
That is assuming that your record #4 entry was supposed to be Pond and not Pong in your sample data.
Wednesday, September 17, 2008 8:45 PM
Thank you for the post. Sorry, I should have been more specific. The reason I am using fuzzy matching in SSIS, is because the quality of data is sub par. I.E. - I have two records:
Name Address Zip Code
John Doe 123 Typo Lane 34205
Jon Doe 123 Typo Ln 34205
To the naked eye, this is the same person and these are the same records. However, writing SQL to figure that out can be nightmarish. I can join on zip, but doing a compare on address and name would require significant logic (especially when the customer stores symbols in their name column). SSIS has a task that can match the two records, which surprisingly works pretty well. However, it gets more complicated using the task, since my data can be matched on several different field combinations. Using their task, I can only choose one join. What I really need is the ability to choose several OR joins. I am looking into making a custom SSIS component based on the fuzzy grouping task where I can override the join logic. I think that writing it in T-SQL would be super complicated trying to fit every scenario. I don't think a third-party tool is an option, but will investigate as a last resort.