none
Fuzzy Data Fun

    Question

  • Note:  I am not at all opposed to a T-SQL or third party suggestion in lieu of SSIS. 

     

    I have a data set that is full of fuzzy dups that need to be linked to each other.  The fuzzy grouping task in SSIS appears to do this.  However, I have several records that need to be tied together using several different linkages.  It appears that the fuzzy grouping task will only let you choose one fuzzy join.  I am looking for a logical OR join.  I also do not believe I will get intended results if I pass this result set to another fuzzy grouping task to group new linking dups with previous groupings. It seems it will relink all the records causing even more dups.  Here's what the data looks like:

     

    ORIGINAL_KEY   FIRST_NAME    LAST_NAME     ADDRESS                  UNIQUE_IDENTIFIER

    1                         Betty                NULL                484 Duck Pond Rd       NULL

    2                         NULL                Lancaster          484 Duck Pond Rd       1234

    3                         Patrick              NULL                NULL                           5678      

    4                         Betsy                Lanacster         484 Duck Pong Rd        NULL

    5                         Patrick              Davis                1600 Penn Ave              5678 

     

    Here is the logic at hand for record linkages:

     

    If a unique_identifier exists, then any join to it is a perfect match.

    OR

    If there is a fuzzy match for first_name AND address combination between records.

    OR

    If there is a fuzzy match for last_name AND address combination between records.

     

    If these rules were used in a fuzzy grouping, the only linking records I would get would be records with original_key 3 and 5.

     

    This is what I need the data to end up looking like:

     

     

    ORIGINAL_KEY   FUZZY_MAP_KEY     FIRST_NAME    LAST NAME     ADDRESS                 UNIQUE IDENTIFIER

    1                         4                              Betty                NULL                484 Duck Pond Rd      NULL

    2                         4                              NULL                Lancaster          484 Duck Pond Rd      1234

    3                         5                              Patrick              NULL                NULL                          5678   

    4                         4                              Betsy                Lanacster         484 Duck Pong Rd      NULL                         

    5                         5                              Patrick              Davis                1600 Penn Ave             5678 

     

    I know this could probably be solved with several temp objects, but I am looking for a simpler solution that won't turn into a recursive nightmare.  Any help is greatly appreciated.

     

    Wednesday, September 17, 2008 7:40 PM

All replies

  • I think this is what you are after:

     

    Code Snippet

    declare @table table

    (

    ORIGINAL_KEY int,

    FIRST_NAME varchar(30),

    LAST_NAME varchar(30),

    ADDRESS varchar(50),

    UNIQUE_IDENTIFIER int

    )

    insert into @table select 1, 'Betty', NULL, '484 Duck Pond Rd', NULL

    insert into @table select 2, NULL, 'Lancaster', '484 Duck Pond Rd', 1234

    insert into @table select 3, 'Patrick', NULL, NULL, 5678

    insert into @table select 4, 'Betsy', 'Lanacster', '484 Duck Pond Rd', NULL

    insert into @table select 5, 'Patrick', 'Davis', '1600 Penn Ave', 5678

    select t1.original_key, max(t2.original_key) as Fuzzy_key, t1.first_name, t1.last_name, t1.address, t1.unique_identifier

    from @table t1

    join @table t2 on ((t1.unique_identifier = t2.unique_identifier) or

    (t1.first_name = t2.first_name or t1.address = t2.address) or

    (t1.last_name = t2.last_name or t1.address = t2.address))

    group by t1.original_key, t1.first_name, t1.last_name, t1.address, t1.unique_identifier

     

     

    That is assuming that your record #4 entry was supposed to be Pond and not Pong in your sample data.

    Wednesday, September 17, 2008 8:17 PM
  • Thank you for the post.  Sorry, I should have been more specific.  The reason I am using fuzzy matching in SSIS, is because the quality of data is sub par.  I.E. - I have two records:

     

    Name             Address             Zip Code

    John Doe        123 Typo Lane    34205

    Jon Doe          123 Typo Ln       34205

     

    To the naked eye, this is the same person and these are the same records.  However, writing SQL to figure that out can be nightmarish.  I can join on zip, but doing a compare on address and name would require significant logic (especially when the customer stores symbols in their name column). SSIS has a task that can match the two records, which surprisingly works pretty well.  However, it gets more complicated using the task, since my data can be matched on several different field combinations.  Using their task, I can only choose one join. What I really need is the ability to choose several OR joins.  I am looking into making a custom SSIS component based on the fuzzy grouping task where I can override the join logic. I think that writing it in T-SQL would be super complicated trying to fit every scenario.  I don't think a third-party tool is an option, but will investigate as a last resort. Sad

    Wednesday, September 17, 2008 8:45 PM