Answered Fuzzy Grouping Transform Corrupts Pass-through Data

  • 2 สิงหาคม 2548 21:55
     
     
    We are working with a client and are using Fuzzy Group transform for de-duping, and hierarchy creation for a national account list.

    I've found that if a large number of pass through columns are sent to the Fuzzy Grouping transforms it randomly corrupts the char columns.

    Our work around was to only pass through ID columns and then build out the attributes needed from views against the Fuzzy group output --- however product team should take a look at this.


    By corruption I mean random characters from other records would show up in character columns (we had address and name corruption in about 10% of a 1.5 million record dataset).

    Thanks.

    Michael Slater
    Software Architects

ตอบทั้งหมด

  • 15 สิงหาคม 2548 21:47
     
     คำตอบ
    Michael,

    Thanks for your post.  We have been unable to reproduce the problem you have reported in the new test cases that we have created for this issue.  We would very much like to get to the bottom of what you are seeing.  Can you please contact me directly so that we might work with you to find a better repro case that can be used to further investigate and fix this problem?

    Please send an email to KrisGan@microsoft.com

    Thanks,
    Kris Ganjam
  • 24 กุมภาพันธ์ 2551 15:02
     
     

     

    Hello,

     

    Would you please tell me what that issue turned out to be? I am facing a similar problem of data corruption for a column when it is in the 'Pass Through' of the Fuzzy Grouping. The column passed is nvarchar(4000) and some of its entries contains asteriks and other special characters, could the size or the special characters content be the reason?

     

    Best Regards,

    Katara

  • 27 เมษายน 2555 21:03
     
     

    I am having a similar issue with a relatively small (~120,000) recordset.

    I have one varchar passthrough column that contains a vendor ID (we are trying to match similar names / addresses across different vendor IDs to identify potential duplicates).

    I'm unable to use ~3000 results because these particular records have corrupted IDs.

    Here's an actual example:

    1. the input table to the fuzzy grouping has a single record with vendor ID "ZX4010"
    2. A SELECT * WHERE Vendor_ID LIKE '%ZX4010%' only returns the one record mentioned above (used as a sanity check for following observations)

    3. Run fuzzy grouping, matching on vendor name, address, etc and using Vendor_ID as a passthrough column.

    4. In result set, run the same query as shown in (2) on the resulting table and find 9 records with the following Vendor_IDs:

    ZX40101

    ZX40102
    ZX40103
    ZX40104
    ZX40105
    ZX40106
    ZX40107
    ZX40108
    ZX40109

    This is an actual example (excluding the fact that the Vendor_ID is made up to protect client info). Notice that the example shows a single, sequentially inremented digit is being added to the end of the Vendor_ID. This extra digit and incremental numbering appears in our actual result set.

    I'm sharing this to try to help discover the source of the problem.

    Best Regards