locked
Fuzzy Lookup misses possible match RRS feed

  • Question

  • I have a fuzzy lookup task that compares a source list of contacts to a reference list of contacts with the default settings.  I did some testing by adding seed data that I knew would produce somewhat high similarity hits.  All of the seeded contacts but one came back with the expected high sim values.  When I looked for the one that didn't, I noticed another match had come up but it had a very low similarity of .17.  I then did some research and discovered the reason was the MaxOutputMatchesPerInput setting which was set to 1.  I then set it to 3 and reran the package and sure enough my seeded contact that was missing before now showed up.  I thought the best match would show up if the MaxOutputMatches was set to 1?  That is not the case in my testing.

    For example,  Donna Mizeman was in the reference list.  I added Don Miseman to the source list to seed it.  The only match that came back was something like Dieman Abdul ....  So the initial match had a similarity of .17 but when MaxOutputMatchesPerInput is set to 3 the best match (seeded) has a similarity of .72.

    Anyone have an explanation for this?

    -Mike
    Thursday, April 10, 2008 3:29 PM

Answers

  • Hi Mike,

     

    For efficiency reasons, Fuzzy Lookup uses a heuristic to stop its search when MaxOutputMatches is low.  It starts with a bunch of candidate reference rows that have full words or substrings of length 4 in common with the input row.  In your example, the only words/substrings in common is "eman".  It might be the case that there are very many rows in the reference table which contain the substring "eman".  For large reference tables this could be in the many thousands and thus there would be many candidate rows and FL would have no way to find the best one without looking at each and every one of them.  To avoid this, FL will stop its search early when the candidates that it looks up stop improving the results rapidly enough. 

     

    For example, say the input is "foobar, new york" and say the reference table contains "fobaar, new york".  The only words/substring-of-length-4 in common are "new york".  The number of candidates FL has to look at might be huge.  Say MaxOutputMatches is 1.  We might first compare with "apple, new york" which gives score .2, then "bronx, new york" which might give score .3 (improving the score), then "cat, new york" with score .1 (not improving the best match), etc.  If we go for a long time w/o improving the best match we stop.  By increasing the MaxOutputMatches, there is a better chance that the results will improve as we keep searching because we are now keeping a larger result set.

     

    Hope that helps.

     

    -Kris

    Wednesday, April 16, 2008 9:32 PM