none
Newbie Question for C# fuzzy search RRS feed

  • Question

  • I have 2 lists of some companies with their registered information on 2 different web sites.  Some companies in different lists are actually the same company, but their registered information looks a little different.  I want to do some kind of fuzzy search to find the same companies, whose names and their representer’s names are almost the same for human eyes, but not for exact match in computer.

            public struct Company
            {
                public int ID { get; set; }
                public string Name { get; set; }
                public string Representer { get; set; }
            }
    
            static void Main(string[] args)
            {
    
                List<Company> list1 = new List<Company>();
                List<Company> list2 = new List<Company>();
    
                list1.Add(new Company
                {
                    ID = 10,
                    Name = "A Company LTD",
                    Representer = "Watson Jr."
                });
                list1.Add(new Company
                {
                    ID = 11,
                    Name = "A & B FC",
                    Representer = "Diego A. Maradona"
                });
                list1.Add(new Company
                {
                    ID = 12,
                    Name = "ABB LTD",
                    Representer = "Ulrich Spiesshofer"
                });
    
    
                list2.Add(new Company
                {
                    ID = 20,
                    Name = "A Company Limited",
                    Representer = "Watson Junior"
                });
                list2.Add(new Company
                {
                    ID = 21,
                    Name = "A and B Football Club",
                    Representer = "Diego Armando Maradona"
                });
                list2.Add(new Company
                {
                    ID = 22,
                    Name = "GM LTD",
                    Representer = "Mary Barra"
                });
            }

    In the above examples, for the companies in list1 and list2, the first company in list1 (ID=10) and the first company in list2 (ID=20) is actually the same company, as we can see it by our eyes; the second company in list1 (ID=11) and the second company in list2 (ID=21) is actually the same company, it is also obvious.

    The last company in list1 (ID=12) and last company in list2 (ID=22) are different companies, as they are totally different entities.

    Is there any tools in .net, which I can use for some kind of fuzzy search to find the match between the 2 lists.  The total number of companies in different list could be different, i.e. list1 could have 10 items, list2 could have 20 items; but they can have the same number of items, as in this example, both of them have 3 items.  There must be at least one item is the same item in the both lists, and there can be as many as all the items are the same for the both lists under some specific conditions.  However, if using exact string compare or Regular expression ways of searching, it will definitely fail, as the different list uses different spelling.



    • Edited by zydjohn Saturday, May 19, 2018 10:30 PM typos
    Saturday, May 19, 2018 10:29 PM

All replies

  • Hi zydjohn,

    Thank you for posting here.

    For your question, if you want to use fuzzy search, you should define your own algorithm.

    Please download the source file from code project. It shows a simple implementation of the fuzzy string search with algorithm for your reference.

    https://www.codeproject.com/Articles/36869/Fuzzy-Search

    Best Regards,

    Wendy


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Monday, May 21, 2018 8:31 AM
    Moderator
  • This is a hard problem.  If you KNOW the kinds of translations you are likely to see, you can run the strings through a transform that converts "football club" to "FC", and "Limited" to "Ltd", and "and" to "&" and so on.  That way, you can convert both strings to some kind of canonical format before doing the comparison.

    That won't help with spelling errors, although there are algorithms (like Soundex) that can help with that.


    Tim Roberts, Driver MVP Providenza & Boekelheide, Inc.

    Monday, May 21, 2018 6:28 PM