none
How to design optimal performance Database for a name matching application. RRS feed

  • Question

  • I intend to develop a web based application, which uses SQL server 2005 at back end and Visual studio 2.0 as front end.

    The first requirement of application is as follow

    Requirement1: It carryout a search (In SQL server) for a particular name entered from front end .net application against a huge DataBase of size about 1 million records.

    We consider exact sitiuation here,

    We have a huge (about 1 million records) database of people involved in loan, credit card or any kind of fraud against banks.It is combined database for all banks in a country and each bank is contributing a list of defaulters from it's side to this combined database.

    Now if some new customer comes to avail the services of a bank, then bank wants to ensure that new customer never appeared in this defaulter list before.

    But one important searching criteria for matching against this defaulter's dabase is that we should able to carryout sounlike search i.e (let me explain with one example mentioned below)

    If a person names "Mohammed Ali", then our application should be able to find out variation of names which sounds similar to orignal name of person, i.e

    "Mohammad Ali"

    "Mohamad Ali"

    This requirement is expected from banks if new customer comes to bank with fake ID Proof or with forged documents to avail the services. In this case there will be no pre defined unique id for customer or any identifier and application will have to solely on name matching logic.   

     

    My concern is mostly associated with the performance(response time)of application, especially to the database design (for optimal performance), rather then logic of search.     

    Tuesday, July 31, 2007 6:32 AM

Answers

  • If response time is your primary non-functional requirement, don't do this in the database. Load all the rows into memory and use a text-matching library.

     

    If you need richer capabilities, try looking at ClearForest [http://www.clearforest.com/]

    Tuesday, July 31, 2007 7:31 AM
  • I would have to agree with Udi on this.

     

    I worked on a project for a large examination board who wanted to provide results online and they required searching and filtering functionality for over 2 million candidates during the examination period, the peformance was achieved by loading the data into an 'in-memory database' structure and the developing searching and filtering algorthims - you have to remember it is cheaper to add more hardware (memory, processors) to a web server than upgrade a database server.

    Obiviously the use of an in-memory database might not be suitable if the data is changing frequently, but from what you have outlined it sounds as though the data will be updated on a daily basis and if you are using Sql Server 2005 you can take advantge of 'Notification Services' to alert you to changes in the data.

     

    As for the details of the name searching - pattern matching for different spelling etc, is this extra data already stored in the database or are you exepcted to generate this data?

     

    HTH

     

    Ollie Riches

     

     

    Tuesday, July 31, 2007 9:49 AM
  • Hi,

     

    First off 1,000,000 records isn't what I'd call a huge database  it a fair number of records but it isn't huge.  So if you have a reasonably fast database server with a good network conection to your web server I'd start off leaving it all in the database.  Don't do premature optimization! (Personal experiance - I've created Sql Server based customer searching systems for 1.5 Million customers and 20 users with no performance problems.  You don't say how many users your system will have so this may not be relevant!)

     

    Where you will have problems is if you have to scan all 1 million records each time - I've just run a quick test on my test database: select * from Customer where soundex(CustomerSurname) = Soundex('mohammed')

    This returns lots of rows (1500) and takes about twenty seconds to run on my dual proc server - this is because soundex is only usefull when you can live with lots of false positives, but it illustrates the problem.

     

    So if it was me I'd start by storing the soundex() (or whatever it is you plan to use for sounds like)  value for the fields I was interested in an indexed column in the database, this ought to speed up the response times.  If this isn't a viable approach then the in-memory option might be the only way to go.  What it boils down to is can you encode the name information in such a way that you can use the database to search for sound alike names rather than looping through 1 million records and checking each one against the new name?

     

    Jackson

     

     

     

    Tuesday, July 31, 2007 11:18 AM

All replies

  • If response time is your primary non-functional requirement, don't do this in the database. Load all the rows into memory and use a text-matching library.

     

    If you need richer capabilities, try looking at ClearForest [http://www.clearforest.com/]

    Tuesday, July 31, 2007 7:31 AM
  •  

    Thanks Udi,for your support

    But do you not think that

    1)Fetching about 1 million names in memory will cost additional overhead as it will take long time to fetch data from database(as it is always said ,data fetching is costly affair in programming).

    2)Time required to fetch name from database may cause longer delay then then it was for searching database against a particular name.

    I would appriciate your efforts if you can provides support for your thoughts and solution to my above mentioned concerns. 

       

    Tuesday, July 31, 2007 9:12 AM
  • I would have to agree with Udi on this.

     

    I worked on a project for a large examination board who wanted to provide results online and they required searching and filtering functionality for over 2 million candidates during the examination period, the peformance was achieved by loading the data into an 'in-memory database' structure and the developing searching and filtering algorthims - you have to remember it is cheaper to add more hardware (memory, processors) to a web server than upgrade a database server.

    Obiviously the use of an in-memory database might not be suitable if the data is changing frequently, but from what you have outlined it sounds as though the data will be updated on a daily basis and if you are using Sql Server 2005 you can take advantge of 'Notification Services' to alert you to changes in the data.

     

    As for the details of the name searching - pattern matching for different spelling etc, is this extra data already stored in the database or are you exepcted to generate this data?

     

    HTH

     

    Ollie Riches

     

     

    Tuesday, July 31, 2007 9:49 AM
  • Thankyou Ollie for giving your time.

    Now I'm quite satisfied with your positive arguments,but I would like to know how much memory did you use there for

    2 million records(due to some practical intricacies)

    Although you stand correct with your experience,but the sitiuation is more grim here. I need your opinion for points mentioned below.

    1) We want to apply search on each attribute of record i.e we are applying pattern matching on name attribute(salutation,first name ,middle name and last name) and simple search can also be applied(depends on end user) for rest of the attributes (Passport number,Age,DOB,driving licence number...e.t.c.).

     Query for searching these extra attributes are appended using "AND" clause with query written for name matching.

    So it can become too hefty and memory requirements can increase many folds. 

    In your case it could be simple integers values(i.e roll numers of students) which needs searching and it should not have posed any difficulty.  

    2) Yes our database will be updated at the end of each day.

    3)Yes this extra data is already stored in database as you asked.      

    Tuesday, July 31, 2007 11:07 AM
  • Hi,

     

    First off 1,000,000 records isn't what I'd call a huge database  it a fair number of records but it isn't huge.  So if you have a reasonably fast database server with a good network conection to your web server I'd start off leaving it all in the database.  Don't do premature optimization! (Personal experiance - I've created Sql Server based customer searching systems for 1.5 Million customers and 20 users with no performance problems.  You don't say how many users your system will have so this may not be relevant!)

     

    Where you will have problems is if you have to scan all 1 million records each time - I've just run a quick test on my test database: select * from Customer where soundex(CustomerSurname) = Soundex('mohammed')

    This returns lots of rows (1500) and takes about twenty seconds to run on my dual proc server - this is because soundex is only usefull when you can live with lots of false positives, but it illustrates the problem.

     

    So if it was me I'd start by storing the soundex() (or whatever it is you plan to use for sounds like)  value for the fields I was interested in an indexed column in the database, this ought to speed up the response times.  If this isn't a viable approach then the in-memory option might be the only way to go.  What it boils down to is can you encode the name information in such a way that you can use the database to search for sound alike names rather than looping through 1 million records and checking each one against the new name?

     

    Jackson

     

     

     

    Tuesday, July 31, 2007 11:18 AM
  • Thanks jakson for your support,

    I will first try inputs provided by u people on my database and let u know about results tomorrow.

     

    Tuesday, July 31, 2007 11:34 AM
  • In our case we were using a lot simple string & integer based values to search on and the combination of clauses in the search was minimal so the memory footprint relatively was small - less than 5 kb per candidate so the total memory footprint was manageable. The advantage for us was it scaled very well in a web cluster because we had a relatively poor database infrastructure compared to the web cluster.

     

    If performance (of searches) is your key indicator of acceptability then I would suggest testing out the 2 different solutions listed in this thread and then make an informed decision.

     

    How many user will be using the system concurrently?

     

    HTH

     

    Ollie Riches

     

    Tuesday, July 31, 2007 12:15 PM
  • Fetching a million rows and loading into memory is something that you can do when the server starts up, and thus probably won't affect your response times. Also, in terms of total time, this shouldn't take more than a couple of seconds. You also minimize round trips to your DB server this way thus following the "chunky over chatty" rule. Be aware that indexing the columns of your table will cause that data to be loaded into the memory of the DB server thus taking up a similar amount of memory to that which would have been used in the middle tier. If you don't index those columns you'll find that you'll be performing full table scans on every request - which will hurt your response times significantly.

    Tuesday, July 31, 2007 3:17 PM
  • First, I would suggest separating out the different kinds of searches/filtering at the UI level as well. Filter by the other attributes after the first text matching part.

     

    Second, the amount of data in [Salutation, First name, Middle name, Last name] probably will come in under 1K per record, even in unicode, so no problem keeping that data for all the million records resident in memory. If the whole record comes in under 2K then keep everything in memory running in under 2GB.

     

    Let me know if you have any other questions.

    Tuesday, July 31, 2007 3:27 PM
  • Hello,Udi

    As u have suggested,Yes we are seprating searches/filter at UI level and filtering the rest of attributes after we find something in name columns[First name,Middle name,Last name].

    _________________________________________

    |UID     | First Name   | Middle Name |Last Name |

    |______|___________|____________|__________|

    |001     | Abdhul         |                     |   Khan      |

    |______|___________|____________|__________|

    002      | Mohammad |  Abdul           |   Khan       |

    |______|___________|____________|__________|

    003      | Khan           |  Abdul          |                  |

    |______|___________|____________|__________|

    004      | Haneef        |  Abdul           |   Khan       |

    |______|___________|____________|__________|

    Here I'm giving specimen copy of actual data table(I have dropped salutation for sake of simplicity),For If I enter

    First Name:      Mohammad

    Middle Name:   Abdul

    Last Name:      Khan

    from my front end .net application,then all four records should be found by applcation. As per search criteria

    we need to match minimum two name pair {i.e (first name,middle name),(First Name,Last Name),(Last name,Middlename)},but if we find all three ,that is best match found.

     

    Application should also be able to find the any permutation applied to the name pair. i.e if we concider the case for which I entered from front end 

    First Name:      Mohammad

    Middle Name:   Abdul

    Last Name:      Khan

    then record set 

     

    003      | Khan           |  Abdul          |                  |

    |______|___________|____________|__________|

    should be displayed.

     

    In essence we need to display any person's name which have a minimum two parts of his name comman to the name, entered for search  irrespective to the position of  first name,middle name and last name in his name.

     

    Now my question comes that do u suggest some technique(algorithm)which can handel this mess efficiently(although there is more dirt we need to handel for this application). 

     

    As you suggested to load all name fields to load in memory at initialization of server,so I wolud like to know whether the record set would be sufficient or we should go for some customary data structure(i.e binary tree e.t.c) for better performance,  

     

    Wednesday, August 1, 2007 5:23 AM
  • From the example you give above it looks like the use of DataSets & DataTable will provide the searhing and filtering functionality you require. If you didn't know the System.Data.DataTable class has a 'Select' method that allows you to filter the rows in memory.

     

    check out:

    http://msdn2.microsoft.com/en-us/library/way3dy9w.aspx

     

    If this does work for you then all you require then is to build up a set static of filtering criteria, or even allow dynamic querying of the data.

     

    HTH

     

    Ollie Riches

     

     

    Wednesday, August 1, 2007 8:28 AM
  • In the database you could take the name fields that you want to search on and normalize them out to a seperate table, so you would end up with a data structure like this:

     

    Customer

    001 | 1    | NULL | 2    |

    002 | 3    | 1    | 2    |

    003 | 2    | 1    | NULL |

     

    CustomerName

    001 | 1    | Adbdul

    002 | 2    | Mohammad

    003 | 3    | Khan

     

    Then a select on the CustomerName table will find all possible customers and if you group by the customer id and count the number of matches you should find all Customers that match 2 or more of the names given.

     

    Something like 'select CustomerId from CustomerName where name in ('first, second, last) and count (CustomerId) > 0 group by CustomerId' - this hasn't been tested and it's very rare that I get Sql right first time

     

    I'd be tempted to keep the names in the customer table as well as in the customerName table - use some extra storage but simplfy the SQL on the Customer table when you are not searching. (A few triggers  to keep the two tables in step perhaps?)

     

    If you do go with the in memory solution then this kind of reverse data structure may be what you want to build, A Dictionary keyed on the name linked to a SortedList that holds the id's of the customers.  You lookup the 2 or more names and look for overlap in the sorted listed (a set would be better, you'd just need the intersections of the sets of customer ids)

     

     

     

     

    Wednesday, August 1, 2007 8:44 AM
  • As udi mentioned one option is to cache all the names on the server.
    someone suggested Soundex - Soundex is nice but limited (e.g. it is highly dependent on the first letter) and it is not tuned for languages.
    You may want to look at something like Celato (http://www.celatro.com) which have a .NET library which uses fuzzy matching and they have languange specific enhancements. They event  have an Arbic name matching demo on-line ( which seems to be what you are looking at) as well as a short white paper http://www.celatro.com/docs/Celatro_ANS.pdf

    Also depending on your time-to-market you may want to check LINQ for filtering and searching through the cached data

    In addition to that, you may want to try to clean and eliminate duplicated off-line (to make the list shorter). See http://msdn.microsoft.com/msdnmag/issues/05/09/sqlserver2005/default.aspx on MSDN for an explantion on using fuzzy lookup with SSIS

    HTH
    Arnon
    Wednesday, August 1, 2007 9:01 AM
  • Hello Jakson,

    At least theoretically we are finding your design good,but in "CustomerName" table How can we maintain customer ID while we are storing each part of name separately(Plz let me know if I percieved it correctly what u intended with your design).

    After reading your query,I think your CustomerName table structure could be something like this.

    CustomerName

    | 001|  Abdul

    | 001|  Mohammad

    | 002|  Khan

    | 002|  Abdul

    Then we try to run query suggested by you.

    Write back to me If I got it right or let me know ,if u intended otherwise.

     

    Wednesday, August 1, 2007 10:36 AM
  •  

    Thanks Arnon,

    We have gone through the celatro website,but couldn't find .net library for download to test it's capabilities ,do such type of test libraries exist or  we need to purchase those first?  

    Wednesday, August 1, 2007 10:44 AM
  • Hi,

    1. I would suggest you to explore the full-text search capabilities available in SQL Server (see Full-Text Search in SQL Server Books Online). With full-text search you may search for the inflectional forms of words, proximity terms and etc. Full-text queries can also use a thesaurus to find synonyms of search terms. For each supported language, there exists a single thesaurus file you may easily modify (it is an XML format). I don't know what are the limitations of these files but I think it may be a good substitute for SOUNDEX.

    Since the performance of Full-text in SQL Server is much better than the table scan caused by using SOUNDEX in WHERE clause of your sql statement, I would definitely give the Full-text Search technology a try.

     

    2. If you still use SOUNDEX be aware of its key drawback that the input string must be contiguous with no spaces, and if the first character of the input string is not correct, the probability of a match being made drops down to zero. Look at the following article for more details: http://msdn.microsoft.com/msdnmag/issues/05/09/sqlserver2005/default.aspx

     

    3. I agree with Arnon LINQ may be a very good in-memory alternative for this task. You may also consider writing custom CLR procedure in SQL Server 2005 that utilizes one of the SOUNDEX algorythms.

     

    Wednesday, August 1, 2007 12:56 PM
  • Gagan

     

    The CustomerName table is just a lookup table - the idea is to trade of disk space for time, you use more disk space because your storing extra data, but your queries run much faster.  The table structure has to allow the same CustomerId to appear several times - linked to several names , to maintain this I would create some triggers that created/updated/deleted from the CustomerName table when the Customer table was updated.  This way everthing stays in step.

     

    There are lots of different ways that the tables could be structured and I think that the way you propose would work.  You might have problems if someone has a the same name twice: (http://en.wikipedia.org/wiki/Major_Major_Major_Major  ) but then so would my solution.  A better design might be to add a sequence to the CustomerName table to ensure a unique primary key and to create an index (clustered?) on the CusomerId and CustomerName columns.  I'd use something like this:

     

    CREATE TABLE [dbo].[CustomerName](

    [CustomerNameId] [int] IDENTITY(1,1) NOT NULL,

    [CustomerID] [int] NOT NULL,

    [CustomerName] [varchar](50) NOT NULL,

    CONSTRAINT [PK_CustomerName] PRIMARY KEY NONCLUSTERED

    (

    [CustomerNameId] ASC

    ) ON [PRIMARY]

    ) ON [PRIMARY]

     

    CREATE CLUSTERED INDEX [IX_CustomerName] ON [dbo].[CustomerName]

    (

    [CustomerName] ASC

    ) ON [PRIMARY]

     

     

    Wednesday, August 1, 2007 1:01 PM
  •  Gagan566188 wrote:

     

    Thanks Arnon,

    We have gone through the celatro website,but couldn't find .net library for download to test it's capabilities ,do such type of test libraries exist or  we need to purchase those first?  

     

    Why not just contact them and ask Smile ?

     

    Arnon

    Wednesday, August 1, 2007 7:08 PM