SQL Server Developer Center > SQL Server Forums > Data Mining > Is data mining techniques better than using fulltext , soundex , pattern to search records? heres my scenario, please advice.

Answered Is data mining techniques better than using fulltext , soundex , pattern to search records? heres my scenario, please advice.

  • Friday, July 30, 2010 1:21 AM
     
     
    <!-- /* Font Definitions */ @font-face {font-family:Wingdings; panose-1:5 0 0 0 0 0 0 0 0 0; mso-font-charset:2; mso-generic-font-family:auto; mso-font-pitch:variable; mso-font-signature:0 268435456 0 0 -2147483648 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-parent:""; margin:0in; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman"; mso-fareast-font-family:"Times New Roman";} p {mso-margin-top-alt:auto; margin-right:0in; mso-margin-bottom-alt:auto; margin-left:0in; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman"; mso-fareast-font-family:"Times New Roman";} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.25in 1.0in 1.25in; mso-header-margin:.5in; mso-footer-margin:.5in; mso-paper-source:0;} div.Section1 {page:Section1;} /* List Definitions */ @list l0 {mso-list-id:27487746; mso-list-template-ids:77652700;} @list l0:level1 {mso-level-number-format:bullet; mso-level-text:; mso-level-tab-stop:.5in; mso-level-number-position:left; text-indent:-.25in; mso-ansi-font-size:10.0pt; font-family:Symbol;} @list l1 {mso-list-id:297423348; mso-list-type:hybrid; mso-list-template-ids:1760099356 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;} @list l1:level1 {mso-level-tab-stop:66.0pt; mso-level-number-position:left; margin-left:66.0pt; text-indent:-.25in;} @list l2 {mso-list-id:1372992100; mso-list-template-ids:-636470380;} @list l2:level1 {mso-level-number-format:bullet; mso-level-text:; mso-level-tab-stop:.5in; mso-level-number-position:left; text-indent:-.25in; mso-ansi-font-size:10.0pt; font-family:Symbol;} ol {margin-bottom:0in;} ul {margin-bottom:0in;} -->

    Hello Gurus,

          I have a scenario wherein a lists of NAMES and Birthday should be matched with records on a table. I believe that data mining is the proper approach for this.

              Table Details

    • The source table/file/lists of names/birthday is around 50,000 records. (this changes depending on the request of users in real time wherein via web, its uploaded to the MSSQL2008 Server, then a process is to be done and sends back the results to the client via web page or may be a text file downloaded on the client. 

    Simultaneously, a total of 10 users can upload and expects a result in a few minutes as much as possible.)

    • The table to search is around 5 Million records and growing by 1 to 2% per month.

          The requirement is that a given a Name and Birthday, the process should find a match using like, pattern, sounddex ,   contains (full text)  etc.

    I think doing this via stored procedure may not be the best approach since this will take much time and resources on the server and it’s a real time request.    

    If real time results may not be possible, atleast an extract that can be given to the client after a few hours is acceptable provided that its not longer than 3 hrs  (i hope!).

    Ex.    Will search for   

              Column 1                            Column 2

              Mike Allen Gusto               Feb 29, 1930

             The result list should find the ff:

    1.      Mike Cruz ….

    2.      Mike villa …

    3.      Victor Allen …

    4.      Allenaida …

    5.      … Feb 10, 1930

    6.      … feb 29, 1930

    7.      Gustos Mickey …

    8.      Gusto, Allen Mike

    9.      Mike Gusto …

    10.  Gust Mike …

    11.  Mr. Gusto …

    12.    … February 29, 1930

    13.    … Feb. 1930

    14.    Mikel Gustos …

         Thank you and hopping for a positive reply soon.   

     

Answers

  • Friday, July 30, 2010 7:42 AM
     
     Answered
    I dont think that this really is an issue of data mining, i will agree that you have a huge dataset but data mining is all about finding hidden patterns this is something you need to check in the list based upon the identifer but due to very large dataset you are trying to aviod this, many optimization techniques like indexing, query optimization, database tuning, partitioning etc and etc are needed to be applied here rather then brining this problem to data mining algos.. its like you are using a gun to kill a lizard. thats my feedback if you like mark it as an answer :) cheers

All Replies

  • Friday, July 30, 2010 7:42 AM
     
     Answered
    I dont think that this really is an issue of data mining, i will agree that you have a huge dataset but data mining is all about finding hidden patterns this is something you need to check in the list based upon the identifer but due to very large dataset you are trying to aviod this, many optimization techniques like indexing, query optimization, database tuning, partitioning etc and etc are needed to be applied here rather then brining this problem to data mining algos.. its like you are using a gun to kill a lizard. thats my feedback if you like mark it as an answer :) cheers
  • Friday, July 30, 2010 8:03 AM
     
     
    Use table partisioning , Then a good query will do it.
    Bineesh Thomas Software Solution Architect/BI Consultant
  • Friday, July 30, 2010 8:11 AM
     
     

    OIC, it may be overkill but its the algos in data mining I want to fully utilize to address this problem. Am a beginner in DataM so am not fully aware of its prowess....

    I would like to do away the approach of using statements such as " Select name from table Where contains(CriteriaAccountHolder, @crit) and contains(CriteriaAccountHolderBday, @bday)  "  then i will also follow another script that will check for "sounds like" "patterns" of the same criteria and so on...  then compiles all the result and sent it back to the client. I believe that even with an optimized table,  this will eat up time and memory.

    So If i can make use of this data mining tool, it may be a more optimal approach. (even though like u said is too much just to kill a lizard hehehe)

    Per Microsoft's on site visit , my db and tables are healthy/ok, its just that am very much challenged about solving the problem using data mining and optimistic that its algos will bring about a much faster and accurate results.

    I Am just not sure how to apply it on my problem. 

    Suggestions/comments please...

    Thanks Much!

     

  • Friday, July 30, 2010 8:18 AM
     
     

    See for Data Mining you would be needing to shape your data + tables in the format which DM Algos could understand and DM is required when you need to find out something which you can't directly see in your Data. So DM would take in a lot of efforts and expertise and will deviate you from the orignal problem.

    Right now you issue can be easily held by designing a solution equally participated by Resource Governor, Query Optimization, Indexing, Tuning, Partioning, Full text search..

    So please don't waste your time and efforts on something which is not gonna end up in something which you have desired.

    SQL queries are run into DataStores having TBs of Data!! so your recordset is not the largest we have on planet.

  • Friday, July 30, 2010 8:34 AM
     
     
    Use table partisioning , Then a good query will do it.
    Bineesh Thomas Software Solution Architect/BI Consultant


    How to do this please...

    Thanks Much!

  • Friday, July 30, 2010 8:37 AM
     
     Proposed Answer

    See for Data Mining you would be needing to shape your data + tables in the format which DM Algos could understand and DM is required when you need to find out something which you can't directly see in your Data. So DM would take in a lot of efforts and expertise and will deviate you from the orignal problem.

    Right now you issue can be easily held by designing a solution equally participated by Resource Governor, Query Optimization, Indexing, Tuning, Partioning, Full text search..

    So please don't waste your time and efforts on something which is not gonna end up in something which you have desired.

    SQL queries are run into DataStores having TBs of Data!! so your recordset is not the largest we have on planet.


    ok, noted.  thank you for this info, atleast now i have a deeper understanding on when to use DM. thank you very much.
    • Proposed As Answer by jibran jamshad Friday, July 30, 2010 8:49 AM
    •  
  • Friday, July 30, 2010 11:34 AM
     
     
    if i have answered so this should be marked as the answer :).. cheers,
  • Monday, August 02, 2010 1:16 AM
     
     
    if i have answered so this should be marked as the answer :).. cheers,

    How do i do this partitioning...

    Is this what it means?

             Select NameOf , Age from Table where NameOf > a and NameOf < m

                Union All

             Select NameOf , Age from Table where NameOf > n and NameOf < z

     

     

    • Marked As Answer by Kira76 Monday, August 02, 2010 1:30 AM
    • Unmarked As Answer by Kira76 Monday, August 02, 2010 4:56 AM
    •  
  • Wednesday, August 11, 2010 6:53 PM
    Answerer
     
     

    Have you looked at the Fuzzy Lookup transformation in SQL Server Integration Services?

    It might do a decent job of finding the best match candidates.

    Here is the link to the Books Online content that describes how to set it up:

    http://msdn.microsoft.com/en-us/library/ms137786(SQL.100).aspx

    But I rather like this article, that walks through a task that is similar to yours:

    http://www.sqlteam.com/article/using-fuzzy-lookup-transformations-in-sql-server-integration-services