locked
Database Design Help for a Search Engine RRS feed

  • Question

  • Hi,

    I am trying to create a database design and want some suggestions from experts. Actually it is a manual form of Full Text Search System.

    What we have is a list of Users probably between 100,000 to 300,000. Each user has a list of one or more keywords for example Developer, Architect, System Analyst, etc.

    We want a search engine that should bring out the users on a particular query. So far I have created a table as,

    User_ID    int             User_Name    varchar(255)              Word    varchar(30)
    ----------------------------------------------------------------------------------------------------
    1                               Abc                                               Developer
    1                               Abc                                               Programmer
    1                               Abc                                               System
    1                               Abc                                               Analyst

    This table has a composite primary key User_ID and Word.

    Note that I have split the System Analyst Keyword into 2 words this because if someone tries to search like Analyst System or System Analyst both should return the same result.

    What I have done is when user gives me a keyword I split it into words and searches in table.

    Query:

     

    DECLARE @tblWords table(Word varchar(30) not null primary key);
    
    INSERT INTO @tblWords(Word)
    	SELECT 'Developer'
    	UNION ALL
    	SELECT 'System'
    	UNION ALL
    	SELECT 'Analyst'
    
    SELECT	TOP 10 [User_ID], Min([User_Name]) AS UserName
    FROM	dbo.tbl_UserWords
    WHERE	Word IN (SELECT Word FROM @tblWords)
    GROUP BY [User_ID]
    HAVING Count([User_ID]) = 3
    ORDER BY UserName
    

    I have created suggested indices by Tuning Advisor which are 

    1) Word (asc)
    2) Word (asc), User_ID (asc) include (User_Name)

    The speed of query is good but we haven't tested it on a huge scale plus when I look in the Executed Query Plan,

    1) Index Seek on table and clustered index scan on temp. table, where Word = scalar operation      returns 2160 rows
    2) Nested Loops (inner join)  with the temporary table again returns 2160 rows
    3) Sort on User_ID return 2160 rows
    4) Stream Aggregate   output:  Group by User_ID  |  Min(User_name)   |   Total Words count            rows = 2041
    5) Compute Scalar
    6) Filter  Count(User_ID) = 3   returns 117 rows
    7) Sort (Top N Sort)   Order By User_Name  returns 10 rows
    8) Select

    Problems:

    1) As we can see the Query Plan index seek is fetching 2160 rows on first step which can be a larger in some cases
    2) This will be implemented on a website which will have 1000s of queries per minute and obviously the number of users will grow day by day.

    Question:

    Does anyone have a better idea?

    Thanks

     

    Saturday, February 12, 2011 12:44 PM

Answers

  • Hi

    So you want to return a user who is developer, system or analyst  or all of them? Instead of IN predicate I would probably use JOIN  but test it before..

    Actually having search table (table variable) is  a good idea to solve such problems .See Joe Celko approaches to deal with such problems

    CREATE TABLE PilotSkills
    (pilot CHAR(15) NOT NULL,
     plane CHAR(15) NOT NULL,
     PRIMARY KEY (pilot, plane));

    PilotSkills
    pilot    plane
    =========================
    'Celko'    'Piper Cub'
    'Higgins'  'B-52 Bomber'
    'Higgins'  'F-14 Fighter'
    'Higgins'  'Piper Cub'
    'Jones'    'B-52 Bomber'
    'Jones'    'F-14 Fighter'
    'Smith'    'B-1 Bomber'
    'Smith'    'B-52 Bomber'
    'Smith'    'F-14 Fighter'
    'Wilson'   'B-1 Bomber'
    'Wilson'   'B-52 Bomber'
    'Wilson'   'F-14 Fighter'
    'Wilson'   'F-17 Fighter'

    CREATE TABLE Hangar
    (plane CHAR(15) NOT NULL PRIMARY KEY);

    Hangar
    plane
    =============
    'B-1 Bomber'
    'B-52 Bomber'
    'F-14 Fighter'

    PilotSkills DIVIDED BY Hangar
    pilot
    =============================
    'Smith'
    'Wilson'

    SELECT DISTINCT pilot 
      FROM PilotSkills AS PS1
     WHERE NOT EXISTS
           (SELECT *
              FROM Hangar
             WHERE NOT EXISTS
                   (SELECT *
                      FROM PilotSkills AS PS2
                     WHERE (PS1.pilot = PS2.pilot)
                       AND (PS2.plane = Hangar.plane)));
    ---
    SELECT PS1.pilot
       FROM PilotSkills AS PS1, Hangar AS H1
      WHERE PS1.plane = H1.plane
      GROUP BY PS1.pilot
     HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar);

     


    Best Regards, Uri Dimant SQL Server MVP http://dimantdatabasesolutions.blogspot.com/ http://sqlblog.com/blogs/uri_dimant/
    Sunday, February 13, 2011 8:01 AM

All replies

  • Hi

    So you want to return a user who is developer, system or analyst  or all of them? Instead of IN predicate I would probably use JOIN  but test it before..

    Actually having search table (table variable) is  a good idea to solve such problems .See Joe Celko approaches to deal with such problems

    CREATE TABLE PilotSkills
    (pilot CHAR(15) NOT NULL,
     plane CHAR(15) NOT NULL,
     PRIMARY KEY (pilot, plane));

    PilotSkills
    pilot    plane
    =========================
    'Celko'    'Piper Cub'
    'Higgins'  'B-52 Bomber'
    'Higgins'  'F-14 Fighter'
    'Higgins'  'Piper Cub'
    'Jones'    'B-52 Bomber'
    'Jones'    'F-14 Fighter'
    'Smith'    'B-1 Bomber'
    'Smith'    'B-52 Bomber'
    'Smith'    'F-14 Fighter'
    'Wilson'   'B-1 Bomber'
    'Wilson'   'B-52 Bomber'
    'Wilson'   'F-14 Fighter'
    'Wilson'   'F-17 Fighter'

    CREATE TABLE Hangar
    (plane CHAR(15) NOT NULL PRIMARY KEY);

    Hangar
    plane
    =============
    'B-1 Bomber'
    'B-52 Bomber'
    'F-14 Fighter'

    PilotSkills DIVIDED BY Hangar
    pilot
    =============================
    'Smith'
    'Wilson'

    SELECT DISTINCT pilot 
      FROM PilotSkills AS PS1
     WHERE NOT EXISTS
           (SELECT *
              FROM Hangar
             WHERE NOT EXISTS
                   (SELECT *
                      FROM PilotSkills AS PS2
                     WHERE (PS1.pilot = PS2.pilot)
                       AND (PS2.plane = Hangar.plane)));
    ---
    SELECT PS1.pilot
       FROM PilotSkills AS PS1, Hangar AS H1
      WHERE PS1.plane = H1.plane
      GROUP BY PS1.pilot
     HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar);

     


    Best Regards, Uri Dimant SQL Server MVP http://dimantdatabasesolutions.blogspot.com/ http://sqlblog.com/blogs/uri_dimant/
    Sunday, February 13, 2011 8:01 AM
  • Thanks for reply Uri,

    Yes I want to extract out the users having all the given words written with them. For example if we gave Developer and Designer 2 words. And a particular user has only Developer written with him. I don't want that particular user to be extracted out in the results, I want those users who have both developer and designer attached with them.

    I have already tried joining the Temporary Table with the main table but the Query Plan is same.

    Your idea seems to be a bit more resource hungry. Plus you are suggesting more than 1 table to be used.

    What I have done already is used a de-normalized form for faster response. This is actually kinda Search Engine Index which I have created.

    Do you know any article / book of Joe in which he has discussed such scenarios.

    My goal is to extract out only particular records in first place rather sorting filtering, etc. as we have seen in the Query Plan.

    Monday, February 14, 2011 7:20 AM
  • No,but having additional table is general approach which has no problem in terms of perfomance
    Best Regards, Uri Dimant SQL Server MVP http://dimantdatabasesolutions.blogspot.com/ http://sqlblog.com/blogs/uri_dimant/
    Monday, February 14, 2011 8:39 AM
  • Obviously we will be adding more tables in normal/general case and doing joins etc. But my problem is a bit different as,

    1) Number of rows (data) is huge. Literally it may cross 10,000,000 rows
    2) Queries per second are also big

    User wants a faster response like normal search engines do.

    Thanks for giving your time.

    Monday, February 14, 2011 9:58 AM
  • Hi

    There is no miracle, you will have to define proper indexes to speed up the queries.


    Best Regards, Uri Dimant SQL Server MVP http://dimantdatabasesolutions.blogspot.com/ http://sqlblog.com/blogs/uri_dimant/
    Monday, February 14, 2011 10:15 AM
  • I have now modified my Query and have added Inner Hash Join Query Hint.

    Now I am creating the Dynamic Query on runtime when user gives me the words and finally my Query looks like this

    SELECT	P.[User_ID], P.[User_Name]
    FROM	dbo.tbl_UserWords AS P
    		INNER HASH JOIN (
    			SELECT	[User_ID]
    			FROM	dbo.tbl_UserWords
    			WHERE	Word = 'System'
    					) AS S ON P.[User_ID] = S.[User_ID]
    		INNER HASH JOIN (
    			SELECT	[User_ID]
    			FROM	dbo.tbl_UserWords
    			WHERE	Word = 'Analyst'
    					) AS A on S.[User_ID] = A.[User_ID]
    WHERE	Word = 'Developer'
    ORDER BY [User_Name]
    

    Before creating the dynamic query I first check the minimum number of records in the given words. For example in above case I got min. records in word Developer so I put it in the outer loop and then System and then Analyst.

    According to executed Query Plan, database engine first extracts out User_IDs of Word Developer then hash match with User_IDs of word System and then with User_IDs of word Analyst. After this extraction at the end sort is performed on the extracted result set.

    Any comments?

    Thanks

    Monday, February 21, 2011 10:47 AM