Answered Large Lookup table

  • Saturday, March 09, 2013 12:06 PM
     
     

    Hi all, I just wanted to ask some advice please.

    I have an address lookup table containing all UK addresses (approx 28 million) , we will have a number of concurrent queries accessing this table to match an address against 3 fields (e.g. Postcode, town, street), these fields form a PK on the lookup table.

    A file containing up to  60k records will be processed by creating multiple threads which will issue a query via a stored proc to match against the lookup table for each address in the file.

    I would like to know the best approach to optimise access times on this table.

    I am currently considering some sort of partitioning on the table based on the first two characters of the post code which will give aprrox 125 partitions each having about 220K addresses.

    Many thanks for your help in advance!

All Replies

  • Saturday, March 09, 2013 12:22 PM
    Moderator
     
     

    For quick assistance post query & DDL.

    Optimization article:

    http://www.sqlusa.com/articles/query-optimization/

    What is @@version?  What is the HW platform?  Disk configuration?


    Kalman Toth Database & OLAP Architect
    Paperback / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012

  • Saturday, March 09, 2013 12:44 PM
     
      Has Code

    Hi, I'm not too sure what disk configuration is as this is a greenfield project so we can change it for the most approopiate design, we are using SQL2008R2.

    It is just a simple table and select proc,

    CREATE TABLE [dbo].[Address](
    
          [Line1] [varchar](500) NOT NULL,
    
          [Town] [varchar](250) NOT NULL,
    
          [Postcode] [varchar](10) NOT NULL,
    
    )
    
    
    SET ANSI_NULLS ON
    GO
    
    SET QUOTED_IDENTIFIER ON
    GO
    
    CREATE PROCEDURE [dbo].[AddressSearch]
         @pCurrentUserLoginID bigint
    	,@pPostcode	varchar(100)
    	,@pTown varchar(100)
    	,@Line1 varchar(100)
    --------------------------------------------------------------------------------------------------------------------------------
    --   Address search
    --   Any changes made manually will be lost next time this file is regenerated.
    --------------------------------------------------------------------------------------------------------------------------------
    AS
    BEGIN
    	SET NOCOUNT ON;
    
        BEGIN TRY
    
    		SELECT *
    		FROM Address
    		WHERE Postcode = @pPostcode
    	      AND Town = @pTown
    		  AND Line1 = @Line1
    		
        END TRY
        
        BEGIN CATCH
        
            -- Write error to ErrorLog table
            EXEC [dbo].[usp_ErrorLog_Insert] @pCurrentUserLoginID;
        
            RETURN(1);
        
        END CATCH;
    
        RETURN(0);
    
    END
    
    GO
    Thanks!



  • Saturday, March 09, 2013 12:56 PM
     
     

    Can you give few more details like

    what is the operation done on the Lookup table with 60k records, means the select query on Lookup table or modifications are happening on the table?

    Also the what is the order of the PK and does it mataches with the join condition used in the match?

    Instead of issuing the query between the file records and the loopkup table , can copy the data to a staging table and try querying with the lookup table.


    Thanks Sarat --Please use Marked as Answer if my post solved your problem and use Vote As Helpful if the post was useful.

  • Saturday, March 09, 2013 3:40 PM
     
     

    Amended

    Hi, essentialy the process is thus, a csv file is loaded into memory in a .net web app, and the app loops through the csv data calling the look up proc for each record which it then essentially returns a flag to show whether or not the address is valid (i.e. matches exactly with the lookup data).

    Just to mention that web appp calls are done using multi threads.

    Any other info required please let us know.

    Many thanks!


    • Edited by rockyboy1 Saturday, March 09, 2013 4:07 PM
    •  
  • Saturday, March 09, 2013 4:52 PM
     
     

    It is possible that partitioning over multiple disk can give better scale, but for 60000 lookups, I am not sure it is worth the pain.

    What is worth looking at is to change the procedure from just returning the address for a single triplet, is to have to accept a set of data. Since you are on SQL 2008, the closest at hand is a table variable:

    CREATE TYPE addresssearchtype AS TABLE
       (postcode varchar(100) NOT NULL,
        town     varchar(100) NOT NULL,
        line1    varchar(100) NOT NULL)

    go
    CREATE PROCEDURE AddressSearchMulti
       @pCurrentUserLoginID bigint,
       @searchdata          addresssearchtype READONLY AS
    BEGIN TRY
       SELECT A.*
       FROM   Addresses A
       WHERE  EXISTS (SELECT *
                      FROM   @searchdata sd
                      WHERE  sd.postcode = A.Postcode
                        AND  sd.town     = A.Town
                        AND  sd.line1    = A.Line1)
    END TRY

    Now you only need one roundtrip to the server, which shold be more efficient. Essentially, you can make a single call, and no need for multiple threads.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
  • Saturday, March 09, 2013 5:02 PM
     
     

    Hi, thanks for the reply, apologies but I think I may not of been clear with the process.

    The issue is that the web app has already been developed however there are serious performance issues as it is having to call a proc for each of the 60K records in the csv file which has been loaded into memory and match against the 28Million addres records in the lookup table.

    Thanks

  • Saturday, March 09, 2013 5:07 PM
     
     Answered Has Code

    Instead of doing  row-by-row validation, you can dump the CSV file in a table and match it against the lookup data by joining on the three columns. This can help you update a flag column once and then query against that flag column for invalid addresses.

    CREATE TABLE [dbo].[CSVDump](
    [Line1] [varchar](500) NOT NULL,
    [Town] [varchar](250) NOT NULL,
    [Postcode] [varchar](10) NOT NULL,
    [IsValid] BIT NULL
    )
    GO
    
    UPDATE d
    SET d.[IsValid] = 1
    FROM [dbo].[CSVDump] AS d
    INNER JOIN [dbo].[Address] AS a
    ON d.[Postcode] = a.[Postcode]
    AND d.[Town] = a.[Town]
    AND d.[Line1] = a.[Line1]

    Please mark this reply as the answer or vote as helpful, as appropriate, to make it useful for other readers.
    Thanks!
    Aalam | Blog (http://aalamrangi.wordpress.com)

    • Marked As Answer by rockyboy1 Saturday, March 09, 2013 5:22 PM
    •  
  • Saturday, March 09, 2013 5:08 PM
     
     

    Sounds like the rewrite with a table variable is exactly what you need.

    What analysis have you performed to come to the conclusion that partitioning the table would help? Partitioning would really only help if you spread out the data over multiple I/O channels, of which you may have more than one, but not 125.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
  • Saturday, March 09, 2013 5:23 PM
     
     

    Thanks for the suggestions, I will discuss amending the row by row process with the others and see if its not too late to amend the solution.

    Thanks again!

  • Saturday, March 09, 2013 5:36 PM
     
      Has Code

    You didn't post the full table DDL with indexes, or indicate what the performance problems are, so all we can do is guess.  It would help if you could tell us how big the table is, how much memory your database sever has, what the STATISTICS_IO and STATISTICS_TIME and query plan are for running this stored procedure, and whether the performance problems relate to CPU utilization or disk waits.

    Partitioning by postal code may help if you introduce enough partitions to reduce the number of levels in the PK index from 4 to 3.  It also might help to make the PK a non-clustered index, if there are a lot of non-key columns on the table, and ordering of the key columns may make a difference.

    If you are doing individual address lookups, you can optimize the traffic to and from the database.  Returning a SELECT * single-row result is less efficient than using an output parameter.  EG:

    CREATE PROCEDURE [dbo].[AddressSearch]
         @pCurrentUserLoginID bigint
    	,@pPostcode	varchar(100)
    	,@pTown varchar(100)
    	,@Line1 varchar(100)
            ,@IsValid bit out
    --------------------------------------------------------------------------------------------------------------------------------
    --   Address search
    --   Any changes made manually will be lost next time this file is regenerated.
    --------------------------------------------------------------------------------------------------------------------------------
    AS
    BEGIN
    	SET NOCOUNT ON;
    
        BEGIN TRY
        
        if exists
        (
    		  SELECT *
    		  FROM Address
    		  WHERE Postcode = @pPostcode
    	        AND Town = @pTown
    		    AND Line1 = @Line1
        )
        begin
          set @IsValid = 1;
        end
        else
        begin
          set @IsValid = 0;
        end;
        return 0;
    		
        END TRY
        
        BEGIN CATCH
        
            -- Write error to ErrorLog table
            EXEC [dbo].[usp_ErrorLog_Insert] @pCurrentUserLoginID;
        
            RETURN(1);
        
        END CATCH;
    
    
    
    END
    

     


    David http://blogs.msdn.com/b/dbrowne/