Large Lookup table
-
Saturday, March 09, 2013 12:06 PM
Hi all, I just wanted to ask some advice please.
I have an address lookup table containing all UK addresses (approx 28 million) , we will have a number of concurrent queries accessing this table to match an address against 3 fields (e.g. Postcode, town, street), these fields form a PK on the lookup table.
A file containing up to 60k records will be processed by creating multiple threads which will issue a query via a stored proc to match against the lookup table for each address in the file.
I would like to know the best approach to optimise access times on this table.
I am currently considering some sort of partitioning on the table based on the first two characters of the post code which will give aprrox 125 partitions each having about 220K addresses.
Many thanks for your help in advance!
All Replies
-
Saturday, March 09, 2013 12:22 PMModerator
For quick assistance post query & DDL.
Optimization article:
http://www.sqlusa.com/articles/query-optimization/
What is @@version? What is the HW platform? Disk configuration?
Kalman Toth Database & OLAP Architect
Paperback / Kindle: Pass SQL Exam 70-461 & Job Interview: Programming SQL Server 2012 -
Saturday, March 09, 2013 12:44 PM
Hi, I'm not too sure what disk configuration is as this is a greenfield project so we can change it for the most approopiate design, we are using SQL2008R2.
It is just a simple table and select proc,
CREATE TABLE [dbo].[Address]( [Line1] [varchar](500) NOT NULL, [Town] [varchar](250) NOT NULL, [Postcode] [varchar](10) NOT NULL, ) SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE PROCEDURE [dbo].[AddressSearch] @pCurrentUserLoginID bigint ,@pPostcode varchar(100) ,@pTown varchar(100) ,@Line1 varchar(100) -------------------------------------------------------------------------------------------------------------------------------- -- Address search -- Any changes made manually will be lost next time this file is regenerated. -------------------------------------------------------------------------------------------------------------------------------- AS BEGIN SET NOCOUNT ON; BEGIN TRY SELECT * FROM Address WHERE Postcode = @pPostcode AND Town = @pTown AND Line1 = @Line1 END TRY BEGIN CATCH -- Write error to ErrorLog table EXEC [dbo].[usp_ErrorLog_Insert] @pCurrentUserLoginID; RETURN(1); END CATCH; RETURN(0); END GOThanks!
-
Saturday, March 09, 2013 12:56 PM
Can you give few more details like
what is the operation done on the Lookup table with 60k records, means the select query on Lookup table or modifications are happening on the table?
Also the what is the order of the PK and does it mataches with the join condition used in the match?
Instead of issuing the query between the file records and the loopkup table , can copy the data to a staging table and try querying with the lookup table.
Thanks Sarat --Please use Marked as Answer if my post solved your problem and use Vote As Helpful if the post was useful.
-
Saturday, March 09, 2013 3:40 PM
Amended
Hi, essentialy the process is thus, a csv file is loaded into memory in a .net web app, and the app loops through the csv data calling the look up proc for each record which it then essentially returns a flag to show whether or not the address is valid (i.e. matches exactly with the lookup data).
Just to mention that web appp calls are done using multi threads.
Any other info required please let us know.
Many thanks!
- Edited by rockyboy1 Saturday, March 09, 2013 4:07 PM
-
Saturday, March 09, 2013 4:52 PM
It is possible that partitioning over multiple disk can give better scale, but for 60000 lookups, I am not sure it is worth the pain.
What is worth looking at is to change the procedure from just returning the address for a single triplet, is to have to accept a set of data. Since you are on SQL 2008, the closest at hand is a table variable:
CREATE TYPE addresssearchtype AS TABLE
(postcode varchar(100) NOT NULL,
town varchar(100) NOT NULL,
line1 varchar(100) NOT NULL)go
CREATE PROCEDURE AddressSearchMulti
@pCurrentUserLoginID bigint,
@searchdata addresssearchtype READONLY AS
BEGIN TRY
SELECT A.*
FROM Addresses A
WHERE EXISTS (SELECT *
FROM @searchdata sd
WHERE sd.postcode = A.Postcode
AND sd.town = A.Town
AND sd.line1 = A.Line1)
END TRYNow you only need one roundtrip to the server, which shold be more efficient. Essentially, you can make a single call, and no need for multiple threads.
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se -
Saturday, March 09, 2013 5:02 PM
Hi, thanks for the reply, apologies but I think I may not of been clear with the process.
The issue is that the web app has already been developed however there are serious performance issues as it is having to call a proc for each of the 60K records in the csv file which has been loaded into memory and match against the 28Million addres records in the lookup table.
Thanks
-
Saturday, March 09, 2013 5:07 PM
Instead of doing row-by-row validation, you can dump the CSV file in a table and match it against the lookup data by joining on the three columns. This can help you update a flag column once and then query against that flag column for invalid addresses.
CREATE TABLE [dbo].[CSVDump]( [Line1] [varchar](500) NOT NULL, [Town] [varchar](250) NOT NULL, [Postcode] [varchar](10) NOT NULL, [IsValid] BIT NULL ) GO UPDATE d SET d.[IsValid] = 1 FROM [dbo].[CSVDump] AS d INNER JOIN [dbo].[Address] AS a ON d.[Postcode] = a.[Postcode] AND d.[Town] = a.[Town] AND d.[Line1] = a.[Line1]
Please mark this reply as the answer or vote as helpful, as appropriate, to make it useful for other readers.
Thanks!
Aalam | Blog (http://aalamrangi.wordpress.com)- Marked As Answer by rockyboy1 Saturday, March 09, 2013 5:22 PM
-
Saturday, March 09, 2013 5:08 PM
Sounds like the rewrite with a table variable is exactly what you need.
What analysis have you performed to come to the conclusion that partitioning the table would help? Partitioning would really only help if you spread out the data over multiple I/O channels, of which you may have more than one, but not 125.
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se -
Saturday, March 09, 2013 5:23 PM
Thanks for the suggestions, I will discuss amending the row by row process with the others and see if its not too late to amend the solution.
Thanks again!
-
Saturday, March 09, 2013 5:36 PM
You didn't post the full table DDL with indexes, or indicate what the performance problems are, so all we can do is guess. It would help if you could tell us how big the table is, how much memory your database sever has, what the STATISTICS_IO and STATISTICS_TIME and query plan are for running this stored procedure, and whether the performance problems relate to CPU utilization or disk waits.
Partitioning by postal code may help if you introduce enough partitions to reduce the number of levels in the PK index from 4 to 3. It also might help to make the PK a non-clustered index, if there are a lot of non-key columns on the table, and ordering of the key columns may make a difference.
If you are doing individual address lookups, you can optimize the traffic to and from the database. Returning a SELECT * single-row result is less efficient than using an output parameter. EG:
CREATE PROCEDURE [dbo].[AddressSearch] @pCurrentUserLoginID bigint ,@pPostcode varchar(100) ,@pTown varchar(100) ,@Line1 varchar(100) ,@IsValid bit out -------------------------------------------------------------------------------------------------------------------------------- -- Address search -- Any changes made manually will be lost next time this file is regenerated. -------------------------------------------------------------------------------------------------------------------------------- AS BEGIN SET NOCOUNT ON; BEGIN TRY if exists ( SELECT * FROM Address WHERE Postcode = @pPostcode AND Town = @pTown AND Line1 = @Line1 ) begin set @IsValid = 1; end else begin set @IsValid = 0; end; return 0; END TRY BEGIN CATCH -- Write error to ErrorLog table EXEC [dbo].[usp_ErrorLog_Insert] @pCurrentUserLoginID; RETURN(1); END CATCH; END
David http://blogs.msdn.com/b/dbrowne/

