none
looking for a technique to make date range joins perform better RRS feed

  • Question

  • Hi we run 2017 standard.  I've come across this challenge in the past but never solicited help on the matter.

    Often, I come across situations where individual records in table A match only a SINGLE record in table B based on
    a datetime range in table B that covers a single datetime column from table A.  And indices dont seem capable of making
    the match as efficient as other business scenarios where a datetime range isnt involved.

    What do i mean? Lets say table A is the incident table (see index also) shown in the code block and it has 30.5 million
    records that are newer (start datetime) than 2 years ago.  and about 27 million older than 2 years ago. 

    And table B is the job table (see index also, notice no physical pk defined) shown in the code block and it has 3.3 million
    records of which about 1.7 million jobs get a match as explained below.  I should mention that presently endtime is null on jobs that arent finished yet.

    It would be great if the following query would run much faster than 8-10 minutes.  I suspect if the match criteria
    involved only 1 or 2 column equivalents, it would.  But date ranges seem to perform poorly.  I verified that the indexes
    shown in the code block are being used in the execution plan.   And that the execution plans seems to be doing an index scan rather than seek which at these volumes kind of makes sense to me.  Now that I look at the NC on the jobs table, I have to admit that I don't know what we were thinking.

    select count(*) from
    (

    SELECT
           inc.pk,
           inc.StartTime,
           inc.EndTime,
           inc.machine

    FROM dbo.incidents inc (NOLOCK)
    JOIN dbo.jobs job (NOLOCK) ON inc.machine = job.machine  
                              AND inc.StartTime >= job.StartTime
                              AND (inc.StartTime < job.EndTime OR job.EndTime IS NULL)
                              AND job.building = inc.building
                              AND (job.EndTime >= DATEADD(MONTH, -24, GETDATE()) OR job.EndTime IS NULL)

    WHERE
    inc.StartTime >= DATEADD(MONTH,-24,GETDATE())
    ) x

    USE [mydatabase]
    
    
    SET ANSI_NULLS ON
    GO
    
    SET QUOTED_IDENTIFIER ON
    GO
    
    CREATE TABLE [dbo].[incident](
    	
    	[Id01] [int] NULL,
    	[EndTime] [datetime] NULL,
    	[Id02] [int] NULL,
    	[machine] [int] NULL,
    	[Id03] [int] NULL,
    	[Id04] [int] NULL,
    	[Id05] [int] NULL,
    	[Id06] [int] NULL,
    	[Id07] [int] NULL,
    	[StartTime] [datetime] NULL,
    	[pk] [int] NOT NULL,
    	[Id08] [int] NULL,
    	[building] [varchar](25) NOT NULL,
    	
     CONSTRAINT [PK_incident] PRIMARY KEY CLUSTERED 
    (
    	[pk] ASC,
    	[building] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    ) ON [PRIMARY]
    GO
    
    USE [mydatabase]
    GO
    
    CREATE NONCLUSTERED INDEX [IDX_Incident_NCI02] ON [dbo].[Incident]
    (
    	[StartTime] ASC
    )
    INCLUDE ( 	[id01],
    	[End_Time],
    	
    	[machine],
    	[id02],
    	[id03],
    	[id04],
    	[id05],
    	[id06],
    	[id07]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    GO
    
    
    
    USE [mydatabase]
    GO
    
    SET ANSI_NULLS ON
    GO
    
    SET QUOTED_IDENTIFIER ON
    GO
    
    CREATE TABLE [dbo].[jobs](
    	[Id01] [int] NULL,
    	[EndTime] [datetime] NULL,
    	[Id02] [int] NULL,
    	[machine] [int] NULL,
    	[jobId] [int] NOT NULL,--<-------------------this and building are unique
    	[StartTime] [datetime] NULL,
    	[building] [varchar](25) NOT NULL,
    	
    ) ON [PRIMARY]
    GO
    
    USE [mydatabase]
    GO
    
    SET ANSI_PADDING ON
    GO
    
    CREATE NONCLUSTERED INDEX [IDX_Jobs_NCI02] ON [dbo].[Jobs]
    (
    	[StartTime] ASC,
    	[EndTime] ASC,
    	[machine] ASC,
    	[building] ASC
    )
    INCLUDE ( 	[id02]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    GO


    • Edited by db042190 Wednesday, August 14, 2019 12:44 PM want to mention index scans
    Tuesday, August 13, 2019 8:22 PM

Answers

  • You are right. Those type of queries are not any fun.

    What you can try:

    1) Add an index on Job.Endtime. Maybe SQL Server will perform index intersection on both indexes. (But since of them will have an ugly hit rate, not likely.)

    2) How much I hate to say it, it is probably better to have a "magic" date for jobs that not completed, for instance 9999-12-31. The queries are easier to write, and SQL Server may be more interested in the index.

    3) There is a little used database option that you can try:
       ALTER DATABASE db SET DATE_CORRELATION_OPTIMIZATION ON
    It is intended for scenarios like this, I think. I've never tried it myself, but this blog post may shed some light on it:
    https://www.red-gate.com/simple-talk/sql/t-sql-programming/the-query-optimizer-date-correlation-optimisation/


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    • Marked as answer by db042190 Thursday, August 15, 2019 12:17 PM
    Tuesday, August 13, 2019 9:08 PM
  • What I think we are going to do is carry the one column needed from jobs in the incident table so downstream systems need not join these two tables going forward.   Our loads are incremental so adding this field may not be painful.

    Yeah, these situations often require a special solution which you only can find if you have a good understanding of what the tables are used for.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    • Marked as answer by db042190 Thursday, August 15, 2019 12:17 PM
    Wednesday, August 14, 2019 9:03 PM

All replies

  • You are right. Those type of queries are not any fun.

    What you can try:

    1) Add an index on Job.Endtime. Maybe SQL Server will perform index intersection on both indexes. (But since of them will have an ugly hit rate, not likely.)

    2) How much I hate to say it, it is probably better to have a "magic" date for jobs that not completed, for instance 9999-12-31. The queries are easier to write, and SQL Server may be more interested in the index.

    3) There is a little used database option that you can try:
       ALTER DATABASE db SET DATE_CORRELATION_OPTIMIZATION ON
    It is intended for scenarios like this, I think. I've never tried it myself, but this blog post may shed some light on it:
    https://www.red-gate.com/simple-talk/sql/t-sql-programming/the-query-optimizer-date-correlation-optimisation/


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    • Marked as answer by db042190 Thursday, August 15, 2019 12:17 PM
    Tuesday, August 13, 2019 9:08 PM
  • this is all pretty interesting.  I ran a baseline with just the 2 years of data from each table, ie I loaded only 2 yrs of data in each table and 12/31/9999 in place of null end dates.  Not much if any improvement. 

    Then I removed the previous index from the jobs table and replaced it with one that had building and machine before the 2 dates.  Results were terrible, twice as bad as before.

    Finally I removed the latter index and put the old jobs index back and added another index to jobs on just end time.  I noticed the new index was not being used in the estimated exec plan.  I ran anyway.  But saw no improvement.

    Also I read the link on datetime optimization and concluded it might be great for relationships where an FK applies but in our case that doesn't make any sense so I passed on that idea.  The relationship between these two tables is what this query is trying to establish.

    What I think we are going to do is carry the one column needed from jobs in the incident table so downstream systems need not join these two tables going forward.   Our loads are incremental so adding this field may not be painful.


    • Edited by db042190 Wednesday, August 14, 2019 7:50 PM additional info
    Wednesday, August 14, 2019 7:49 PM
  • What I think we are going to do is carry the one column needed from jobs in the incident table so downstream systems need not join these two tables going forward.   Our loads are incremental so adding this field may not be painful.

    Yeah, these situations often require a special solution which you only can find if you have a good understanding of what the tables are used for.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    • Marked as answer by db042190 Thursday, August 15, 2019 12:17 PM
    Wednesday, August 14, 2019 9:03 PM
  • I couldn't let this go. here is what I did after still feeling it was running way too long and staring at the execution plan just a bit longer...

    1.  I limited my table rows to 2 years

    2.  I ensured that no dates being used in the join criteria were null, ie I set end time in jobs to 12/31/9999 where it was null...this was erland's idea and my evidence suggests it makes a difference

    3.  I created a non clustered index on incident over machine, building and start time and covered the other fields I needed in my select

    4.  I created a non clustered index on jobs over machine, building, start time and end time and covered id02 which I actually need in my select

    5.  I changed up the join condition a little to match on machine, building and incident start time being >= job start and < job end

    I got this to run in 14 seconds

    I tried all sorts of other things including conversion of the 3 dates used in the join condition to be non null bigints as shown in the cast below (or 999999999999 if date was null) and consistently saw about 10 second run times or less.   start times were never null.  no date was ever earlier than jan 1 2000.  in 3 and 4 above, I used these bigint "counters" instead of the dates in the non clustered index key columns.

    cast(datediff(ss,'01/01/2000',start_time) as bigint) * 1000 + datepart(ms,start_time)

    the execution plan suddenly showed a nested loop instead of a merge join (merge join followed sorts in original) and a seek on the incident nc index instead of an index scan on same.  I think it was the sort fields in the latter that gave me the idea to change up the index search columns the way I did, ie to make machine and building more prominent.  SQL eliminated the sorts in the new plan.  I may be wrong but one takeaway for me is that sql does better (maybe it wants to) joining on more sparse (less granular) criteria when ranges are involved in the predicate.


    • Edited by db042190 Thursday, August 22, 2019 3:12 PM clarity
    Wednesday, August 21, 2019 3:37 PM
  • I got this to run in 14 seconds

    From 8-10 minutes originally? Good work!


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Wednesday, August 21, 2019 9:12 PM