locked
Identifying Datasets RRS feed

  • Question

  • Hi, I have a requirment where in i have to identify the datasets, Its like this

    cart_no STATUS    DATE_TIME             BILL_NO
    1        L              2009-01-01                  NULL  *
    1        L              2009-01-02                  NULL
    1        L              2009-01-02                  NULL
    1        E              2009-01-03                  NULL
    1        E              2009-01-03                  NULL  **
    1        L              2009-01-04                  NULL
    1        L              2009-01-05                  NULL
    1        L              2009-01-06                  NULL
    1        L              2009-01-07                  NULL
    1        E              2009-01-07                  NULL
    1        E              2009-01-08                  NULL
    1        E              2009-01-09                  NULL

    I want to identify the records ** (actually the date_time on that record)
    then i will  be updating the bill_no for the first record to that record from a different table.

    (I have table XYZ which has SHIP_DATE and BILL_NO.For the above data set between * and ** is between two ship dates from XYZ i need to update the BILL_NO. I have a running query which works fine as long as ther is next ship date. Problem is when there is a ship date missing. then i need to look at the STATUS column and recognise the dataset ie between first L and last E before the next set of L.


    Thursday, December 3, 2009 8:43 PM

Answers

  • Update. I've come up with a query that should perform better and takes into account the cart_no. I've also applied an indexing strategy that allows the optimizer to service the query using a 2 index seeks compared to 3 in your query. The main downside of it is that it's far less readable and hence may be difficult from a maintenance perspective.

    Firstly the setup code. The following is a create table statemenmt with a clustered index on the date column as well as non-clustered index to aid the query:

    CREATE TABLE dbo.cart
    ( cart_no INT NOT NULL
    , [STATUS] CHAR(1) NOT NULL
    , DATE_TIME DATETIME NOT NULL
    , BILL_NO INT
    , CONSTRAINT [PK_cart_dt] PRIMARY KEY CLUSTERED(DATE_TIME)
    )
    
    CREATE INDEX [IX_cart_cn_st_dt] ON dbo.cart(cart_no, [Status], DATE_TIME desc)

    Next INSERT the sample data. Here I've added a time component to make the DATE_TIME values unique:

    INSERT INTO dbo.cart(cart_no, [STATUS], DATE_TIME) 
    VALUES (1, 'L', '2009-01-01 01:00:00')
         , (1, 'L', '2009-01-02 01:00:00')
         , (1, 'L', '2009-01-02 02:00:00')
         , (1, 'E', '2009-01-03 01:00:00')
         , (1, 'E', '2009-01-03 02:00:00')
         , (1, 'L', '2009-01-04 01:00:00')
         , (1, 'L', '2009-01-05 01:00:00')
         , (1, 'L', '2009-01-06 01:00:00')
         , (1, 'L', '2009-01-07 01:00:00')
         , (1, 'E', '2009-01-07 02:00:00')
         , (1, 'E', '2009-01-08 01:00:00')
         , (1, 'E', '2009-01-09 01:00:00')

    And finally, here are the 2 queries. Switch on the actual execution plan and run them as a batch. On my machine the comparison of old vs. new was 57% vs 43% so not a huge difference on these volumes, however I would strongly  encourage you to test this with much larger volumes to get a true representation:

    --original
    select MAX(date_time) from cart
    where date_time< (select MIN(date_time) from cart where status = 'L' and date_time> 
    (SELECT MIN(date_time) from cart where status = 'E' and date_time >='2009-01-01'))
    
    --new
    SELECT      MIN(group_date)
    FROM        cart          c1
    CROSS APPLY
    (
      SELECT TOP (1) c2.DATE_TIME
      FROM      cart          c2
      WHERE     c2.cart_no   = 1
      AND       c2.[STATUS]  = 'E'
      AND       c1.DATE_TIME > c2.DATE_TIME  
      ORDER BY  c2.DATE_TIME DESC
    )           x(group_date)
    WHERE       cart_no      = 1
    AND         [STATUS]     = 'L'
    GROUP BY    [STATUS]

    Here are the statistics for the 2 queries:

    --original
    Scan count 3
    logical reads 6
    physical reads 0
    read-ahead reads 0
    lob logical reads 0
    lob physical reads 0
    lob read-ahead reads 0

    --new
    Scan count 8
    logical reads 16
    physical reads 0
    read-ahead reads 0
    lob logical reads 0
    lob physical reads 0
    lob read-ahead reads 0

    It's interesting to see that the second query performs better yet it appears to be performing more work in terms of the scan count and logical reads (still trying to work out why).

    Adam Tappis. (MCSD VB6 & .NET, MCDBA SQL 2000)
    • Marked as answer by devlssis Wednesday, January 27, 2010 10:57 PM
    Friday, December 4, 2009 5:13 PM

All replies

  • There is nothing specific that i can use to identify it except scrowling down the status column from  'L' to max of 'E'
    in the first setof L and E
    Thursday, December 3, 2009 10:09 PM
  • you really need to explain your requirement better. It is not clear from your explanation what business logic you are try ing to apply
    Adam Tappis. (MCSD VB6 & .NET, MCDBA SQL 2000)
    Thursday, December 3, 2009 10:13 PM
  • I just want the date at the record **, whis is the first max (date) record on first set of 'E' status.

    --sample data

    select * into BL_TEST from (
    select 1 as cart_no, 'L' AS STATUS, '2009-01-01' AS DATE_TIME,NULL AS BILL_NO
    UNION ALL
    select 1 as cart_no, 'L' AS STATUS, '2009-01-02' AS DATE_TIME,NULL AS BILL_NO
    UNION ALL
    select 1 as cart_no, 'L' AS STATUS, '2009-01-02' AS DATE_TIME,NULL AS BILL_NO
    UNION ALL
    select 1 as cart_no, 'E' AS STATUS, '2009-01-03' AS DATE_TIME,NULL AS BILL_NO
    UNION ALL
    select 1 as cart_no, 'E' AS STATUS, '2009-01-03' AS DATE_TIME,NULL AS BILL_NO
    UNION ALL
    select 1 as cart_no, 'L' AS STATUS, '2009-01-04' AS DATE_TIME,NULL AS BILL_NO
    UNION ALL
    select 1 as cart_no, 'L' AS STATUS, '2009-01-05' AS DATE_TIME,NULL AS BILL_NO
    UNION ALL
    select 1 as cart_no, 'L' AS STATUS, '2009-01-06' AS DATE_TIME,NULL AS BILL_NO
    UNION ALL
    select 1 as cart_no, 'L' AS STATUS, '2009-01-07' AS DATE_TIME,NULL AS BILL_NO
    UNION ALL
    select 1 as cart_no, 'E' AS STATUS, '2009-01-07' AS DATE_TIME,NULL AS BILL_NO
    UNION ALL
    select 1 as cart_no, 'E' AS STATUS, '2009-01-08' AS DATE_TIME,NULL AS BILL_NO
    UNION ALL
    select 1 as cart_no, 'E' AS STATUS, '2009-01-09' AS DATE_TIME,NULL AS BILL_NO

    ) a

    this is what i did

    select MAX(date_time) from bl_test
    where date_time< (select MIN(date_time) from bl_test where status = 'L' and date_time>
    (SELECT MIN(date_time) from bl_test where status = 'E' and  date_time >='2009-01-01'))

    output is the date i wanted
    2009-01-03

    is there a better way to do it?

    Thursday, December 3, 2009 10:38 PM
  • The issue I have with this table is that you seem to be relying on the order in which data was entered into the table without a column that can be used for explicit ordering. Without so much as an INT identity column you can't rely on the order in which the values are represented.

    You problem looks like a classic islands and gaps problem, with a twist in that you have multiple groups with the same value i.e. status, with the only element to distinguish them the order in which they were entered.

    Take the following sample rows and look at the date.

    cart_no STATUS    DATE_TIME             BILL_NO
    1        L              2009-01-01                  NULL
    1        L              2009-01-02                  NULL
    1        E              2009-01-02                  NULL
    1        E              2009-01-02                  NULL
    1        L              2009-01-02                  NULL
    1        L              2009-01-03                  NULL

    Rows 2 and 5 above are identical, yet they are in different groups, however there is no combination of columns that you can place in an ORDER BY clause to guarantee the order of these rows. Hence unless you can add some columns to guarantee the order, then I don't think you can come up with a query that gives the correct answer 100% of the time.

    Adam Tappis. (MCSD VB6 & .NET, MCDBA SQL 2000)
    Thursday, December 3, 2009 11:02 PM
  • Perhaps this table has a PK (identity column) which was not shown. In this case the problem is simpler.
    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Thursday, December 3, 2009 11:04 PM
  • There is no primary key in this data its unique on cart number and the date (its actually date times stamp). So the order in which the records are entered can be the timestamp.Comming to statues its always set of L records (for loaded) followed by E records (for empty).
    Friday, December 4, 2009 2:32 PM
  • Can you confirm which version of SQL Server you are targetting?
    Adam Tappis. (MCSD VB6 & .NET, MCDBA SQL 2000)
    Friday, December 4, 2009 2:46 PM
  • I've been playing with a few queries and can't really find a faster approach to finding this value than the query you present. I do have a couple of questions regarding your query though:

    1) your subqueries are based purely on DATE_TIME, however your table has a cart_no column, hence I suspect that your table will contain data for more than 1 cart_no and hence this should be included int eh subqueries i.e. they become corelated subqueries
    2) I'm curious at the following filter in your subquery date_time >='2009-01-01', will this be a parameter that wil be passed in?
    3) is this really the full complexity of the problem or have you simplified it down for the purpose of clarity?
    4) what is the likely volumes in this table and distribution of the data e.g. lts of carts with few items or few carts with lots of items.
    5) You original question refers to isolating a range of rows for update yet your query focuses only on the boundary. Can you post the full query?

    With the current understanding I can only suggest an indexing strategy that will make your query faster rather than a more optimal query.
    Adam Tappis. (MCSD VB6 & .NET, MCDBA SQL 2000)
    Friday, December 4, 2009 4:19 PM
  • Update. I've come up with a query that should perform better and takes into account the cart_no. I've also applied an indexing strategy that allows the optimizer to service the query using a 2 index seeks compared to 3 in your query. The main downside of it is that it's far less readable and hence may be difficult from a maintenance perspective.

    Firstly the setup code. The following is a create table statemenmt with a clustered index on the date column as well as non-clustered index to aid the query:

    CREATE TABLE dbo.cart
    ( cart_no INT NOT NULL
    , [STATUS] CHAR(1) NOT NULL
    , DATE_TIME DATETIME NOT NULL
    , BILL_NO INT
    , CONSTRAINT [PK_cart_dt] PRIMARY KEY CLUSTERED(DATE_TIME)
    )
    
    CREATE INDEX [IX_cart_cn_st_dt] ON dbo.cart(cart_no, [Status], DATE_TIME desc)

    Next INSERT the sample data. Here I've added a time component to make the DATE_TIME values unique:

    INSERT INTO dbo.cart(cart_no, [STATUS], DATE_TIME) 
    VALUES (1, 'L', '2009-01-01 01:00:00')
         , (1, 'L', '2009-01-02 01:00:00')
         , (1, 'L', '2009-01-02 02:00:00')
         , (1, 'E', '2009-01-03 01:00:00')
         , (1, 'E', '2009-01-03 02:00:00')
         , (1, 'L', '2009-01-04 01:00:00')
         , (1, 'L', '2009-01-05 01:00:00')
         , (1, 'L', '2009-01-06 01:00:00')
         , (1, 'L', '2009-01-07 01:00:00')
         , (1, 'E', '2009-01-07 02:00:00')
         , (1, 'E', '2009-01-08 01:00:00')
         , (1, 'E', '2009-01-09 01:00:00')

    And finally, here are the 2 queries. Switch on the actual execution plan and run them as a batch. On my machine the comparison of old vs. new was 57% vs 43% so not a huge difference on these volumes, however I would strongly  encourage you to test this with much larger volumes to get a true representation:

    --original
    select MAX(date_time) from cart
    where date_time< (select MIN(date_time) from cart where status = 'L' and date_time> 
    (SELECT MIN(date_time) from cart where status = 'E' and date_time >='2009-01-01'))
    
    --new
    SELECT      MIN(group_date)
    FROM        cart          c1
    CROSS APPLY
    (
      SELECT TOP (1) c2.DATE_TIME
      FROM      cart          c2
      WHERE     c2.cart_no   = 1
      AND       c2.[STATUS]  = 'E'
      AND       c1.DATE_TIME > c2.DATE_TIME  
      ORDER BY  c2.DATE_TIME DESC
    )           x(group_date)
    WHERE       cart_no      = 1
    AND         [STATUS]     = 'L'
    GROUP BY    [STATUS]

    Here are the statistics for the 2 queries:

    --original
    Scan count 3
    logical reads 6
    physical reads 0
    read-ahead reads 0
    lob logical reads 0
    lob physical reads 0
    lob read-ahead reads 0

    --new
    Scan count 8
    logical reads 16
    physical reads 0
    read-ahead reads 0
    lob logical reads 0
    lob physical reads 0
    lob read-ahead reads 0

    It's interesting to see that the second query performs better yet it appears to be performing more work in terms of the scan count and logical reads (still trying to work out why).

    Adam Tappis. (MCSD VB6 & .NET, MCDBA SQL 2000)
    • Marked as answer by devlssis Wednesday, January 27, 2010 10:57 PM
    Friday, December 4, 2009 5:13 PM
  • Adam, You are really amazing all your assumptions (full story) from what i said (trailer) is 100% correct. I wish I was your student.

    1) your subqueries are based purely on DATE_TIME, however your table has a cart_no column, hence I suspect that your table will contain data for more than 1 cart_no and hence this should be included int eh subqueries i.e. they become corelated subqueries

    --There are multiple carts in the source date. this is astually the data about moment of loaded and empty cars on a train.(3rd party source)

    2) I'm curious at the following filter in your subquery date_time >='2009-01-01', will this be a parameter that wil be passed in?

    -- this is the date that i get from table XYZ i was talking about where i have the Bill_no.

    3) is this really the full complexity of the problem or have you simplified it down for the purpose of clarity?

    --I just put the simplest form of the issue.

    4) what is the likely volumes in this table and distribution of the data e.g. lts of carts with few items or few carts with lots of items.

    --many cars with many resords in each status.14.5 million records for 6 months with lot of junk data (like train moved for a minute and stoped which results in 4 records for that location of which least arrival and max departure record is correct rest 2 are junk)

    5) You original question refers to isolating a range of rows for update yet your query focuses only on the boundary. Can you post the full query?


    UPDATE

     

    CLM

    SET

     

    CLM_BL_SEQ_NO

    = BL.BL_SEQ_NO,

    CLM_BL_ID

    = BL.BL_ID,

    CLM_BL_NO

    = BL.BL_NO,

    CLM_BL_DT

    = BL.BL_DT,

    CLM_BL_SYS_CD

    = BL.BL_SYS_CD,

    CLM_BL_TRAF_TYP

    = BL.BL_TRAF_TYP

    FROM

     

    EDW_TRANSPORT.EDW_CLMS_FCT CLM

    INNER

     

    JOIN

    (

    SELECT

     

    BL.BL_SEQ_NO,BL.BL_ID,BL.BL_NO,BL.BL_DT,BL.BL_SYS_CD,BL.BL_TRAF_TYP,CH_CAR_NO,CH_SHIP_DT as START_DT

    FROM

     

    EDW_TRANSPORT.EDW_FA_RAIL_CHARGE_CAR_AGG AGG

    INNER

     

    JOIN EDW_TRANSPORT.EDW_FA_BL_FCT BL

    ON

     

    AGG.CH_BL_ID = BL.BL_ID

    )

     

    BL

    ON

     

    BL.CH_CAR_NO = CLM.CLM_CAR_NO

    WHERE

     

    CLM_DATE BETWEEN bl.START_DT AND --BL.END_DT -1

    --find the end date for that trip -first finding the least empty status then find the next least loaded status one day less is the end date

     

     

    ( select MIN(CLM_DATE)-1 from EDW_TRANSPORT.EDW_CLMS_FCT where CLM_CAR_NO = CLM.CLM_CAR_NO and CLM_L_E_STA = 69 and CLM_DATE> --(step 2 to get to next start date of L )

     

     

    (select MIN(CLM_DATE) from EDW_TRANSPORT.EDW_CLMS_FCT where CLM_CAR_NO = CLM.CLM_CAR_NO and CLM_L_E_STA = 70 and CLM_DATE>bl.START_DT ) ) --(step 1 to get to next set of E)

     

    --this is to make sure that it looks at the BL data from the day the CLM has data, BL has a lot more history data
    and

     

    bl.START_DT>=(select MIN(CLM_DATE) from EDW_TRANSPORT.EDW_CLMS_FCT where CLM_CAR_NO = CLM.CLM_CAR_NO )

    Leave me a question if you have any ill eply 2morro.

    Thanks Adam once again


    Friday, December 4, 2009 10:43 PM
  • Sorry, I've not had a chance to llok at this since, been busy at work. Will post back in a couple of days.
    Adam Tappis. (MCSD VB6 & .NET, MCDBA SQL 2000)
    Tuesday, December 8, 2009 1:32 AM