locked
Efficient way to join huge tables RRS feed

  • Question

  • I have a table with 20M rows, and each row has 5 columns: time, id, value, value_lst, value_nxt. For each id and time, there is a value for the status. The first three columns are known and I want to know value_lst and value_nxt, i.e.: the values of the last and the next periods for a specific time and id, and have the following query to create tables and calculate value

    create table tab1 (id nvarchar(12),time int, value nvarchar(8),

    value_lst nvarchar(8), value_nxt nvarchar(8))


    insert tab1 with (tablock) (id,time,value) 
    select id,time,value
    from tab0

    update a1
    set  a1.value_lst = b1.value,
    a1.value_nxt = c1.value
    from tab1 a1
    left join tab1 b1
    on a1.id = b1.id
    and a1.time = b1.time + 1
    left join tab1 c1
    on a1.id = c1.id
    and a1.time = c1.time - 1

     

    Tab0 is the source table where I get id/time/value.

    It seems that the query takes forever and the log file increased by more than 10 GB. I'm wondering what's the most efficient way to write this query? I know using index will speed up the joining process, but how can I reduce the logging?

    I'm using SQL server 2016 on Win10 64bit.

    Thanks,

    Jason


    • Edited by Jason_86 Wednesday, December 14, 2016 11:39 PM Add more code
    Wednesday, December 14, 2016 8:09 PM

Answers

  • If you insist on updating the entire table, the log file will indeed grow. The log can be kept in check by updating in batches, but that may not be the easiest with operation you want to do.

    It should however to be possible to reduce the execution time by using LAG/LEAD or FIRST_VALUE/LAST_VALUE. I have some difficulties, though, to find coherence in your description. You first say that there are three columns, time, id and value, and then suddently a fourth column hits the scene. And a fifth and a sixth column in value_last and value_next. In the text you about next periods for a specific time and id, but the query only partitions by id.

    Since I don't like guessing games, please post:

    1) CREATE TABLE statement for your table.
    2) INSERT statements with sample data, enough to demonstrate all angles of the problem.
    3) The desired result given the sample.

    Wednesday, December 14, 2016 10:14 PM
  • There is still no sample data, and it is not clear whether it is a requirement that it must be time + 1 (or for that matter what time + 1 means).

    But here is a query which ignores the +1 bit and only takes the the "next time" be that +1 or +99.

    SELECT id, time, value,
           value_lst = LAG(value) OVER(PARTITION BY id ORDER BY time),
           value_nxt = LEAD(value) OVER(PARTITION BY id ORDER BY time)
    FROM   tab0

    Thursday, December 15, 2016 8:12 AM

All replies

  • If you insist on updating the entire table, the log file will indeed grow. The log can be kept in check by updating in batches, but that may not be the easiest with operation you want to do.

    It should however to be possible to reduce the execution time by using LAG/LEAD or FIRST_VALUE/LAST_VALUE. I have some difficulties, though, to find coherence in your description. You first say that there are three columns, time, id and value, and then suddently a fourth column hits the scene. And a fifth and a sixth column in value_last and value_next. In the text you about next periods for a specific time and id, but the query only partitions by id.

    Since I don't like guessing games, please post:

    1) CREATE TABLE statement for your table.
    2) INSERT statements with sample data, enough to demonstrate all angles of the problem.
    3) The desired result given the sample.

    Wednesday, December 14, 2016 10:14 PM
  • Thank you so much for your comments! I have updated the code and problem description. Can you take another look?

    Jason

    Wednesday, December 14, 2016 11:41 PM
  • There is still no sample data, and it is not clear whether it is a requirement that it must be time + 1 (or for that matter what time + 1 means).

    But here is a query which ignores the +1 bit and only takes the the "next time" be that +1 or +99.

    SELECT id, time, value,
           value_lst = LAG(value) OVER(PARTITION BY id ORDER BY time),
           value_nxt = LEAD(value) OVER(PARTITION BY id ORDER BY time)
    FROM   tab0

    Thursday, December 15, 2016 8:12 AM