Answered by:
Efficient way to join huge tables

Question
-
I have a table with 20M rows, and each row has 5 columns: time, id, value, value_lst, value_nxt. For each id and time, there is a value for the status. The first three columns are known and I want to know value_lst and value_nxt, i.e.: the values of the last and the next periods for a specific time and id, and have the following query to create tables and calculate value
create table tab1 (id nvarchar(12),time int, value nvarchar(8),
value_lst nvarchar(8), value_nxt nvarchar(8))
insert tab1 with (tablock) (id,time,value)
select id,time,value
from tab0
update a1
set a1.value_lst = b1.value,
a1.value_nxt = c1.value
from tab1 a1
left join tab1 b1
on a1.id = b1.id
and a1.time = b1.time + 1
left join tab1 c1
on a1.id = c1.id
and a1.time = c1.time - 1Tab0 is the source table where I get id/time/value.
It seems that the query takes forever and the log file increased by more than 10 GB. I'm wondering what's the most efficient way to write this query? I know using index will speed up the joining process, but how can I reduce the logging?
I'm using SQL server 2016 on Win10 64bit.
Thanks,
Jason
- Edited by Jason_86 Wednesday, December 14, 2016 11:39 PM Add more code
Wednesday, December 14, 2016 8:09 PM
Answers
-
If you insist on updating the entire table, the log file will indeed grow. The log can be kept in check by updating in batches, but that may not be the easiest with operation you want to do.
It should however to be possible to reduce the execution time by using LAG/LEAD or FIRST_VALUE/LAST_VALUE. I have some difficulties, though, to find coherence in your description. You first say that there are three columns, time, id and value, and then suddently a fourth column hits the scene. And a fifth and a sixth column in value_last and value_next. In the text you about next periods for a specific time and id, but the query only partitions by id.
Since I don't like guessing games, please post:
1) CREATE TABLE statement for your table.
2) INSERT statements with sample data, enough to demonstrate all angles of the problem.
3) The desired result given the sample.- Proposed as answer by Ed Price - MSFTMicrosoft employee Thursday, December 15, 2016 1:54 AM
- Unproposed as answer by Ed Price - MSFTMicrosoft employee Thursday, December 15, 2016 1:54 AM
- Proposed as answer by Ed Price - MSFTMicrosoft employee Thursday, December 15, 2016 1:54 AM
- Marked as answer by Ed Price - MSFTMicrosoft employee Wednesday, January 4, 2017 2:28 AM
Wednesday, December 14, 2016 10:14 PM -
There is still no sample data, and it is not clear whether it is a requirement that it must be time + 1 (or for that matter what time + 1 means).
But here is a query which ignores the +1 bit and only takes the the "next time" be that +1 or +99.
SELECT id, time, value,
value_lst = LAG(value) OVER(PARTITION BY id ORDER BY time),
value_nxt = LEAD(value) OVER(PARTITION BY id ORDER BY time)
FROM tab0- Marked as answer by Ed Price - MSFTMicrosoft employee Wednesday, January 4, 2017 2:28 AM
Thursday, December 15, 2016 8:12 AM
All replies
-
If you insist on updating the entire table, the log file will indeed grow. The log can be kept in check by updating in batches, but that may not be the easiest with operation you want to do.
It should however to be possible to reduce the execution time by using LAG/LEAD or FIRST_VALUE/LAST_VALUE. I have some difficulties, though, to find coherence in your description. You first say that there are three columns, time, id and value, and then suddently a fourth column hits the scene. And a fifth and a sixth column in value_last and value_next. In the text you about next periods for a specific time and id, but the query only partitions by id.
Since I don't like guessing games, please post:
1) CREATE TABLE statement for your table.
2) INSERT statements with sample data, enough to demonstrate all angles of the problem.
3) The desired result given the sample.- Proposed as answer by Ed Price - MSFTMicrosoft employee Thursday, December 15, 2016 1:54 AM
- Unproposed as answer by Ed Price - MSFTMicrosoft employee Thursday, December 15, 2016 1:54 AM
- Proposed as answer by Ed Price - MSFTMicrosoft employee Thursday, December 15, 2016 1:54 AM
- Marked as answer by Ed Price - MSFTMicrosoft employee Wednesday, January 4, 2017 2:28 AM
Wednesday, December 14, 2016 10:14 PM -
Thank you so much for your comments! I have updated the code and problem description. Can you take another look?
Jason
Wednesday, December 14, 2016 11:41 PM -
There is still no sample data, and it is not clear whether it is a requirement that it must be time + 1 (or for that matter what time + 1 means).
But here is a query which ignores the +1 bit and only takes the the "next time" be that +1 or +99.
SELECT id, time, value,
value_lst = LAG(value) OVER(PARTITION BY id ORDER BY time),
value_nxt = LEAD(value) OVER(PARTITION BY id ORDER BY time)
FROM tab0- Marked as answer by Ed Price - MSFTMicrosoft employee Wednesday, January 4, 2017 2:28 AM
Thursday, December 15, 2016 8:12 AM