none
Update 700 Million Records Quickly RRS feed

  • Question

  • I have a table which conatins ID column and have Clustered index on it.
    I have two another columns named Old_Column and New_Column, now for nulls in new column i want to copy records from old column to new column and want to update all records in New_Column to Right Justified zero filled.I know we have to do it in batches. I wrote a update statement but when i run it on the table the query runs for ever. Though records gets updated for only first batch.

    Here is my query: 

    WHILE  ( 1 = 1 )
        BEGIN
            BEGIN TRANSACTION
            
        
            UPDATE TOP ( 100000 )
                    Table1
            SET     New_Column = ( case when ( New_Column is null )
                                        then Replicate('0', 10 - Len(Old_Column))
                                             + Old_Column
                                        else Replicate('0', 10 - Len(New_Column))
                                             + New_Column
                                   End )
          
       

            IF @@ROWCOUNT = 0
                BEGIN
                    COMMIT TRANSACTION
                    BREAK
                END
            COMMIT TRANSACTION
      

        END
    GO

    The problem is query runs forever and i have about 700 million records.
    Any suggestions guyz....


    mike
    Tuesday, December 28, 2010 4:28 PM

Answers

  • Try

     

    declare @Size int, @Loops int, @i int
    
    set @Size = 100000
    
    select @Loops = count(*)/@Size from Table1
    
    set @I = 0
    
    while @I <=@Loops
    
      begin
    
        ;with cte as (select ID, OldColumn,   NewColumn, case when  New_Column is null 
    
                                        then Replicate('0', 10 - Len(Old_Column))
                                             + Old_Column
                                        else Replicate('0', 10 - Len(New_Column))
                                             + New_Column end as FixedColumn, 
    
    Row_Number() over (order by ID) as Row from Table1)
    
    update cte set NewColumn = FixedColumn
    where Row between @I* @Size and (@I+1)*@Size
    set @I = @I + 1
    end
    



    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    • Marked as answer by Mike Rodney Wednesday, December 29, 2010 3:52 PM
    Tuesday, December 28, 2010 5:27 PM
    Moderator
  • Here is a query I use when I have to update our large tables. It has the advantage that you can calculate when it will be done, based on current execution and not just the last 100 000 records or so.

     

    DECLARE	@CurID INT = (SELECT MIN(ID) FROM dbo.Table1),
    	@MaxID INT = (SELECT MAX(ID) FROM dbo.Table1),
    	@BatchSize INT = 100000,
    	@StartTime DATETIME = GETDATE(),
    	@FinishTime VARCHAR(43) = 'unknown at this time'
    
    DECLARE	@Batches FLOAT = CEILING(1E * (@MaxID - @MinID) / @BatchSize),
    	@Batch INT = 1
    
    WHILE @CurID <= @MaxID
    	BEGIN
    		RAISERROR('Now working with ID %d of %d. Estimated completion time is %s.', 10, 1, @CurID, @MaxID, @FinishTime) WITH NOWAIT
    
    		UPDATE	dbo.Table1
    		SET	New_Column = RIGHT('0000000000' + COALESCE(New_Column, Old_Column), 10)
    		WHERE	ID >= @CurID
    			AND ID < @CurID + @BatchSize
    			AND (LEN(New_Column) <> 10 OR New_Column IS NULL)
    			AND (New_Column IS NOT NULL OR Old_Column IS NOT NULL)
    
    		SELECT	@CurID += @BatchSize,
    			@Batch += 1,
    			@FinishTime = CONVERT(VARCHAR(40), DATEADD(SECOND, CAST(@Batches * DATEDIFF(SECOND, @StartTime, GETDATE()) / @Batch AS INT), @StartTime), 120)
    	END
    
    
    • Marked as answer by Mike Rodney Wednesday, December 29, 2010 3:52 PM
    Tuesday, December 28, 2010 10:25 PM

All replies

  • Probably your problem is with @@ROWCOUNT because your UPDATE stamen always updates records and @@ROWCOUNT never will be equal to 0 
    Tuesday, December 28, 2010 4:33 PM
  • Do you have primary key in your table or better yet the column that will tell us if the record was just updated or not? Say, the ModifiedDate column?

    Your current query will be updating the same first 100K records over and over again. We need to have a where condition to update records only once. We can base WHERE condition on the ModifiedDate column, on the Update flag, on the Primary Key.

    In the worse case scenario, you can use this WHERE clause

    WHERE NewColumn IS NULL or (NewColumn IS NOT NULL and NewColumn <> Replicate('0', 10 - Len(New_Column))

                                             + New_Column)

    Although I don't understand why do you need to do this update at all? Why not let the formatting be done on the client or when you're selecting data?


    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Tuesday, December 28, 2010 4:35 PM
    Moderator
  • Probably your problem is with @@ROWCOUNT because your UPDATE stamen always updates records and @@ROWCOUNT never will be equal to 0 

    with this suggestion,

    Also do the part of 700 million records and check the results.

     

    do it for only 100 records and update with this query and check the result. if it  looks good, go ahead and run it for all ur records.

    Tuesday, December 28, 2010 4:36 PM
  • WHILE (1=1) will be always TRUE so code will be run over and over, using UPDATE without WHERE conditions always will be updating all records so @@ROWCOUNT never will be 0 if some records are at table. Loot on Naom post there is some solution.
    Tuesday, December 28, 2010 4:44 PM
  • Thnx it works fine when i gave where condition, but when i will update millions of records, willthere be anyway we can improve its performance. The primary key column is ID and its identity column, so we need to use that some where i guess so we can improve its performance.

    I have to update it only on historical data so i will only have to run the script once. The formatting is done for the future data from client side.


    mike
    Tuesday, December 28, 2010 4:55 PM
  • Try

     

    declare @Size int, @Loops int, @i int
    
    set @Size = 100000
    
    select @Loops = count(*)/@Size from Table1
    
    set @I = 0
    
    while @I <=@Loops
    
      begin
    
        ;with cte as (select ID, OldColumn,   NewColumn, case when  New_Column is null 
    
                                        then Replicate('0', 10 - Len(Old_Column))
                                             + Old_Column
                                        else Replicate('0', 10 - Len(New_Column))
                                             + New_Column end as FixedColumn, 
    
    Row_Number() over (order by ID) as Row from Table1)
    
    update cte set NewColumn = FixedColumn
    where Row between @I* @Size and (@I+1)*@Size
    set @I = @I + 1
    end
    



    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    • Marked as answer by Mike Rodney Wednesday, December 29, 2010 3:52 PM
    Tuesday, December 28, 2010 5:27 PM
    Moderator
  • Thanks a lot...Now

    Actually i tried that and its taking almost the same time , is there anyway we can loop through the ID column which has clustered index on it to incerase the performance as i will be upating 700 million records at a time so it might take hours..........
    right now its taking about 15 to 17 seconds to update 1 million records.......


    mike
    Tuesday, December 28, 2010 7:38 PM
  • If your ID doesn't have many gaps and starts from 1 to 700MLN, then you may try

    update ... where ID between @i * @Size and (@i+1)*@Size - 1 instead of cte/row_number() approach I suggested.


    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Tuesday, December 28, 2010 7:44 PM
    Moderator
  • I'm not sure that you can get better time that you already have. Maybe you can try to use some simple command to resolve your problem maybe it can give you some time cut event your code will be longer but can be faster. 

    Generally code optimization is very hard and takes time for tasting every line of code.

    Tuesday, December 28, 2010 7:48 PM
  • Thanks it really helped and got what i was looking for....i tried it on sample data of 2 million and it took only 12 seconds(more than 50% improvement)................

    i really appreciate your help....i will now try it on larger set of data. Should i run in small batch size or larger batch size and how does it affect the performance and time as there is large amount of data.....I know we also have to make sure our transaction log should not be affected.

    Thanks again......


    mike
    Tuesday, December 28, 2010 8:20 PM
  • Ohh no when i carefully look through there are some of the gaps where data is not updated....not really sure why it does that though there is no gaps in the ID column.......
    mike
    Tuesday, December 28, 2010 8:26 PM
  • Can you post the latest code you tried and what IDs were not updated? I don't see why. You can also add some info in the loop like

    Print 'Updating ' + cast(@i * @Size as varchar(10)) + ' - ' +  cast((@i + 1) * @Size as varchar(10)) + ' records'


    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Tuesday, December 28, 2010 8:31 PM
    Moderator
  • below is the code for the problem above:-

    declare

     

    @Size int, @Loops int, @i int

    set

     

    @Size = 500000

    select

     

    @Loops = count(*)/@Size from Table1

    set

     

    @I = 0

    while

     

    @I <=@Loops

     

    begin

     


            UPDATE  Table1
            SET     New_Column = ( case when ( New_Column is null )
                                        then Replicate('0', 10 - Len(Old_Column))
                                             + Old_Column
                                        else Replicate('0', 10 - Len(New_Column))
                                             + New_Column
                                   End )
          
       

     

     

    where

     

    ID between @I* @Size and (@I+1)*@Size-1

    set

     

    @I = @I + 1

    end


    mike
    Tuesday, December 28, 2010 8:33 PM
  • SET   New_Column = ( case when ( New_Column is null )
                      then Replicate('0', 10 - Len(Old_Column))
                         + Old_Column
                      else Replicate('0', 10 - Len(New_Column))
                         + New_Column
                    End )
    

    may be to can try this

    UPDATE Table1
        SET   New_Column = Replicate('0', 10 - Len(New_Column))
                         + New_Column
              WHERE New_Column Is NULL
    

    Tuesday, December 28, 2010 8:57 PM
  • Thanks

    ya it runs fine but takes 1 min to update 2 million records, so thats the best we can do.Also can you tell me that how can i play around with batch size to reduce the time....


    mike
    Tuesday, December 28, 2010 9:27 PM
  • Thanks thats nice as we can keep track of time execution will take for each batch......
    mike
    Wednesday, December 29, 2010 3:46 PM
  • Thnx it works fine when i gave where condition, but when i will update millions of records, willthere be anyway we can improve its performance. The primary key column is ID and its identity column, so we need to use that some where i guess so we can improve its performance.

    I have to update it only on historical data so i will only have to run the script once. The formatting is done for the future data from client side.


    mike

    You say "records" which is a sequential file concept instead of "rows", which is relational concept. You use TOP which is a sorted file concept. You have a vague. mystical generic  "id" that mimics the physical record number of a magnetic tape file. Since it is an IDENTITY table property (NOT a column!)  it cannot be a key by definition. But at least give it a valid name that describes it -- "physical_insertion_attempt_count"  and label it as meta-data that should not be in a table. 

    Have you considered switching your mindset to RDBMS? That is the lack of theory that is stalling you. Now onto "practical programming"

    Create a table with the right constraints on it, such as a LIKE predicate in a CHECK() constraint for those zeros. Call up the sales guy at an ETL software company and ask for a demo. It will stuff data into the table faster than anything we can do in T-SQL. That is what ETL lives to do. 

     


    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL
    • Proposed as answer by Doron_F Saturday, March 26, 2011 11:49 PM
    Wednesday, December 29, 2010 9:37 PM
  • Hi Mike,

    Maybe it is a bit late to answer you but I found a way to move or update hundreds of millions of records using ms sql script in less then 1 hour. You basically can track down the process of 1 million records at the time and also figure out the time it takes to process all records.  Please see this article at: 
    http://www.dfarber.com/computer-consulting-blog/2011/1/14/processing-hundreds-of-millions-records-got-much-easier.aspx it works like a charm for me.

    Regards,

    Doron

     

     

    Saturday, March 26, 2011 11:57 PM