locked
running total 5 million rows RRS feed

  • Question

  • What's the most efficient way to implement this for a table with 5 million rows?

    I'm introducing a new field named "CumulativePayments" it should be populated using this formula:

    CumulativePayments = Payment on current row + SUM(Payment) (for all previous rows where A+B+C+D+E is the same as on the current row)

    A,B,C,D,E are all varchar fields

    Payment and CumulativePayment are MONEY type

    there is an identity column (id) that can be used to identify previous rows i.e. id is less than current row id

    Friday, October 8, 2010 1:37 PM

Answers

    • Proposed as answer by Naomi N Friday, October 8, 2010 1:47 PM
    • Marked as answer by Kalman Toth Thursday, October 14, 2010 6:12 AM
    Friday, October 8, 2010 1:42 PM
  • Just so you will know and can help pressure Microsoft into catching up with ANSI/ISO Standards, the complete Window clause has a [ROWS | RANGE] option that makes this trivial. 

     SELECT loan_nbr,
            SUM(payment_amt)
                OVER (PARTITION BY loan_nbr
                      ORDER BY payment_nbr  -- or payment_date?
                      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
            AS cumulative_payment_amt
       FROM Loan_History;

    Don't use MONEY -- it has rounding errors that show up when you multiply and divide, nor does it conform to accounting rules about decimal places. Why do you think that having an IDENTITY property (it is not a column) with the vague generic name "id" is a good idea? Why was Dr. Codd wrong about keys in tables? 

    --CELKO-- Please post DDL, use ISO-11179 naming rules and format code so we can read it
    • Marked as answer by Kalman Toth Thursday, October 14, 2010 6:12 AM
    Friday, October 8, 2010 7:43 PM
  • No quirky update here, and runs in 2:45 for 5 million records

    -- Prepare sample data
    CREATE TABLE	#Temp
    		(
    			ID INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
    			c1 INT NOT NULL,
    			c2 INT NOT NULL,
    			c3 INT NOT NULL,
    			c4 INT NOT NULL,
    			c5 INT NOT NULL,
    			v INT NOT NULL,
    			rt INT
    		)
    
    -- Initialize sample data for 45 seconds
    INSERT		#Temp
    		(
    			c1,
    			c2,
    			c3,
    			c4,
    			c5,
    			v
    		)
    SELECT		v1.number AS c1,
    		v2.number AS c2,
    		v3.number AS c3,
    		v4.number AS c4,
    		v5.number AS c5,
    		v6.number AS v
    FROM		master.dbo.spt_values AS v1
    INNER JOIN	master.dbo.spt_values AS v2 ON v2.type = 'P'
    			AND v2.number BETWEEN 1 AND 10
    INNER JOIN	master.dbo.spt_values AS v3 ON v3.type = 'P'
    			AND v3.number BETWEEN 1 AND 10
    INNER JOIN	master.dbo.spt_values AS v4 ON v4.type = 'P'
    			AND v4.number BETWEEN 1 AND 10
    INNER JOIN	master.dbo.spt_values AS v5 ON v5.type = 'P'
    			AND v5.number BETWEEN 1 AND 10
    INNER JOIN	master.dbo.spt_values AS v6 ON v6.type = 'P'
    			AND v6.number BETWEEN 1 AND 50
    WHERE		v1.type = 'P'
    		AND v1.number BETWEEN 1 AND 10
    ORDER BY	NEWID()
    
    -- Create index for 15 seconds
    CREATE NONCLUSTERED INDEX IX_Peso ON #Temp (c1, c2, c3, c4, c5) INCLUDE (v)
    
    -- Start the running total, 2:45
    UPDATE		t
    SET		t.rt = f.rt
    FROM		#Temp AS t
    INNER JOIN	(
    			SELECT		c1,
    					c2,
    					c3,
    					c4,
    					c5,
    					MIN(ID) AS minID
    			FROM		#Temp
    			GROUP BY	c1,
    					c2,
    					c3,
    					c4,
    					c5
    		) AS x ON x.c1 = t.c1
    			AND x.c2 = t.c2
    			AND x.c3 = t.c3
    			AND x.c4 = t.c4
    			AND x.c5 = t.c5
    CROSS APPLY	(
    			SELECT	SUM(w.v)
    			FROM	#Temp AS w
    			WHERE	w.ID >= x.minID
    				AND w.ID <= t.ID
    				AND w.c1 = t.c1
    				AND w.c2 = t.c2
    				AND w.c3 = t.c3
    				AND w.c4 = t.c4
    				AND w.c5 = t.c5
    		) AS f(rt)
    
    • Marked as answer by Kalman Toth Thursday, October 14, 2010 6:13 AM
    Friday, October 8, 2010 11:47 PM

All replies

    • Proposed as answer by Naomi N Friday, October 8, 2010 1:47 PM
    • Marked as answer by Kalman Toth Thursday, October 14, 2010 6:12 AM
    Friday, October 8, 2010 1:42 PM
  • could you guide me on adapting the example from that thread to suit my criteria i.e. take account of A+B+C+D+E rather than all rows as per the example in the thread?
    Friday, October 8, 2010 2:14 PM
  • try this.... tell me how much time is taking..

    SELECT	ROW_NUMBER() OVER (partition by a,b,c,d,e ORDER BY Id) as rn,a,b,c,d,e,val,RunningTotal=CONVERT(int,0)
    INTO #Sales 
    FROM TableName
    
    CREATE CLUSTERED INDEX idxSOID on #Sales (a,b,c,d,e) 
    
    DECLARE @RunningTotal int = 0
    
    BEGIN -- SINGLE-UPDATE FAST RUNNING-TOTAL
    	DBCC DROPCLEANBUFFERS
    	SET @RunningTotal = 0.0
    	UPDATE #Sales SET @RunningTotal = RunningTotal = case when rn = 1 then 0 else @RunningTotal end + val
    	OPTION (MAXDOP 1) 
    -- 1 sec
    
    END
    select * from #Sales
    
    

     

     

    Friday, October 8, 2010 2:50 PM
  • 1 min 50 secs - excellent
    Friday, October 8, 2010 3:08 PM
  • Not satisfied ??? what about this?? lets try this... if levels are less, it will perform better i think....

    ;with cte as
    (
    	select *,ROW_NUMBER() over (PARTITION by a,b,c,d,e order by ID) as rn,A,B,C,D,E,Val from TableName
    ),
    cte1 as
    (
    	select A,B,C,D,E,Val,Val as Total from cte where rn = 1 
    	union all
    	select c.A,c.B,c.C,c.D,c.E,c.Val,Total + c.Val from cte1 c1
    	inner join cte c on c.A = c1.A and c.B = c1.B and c.C = c1.C and c.D = c1.D and c.E = c1.E and c.rn = c1.rn + 1
    )
    select * from cte1
    

     

    Friday, October 8, 2010 3:17 PM
  • Take also a look here

    Running Totals Yet again - CLR will save the day


    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Friday, October 8, 2010 5:53 PM
  • Just so you will know and can help pressure Microsoft into catching up with ANSI/ISO Standards, the complete Window clause has a [ROWS | RANGE] option that makes this trivial. 

     SELECT loan_nbr,
            SUM(payment_amt)
                OVER (PARTITION BY loan_nbr
                      ORDER BY payment_nbr  -- or payment_date?
                      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
            AS cumulative_payment_amt
       FROM Loan_History;

    Don't use MONEY -- it has rounding errors that show up when you multiply and divide, nor does it conform to accounting rules about decimal places. Why do you think that having an IDENTITY property (it is not a column) with the vague generic name "id" is a good idea? Why was Dr. Codd wrong about keys in tables? 

    --CELKO-- Please post DDL, use ISO-11179 naming rules and format code so we can read it
    • Marked as answer by Kalman Toth Thursday, October 14, 2010 6:12 AM
    Friday, October 8, 2010 7:43 PM
  • No quirky update here, and runs in 2:45 for 5 million records

    -- Prepare sample data
    CREATE TABLE	#Temp
    		(
    			ID INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
    			c1 INT NOT NULL,
    			c2 INT NOT NULL,
    			c3 INT NOT NULL,
    			c4 INT NOT NULL,
    			c5 INT NOT NULL,
    			v INT NOT NULL,
    			rt INT
    		)
    
    -- Initialize sample data for 45 seconds
    INSERT		#Temp
    		(
    			c1,
    			c2,
    			c3,
    			c4,
    			c5,
    			v
    		)
    SELECT		v1.number AS c1,
    		v2.number AS c2,
    		v3.number AS c3,
    		v4.number AS c4,
    		v5.number AS c5,
    		v6.number AS v
    FROM		master.dbo.spt_values AS v1
    INNER JOIN	master.dbo.spt_values AS v2 ON v2.type = 'P'
    			AND v2.number BETWEEN 1 AND 10
    INNER JOIN	master.dbo.spt_values AS v3 ON v3.type = 'P'
    			AND v3.number BETWEEN 1 AND 10
    INNER JOIN	master.dbo.spt_values AS v4 ON v4.type = 'P'
    			AND v4.number BETWEEN 1 AND 10
    INNER JOIN	master.dbo.spt_values AS v5 ON v5.type = 'P'
    			AND v5.number BETWEEN 1 AND 10
    INNER JOIN	master.dbo.spt_values AS v6 ON v6.type = 'P'
    			AND v6.number BETWEEN 1 AND 50
    WHERE		v1.type = 'P'
    		AND v1.number BETWEEN 1 AND 10
    ORDER BY	NEWID()
    
    -- Create index for 15 seconds
    CREATE NONCLUSTERED INDEX IX_Peso ON #Temp (c1, c2, c3, c4, c5) INCLUDE (v)
    
    -- Start the running total, 2:45
    UPDATE		t
    SET		t.rt = f.rt
    FROM		#Temp AS t
    INNER JOIN	(
    			SELECT		c1,
    					c2,
    					c3,
    					c4,
    					c5,
    					MIN(ID) AS minID
    			FROM		#Temp
    			GROUP BY	c1,
    					c2,
    					c3,
    					c4,
    					c5
    		) AS x ON x.c1 = t.c1
    			AND x.c2 = t.c2
    			AND x.c3 = t.c3
    			AND x.c4 = t.c4
    			AND x.c5 = t.c5
    CROSS APPLY	(
    			SELECT	SUM(w.v)
    			FROM	#Temp AS w
    			WHERE	w.ID >= x.minID
    				AND w.ID <= t.ID
    				AND w.c1 = t.c1
    				AND w.c2 = t.c2
    				AND w.c3 = t.c3
    				AND w.c4 = t.c4
    				AND w.c5 = t.c5
    		) AS f(rt)
    
    • Marked as answer by Kalman Toth Thursday, October 14, 2010 6:13 AM
    Friday, October 8, 2010 11:47 PM
  • try this.... tell me how much time is taking..

     

    SELECT	ROW_NUMBER() OVER (partition by a,b,c,d,e ORDER BY Id) as rn,a,b,c,d,e,val,RunningTotal=CONVERT(int,0)
    
    INTO #Sales 
    
    FROM TableName
    
    
    
    CREATE CLUSTERED INDEX idxSOID on #Sales (a,b,c,d,e) 
    
    
    
    DECLARE @RunningTotal int = 0
    
    
    
    BEGIN -- SINGLE-UPDATE FAST RUNNING-TOTAL
    
    	DBCC DROPCLEANBUFFERS
    
    	SET @RunningTotal = 0.0
    
    	UPDATE #Sales SET @RunningTotal = RunningTotal = case when rn = 1 then 0 else @RunningTotal end + val
    
    	OPTION (MAXDOP 1) 
    
    -- 1 sec
    
    
    
    END
    
    select * from #Sales
    
    
    
    

     

     

     


    Come on, we all know that there is no guarantee that this approach will work in the next CU / SP / version. Also, alert the users about all conditions that should be met, like no table partitioning, etc.

    I would not recommend it, unless Microsoft states the contrary.

     


    AMB

    Some guidelines for posting questions...

    Saturday, October 9, 2010 1:20 AM
  • Just so you will know and can help pressure Microsoft into catching up with ANSI/ISO Standards, the complete Window clause has a [ROWS | RANGE] option that makes this trivial. 

    You can submit a suggestion at Connect:

    https://connect.microsoft.com/SQLServer?wa=wsignin1.0


    Kalman Toth, SQL Server & Business Intelligence Training; SQL 2008 GRAND SLAM
    • Proposed as answer by Naomi N Tuesday, November 9, 2010 8:26 PM
    Thursday, October 14, 2010 6:15 AM