locked
In my Actual Execution Plan, why is the actual rows number 3280% of Estimated rows and does it matter? RRS feed

  • Question

  • SQL Server 2016. My query is against a Fact table with 4.7 billion rows. It has a clustered columnstore index and 3 NC row store indices. The query is using one of those which makes sense in this instance.

    I'm joining the Fact to a temp table containing the required account keys. There are 11 of those. The NC index has the account key as the leading part and is covering.

    The index was built about 4 hours ago and the table has not changed, since.

    This is the first part of the plan:


    Takes several minutes to run which is not really acceptable for one query in a complex SSRS report for end-users.

    I'm not clear on which is the best of the JOINs for this set of data - perhaps the Nested Loops is not suitable for the 11 account keys being joined to the 200 million transaction rows?

    Apart from Account Key, the other part of the JOIN is that the Transaction Date falls within a range of 24 months. The table is partitioned on transaction date with 1 month per partition. I organised the columnstore by first creating a clustered index on TransactionDate then created the columnstore with Drop Existing on which was a tip I found to help maximise segment elimination when querying with Transaction Date. Not that this is relevant to the current query, I suppose.

    I have added an OPTION( RECOMPILE ) to the SELECT in case that helped. The actual number of rows returned varies enormously from this our largest merchant to the smallest.

    So, can anyone suggest why the stats don't seem to be helping get the row count right? And is Nested Loops the best type of join, here?

    Nick Ryan MIS Programmer Analyst, ANZ Bank


    • Edited by Nick Ryan Thursday, November 14, 2019 12:57 AM
    Thursday, November 14, 2019 12:36 AM

Answers

  • when you run the query with actual execution plan, right click in the graph, choose first option

    also, right click on the index seek to the fact table/index, choose last item, properties, on the right, look of actual rows. expand 

    but as you said above, one thread will have 200M rows. if it shows one thread with 200M, then this is an oversight introduced in 2005, that was supposed to have been fixed. 

    force the hash join to get even distribution of work between threads - but check to see that this actually reduces execution time, might depend on degree of parallelism.

    Edit.

    Statistics on indexes need to be full scan. It could be the average Child Company has 627K rows, hence the estimate for 11 companies is 6.9M, even though there is one with 200M

    SQL Server is not good at handling skewed distributions. I would like to see the full SQL + the index definition for the one in use.

    It might be that you have to write this as two queries, one specifying the specific 200M Id value, then UNION ALL for the other companies (if the force hash option cannot use a viable index)


    jchang


    • Edited by jchang61 Thursday, November 14, 2019 1:05 PM
    • Marked as answer by Nick Ryan Thursday, November 14, 2019 7:37 PM
    Thursday, November 14, 2019 4:36 AM

All replies

  • the sqlplan file for the actual execution would help. It is possible 6,902,456 is the estimate for each row from the outer source #ChildCompanies, and 226,465,089 is the total number of rows , in which case the estimate is only 2X off.

    Still, I would be concerned with the exact number of rows in the Fact table that each row from the temp table matches. The reason is that the parallel loop join will have one thread work on each row from the temp table to find matches. Hence, if there is strong skew, then your query time is the time for the highest match. 

    The actual plan detail will show how many rows each thread touched. If there is strong asymmetry, try forcing a hash join


    jchang

    Thursday, November 14, 2019 1:56 AM
  • I'm not sure what you mean by the sqlplan for the actual execution? The screenshot I show above is of the Actual Execution Plan.

    Where should I look to see the detail in your last paragraph?

    Something interesting is that if I hover over the Index Seek, it shows only 627,496 as the estimated number of rows to be read.

    I ran a cut down version of the query and found that of the 11 account keys, only 5 have transactions and almost all of the 200 million are on one account. The smallest number was 35.


    Nick Ryan MIS Programmer Analyst, ANZ Bank



    • Edited by Nick Ryan Thursday, November 14, 2019 2:29 AM
    Thursday, November 14, 2019 2:04 AM
  • when you run the query with actual execution plan, right click in the graph, choose first option

    also, right click on the index seek to the fact table/index, choose last item, properties, on the right, look of actual rows. expand 

    but as you said above, one thread will have 200M rows. if it shows one thread with 200M, then this is an oversight introduced in 2005, that was supposed to have been fixed. 

    force the hash join to get even distribution of work between threads - but check to see that this actually reduces execution time, might depend on degree of parallelism.

    Edit.

    Statistics on indexes need to be full scan. It could be the average Child Company has 627K rows, hence the estimate for 11 companies is 6.9M, even though there is one with 200M

    SQL Server is not good at handling skewed distributions. I would like to see the full SQL + the index definition for the one in use.

    It might be that you have to write this as two queries, one specifying the specific 200M Id value, then UNION ALL for the other companies (if the force hash option cannot use a viable index)


    jchang


    • Edited by jchang61 Thursday, November 14, 2019 1:05 PM
    • Marked as answer by Nick Ryan Thursday, November 14, 2019 7:37 PM
    Thursday, November 14, 2019 4:36 AM
  • So, can anyone suggest why the stats don't seem to be helping get the row count right?

    May be outdated statistics? => UPDATE STATISTICS (Transact-SQL)

    Olaf Helper

    [ Blog] [ Xing] [ MVP]


    Thursday, November 14, 2019 9:42 AM
  • I'm not sure how that can be. I only created the index a few hours before I ran the query and the table has not been updated in several weeks, other than that.

    Nick Ryan MIS Programmer Analyst, ANZ Bank

    Thursday, November 14, 2019 6:44 PM
  • Oh, I see what you mean regarding the plan. I thought that you meant it would be helpful for me. I can't see a way I can provide the sqlplan file here, though. It's quite large.

    That is a very good point regarding the skew of the distribution. This is a very large chain of supermarkets (by New Zealand standards) and the one account that has all the transactions is the one containing all the supermarket sites. The others are head office, online ordering, etc. with only a few. The majority of our merchant customers are much smaller.

    This is the query text up to the one that causes the problem.

    DECLARE
    	@ID	varchar(20) = '460272',
    	@AccountSiteRetailer varchar(8) = 'Account',
    	@Run_Date varchar(8) = '20181101'
    
    	SET NOCOUNT ON;
    
    	DECLARE @CurrentMonthEnd	date,	-- End date for extract
    			@23MonthsAgoStart	date,	-- Start date for extract
    			@CurrentMonth		int,
    			@Key				int = @ID;
    
    	-- Store the various date ranges required for the report (current month start and end, previous month start and 
    	-- end, 1 year ago, 2 years ago)
    	SET @CurrentMonthEnd = EOMONTH( CAST(@Run_Date AS date ));
    	SET @CurrentMonth = YEAR( @CurrentMonthEnd ) * 12 + MONTH( @CurrentMonthEnd );
    
    	SELECT
    		@23MonthsAgoStart = First_Date_Of_Month
    	FROM
    		dbo.DimDate
    	WHERE
    		DateKey = DATEADD(MONTH, -23, CAST(@Run_Date AS date));
    
    	-- If Account level ....
    
    	IF @AccountSiteRetailer = 'Account'
    	BEGIN
    		-- Create a table of the Account IDs from DimAccount for the chosen account number and all under that in the hierarchy.
    		DROP TABLE IF EXISTS #ChildCompanies;
    
    		WITH CTE_ChildCompanies (Account_Parent_ID, Account_ID, AccountKey)
    		AS
    		(
    		 -- Anchor member definition
    			SELECT Account_Parent_ID, Account_ID, AccountKey
    			FROM dbo.DimAccount
    			WHERE Account_ID = CAST( @Key AS int )
    
    			UNION ALL
    
    		-- Recursive member definition
    			SELECT e.Account_Parent_ID, e.Account_ID, e.AccountKey
    			FROM dbo.DimAccount e
    			INNER JOIN CTE_ChildCompanies d ON e.Account_Parent_ID = d.Account_ID
    		)
    		SELECT
    			AccountKey
    		INTO
    			#ChildCompanies
    		FROM
    			CTE_ChildCompanies;
    	END;
    
    	-- Create a table summarised by month for the last 12 months showing no. of new customers and no. of 
    	-- returning customers for each. A new customer is defined as one who has not used their card here 
    	-- in the preceeding 12 months.
    
    	-- Extract goes back 24 months in order to be able to check for new cards for the first of the
    	-- 12 months.
    	
    	DROP TABLE IF EXISTS #FirstStage;
    
    	CREATE TABLE #FirstStage
    	(
    		CardNumberKey int NOT NULL,
    		Months int NOT NULL,
    		TransactionCount int NOT NULL,
    		TransactionAmount money NOT NULL
    	)
    
    	IF @AccountSiteRetailer = 'Account'
    		BEGIN
    			INSERT INTO #FirstStage( CardNumberKey, Months, TransactionCount, TransactionAmount )
    			SELECT
    				f.CardNumberKey,
    				YEAR( f.TransactionDate ) * 12 + MONTH( f.TransactionDate ) Months,
    				COUNT(1) TransactionCount,
    				SUM(f.Transaction_Amount) TransactionAmount
    			FROM
    				dbo.FactMerchantIntelligence f
    				JOIN #ChildCompanies cc ON cc.AccountKey = f.AccountKey
    				JOIN dbo.DimTransaction t ON t.TransactionKey = f.TransactionKey
    			WHERE
    				f.TransactionDate <= @CurrentMonthEnd
    				AND f.TransactionDate >= @23MonthsAgoStart
    				AND t.Transaction_Type IN( 'Purchase','Purchase with Cash','Completion','Goods and Services','Goods and Services with Cash' )
    			GROUP BY
    				f.CardNumberKey,
    				YEAR( f.TransactionDate ) * 12 + MONTH( f.TransactionDate )
    			OPTION( RECOMPILE );
    		END

    And the index:

    USE [MI]
    GO
    
    /****** Object:  Index [ix_AccountKey_TransactionDate_Includes]    Script Date: 15/11/2019 8:26:44 AM ******/
    CREATE NONCLUSTERED INDEX [ix_AccountKey_TransactionDate_Includes] ON [dbo].[FactMerchantIntelligence]
    (
    	[AccountKey] ASC,
    	[TransactionDate] ASC
    )
    INCLUDE([SiteKey],[CardNumberKey],[TransactionKey],[Transaction_Amount]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
    GO

    EDIT

    I tried running the query with just the one account and the optimiser chose a Hash Match and the actual number of rows was 99% of the estimate so your theory about the skew of numbers causing the problem was correct.

    Still took over a minute but that's a lot better.

    I'll try forcing a hash join but I will have to then check that with the majority of cases just to make sure it doesn't make them worse.


    Nick Ryan MIS Programmer Analyst, ANZ Bank


    • Edited by Nick Ryan Thursday, November 14, 2019 7:37 PM
    Thursday, November 14, 2019 7:28 PM
  • I would check the actual execution plan on the high row count operation - FactMerchIntel, are the rows evenly distributed among the worker threads? (there might be one control thread).

    Also, was is Degree of Parallelism relative to your system, and is Soft Numa in effect? On a 2-socket system with 18-core processors, the default soft-Numa setting could be 6 cores per node, in which case MAXDOP was 6. Given that you are a DW environment, I would recommend MAXDOP set at the of cores on one processor (scaling is less across multiple processors, but you could try that too).

    I am thinking that forcing a hash join in the SQL as written would not be a good idea. Given that the single account produces the correct plan, I might go with a Cursor loop around the rows in #ChildCompanies, assigning AccountKey to a variable,

    your query inside the Cursor loop, would now not have a join to the temp table, and the SARG would be 

    WHERE f.AccountKey = @AccountKey AND f.TransactionDate <= @CurrentMonthEnd
    AND f.TransactionDate >= @23MonthsAgoStart
    AND t.Transaction_Type IN( 'Purchase','Purchase with Cash','Completion','Goods and Services','Goods and Services with Cash' )


    jchang

    Friday, November 15, 2019 1:58 PM
  • You are not using Account_Key in your WHERE clause, so your index should not be used.

    A better index would be on: 

    TransactionDate, TransactionKey, AccountKey INCLUDE CardNumberKey, Transaction_Amount

    Friday, November 15, 2019 3:00 PM
  • the AccountKey SARG comes via the Nested Loops Join from the temp table. The issue is that the method SQL Server employs for parallel execution plans in a Loop Join is to assign one thread to each value from the outer source (temp table in this case). This is a very efficient method in terms of having minimal inter-thread  (Parallelism Repartition Streams) communication. However, it sucks when the rows in the inner source are not uniformly distributed, which is the case here.

    I would be mildly curious as to how much overhead the Year and Month functions generate


    jchang

    Friday, November 15, 2019 3:30 PM
  • I'm not sure what some of those questions mean. The SQL Server has 8 processors but I don't know how to tell how many cores each has nor do I know what's meant by 2 sockets. MAXDOP for the server is set to 0 and I haven't used a MAXDOP on my query.

    I don't know what Soft Numa is.

    I had actually come up with the cursor idea myself and noted that each run through came up with a different execution plan each more suited to the number of rows with the estimate being close to actual so that's a good suggestion.

    UPDATE

    I asked the DBA responsible for our servers to explain that lot to me. Hopefully when she's not too busy, she'll let me know.

    SECOND UPDATE

    Didn't think the cursor approach through. Because the INSERT had a GROUP statement, by inserting at separate stages the end result had multiple results where there should have been 1.


    Nick Ryan MIS Programmer Analyst, ANZ Bank



    • Edited by Nick Ryan Tuesday, November 19, 2019 1:00 AM
    Sunday, November 17, 2019 8:26 PM
  • It was using that index. I can't remember the term for it but AccountKey is the column out of that lot with the most unique values. According to, I think, Brent Ozar that's the column that should be the lead part of my index.

    I also saw advice that said I should pick columns that are in my WHERE clause first and then columns that are in my JOIN. I think that although AccountKey was in the latter the optimiser was clever enough to realise that was the best index to use.


    Nick Ryan MIS Programmer Analyst, ANZ Bank

    Sunday, November 17, 2019 8:45 PM
  • FYI. I tried removing the functions from stage 1 and just used the TransactionDate so I had it available to create those columns at a later stage.

    On 3 runs of each, it took much the same amount of time.


    Nick Ryan MIS Programmer Analyst, ANZ Bank

    Sunday, November 17, 2019 9:00 PM