none
Optimizing Union All Queries RRS feed

  • Question

  • Hi Friends,

    Could anyone help to improve the following TSQL query inside  a SP to query multiple tables and return the results.  Currently the query is taking a long time to return the results. In a SQL Datawarehouse , it is taking 15mts to return just 1 million. Basically what the TSQL does is to loop and query each table and do a union all, thus returning the final results. The main performance issue happens when joining , pls check the 'SET @sqlJoin = @sqltable + ' LEFT' line  below. 

    Full query is given below. 

     

    IF EXISTS (SELECT * 
    FROM sys.objects 
    WHERE object_id = OBJECT_ID(N'[REPORT].[ExportHistoryToCsv]'
    ))

      DROP PROCEDURE [REPORT].[ExportHistoryToCsv]

    GO

    CREATE PROC [REPORT].[ExportHistoryToCsv] 
    @EstablishmentId [INT], 
    @HashedUsernames [VARCHAR](8000),
        @SortColumn [VARCHAR](40),
    @SortOrder [VARCHAR](4), 
    @StartDatetime [DATETIME2](3), 
    @EndDatetime [DATETIME2](3), 
    @PageNumber [INT], 
    @RowspPage [INT], 
    @ContentTypes [VARCHAR](400), 
    @Url [VARCHAR](8000), 
    @FilterListIds [VARCHAR](1000), 
    @ActionType [SMALLINT], 
    @UserTypes [VARCHAR](20), 
    @ClientIP [VARCHAR](15), 
    @Datacenter [VARCHAR](100),
    @externaltablename [VARCHAR](40),
    @blobpath [VARCHAR](1000)
    AS

    BEGIN
      BEGIN TRY
        DECLARE @sql nvarchar(max)
        DECLARE @sqlselect nvarchar(max)
        DECLARE @sqltable nvarchar(max)
    DECLARE @sqljoin nvarchar(max)
        DECLARE @sqlwhere nvarchar(max)
        DECLARE @sqlorderby nvarchar(max)
        DECLARE @counter int
        DECLARE @pos int
        DECLARE @len int
        DECLARE @mime varchar(400)
        DECLARE @found bit
    DECLARE @params nvarchar(4000)

        SET @sql = ''
        SET @sqlselect = ''
        SET @sqltable = ''
        SET @sqlwhere = ''
        SET @sqlorderby = ''


        SET @SortOrder =
                        CASE @SortOrder
                          WHEN 'desc' THEN 'DESC'
                          ELSE 'ASC'
                        END

        SET @sqlorderby =
                         CASE @SortColumn
                           WHEN 'time' THEN '[logTime] ' + @SortOrder
                           WHEN 'action' THEN '[Action] ' + @SortOrder + ', [logTime] DESC'
                           WHEN 'ip' THEN '[ClientIP] ' + @SortOrder + ', [logTime] DESC'
                           ELSE '[logTime] ' + @SortOrder
                         END


        SET @sqlselect = 'SELECT CONCAT([protocol],''://'',[hostname],[port],[urlParam]) as [Url],
     CASE [Action] WHEN 2 THEN ''Denied'' ELSE ''Allowed'' END as Action,
     [FilterLists].[Name] as [FilterListName],
     [clientIP] as [ClientIp],
     [TestNetUsers].[Name] as [UserFullName],
     [logTime] as [LogTime]
          FROM '

        SET @sqlwhere += ' WHERE [establishmentId] = @EstablishmentId AND [hostname] IS NOT NULL AND [hostname] != '''''


      
        IF (@EstablishmentId IS NULL
          OR @EstablishmentId = 0)
        BEGIN
          THROW 200000, 'EstablishmentId parameter is null or 0.', 1;
        END

        IF (@StartDatetime IS NOT NULL) 
        BEGIN
          SET @sqlwhere += ' AND [logTime] >= @StartDatetime'
        END
        ELSE
        BEGIN
          THROW 200001, 'StartDatetime parameter is null or invalid.', 1;
        END

        IF (@EndDatetime IS NOT NULL)
        BEGIN
          SET @sqlwhere += ' AND [logTime] <= @EndDatetime'
        END
        ELSE
        BEGIN
          THROW 200002, 'EndDatetime parameter is null or invalid.', 1;
        END

        
        
        

        DECLARE @endDate date
        DECLARE @tempDate date
        DECLARE @dataSet bit
        DECLARE @tableName varchar(100)

        SET @endDate = @EndDatetime
        SET @tempDate = @StartDatetime
        SET @dataSet = 0
        SET @tableName =
                        CASE @Datacenter
                          WHEN 's' THEN 's'
                          ELSE 'c'
                        END

        WHILE @tempDate <= @endDate
        BEGIN 
          SET @sqltable = @tableName + CONVERT(varchar(10), @tempDate, 112) 
     SET @sqlJoin = @sqltable + ' LEFT JOIN [REPORT].[TestNetFilterLists] FilterLists ON ' +
           @sqltable + '.[filterListId] = [FilterLists].[Id] LEFT JOIN [REPORT].[TestNetUsers] TestNetUsers ON ' +
           @sqltable + '.[userHash] = [TestNetUsers].[HasheduserName] ' 
          
          IF EXISTS (SELECT 1 FROM sys.tables WHERE NAME = @sqltable)
          BEGIN
            IF @dataSet = 0 --to be executed only once, since its the outer query
    BEGIN
     SET @sql += 'SELECT 
    [Url],
    [SnAction],
    [FilterListName],
    [ClientIp],
    [UserFullName],
    [LogTime] FROM 
    (SELECT ROW_NUMBER() OVER(ORDER BY ' + @sqlorderby + ') AS [Row],
    [Url],
    [SnAction],
    [FilterListName],
    [ClientIp],
    [UserFullName],
    [LogTime] FROM ('
    + @sqlselect
    + @sqlJoin
    + @sqlwhere
    END
            ELSE
    BEGIN
     SET @sql += ' UNION ALL ' + @sqlselect + @sqlJoin + @sqlwhere
    END
          END

          SET @tempDate = DATEADD(dd, 1, @tempDate)
        END

        IF @sql != ''
        BEGIN
          SET @sql += ') AS [data] ) AS [RowData]  
    WHERE [RowData].[Row] BETWEEN ((@PageNumber - 1) * @RowspPage + 1) AND (@PageNumber * @RowspPage)'

          
     print @sql ;

          SET @params = N'@externaltablename [VARCHAR](40),@blobpath [VARCHAR](1000), @EstablishmentId [INT],@StartDatetime [DATETIME2](3),@EndDatetime [DATETIME2](3),@PageNumber [INT],@RowspPage [INT],@Url [VARCHAR](8000),@ActionType [SMALLINT],@ClientIP [VARCHAR](15)'

     --Output to blob using polybase
     SET @sql = 'IF EXISTS (SELECT * 
    FROM sys.objects 
    WHERE object_id = OBJECT_ID(N''[REPORT].['+@externaltablename+']''
    ))
    BEGIN
                DROP EXTERNAL TABLE [REPORT].['+@externaltablename+'] 
       END

          CREATE EXTERNAL TABLE [REPORT].['+@externaltablename+']
    WITH 
    (
    LOCATION='''+@blobpath+''',
    DATA_SOURCE = ReportStorage,
    FILE_FORMAT = CommaDelimitedFileExport,
    REJECT_TYPE = percentage,
    REJECT_VALUE = 1,
    REJECT_SAMPLE_VALUE = 1000
    ) AS '+ @sql;


           EXEC sp_executesql @sql,
                             @params,
    @externaltablename,
    @blobpath,
                             @EstablishmentId,
                             @StartDatetime,
                             @EndDatetime,
                             @PageNumber,
                             @RowspPage,
                             @Url,
                             @ActionType,
                             @ClientIP 


                   SELECT 1
        END
        --executed if none of the tables exists and will return empty result
        ELSE
        BEGIN
        SET @params = N'@externaltablename [VARCHAR](40),@blobpath [VARCHAR](1000)'

    SET @sql = 'IF EXISTS (SELECT * 
    FROM sys.objects 
    WHERE object_id = OBJECT_ID(N''[REPORT].['+@externaltablename+']''
    ))
    BEGIN
                DROP EXTERNAL TABLE [REPORT].['+@externaltablename+'] 
       END

          CREATE EXTERNAL TABLE [REPORT].['+@externaltablename+']
    WITH 
    (
    LOCATION='''+@blobpath+''',
    DATA_SOURCE = ReportStorage,
    FILE_FORMAT = CommaDelimitedFileExport,
    REJECT_TYPE = percentage,
    REJECT_VALUE = 1,
    REJECT_SAMPLE_VALUE = 1000
    ) AS SELECT
    [UserHash] = 1,
    [Url] = 1,
    [SNAction] = 1,
    [LogTime] = 1
    WHERE 1 = 0'


           EXEC sp_executesql @sql,
                             @params,
    @externaltablename,
    @blobpath 
    SELECT 1         
        END
      END TRY
      BEGIN CATCH
        THROW;
      END CATCH
    END

    GO

    Wednesday, March 1, 2017 1:17 PM

Answers

  • The T-SQL may have some optimizations that could be done, but we often see the biggest issues in the data distribution and the indexes.

    The main table should have an index and a hash based distribution.

    If it doesn't, the query has to do a table scan for every join parameter.

    I would look at the following documentation to get a better understanding of how to optimize your table structure.

    Distributed Tables

    Indexes

    • Marked as answer by 1-----------1 Tuesday, March 7, 2017 4:54 AM
    Monday, March 6, 2017 4:45 PM

All replies

  • Are you using the same distribution hash key for all three of those tables?

    My guess is that there is data shuffle occurring when moving among the compute nodes that is taking time.

    Also, what indexes do you have on those tables and what is their rowcount?

    What size DWU are you running at and what data volume size?

    -Casey

    Wednesday, March 1, 2017 5:14 PM
  • Hi Casey,

    The index is clustered column store as 

    CREATE TABLE [Report].[TestNetUsers]
    WITH
    (   
    CLUSTERED COLUMNSTORE INDEX,
    DISTRIBUTION = ROUND_ROBIN
    )
    AS
    SELECT * FROM [dbo].[TestUsersExternal]; //Loading data using polybase 

    UPDATE STATISTICS [Report].[TestNetUsers] ([stats_hashname]);

    The rowcount for the table is 2500+ . 
    The DWU size is 100-1000. When put as 100, its returning rows within 6 minutes, while for 1000DWU it took 10+ Minutes. Its strange. 

    Hashkey distribution I need to check. I am not sure whether it's using the same distribution. 


    Vishnu VG

    Thursday, March 2, 2017 7:49 AM
  • Could you send the Table DDL for the other 2 tables that you are using for the joins?

    Also, looking at the table TestNetUsers, I would use a Clustered index instead of CCI.

    CCI is less potentially less performant than Clustered index for tables less than 60 million rows.

    How much data do you have in your database?

    Friday, March 3, 2017 1:01 AM
  • Hi Casey. Thankyou so much for suggesting.

    These two are the tables used for join, having around 2000 entries. I changed the index to heap, as ClusteredColumnStore not effecient for less rows. I will try with ClusteredIndex as you suggested.

    CREATE TABLE [Report].[TestNetFilterLists]
    WITH ( HEAP , DISTRIBUTION = ROUND_ROBIN)

    CREATE TABLE [Report].[TestNetUsers]
    WITH ( HEAP , DISTRIBUTION = ROUND_ROBIN
    )

    The main table has around millions of log records, but unfortunately its a hstoric table, not having any indexes. I doubt there is any other way to improve performance to query multiple tables other than by looping and UNION ALL. Do you fnd any issues withe the TSQL queries so far, or is it just the data movement causing the delay..

               

    Vishnu VG



    Monday, March 6, 2017 5:36 AM
  • The T-SQL may have some optimizations that could be done, but we often see the biggest issues in the data distribution and the indexes.

    The main table should have an index and a hash based distribution.

    If it doesn't, the query has to do a table scan for every join parameter.

    I would look at the following documentation to get a better understanding of how to optimize your table structure.

    Distributed Tables

    Indexes

    • Marked as answer by 1-----------1 Tuesday, March 7, 2017 4:54 AM
    Monday, March 6, 2017 4:45 PM
  • Yup Thanks Casey. Got it.

    I've added some index and statistics to sub table. Now its performing much better than before and total time is now just 2mts. As you suggested will look into adding indexes and distribution optimizations.

    As of now this solved the issue.


    Vishnu VG

    Tuesday, March 7, 2017 4:55 AM