locked
Query performance unions vs temp tables RRS feed

  • Question

  • I've seen several cases where a query in this format:

    select col1,col2,col3

    from

    (

    select col1,col2,col2 from table1

    union

    select col1,col2,col3 from table 2

    )

    is much slower than

    select col1,col2,col3 into #tmptable from table1

    insert into #tmptable select col1,col2,col3 from table2 left join #tmptable on col1=col1 where #tmptable.col1 is null

    select col1,col2,col3 from #tmptable

    Especially in cases where the top table (table1) is large and the bottom table (table2) is small or empty. Our tempdb is segregated onto it's own disk, but the data drives are very fast drives and we're not seeing any IO waits. The DMVs just show the query as runnable, but the select into version runs as much as 10x faster in some cases.

    Can someone explain this behavior?

     

    Thanks

    Tuesday, January 25, 2011 7:14 PM

Answers

  • How about something like this instead of using the temp table?

    SELECT col1,col2,col3
    FROM Table1
    UNION ALL
    SELECT col1,col2,col3
    FROM Table2
    WHERE NOT EXISTS (SELECT * FROM Table1 WHERE col1=Table2.col1 AND col2=Table2.col2 AND col3=Table2.col3)

    It effectively returns the same result without having any temporary storage.

     


    --Brad (My Blog)
    • Marked as answer by GPage Tuesday, January 25, 2011 8:56 PM
    Tuesday, January 25, 2011 8:52 PM

All replies

  • Do you really want a UNION (which eliminates all duplicates)?  Or do you really want a UNION ALL?

    The UNION ALL is much faster.


    --Brad (My Blog)
    Tuesday, January 25, 2011 7:16 PM
  • We want to avoid dups so a union is being used instead of a union all. That's also why the left join in the insert statement to avoid dups going into the temp table.
    Tuesday, January 25, 2011 7:28 PM
  • Understood.

    Are you sure that Table1 and Table2 each don't have any duplicates in them to start with?

    The UNION will eliminate duplicates in either table, but the temptable solution will not.

    In other words, if Table1 consists of 1million rows, all with the exact same data, and Table2 consists of no rows at all, then the UNION query will end up with only 1 row, but the temptable solution will end up with 1million rows.

     


    --Brad (My Blog)
    • Proposed as answer by Gert-Jan Strik Tuesday, January 25, 2011 8:00 PM
    • Unproposed as answer by GPage Tuesday, January 25, 2011 8:48 PM
    Tuesday, January 25, 2011 7:38 PM
  • Right,

    In this case both tables have been de-dupped within the scope of the table itself prior to this statment (there is a primary key on each table), but they have not been de-duped in relation to each other.

     

    Tuesday, January 25, 2011 8:46 PM
  • How about something like this instead of using the temp table?

    SELECT col1,col2,col3
    FROM Table1
    UNION ALL
    SELECT col1,col2,col3
    FROM Table2
    WHERE NOT EXISTS (SELECT * FROM Table1 WHERE col1=Table2.col1 AND col2=Table2.col2 AND col3=Table2.col3)

    It effectively returns the same result without having any temporary storage.

     


    --Brad (My Blog)
    • Marked as answer by GPage Tuesday, January 25, 2011 8:56 PM
    Tuesday, January 25, 2011 8:52 PM
  • Yes I think that would work.

    I think the implied answer is that it's the sort and filter imposed by the union that's causing the problem. I basically did the same thing you have there in this format:

    select col1,col2,col3 from table1

    union all

    select col1,col2,col3 from table2 t2 left join table1 t1 on t2.col1=t1.col1

    where t1.col1 is null

    Is there an advantage to using the not exists over the left join?

    Tuesday, January 25, 2011 8:57 PM
  • A follow up question would be why doesn't the union show up as more expensive in the query plan. There's a distinct sort after the union but the execution plan shows it as 0% of the cost of the query, while an index scan is showing 90% of the cost. However with the sort removed the query returns in seconds instead of minutes.
    Tuesday, January 25, 2011 9:00 PM
  • NOT EXISTS is better because it will do an "Semi Join"... meaning that it's just checking if rows exist in Table1 or not... it doesn't care about actual data in Table1.

    The LEFT JOIN will actually bring in columns from Table1 and then it will filter the result (looking for the Col1 IS NULL)... that's a lot more work.

     


    --Brad (My Blog)
    Tuesday, January 25, 2011 9:06 PM
  • Yes, there may be performance improvement with using NOT EXISTS vs. LEFT JOIN. Gail Shaw performed a series of tests, check

    LEFT JOIN vs NOT EXISTS
    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Tuesday, January 25, 2011 9:07 PM
  • A follow up question would be why doesn't the union show up as more expensive in the query plan. There's a distinct sort after the union but the execution plan shows it as 0% of the cost of the query, while an index scan is showing 90% of the cost. However with the sort removed the query returns in seconds instead of minutes.


    Remember those costs are only estimates based on what it can tell from the statistics of your tables... and the optimizer, while amazing, is not perfect at estimating everything.

    How many rows are we talking about from each table?  Can you post an actual (not estimated) execution plan?  How do the estimated numbers compare to the actual numbers?

    It's hard to answer your question without knowing more about the plan.

     


    --Brad (My Blog)
    Tuesday, January 25, 2011 9:10 PM
  • This is in a data warehouse so we're dealing with very large numbers of records in some cases. In this case it must be one of those "your mileage may vary" sort of things. When I use the left join it returns all rows (1.4M) in 30s. With the not exists it returns the records in about a minute and a half so 3x as long. In this case table1 and table2 have the same structure where table1 has 1.4M rows and table2 has 0. Changing the UNION to a UNION ALL with a filter for rows that already exist cuts the time down from over an hour to 30s. The stats should be up to date since we do an update stats after loading the staging tables. We don't do a fullscan, but the distribution is pretty uniform and my understanding is that you shouldn't have to do a fullscan unless your data distribution is scewed in some fashion. In the actual plan it shows the index scan on table1 as being the most expensive by a very large margin (the distinct sort is 0%) even though the distinct sort is what's causing the problem.

    I appreciate the help.

    Tuesday, January 25, 2011 9:25 PM