locked
SQL Optimisation RRS feed

  • Question

  • Hi,

    I am currently working on a very messy ETL implementation (straight sprocs instead of SSIS). Code is baadly written and formatted and some tasks take forever (read last one I optimised took in excess of 4hrs/50min for calculating ~2700 rows) and went down to 4min:51sec after changing a few things.

    I am noticing a pattern throughout the whole code and I'd like to get some opinions on better ways to implement it.

    The code pretty much does the following (pseudocode):

    INSERT INTO reportingTable
    SELECT col1, col2, col3
    FROM opdb.dbo.table1
    INNER JOIN opdb.dbo.table2
    INNER JOIN opdb.dbo.table3
    ...
    ...
    LEFT JOIN opdb.dbo.table13
    LEFT JOIN table14
    WHERE blah=blahblah

    A major perfromance optimisation was adding OPTION(FORCE ORDER) and reordering the JOINs to eliminate as many rows as possible before doing the larger ones, placing the OUTER ones lower, removing CROSS JOINs, etc.

    It seems to me that SQL Server 2005 (SP3) is getting somewhat confused with a large amount of data and does not create the best execution plan. It tends to blow up tempdb, as well (at one occasion growing up to 220Gb) because of the excessive amount of data passing through (I am sometimes using count_big because otherwise I get an arithmetic overflow).

    Also, someone decided to create views in the reporting db, which pretty much do select * from operational table in operational database), and create more joins inside, and sometimes use other (nested) views.

    Now, after fixing some issues, I am getting exhausted of options - some queries still run for more than an hour or two and still use more than 10-20Gb of tempdb.

    I am wondering if there is any performance/tempdb issues in SQL Server 2005 with:

    1. Multiple joins between tables from two or more databases
    2. Joining multiple views based on a different database
    3. Any potential issues with using OPTION (FORCE ORDER)

    Also, it would be interesting to know if there could be any benefits to tempdb size when using indexes (non-clustered) on the joined tables.

    Please give me some general ideas of what esle would be good to try (apart from re-writing everything).

    Thanks!

    Boyan Penev --- http://www.bp-msbi.com
    Wednesday, December 16, 2009 5:22 AM

Answers

  • I often use table vairiable,  temporary table or specifically created permanent table structures to hold the "core" set of data to save the optimiser have to re-process the same data repeatedly, and saving having to apply the Where clause to large  repeated sets of data. I then join off that core object and filter further which can be much more efficient.

    For complex queries I'll map out the join structure like an ER diagram and identify the fields used in the join and in the where clause. Then start breaking up the one big statement into multiple statments using the structures as mentioned above. The poitn where you start left joining in the big statement can identify how the temp structures might be applied.

    It can take a bit of experimentation to get right so I would advise running in non-production environment.

    • Marked as answer by Boyan Penev Friday, December 18, 2009 12:56 AM
    Wednesday, December 16, 2009 6:56 AM
  • Hi,

    Using the permanent table for both insert and select for big queries will reduce perfomance...

    1. Best option would be having variable table...do the insert operation and then shift it to the permanent table

    2. And one more thing keep your transaction slipped up...

        In your query..if such a bulk insert happens with in a single begin and comit transaction then it wil be lead to isolation problem..so     keep your transactions less ..

    To achive this you can go for batch insert...and do a while loop

    so transactions will be split up....

    3. You can split up the select queries...stored the output of each select query in a variable table and then proceed with the other table like wise...at the last you can do the insert operation to the permanent table

    Thanks,


    Gnanasekar Babu Note: Please click on the vote button if the answer helps you
    • Marked as answer by Boyan Penev Friday, December 18, 2009 12:56 AM
    Wednesday, December 16, 2009 7:09 AM

All replies

  • I often use table vairiable,  temporary table or specifically created permanent table structures to hold the "core" set of data to save the optimiser have to re-process the same data repeatedly, and saving having to apply the Where clause to large  repeated sets of data. I then join off that core object and filter further which can be much more efficient.

    For complex queries I'll map out the join structure like an ER diagram and identify the fields used in the join and in the where clause. Then start breaking up the one big statement into multiple statments using the structures as mentioned above. The poitn where you start left joining in the big statement can identify how the temp structures might be applied.

    It can take a bit of experimentation to get right so I would advise running in non-production environment.

    • Marked as answer by Boyan Penev Friday, December 18, 2009 12:56 AM
    Wednesday, December 16, 2009 6:56 AM
  • Hi,

    Using the permanent table for both insert and select for big queries will reduce perfomance...

    1. Best option would be having variable table...do the insert operation and then shift it to the permanent table

    2. And one more thing keep your transaction slipped up...

        In your query..if such a bulk insert happens with in a single begin and comit transaction then it wil be lead to isolation problem..so     keep your transactions less ..

    To achive this you can go for batch insert...and do a while loop

    so transactions will be split up....

    3. You can split up the select queries...stored the output of each select query in a variable table and then proceed with the other table like wise...at the last you can do the insert operation to the permanent table

    Thanks,


    Gnanasekar Babu Note: Please click on the vote button if the answer helps you
    • Marked as answer by Boyan Penev Friday, December 18, 2009 12:56 AM
    Wednesday, December 16, 2009 7:09 AM
  • First, thank you for your replies. Based on them, tomorrow when I get back to work, I will try the following:

    1. Break up the source data query and process it in a WHILE loop, which will cycle based on years - this will split it up in at least 4-5 sections (it is not a delta load because of possible changes historically).
    2. Create table variables for some of the tables - especially for the ones which get filtered (say in the WHERE clause or the JOINs); and maybe also try to keep the table vars relatively small (so my tempdb doesn't blow up).
    3. Join the table variables with the rest of the physical tables (if any left out).
    4. Insert each year into the main table sequentially.
    5. Time the query.
    6. Post the results here and see if the whole operation is successful :)

    Thanks again. Also, I am working on a dev (Pre-Prod..sort of Prod sandbox replica) environment, so no worries about breaking stuff up. Only that I start getting emails about excessive disk usage from my queries..but I think this will be very easy to handle.

    Boyan Penev --- http://www.bp-msbi.com
    Wednesday, December 16, 2009 10:09 AM
  • Alright,

    When further thinking about the process, putting a lot of data in table variables will actually impact my tempdb space, which is critical, so I decided against it.

    What I did, though, is loop through my data by year, thus making sure that when the data volume grows (after some more years of inputs), the tempdb usage will remain fairly constant.

    Seemingly, this change also helps with performance - the server seems to have not that much memory and according to a DBA around here breaking up the transactions leads to less swapping, which may/will/did result in faster queries.

    Thanks for the help.

    Boyan Penev --- http://www.bp-msbi.com
    Friday, December 18, 2009 12:55 AM