Ask a questionAsk a question
 

QuestionTempDB Contention

  • Thursday, December 04, 2008 6:31 PMGlenn Wellington Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi all,

    I was wondering if anyone had any advice for me regarding the following issues that I've seen in our TempDB as of late. 

    These past few mornings we've been seeing a lot of random blocks at around 10:00AM (Yes, I am looking into what's currently running, but nothing has really changed in our production environment as of late.) that have a lot of wait types of PAGELATCH_EX (not PAGEIOLATCH_EX).  The wait type description states that it is not because of an IO request.  (tempDB is split into 16 files (we have 16 cores) on fast (15K) RAID 1/0 space. )  The wait resource is 2:15:<page id> for all of the blocks (100+)

    Does anyone know why SQL would be using the exact same tempDB file for all it's processes?  Isn't SQL set up to use multiple files at once to reduce contention?  (Hence have 16 files for 16 cores?)

    Thanks in advance,
    Glenn

All Replies

  • Thursday, December 04, 2008 8:27 PMJonathan KehayiasMVP, ModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    Glenn,

     

    Have a look at Paul Randals blog post on Multiple Files:

     

    http://www.sqlskills.com/blogs/paul/post/Search-Engine-QA-12-Should-you-create-multiple-files-for-a-user-DB-on-a-multi-core-box.aspx

     

    His recommendation is not one file per core, but perhaps 1/4 to 1/2.  Also are all of the tempdb files sized the same, and are they on separate disks, or all on the same disk?  Also is this SQL 2000/2005/2008.  If it is 2000, then you may need to add the -T1118 flag as per Pauls Blog post.  This is not needed in SQL 2005/2008.

  • Thursday, December 04, 2008 10:33 PMGlenn Wellington Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Thanks for your reply, John. 

    We are running SQL 2005 with the latest service pack.  All of our tempDB files are size the same so no growths have happened recently.  The files are all sitting on the same set of disks.  I believe there's 40 15K RAID 1/0 disks supporting it.  (We run an extremely heavy I/O platform.) 

    I'm going to learn more about the SGAM/IAM pages.  I have a feeling that's what might be killing us.  I have a perfmon window open with the tempDB metrics up to keep an eye on it.   On avg. we have about 400 work tables active and 1400 tables in tempDB at any given time. 

    Thanks again.
  • Friday, December 05, 2008 2:06 AMPaul RandalMVP, ModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    Couple more things to check:

    1. Are you explicitly using either of the snapshot isolation levels? This will make very heavy use of tempdb - adding to the load.
    2. Are you making heavy use of triggers? If so, my team rewrote the implementation of triggers between 2000 and 2005, and now they use the version store in tempdb.

    Either of these could be contributing significantly to the perf problems in tempdb.

     

    Thanks

     

  • Friday, December 05, 2008 2:46 AMJonathan KehayiasMVP, ModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    Glenn,

     

    Is tempdb on a dedicated LUN/DASD disk that has 40 disks behind it, or is this a shared LUN with the data and log files as well?

     

    Paul,

     

    What would the recommended number of tempdb files be for the number of cores involved? Four, eight, or sixteen?  Is it possible that there is so much contention that 16 files is actually needed here, and would there be any benefit from spreading the files across multiple LUNS/DASD disks?

     

  • Friday, December 05, 2008 3:23 AMPaul RandalMVP, ModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

     

    Ah - I missed the fact that all the contention is on a single file in tempdb. Hmm. Can you see I/O activity happening on the 16 physical files, or is it confined to a single file? What's the page ID within file 15 that the latch waits are on? This sounds like classic tempdb contention but it shouldn't happen on a single file.

     

     

    Jon's question - 8 should be sufficient. The SQL Customer Advisory Team (who deal with very large customers) have found that going above 8 files on a >8 core box for tempdb doesn't get any gain. Now, that's a generalization for which there will be exceptions.

     

    Checkout this whitepaper: http://www.microsoft.com/technet/prodtechnol/sql/2005/workingwithtempdb.mspx

  • Monday, December 08, 2008 11:35 PMGlenn Wellington Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Paul,

    1) Both snapshot isolation levels are off. 

    2) We try to avoid heavy trigger usage as they usually lead to problems.  The SPIDs that usually are blocked are ones that just create #tables, exec a select into and then create an index.  (Probably not the most intelligent of ways to do it, but it's been working fine for a while.)

    We've reviewed the tempDB white papers and things look good on our end. 

    Also, as far as IOs go on our tempDB LUN, it's fairly tame at only about 300 IOs with maybe 1 or 2 queued at most.  The response time is also within expected values.  I took at look at our SAN and the SP that the LUNs are sitting on and numbers look ok to me as well.  40%-50% CPU utiliziation and then about 50% dirty pages.

    For pageIDs, it's usually one page ID and then a few minutes later it will slam another one.  We've gotten pageId 103 fairly often (low number is a sys table possibly?)  Other than that, it jumps around.

    John,

    tempDB is a dedicated LUN with 40 disks supporting it.  The underlying RAID groups are also dedicated to that LUN so there isn't any contention on the SAN side. 

    The server that is currently running into the tempDB issues is a 4 socket, quad core machine with 128GB of RAM connected to an CX3-80 so we decided to set up tempDB as 16 files, one for each core. 

    All,

    Also to shine more light on this issue, SQL had a stack dump earlier today which was in the middle of this:
    FCheckAndCleanupCachedTempTable

    What we have is a procedure that creates a #table, calls another procedure and selects into that tempDB.  We think that that issue was that the original #table fell out of scope (or something to that effect) because the inner exec took too long so tempDB went through and cleaned up the outer query's temp table.  This in turn caused SQL to be very confused by the fact that the original table was missing. 

    Our thinking is that the temp tables may be small enough to fit multiple tables on one page, so when we go back and create indexes on them it may slam that page/system table to process everything.  Or the fact, that we are creating so many temp tables that we may be pushing tempDB over the edge. 

    We've been keeping an eye on some tempDB metrics.  Here's what we have so far, our average is about 1400 temp tables in the database, avg. creation rate of .5 temp tables with about 2 or so tables set up for destruction.  Also, during calm times, we have about 400 work tables active. 

    However, when tempDB is hosed, we see these numbers usually:
    active temp tables: 2500
    avg. creation rate: 14
    avg. tables for destruction: 80
    active work tables: 1200

    Thanks again for your help guys.  Any advice/insight is greatly appreciated. 

    EDIT:  We also followed up with breaking the inner exec and placing all the logic into the calling query.  We just pushed this to production today so we're going to sit back and see what happens.  Hopefully this will resolve things....
  • Monday, December 08, 2008 11:57 PMJonathan KehayiasMVP, ModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    Glenn,

     

    This is well beyond my knowledge, but so this post is complete, can you provide the output of SELECT @@VERSION for the SQL Instance?  I would expect that to be a logical next request from Paul potentially.

  • Tuesday, December 09, 2008 12:57 AMGlenn Wellington Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    John,

    Thanks for your help.  We're currently running 9.0.3282 with SQL 2005 SP2 with CU9. 

    Microsoft SQL Server 2005 - 9.00.3282.00 (X64)  
    Aug  5 2008 00:48:00  
    Copyright (c) 1988-2005 Microsoft Corporation 
    Enterprise Edition (64-bit) on Windows NT 5.2 (Build 3790: Service Pack 2)
  • Wednesday, December 10, 2008 3:45 PMGlenn Wellington Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    So far our environment is looking stable after the edit from yesterday. 

    Could this be a bug in SQL or were we just doing something "not so smart" and seeing the repercussions of it?

    -Glenn
  • Sunday, October 25, 2009 8:26 PMSerhad Erdem Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Nowadays  I have the same problem exactly Glenn described above with SQL 2005 + SP2 + CU10. I have only 9 GB oltp hospital database but with 250 concurrent users using Hospital MIS application. When I looked  os_waiting_tasks DMV,  I saw many sessions waiting  resource like 2:15:999  .....  which means a tempdb resource.Also I saw the average disk queue length is 20 which is quite high during problem.
    We use 8 cores cpu system , 16 GB RAM and 4 datafiles for tempdb , probability of a sudden bottleneck of a slow SAN scsi disk subsystem can explain this situation but our hardware is very new and fast MSA 20 with RAID5 15 rpm disks.
    After a sql server service restart  the problem is disappearing and then raising again.After the restart , for 5 minutes users can work fast enough but then the query durations become very high. Finally I started to think that a user session (or a part of application)  is using tempdb very aggressive like creating  a huge temp table without any index tempdb and  SQL Server cannot handle this situation due to poor application design.

    Any comments ?
    Thanks in advance...

  • Wednesday, October 28, 2009 6:59 AMHaris Chris Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Serhad,

    The processor is Hyperthreaded or 8 Cores?
    The Disk Queue length 20 is on the disk where tempdb resides or where data file resides.
  • Friday, November 06, 2009 9:45 PMSerhad Erdem Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Thanks for your reply Haris.
    Servers have 8 cores.
    Disk queue was occuring on the disk volume where only tempdb files resides.
    But finaly I found the problem. it is due to wrong execution plan of  SQL 2005 query optimizer.
    I simply added "with recompile" option to several stored procedures , query durations are fine now :)
    No more disk queues and waiting tasks.