Anyway to tell if "too many" tempdbs is the "write" latency cause? RRS feed

  • Question

  • Hello, the 8 tempdb files are currently of equal size and live on a raid setup. 

    1. 8 x 64GB files of equal size, no growth (Start at 64gb
    2. 2 sockets, 16 cores, 32 logical processes (numa 1 and 2)
    3. MAXDOP set to 8
    4. 8 tempdb were created using the 500 various articles on tempdbs.

    I have experimented with 16 and 32 tempdbs which had adverse effects (slow query run time).

    1. The “write latency” on non-tempdb databases is 5-30ms.
    2. The “write latency” for the tempdb is often 1000-1200ms with a read latency of 1-20ms.
    3. Bytes per cluster: 64kb
    4. Offset: not set to 1024KB, currently 128MB … which could be  a related issue.
    5. I am also reaching out the the server admin on possibly a battery write cache for the raid config.
    6. Normal file transfers are 500-1500MBs. Activity monitor in sql is often slower for tempdb queries (30-150ms).

    Is there a way to bench/reproduce that writing to multiple tempdb files is the culprit and possibly to reduce the size to 4 files or 2 files?

    Other ideas?

    Utilized the common test query in various articles which shows 1000-1200 write latency in tempdb only:

        [ReadLatency] =
            CASE WHEN [num_of_reads] = 0
                THEN 0 ELSE ([io_stall_read_ms] / [num_of_reads]) END,
        [WriteLatency] =
            CASE WHEN [num_of_writes] = 0
                THEN 0 ELSE ([io_stall_write_ms] / [num_of_writes]) END,
        [Latency] =
            CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0)
                THEN 0 ELSE ([io_stall] / ([num_of_reads] + [num_of_writes])) END,
        [AvgBPerRead] =
            CASE WHEN [num_of_reads] = 0
                THEN 0 ELSE ([num_of_bytes_read] / [num_of_reads]) END,
        [AvgBPerWrite] =
            CASE WHEN [num_of_writes] = 0
                THEN 0 ELSE ([num_of_bytes_written] / [num_of_writes]) END,
        [AvgBPerTransfer] =
            CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0)
                THEN 0 ELSE
                    (([num_of_bytes_read] + [num_of_bytes_written]) /
                    ([num_of_reads] + [num_of_writes])) END,
        LEFT ([mf].[physical_name], 2) AS [Drive],
        DB_NAME ([vfs].[database_id]) AS [DB],
        sys.dm_io_virtual_file_stats (NULL,NULL) AS [vfs]
    JOIN sys.master_files AS [mf]
        ON [vfs].[database_id] = [mf].[database_id]
        AND [vfs].[file_id] = [mf].[file_id]
    -- WHERE [vfs].[file_id] = 2 -- log files
    -- ORDER BY [Latency] DESC
    -- ORDER BY [ReadLatency] DESC
    ORDER BY [WriteLatency] DESC;

    • Edited by xxjergerxx Friday, June 23, 2017 5:56 PM
    Friday, June 23, 2017 5:54 PM

All replies

  • You can always try to reduce the number of tempdb files, but I don't think that's the problem.  Does the tempdb have it's own set of disks (ie not on system drive or being shared with other databases)?  What's the average disk queue look length look like?  Have you looked to see what's hitting the tempdb, to see if there's any query optimizations that can be made?
    Friday, June 23, 2017 7:01 PM
  • a. TEMPDB is on its own set of disks/raid dedicated to the server, although any details further than that I am not certain. With various benchmarks you can often see 1000-3000MBS except for 4kb writes 50-200mbs. I do not believe it has a battery write cache at this time and I reached out to see if we could test this. 

    b. the offset is at 129MB which MS recommends the disk to have an offset of 1024kb, this could be related.

    c. the tempdb "Work" is in general always having this issue... at times I've seen the write latency as low as 150... but often it is much higher than this. It can be from simple queries (select into), queries with joins, queries that take advantage of 100% high quality clustered indexes... the only consistency is the write latency is often high. ... does not appear to be query specific. 

    d. CX packets always occur except when maxdop = 1. I keep maxdop set to 8 (no performance increase by lowering). 

    Thank you for the reply. I may test reducing the number of tempdb files this weekend off hours.

    • Edited by xxjergerxx Friday, June 23, 2017 7:20 PM
    Friday, June 23, 2017 7:06 PM
  • Does the write latency go from good to bad when you added more tempdb files? Or do you get the same lousy numbers with 8 files as well?

    I would expect the latter, and it certainly looks like a problem in the I/O subsystem. The battery, or lack of thereof could matter. Is this a SAN or is it local storage?

    As for the latency query you use, that is accumulated data since the server started, so it may be misleading. It's better to run it twice and then work with the delta. Here is a query from SQL Server MVP Paul Randall to do this:

    Friday, June 23, 2017 7:29 PM
  • a. I was informed the RAID setup is SSD and most likely therefore battery backup cache would not be added. It is local storage I believe (HP RAID ) 

    b. I do not recall the write latency when adding additional tempdbs, I just know the total run times of some common queries took about 10-30% longer  when I had more tempdbs.

    Thanks for the script I will test it. From the top of my head I usually see the write latency at 150MS consistently ... today it was up to 1000ms but that may have been distorted by a system reading from the database that is resource intensive. Even at 150ms it is too high. 

    Friday, June 23, 2017 8:47 PM
  • 150 ms is absolutely too high for local SSDs. Well, 150ms is never acceptable, but if it is SAN, I know why. :-) There must be something wrong going on here.

    Friday, June 23, 2017 9:11 PM
  • Another thing to check is to make sure you don't have any anti-virus scanning the tempdb files (or any of the files associated with SQL).  

    Regarding your CXPACKET waits (which appear to make up a chunk of your total waits), have you considered increasing the cost threshold for parallelism from it's default?

    Friday, June 23, 2017 10:49 PM
  • OP,

    If you are writing large volumes of data to - well, anything - I've had a suspicion about SQL Server since at least 2008, that sometimes it (and Windows and perfmon) report bogus numbers.

    Say you have a SQL operation that decides to write 1gb to tempdb, and it's going to take 1.0 seconds to do it.  Even though it starts writing with zero latency and uses 100% of your bandwidth it might report the average latency as 500ms.  It reports queue times instead of latency.

    Even an SSD won't fix that, if that's the case.

    Though I've never had it confirmed.

    This would be some mismatch between what SQL Server thinks is queuing and what Windows thinks is queuing.

    What I have seen in perfmon are these horrible latency spikes that suddenly vanish ... leading to the above suspicion.


    Saturday, June 24, 2017 12:01 AM
  • Thank you all for the feedback.

    a. cost threshold for parallelism is set to 50

    b. antivirus is not interfering with the tempdb

    We are considering adding an SSD drive for BPE which most likely will not resolve the concern. 

    • Edited by xxjergerxx Monday, June 26, 2017 3:42 PM
    Monday, June 26, 2017 3:37 PM
  • ran the tools below which have 50-100mbs per thread read and write (x16 or x32 threads):

    Possibly the write latency as others said is a false alarm. Hoping that BPE helps!

    Tuesday, June 27, 2017 2:53 PM
  • ran the tools below which have 50-100mbs per thread read and write (x16 or x32 threads):

    Possibly the write latency as others said is a false alarm. Hoping that BPE helps!

    Have you received any reported issues for databases that reside on the server?  With the high latency I'd expect an application owner to complain about performance.  If there haven't been any complaints, then it would make it easier to say it's a false alarm.

    The SSD for BPE's might reduce some of the tempdb usage, however without knowing what's causing the latency'd really just be throwing something at it and hoping it works.  The main reason I'd look at SSD BPE's is if I'm seeing memory related problems and I've already maxed out the server memory.  I'd definitely recommend testing on a non-production server before implementing on a prod environment. 

    Have you run the tool on other servers and drives that don't have the high latency statistics?  Running the tool on one drive doesn't really give you the full picture - you need to have something to measure the results against.  We still use SQLIO as a baselining tool for server setup and if any new drives are added/replaced.  We keep records of all our SQL servers that we can compare against.  It's helped us identify problems a couple times before SQL was even installed.

    Tuesday, June 27, 2017 10:14 PM
  • Good morning,

    The database is more for research/data summaries. 

    a. large bulk loads (100-300 million rows)

    b. reference tables with 1-200 million rows joined to the main table

    c. indexing for optimizing joins or later summaries

    It is less customer/end user utilized except when pulling data to another system (rare). 

    I'm not sure any additional methods to test high latency or to build a list of ideas to reduce it. 

    Wednesday, June 28, 2017 1:55 PM
  • Josh you were right, some of it was skew when rebooting and the 1000ms was an overstatement. After the last server patch update window and a couple restarts, the read latency for both tempdb and the dbs is often 10 or less (sometimes 20-30 on less used dbs, or from possibly 3rd party tools like SAS). 

    However write latency still remains 3-6x higher than read, although it is now on average 30-45 for this week instead of 1000. The read speed is much faster (4000-6000 in crystal disk mark) while the wirte speed is 500-700mbs. 

    The databases have the default cluster size of 4kb and not 64kb, however the tempdb is setup with 64kb cluster size. 

    Anyone have before/after benchmarks of 4kb cluster size vs 64kb?

    Friday, August 4, 2017 1:59 PM
  • OP, along the same lines, can you tell if the write latency is worst only during large ETL?

    On older SANs (or controllers) you might even write caching on but only a small buffer that could be overrun.

    Newer enterprise-level SANs tend to have multi-gigabyte write buffers and are much harder, if still not quite impossible, to overrun.  Even an SSD might need a big RAM write buffer.


    • Edited by JRStern Monday, August 7, 2017 2:35 AM
    Monday, August 7, 2017 2:35 AM
  • Hello, appreciate the feedback, I think you may be describing what I am observing.

    a. after monthly patch reboot the write latency hovered at 20-40MS

    b. while running our large ETL process today (300mil records base table 10 reference tables of various sizes up to 300mil lines)... I noticed it spike to 72ms (average)... an alternative query states "1300ms" which may not be accurate. 

    The tempdb I believe is on a direct raid (however it is listed as HP 10gb iscsi 560sfp+)  while the databases live on the SAN (NIMBLE). Neither have battery write cache to my knowledge. 

    Collection of queries:

    -- Also remember IO stalls are a good indication of disk IO pressure.
    SELECT  DB_NAME(vfs.database_id) AS database_name , vfs.database_id , vfs.file_id ,
            io_stall_read_ms / NULLIF(num_of_reads, 0) AS avg_read_latency ,
            io_stall_write_ms / NULLIF(num_of_writes, 0) AS avg_write_latency ,
            io_stall_write_ms / NULLIF(num_of_writes + num_of_writes, 0) AS avg_total_latency ,
            num_of_bytes_read / NULLIF(num_of_reads, 0) AS avg_bytes_per_read ,
            num_of_bytes_written / NULLIF(num_of_writes, 0) AS avg_bytes_per_write ,
            vfs.io_stall , vfs.num_of_reads , vfs.num_of_bytes_read ,vfs.io_stall_read_ms ,
            vfs.num_of_writes ,vfs.num_of_bytes_written , vfs.io_stall_write_ms ,
            size_on_disk_bytes / 1024 / 1024. AS size_on_disk_mbytes , physical_name
    FROM    sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
    	JOIN sys.master_files AS mf 
    		ON vfs.database_id = mf.database_id AND vfs.file_id = mf.file_id
    ORDER BY avg_total_latency DESC;
    WITH [Waits] AS
            [wait_time_ms] / 1000.0 AS [WaitS],
            ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS],
            [signal_wait_time_ms] / 1000.0 AS [SignalS],
            [waiting_tasks_count] AS [WaitCount],
           100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],
            ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum]
        FROM sys.dm_os_wait_stats
        WHERE [wait_type] NOT IN (
            N'CHKPT', N'CLR_AUTO_EVENT',
            -- Maybe uncomment these four if you have mirroring issues
            N'EXECSYNC', N'FSAGENT',
            -- Maybe uncomment these six if you have AG issues
        AND [waiting_tasks_count] > 0
        MAX ([W1].[wait_type]) AS [WaitType],
        CAST (MAX ([W1].[WaitS]) AS DECIMAL (16,2)) AS [Wait_S],
        CAST (MAX ([W1].[ResourceS]) AS DECIMAL (16,2)) AS [Resource_S],
        CAST (MAX ([W1].[SignalS]) AS DECIMAL (16,2)) AS [Signal_S],
        MAX ([W1].[WaitCount]) AS [WaitCount],
        CAST (MAX ([W1].[Percentage]) AS DECIMAL (5,2)) AS [Percentage],
        CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgWait_S],
        CAST ((MAX ([W1].[ResourceS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgRes_S],
        CAST ((MAX ([W1].[SignalS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgSig_S],
        CAST ('' + MAX ([W1].[wait_type]) as XML) AS [Help/Info URL]
    FROM [Waits] AS [W1]
    INNER JOIN [Waits] AS [W2]
        ON [W2].[RowNum] <= [W1].[RowNum]
    GROUP BY [W1].[RowNum]
    HAVING SUM ([W2].[Percentage]) - MAX( [W1].[Percentage] ) < 95; -- percentage threshold
    DECLARE @Reset bit = 0;
    IF NOT EXISTS (SELECT NULL FROM tempdb.sys.objects 
    WHERE name LIKE '%#fileStats%')  
            SET @Reset = 1;  -- force a reset
    IF @Reset = 1 BEGIN 
            IF EXISTS (SELECT NULL FROM tempdb.sys.objects 
            WHERE name LIKE '%#fileStats%')  
                    DROP TABLE #fileStats;
                    io_stall_write_ms, io_stall
            INTO #fileStats 
            FROM sys.dm_io_virtual_file_stats(null, null);
            DB_NAME(vfs.database_id) AS database_name, 
            --vfs.database_id , 
            vfs.FILE_ID , 
            (vfs.io_stall_read_ms - history.io_stall_read_ms)
             / NULLIF((vfs.num_of_reads - history.num_of_reads), 0) avg_read_latency,
            (vfs.io_stall_write_ms - history.io_stall_write_ms)
             / NULLIF((vfs.num_of_writes - history.num_of_writes), 0) AS avg_write_latency ,
    FROM    sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs 
                    JOIN sys.master_files AS mf 
                            ON vfs.database_id = mf.database_id AND vfs.FILE_ID = mf.FILE_ID 
                    RIGHT OUTER JOIN #fileStats history 
                            ON history.database_id = vfs.database_id AND history.file_id = vfs.file_id
    ORDER BY avg_write_latency DESC;
        [ReadLatency] =
            CASE WHEN [num_of_reads] = 0
                THEN 0 ELSE ([io_stall_read_ms] / [num_of_reads]) END,
        [WriteLatency] =
            CASE WHEN [num_of_writes] = 0
                THEN 0 ELSE ([io_stall_write_ms] / [num_of_writes]) END,
        [Latency] =
            CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0)
                THEN 0 ELSE ([io_stall] / ([num_of_reads] + [num_of_writes])) END,
        [AvgBPerRead] =
            CASE WHEN [num_of_reads] = 0
                THEN 0 ELSE ([num_of_bytes_read] / [num_of_reads]) END,
        [AvgBPerWrite] =
            CASE WHEN [num_of_writes] = 0
                THEN 0 ELSE ([num_of_bytes_written] / [num_of_writes]) END,
        [AvgBPerTransfer] =
            CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0)
                THEN 0 ELSE
                    (([num_of_bytes_read] + [num_of_bytes_written]) /
                    ([num_of_reads] + [num_of_writes])) END,
        LEFT ([mf].[physical_name], 2) AS [Drive],
        DB_NAME ([vfs].[database_id]) AS [DB],
        sys.dm_io_virtual_file_stats (NULL,NULL) AS [vfs]
    JOIN sys.master_files AS [mf]
        ON [vfs].[database_id] = [mf].[database_id]
        AND [vfs].[file_id] = [mf].[file_id]
    -- WHERE [vfs].[file_id] = 2 -- log files
    -- ORDER BY [Latency] DESC
    -- ORDER BY [ReadLatency] DESC
    ORDER BY [WriteLatency] DESC;
    SELECT dm_ws.wait_duration_ms,
    DB_NAME(dm_r.database_id) DatabaseName,
    -- Optional columns
    FROM sys.dm_os_waiting_tasks dm_ws
    INNER JOIN sys.dm_exec_requests dm_r ON dm_ws.session_id = dm_r.session_id
    INNER JOIN sys.dm_exec_sessions dm_es ON dm_es.session_id = dm_r.session_id
    CROSS APPLY sys.dm_exec_sql_text (dm_r.sql_handle) dm_t
    CROSS APPLY sys.dm_exec_query_plan (dm_r.plan_handle) dm_qp
    WHERE dm_es.is_user_process = 1

    Monday, August 7, 2017 8:49 PM