none
High IO Performance Problem RRS feed

  • Question

  • We have a periodic problem on our sql server that I'm totally stumped on. This is on Windows Server 2008 R2, SQL Server 2012 sp3, clustered server, attached to a SAN. All latest updates and patches are installed. Every so often (maybe once a month totally random) for multiple hours the Disk IO will start spiking on and off. This has no correlation to load on the server. The IO will spike for roughly 90 seconds and then go back down for about the same period of time and repeat this over and over until it just stops. Failing over the cluster and restarts have no affect on the problem. I've narrowed it down to one database on one drive. This particular db has a lot of blob objects and is our largest db (2 TB) . When montioring perf mon the Avg. Disk sec/Transfer for this drive spikes up from the normal .010 to around 0.8. It will stay up around 0.8 for 90 seconds and then drop back to normal. We have checked all the SAN hardware even switching to totally new hardware to try to isolate but it doesn't fix it. Eventually the problem just goes away as mysteriously as it appears. I'm totally stumped any random thoughts appreciated, thanks.
    Friday, February 10, 2017 2:24 PM

All replies

  • You need to run extended events over an extended time period to see exactly what is causing this IO problem.

    Here is the code that I would use. This is based on a sample from Jonathan Kehayias. Note, here I am gathering stored procedures, queries, etc that use over 20,000 logical reads and are in database id 6.

    To figure out your database ID do the following:

    Select db_id('ProblemDatabaseName').

    This will create xel files in the c:\temp directory. Ensure you have 10 Gigs or so there, or modify the path.

    CREATE EVENT SESSION [EE_ExpensiveQueries1] ON SERVER 
    ADD EVENT sqlserver.rpc_completed(
        ACTION(sqlserver.plan_handle,sqlserver.sql_text)
        WHERE ([sqlserver].[database_id]=(6) AND [logical_reads]>(20000))) 
    ADD TARGET package0.event_file(SET filename=N'C:\TEMP\EE_ExpensiveQueries1.xel',metadatafile=N'C:\temp\EE_ExpensiveQueries1.xem')
    WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=1 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=ON)
    GO
    ALTER EVENT SESSION EE_ExpensiveQueries1  on server  state=START

    Now after you have the IO spike do the following:

    alter event session ee_expensiveQueries1 on server state=stop

    Then read your events like this:

    SET QUOTED_IDENTIFIER ON
    declare @table table(PK int identity primary key, EventData xml)
    
    SELECT  convert(xml,event_data) FROM sys.fn_xe_file_target_read_file('C:\temp\EE_ExpensiveQueries1*.xel', 'C:\temp\EE_ExpensiveQueries1*.xem', null, null)
    
    select
    t.EventData.value('(/event/@name)[1]','varchar(100)')[name],
    t.EventData.value('(/event/@package)[1]','varchar(100)')[package],
    t.EventData.value('(/event/@timestamp)[1]','datetime')[TimeStamp],
    t.EventData.value('(/event/data[@name="duration"]/value)[1]','bigint')[duration],
    t.EventData.value('(/event/data[@name="physical_reads"]/value)[1]','bigint')[physical_reads],
    t.EventData.value('(/event/data[@name="logical_reads"]/value)[1]','bigint')[logical_reads],
    t.EventData.value('(/event/data[@name="row_count"]/value)[1]','bigint')[row_count],
    t.EventData.value('(/event/data[@name="statement"]/value)[1]','varchar(max)')[statement],
    t.EventData.value('(/event/data[@name="object_name"]/value)[1]','varchar(500)')[object_name]
    
    from @table t
    where t.EventData.value('(/event/@timestamp)[1]','datetime')>getdate()-1
    order by (t.EventData.value('(/event/data[@name="logical_reads"]/value)[1]','bigint')+t.EventData.value('(/event/data[@name="physical_reads"]/value)[1]','bigint') ) desc

    Friday, February 10, 2017 3:07 PM
  • Although this is useful and I will check this out but there is no correlation to sql transactions and the Disk Queue Length getting saturated.  It doesn't matter what load is on the server the Avg. Disk sec/Tranfer will spike for 90 seconds then drop back down to normal for around 90 seconds and then spike again it does this over and over regardless of load on the server. 
    Friday, February 10, 2017 3:16 PM
  • How have you narrowed it down to one database? I am guessing you are using the dmv sys.dm_io_virtual_file_stats.

    Do any of the other dmvs reveal workloads which will help narrow down what query is causing the IO. Here is one of my favorites for this from Glenn Berry.


    SELECT TOP(50) OBJECT_NAME(qt.objectid, dbid) AS [SP Name],
    (qs.total_logical_reads + qs.total_logical_writes) /qs.execution_count AS [Avg IO], qs.execution_count AS [Execution Count],
    SUBSTRING(qt.[text],qs.statement_start_offset/2, 
    	(CASE 
    		WHEN qs.statement_end_offset = -1 
    	 THEN LEN(CONVERT(nvarchar(max), qt.[text])) * 2 
    		ELSE qs.statement_end_offset 
    	 END - qs.statement_start_offset)/2) AS [Query Text]	
    FROM sys.dm_exec_query_stats AS qs WITH (NOLOCK)
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
    WHERE qt.[dbid] = DB_ID()
    ORDER BY [Avg IO] DESC OPTION (RECOMPILE);
    

    You would need to run this as part of a job which persists the results to a table every 5 min or so.

    Is this causing performance problems for you, or just something which is puzzling to you?

    Friday, February 10, 2017 3:28 PM
  • BTW - could it be the automatic updating of statistics?

    Perhaps also log this:

    SELECT
    OBJECT_NAME([sp].[object_id]) AS "Table",
    [sp].[stats_id] AS "Statistic ID",
    [s].[name] AS "Statistic",
    [sp].[last_updated] AS "Last Updated",
    [sp].[rows],
    [sp].[rows_sampled],
    [sp].[unfiltered_rows],
    [sp].[modification_counter] AS "Modifications"
    FROM [sys].[stats] AS [s]
    OUTER APPLY sys.dm_db_stats_properties ([s].[object_id],[s].[stats_id]) AS [sp]
     

    Friday, February 10, 2017 3:31 PM
  • Just throwing out random thoughts:

    Is the SAN shared with other servers / database servers, or is it totally dedicated to this one SQL Server machine?

    Are there any other processes like enterprise backups running that could account for, or contribute to, elevated disk I/O activity?

    Have you had the SAN vendor analyze this disk behavior?

    Are virus scanning processes configured correctly to exclude files / file systems to avoid unwanted scans & elevated disk activity?
    *How to choose antivirus software to run on computers that are running SQL Server
    *Virus scanning recommendations for Enterprise computers that are running currently supported versions of Windows

    Curious, but not directly related:

    What is SAN HBA queue depth set to?
    SQL Server and HBA Queue Depth

    Are latest applicable storport drivers/fixes installed?
    *Computer intermittently performs poorly or stops responding when the Storport driver is used in Windows Server 2008 R2
    *Add a fix to improve the logging capabilities of the Storport.sys driver to troubleshoot poor disk I/O performance in Windows Server 2008 R2

    What are disk format block size? 4k or 64k?

    What is RAID array? 

    Are Data/Log files on separate LUN's?

    *Storage Top 10 Best Practices
    *SAN Storage Best Practices

    Hope that helps,


    Phil Streiff, MCDBA, MCITP, MCSA

    • Edited by philfactor Friday, February 10, 2017 5:02 PM
    Friday, February 10, 2017 4:42 PM
  • How have you narrowed it down to one database? I am guessing you are using the dmv sys.dm_io_virtual_file_stats.

    Do any of the other dmvs reveal workloads which will help narrow down what query is causing the IO. Here is one of my favorites for this from Glenn Berry.


    SELECT TOP(50) OBJECT_NAME(qt.objectid, dbid) AS [SP Name],
    (qs.total_logical_reads + qs.total_logical_writes) /qs.execution_count AS [Avg IO], qs.execution_count AS [Execution Count],
    SUBSTRING(qt.[text],qs.statement_start_offset/2, 
    	(CASE 
    		WHEN qs.statement_end_offset = -1 
    	 THEN LEN(CONVERT(nvarchar(max), qt.[text])) * 2 
    		ELSE qs.statement_end_offset 
    	 END - qs.statement_start_offset)/2) AS [Query Text]	
    FROM sys.dm_exec_query_stats AS qs WITH (NOLOCK)
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
    WHERE qt.[dbid] = DB_ID()
    ORDER BY [Avg IO] DESC OPTION (RECOMPILE);

    You would need to run this as part of a job which persists the results to a table every 5 min or so.

    Is this causing performance problems for you, or just something which is puzzling to you?

    The high IO is not correlated to any queries or sql transactions, we have sort of narrowed it to a database in that we moved one db to all new SAN hardware to see if it mitigated the problem and the problem followed that database to the new infrastructure.
    Friday, February 10, 2017 5:46 PM
  • Just throwing out random thoughts:

    Is the SAN shared with other servers / database servers, or is it totally dedicated to this one SQL Server machine?

    Dedicated

    Are there any other processes like enterprise backups running that could account for, or contribute to, elevated disk I/O activity?

    No

    Have you had the SAN vendor analyze this disk behavior?

    No not sure that is option for us anymore, plus finding a time when it is actually occurring can be problematic as we can go months with everything running fine.

    Are virus scanning processes configured correctly to exclude files / file systems to avoid unwanted scans & elevated disk activity?
    *How to choose antivirus software to run on computers that are running SQL Server
    *Virus scanning recommendations for Enterprise computers that are running currently supported versions of Windows

    Not positive but we did eliminate anti virus as cause at some point
    Curious, but not directly related:

    What is SAN HBA queue depth set to?
    SQL Server and HBA Queue Depth

    Are latest applicable storport drivers/fixes installed?
    *Computer intermittently performs poorly or stops responding when the Storport driver is used in Windows Server 2008 R2
    *Add a fix to improve the logging capabilities of the Storport.sys driver to troubleshoot poor disk I/O performance in Windows Server 2008 R2

    Will look into these 2

    What are disk format block size? 4k or 64k?

    64K I think

    What is RAID array? 

    RAID 6

    Are Data/Log files on separate LUN's?

    Yes

    *Storage Top 10 Best Practices
    *SAN Storage Best Practices

    Hope that helps,


    Phil Streiff, MCDBA, MCITP, MCSA

    Thanks for your input
    Friday, February 10, 2017 6:08 PM

  • What is SAN HBA queue depth set to?
    SQL Server and HBA Queue Depth

    Pretty sure its the default of 32, we have Emulex AJ763B
    Friday, February 10, 2017 6:23 PM
  • Hi,

    Please, check you're not running VMware snapshots or alike in that particular volume.

    Had similar symptoms months ago, and was related to VMware snapshots.

    Friday, February 10, 2017 6:24 PM
  • How large are your blob objects?  

    Could it be that when they happen to age out and the lazywriter gets to them, they cause that much IO?

    You could try doing a checkpoint, and see if that (a) causes the same spike in IO, and (b) eliminates the random occurrence.

    Josh

    Friday, February 10, 2017 6:25 PM
  • Hi,

    Please, check you're not running VMware snapshots or alike in that particular volume.

    Had similar symptoms months ago, and was related to VMware snapshots.

    Thanks for the input, its not a virtual machine though
    Friday, February 10, 2017 7:24 PM
  • How large are your blob objects?  

    Could it be that when they happen to age out and the lazywriter gets to them, they cause that much IO?

    You could try doing a checkpoint, and see if that (a) causes the same spike in IO, and (b) eliminates the random occurrence.

    Josh

    This looks promising, I'm running a trace to the error log to see if the 2 are related I did get an IO spike running a checkpoint.  Thanks I'll let you know
    Friday, February 10, 2017 8:30 PM
  • It seems like its a checkpoint issue but changing the target recovery time for that specific database did nothing at all.  I tried various values and nothing changed the behavior.

    ALTER DATABASE dBName
    SET TARGET_RECOVERY_TIME = 60 SECONDS WITH NO_WAIT

    From errorlog this was right after the IO returned to normal

    2017-02-10 16:03:07.82 spid18s     FlushCache: cleaned up 3006 bufs with 774 writes in 68592 ms (avoided 803 new dirty bufs) for db 76:0

    2017-02-10 16:03:07.82 spid18s                 average throughput:   0.34 MB/sec, I/O saturation: 2325, context switches 4676

    2017-02-10 16:03:07.82 spid18s                 last target outstanding: 8000, avgWriteLatency 13

    Saturday, February 11, 2017 2:08 AM
  • Hi, have you excluded ghost record cleanups from your investigation.
    Saturday, February 11, 2017 3:13 AM
  • Hi, have you excluded ghost record cleanups from your investigation.
    Have not yet, the problem has now disappeared again.  So I'll likely have to wait till it potentially reoccurs down the road but I'll put this on the list to investigate.  We are making some changes to the underneath db to better handle the blob objects and reduce the size of those records.  We also tested the HBA Queue Length setting on the SAN and are changing that from the default of 32 to 64 (its a dedicated SAN) and seeing about a 15% performance bump in our load tests.  I also used Diskspd to test the IO and saw a similar improvement.  Thanks for all the ideas I'll update this thread if I ever figure it out.
    Friday, February 17, 2017 1:14 PM