none
SQL Server 2012 SP1 Fail Over Cluster Node MUUUUUCH Slower Than Others

    General discussion

  • This is a relatively new installation... To get geared up for Windows Server 2012 R2 GA, I put the RTM of said on 3 _identically configured_ servers (2 x 4 core Xeon E5620 2.4GHz, 64G of RAM, local RAID 10 system disk). I then created Gen 3 Hyper-V virtual machines on each physical server with 8 cores and 16G of RAM each. While the system vhdx files are located on the physical server local disk, each virtual machine is accessing shared NAS disk exposed using 2012 R2 iSCSI/MPIO configured on a 4th physical box. Each of the Hyper-V host machines have 2x1Gb interfaces, and the iSCSI NAS machine has 1x4-port 1Gb (NAS VLAN), and 2x1Gb interfaces (LAN VLAN). I am using teaming on these interfaces to create 2 logical interfaces suitable for multipath I/O. Every physical and virtual network interface has jumbo frames enabled (9014 bytes).

    (Hopefully that's descriptive enough to communicate the basic configuration)

    Failover Clustering is configured on the VMs (db01, db02, and db03). Cluster validation was successfully run without any errors or warnings. SQL Server 2012 SP1 is installed in failover clustering mode, with the following named instances on each node:

    db01 - SQ1\db01 (Active, DB Engine), SQ2\db02 (Failover 2, DB Engine), SQ3\db03 (Failover 1, Analysis)
    db02 - SQ2\db02 (Active, DB Engine), SQ1\db01 (Failover 1, DB Engine), SQ3\db03 (Failover 2, Analysis)
    db03 - SQ3\db03 (Active, Analysis), SQ1\db01 (Failover 2, DB Engine), SQ2\db02 (Failover 1, DB Engine)

    When db01 is the owner node of any of the named SQL instances, performance is quite good, and the system is very stable... this is the case even if db01 is the owner node of both the DB Engine instances (up to a point, of course... both instances are very actively used by SharePoint 2013, Lync Server 2013, Team Foundation 2012, etc so there's a point at which load will make the node bog down and sometimes queue and time out on requests... No big deal, I would expect this...)

    However, if the owner node configuration is as shown above, the SQ2\db02 named instance is orders of magnitude slower. Requests regularly time out. Management Studio connections, for example; Activity Monitor throws timeout errors constantly... SP site collections that have content databases hosted on the SQ2 instance are hugely slower than ones that aren't. In some cases the timeouts are so pervasive and lengthy that dependent applications crash and/or throw error messages...

    I've run Passmark on all of the physical servers and the VMs and the CPU/RAM/Disk performance on all comparable systems are as good as identical. Additionally, as part of my troubleshooting process, I have changed the network packet size to 8972 bytes in SQL server since jumbo frames is enabled everywhere, but that didn't change anything.

    I'm stumped and would LOVE to know what I could be missing here that's causing this issue. Thanks very much for the help in advance.

    Wednesday, October 02, 2013 8:34 AM

All replies

  • hello,

    To start off with, what is the MAXDOP and MAX Server memory run value? Also what is the Windows power plan setting has been set to?

    Wednesday, October 02, 2013 3:59 PM
  • Thanks for the reply...

    SQ1\DB01: MAXDOP = 1; Max Server Memory = 2147483647 MB (default)
    SQ2\DB02: MAXDOP = 1; Max Server Memory = 2147483647 MB (default)

    db01 and db02 (the VMs) both have their power setting set to "balanced", default configuration with the exception of a GPO that turns off the display after 30 minutes. The same is true of vh01 and vh02 (the Hyper-V physical hosts) and ss01 (the iSCSI NAS host).

    Anticipating what I'm confident will be your next suggestion, I have changed all the servers in question (including the 3rd db and vh nodes) to "high performance". I will do a little testing to see where we are now and respond back. Thanks again!

    Wednesday, October 02, 2013 5:46 PM
  • FYI, globally modifying the active power scheme (I spent some time to create a GPO to set the option for all servers, both hosts and VMs) does not change the poor performance of the SQ2\DB02 instance when the owner node is db02... :-/
    Wednesday, October 02, 2013 6:49 PM
  • Also, just for grins and giggles, I changed the owner node to db03, and it exhibits the same poor performance as db02.
    Wednesday, October 02, 2013 6:52 PM
  • Hello,

    Is the issue specific to only few queries or overall SQL performance is slow? If overall performance is slow,  what are the major wait types you see?

    Use the following query to get the wait stats details:

    select * from sys.dm_os_wait_stats

    Thursday, October 03, 2013 3:41 AM
  • Overall very slow... handshake requests from SharePoint, Lync, SSMS, etc. regularly time out... so it's like the server is acting too busy to respond, but there's basically nothing happening on the owner node, the physical host, or the named instance.

    To execute your suggested query, I changed the owner node of SQ2\DB02 to db02 and waited for the instance to resume responding to requests. Additionally, I modified the query to show only rows with non-zero results. As I am far from an expert in SQL Server, I can't intelligently determine what field value denotes a "major" wait type, so I've included these below. Thanks again for the response.

    wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms
    LCK_M_SCH_M, 1815, 6127, 151, 150
    LCK_M_S, 37, 9359, 475, 6
    LCK_M_U, 13, 1, 0, 0
    LCK_M_X, 459, 224188, 4151, 90
    LATCH_SH, 185, 90, 53, 4
    LATCH_UP, 1, 0, 0, 0
    LATCH_EX, 209, 652, 103, 8
    PAGELATCH_SH, 1775, 278, 49, 15
    PAGELATCH_EX, 2032, 76, 14, 27
    PAGEIOLATCH_SH, 146998, 79981, 57, 1307
    PAGEIOLATCH_UP, 235, 163, 7, 6
    PAGEIOLATCH_EX, 2169, 1264, 17, 21
    LAZYWRITER_SLEEP, 23198, 233616, 1004, 1181
    IO_COMPLETION, 1186, 2281, 29, 15
    ASYNC_IO_COMPLETION, 2, 60, 53, 0
    ASYNC_NETWORK_IO, 212, 7, 0, 3
    SLEEP_BPOOL_FLUSH, 20, 189, 72, 0
    CHKPT, 1, 1526, 1526, 0
    SLEEP_DBSTARTUP, 7, 745, 110, 51
    SLEEP_MASTERDBREADY, 1, 1520, 1520, 0
    SLEEP_DCOMSTARTUP, 2, 181, 114, 0
    SLEEP_TASK, 16453, 401017, 1038, 2691
    SLEEP_SYSTEMTASK, 1, 1526, 1526, 0
    RESOURCE_SEMAPHORE, 44, 605570, 25012, 12
    OLEDB, 2, 0, 0, 0
    THREADPOOL, 28, 59, 35, 0
    SOS_SCHEDULER_YIELD, 12241, 427, 7, 406
    ONDEMAND_TASK_QUEUE, 1, 0, 0, 0
    LOGMGR_QUEUE, 3651, 235069, 147, 31
    REQUEST_FOR_DEADLOCK_SEARCH, 47, 230375, 5015, 230375
    CHECKPOINT_QUEUE, 132, 229831, 6088, 70
    KSOURCE_WAKEUP, 1, 0, 0, 0
    SQLTRACE_INCREMENTAL_FLUSH_SLEEP, 59, 232818, 4015, 0
    BROKER_TRANSMITTER, 2, 0, 0, 0
    BROKER_MASTERSTART, 1, 0, 0, 0
    BROKER_EVENTHANDLER, 28, 91, 34, 2
    WRITELOG, 791, 640, 7, 58
    CMEMTHREAD, 7492, 463, 57, 116
    MSQL_XP, 14, 19, 4, 0
    CLR_MANUAL_EVENT, 49, 2243, 171, 1
    CLR_AUTO_EVENT, 65, 458760, 8468, 1
    RESOURCE_SEMAPHORE_MUTEX, 1, 0, 0, 0
    EE_PMOLOCK, 1, 0, 0, 0
    RESOURCE_SEMAPHORE_QUERY_COMPILE, 2426, 1035435, 2348, 288
    BROKER_TASK_STOP, 8, 80030, 10012, 0
    SNI_TASK_COMPLETION, 1, 0, 0, 0
    DAC_INIT, 1, 0, 0, 0
    XE_TIMER_EVENT, 54, 230266, 5000, 229999
    XE_DISPATCHER_WAIT, 40, 230278, 19999, 0
    BROKER_TO_FLUSH, 114, 117094, 1038, 5
    PREEMPTIVE_OS_GENERICOPS, 22, 152, 103, 0
    PREEMPTIVE_OS_AUTHENTICATIONOPS, 1010, 303, 3, 0
    PREEMPTIVE_OS_DECRYPTMESSAGE, 199, 7, 0, 0
    PREEMPTIVE_OS_DELETESECURITYCONTEXT, 98, 30, 4, 0
    PREEMPTIVE_OS_QUERYCONTEXTATTRIBUTES, 201, 5, 0, 0
    PREEMPTIVE_OS_AUTHORIZATIONOPS, 255, 11, 0, 0
    PREEMPTIVE_OS_LOOKUPACCOUNTSID, 175, 48, 1, 0
    PREEMPTIVE_OS_REVERTTOSELF, 201, 6, 0, 0
    PREEMPTIVE_OS_CLUSTEROPS, 1, 43, 43, 0
    PREEMPTIVE_OS_COMOPS, 1, 73, 73, 0
    PREEMPTIVE_OS_CRYPTOPS, 3, 69, 68, 0
    PREEMPTIVE_OS_CRYPTACQUIRECONTEXT, 33, 5, 0, 0
    PREEMPTIVE_OS_CRYPTIMPORTKEY, 51, 6, 3, 0
    PREEMPTIVE_OS_DOMAINSERVICESOPS, 1, 138, 138, 0
    PREEMPTIVE_OS_FILEOPS, 223, 846, 10, 0
    PREEMPTIVE_OS_CREATEFILE, 136, 7, 0, 0
    PREEMPTIVE_OS_DELETEFILE, 1, 0, 0, 0
    PREEMPTIVE_FILESIZEGET, 102, 5, 0, 0
    PREEMPTIVE_OS_GETDISKFREESPACE, 2, 0, 0, 0
    PREEMPTIVE_OS_GETFILEATTRIBUTES, 128, 9, 0, 0
    PREEMPTIVE_OS_WRITEFILE, 57, 86, 2, 0
    PREEMPTIVE_OS_WRITEFILEGATHER, 23, 150, 28, 0
    PREEMPTIVE_OS_LIBRARYOPS, 3, 218, 213, 0
    PREEMPTIVE_OS_GETPROCADDRESS, 14, 0, 0, 0
    PREEMPTIVE_OS_LOADLIBRARY, 5, 29, 14, 0
    PREEMPTIVE_OS_PIPEOPS, 1, 104, 104, 0
    PREEMPTIVE_OS_REPORTEVENT, 103, 25, 1, 0
    PREEMPTIVE_OS_QUERYREGISTRY, 3, 0, 0, 0
    PREEMPTIVE_OS_SQMLAUNCH, 1, 0, 0, 0
    PREEMPTIVE_SB_STOPENDPOINT, 1, 0, 0, 0
    PREEMPTIVE_XE_CALLBACKEXECUTE, 484916, 1529, 1, 0
    PREEMPTIVE_XE_DISPATCHER, 1, 0, 0, 0
    PREEMPTIVE_XE_GETTARGETSTATE, 12, 55, 5, 0
    PREEMPTIVE_XE_SESSIONCOMMIT, 9, 1, 1, 0
    PREEMPTIVE_XE_TARGETINIT, 4, 2, 2, 0
    PREEMPTIVE_XE_TIMERRUN, 1, 0, 0, 0
    PREEMPTIVE_LOCKMONITOR, 1, 0, 0, 0
    WRITE_COMPLETION, 98, 148, 12, 0
    FT_IFTSHC_MUTEX, 4, 128, 128, 0
    FT_IFTS_RWLOCK, 1, 0, 0, 0
    FT_IFTS_SCHEDULER_IDLE_WAIT, 4, 180008, 60004, 0
    PERFORMANCE_COUNTERS_RWLOCK, 30, 7, 1, 0
    PWAIT_ALL_COMPONENTS_INITIALIZED, 3, 3354, 1123, 0
    PREEMPTIVE_SP_SERVER_DIAGNOSTICS, 2, 0, 0, 0
    SP_SERVER_DIAGNOSTICS_SLEEP, 235, 233176, 1015, 233176
    LOGPOOL_CONSUMERSET, 1, 0, 0, 0
    HADR_FILESTREAM_IOMGR_IOCOMPLETION, 460, 232698, 519, 29
    DIRTY_PAGE_SYNC, 1, 0, 0, 0
    DIRTY_PAGE_POLL, 2150, 235040, 116, 8
    FFT_RECOVERY, 23, 951, 61, 4
    • Edited by ramz_dc Thursday, October 03, 2013 5:39 PM
    Thursday, October 03, 2013 5:37 PM
  • Update:

    In doing some research, I thought it might be a good idea to check lock performance counters on the databases. When SQ2\DB02 is running on the db02 node, the 'Lock Requests/sec' counter bounces all over the place between 10,000 and 30,000 on average... I even saw peaks well over _50,000_... I'm no expert, but that seems bad... I also noted that the CPU utilization on the db02 node never goes below 20%. By comparison, when I move SQ2\DB02 to the db01 node (which, by the way, is already the owner node of the SQ1\DB01 instance), the Lock Requests/sec counter never gets higher than 2,000 or so... and the CPU utilization (% processor time) rarely gets over 5%...

    I'm starting to suspect a hardware issue or a subtle configuration difference between the host nodes... Will continue to investigate...

    Thursday, October 03, 2013 8:43 PM
  • MIND. BLOWN.

    So, out of curiosity, I moved SQ1\DB01 to the db02 node and watched the counters...

    ...

    and...

    ...

    Both instances hosted on the db02 node perform perfectly, Lock Requests/sec go to what look like normal numbers on _BOTH_ instances and the % Processor Time on db02 goes to < 5% on average.

    Then I moved SQ1\DB01 back to db01... now BOTH SQL instances are showing high Lock Requests and % Processor Time...

    Is this a clustering issue...!?

    Thursday, October 03, 2013 8:53 PM
  • Confirmed this behavior now. It doesn't seem to matter which node the db instances are on, as long as they are on the SAME node. Where analysis services is doesn't seem to matter based on what I can tell. So... what would cause SQL instances to behave badly when they are not collocated on the same failover cluster node? Are they getting lonely or something...?
    Thursday, October 03, 2013 9:08 PM
  • 'Lock Requests/sec' is and CPU utilization are both throughput metrics. 

    Check out this query for locks and running workload here:

    http://blogs.msdn.com/b/dbrowne/archive/2013/01/18/my-favorite-query-for-investigating-sql-server-performance.aspx

    these

    PAGEIOLATCH_SH, 146998, 79981, 57, 1307

    LCK_M_X, 459, 224188, 4151, 90

    Are worrying, but I would like to see the output of the wait sample in the above query.

    David


    David http://blogs.msdn.com/b/dbrowne/


    Thursday, October 03, 2013 9:14 PM
  • Appreciate the guidance... Here are the results of the query you reference:

    Query results with SQL instances on their respective (separated) nodes

    SQ1:

    SP_SERVER_DIAGNOSTICS_SLEEP, 1006.2
    CPU_USED, 25
    PREEMPTIVE_XE_GETTARGETSTATE, 15
    WRITELOG, 12.6
    PREEMPTIVE_OS_WRITEFILE, 0.4

    SQ2:

    SP_SERVER_DIAGNOSTICS_SLEEP, 1009.4
    CPU_USED, 781.25
    RESOURCE_SEMAPHORE_QUERY_COMPILE, 82.8
    LCK_M_X, 10.4
    SOS_SCHEDULER_YIELD, 9.4
    PREEMPTIVE_XE_CALLBACKEXECUTE, 7.8
    WRITELOG, 4.8
    LCK_M_SCH_M, 0.6
    PAGEIOLATCH_UP, 0.4
    PREEMPTIVE_OS_WRITEFILE, 0.4
    CMEMTHREAD, 0.2

    Query results with SQL instances on a single node

    SQ1:

    SP_SERVER_DIAGNOSTICS_SLEEP, 976.4
    CPU_USED, 50.05
    PREEMPTIVE_XE_GETTARGETSTATE, 16.4
    WRITELOG, 7.6
    ASYNC_NETWORK_IO, 3.4
    PAGEIOLATCH_EX, 2
    PREEMPTIVE_OS_WRITEFILEGATHER, 1.2
    PREEMPTIVE_OS_WRITEFILE, 0.4
    PREEMPTIVE_OS_FILEOPS, 0.4
    PREEMPTIVE_OS_FLUSHFILEBUFFERS, 0.2
    PREEMPTIVE_OS_GETDISKFREESPACE, 0.2
    SOS_SCHEDULER_YIELD, 0.2

    SQ2:

    SP_SERVER_DIAGNOSTICS_SLEEP, 1012.6
    CPU_USED, 43.8
    WRITELOG, 4
    PAGEIOLATCH_UP, 1.2
    PREEMPTIVE_OS_FILEOPS, 1
    SOS_SCHEDULER_YIELD, 0.8
    PAGELATCH_EX, 0.8
    PREEMPTIVE_OS_WRITEFILEGATHER, 0.8
    PREEMPTIVE_OS_WRITEFILE, 0.4
    IO_COMPLETION, 0.2

    Thursday, October 03, 2013 10:57 PM
  • SQ2 is chewing up CPU 781.25 ms/sec of CPU time, perhaps compiling plans.  It this only just after failover, or does it continue like that?

     Does the Batch Requests/sec show high throughput http://technet.microsoft.com/en-us/library/ms190911.aspx ?   Does the "Top Queries by Total CPU" report show high query CPU use?

    eg http://www.johnsansom.com/how-to-identify-the-most-costly-sql-server-queries-using-dmvs/

    David


    David http://blogs.msdn.com/b/dbrowne/


    Thursday, October 03, 2013 11:46 PM
  • As long as the SQL instances are on separate nodes, it continues like that.

    I looked at the Batch Request/sec on both instances while on separate nodes, and:

    SQ1 (on db01) bounces from 0 - 10
    SQ2 (on db02) bounces from 0 - 40 (but not all that often, to be clear)

    I frankly don't know what constitutes 'high' query CPU use, but:

    SQ1 (on db01) returns 10-15 results with a "Total CPU" ranging from 79-72376
    SQ2 (on db02) returns 2-4 results with a "Total CPU" ranging from 86-142118

    Friday, October 04, 2013 12:40 AM
  • Reran that query thing again, and one of the results stuck out at me:

    SQ2 - Total CPU = 1667342 with only 6 executions...

    Friday, October 04, 2013 12:55 AM
  • sys.dm_exec_query_stats.total_cpuis in microseconds.  So 1667342 is only 1.6sec.

    http://msdn.microsoft.com/en-us/library/ms189741.aspx

    David


    David http://blogs.msdn.com/b/dbrowne/

    Friday, October 04, 2013 2:52 AM
  • Gotcha... and thinking through it, I think I understand why the statistics don't seem as bad when I collocate the instances: each server process is sharing processor time, each instance is effectively only getting 500ms of available CPU time per second of operation. If I really was onto something, feel free to correct me... lol...
    Friday, October 04, 2013 3:07 AM
  • I'm not sure why this was flagged as answered... I'm still having inexplicably poor performance with the instance...
    Friday, October 04, 2013 3:49 PM
  • " think I understand why the statistics don't seem as bad when I collocate the instances: each server process is sharing processor time, each instance is effectively only getting 500ms of available CPU time per second of operation"

    Right.  When you co-locate the instances they may be resource starved, and unable to do much useful work.  This goes for both CPU and Memory.

    David


    David http://blogs.msdn.com/b/dbrowne/

    Friday, October 04, 2013 4:16 PM
  • Thanks, David... did you have any further thoughts on why this is happening?
    Friday, October 04, 2013 8:34 PM
  • Do you still have "Max server memory" set to the default?  This will cause serious memory contention when both servers are on the same node.  This needs to be configured to a reasonable value to allow both to run and the node to not massively page swap.

    I would suggest looking at this:

    http://sqlblog.com/blogs/aaron_bertrand/archive/2009/09/18/managing-active-active-cluster-failovers-with-different-hardware.aspx


    Friday, October 04, 2013 8:45 PM
  • the default value is still set, yes. That said, while either node has 12GB available, their memory demand has yet to exceed 2GB, so I don't believe we have an issue of memory contention...
    Wednesday, October 09, 2013 3:24 PM
  • Different hardware...? but I have identical hardware on every node...
    Thursday, October 10, 2013 2:51 PM
  • any more ideas on this...?
    Friday, October 11, 2013 5:11 PM
  • If anyone stumbles across this post someday and wonders if and how it was resolved, the solution appears to be to NOT use the 'supported' dynamic memory feature of Hyper-V... Once I turned it off, everything started working like a champ. Sigh.
    Saturday, October 19, 2013 6:56 AM
  • Thanks for this information.

    I was having a helluva time with slow failover and cluster disks not coming online....turned off the 'supported' dynamic memory feature on both hosts and back to normal speed.

    Much appreciated, thanks for posting.


    -chief-

    Thursday, October 24, 2013 4:29 PM