locked
Mirroring failover RRS feed

  • Question

  • Hi,

    My Mirroring failed over suddenly, to look at the possible reasons for this, I checkd the  error logs and found the following

    2014-06-10 12:16:57.28 spid38s     SQL Server has encountered 1 occurrence(s) of cachestore flush for the 'SQL Plans' cachestore (part of plan cache) due to some database maintenance or reconfigure operations.
    2014-06-10 12:16:57.28 spid36s     SQL Server has encountered 1 occurrence(s) of cachestore flush for the 'Bound Trees' cachestore (part of plan cache) due to some database maintenance or reconfigure operations.
    2014-06-10 12:16:57.30 Logon       Error: 18456, Severity: 14, State: 16.
    2014-06-10 12:16:57.30 Logon       Login failed for user 'XXXX'. [CLIENT: xxx.xxx.xx.xx}
    2014-06-10 12:16:57.30 Logon       Error: 18456, Severity: 14, State: 16.
    2014-06-10 12:16:57.30 Logon       Login failed for user 'Domain\xxxx'. [CLIENT: <local machine>]
    2014-06-10 12:16:57.30 Logon       Error: 18456, Severity: 14, State: 16.
    2014-06-10 12:16:57.30 Logon       Login failed for user 'User1'. [CLIENT: xxx.xx.xx.xxx}
    2014-06-10 12:16:57.30 spid38s     Starting up database 'DB1'
    2014-06-10 12:16:57.30 spid35s     Starting up database 'DB2'.
    2014-06-10 12:16:57.31 spid23s     The mirrored database "DB1" is changing roles from "PRINCIPAL" to "MIRROR" due to Role Syncronization.

    I don't know what to conclude from this.

    Can some one please help me here

    Apart from the error log, how can I get more info on the failover., The Mirroring Monitor does not give me much information either

    Tuesday, June 10, 2014 8:41 AM

Answers

  • 2014-06-10 12:16:34.31 spid11s     Timeout occurred while waiting for latch: class 'ACCESS_METHODS_HOBT_COUNT', id 00000000921FA968, type 4, Task 0x0000000006710328 : 0, waittime 300, flags 0x1018, owning task 0x0000000000000000. Continuing to wait.
    2014-06-10 12:16:34.33 spid11s     Using 'dbghelp.dll' version '4.0.5'
    2014-06-10 12:16:34.45 spid11s     **Dump thread - spid = 11, PSS = 0x0000000080846B60, EC = 0x0000000080846B70
    2014-06-10 12:16:34.46 spid11s     ***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL.1\

    Hi,

    Above is the reason why mirroring failed over. My guess is you wrere running DBCC CHECKDB operation because such error comes when DBCC CHECKDB runs and since it is highly resource intensive and uses CPU and memory heavily your system becomes slow and if resources are very less sometimes it freezes. What happenend in your system is memory or some resource was not sufficient for CHECKDB so it spilled processess to tempdb used it as a temp resource for storage but soon templog was out of space freezing your system. A viable solution would be to run checkdb at time when load is very less or during maintenance window check below link for more details regarding such error

    http://blogs.msdn.com/b/psssql/archive/2008/07/10/sql-server-reports-operating-system-error-1450-or-1452-or-665-retries.aspx

    Can you confirm checkdb was running ? What is version and edition of SQL server is it patched to latest Service pack

    Select @@Version


    Please mark this reply as answer if it solved your issue or vote as helpful if it helped so that other forum members can benefit from it.

    My TechNet Wiki Articles

    • Marked as answer by tracycai Tuesday, June 24, 2014 2:18 AM
    Tuesday, June 10, 2014 11:07 AM
  • Hi,

    If the CPU usage is going more than 50% then the chances of false failover is more.

    (http://msdn.microsoft.com/en-us/library/ms366349%28v=sql.100%29.aspx)

    If you plan to use high-safety mode with automatic failover, the normal load on each failover partner should be less than 50 percent of the CPU. If your work load overloads the CPU, a failover partner might be unable to ping the other server instances in the mirroring session. This causes a unnecessary failover. If you cannot keep the CPU usage under 50 percent, we recommend that you use either high-safety mode without automatic failover or high-performance mode.

    In case of the false failures, you can increase the mirroring time-out period and monitor.

    (http://msdn.microsoft.com/en-us/library/ms190913.aspx)

    Even if the other server is actually proceeding correctly, a time-out is considered a failure. If the time-out value for a session is too short for the regular responsiveness of either partner, false failures can occur. A false failure occurs when one server instance successfully contacts another whose response time is so slow that its pings are not received before the time-out period expires.

    In high-performance mode sessions, the time-out period is always 10 seconds. This is generally enough to avoid false failures. In high-safety mode sessions, the default time-out period is 10 seconds, but you can change the duration. To avoid false failures, we recommend that the mirroring time-out period always be 10 seconds or more.

    To change the time-out value (high-safety mode only)

    To view the current time-out value

    Thanks.


    Tracy Cai
    TechNet Community Support

    • Marked as answer by tracycai Tuesday, June 24, 2014 2:18 AM
    Wednesday, June 11, 2014 7:27 AM
  • It seems to be very true that aspects shared by Shanky and Tracy matters a lot here  in this case. As if all resources are heavily used or above average like that of CPU 50% more, due to DBCC checkdb usage server resources were started running out of options then failover was forced on setup.

    One previous thread also points towards this point only, check below:

    http://social.msdn.microsoft.com/Forums/sqlserver/en-US/10e01de1-b48a-4435-819a-fb5149e27434/latch-waiting-error-in-sql-server-2008

    Kindly confirm the pointers mentioned by Shanky is with you! 


    Santosh Singh

    • Marked as answer by tracycai Tuesday, June 24, 2014 2:18 AM
    Thursday, June 12, 2014 9:00 AM
  • Hi Umesh,

    From the logs it is very clear as to why there was a failover. There was a "Latch - timeout" which occurred which caused the stack dump. When the Stack dump was happening the system went into a hung state or was unresponsive for sometime causing the database to go unavailable which caused the failover.

    2014-06-10 12:16:34.31 spid11s     Timeout occurred while waiting for latch: class 'ACCESS_METHODS_HOBT_COUNT', id 00000000921FA968, type 4, Task 0x0000000006710328 : 0, waittime 300, flags 0x1018, owning task 0x0000000000000000. Continuing to wait.

    2014-06-10 12:16:34.46 spid11s     ***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0016.txt
    2014-06-10 12:16:34.46 spid11s     *

    2014-06-10 12:16:55.33 spid41s     Database mirroring is inactive for database 'DB1'.

    2014-06-10 12:16:57.30 spid38s     Starting up database 'DB1'
    2014-06-10 12:16:57.30 spid35s Starting up database 'DB2'.
    2014-06-10 12:16:57.31 spid23s     The mirrored database "DB1" is changing roles from "PRINCIPAL" to "MIRROR" due to Role Syncronization.

    Now the main question is why there was "Latch-Timeout". To understand the cause we need to look into the stack dump which is generated. If this is a recurring issue i would recommend you to open a case with Microsoft to look at the dump files.

    Now the message before the stack dump and time out is very important.

    2014-06-10 12:10:23.82 spid74      Autogrow of file 'templog' in database 'tempdb' was cancelled by user or timed out after 51234 milliseconds.  Use ALTER DATABASE to set a smaller FILEGROWTH value for this file or to explicitly set a new file size.

    This could very well be the cause of the "Latch-timeout".  The autogrow was timeout after 51 seconds. It seems like you havent set proper autogrow configuration for tempdb. Review the tempdb growth and set proper autogrowth in MB's. It would be good to pre-allocate the space and not wait for autogrowth to increase the log file for tempdb. Autogrowths are only for the unforseen situations where an unexpected load needs more room for data/log files. You should always pre-allocate the storage.

    Have a read on the below documents for more information regarding "Latch-timeouts"

    http://mssqlwiki.com/tag/a-time-out-occurred-while-waiting-for-buffer-latch/

    http://sqlactions.com/2013/02/25/latch-timeout-to-worry-or-not-to/#more-3279

    HTH,


    Regards, Ashwin Menon My Blog - http:\\sqllearnings.com

    • Marked as answer by tracycai Tuesday, June 24, 2014 2:18 AM
    Thursday, June 12, 2014 10:15 AM

All replies

  • Umesh,

    It seems a connection issue or network issue to me . I want you to put search more details in errorlog there must be some more entry before what you have posted can you post that as well.

    Is mirroring sync or Async is witness configured ?


    Please mark this reply as answer if it solved your issue or vote as helpful if it helped so that other forum members can benefit from it.

    My TechNet Wiki Articles

    Tuesday, June 10, 2014 9:13 AM
  • Yes! Mirroring is in Sync and Witness is configures...

    This is the part of the log before the one attached above

    2014-06-10 12:10:02.84 Backup      Log was backed up. Database: DB1, creation date(time): 2009/05/22(10:21:43), first LSN: 1779375:12212:1, last LSN: 1779383:52130:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {'G:\DBBackups\DB1\DB1_backup_201406101210.trn'}). This is an informational message only. No user action is required.
    2014-06-10 12:10:23.82 spid74      Autogrow of file 'templog' in database 'tempdb' was cancelled by user or timed out after 51234 milliseconds.  Use ALTER DATABASE to set a smaller FILEGROWTH value for this file or to explicitly set a new file size.
    2014-06-10 12:16:34.31 spid11s     Timeout occurred while waiting for latch: class 'ACCESS_METHODS_HOBT_COUNT', id 00000000921FA968, type 4, Task 0x0000000006710328 : 0, waittime 300, flags 0x1018, owning task 0x0000000000000000. Continuing to wait.
    2014-06-10 12:16:34.33 spid11s     Using 'dbghelp.dll' version '4.0.5'
    2014-06-10 12:16:34.45 spid11s     **Dump thread - spid = 11, PSS = 0x0000000080846B60, EC = 0x0000000080846B70
    2014-06-10 12:16:34.46 spid11s     ***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0016.txt
    2014-06-10 12:16:34.46 spid11s     * *******************************************************************************
    2014-06-10 12:16:34.46 spid11s     *
    2014-06-10 12:16:34.46 spid11s     * BEGIN STACK DUMP:
    2014-06-10 12:16:34.46 spid11s     *   06/10/14 12:16:34 spid 11
    2014-06-10 12:16:34.46 spid11s     *
    2014-06-10 12:16:34.46 spid11s     * Latch timeout
    2014-06-10 12:16:34.46 spid11s     *
    2014-06-10 12:16:34.46 spid11s     * 
    2014-06-10 12:16:34.46 spid11s     * *******************************************************************************
    2014-06-10 12:16:34.46 spid11s     * -------------------------------------------------------------------------------
    2014-06-10 12:16:34.46 spid11s     * Short Stack Dump
    2014-06-10 12:16:34.53 spid11s     Stack Signature for the dump is 0x0000000059A77AA2
    2014-06-10 12:16:55.27 Logon       Error: 18456, Severity: 14, State: 27.
    2014-06-10 12:16:55.27 Logon       Login failed for user 'xxxxx1'. [CLIENT: xxx.xxx.xx.xx}
    2014-06-10 12:16:55.27 Logon       Error: 18456, Severity: 14, State: 27.
    2014-06-10 12:16:55.27 Logon       Login failed for user 'xxxxx2'. [CLIENT: xxx.xxx.xx.xx}
    2014-06-10 12:16:55.27 Logon       Error: 18456, Severity: 14, State: 27.
    2014-06-10 12:16:55.27 Logon       Login failed for user ''xxxxx3'. [CLIENT: xxx.xxx.xx.xx}
    2014-06-10 12:16:55.27 spid70      Error: 18056, Severity: 20, State: 27.
    2014-06-10 12:16:55.27 spid70      The client was unable to reuse a session with SPID 70, which had been reset for connection pooling. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.
    2014-06-10 12:16:55.27 spid65      Error: 18056, Severity: 20, State: 27.
    .
    .
    .
    2014-06-10 12:16:55.32 Logon       Login failed for user 'User1'. [CLIENT: xxx.xxx.xxx.xx]
    2014-06-10 12:16:55.33 spid41s     Database mirroring is inactive for database 'DB1'. This is an informational message only. No user action is required.

    .
    .
    .
    .

    Tuesday, June 10, 2014 10:23 AM
  • 2014-06-10 12:16:34.31 spid11s     Timeout occurred while waiting for latch: class 'ACCESS_METHODS_HOBT_COUNT', id 00000000921FA968, type 4, Task 0x0000000006710328 : 0, waittime 300, flags 0x1018, owning task 0x0000000000000000. Continuing to wait.
    2014-06-10 12:16:34.33 spid11s     Using 'dbghelp.dll' version '4.0.5'
    2014-06-10 12:16:34.45 spid11s     **Dump thread - spid = 11, PSS = 0x0000000080846B60, EC = 0x0000000080846B70
    2014-06-10 12:16:34.46 spid11s     ***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL.1\

    Hi,

    Above is the reason why mirroring failed over. My guess is you wrere running DBCC CHECKDB operation because such error comes when DBCC CHECKDB runs and since it is highly resource intensive and uses CPU and memory heavily your system becomes slow and if resources are very less sometimes it freezes. What happenend in your system is memory or some resource was not sufficient for CHECKDB so it spilled processess to tempdb used it as a temp resource for storage but soon templog was out of space freezing your system. A viable solution would be to run checkdb at time when load is very less or during maintenance window check below link for more details regarding such error

    http://blogs.msdn.com/b/psssql/archive/2008/07/10/sql-server-reports-operating-system-error-1450-or-1452-or-665-retries.aspx

    Can you confirm checkdb was running ? What is version and edition of SQL server is it patched to latest Service pack

    Select @@Version


    Please mark this reply as answer if it solved your issue or vote as helpful if it helped so that other forum members can benefit from it.

    My TechNet Wiki Articles

    • Marked as answer by tracycai Tuesday, June 24, 2014 2:18 AM
    Tuesday, June 10, 2014 11:07 AM
  • Hi,

    If the CPU usage is going more than 50% then the chances of false failover is more.

    (http://msdn.microsoft.com/en-us/library/ms366349%28v=sql.100%29.aspx)

    If you plan to use high-safety mode with automatic failover, the normal load on each failover partner should be less than 50 percent of the CPU. If your work load overloads the CPU, a failover partner might be unable to ping the other server instances in the mirroring session. This causes a unnecessary failover. If you cannot keep the CPU usage under 50 percent, we recommend that you use either high-safety mode without automatic failover or high-performance mode.

    In case of the false failures, you can increase the mirroring time-out period and monitor.

    (http://msdn.microsoft.com/en-us/library/ms190913.aspx)

    Even if the other server is actually proceeding correctly, a time-out is considered a failure. If the time-out value for a session is too short for the regular responsiveness of either partner, false failures can occur. A false failure occurs when one server instance successfully contacts another whose response time is so slow that its pings are not received before the time-out period expires.

    In high-performance mode sessions, the time-out period is always 10 seconds. This is generally enough to avoid false failures. In high-safety mode sessions, the default time-out period is 10 seconds, but you can change the duration. To avoid false failures, we recommend that the mirroring time-out period always be 10 seconds or more.

    To change the time-out value (high-safety mode only)

    To view the current time-out value

    Thanks.


    Tracy Cai
    TechNet Community Support

    • Marked as answer by tracycai Tuesday, June 24, 2014 2:18 AM
    Wednesday, June 11, 2014 7:27 AM
  • It seems to be very true that aspects shared by Shanky and Tracy matters a lot here  in this case. As if all resources are heavily used or above average like that of CPU 50% more, due to DBCC checkdb usage server resources were started running out of options then failover was forced on setup.

    One previous thread also points towards this point only, check below:

    http://social.msdn.microsoft.com/Forums/sqlserver/en-US/10e01de1-b48a-4435-819a-fb5149e27434/latch-waiting-error-in-sql-server-2008

    Kindly confirm the pointers mentioned by Shanky is with you! 


    Santosh Singh

    • Marked as answer by tracycai Tuesday, June 24, 2014 2:18 AM
    Thursday, June 12, 2014 9:00 AM
  • Hi Umesh,

    From the logs it is very clear as to why there was a failover. There was a "Latch - timeout" which occurred which caused the stack dump. When the Stack dump was happening the system went into a hung state or was unresponsive for sometime causing the database to go unavailable which caused the failover.

    2014-06-10 12:16:34.31 spid11s     Timeout occurred while waiting for latch: class 'ACCESS_METHODS_HOBT_COUNT', id 00000000921FA968, type 4, Task 0x0000000006710328 : 0, waittime 300, flags 0x1018, owning task 0x0000000000000000. Continuing to wait.

    2014-06-10 12:16:34.46 spid11s     ***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0016.txt
    2014-06-10 12:16:34.46 spid11s     *

    2014-06-10 12:16:55.33 spid41s     Database mirroring is inactive for database 'DB1'.

    2014-06-10 12:16:57.30 spid38s     Starting up database 'DB1'
    2014-06-10 12:16:57.30 spid35s Starting up database 'DB2'.
    2014-06-10 12:16:57.31 spid23s     The mirrored database "DB1" is changing roles from "PRINCIPAL" to "MIRROR" due to Role Syncronization.

    Now the main question is why there was "Latch-Timeout". To understand the cause we need to look into the stack dump which is generated. If this is a recurring issue i would recommend you to open a case with Microsoft to look at the dump files.

    Now the message before the stack dump and time out is very important.

    2014-06-10 12:10:23.82 spid74      Autogrow of file 'templog' in database 'tempdb' was cancelled by user or timed out after 51234 milliseconds.  Use ALTER DATABASE to set a smaller FILEGROWTH value for this file or to explicitly set a new file size.

    This could very well be the cause of the "Latch-timeout".  The autogrow was timeout after 51 seconds. It seems like you havent set proper autogrow configuration for tempdb. Review the tempdb growth and set proper autogrowth in MB's. It would be good to pre-allocate the space and not wait for autogrowth to increase the log file for tempdb. Autogrowths are only for the unforseen situations where an unexpected load needs more room for data/log files. You should always pre-allocate the storage.

    Have a read on the below documents for more information regarding "Latch-timeouts"

    http://mssqlwiki.com/tag/a-time-out-occurred-while-waiting-for-buffer-latch/

    http://sqlactions.com/2013/02/25/latch-timeout-to-worry-or-not-to/#more-3279

    HTH,


    Regards, Ashwin Menon My Blog - http:\\sqllearnings.com

    • Marked as answer by tracycai Tuesday, June 24, 2014 2:18 AM
    Thursday, June 12, 2014 10:15 AM