SQL AOAG -- Not able to failover automatically. RRS feed

  • Question

  • Hello,

    I have created a SQL Server 2012 Always on Availability Group with 2 nodes. I can failover manually.

    in the properties of the AG I can see that the Availability Mode is Synchronous Commit and the Failover Mode is automatic.

    If I right click on the AG and click failover, I can successfully do failover.

    In my Cluster Manager I have configured the quorum setting to include the 2 servers as well as 1 file share (which is accessible and up and running).

    Now if I stop the SQL Service instance on one of the machines. I can see that the AG on the 2nd machine goes into "resolving" state... but it never comes out.

    in the FC Manager I can see the errors like

    Clustered role 'SP2013SQLAG' has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state.  No additional attempts will be made to bring the role online or fail it over to another node in the cluster.  Please check the events associated with the failure.  After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

    1. The Cluster service failed to bring clustered service or application 'SP2013SQLAG' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.
    2. Cluster resource 'SP2013SQLAG' of type 'SQL Server Availability Group' in clustered role 'SP2013SQLAG' failed.
    3. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

    What is causing the automatic failover to error out when manual failover is working fine?

    val it: unit=()

    Tuesday, May 21, 2013 11:08 PM


All replies

  • Some screenshots

    val it: unit=()

    Tuesday, May 21, 2013 11:22 PM
  • I do notice that at the time I join a DB successfully to the availability group. I do get a warning.

    My understanding was that the quorum configuration should be good if each node has a vote along with a file share (screenshot above)

    Again, the manual failover works fine.... but when I shutdown a sql server instance. it just hangs at resolving.... but does not take control.

    val it: unit=()

    Wednesday, May 22, 2013 4:36 AM
    • Edited by SQL24 Wednesday, May 22, 2013 11:03 AM
    • Marked as answer by MSDN Student Wednesday, May 22, 2013 8:59 PM
    Wednesday, May 22, 2013 6:43 AM
  • Take a look at your cluster log for more information. Based on your process, you might need to set a higher value for the maximum failures in a specified period property of the cluster. Check out this blog post for more details 

    Edwin Sarmiento SQL Server MVP
    Blog | Twitter | LinkedIn

    Wednesday, May 22, 2013 1:09 PM
  • I went through the article

    1. My failure count was set to 1, I increased it to 3.

    2. there were no diagnostics failures for the NT Authority\SYSTEM account. so I believe this issue does not apply to me.

    3. when I issue this query on my secondary node

    Select database_name, is_failover_ready from 
    where replica_id in (select replica_id from sys.dm_hadr_availability_replica_states)

    The response I get for all databases is that is_failover_ready value is 0.

    So I believe this is the issue. But how do I fix his? why is my database no synchronized? what should I do to make it synchronized? just wait??? or do something so that the nodes come to the synchronized state?

    val it: unit=()

    Wednesday, May 22, 2013 2:50 PM
  • Did you maybe change your availability mode to Async?

    Wednesday, May 22, 2013 2:55 PM
  • Many things were wrong my side.

    1. I increased the failure count as specified in the article and now the failover worked correctly.

    val it: unit=()

    Wednesday, May 22, 2013 8:59 PM