none
Error While creating Availability Group (Error 19435, 41044) RRS feed

  • Question

  • Dear all,

    I have a big issue with a new availability group installation/configuration. It does an error and do not create the group...

    It seems that the group goes online and then is killed by the failover cluster... But I don't see why. I do have searched the web about my issue but I have tried everything proposed :

    1. Grand privileges to NT AUTHORITY\SYSTEM (Connect SQL to, View server state to, Alter any availability group to

    2. Local admin for the agent /engine service account on Windows and on the SQL database

    3. Delete my cluster and recreated it

    4. Tried creating the group without the listener

    5. Have exactly the same Hardware configuration (HDD / RAM / CPU)

    Here the log from the SQL Server (from SSMS)

    08/13/2019 09:07:01,spid55,Unknown,Always On: WSFC AG integrity check failed for AG 'AG-SQLIPSN-DEV' with error 41044<c/> severity 16<c/> state 1.
    08/13/2019 09:07:01,spid55,Unknown,Error: 19435<c/> Severity: 16<c/> State: 1.
    08/13/2019 09:07:01,spid55,Unknown,The state of the local availability replica in availability group 'AG-SQLIPSN-DEV' has changed from 'RESOLVING_NORMAL' to 'NOT_AVAILABLE'.  The state changed because either the associated availability group has been deleted<c/> or the local availability replica has been removed from another SQL Server instance.  For more information<c/> see the SQL Server error log<c/> Windows Server Failover Clustering (WSFC) management console<c/> or WSFC log.
    08/13/2019 09:06:01,spid55,Unknown,The state of the local availability replica in availability group 'AG-SQLIPSN-DEV' has changed from 'NOT_AVAILABLE' to 'RESOLVING_NORMAL'.  The state changed because the local availability replica is joining the availability group.  For more information<c/> see the SQL Server error log<c/> Windows Server Failover Clustering (WSFC) management console<c/> or WSFC log.
    08/13/2019 09:04:39,spid15s,Unknown,Always On: The availability replica manager is waiting for the instance of SQL Server to allow client connections. This is an informational message only. No user action is required.
    08/13/2019 09:04:39,spid15s,Unknown,Always On Availability Groups: Local Windows Server Failover Clustering node is online. This is an informational message only. No user action is required.

    Here the logs from the Cluster :

    EVENT ID : 1254 Error - Clustered role 'AG-SQLIPSN-DEV' has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period. EVENT ID : 1205 Error - The Cluster service failed to bring clustered role 'AG-SQLIPSN-DEV' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role. EVENT ID : 1069 Error - Cluster resource 'AG-SQLIPSN-DEV' of type 'SQL Server Availability Group' in clustered role 'AG-SQLIPSN-DEV' failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

    I am totally lost on why it doesn't work. My previous AlwaysOn configuration went fine without any issue and we did the same thing for this one...

    The only thing I could think of is to begin the all process again (deleting everything -> DNS records, AD records, Quorum share , Cluster) and start again... But I am not sure it would work...

    Hope anyone can help,

    Best Regards,

    Jon

    Tuesday, August 13, 2019 8:12 AM

All replies

  • We will need to see the cluster log to diagnose this.

    create a directory called c:\temp on one of your cluster nodes.

    then using an administrator privilidges command prompt type this:

    powershell -ExecutionPolicy Bypass Get-ClusterLog -UseLocalTime -destionation c:\temp

    After a while your cluster logs will be written to the c:\temp directory.

    You will need to post the logs somewhere where we can access them - perhaps pastebin.

    Tuesday, August 13, 2019 2:02 PM
  • Will do that today.

    Thank you.

    Wednesday, August 14, 2019 7:16 AM
  • EDIT : Here are the logs. My logs were to big for pastebin. Did however a dropbox link

    Cluster01_Node01

    Cluster01_Node02

    Wednesday, August 14, 2019 11:25 AM
  • did you add [NT AUTHORITY\SYSTEM]  and permission on both replicas?

    in the cluster log we can find "

    [=== Performance ===]
    Failed to connect to the health service[=== System ===]

    "

    The [NT AUTHORITY\SYSTEM] account is used by SQL Server AlwaysOn health detection to connect to the SQL Server computer and to monitor health. When you create an availability group, health detection is initiated when the primary replica in the availability group comes online. If the [NT AUTHORITY\SYSTEM] account does not exist or does not have sufficient permissions, health detection cannot be initiated, and the availability group cannot come online during the creation process.

    Make sure that these permissions exist on each SQL Server computer that could host the primary replica of the availability group. 



    • Edited by baleng Wednesday, August 14, 2019 12:09 PM
    Wednesday, August 14, 2019 11:45 AM
  • Yes, I did use the following commands (found on the Web) :

    GRANT ALTER ANY AVAILABILITY GROUP TO [NT AUTHORITY\SYSTEM];

    GRANT CONNECT SQL TO [NT AUTHORITY\SYSTEM];

    GRANT VIEW SERVER STATE TO [NT AUTHORITY\SYSTEM];

    with the following result

    Shoud the NT Authority\System be sysadmin also ? Or other priviliges ?

    • Edited by Ptitsuisse Wednesday, August 14, 2019 1:01 PM
    Wednesday, August 14, 2019 1:00 PM
  • It should not be NT Authority you add on both sides. if your SQL Server Services is running under a virtual account (NT Service, Local System, Local Service, Network service, etc) you will need to issue the grant statements to the other node. So on Node 1 you would add Grants to Node 2, and on Node 2 you would add Grants to Node 1.
    Wednesday, August 14, 2019 1:53 PM
  • Judging from what I see in the logs it looks like your file share quorum does not exist or your cluster nodes do not have rights to access it.

    \VSI-QUOSQL-P01.Domain.ch\SQLIPSNDEV$

    Is this the correct name?

    Wednesday, August 14, 2019 2:01 PM
  • It looks like you shut down one of the nodes. The second node could not connect to the quorum share and did not know what to do and the databases went into resolving mode. You then brought the other node back online. You did this more than 1 time and your failure policies where exhausted. you must reset recent events in your cluster (just right click on the cluster to find this option) after you failover to reset the clock.

    [System] 00000e50.000017d0::2019/08/12-14:55:57.305 ERR   Clustered role 'AlwaysOn-SQLIPSN-DEV' has exceeded its failover threshold.  It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state.  No additional attempts will be made to bring the role online or fail it over to another node in the cluster.  Please check the events associated with the failure.  After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

    Further you do not seem to be able to update the cluster entries in DNS.

    ERR   Cluster network name resource failed registration of one or more associated DNS names(s) because the access to update the secure DNS Zone was denied.

    Wednesday, August 14, 2019 2:07 PM
  • Thank you for all your input.
    I'll try to check the DNS and the privileges later today.
    Thursday, August 15, 2019 6:13 AM
  • Hi Jon,

    Any update? Did you resolve your issue? If you have resolved your issue, please mark the useful reply as answer. It will make it easier for other community members to find the useful ones.

    In addition, if you have another questions, please feel free to ask.
    Thanks for your contribution.

    Best regards,
    Cathy

    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to  MSDN Support, feel free to contact MSDNFSF@microsoft.com

    Wednesday, August 21, 2019 10:01 AM
  • Hi,

    I am really sorry for the delays... Had some non-planned project to do... No it still won't work.

    What I have tried new :

    - Reset the events

    - New Quorum server installed (I've found that the old one was doing DNS errors and was unable to correct them). On the share I have only given the Cluster Computer Node.

    - I have uninstalled the Failover Cluster and everything related (DNS and AD link) to be able to recreate it from scratch (on the same server though). 

    About the DNS access, I am a bit unsure on what access it needs. I have just recreated one A host. Should it have something special as privileges ?

    Now when I try to create my Availability group I still get the same issue. And I am totally lost... Could it be that one of the SQL server was in standard for the first installation and then upgraded to Enterprise Edition. 

    Again, thank you all for your inputs.

    Have a nice day,

    Jon

    Thursday, August 29, 2019 11:01 AM