How to get AlwaysOn connection errors list


  • We have implemented AlwaysOn, but want to write specific retry logic when we open a connection in order that after retrying periodically, we can also try to connect to another specific node instead.  This way we can handle transient communication failures due to failover, network reliability, etc.

    The problem is if I Try..Catch on SqlException on the SqlConnection.Open() method, I want to look for specific communications link errors, and not errors resulting from failed logins, etc.  I cannot find a definitive list of communication link errors anywhere.  sys.messages only includes server-side errors, and not winsock errors such as error numbers 10060 and 10061.  Even looking through sys.messages, there's no easy way to filter just those that have to do with failover or network/host unavailable. 

    Is there any such published list out there, as I'd really rather not guess if I have the full list of possible errors or not?



    Friday, August 23, 2013 10:04 PM

All replies

  • If you are using an AG Listener, you can pretty much just retry once on any connection failure.  The default connection timeout of 15sec should cover the AG Listener failover time.



    Friday, August 23, 2013 11:51 PM
  • Hi David,

    Thanks for the reply.  Yes, I know I could rely on the ApplicationIntent=readonly feature.  That's not the issue though.  We have specific applications that we want to point to specific nodes of our AlwaysOn Availability Group.  We have a multi-site configuration, and want to point applications which live on the same subnet to the node that is at that subnet.  Using just the listener, all applications would point to whichever secondar(y/ies) are included in the read-only routing list, and there is no control over which of them is used, even within the same application, which is not what we want.  We want to have control over which node is addressed, in order to minimize traffic on the communications link.  This can only be accomplished by pointing to the node directly, and that is why we want to write our own "failover" logic.  I don't want to blanketly retry on all SQLConnection errors, such as login credentials, etc.  I only want to retry on those specific to network only.

    Thanks for trying though!  Hopefully, this clarifies our use-case and someone can point me in the right direction! 


    Wednesday, August 28, 2013 3:50 PM
  • Cool.  I was just working with another customer on a similar configuration.

    There's an awesome feature in Windows DNS that helps here: Subnet Prioritization

    Assuming that

    1.       Your DNS servers are Windows Servers
    2.       Your application servers and their local readable replica are either on the same subnet, or have IP addresses that agree on the network portion of the address.

    For an Availibility Group with an Availibility Group Listener called AG, create a two DNS A records for the hostname AG_R, one with the IP address of each readable replica. 

    Windows DNS will prioritize addresses list that is returned, listing the local replica’s IP address first.   

    Then in your read-only connection string set MultiSubnetFailover=False and ConnectionTimeout=30.  This will cause the SQL Client to attempt to connect only to the first returned IP address and only attempt the second if the first is unavailable.  By default the failure to connect to the first IP will take 20sec, so the longer ConnectionTimeout (15sec is the default) is necessary to allow the connection to eventually succeed.  Changing DNS TTL is not necessary as the DNS records are not changing.

    If you want to solve this problem in application code, I still don't think you need a list of special error codes.  For each Application Server configure both the name of its preferred readable replica, and the AG Listener.  For read-only connections try to connect to the preferred readable replica, and if that fails for any reason, then try just once to connect to the AG Listener with ReadOnly intent.  This should cover almost any legitimate retry scenario.




    Wednesday, August 28, 2013 4:14 PM
  • Hi David, thanks for the reply again!

    That does sound pretty cool.  Is it a feature of Windows 2008 R2?  That's what our DNS servers are on.

    I will definitely investigate this as a way of resolving our issues and let you know how it works out.



    Wednesday, August 28, 2013 4:24 PM
  • Yes, Windows Server 2008 R2 DNS supports Subnet Prioritization. 



    Wednesday, August 28, 2013 4:44 PM
  • Okay, I'm looking into it. 

    The only other thing I'd like to know is how WFCS deals with multiple A records in the DNS.  Will it be able to properly resolve and failover still, and is there any delay in the primary node connection, since it would have to resolve among multiple IPs and multiple subnets?

    My Active Directory knowledge is pretty limited, so I'd like to understand the internals of this.  Also, is this the behaviour by default, because I'd rather not have to ask the LAN team to make server-level changes to the DNS for round-robin/subnet prioritization, as I don't know how that would affect the rest of the network.


    Wednesday, August 28, 2013 4:49 PM
  • WSFC will not know anything about these A records.  The only thing the cluster registers in DNS is the Client Access Point for your AG Listener.  This would be a completely different hostname.

    Say your cluster spans two subnets and,  Your AG Listener has hostname of and two IP addresses, one on each subnet.  The cluster will register that hostname in DNS with both IP addresses, and the clients will always get both IP addresses when they lookup the AG Listener name.  Then when connecting to the AG Listener the client, configured with MultiSubnetFailvoer connection string parameter will attempt to connect to both IP addresses simultaneously and go with whichever one responds.

    For the read-only connections you are bypassing the AG Listener and connecting directly to the replicas.  You create a new hostname, say, and add two A records for that hostname, containing the IP address of readable replicas on each subnet.  Then you configure the read-only connection string to not use MultiSubnetFailover and so it will query DNS, and get a list of 2 IP addresses back.  The list will have been ordered by the DNS server to put the IP address that is on the same subnet as the application server first.  The application server will then connect to the first IP address on the list.  If that fails, it will try to connect to the second one.  This is why the read-only connection string needs to have MultiSubnetFailover=false.  If it was true then the client would try to connect to both IP addresses, and might end up getting connected to the wrong one.  If the other readable replica is sufficiently far away you might consistently get connected to the local one, but it's not guaranteed.

    DNS Subnet Prioritization is on by default, and you should validate that it's working in your environment by running from application servers in each subnet:


    Several times, and checking that the local replica's IP address is listed first always in both subnets.  The reason you want to do this several times is that the DNS server will rotate or "Round Robin" the order of the IP addresses, but this happens _after_ subnet prioritization.  So you should consistently see the local replia's IP listed first.



    Wednesday, August 28, 2013 5:09 PM
  • Okay, I see.  Thanks for explaining it.  I didn't realize at first you meant creating another hostname on the network for resolution.  I will have to test this out and see, because I would like to guarantee which subnet I get, if possible.  Otherwise, I have already written the application code which tries the preferred replica first, and then defaults to the ApplicationIntent=ReadOnly, MultiSubnetFailover=True connection.  I was just looking for a way to be smarter about attempting it, since I know we get some that are due to login failure, etc.

    Thanks for the help!


    Wednesday, August 28, 2013 5:16 PM
  • Hi David,

    I tried adding the DNS A records for the 2 IPs on multiple subnets, and although it seemed like it was connecting against the right subnet, if an error occurred, (such as pausing the SQL Server instance) there was no retry of the second IP.


    Thursday, August 29, 2013 4:42 PM
  • Okay, it seems it works if the SQL Service is stopped, and not when it is alive but not accepting connections.

    I guess I can always add retry logic on top of it, and then divert to the actual server instance if that fails.


    Thursday, August 29, 2013 5:36 PM