none
Connection not getting redirected to the failover partner RRS feed

  • Question

  • Hi,

    I am using Enterprise Library 4.1 to connect to a SQL Server 2008 R2 database with a mirror.
    My connection string itself is as below (with dummy values for all fields):

    "Data Source=ServerA;Failover Partner=ServerB;Initial Catalog=DBName;User ID=uid;Password=pwd;Connect Timeout=5"
    

     

    Sample .NET code is as below (with modifications to instantiate all variables locally to provide context):

     

    SqlDatabase db = new SqlDatabase(ConnectionString);
    SqlConnection cn = (SqlConnection)db.CreateConnection();
    cnInitTime = String.Format("{0:MMM dd, yyyy, HH:mm:ss.fff}", DateTime.Now); //To store the time when the connection was attempted
    cn.Open();
    
    //A dummy method to test whether connection has indeed been established. Added out of desperation :-)
    if (cn != null && cn.State == ConnectionState.Open)
    { 
    cmd = new SqlCommand("PRINT 'ASDF'", cn);    cmd.ExecuteNonQuery();          
    
    }
    

     

    And here is what I observed happening:
    Initially when ServerA is active, everything works well.
    When Server A fails and the requests get routed to ServerB, I get a network error. Which I can understand happening on the first request. The problem is this persists for approx two minutes or so before I can get access to the ServerB from within my code.
    After two minutes with ServerB acting as primary and Server A as mirror, life is good. And when ServerB goes down, the connection seamlessly switches to ServerA without any issues.
    Digging into this further I noticed that after this line below:

     

    SqlConnection cn = (SqlConnection)db.CreateConnection();
    

     

    the datasource of the connection is always ServerA regardless of which server is active. However, when the database (on ServerA or ServerB) is serving requests normally, the datasource switchs on cn.Open(); But immediately after failing over to ServerB and for the next couple of minutes, this switch does not happen which means that the code is still attempting to connect to ServerA despite that failing.


    Prior to posting this, I tried handling the exception raised and clearing the connection pool and reinitiating the connection hoping that this is something transient that will go away.
     catch (SqlException se)
                {
                    try
                    {
                        
                        if (cn != null)
                        {
                            SqlConnection.ClearPool(cn);
                            cn = new SqlConnection(connectionstring);
                            cn.Open();
                            cmd = new SqlCommand("PRINT 'ASDF'", cn);
                            cmd.ExecuteNonQuery();
                        }
                    }

    }
    This still gives me a network error.
    This is driving me nuts and I would really appreciate it if you could shed some light on what I am doing wrong here.


    Thanks

    Wednesday, August 25, 2010 5:47 AM

Answers

  • Issue #1:

    When you fail over to the secondary, it may take a period of time before the database is available.  For example the database has to recover the transaction log.  If the database is under heavy load, it can take longer.  If the database is under light or zero load, it should be relatively quick, like 10 seconds from my experience.

    Issue #2:

    The SqlConnection class enables connection pooling by default, meaning when a failover occurs you can have a pool full of "bad" connections.   The pool is not fully optimized to handle this situation, this is something I've been pushing the ADO.NET team to fix in the next major release.  So what happens is you start pulling these "bad" connections from the pool and commands start failing, which I agree is annoying.

    There are several solutions here, one is to flush the pool as you mentioned.  Another is to disable pooling, but this is not optimal for applications that really need the performance boost of pooling.

    Note when you flush the pool, it is an asychronous operation, so it will take a short period of time to take effect. 

    A faster alternative to flushing the pool is to change the connection string slightly, which will create a new pool.  For example you can use a static int for pool id, then do something like this:

     

     catch (SqlException se)
                {
                    try
                    {
                        
                        if (cn != null)
                        {
                            Interlocked.Increment(ref nextPoolId); <- This is some static int initially set to 1
                            cn = new SqlConnection(connectionstring + ";ApplicationName=MyApp" + nextPoolId);
                            cn.Open();
                            cmd = new SqlCommand("PRINT 'ASDF'", cn);
                            cmd.ExecuteNonQuery();
                        }
                    }
    }

    What this will do is ensure the next SqlConnection will start with a fresh pool.  The old pool with the dead connections will eventually be cleaned up and destroyed (in 4 to 8 mins).

    As I mentioned, what we are working on is having the pool detect "bad" connections and NEVER hand them out.  We can do this by posting a zero byte read on the underlying socket which will fail fast if the network connection is dead.

    Currently the SQL Native Client driver for OLEDB has this feature built into the pool, but we have not ported this to ADO.NET yet.

    Monday, October 4, 2010 5:29 PM
    Moderator

All replies

  • Issue #1:

    When you fail over to the secondary, it may take a period of time before the database is available.  For example the database has to recover the transaction log.  If the database is under heavy load, it can take longer.  If the database is under light or zero load, it should be relatively quick, like 10 seconds from my experience.

    Issue #2:

    The SqlConnection class enables connection pooling by default, meaning when a failover occurs you can have a pool full of "bad" connections.   The pool is not fully optimized to handle this situation, this is something I've been pushing the ADO.NET team to fix in the next major release.  So what happens is you start pulling these "bad" connections from the pool and commands start failing, which I agree is annoying.

    There are several solutions here, one is to flush the pool as you mentioned.  Another is to disable pooling, but this is not optimal for applications that really need the performance boost of pooling.

    Note when you flush the pool, it is an asychronous operation, so it will take a short period of time to take effect. 

    A faster alternative to flushing the pool is to change the connection string slightly, which will create a new pool.  For example you can use a static int for pool id, then do something like this:

     

     catch (SqlException se)
                {
                    try
                    {
                        
                        if (cn != null)
                        {
                            Interlocked.Increment(ref nextPoolId); <- This is some static int initially set to 1
                            cn = new SqlConnection(connectionstring + ";ApplicationName=MyApp" + nextPoolId);
                            cn.Open();
                            cmd = new SqlCommand("PRINT 'ASDF'", cn);
                            cmd.ExecuteNonQuery();
                        }
                    }
    }

    What this will do is ensure the next SqlConnection will start with a fresh pool.  The old pool with the dead connections will eventually be cleaned up and destroyed (in 4 to 8 mins).

    As I mentioned, what we are working on is having the pool detect "bad" connections and NEVER hand them out.  We can do this by posting a zero byte read on the underlying socket which will fail fast if the network connection is dead.

    Currently the SQL Native Client driver for OLEDB has this feature built into the pool, but we have not ported this to ADO.NET yet.

    Monday, October 4, 2010 5:29 PM
    Moderator
  • I am having the same issue....

    Is there an microsoft update for this?

    Tuesday, May 10, 2011 9:08 PM
  • Note from re-reading the post, I realized a pooling fix may not solve all of this issue.   I did some further digging with the ADO.NET team and I found that the pooling code will automatically flush the pool when it encounters dead connections.   So you may get a few network errors at first but very quickly the pool will be flushed and fresh new connection attempts will occur.  So very quickly (within seconds) the pool should recover and you should start connecting to ServerB.

     

    The problem in this post may be slightly different.  One issue could be network latency to the remote server ServerB or that the remote server ServerB has less CPU capacity and thus logins are taking longer to complete during failover.  This causes the database mirroring algorithm to keep failing to log into the remote server ServerB and flip flopping back to ServerA.   You can verify this by examining logins/sec perfcounters on both ServerA and ServerB during a failover.   I suspect once ServerB CPU stabilizes after failover then logins can proceed and clients then start connecting.  To resolve this you can increase the connection timeout to a larger value, try 60 seconds for example instead of the default of 15.

     

    A pooling fix will eliminate the initial network errors but it may not fully solve the problem here. 

     

    Matt


    Matt
    Friday, July 29, 2011 5:22 PM
    Moderator
  • Hello,

    I am experiencing this exact problem.

    When executing a backend fail-back from the Failover Partner to the Principal, the SqlConnection seamlessly switches over (sub-second) to the new Primary.

    However, when executing a backend failover from the Principal to the Failover Partner, the Connection times out without trying to switch over to the Failover Partner. I have verified this using Wireshark: no Connection attempts are made by the SqlConnection to the Failover Partner -- instead, it keeps trying to access the Primary until a timeout is reached. Looking at the Wireshark messages, the Primary sends a "login failed" message back to the client. This is also the case when doing a fail-back: the Failovar Partner sends a "login failed" message, after which the client immediately reconnects to the Primary partner and things run smooth.

    If I catch the timeout and use a different SqlConnection.ApplicationName (such as incrementing some number) when trying to connect to the database environment, the connection immediately works post a fail-back. Subsequent SqlCommand attempts using the new ApplicationName works great, whereas the previous one does not. After a while, the pool (or something) times out and the SqlConnection using the previous ApplicationName gets flushed, making subsequent calls (using the previous ApplicationName) operational.

    In conclusion, considering it works one way (fail-back) but not the other (failover). This behavior is very weird.

    Kind regards,

    Ulf

    Sweden

    Thursday, August 22, 2013 8:05 AM