none
Should I retry on Sql Azure Timeout?

    Question

  • I've been seeing intermittent SqlException - "Timeout expired" in our application quite regularly.  We have implemented transient retry logic as per

    http://social.technet.microsoft.com/wiki/contents/articles/sql-azure-connection-management-in-sql-azure.aspx

    and

    http://msdn.microsoft.com/en-us/library/4cff491e-9359-4454-bd7c-fb72c4c452ca

    but neither of those documents mentioned the -2 timeout error and as a result we don't retry.  But the regularity with which we see these exceptions seems to tell me that I'm doing something wrong and perhaps I should retry.

    Additionally, the first document above infers that the second is the definitive resource, yet the second doc does not mention error code 40143 (which we also had not implemented as a retry-able error but appears that we should).

    I'd really like to get the definitive list (so that my app would stop sending me SMS alerts in the middle of the night!)

    For the record (and to pass on some code) , my app is testing as follows:

     

    	public static class SqlExceptionExtensions
    	{
    		public static bool IsRetryable( this SqlException that )
    		{
    			//for error descriptions go to:
    			//http://social.technet.microsoft.com/wiki/contents/articles/sql-azure-connection-management-in-sql-azure.aspx
    			//http://msdn.microsoft.com/en-us/library/4cff491e-9359-4454-bd7c-fb72c4c452ca
    			var sqlErrorCodesToRetry = new[]
    			{
    				-2 /*Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.*/
    				, 20 /*The instance of SQL Server you attempted to connect to does not support encryption. (PMcE: amazingly, this is transient)*/
    				, 64 /*A connection was successfully established with the server, but then an error occurred during the login process.*/
    				, 233 /*The client was unable to establish a connection because of an error during connection initialization process before login*/
    				, 10053 /*A transport-level error has occurred when receiving results from the server.*/
    				, 10054 /*A transport-level error has occurred when sending the request to the server.*/
    				, 10060 /*A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible.*/
    				, 40143 /*The service has encountered an error processing your request. Please try again.*/
    				, 40197 /*The service has encountered an error processing your request. Please try again.*/
    				, 40501 /*The service is currently busy. Retry the request after 10 seconds.*/
    				, 40613 /*Database '%.*ls' on server '%.*ls' is not currently available. Please retry the connection later.*/
    			};
    
    			return that.Errors.Cast<SqlError>().Any( sqlError => sqlErrorCodesToRetry.Contains( sqlError.Number ) );
    		}
    	}
    


     

     

    Wednesday, July 27, 2011 9:50 AM

All replies

  • Have you had a look at the Transient Fault Handling Framework (http://blogs.msdn.com/b/appfabriccat/archive/2010/10/28/best-practices-for-handling-transient-conditions-in-sql-azure-client-applications.aspx)?  It does a nice job of providing a reusable, extensible library for such transient faults.

    The list of SQL error codes in that framework appears to be the same as what you have though.

    Wednesday, July 27, 2011 1:08 PM
  • Thanks Michael,

    A lot of my inspiration actually came from that article, however it does not list -2 (Timeout) as a valid sql error code for retrying (in my code above, I actually added the Timeout check  after discovering that it was not being tested)

    Indeed, the SqlAzureTransientErrorDetectionStrategy class in that download does NOT retry on timeout, so hence my question here....

     

    Pete

    Wednesday, July 27, 2011 2:37 PM
  • Hi Pete,

    I built a system that highly depends on SQL Azure availability; I spent a lot of time fixing the application for many different types of failures and found that certain types of errors could be happening at various levels in the communication stack. I built the following TryOpen methods to centralize the communication attempts to SQL Azure. Here are the 2 methods; I only call the first one from the parent code. Note that I check for the Exception itself, the InnerException if there is one, and Socket Exceptions as well.  Depending on the exception, the error may be in the Number or the ErrorCode property.

    public static bool IsTransientError(Exception ex)
    {
        if (ex is SqlException)
          return IsTransientError((SqlException)ex);
        if (ex.InnerException is SqlException)
          return IsTransientError((SqlException)ex.InnerException);
        else if (ex is InvalidOperationException && ex.InnerException != null && ex.InnerException.Message.ToLower().StartsWith("timeout expired"))
          return true;
        else if (ex is System.Net.Sockets.SocketException)
        {
          if (((System.Net.Sockets.SocketException)ex).ErrorCode == 10054)
            return true;
        }
        else if (ex is TimeoutException)
          return true;
        return false;
    }
    
    public static bool IsTransientError(SqlException ex)
    {
       if (ex.Number == 40197 ||
          ex.Number == 40501 ||
          ex.Number == 10053 ||
          ex.Number == 10054 ||
          ex.Number == 10060 ||
          ex.Number == 40613 ||
          ex.Number == 40143 ||
          ex.Number == 233 ||
          ex.Number == 64 ||
          ex.Number == 20)
          return true;
    
        return false;
    }
    

     


    Herve Roggero, MVP SQL Azure Co-Author: Pro SQL Azure http://www.herveroggero.com
    • Proposed as answer by Arunraj.C Saturday, July 30, 2011 8:52 PM
    Friday, July 29, 2011 5:09 PM
  • Herve,

    Thanks for your response - I appreciate you sharing your code.

    I had not considered the SocketException and lowlevel TimeoutException - I was only examining SqlExceptions.  I'll need to add similar checks on my next deployment.  Although, I would be a little scared about the way you examine the exception message text - that seems brittle to me.

    I notice though that you don't check for SqlError of -2 (Timeout) - would that be because in that case the SqlException would contain an inner TimeoutException as well as the SqlError -2?  

    Pete

    Saturday, July 30, 2011 9:36 AM
  • Hi Peter - you are welcome.

    Yes - checking for a text message is always dangerous and something I avoid as the primary method; I consider this more of a heuristic approach and makes the assumption that a timeout is always something to retry.  SqlError -2 is, as you mention, a timeout. I am not sure exactly which condition in the code I posted catches it, but I thought it was accounted for somehow. I remember seeing it come across as well.

    Also, I do not believe this code is perfect nor complete; but so far it works... :)  If you find a way to improve it, please do share on this post for my benefit, and others.  I would be curious to know if the error you are experiencing is caught by the routines I posted.

    Thank you - I look forward to your response. 

     


    Herve Roggero, MVP SQL Azure Co-Author: Pro SQL Azure http://www.herveroggero.com
    • Marked as answer by Peter McEvoy Sunday, July 31, 2011 9:30 AM
    • Unmarked as answer by Peter McEvoy Friday, September 30, 2011 10:18 AM
    Saturday, July 30, 2011 8:14 PM
  • Thanks again, It's such a pity that we are both unsure if our code catches all cases, despite the fact that there is a guidance article (and indeed a framework) that should address the issues we are talking about and that we are seeing.  It all seems a little anecdotal for me...

     

    Best of luck

    Pete

    Sunday, July 31, 2011 9:33 AM
  • Pete,

    Let me clarify one important thing.

    The Transient Fault Handling Framework is not configured to handle the timeout errors by default. We believe that this type of exceptions is NOT something we should encourage the developers to compensate by a retry. Timeouts may have a very distinct nature and can be the result of poor schema design, lack of indexes, suboptimal query plans, blockings, timeout settings misconfiguration, and more. It is unwise to always retry when you get a timeout. Therefore, we have decided not to include the general timeout errors into the framework.
     
    Instead, I would recommend looking into the queries being executed and try to determine the reasons behind timeouts. If you are 100% confident that your queries are generating a good execution plan, you have all the right indexes and everything is configured correctly, and you are still seeing timeouts, you have all the source code in your possession. The implementation of the SqlAzureTransientErrorDetectionStrategy class can be easily modified to catch the timeout exceptions.

    To re-iterate, this is not a general recommendation. We leave the timeout handling at your own discretion.

    Valery

    Tuesday, August 09, 2011 3:28 PM
  • Thanks for the feedback.  I appreciate that there are cases where poorly designed queries would cause Timeouts and that your framework should not make retry decisions as a rule.  

    However, my issue is that I am seeing an unusual amount of Timeouts on simple, optimized queries that lead me to question the guidance and explanations outlined in the "Sql Azure Connection Management" article I pasted in the OP.

    That article definitely makes it explicit to the developer that when writing code that connects to sql azure that we need to handle connections in a specific way that is different to the way that we would (normally) write on-premise code.  It does not mention anything inherent in the environment that can cause Timeouts.

    Yet since moving to SQL Azure, our application does see a lot more timeouts.  

    For example, simple username/password lookups (on a table with indexes and only a few thousand rows) seemed to _occasionally_ exceed the default CommandTimeout (which is 30seconds)!  I stress this is only ocassional.

    Since writing my OP, I have reduced my ConnectionTimeout to 5 seconds and my CommandTimeout to 5 sec (and in the password lookup case, 2 seconds - although I makes me sick to think that a request thread could be in the DB for more than 2 seconds).  

    The reason I have reduced from the defaults on both, is that I'd rather fail-fast-and-retry while the user is still waiting for the page to render than hang around for 30 seconds because some load-balancer is trying to swap instances

    Pete

     


    Tuesday, August 09, 2011 4:40 PM
  • Pete,

    Thanks for the additional clarifications.

    Note though that timeouts could also be the result of SQL connection pool exhaustion.

    I hear what you are saying about ocassional nature of the timeouts. How about this? We can get our best support engineers to assist you with the root cause analysis. Please get in touch with us via our blog at AppFabricCAT.com so that we can start the ball rolling.

    Valery

    Tuesday, August 09, 2011 4:56 PM
  • Pete, Valery,

    I'd be interested in what the outcome of this investigation was as we are experiencing what appears to be the same behaviour. an Update to a SQL Azure table with just 51 rows in, times out periodically.

    We have an active ticket with Microsoft with regard to the issue, that has shown the actual execution time of the SP is on average 140ms with a top time of 158ms. There is only one other SP that accesses the same table and that also runs equaly fast. Both SPs are run on a scheduled basis, (approx every 5 seconds) with the default timeout for the command, (linq). Of these scheduled calls approximately 2-3 will fail with a tiemout exception in any 24 hour period.

    Blocking does not seem likely, rather it appears that the timeout exception could be masking something else, hence my interest.

    thanks

    Niall 



    • Edited by Niall Hines Friday, September 23, 2011 9:59 AM
    Friday, September 23, 2011 9:55 AM
  • I have an open ticket as well and have been working with a chap called Evan.  Investigations are still ongoing, although things have slowed down the last two weeks due to me being committed to other things so I have not been able to get the information Evan needs.

    Gathering information is difficult as the app is in production and we have not got a new build ready yet to deploy. 

    I hope to get something soon and get it closed out.

    Will post here any outcome.

    Friday, September 23, 2011 12:26 PM
  • We seem to have made some progress. MS Support performed a reconfiguration of the databases which they define as :

    Reconfiguration 

    The process of replacing failed replicas is called reconfiguration. Reconfiguration can be required due to failed hardware or to an operating system crash, or to a problem with the instance of SQL Server running on the node in the data center. Reconfiguration can also be necessary when an upgrade is performed, whether for the operating system, for SQL Server, or for SQL Azure.

    Since then we have not had any of the timeouts, (they were daily previously) but since it has been only a few days, we will continue to keep an eye on it.

    Thursday, September 29, 2011 8:47 AM
  • Well that sure is good news (for you at least!).  We have little progress:  we dont appear to have blocking queries yet still get the timeouts at least daily.  

    If reconfiguration fixes for you over the next week, I'd appreciate if you let me know...  What kind of downtime did you experience while they were fixing your DB?  Is there any risk?

    Thursday, September 29, 2011 9:59 AM
  • There was a period of db outage, about 20seconds according to MS but approx 60 seconds according to the errors I can see in our logs.

    We didn't get any prior notification, which would have been nice! Fortunately at the time of the reconfiguration all the db calls were heartbeat rather than business processes, although the outage can be mitigated if the previously discussed retry strategies are in place. The errors were at least detectable as something other than sql timeouts.

    Still no timeouts so far, and no data issues detected with our DBs. The real test will be if it remains healthy.

    Thursday, September 29, 2011 1:10 PM
  • Niall,

    Co-incidentally, we also have not seen any SQL Timeouts since Monday 26th.  Prior to that, over the last two months (August, September) we only had two separate days when no timeouts occurred

    I am not aware if MS has "reconfigured" our DB instance and we have taken no corrective action ourselves.  

    I will keep an eye on things here and report back.  

    Niall, I wonder if perhaps I could contact you directly off-list?  If so, can you email me at: pmcevoy "at" inishtech "dot" com

     

    Pete

    Friday, September 30, 2011 10:39 AM
  • We are also still clear, the fact we both are since the 26th doesn't feel like a coincidence to me.

    I'll drop you a mail with my contact details.

    Niall

    Tuesday, October 04, 2011 10:53 AM
  • I just recently came across this thread and have found it very helpful, especially this link: http://windowsazurecat.com/2010/10/best-practices-for-handling-transient-conditions-in-sql-azure-client-applications/

    I am using the Entity Framework 4.1 with ASP.NET 4.0, hosted in Windows Azure with a SQL Azure database.  I have to say, I find it somewhat bizarre how much wrapper code is required to use SQL Azure with the Entity Framework.  It seems to me these are two flagship Microsoft development technologies that naturally should function more elegantly together.  Perhaps, the entity framework connection string could simply support a property such as "IsSqlAzure=true", and all of this retry policy logic would be handled automatically (since we aren't writing custom handlers for each exception in the list anyway).

    Does anyone else have any thoughts on this?  Is anyone aware of anything on the Microsoft roadmap which integrates these two technologies better?  Perhaps these technologies are so new that I'm not giving enough credit to how far these technologies have come in the first place.  That said, the goal and promise of the Entity Framework to simplify data access seems to not be fully realized as it pertains to SQL Azure.



    Wednesday, October 05, 2011 8:34 PM
  • Update - I spoke to the ADO.NET team lead about this.  Implementing retry logic "under the hood" is in the backlog for the ADO.NET team, and as of March 2012, it has a lead time of 6-18 months before they plan on implementing it.  It sounded like the ADO.NET is currently overwhelmed with other issues.  Just thought I'd provide an update to anyone currently realizing how much effort it is to wrap every atomic operation in their data layer with retry logic.  Good luck!


    Friday, June 22, 2012 3:27 PM
  • Pete, Valery,

    I'd be interested in what the outcome of this investigation was as we are experiencing what appears to be the same behaviour. an Update to a SQL Azure table with just 51 rows in, times out periodically.

    We have an active ticket with Microsoft with regard to the issue, that has shown the actual execution time of the SP is on average 140ms with a top time of 158ms. There is only one other SP that accesses the same table and that also runs equaly fast. Both SPs are run on a scheduled basis, (approx every 5 seconds) with the default timeout for the command, (linq). Of these scheduled calls approximately 2-3 will fail with a tiemout exception in any 24 hour period.

    Blocking does not seem likely, rather it appears that the timeout exception could be masking something else, hence my interest.

    thanks

    Niall 



    Hi All,

    I'm wondering if someone gets to the bottom of the issue?

    It seems that we're experiencing similar problems (periodically, queries which run 50-150 ms produce timeouts 30+ seconds).

    Peter, Niall, please share output of your investigations if any.

    Do you experience any issues after 'magic Reconfiguration' ?

    Regards, Oleksii




    Monday, June 03, 2013 5:43 PM
  • This thread has been very useful. However, there has not been any clear cut solution to the timeout problem.

    We are having serious Timeout issues on Azure Sql, connected using Entity Framework.

    Please update this thread with what worked for you. Thanks a lot.

    Regards,
    Gautam Jain

    Friday, August 02, 2013 1:30 AM
  • Hello Gautam,

    Does all of your code use the The Transient Fault Handling Application Block?

    It's easily available via NUGET.

    For example, every time you access the database, the command must be wrapped in a retry policy like this:

    using (MyDB db = new MyDB())
    {
    	User u = retryPolicy.ExecuteAction<User>(() => db.Users.FirstOrDefault(c => c.Id == userid));
    }

    I have been using SQL Azure for two years, and it works pretty well now that all of my code looks like that.  I do still see intermittent failures however, even with the retry policy.  So I'm not sure you'll ever get perfect performance, however, you shouldn't be having constant issues if your queries and workload are reasonable, and you are using the transient fault framework.

    Hope this helps.

    Matt




    Friday, August 02, 2013 5:05 AM
  • My azure app's been seeing a lot of "The semaphore timeout period has expired" (error code 121) errors lately, which are similarly omitted from the SqlDatabaseTransientErrorDetectionStrategy.

    From what I can tell from my searching, my error is commonly associated with non-azure SQL databases having NAS issues or similar IO problems, not any fault of the queries themselves. Does that sound accurate?

    1. should the strategy also consider error code 121 as transient?

    2. are we expected to implement our own strategy to consider 121 as transient, or wait for a code fix?

    3. Or is this error a fatal thing never supposed to happen in azure, and we wait for an azure fix?

    It just seems odd that this error could be a common/transient thing in azure and be ignored by the strategy by default.

    Friday, October 18, 2013 9:00 PM
  • How relevant. I too have been seeing way more of these errors in the last 2 days than I have in literally 10 months. SQL Database has very much smoothed out for us over the last year to a point where I am quite happy with it. But the last few days have brought back nightmares of the stability issues we've had in the past.

    I am going to go out on a limb and guess that a service update was rolled out over the last few days to the core SQL Database service. I am basing that on this instability as well as a number of failover events we've seen (40197 errors, we log these so we know when it happens). The failovers in particular are a sign of an update, especially when it happens more than once in a short period of time.

    So far today things seem a little more stable. I am hopeful the update itself is not causing stability issues and instead it was just the rollout of the update.

    Anyone else also seen a lot of SQL Database errors starting say Wednesday 2013-10-16?




    Friday, October 18, 2013 9:08 PM
  • Thanks for corroborating my observations, George. For comparison: my logs show the largest batch of errors lasting from

    2013-10-16 17:28:18.070
    to
    2013-10-17 03:48:11.697

    (utc)

    Friday, October 18, 2013 9:15 PM
  • In looking at my logs, for us they seemed to last for this period (UTC):

    2013-10-16 23:13:06

    through

    2013-10-18 00:09:38

    Friday, October 18, 2013 9:25 PM
  • We saw a lot of timeouts 2013-10-17 12:00 UTC to 2013-10-17 15:00 UTC. Has worked fine for months but this thursday our application became unusuable due to these timeouts.

    What logs are you looking at? I don't see any reconfiguration events for my database for this time period when I look at sys.event_log.

    When the issue occured for us, the memory usage reported in sys.resource_stats jumped up and down like crazy. The database consumed 5GB memory, 5 minutes later 1GB, then up to 5 again, and so on. During the same time, the IO avg_physical_read_iops went through the roof (normally at around 1-10 but went up to 300) which explains why we got the timeouts.

    I have an open support case with Microsoft but so far haven't gotten any explanation yet.

    • Edited by Nitramafve Monday, October 21, 2013 7:08 AM
    Monday, October 21, 2013 7:03 AM
  • I see the reconfiguration/failover events in my own logging. I log (using Loggly) any events that are pertinent and 40197, 40143, 40166 are always indicative of a failover. I also have seen many 10054, -1 and -2 errors lately.

    Do let us know if you get any reasonable response from Microsoft. If it continues any longer for me I will also open a ticket.

    Monday, October 21, 2013 2:14 PM
  • I got a reply from Windows Azure support now.

    Our database ran into performance problems due to high resource usage by another tenant. The problem lasted 10/16/2013 2.05 AM PST thru 10/17/2013 6.13 PM UTC (note the two different time zones - this was the reply I got and I'm not sure if it's a mistake in their reply or actual correct time). The operations team performed a failover to fix the issue.

    We are using a Premium database, which according to specs comes with reserved resources (8GB of RAM), so I've asked a follow up question on how we can run out of resources due to other tenants if resource are reserved.

    I'm located in the North Europe datacenter. If the issue on my db was in fact due to a noisy tenant, it sounds unlikely that it would be the exact same cause of your issue. Unless we're all living in a single happy database server...

    • Edited by Nitramafve Monday, October 21, 2013 6:54 PM
    Monday, October 21, 2013 6:53 PM
  • I would agree that your issue sounds different than mine. I am disappointed to hear that the Premium tier also experiences this sort of thing. I know it is still in Preview, but still....
    Monday, October 21, 2013 7:18 PM