Astonishing amount of downtime on SQL Azure - North Central data center

常规讨论 Astonishing amount of downtime on SQL Azure - North Central data center

  • Sunday, February 19, 2012 7:34 PM
     
     

    SQL Azure in the North Central data center has been severly slow (pretty much not working) for about 36 hours now.  We've been in contact with MS tech support and they've said that there is a problem in that data center with SQL Azure.  However, the service dashboard doesn't show any problem at all.  What's the point of the dashboard? :-)

    I am absolutely shocked at how long this outage has been and it's making me question whether Azure is a good place to be.  My company's road map has us moving all our applications to Azure, but I'm not so sure now.  I hate to say it, but it's kind of a joke at how long it's been down.

    We currently host most of our stuff at another hosting company. The thing in dealing with them - a small company - is there is a more personal connection and accountability. This has come in quite handy in times of crisis in the past. That's not happening in this time of crisis with Windows Azure.

    Ugh, Azure, please don't make all my fears come true. :-)


All Replies

  • Monday, February 20, 2012 12:07 AM
     
     

    While there very well may several machines in North Central currently misbehaving (there are always a few, thus the automatic failover requirements of our code), there are no data center wide problems.

    Please email me at evanba at microsoft dot com privately with your case number and I will check into what is going on.

  • Monday, February 20, 2012 6:35 AM
     
     
    Evan we are having similar issues to Paul on North Central. Very slow response; repeatedly getting timeouts even though we use retry logic. We've been running our CRM system on SQL Azure for 6 months or so now quite happily but the system is currently unusable. I can't even login to the SQL Azure Mangement Portal. There may be a few dodgy machines but failover doesn't seem to be picking them all up.

    CC

  • Monday, February 20, 2012 7:29 PM
     
     

    ContainsCaffeine,

    Even if the automatic failover doesn't detect it, we have manual processes in place to help detect problematic machines.  That being said, I would be lying if I said that our automatic and manual processes have detected every single failure in the past.

    If you see problems persist, please engage our support organization by opening up a free support case at https://support.microsoft.com/oas/default.aspx?gprid=14919&st=1&wfxredirect=1&sd=gn.

    Evan

  • Monday, February 20, 2012 7:58 PM
     
     

    My expectation of a cloud service is that it is so rarely unavailable that a persistent problem is one that is neither dealt with by retry logic nor a second attempt at retry logic (that's a lot of retries). Would you agree Evan? Would that be a persistent problem in cloud terms?


    CC

  • Monday, February 20, 2012 8:07 PM
     
     
    The publically stated SLA is that the database is accessible 99.9% of the 5 minute intervals in a 30 day period.  There is currently no performance SLA although if you see a significant change in performance that would tend to be indicative of something that should be investigated.
  • Monday, February 20, 2012 8:58 PM
     
     
    Is accessible defined anywhere? We use the retry logic referred to by the Azure team as best practise. If 1% of our users manage to make a deathly slow connection and 99% fail in a given 5 min interval using that logic, does that mean it is accessible in that 5min interval?

    CC


    Edit: Upon rereading my post seems a bit narky. That isn't the intention. I'm still quite positive about Azure. I guess I am trying to refine expectations and maybe refine retry logic and develop a process to follow on failure that is in line with reasonable expectations. So a definition of 'accessible' would be helpful.
    • Edited by FSL_Info Monday, February 20, 2012 9:59 PM
    •  
  • Sunday, February 26, 2012 9:01 PM
     
     

    The definition can be found at http://www.windowsazure.com/en-us/support/sla/ and is focused on connectivity:

    SQL Azure customers will have connectivity between the database and our Internet gateway. SQL Azure will maintain a “Monthly Availability” of 99.9% during a calendar month. “Monthly Availability Percentage” for a specific customer database is the ratio of the time the database was available to customer to the total time in a month. Time is measured in 5-minute intervals in a 30-day monthly cycle. Availability is always calculated for a full month. An interval is marked as unavailable if the customer’s attempts to connect to a database are rejected by the SQL Azure gateway.


    There is also a link there to the official SLA document that goes into more detail http://go.microsoft.com/fwlink/p/?LinkId=159706&clcid=0x409 and explains exactly how we calculate availability. 

    Given the scenario you describe, that 1% would mean that your database counts as available as per the letter of the SLA.  That being said, I don't know that I have ever seen a scenario where only some connections fail.  Everything I have seen is pretty binary - either the database is accessible or it is not.  If it is not available, there is either a service issue or you have hit a throttling threshold.

    Evan 

  • Sunday, February 26, 2012 9:56 PM
     
     

    Thank you Evan; that is really helpful. There have been two periods in the last six months when we have had trouble connecting (http://social.msdn.microsoft.com/Forums/en-US/ssdsgetstarted/thread/6e9db81c-5e03-44cc-aba5-0283bfa734bb). In both cases the problems continued for a couple of hours. In both cases others also reported they were experiencing problems. In both cases the Azure Dashboard showed no problems.

    I appreciate it is unrealistic to expect every problem that might apply to an individual user to be detected immediately. However when multiple users are experiencing similar problems at the same time it suggests to me something is going wrong perhaps at the AppFabric layer, rather than the database layer and so maybe this needs to be beefed up. The fact that when problems have occured at this layer they have persisted for hours suggests perhaps error detection at this layer isn't what it should be.

    Our clients use our main POS app in a retail environment. For the time being we are only running an in-house app on Azure. I understand that nothing can ever be error free. But retailers cannot lose their POS software for a couple of hours at a time. Before we can switch our retail clients across to Azure we need to be confident all issues will be detected and resolved promptly.


    CC


    • Edited by FSL_Info Sunday, February 26, 2012 9:57 PM
    •  
  • Monday, February 27, 2012 1:05 AM
     
     

    So, out of curiousity, has the problem for ContainsCaffeine or Paul been resolved yet?

    I'm considering moving a fairly heavy application to Azure that is used to run a web application used by close to ten thousand users; The price is very attractive for the growth we're experiencing and our tests so far have went just fine... Uptime reports from CloudHarmony show that SQL Azure is very stable - as much so as our local ones to say the least. Seeing this though, makes me nervous about using Azure SQL.  36 hours to fix a problem like this is *very* scary. 

    I'm okay with hiccups in automatic monitoring - but if a customer reports an issue and lets a human at MSFT know about the problem, I would expect the problem would start being looked into fairly quickly and be able to be resolved in far far less than 36 hours (or even 8 hours).

    Thoughts? :)

    Dan

  • Monday, February 27, 2012 2:50 AM
     
     

    It depends on what you mean by resolved Dan. Our experience of both incidents was that connectivity returned after a couple of hours. My concern is that in both cases the Azure dashboard remained unaware there was ever a problem.

    So, can we connect? Yes. However both issues remain unexplained. We won't be moving our main retail app to Azure until we see these issues appearing promptly on the Azure dashboard and promptly resolved.


    CC

  • Tuesday, February 28, 2012 2:20 PM
     
     

    Just to add on to this thread - we are experiencing the same nervousness that everyone else is here.  We have not witnessed any of these problems with the other Azure storage offerings (Blob/Table/Queue), but the sporadic connectivity and performance issues with SQL Azure are making us hold off on official deployments until it seems more stable.

    Knowing that Blob/Table/Queue also run on shared-hardware, multi-tenant system, can anyone explain why those services don't have similar issues? I'm sure there is a technical reason, but regardless, the outward customer expectation is that SQL Azure would behave the same way.

  • Tuesday, February 28, 2012 3:41 PM
     
     

    Any potential of getting an answer from a Microsoft employee on this one?

    I'm super-excited to use Azure, It's going to cut operating costs down a ton, but like @SagerCat stated, I'm not comfortable having my company go that direction until there's a greater sense of reliability. According to CloudHarmony, SQL Azure has uptime of 100.00% over the last 12 months with the last outage only lasting 2 minutes, 114 days ago.  According to this thread, that's not the case.

    Dan

  • Wednesday, February 29, 2012 8:56 AM
     
     

    I don't know how useful measures of downtime are for comparing cloud services with other web services. I see yet another post has just appeared "Is SQL Azure down right now?" by Onkar. Yet again the dashboard doesn't think so yet it seems it isn't there for Onkar.

    For 5 years or so we ran our in-house CRM app on SQL Server hosted by a premium web host. Every now and again we would be notified that the server would be out of action for 30 mins or so for maintenance. We would typically be given 4 weeks notice the service would be unavailable. Once or twice in that 5 years the server was taken down at short notice for emergency maintenance. I don't know that it was ever unavailable without notice or for more than 30 mins.

    That's pretty good I reckon. We moved that app to SQL Azure because we were looking for a solution for the POS app we provide to our clients. Our clients are retailers scattered around the globe. We needed a server that would be always up. We thought before moving an app that is mission critical for our clients we would run our internal app on Azure for a while.

    It seems that although Azure is always up, periodically we lose connectivity for one or two hours without warning and without an issue appearing on the Azure dashboard. We couldn't survive inflicting that level of unreliability on our clients. A number of others have reported simliar experiences with SQL Azure.

    So if we wanted the maximum reliability right now, we would move our SQL Server service back to our web host. The service isn't always there, but it is nearly always there and it is never not there for more than 30 mins or so and never not there without notice or for no reason.

    But we will leave our in house CRM app on Azure as it is not mission critical and it gives us the opportunity to see how Azure is getting along.


    CC

  • Thursday, March 01, 2012 12:13 AM
     
     

    Trust me - we totally get that the service as a whole has not been as robust as we would like it to be over the past 6 months.  Rest assured we are working extremely hard to fix the problems.

    Relative to the Dashboard comments, let me flip this back around to the other folks on the thread.  Right now, our Dashboard only shows problems that affect multiple servers in the data center and/or multiple customers simultaneously.  This is why some of the reported outages in the thread weren't posted - although it was clearly highly impactful to the affected customers, the problem was not widespread.  My own personal belief here is that if we posted on the Dashboard for every scenario like that, we would have lots of customers think they were impacted and go into disaster mode, yet not actually be impacted.  I would be happy to get feedback on that belief, though, so I can work that feedback into our long term plans.

    Also, addressing some of the specific long outages referenced, I totally agree.  Any outage that lasts more than 5 minutes is an SLA violation (see the full definition I posted earlier for the specifics) and you should definitely engage our Support organization.  If the outage continues even after engaging Support and you don't have a satisfactory explanation and ETA, please ask the Support person to whom you are talking to escalate the issue until you get someone who can give you the information you need to run your business (and it may even be me if you catch me on shift :)).  It may not resolve the outage, but it should allow you to get an ETA and good explanation.


  • Thursday, March 01, 2012 9:15 PM
     
     

    Thanks for that Evan. I would not expect a problem on an individual server to appear on the dashboard. My understanding is that where there is a server failure, service is switched to another server and the customer should experience no more than 90 secs interruption to service.

    So if the Azure team is aware of issues on individual servers, why haven't those servers been swapped out, rather than the Azure team working to fix the problem? If someone logs a job and the issue relates to an individual server, that customer should be swapped to another server and be back up immediately ... the Azure team can think about what went wrong later.

    This suggests to me that the outages we are talking about are not explained by individual server issues. There is something at another layer going wrong and not being reported on the dashboard.

    The other part to this is that threads like this one always seem to be about North Central US. We don't see multiple users complaining South Central, Asia or Ireland have suddenly gone away. This has me wondering if North Central might be the busiest data centre and at times there is some level of the service which isn't responding to the load as it should. Am I right do you think? Would we get greater reliability if we moved to the Ireland data centre?


    CC