none
Table Storage Outages

    Question

  • We had an outage today of 2 minutes. We had a 3 minute outage on Friday, and another 4 minute outage on Thursday.

    The Azure Frontline support team have been completely useless, simply stating that this is normal behaviour and that we are within SLA.

    Load balancing outages are supposed to be within 10-20 seconds, which a re-try policy will deal with. However outages of 3-10 minutes are not normal behaviour and are completely unacceptable.

    Jai, I have been CC'ing you the communication between us & support, and you have also done nothing.

    We have staked our business on this technology and are now left at the mercy of Microsoft who are doing absoloutely nothing about these issues.

    Anthony.

    Edit 8 Dec 2010:

    This forum post was written out of desparation to get Microsoft to respond & fix issues we were having with table storage. We were not getting a suitable response from the Azure support team and the problems could not go on for any longer.

    Prior to these issues beginning last week, Table Storage availability has been excellent - with a maximum outage time of 10-20 seconds per day. This is considered normal behaviour due to load balancing, which is addressed by implementing a retry policy. 

    • Edited by ants_super Tuesday, December 07, 2010 8:51 PM
    Tuesday, December 07, 2010 1:50 AM

Answers

  • Thanks Brad. When compiling the list of failed requests above, I did discover a delay in our logging which made us think the outage was for 10 minutes when in fact it was for 2 minutes. I will amend my first post and make sure we always go off the timestamps on the errors in future.

    The timestamps on Friday's errors started at 01:04:14 and finished at 01:07:06, which is almost 3 minutes, and Thursday 00:00:21 to 00:04:44.

    To wrap this up, this is not the kind of discussion I wanted to have on a public forum, but desperate times called for desperate measures. Dealing with frontline support has been like banging my head on a brick wall. It's good to know that you are recieving the information, but the communication that we have received is that everything has been normal and that our application needs to "back off". When we are dealing with our customers screaming at us I'm sure you can understand my situation.

    I will continue sending problems to support and hopefully the feeback loop will improve.

    Cheers,
    Anthony.

     

    Tuesday, December 07, 2010 8:40 PM

All replies

  • I'm not aware of any storage outages (in any data center) in quite some time.

    If you want to loop me in (I think Jai's out of the office with little to no email access), I'd be happy to make sure the right people are engaged to figure out what you're seeing.  My email address is Steve.Marx@microsoft.com.

    Tuesday, December 07, 2010 2:45 AM
  • I've sent you the communication so far. To clarify, the issue is OperationTimedOut / ServerBusy errors - which we constitute as an outage given the duration that the errors are occurring.

    Anthony.

    Tuesday, December 07, 2010 3:03 AM
  •  

    Hi Anthony

     

    In looking at your account for today it has 100% availability except for 1 hour in which there were 2 requests that timed out.  In investigating why those 2 requests timed out that was due to a server crash.      

     

    We did receive your prior requests last week through support, and with our monthly upgrade going out this week, we did make a few tweaks that should help your usage.  

     

    Note, Jai is on vacation, so probably not responding to emails at this time. 

     

    Our goal is to provide the best availability and service possible, but there will be the occasional load spike, server crash or HW failure that needs to be dealt with.  The system is built to automatically deal with these issues, but there can be a few timeout/server busy errors when those types of issues occur.

     

    Brad

    Tuesday, December 07, 2010 4:55 AM
    Moderator
  • 2 request timeouts? Perhaps somehow you missed these:

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:1c34c92a-03dc-491d-92f3-7ac4ed21d024
    Time:2010-12-07T01:27:10.9960627Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:08cee7a7-e685-4a47-be53-a02ffc1c586f
    Time:2010-12-07T01:27:11.5089583Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:d493b262-08fe-45c6-b1c1-266041ec87ce
    Time:2010-12-07T01:27:20.9330689Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:27e9a798-5d2f-4fee-9511-01cd8a35e634
    Time:2010-12-07T01:27:31.8893042Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:981a4937-5456-4b28-8637-a164b889df1c
    Time:2010-12-07T01:27:34.4275227Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:39e16ec6-c02b-4975-8766-4551c0d3a171
    Time:2010-12-07T01:27:26.4185945Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:2c9aa452-2db6-419b-9888-bceea659c08f
    Time:2010-12-07T01:27:34.6657737Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:a01aad6b-0509-4af0-881c-c4d69c871bdc
    Time:2010-12-07T01:27:39.5327828Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:e35a414e-a150-4ee1-861e-0a1a9941e7f6
    Time:2010-12-07T01:27:41.1299138Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:e1411190-5d2b-408d-87b9-8a77bf54bb91
    Time:2010-12-07T01:27:43.1116692Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:d7ebf157-dbac-4bfa-a3a1-f00d3f5a8379
    Time:2010-12-07T01:27:46.3163337Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:8de30093-105c-44d3-8e0a-de97bee0902a
    Time:2010-12-07T01:27:46.2676134Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:e16c3ec4-8c53-45ee-8b06-16eca720cfd0
    Time:2010-12-07T01:27:47.0083702Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:8556697e-577e-4e1e-9b19-43d9da9d4ae4
    Time:2010-12-07T01:27:51.1333316Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:ce883a0a-8f03-41a1-b758-a7e16379f1b0
    Time:2010-12-07T01:27:52.3898977Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:bf1e460c-8c76-441d-913b-e6e0cd2e863f
    Time:2010-12-07T01:27:52.7749658Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:d04d0ad5-18b0-43b8-8f4b-d59e7bbbe924
    Time:2010-12-07T01:27:54.8629409Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:18101bc8-8005-429f-8749-1e9d18b6962e
    Time:2010-12-07T01:27:59.8055016Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:3628c426-d8c3-4dac-97d0-630595c757b1
    Time:2010-12-07T01:27:59.8561231Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:8c4e3ac8-178d-4a52-aa25-813da9480d47
    Time:2010-12-07T01:28:07.9199781Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:fd5c42ef-884b-46d0-9f82-892b825cf02d
    Time:2010-12-07T01:28:15.0704580Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:8f2e816f-8363-4b14-a792-18f8b9d35bd9
    Time:2010-12-07T01:28:20.0967047Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:161b208d-a830-4713-b322-a24631aa193a
    Time:2010-12-07T01:28:24.3678030Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:b70b7795-f604-4ce9-a227-eecb9ead5235
    Time:2010-12-07T01:28:25.7195644Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:56368194-5566-4a41-85b5-04c86ace6620
    Time:2010-12-07T01:28:43.6823672Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:c34176a2-d10a-4f2b-bbd3-f2f376560532
    Time:2010-12-07T01:28:51.9999082Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:5bfb7fd3-91f0-486f-a1c3-b76af56ef7ae
    Time:2010-12-07T01:28:54.5837811Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:64a296ac-1f55-49ad-898f-75e7edc435f9
    Time:2010-12-07T01:29:23.1239268Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:dd9264ce-d56c-4604-b3b2-44ea65e2004c
    Time:2010-12-07T01:29:25.2838979Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:ed817fa8-ea97-475a-9eff-40f41de7bd17
    Time:2010-12-07T01:29:21.5319169Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:32ec5037-9283-422c-bd03-e30b54aa8966
    Time:2010-12-07T01:29:21.4620148Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:f8b8a541-03f2-41ab-bb60-ec0d035dfee0
    Time:2010-12-07T01:29:11.5408254Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:8d23fffe-0dd3-4969-86eb-12c00e335582
    Time:2010-12-07T01:29:03.6006659Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:c03368ca-0820-43f4-9c71-287c40790ce7
    Time:2010-12-07T01:29:16.2906102Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:0994ea33-5b60-4a6f-8cde-c1fb97ad592b
    Time:2010-12-07T01:29:24.2163316Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:d93b8b40-141d-45b7-9fbf-50bcae81959f
    Time:2010-12-07T01:29:12.4544897Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:029b8d15-9892-4f91-8b19-a20a2d28bc0b
    Time:2010-12-07T01:29:16.2423402Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:76b0082d-c84f-4174-aeaa-1829ddaacd83
    Time:2010-12-07T01:29:25.7215357Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:6a8101fb-280b-416b-9fd0-2b278bc78866
    Time:2010-12-07T01:29:24.2628161Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:f2de981a-c839-475a-aa07-36c497d6101d
    Time:2010-12-07T01:29:19.3740270Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:f9314ed9-d7a2-4f58-972b-8da8b003b358
    Time:2010-12-07T01:29:06.9188003Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:2a55ccf6-689f-47e5-9b4c-b18e36c51a6e
    Time:2010-12-07T01:29:26.6865016Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:a394e560-2e6a-4011-9023-97cebddc7def
    Time:2010-12-07T01:29:08.1286198Z

    OperationTimedOut
    Operation could not be completed within the specified time.
    RequestId:7866d267-ce6f-4259-8e15-aa8996512001
    Time:2010-12-07T01:29:16.8464512Z

    I fully understand that load issues occur, and your documentation states that we can expect 10-20 seconds downtime which is acceptable with a retry policy. But look at the timestamps above...

    Anthony.

    Tuesday, December 07, 2010 5:19 AM
  • As an aside, if the request timeouts you have logged is far less than what we have recieved, perhaps the issue is a network issue between the VM's and the storage account.

    The reason I mention this is that we have also received a high number of timeouts today communicating with our externally hosted mail server, and they are adamant they have had no outages.

    Our VM's & storage account are configured in the same hosting center / affinity group.

    Anthony.

    Tuesday, December 07, 2010 5:39 AM
  • Hi Anthony

     

    Sorry, in looking at the account from your today’s posting, we were focused on the requests that hit the data servers, based on what we were working on for you last week, where we saw just 2 request timeouts as mentioned above. 

     

    In looking at the request-ids, there was a different issue that occurred with the authentication server today, which lasted about 2 minutes, and these requests timed out/failed due to not being able to authenticate.    We identified the issue, and are putting a work around in place to avoid having the issue occur as well as getting a fix out for it.

     

    Note, this is different than the load balancing we looked at for your account mid last week, and as mentioned, for that we did do some tweaks in this upcoming monthly upgrade so you should notice improvements this week for that.

     

    Brad

    Tuesday, December 07, 2010 7:36 AM
    Moderator
  • Thanks Brad,

    When will this workaround be in place? We're in Australia and about to hit another day of heavy usage in 13 hours from now, and we can't afford for these errors to occur again.

    It also concerns me that you say the errors last week were simply due to load balancing, as the outages on Thursday & Friday were both around 4 minutes in duration which is far greater than the 10 or so seconds we should be expecting. Are the "tweaks" you made going to reduce this back down to 10 seconds?

    Is there a chance that you didn't see the same failed requests last week that I sent to you today?

    Anthony.

    Tuesday, December 07, 2010 8:00 AM
  • Sorry, didn’t see your last reply  before checking out last night.  The workaround was put in place when I sent my last reply 8 hours ago.  

     

    We relooked at what happened on Thursday and Friday, and Thursday didn’t have any issue with the account authentication, but the Friday one in fact was due to the same account authentication not being available for 2 and ½ minutes, and the investigation from Thursday for load balancing threw off the one done on Friday, as it initially did on Monday.  Sorry about that.    But the full set of request IDs you sent on Monday allowed us to quickly narrow in on the issue. 

     

    The interesting thing is that we weren’t seeing the same amount of minutes for the incident, and not sure why at this time.  For example, on Monday the account authentication had an issue for 2 minutes, but the above post said it was 10 minutes, so not sure if there is something else going or not, but after looking we couldn’t find anything else on our side.

     

    If there are any other issues, please send them to support, since they do get to us immediately, and that is the fastest way to get the issue to our active on call dev.  Also please send the request ID list for as many of the requests as you can, like you did above, since that really helped us to quickly narrow in on the issue.

     

    Thanks

    Brad

     

    Tuesday, December 07, 2010 3:49 PM
    Moderator
  • Thanks Brad. When compiling the list of failed requests above, I did discover a delay in our logging which made us think the outage was for 10 minutes when in fact it was for 2 minutes. I will amend my first post and make sure we always go off the timestamps on the errors in future.

    The timestamps on Friday's errors started at 01:04:14 and finished at 01:07:06, which is almost 3 minutes, and Thursday 00:00:21 to 00:04:44.

    To wrap this up, this is not the kind of discussion I wanted to have on a public forum, but desperate times called for desperate measures. Dealing with frontline support has been like banging my head on a brick wall. It's good to know that you are recieving the information, but the communication that we have received is that everything has been normal and that our application needs to "back off". When we are dealing with our customers screaming at us I'm sure you can understand my situation.

    I will continue sending problems to support and hopefully the feeback loop will improve.

    Cheers,
    Anthony.

     

    Tuesday, December 07, 2010 8:40 PM