We're developing a solution that will involve deployments in all 6 Azure datacenters. Each of these deployments will read a centralized custom configuration file from blob storage in a private container. The deployments are monitoring once per minute to
see if the config file has changed (by looking at the LastModified date). I now have test deployments setup in 3 datacenters (US-NorthCentral, EU-North, AS-East) and the config file resides in US-NorthCentral.
I am seeing what I consider an abnormal number of failures to read the config file from the AS-East datacenter. In the last 12hours I've seen 42 failures (remember the query is every minute). I have seen zero failures from EU-North and US-NorthCentral. The
failures are mostly spread out, perhaps every 10, 20, or 30 minutes (though no clear pattern).
Even though there is great distance between AS-East and US-NorthCentral and connectivity can be unreliable in general my expectation is that communications between Azure datacenters should be very reliable. I assume Microsoft has some huge and reliable pipes
between datacenters. Is this a reasonable expectation? 42 failures out of 720 (5%) seems very high to me.
By the way, if it makes any difference the error I am seeing is this:
System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because
connected host has failed to respond
The code I have uses a CloudBlob.FetchAttributes() to get the LastModified time.
This is not surprising, we see failures when calling azure storage points in the East & West US from non-azure servers inside the US. However it's not the end of the world, when coding for anything internet facing (as the storage points are) code for
expected failures and how to deal with them. ;)
*I should add we don't see numbers as high as 5% but it happens.
I wouldn't have been as surprised if either end were outside the Azure network. But I really expect that when traversing the various Microsoft data centers that there would be huge pipes involved (dark fiber anyone?), ensuring very high bandwidth and reliable
I do have retry logic in place (aside from the built in RetryPolicies of the storageclient) and my code is resilient to such failures. I'm just trying to gauge what I should expect and am a little dissapointed. Anyone from Microsoft care to comment?
When were you seeing this? Our service communicates /heavily/ across data centers to Table Storages within all six data centers. In general things are very good and in the last year we've not seen any major timeouts on a frequent basis (we communicate
much much more often than once per minute and on larger data volumes), but after the Feb 29th outage, things have been worse, with a few periods where storage was downright slow and timing out frequently and even once outages made it to the service dashboard.
They're improving on a daily basis however, so I'm hopeful within a short amount of time, we'll stop seeing timeouts.
Overall, I'm guessing is that the errors you're seeing are storage errors and not communicate-across-data-center errors. I might be wrong.
I was seeing a rash of them from about 6:30PM Pacific on 3/8/2012 until about 5:30AM Pacific on 3/9, at least a couple times an hour (58errors in 11hrs which is an 8.7% error rate). Since then I have only seen it occur 9 times (in 27hrs, 0.56% error rate).
I would really like to think this is a short term issue that Microsoft is working on. I have to say it doesn't leave a comforting feeling thinking that level of error is what I can expect in general. It sounds like you're saying I should expect better service
based on your experience, and I hope you are right!
Yeah, as I mentioned we do utilize the standard RetryPolicy that is inherent in StorageClient (Azure .Net SDK), and we also have built our code to be resilient to failures. But these errors we see are after all retries by the StorageClient code.
So in the last 3.5 days (5040 minutes) we have seen 157 failures to access a blob file in US-NorthCentral from a service running in AS-East. That is a 3.1% failure rate. Whereas we have seen a 0.0% failure rate when accessing the same blob file both from
EU-North and (not surprisingly) US-NorthCentral.
I don;t see any pattern in the failures. I would guess they average about 3 per hour, but then there are hours where we see a rash of a dozen or more.
It's true that latencies may be higher when communicating between continents. For this scenario, we suggest increasing your timeout values on the blob requests. you can do this by using a BlobRequestOptions object with timeout set to an appropriate
value. Here's a code-sample that sets the request timeout to 5 minutes.
CloudBlob largeBlob = cloudBlobClient.GetBlobReference(blobname);
BlobRequestOptions options = new BlobRequestOptions();
options.Timeout = new TimeSpan(0, 5, 0);
Thanks for the suggestion Jeff. As I understand it the default timeout is 90seconds. Since we are making this query every minute I had lowered the timeout to 30seconds. I'm going to go ahead and raise it to 60 and see if it makes any difference, but I'm
not very confident it will.
So can I take your response as an acknowledgement that this sort of error is to be expected when crossing continents even when staying with the Microsoft network infrastructure?
So I raised the timeout to 60 and it appears to have made a significant difference. I have been running the same processes in all 6 datacenters for the last 4 days and have not seen a single failure. So unless something changed on the Azure backend it looks
like my lowering the timeout to 30 seconds was at the root of the problem. I was surprised it made such a huge difference, but I am glad to see it clear up.