I've been having some problems with Table Storage today, including some intermittent but very long timeouts. While diagnosing this I wrote a test which executes a lot of concurrent reads against table storage. I set the TableServiceContext.Timeout property to 1 (second) for this. Every now and then one of my queries times out, but the client waits basically forever for it. Also, my $MetricsTransactionsTable shows no server timeouts. So I'm confused about a few things:
- Why am I getting occasional timeouts?
- Why is the client waiting forever for a response, rather than timing out after 1 second?
- Why doesn't the $MetricsTransactionTable show any server timeouts?
The test code is a bit complicated, because it uses some of my production code. But the basics are:
- Create a table and populate with 2000 entities. They're all in the same partition.
- Set TableServiceContext.Timeout = 1;
- Create a separate CloudTableQuery<> to retrieve each entity, using a point query (PartitionKey and RowKey)
- Execute all the queries using BeginExecuteSegmented()
- Wait until they're all completed.
When I run the above test using an HTTP connection and Fiddler, I can sometimes capture the failed request. It appears that the server just hangs forever and never actually returns. There's no HTTP response code, since the request never comes back. Here's is a copy and paste of the Fiddler Statistics tab for the request. I can supply the full trace if needed.
Note that the Bytes Received is incorrect - Fiddler supplies a fake HTTP 504 response after the server times out. That is not actually a response from Table Storage.
Request Count: 1
Bytes Sent: 439 (headers:439; body:0)
Bytes Received: 638 (headers:126; body:512)
Determine Gateway: 0ms
DNS Lookup: 0ms
TCP/IP Connect: 34ms
HTTPS Handshake: 0ms
Overall Elapsed: 00:02:12.4785000
- Edited by Brian Reischl Tuesday, March 27, 2012 11:10 PM
You’ve already found the answer for part of your question. As for why you get occasional timeouts at the first time, this can happen due to a lot of reasons: network issues, service busy due to too many requests, coding logic error that only affects certain requests, and so on. When it happens, you need to retry the request after a while. As for why $MetricsTransactionTable doesn’t show any timeout, this may be there’s no server time out. Client timeout is different from server timeout. When client times out, the client simply discards the request. Service may actually returns a correct response, but client doesn’t care about that.
I understand the need to have client side timeouts and retries. But clearly the request happened, and it should've returned and been seen by Fiddler regardless of any coding errors I may have. But it doesn't look like it did.
I also think it's interesting that it ran for 2 minutes before Fiddler killed it. The published maximum time a table storage query should take is 30 seconds. I can't think of any coding error on my side that would cause that.
I wrote a little console app that can reproduce this problem sometimes. Basically it fires off 2000 point queries at once, using BeginExecuteSegmented(). About 1 time out of 5 some of the queries will hang. This happens even when running in the Azure data center.
It's not clear to me whether this is a client problem or a server problem. Certainly a timeout would mask the problem, but I'd like to figure out how to actually solve it.
If anyone wants to give it a look, I've put up my test client.
The source code is here: http://dl.dropbox.com/u/425717/StorageTester-src.zip
The compiled console application is here: http://dl.dropbox.com/u/425717/StorageTester.zip
I removed my storage connection string from them, so you'll have to add one into the config file before running it.
If you're testing locally, check your network environment. Perhapse a proxy or something else in your network blocks certain requests. Consult your IT for more details. This may also be caused by too many pending requests in your client application. Increasing ServicePointManager.DefaultConnectionLimit may help.
I get the same results running locally or using a Large instance in the same data center as the storage account.
I already have the connection limit set to 48 in the app.config file. Setting it to the default causes it to not timeout (that I can tell), but obviously it runs much more slowly. And I'm not sure whether that actually solves the problem, or just masks it. And anyway it's not an option for an actual production scenario.
If you make 48 requests simultaneously, please make sure they finish quickly. Each machine has limited bandwidth, so if you have too many requests, some of them may have to wait until bandwidth is available, and may occasionally time out.
They all included PartitionKey and RowKey, and the entities themselves had no other data in them. So they're about as fast as they can be.
I would not be surprised if I got a server timeout or 503 Busy response. I'm surprised that sometimes a query hangs for several minutes.
Hi Brian - just wanted to see if this was resolved. If not, I'm happy to take a look into it. Couple questions in your test:
1) Do all of the other requests succeed quickly?
2) What happens if you only send, say, 100 requests? 2000 requests/second on a single partition is beyond the state scalability targets for Table Storage Partitions (500 requests per second is the stated target), so it'd be interesting to see what you're seeing if you go at a lower rate.
Let me know, and we can try to see if we can resolve this.
Thanks for checking in. I was able to reproduce the issue for a few days, and then it just sort of disappeared. I was using the same test code the whole time, so something on the service-side must have changed. I don't truly know if it happened again, as the client-side timeout code I've implemented since would have masked the problem.
From what I recall, it would happen if I did chunks of 100 requests. It would just fail more consistently with 2000 requests.
I did have a support ticket open, so you can check out some of the history there. But there wasn't any resolution, since the problem just disappeared. It was ticket# 112032784435912.