none
Table Storage timeouts, not handled by client side timeout

    Pertanyaan

  • I've been having some problems with Table Storage today, including some intermittent but very long timeouts. While diagnosing this I wrote a test which executes a lot of concurrent reads against table storage. I set the TableServiceContext.Timeout property to 1 (second) for this. Every now and then one of my queries times out, but the client waits basically forever for it. Also, my $MetricsTransactionsTable shows no server timeouts. So I'm confused about a few things:

    1. Why am I getting occasional timeouts?
    2. Why is the client waiting forever for a response, rather than timing out after 1 second?
    3. Why doesn't the $MetricsTransactionTable show any server timeouts?

    Details

    The test code is a bit complicated, because it uses some of my production code. But the basics are:

    1. Create a table and populate with 2000 entities. They're all in the same partition.
    2. Set TableServiceContext.Timeout = 1;
    3. Create a separate CloudTableQuery<> to retrieve each entity, using a point query (PartitionKey and RowKey)
    4. Execute all the queries using BeginExecuteSegmented()
    5. Wait until they're all completed. 

    Trace

    When I run the above test using an HTTP connection and Fiddler, I can sometimes capture the failed request. It appears that the server just hangs forever and never actually returns. There's no HTTP response code, since the request never comes back. Here's is a copy and paste of the Fiddler Statistics tab for the request. I can supply the full trace if needed. 

    Note that the Bytes Received is incorrect - Fiddler supplies a fake HTTP 504 response after the server times out. That is not actually a response from Table Storage. 

    Request Count:   1
    Bytes Sent:      439 (headers:439; body:0)
    Bytes Received:  638 (headers:126; body:512)


    ACTUAL PERFORMANCE
    --------------
    ClientConnected: 16:37:09.974
    ClientBeginRequest: 16:37:09.974
    GotRequestHeaders: 16:37:09.974
    ClientDoneRequest: 16:37:09.974
    Determine Gateway: 0ms
    DNS Lookup: 0ms
    TCP/IP Connect: 34ms
    HTTPS Handshake: 0ms
    ServerConnected: 16:37:10.008
    FiddlerBeginRequest: 16:37:10.008
    ServerGotRequest: 16:37:10.008
    ServerBeginResponse: 00:00:00.000
    GotResponseHeaders: 00:00:00.000
    ServerDoneResponse: 16:39:22.438
    ClientBeginResponse: 16:39:22.452
    ClientDoneResponse: 16:39:22.452


    Overall Elapsed: 00:02:12.4785000


    27 Maret 2012 22:58

Semua Balasan

  • Well, I suppose this bit from the documentation for HttpWebRequest.Timeout might explain why my client is not timing out. 

    "The Timeout property has no effect on asynchronous requests made with the BeginGetResponse or BeginGetRequestStream method."

    27 Maret 2012 23:39
  • You’ve already found the answer for part of your question. As for why you get occasional timeouts at the first time, this can happen due to a lot of reasons: network issues, service busy due to too many requests, coding logic error that only affects certain requests, and so on. When it happens, you need to retry the request after a while. As for why $MetricsTransactionTable doesn’t show any timeout, this may be there’s no server time out. Client timeout is different from server timeout. When client times out, the client simply discards the request. Service may actually returns a correct response, but client doesn’t care about that.
    28 Maret 2012 11:32
  • I understand the need to have client side timeouts and retries. But clearly the request happened, and it should've returned and been seen by Fiddler regardless of any coding errors I may have. But it doesn't look like it did.

    I also think it's interesting that it ran for 2 minutes before Fiddler killed it. The published maximum time a table storage query should take is 30 seconds. I can't think of any coding error on my side that would cause that. 


    28 Maret 2012 15:05
  • I wrote a little console app that can reproduce this problem sometimes. Basically it fires off 2000 point queries at once, using BeginExecuteSegmented(). About 1 time out of 5 some of the queries will hang. This happens even when running in the Azure data center. 

    It's not clear to me whether this is a client problem or a server problem. Certainly a timeout would mask the problem, but I'd like to figure out how to actually solve it. 

    If anyone wants to give it a look, I've put up my test client.

    The source code is here: http://dl.dropbox.com/u/425717/StorageTester-src.zip

    The compiled console application is here: http://dl.dropbox.com/u/425717/StorageTester.zip

    I removed my storage connection string from them, so you'll have to add one into the config file before running it. 

    29 Maret 2012 16:58
  • If you're testing locally, check your network environment. Perhapse a proxy or something else in your network blocks certain requests. Consult your IT for more details. This may also be caused by too many pending requests in your client application. Increasing ServicePointManager.DefaultConnectionLimit may help.

    29 Maret 2012 17:05
  • I get the same results running locally or using a Large instance in the same data center as the storage account. 

    I already have the connection limit set to 48 in the app.config file. Setting it to the default causes it to not timeout (that I can tell), but obviously it runs much more slowly. And I'm not sure whether that actually solves the problem, or just masks it. And anyway it's not an option for an actual production scenario. 

    29 Maret 2012 17:10
  • Hi,

    If you make 48 requests simultaneously, please make sure they finish quickly. Each machine has limited bandwidth, so if you have too many requests, some of them may have to wait until bandwidth is available, and may occasionally time out.

    Best Regards,

    Ming Xu.


    Please mark the replies as answers if they help or unmark if not.
    If you have any feedback about my replies, please contact msdnmg@microsoft.com.
    Microsoft One Code Framework

    02 April 2012 10:52
  • They all included PartitionKey and RowKey, and the entities themselves had no other data in them. So they're about as fast as they can be. 

    I would not be surprised if I got a server timeout or 503 Busy response. I'm surprised that sometimes a query hangs for several minutes. 

    02 April 2012 14:58
  • Hi Brian - just wanted to see if this was resolved.  If not, I'm happy to take a look into it.  Couple questions in your test:

    1) Do all of the other requests succeed quickly?

    2) What happens if you only send, say, 100 requests?  2000 requests/second on a single partition is beyond the state scalability targets for Table Storage Partitions (500 requests per second is the stated target), so it'd be interesting to see what you're seeing if you go at a lower rate.

    Let me know, and we can try to see if we can resolve this.


    -Jeff

    21 Mei 2012 6:44
  • Jeff,

      Thanks for checking in. I was able to reproduce the issue for a few days, and then it just sort of disappeared. I was using the same test code the whole time, so something on the service-side must have changed. I don't truly know if it happened again, as the client-side timeout code I've implemented since would have masked the problem. 

      From what I recall, it would happen if I did chunks of 100 requests. It would just fail more consistently with 2000  requests. 

      I did have a support ticket open, so you can check out some of the history there. But there wasn't any resolution, since the problem just disappeared. It was ticket# 112032784435912.

    Thanks!
    BKR

    21 Mei 2012 16:22