none
Doubts about ServiceBus reliability RRS feed

  • Question

  • Hi,

    I hope the subject will catch your attention! :)

    I am currently using ServiceBus in production to log events asynchronously. Events are enqueued by multiple Azure web roles, and dequeued by a single worker role. I'm only using one queue at the moment.

    Looking at my app logs, I've seen for quite a long time that dequeuing (that is, calling Receive or ReceiveAsync) sometimes throws an exception. I would say it happens 2 to 3 times a week, and the exception is not always the same. Some exceptions I got are:

    • The server was unable to process the request; please retry the operation. If the problem persists, please contact your Service Bus administrator and provide the tracking id..<some_id>
    • Error during communication with Service Bus. Check the connection information, then retry.
    • Could not connect to net.tcp://<my_sb_address>:9354/. The connection attempt lasted for a time span of 00:00:21.0287258. TCP error code 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond <some_IP>:9354.
    • 50002: Provider Internal Error.TrackingId:<some_id>,TimeStamp:3/20/2014 9:35:09 AM

    Though I'm not 100% sure, I think that no message was lost (I'm not sure because I did lose messages but I think that was because of a bad deployment I made).

    So I'm wondering if these exceptions just reflect transient errors, in which case I would wonder why they are not managed by the ServiceBus client's default RetryPolicy (which is 10x exponential retry), or if they are "real" issues, in which case I would look for guidance on what I may be doing wrong...

    Thanks in advance for your help.
    Thomas

    Tuesday, March 25, 2014 2:22 AM

Answers

  • Hi Thomas,

    we track to a 99.9% reliability SLA, meaning that 1 out of every 1000 operations can fail. Typically it's no more than 1-2 operations out of every 10,000. When you have a receive operation pending, consider every minute of wait time an operation.

    There's a good number of reasons for these sorts of errors even when Service Bus is perfectly nominal. A Service Bus cluster consists of 32 to 96 nodes (depending on which DC you use) and those may go up and down or be moved for resource management reasons or for patches (you'll notice that there is never any planned downtime) and as that happens, singular operations may become the victim of an internal failover to the next node. More rarely, storage operations may fail as messages get offloaded and that makes the Sends fail. When we say we accepted a message we have it. Even if the system were completely down, we'd still give it back eventually,

    Retries only apply to Send operations, we don't mask errors on receive so that you get to see them in your logs. With OnMessage/OnMessageAsync there's a robust loop that will swallow such exceptions, however.

    Best regards
    Clemens 

    Wednesday, April 2, 2014 5:42 AM

All replies

  • Hi Thomas,

    we track to a 99.9% reliability SLA, meaning that 1 out of every 1000 operations can fail. Typically it's no more than 1-2 operations out of every 10,000. When you have a receive operation pending, consider every minute of wait time an operation.

    There's a good number of reasons for these sorts of errors even when Service Bus is perfectly nominal. A Service Bus cluster consists of 32 to 96 nodes (depending on which DC you use) and those may go up and down or be moved for resource management reasons or for patches (you'll notice that there is never any planned downtime) and as that happens, singular operations may become the victim of an internal failover to the next node. More rarely, storage operations may fail as messages get offloaded and that makes the Sends fail. When we say we accepted a message we have it. Even if the system were completely down, we'd still give it back eventually,

    Retries only apply to Send operations, we don't mask errors on receive so that you get to see them in your logs. With OnMessage/OnMessageAsync there's a robust loop that will swallow such exceptions, however.

    Best regards
    Clemens 

    Wednesday, April 2, 2014 5:42 AM
  • Hi Clemens,

    Thanks a lot, I could not expect a better reply!

    Thomas

    Thursday, April 3, 2014 2:02 AM