Azure Cache - Famous ErrorCode<ERRCA0017>:SubStatus<ES0006> frustration
-
12 Nisan 2012 Perşembe 16:48
Hi,
for the last couple of weeks, we're struggling to fix the famous ErrorCode<ERRCA0017>:SubStatus<ES0006> cache error. We haven't changed anything related to the Azure cache in our code, yet it started to happen on all our roles and instances across two subscriptions in the South Central DC.
We're seeing huge amount of errors with the following DataCacheException:
ErrorCode<ERRCA0017>:SubStatus<ES0006>:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.)
Stack Trace:
at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody)
at Microsoft.ApplicationServer.Caching.DataCache.InternalGet(String key, DataCacheItemVersion& version, String region, IMonitoringListener listener)
at Microsoft.ApplicationServer.Caching.DataCache.<>c__DisplayClass49.<Get>b__48()
at XShared.ServerCache.GetObjectForKey(String key) in C:\Work\Azure\X\X\ServerCache.cs:line 129Inner Exception:
Could not connect to net.tcp://x.cache.windows.net:22233/. The connection attempt lasted for a time span of 00:00:21.0300386. TCP error code 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 70.37.90.22:22233
Stack Trace:
at System.ServiceModel.Channels.SocketConnectionInitiator.Connect(Uri uri, TimeSpan timeout)
at System.ServiceModel.Channels.BufferedConnectionInitiator.Connect(Uri uri, TimeSpan timeout)
at System.ServiceModel.Channels.ConnectionPoolHelper.EstablishConnection(TimeSpan timeout)
at System.ServiceModel.Channels.ClientFramingDuplexSessionChannel.OnOpen(TimeSpan timeout)
at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)
at Microsoft.ApplicationServer.Caching.CacheResolverChannel.Open(TimeSpan timeout)
at System.Runtime.Remoting.Messaging.StackBuilderSink._PrivateProcessMessage(IntPtr md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)
at System.Runtime.Remoting.Messaging.StackBuilderSink.PrivateProcessMessage(RuntimeMethodHandle md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)
at System.Runtime.Remoting.Messaging.StackBuilderSink.AsyncProcessMessage(IMessage msg, IMessageSink replySink)Exception rethrown at [0]:
at System.Runtime.Remoting.Proxies.RealProxy.EndInvokeHelper(Message reqMsg, Boolean bProxyCase)
at System.Runtime.Remoting.Proxies.RemotingProxy.Invoke(Object NotUsed, MessageData& msgData)
at Microsoft.ApplicationServer.Caching.CacheResolverChannel.OpenDelegate.EndInvoke(IAsyncResult result)
at Microsoft.ApplicationServer.Caching.ChannelContainer.Opened(IAsyncResult ar)Code & Configuration:
We initialize one instance of DataCacheFactory and one instance of DataCache per role instance, these are being retained in a singleton:
private void Init() { try { if (cacheFactory == null) { // It seems enabling local cache will clear all trace listeners so to work around it we save these listeners and restore them later List<TraceListener> tracelisteners = new List<TraceListener>(); foreach (TraceListener tracelistener in System.Diagnostics.Trace.Listeners) { tracelisteners.Add(tracelistener); } cacheFactory = new DataCacheFactory(); // restore trace listeners System.Diagnostics.Trace.Listeners.Clear(); foreach (TraceListener tracelistener in tracelisteners) { System.Diagnostics.Trace.Listeners.Add(tracelistener); } } if (cache == null) { cache = cacheFactory.GetDefaultCache(); } } catch (Exception ex) { TraceLog.Error(ex.Message, TraceLogRole.Shared, "ServerCache", "Init"); } }
web.config snippet with cache configuration below:
<dataCacheClients> <dataCacheClient name="default" maxConnectionsToServer="2"> <hosts> <host name="x.cache.windows.net" cachePort="22233" /> </hosts> <localCache isEnabled="true" sync="TimeoutBased" objectCount="1000000" ttlValue="60" /> <transportProperties receiveTimeout="45000"/> <securityProperties mode="Message"> <messageSecurity authorizationInfo="..."> </messageSecurity> </securityProperties> </dataCacheClient> </dataCacheClients>
Cache data retrieval method with primitive retry policy (one subsequent attempt only):
public object GetObjectForKey(string key) { try { object res = null; try { res = cache.Get(key); } catch (DataCacheException ex) { // Simple retry policy TraceLog.Warning("Cache returned Code " + ex.ErrorCode.ToString() + " and SubStatus " + ex.SubStatus.ToString() + ". Retrying.", TraceLogRole.Shared, "ServerCache", "GetObjectForKey"); res = cache.Get(key); } return res; } catch (DataCacheException ex) { TraceLog.Error(ex.Message + Environment.NewLine + "StackTrace:" + Environment.NewLine + ex.StackTrace, TraceLogRole.Shared, "ServerCache", "GetObjectForKey"); if (ex.InnerException != null) { TraceLog.Error("InnerException: " + ex.InnerException.Message + Environment.NewLine + "StackTrace:" + Environment.NewLine + ex.InnerException.StackTrace, TraceLogRole.Shared, "ServerCache", "GetObjectForKey"); } } return null; }
Summary:
- We've tried almost everything we found online (in docs, best practices, blogposts, forums...) regarding the code and config to no avail
- Cache is at the same data center as the role instances
- We have the latest Azure SDK installed and latest libraries are being used, including those Microsoft.ApplicationServer.Caching.*.dll
- These errors happen after some short idle time only (1-5 minutes, not really the same all the time)
- Contacted Azure customer support several days ago, just to learn that "Azure cache services is working fine", according to their conclusion.
- Yes, we're aware of the Akshat Sharma's blogpost. Didn't help for us (see the web.config).Currently, we're in the development phase nearing the public beta. Having this occurred in any further phase, this would be devastating for us. As the official support struggles finding any solution for now, I'd like to ask anyone here to share any experiences, insights or opinions on this particular issue.
Thank you!
Tüm Yanıtlar
-
13 Nisan 2012 Cuma 05:52
Hi Pavel,
This is one of the cases where you would use the Transient Fault Handling Application Block to implement a retry mechanism. Could you try implementing the following code?
Sandrino
Sandrino Di Mattia | Twitter: http://twitter.com/sandrinodm | Azure Blog: http://fabriccontroller.net/blog | Blog: http://sandrinodimattia.net/blog
- Yanıt Olarak İşaretleyen Arwind - MSFTModerator 18 Nisan 2012 Çarşamba 06:05
-
13 Nisan 2012 Cuma 08:21Moderatör
Hi,
Would you like to share a test application and we will help to test it. If it can works fine, i think it's nothing serious problem with the application, it may caused by environment.
BR,
Arwind
Please mark the replies as answers if they help or unmark if not. If you have any feedback about my replies, please contact msdnmg@microsoft.com Microsoft One Code Framework
-
24 Nisan 2012 Salı 21:02
If I were you I'd use the cache as a write-though cache, basically making two requests; one to the source and another to the cache and returning the first that comes back.
In your stack trace, it seems you have a timeout after 21 seconds; an eternity really, so you badly need a better failover that "just retry" to get any proper behaviour from your application.
Rx has a method for this called Amb - something along those lines is probably what you want.
NetFlix encapsulates the above in an executable 'command' object instance which does failover if a request takes more than the 95th percentile of time to complete.