none
Troubleshooting AppFabric Scaling Issues (Intermittent ErrorCode<ERRCA0017>:SubStatus<ES0006> Errors)

    Question

  • We've implemented AppFabric Windows Server Cache for our web application.  Initially, we were able to use the cache without any issues.  We then increased traffic roughly 100 fold, and began experiencing intermittent exceptions.  The exceptions occur roughly once per 2 days, for about a minute at a time.

    Our configuration:
      • 9 web servers inserting/retrieving objects in cache:
                • Mostly temporal 500 byte operational type objects
                • Using 1 named region
                • Objects stored with tags
                • Retrieved in bulk for a given tag
              • Cache Cluster:
                • 1 host (lead) AppFabric 1.1 (version reported by get-cachehost is 3)
                • SQL configuration provider
                • 96GB of RAM on host, the default 50% (48GB) allocated to AppFabric
                • Cache Host Config:
                <?xml version="1.0" encoding="utf-8"?>
                <configuration>
                    <configSections>
                        <section name="dataCache" type="Microsoft.ApplicationServer.Caching.DataCacheSection, Microsoft.ApplicationServer.Caching.Core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35" />
                    </configSections>
                    <dataCache size="Small">
                        <caches partitionCount="32">
                            <cache consistency="StrongConsistency" name="AdMatter" minSecondaries="0">
                                <policy>
                                    <eviction type="Lru" />
                                    <expiration defaultTTL="10" isExpirable="true" />
                                </policy>
                            </cache>
                            <cache consistency="StrongConsistency" name="default" minSecondaries="0">
                                <policy>
                                    <eviction type="Lru" />
                                    <expiration defaultTTL="10" isExpirable="true" />
                                </policy>
                            </cache>
                        </caches>
                        <hosts>
                            <host replicationPort="22236" arbitrationPort="22235" clusterPort="22234"
                                hostId="1633955643" size="49145" leadHost="true" account="ADMATTER\SD-CCH-ADM-001$"
                                cacheHostName="AppFabricCachingService" name="SD-CCH-ADM-001.admatter.local"
                                cachePort="22233" />
                        </hosts>
                        <advancedProperties>
                            <securityProperties mode="None" protectionLevel="None" />
                            <transportProperties maxBufferPoolSize="16777216" maxBufferSize="16777216" receiveTimeout="900000" />
                        </advancedProperties>
                        <deploymentSettings>
                            <deploymentMode value="RoutingClient" />
                        </deploymentSettings>
                    </dataCache>
                </configuration>
                • Cache Client Config:
                <dataCacheClient>
                    <hosts>
                      <host name="SD-CCH-ADM-001.admatter.local" cachePort="22233" />
                    </hosts>
                    <localCache isEnabled="false" sync="TimeoutBased" objectCount="100000" ttlValue="300" />
                    <clientNotification pollInterval="60" />
                    <securityProperties mode="None" protectionLevel="None" />
                	<transportProperties connectionBufferSize="131072" maxBufferPoolSize="16777216" maxBufferSize="16777216"
                	                     maxOutputDelay="2" channelInitializationTimeout="60000" receiveTimeout="900000" />
                </dataCacheClient>

            The errors in the order that they occur (the exceptions are occur for each of the nine webservers during the 1 minute period):

    1.  System.Net.Sockets.SocketException : An existing connection was forcibly closed by the remote host
      • Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0016>:SubStatus<ES0001>:The connection was terminated, possibly due to server or network problems or serialized Object size is greater than MaxBufferSize on server. Result of the request is unknown. ---> System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '00:15:00'. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host --- End of inner exception stack trace --- at System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result) at System.ServiceModel.Channels.FramingDuplexSessionChannel.EndReceive(IAsyncResult result) at Microsoft.ApplicationServer.Caching.WcfClientChannel.CompleteProcessing(IAsyncResult result) --- End of inner exception stack trace --- at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody) at Microsoft.ApplicationServer.Caching.DataCache.GetNextBatch(String region, DataCacheTag[] tags, GetByTagsOperation op, IMonitoringListener listener, Byte[][]& state, Boolean& more) at Microsoft.ApplicationServer.Caching.CacheEnumerator.MoveNext() at System.Linq.Enumerable.WhereSelectEnumerableIterator`2.MoveNext() at System.Linq.Enumerable.<ExceptIterator>d__99`1.MoveNext() at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection) at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source) 
    2. Microsoft.ApplicationServer.Caching.DataCacheException  :
      • Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0017>:SubStatus<ES0006>:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.) at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody) at Microsoft.ApplicationServer.Caching.DataCache.GetNextBatch(String region, DataCacheTag[] tags, GetByTagsOperation op, IMonitoringListener listener, Byte[][]& state, Boolean& more) at Microsoft.ApplicationServer.Caching.CacheEnumerator.MoveNext() at System.Linq.Enumerable.WhereSelectEnumerableIterator`2.MoveNext() at System.Linq.Enumerable.<ExceptIterator>d__99`1.MoveNext() at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection) at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
    3. Microsoft.ApplicationServer.Caching.DataCacheException
      • Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out. at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody) at Microsoft.ApplicationServer.Caching.DataCache.GetNextBatch(String region, DataCacheTag[] tags, GetByTagsOperation op, IMonitoringListener listener, Byte[][]& state, Boolean& more) at Microsoft.ApplicationServer.Caching.CacheEnumerator.MoveNext() at System.Linq.Enumerable.WhereSelectEnumerableIterator`2.MoveNext() at System.Linq.Enumerable.<ExceptIterator>d__99`1.MoveNext() at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection) at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
    We have also created a tracelog session on the caching server to capture more information to diagnose the issue - any suggestions on how to analyze this would be appreciated (I can make this available if need be).

    We also monitored various AppFabric, CLR, and Network performance counters, below is a screenshot of the event as it occurs:

    AppFabric Perfmon

    Thanks in advance for any thoughts or advice you can share on resolving this issue.

    UPDATE 1

    The following are the exceptions occurring continuously on the AppFabric Caching Server during the intermittent errors (abstracted from tracelogs) :

    System.TimeoutException: Sending to via http://www.w3.org/2005/08/addressing/anonymous timed out after 00:00:15. The time allotted to this operation may have been a portion of a longer timeout. ---> System.TimeoutException: Cannot claim lock within the allotted timeout of 00:00:15. The time allotted to this operation may have been a portion of a longer timeout. --- End of inner exception stack trace --- at System.ServiceModel.Channels.FramingDuplexSessionChannel.OnSend(Message message, TimeSpan timeout) at System.ServiceModel.Channels.OutputChannel.Send(Message message, TimeSpan timeout) at Microsoft.ApplicationServer.Caching.ReplyContext.Reply(Message message, TimeSpan timeout)

    System.ServiceModel.CommunicationObjectFaultedException: The communication object, System.ServiceModel.Channels.ServerSessionPreambleConnectionReader+ServerFramingDuplexSessionChannel, cannot be used for communication because it is in the Faulted state. at System.ServiceModel.Channels.CommunicationObject.ThrowIfDisposedOrNotOpen() at System.ServiceModel.Channels.OutputChannel.Send(Message message, TimeSpan timeout) at Microsoft.ApplicationServer.Caching.ReplyContext.Reply(Message message, TimeSpan timeout)

    System.ServiceModel.CommunicationObjectAbortedException: The communication object, System.ServiceModel.Channels.ServerSessionPreambleConnectionReader+ServerFramingDuplexSessionChannel, cannot be used for communication because it has been Aborted. at System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result) at System.ServiceModel.Channels.FramingDuplexSessionChannel.OnEndSend(IAsyncResult result) at Microsoft.ApplicationServer.Caching.ReplyContext.EndSend(IAsyncResult result)

    System.ServiceModel.CommunicationException: The socket connection was aborted because an asynchronous send to the socket did not complete within the allotted timeout of 00:00:00.0238080. The time allotted to this operation may have been a portion of a longer timeout. ---> System.ObjectDisposedException: The socket connection has been disposed. Object name: 'System.ServiceModel.Channels.SocketConnection'. --- End of inner exception stack trace --- at System.ServiceModel.Channels.SocketConnection.ThrowIfNotOpen() at System.ServiceModel.Channels.SocketConnection.BeginRead(Int32 offset, Int32 size, TimeSpan timeout, WaitCallback callback, Object state) at System.ServiceModel.Channels.SessionConnectionReader.BeginReceive(TimeSpan timeout, WaitCallback callback, Object state) at System.ServiceModel.Channels.SynchronizedMessageSource.ReceiveAsyncResult.PerformOperation(TimeSpan timeout) at System.ServiceModel.Channels.SynchronizedMessageSource.SynchronizedAsyncResult`1..ctor(SynchronizedMessageSource syncSource, TimeSpan timeout, AsyncCallback callback, Object state) at System.ServiceModel.Channels.FramingDuplexSessionChannel.BeginReceive(TimeSpan timeout, AsyncCallback callback, Object state) at Microsoft.ApplicationServer.Caching.WcfServerChannel.CompleteProcessing(IAsyncResult result)

    UPDATE 2

     - Based on this and this we increased maxConnectionsToServer to 3. As a result we gained 50% more client requests/sec as recorded by the AppFabric Caching:Cache perf counter, but the intermittent errors did not stop occuring

     - We increased the maxBufferSize and maxBufferPoolSize to 2147483647 (int32.max) on the Cache Server configuration. So far we are able to handle 300x traffic volume w/o errors. We will continue to increase traffic volume and monitor.

    Unfortunately the actions we've take thus far have not eliminated the intermittent errors -albeit at 500x traffic now.

    UPDATE 3

    Based on Jeff-ITGuy's recommendations we added two more hosts with 16GB each to the cluster and enabled HighAvailability mode (viaSecondaries=1). Currently the original host remains in the cluster with 96GB ram - all hosts havecacheSize = 12GB. On the cache clients we increase the MaxConnectionToServer to 12 (1 per core). Below are our findings:

    • Occasionally we get (once or twice every 10 minutes):
      • ErrorCode<ERRCA0017>:SubStatus<ES0005>:There is a temporary failure. Please retry later. (There was a contention on the store.)
      • ErrorCode<ERRCA0017>:SubStatus<ES0004>:There is a temporary failure. Please retry later. (Replication queue was full. This may happen during reconfiguration of cluster hosts.)
  • The original 96GB cache hosts still experiences 1 minute outages as described above. The new cache hosts haven't experienced the outage

    We plan to remove 80GB ram from the original cache host. More updates to follow.

  • Edited by cmacho Tuesday, September 11, 2012 1:29 AM Added detail
Wednesday, September 05, 2012 1:31 AM

Answers

  • You appear to be encountering an issue nearly identical to one I'm working with Microsoft at the moment. If it's the same issue, it is probably caused by GC taking a long time and causing delays in the response time for AppFabric. From your perf counters it looked like GC time shot up when you started getting the problem so it probably is the same issue.

    This issue is being investigated actively by Microsoft. In the meantime, in order to alleviate the problem (at least from our findings) you can run more servers with less memory (shrink the size of the memory space that GC is working against) and you can increase the RequestTimeout on your client. By default that is set to 15000 (15 secs) but we have tried raising it to 30000 which helped eliminate some of the issues. This is NOT a good long term solution in my opinion, just passing on information. I've seen the issue with servers having only 24gb memory (with 12gb for cache) and it only really got noticeably better when we tried 8gb servers with 4gb set to cache - that caused GC to do MUCH better.

    Hope that helps, but if this is the issue I think it is there's no solution yet.

    • Marked as answer by cmacho Wednesday, September 12, 2012 10:18 PM
    Thursday, September 06, 2012 7:00 PM

All replies

  • We are currently experiencing the exact same problem. We have used the AppFabric Cache in almost one year now without any problems, but the last couple of weeks we have been experiencing intermittent exceptions of error code ERRCA00017:ES0006 and ERRCA00018:ES0001. However, we have no increase in traffic, and the peak size for the last month is only 1.22 Mb. We are only using the cache for storing session state.

    This is getting a bit serious for us, since our users are losing data every time the exception is thrown.

    Thursday, September 06, 2012 10:03 AM
  • You appear to be encountering an issue nearly identical to one I'm working with Microsoft at the moment. If it's the same issue, it is probably caused by GC taking a long time and causing delays in the response time for AppFabric. From your perf counters it looked like GC time shot up when you started getting the problem so it probably is the same issue.

    This issue is being investigated actively by Microsoft. In the meantime, in order to alleviate the problem (at least from our findings) you can run more servers with less memory (shrink the size of the memory space that GC is working against) and you can increase the RequestTimeout on your client. By default that is set to 15000 (15 secs) but we have tried raising it to 30000 which helped eliminate some of the issues. This is NOT a good long term solution in my opinion, just passing on information. I've seen the issue with servers having only 24gb memory (with 12gb for cache) and it only really got noticeably better when we tried 8gb servers with 4gb set to cache - that caused GC to do MUCH better.

    Hope that helps, but if this is the issue I think it is there's no solution yet.

    • Marked as answer by cmacho Wednesday, September 12, 2012 10:18 PM
    Thursday, September 06, 2012 7:00 PM
  • Thanks for your reply!  Sounds like we could definitely experiencing the same shortcoming of AppFabric.  We are in the process of provisioning two more hosts with 16GB of RAM each and removing 80GB from the current host.  In the mean time we've tried the following:

      • Based on this and this we increased maxConnectionsToServer to 3. As a result we gained 50% more client requests/sec as recorded by the AppFabric Caching:Cache perf counter, but the intermittent errors did not stop occuring
      • We increased the maxBufferSize and maxBufferPoolSize to 2147483647 (int32.max) on the Cache Server configuration. So far we are able to handle 300x traffic volume w/o errors. We will continue to increase traffic volume and monitor.

    More updates to follow.


    Thanks,

    Friday, September 07, 2012 5:30 PM
  • @cmacho - Looking at your perf counter data, there is a big dip in object count just before the errors start happening. Looks like there is a big wave of object expirations that are happening at the same time. This in turn is causing a lot of GC to happen, hence the high %time in GC and the other issues.

    In addition to the workarounds you are considering, if there is a way you can stagger expiration of items, that might help as well.

    Thanks

    Jagan Peri-MSFT

    Sunday, September 09, 2012 7:00 AM
  • Yes, I neglected to mention changes to maxConnectionsToServer and the maxBufferSize that our team has made. We've also changed receiveTimeout on the transportproperties element as well - to 86400000 - although this may or may not help this particular issue.

    The maxConnectionsToServer we now have set to match the number of cores on our AppFabric servers - in our case, 6. That was a recommendation from Microsoft so you may want to try that for your value.

    Monday, September 10, 2012 1:04 PM
  • In regards to the error that was mentioned in Update #3:

    • Occasionally we get (once or twice every 10 minutes):
      • ErrorCode<ERRCA0017>:SubStatus<ES0005>:There is a temporary failure. Please retry later. (There was a contention on the store.)
      • ErrorCode<ERRCA0017>:SubStatus<ES0004>:There is a temporary failure. Please retry later. (Replication queue was full. This may happen during reconfiguration of cluster hosts.)

    We actually saw this error come in groups from one of our web applications. We have clusters of servers (50+ nodes) running this particular application and all of them were logging these errors in the event log when trying to add data to the app fabric cache. After reviewing a few weeks worth of event logs and documenting the times, we saw a pattern. These errors would come in groups about every other day. After thinking about what could be somewhat scheduled or automated to make this pattern happen, it came to me. I checked the application pool the application was assigned to, and the default was set for it to recycle every 1740 minutes - or 29 hours.

    We've removed the scheduled recycling on the application pool and so far, no errors. We will continue to monitor this for the next week to confirm the issue is resolved.

    Since I hadn't come across much researching this online, except for this thread, I figured I'd throw this out here in case anyone else can benefit from our experience.

    Monday, September 23, 2013 3:59 PM