locked
AppFabric Cache: getting ErrorCode<ERRCA0016>:SubStatus<ES0001> continuously RRS feed

  • Question

  • Hello,

    We have a load balanced server environment using AppFabric Cache w/ a claster of 2 cache nodes. We're not using High Availability because of Win Enterprise requirement.

    Client configuration is:

     

    <dataCacheClient requestTimeout="60000" channelOpenTimeout="20000" maxConnectionsToServer="3">
     <hosts>
     <host name="xxx.xxx.xxx.xxx" cachePort="pppp" />
     <host name="xxx.xxx.xxx.zzz" cachePort="pppp"/>
     </hosts>
     <securityProperties mode="None" protectionLevel="None" />
     <localCache isEnabled="false" />
     <transportProperties maxOutputDelay="2"
        channelInitializationTimeout="60000"
        receiveTimeout="60000"
        maxBufferPoolSize="2147483647"
        maxBufferSize="2147483647"/>
    </dataCacheClient>
    

     

    Server configuration is:

     

    <?xml version="1.0" encoding="utf-8"?>
    <configuration>
     <configSections>
     <section name="dataCache" type="Microsoft.ApplicationServer.Caching.DataCacheSection, Microsoft.ApplicationServer.Caching.Core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35" />
     </configSections>
     <dataCache size="Small">
     <caches>
      <cache consistency="StrongConsistency" name="xxx1Cache">
      <policy>
       <eviction type="Lru" />
       <expiration defaultTTL="10" isExpirable="true" />
      </policy>
      </cache>
      <cache consistency="StrongConsistency" name="xxx2Cache">
      <policy>
       <eviction type="Lru" />
       <expiration defaultTTL="10" isExpirable="true" />
      </policy>
      </cache>
      <cache consistency="StrongConsistency" name="xxx3Cache">
      <policy>
       <eviction type="Lru" />
       <expiration defaultTTL="10" isExpirable="true" />
      </policy>
      </cache>
      <cache consistency="StrongConsistency" name="xxx4Cache">
      <policy>
       <eviction type="Lru" />
       <expiration defaultTTL="10" isExpirable="true" />
      </policy>
      </cache>
      <cache consistency="StrongConsistency" name="xxx5Cache">
      <policy>
       <eviction type="Lru" />
       <expiration defaultTTL="10" isExpirable="true" />
      </policy>
      </cache>
      <cache consistency="StrongConsistency" name="xxx6Cache">
      <policy>
       <eviction type="Lru" />
       <expiration defaultTTL="10" isExpirable="true" />
      </policy>
      </cache>
     </caches>
     <hosts>
      <host replicationPort="22236" arbitrationPort="22235" clusterPort="22234"
      hostId="1373089720" size="2041" leadHost="true" account="MG\FAB02$"
      cacheHostName="AppFabricCachingService" name="FAB02" cachePort="22233" />
      <host replicationPort="22236" arbitrationPort="22235" clusterPort="22234"
      hostId="2058316954" size="2041" leadHost="false" account="MG\FAB01$"
      cacheHostName="AppFabricCachingService" name="FAB01" cachePort="22233" />
     </hosts>
     <advancedProperties>
      <securityProperties mode="None" protectionLevel="None" />
      <transportProperties maxBufferPoolSize="2147483647" maxBufferSize="2147483647" />
     </advancedProperties>
     </dataCache>
    </configuration>
    

     

    We are using this method for retry-on-failure implementation:

     

    private void WithRetry(Action method)
    {
     int tryCount = 0;
     bool done = false;
     do
     {
     try
     {
      method();
      done = true;
     }
     catch (DataCacheException ex)
     {
      if (ex.ErrorCode == DataCacheErrorCode.KeyDoesNotExist)
      {
      done = true;
      }
      else if ((ex.ErrorCode == DataCacheErrorCode.Timeout || 
      ex.ErrorCode == DataCacheErrorCode.RetryLater || 
      ex.ErrorCode == DataCacheErrorCode.ConnectionTerminated) 
      && tryCount < MaxTryCount)
      {
      tryCount++;
      }
      else
      {
      Global.Tracer.Error("Data Cache Error. CODE: " + ex.ErrorCode, ex);
      throw;
      }
     }
     }
     while (!done);
    }
    

     

    And the ICache indexer property using it this way for example:

     

    public object this[string key]
    {
     get
     {
     object result = null;
     WithRetry(() =>
     {
      try
      {
      result = cacheInstance.Get(key, this.regionName);
      }
      catch (DataCacheException ex)
      {
      if (ex.ErrorCode != DataCacheErrorCode.RegionDoesNotExist) throw;
      }
     });
     return result;
     }
     set
     {
     WithRetry(() =>
     {
      cacheInstance.Put(key, value, this.regionName);
     });
     }
    }
    

     

    As far as we know this should be correct, but there are random Cache exceptions in the Event Log:

    Data Cache Error. CODE: 16 - Exception: Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0016>:SubStatus<ES0001>:The connection was terminated, possibly due to server or network problems or serialized Object size is greater than MaxBufferSize on server. Result of the request is unknown.
       at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody)
       at Microsoft.ApplicationServer.Caching.DataCache.InternalGet(String key, DataCacheItemVersion& version, String region)
       at Microsoft.ApplicationServer.Caching.DataCache.Get(String key, String region)


    ...

    There about 10-20 of above cache errors during a day in or logs.

    Does anybody known what is the cause of this errors? Are those by design? Or did we miss something in the configs?



    Monday, August 8, 2011 8:27 AM

Answers

  • We have contacted by email, and developed a solution for this problem. Lemme quote it for you:

    "Based on the current observations, could you try out couple of things :
     
    1. Remove “maxConnectionsToServer = 5” from client configuration.
    2. Add <transportProperties receiveTimeout=”900000” /> both in the client configuration and the cluster configuration.
     
    Let me know if that helps."

    --Akshat

    It helped. Thanks.

    • Marked as answer by unbornchikken Wednesday, September 28, 2011 7:04 AM
    Wednesday, September 28, 2011 7:04 AM

All replies

  • We have an application experiencing these similar connection-closing buffer-centric (all are int.max) behaviors in a cluster wherein other applications do not, apparently, exhibit the same bad behavior.
    Microsoft Community Contributor MCTS: .NET 4.0 Service Communication Applications
    Monday, August 8, 2011 2:56 PM
  • This is really a serious problem, we had to move to a mutch slower SQL based ASP.NET session provider, because above errors are affect AF Sessing Provider also. Maybe it is related to number of cache request per second?

    Can maxConnectionToServer setting increase resolve it? Which is the recommended value for this scenario?

    Tuesday, August 9, 2011 7:37 AM
  • Our issue appears to have been related to channel initialization and receive timeout values in the data cache client config. Try increasing them.
    Microsoft Community Contributor MCTS: .NET 4.0 Service Communication Applications
    Tuesday, August 9, 2011 12:41 PM
  • You can see that our client and server configuration defines those values to maximum allowed ammount.
    Tuesday, August 9, 2011 2:34 PM
  • Can you post the size of your largest object?
    Thursday, August 11, 2011 1:33 PM
  • Those are session and small cacheable business objects. Only a few kilobytes maximum per entry.

    Thursday, August 11, 2011 2:02 PM
  • Can you try without setting the receiveTimeout in client configuration? Is the cache sparsely used or used pretty frequently? What is an approximate no. of requests per second?

     

     

    • Marked as answer by unbornchikken Monday, August 15, 2011 7:20 AM
    • Unmarked as answer by unbornchikken Monday, September 12, 2011 7:23 AM
    Thursday, August 11, 2011 7:53 PM
  • Hi,

    Thanks for tryin to help us! We've started to consider to buy an NCache licence because of those errors above. Our site is pretty unstable you know. :S

    Ok, today morning (in Hungary) i give receiveTimout-setting-removal a try. Hope it helps, I'm gonna check the logs at end of the work and reply!

    It is used as pretty frequent as possible, because we're using AppFabcric Cache and Session (we moved to SQL Session recently because of AF errors) in a popular website which has GBytes of data to show.







    • Marked as answer by unbornchikken Monday, August 15, 2011 7:20 AM
    • Unmarked as answer by unbornchikken Monday, August 15, 2011 7:20 AM
    Friday, August 12, 2011 7:25 AM
  • Hi,

     

    Let us know if the issue still exists. We'll drill down deeper if the problem still persists.

     

    Thanks,

    Akshat

     

    Friday, August 12, 2011 11:54 AM
  • It works. Whole weekend has elapsed and no error.

    Could you please describe me what was the cause of this? As far as I can imagine high receive timeout must not affect connections or data download performance.

    Monday, August 15, 2011 7:20 AM
  • I got the same problem with you, this problem is very serious for my program. My largest object is just about 2,516 KB ~ 2MB.

    Here is my configuration in server:

     

     

      <dataCacheClient channelOpenTimeout="20000" name="default">

        <transportProperties maxOutputDelay="2" channelInitializationTimeout="60000" maxBufferPoolSize="2147483647" maxBufferSize="2147483647" >

    </transportProperties>

        <hosts>

          <host name="xxxx.xxxx.xxxx" cachePort="xxxxx" />

        </hosts>

     

        <securityProperties mode="Message">

          <messageSecurity 

            authorizationInfo="xxxxxx">

          </messageSecurity>

        </securityProperties>

      </dataCacheClient>

     

    But I frequently got error:

    ErrorCode<ERRCA0016>:SubStatus<ES0001>:The connection was terminated, possibly due to server or network problems or serialized Object size is greater than MaxBufferSize on server. Result of the request is unknown.   at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody)     at Microsoft.ApplicationServer.Caching.DataCache.InternalPut(String key, Object value, DataCacheItemVersion oldVersion, TimeSpan timeout, DataCacheTag[] tags, String region, IMonitoringListener listener)     at Microsoft.ApplicationServer.Caching.DataCache.<>c__DisplayClass25.<Put>b__24()     at Microsoft.ApplicationServer.Caching.DataCache.Put(String key, Object value, TimeSpan timeout)

    My cache on Windows Azure is 4GB

    Does anyone have a solution to overcome this issue? If not I am considering to use another cache instead of Windows Azure cache because it is unstable.

     

     

     



    Saturday, August 20, 2011 4:14 AM
  • Dear Akshat,

    I have to reopen this thread, because above described errors are in our logs again.

    I removed the receiveTimeout setting. Just for the record, the first version of config declared it 60000 instead of 600000 because of a typo. I tried to set everithing to default manually.

    Same errors happen, even when there are only one cache server in the cluster.

    Cahce used pretty frequently. Number of reuest per second is about 1000 or higher because we're using AF cache for session provider also. But I think it is not a good idea, because we're getting 500 errors because of cache failures.

    Monday, September 12, 2011 7:23 AM
  • For the cache on Azure, please try this out : http://blogs.msdn.com/b/akshar/archive/2011/05/01/azure-appfabric-caching-errorcode-lt-errca0017-gt-substatus-lt-es0006-gt-what-to-do.aspx , Also please ensure that you are using the latest version of Azure AppFabric SDK.

    Also is your application deployed in the same data center as your cache? For any issues feel free to escalate an issue at :

    https://support.microsoft.com/oas/default.aspx?gprid=14924&st=1&wfxredirect=1&sd=gn

     

    Thanks,

    Akshat

     

     

    Monday, September 12, 2011 7:36 AM
  • But we are using Windows Server AppFabric. Please find my first post again.
    Monday, September 12, 2011 10:53 AM
  • Hi,

     

    Sorry for the confusion, my previous reply was meant for the other post that I saw on this thread.. Has there been any modification to any configs or client code? I believe your servers were running fine previously.

    Would it be possible for you to collect the traces (both client and server side) for a time window when you see the exception and share the details for downloading to akshar<.a-t.>microsoft.com? I could informally help you figure out the issue.. In case you think the information might be confidential or otherwise, please contact the customer support for AppFabric via the official channel.

     

    Details for collecting the traces are here : http://msdn.microsoft.com/en-us/library/ff921010.aspx

     

    Thanks,

    Akshat

    Monday, September 12, 2011 11:36 AM
  • My mistake. Ok, I'm starting to collect traces right now. Thanks for your help!
    Monday, September 12, 2011 2:13 PM
  • Forgot to reply about this:

    I believe your servers were running fine previously.

    They were, but there was't too mutch traffic. Right now the site is running and cache hits happens wery frequently, and there are about 1-10 errors per day recently.
    Monday, September 12, 2011 2:38 PM
  • We have contacted by email, and developed a solution for this problem. Lemme quote it for you:

    "Based on the current observations, could you try out couple of things :
     
    1. Remove “maxConnectionsToServer = 5” from client configuration.
    2. Add <transportProperties receiveTimeout=”900000” /> both in the client configuration and the cluster configuration.
     
    Let me know if that helps."

    --Akshat

    It helped. Thanks.

    • Marked as answer by unbornchikken Wednesday, September 28, 2011 7:04 AM
    Wednesday, September 28, 2011 7:04 AM