none
AppFabric Caching Service service terminated unexpectedly after SQL restart

    Question

  • I have an AppFabric (RTM) installation in test environment.  It's a three node cache cluster with an ASP.NET cache-enabled application pointing at it.  All three cache hosts are running Enterprise edition, and they are using SQL as the configuration provider.

    If the SQL instance is brought down for a couple of minutes and then back up again (which would effectively be the case in a Windows Clustering scenario when the active/passive nodes switch around - during patch management rollout for example) I see the DistributedCacheService.exe process crashing.

    The Service Control Manager reports:

    The AppFabric Caching Service service terminated unexpectedly.  It has done this x time(s).

    And there's an event from a source of "Microsoft-Windows Server AppFabric Caching" which states:

    AppFabric Caching service crashed.{Lease with external store expired: Microsoft.Fabric.Federation.ExternalRingStateStoreException: Renew lease failed . . . . . .

    There's also an event from the .NET Runtime source that states:

    Application: DistributedCacheService.exe
    Framework Version: v4.0.30319
    Description: The process was terminated due to an unhandled exception.
    Exception Info: Microsoft.ApplicationServer.Caching.ConfigStoreException
    Stack:
       at Microsoft.ApplicationServer.Caching.SqlServerCustomProvider.BeginTransaction()
       at Microsoft.ApplicationServer.Caching.ClusterConfigDictionaryReader.GetConfigs[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]](System.String)

    A SQL trace tells me that the caching hosts poll the SQL database every 60 seconds, but the crash occurs after the SQL instance is running again; not while it's down.

    The fact the .NET runtime says it's an unhandled exception would lead me to believe that it's a bug.

    Any thoughts?

    All help gratefully received.

    Mark

    Tuesday, May 31, 2011 12:58 PM

Answers

  • I've found that you can mitigate this problem by increasing the lease period.

    You'll need to stop the Cache Cluster and edit the DistributedCacheService.exe.config on each of the Cache Hosts.
    this is normally in %SystemRoot%\System32\AppFabric
    In the section <configuration> <fabric> <section> add the following key:

    <key name="ExternalRingStateUpdateTimeout" value="xxx" />

    Where xxx is a value in seconds.  There will be three attempts at lease renewel during this period, so if you set it to 360 (6 minutes) attempts will be made a minute 2,4 & 6.
    You can only afford one of these to fail (due to SQL being unavailable).  If two fail, it'll crash on the 3rd attempt ... regardless.
    So for a setting of 360 you can afford ~3 minute SQL outage.  Setting it 540 (9 minutes) means you can afford a ~5 minute outage.
    Wednesday, June 15, 2011 1:05 PM

All replies

  • I have seen this same behavior and have in fact reported it to Microsoft Premier Support. I was told it's "expected behavior", that if the DB is not accessible for 60 seconds the cache server will "crash itself" and (if you have your restart settings set properly) it will try to restart itself. I was pointed to this article, which essentially tells you how to set up clustering: http://msdn.microsoft.com/en-us/library/ee790826(WS.10).aspx.

    For what it's worth, I believe there are better options/behavior coming in the next service pack/release of AppFabric from what I've heard.

    And also, IMHO that's not "expected behavior" that's what we refer to as "a bug". ;-)

    Wednesday, June 01, 2011 7:53 PM
  • I've found that you can mitigate this problem by increasing the lease period.

    You'll need to stop the Cache Cluster and edit the DistributedCacheService.exe.config on each of the Cache Hosts.
    this is normally in %SystemRoot%\System32\AppFabric
    In the section <configuration> <fabric> <section> add the following key:

    <key name="ExternalRingStateUpdateTimeout" value="xxx" />

    Where xxx is a value in seconds.  There will be three attempts at lease renewel during this period, so if you set it to 360 (6 minutes) attempts will be made a minute 2,4 & 6.
    You can only afford one of these to fail (due to SQL being unavailable).  If two fail, it'll crash on the 3rd attempt ... regardless.
    So for a setting of 360 you can afford ~3 minute SQL outage.  Setting it 540 (9 minutes) means you can afford a ~5 minute outage.
    Wednesday, June 15, 2011 1:05 PM