locked
DistributedCacheService.exe crashing with System.Runtime.CallbackException-System.NullReferenceException RRS feed

  • Question

  • I have a two distributed cache servers clustered and they're both showing that DistributedCacheService.exe is crashing at unpredictable times. It is happening maybe an average of 4-7 times per week on each server. The windows 2012 event logs show the pattern of event ID 1026, 1000, and 7031:

    Log Name:      Application

    Source:        .NET Runtime

    Event ID:      1026

    Description:

    Application: DistributedCacheService.exe

    Framework Version: v4.0.30319

    Description: The process was terminated due to an unhandled exception.

    Exception Info: System.Runtime.CallbackException

    Log Name:      Application

    Source:        Application Error

    Event ID:      1000

    Description:

    Faulting application name: DistributedCacheService.exe, version: 1.0.4632.0, time stamp: 0x4eafeccf

    Faulting module name: KERNELBASE.dll, version: 6.3.9600.17055, time stamp: 0x532954fb

    Exception code: 0xe0434352

    Fault offset: 0x0000000000005bf8

    Log Name:      System

    Source:        Service Control Manager

    Event ID:      7031

    Description:

    The AppFabric Caching Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

    The full crash dump created by WER shows a System.Runtime.CallbackException with an inner exception of a System.NullReferenceException as the cause of the crash.

    System.Runtime.CallbackException   

    Async Callback threw an exception.

       System.Runtime.AsyncResult.Complete(Boolean)

       System.Runtime.AsyncResult.AsyncCompletionWrapperCallback(System.IAsyncResult)

       System.Runtime.Fx+AsyncThunk.UnhandledExceptionFrame(System.IAsyncResult)

       System.Runtime.AsyncResult.Complete(Boolean)

       System.ServiceModel.Channels.ServerSessionPreambleConnectionReader+ServerFramingDuplexSessionChannel+OpenAsyncResult.OnWriteUpgradeResponse(System.Object)

       System.Runtime.Fx+WaitThunk.UnhandledExceptionFrame(System.Object)

       System.ServiceModel.Channels.SocketConnection.OnSendAsync(System.Object, System.Net.Sockets.SocketAsyncEventArgs)

       System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)

       System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)

       System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)

       System.Net.Sockets.SocketAsyncEventArgs.FinishOperationSuccess(System.Net.Sockets.SocketError, Int32, System.Net.Sockets.SocketFlags)

       System.Net.Sockets.SocketAsyncEventArgs.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)

       System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)

       

    System.NullReferenceException

    Object reference not set to an instance of an object.

       System.Runtime.AsyncResult.End[[System.__Canon, mscorlib]](System.IAsyncResult)

       System.ServiceModel.Channels.CommunicationObject.EndOpen(System.IAsyncResult)

       Microsoft.ApplicationServer.Caching.WcfServerChannel.OnOpen(System.IAsyncResult)

       System.Runtime.AsyncResult.Complete(Boolean)

       

    I opened the dump in windbg.exe and used the .sos extension to probe deeper (!do) into the stacktrace . . .

    • Microsoft.ApplicationServer.Caching.WcfServerChannel
    • System.NullReferenceException
    • System.String    Async Callback threw an exception.
    • System.Runtime.CallbackException
    • System.NullReferenceException
    • System.String    Async Callback threw an exception.
    • System.Runtime.CallbackException
    • System.NullReferenceException
    • System.ServiceModel.Channels.BufferedConnection
    • System.ServiceModel.Channels.InitialServerConnectionReader+UpgradeConnectionAsyncResult
    • System.NullReferenceException
    • System.NullReferenceException
    • System.Runtime.Fx+AsyncThunk
    • System.ServiceModel.Channels.ConnectionStream
    • System.Runtime.CallbackException
    • System.Runtime.CallbackException
    • System.Runtime.Fx+WaitThunk
    • System.NullReferenceException
    • System.NullReferenceException
    • System.Runtime.IOThreadTimer+TimerManager
    • System.Runtime.CallbackException
    • System.Net.FixedSizeReader
    • System.ServiceModel.Channels.SocketConnection

    . . . but cannot figure out what precisely is null.

    I can see from the Installed Updates page that it has both AppFabric v1.1 CU1 (kb 2671763) and CU4 (kb 2800726 and version 1.1.2016.32). So installing CU5 (http://support.microsoft.com/en-us/kb/2932678) is something we can do. Does anyone have any additional ideas for me? 

    Also these servers are probably running an vmware hosts that are not using reserved memory (are using dynamic memory) so I may try to move to physical servers.

    When I run !do on System.Net.Sockets.Socket object just prior to the exception/crash, I can see that the remote endpoint is using ephemeral port 57284 as part of the System.Net.IPEndPoint but I can't coax the IP address out of the dump. My debugging kung-fu isn't that good.

    !dae consistently shows a zillion exceptions that might be related and don't look too healthy. What could cause communication problems like this?

     
    1 exceptions: 0x0000000085ecb038
        In Generation: 1 from .NET v4.0.30319.34014
        HResult: 0x80131501
        Type: System.Runtime.CallbackException
        Message: Async Callback threw an exception.
          Inner Exception: 0x00000001860a3708
        Stack Trace:
        SP               IP               Function
        000000001dc89ea0 00007ffe984cf38e System.Runtime.AsyncResult.Complete(Boolean)
        000000001dc8e420 00007ffe98438f82 System.Runtime.AsyncResult.AsyncCompletionWrapperCallback(System.IAsyncResult)
        000000001dc8e4a0 00007ffe984d0abd System.Runtime.Fx+AsyncThunk.UnhandledExceptionFrame(System.IAsyncResult)
        000000001dc8e4f0 00007ffe984cf31c System.Runtime.AsyncResult.Complete(Boolean)
        000000001dc8e5a0 00007ffe97c83fae System.ServiceModel.Channels.ServerSessionPreambleConnectionReader+ServerFramingDuplexSessionChannel+OpenAsyncResult.OnWriteUpgradeResponse(System.Object)
        000000001dc8e620 00007ffe984d37d1 System.Runtime.Fx+WaitThunk.UnhandledExceptionFrame(System.Object)
        000000001dc8e670 00007ffe979fc2ab System.ServiceModel.Channels.SocketConnection.OnSendAsync(System.Object, System.Net.Sockets.SocketAsyncEventArgs)
        000000001dc8e6d0 00007ffe9d488355 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
        000000001dc8e830 00007ffe9d4880c9 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
        000000001dc8e860 00007ffe9d4880a7 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
        000000001dc8e8b0 00007ffe9c519d16 System.Net.Sockets.SocketAsyncEventArgs.FinishOperationSuccess(System.Net.Sockets.SocketError, Int32, System.Net.Sockets.SocketFlags)
        000000001dc8ea10 00007ffe9c5190c0 System.Net.Sockets.SocketAsyncEventArgs.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
        000000001dc8ea60 00007ffe9d525db6 System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)

    1 exceptions: 0x000000008acb8d28
        In Generation: 0 from .NET v4.0.30319.34014
        HResult: 0x80004005
        Type: System.Net.Sockets.SocketException
        Message: An existing connection was forcibly closed by the remote host
        Stack Trace:

    1 exceptions: 0x00000001860a3708
        In Generation: 1 from .NET v4.0.30319.34014
        HResult: 0x80004003
        Type: System.NullReferenceException
        Message: Object reference not set to an instance of an object.
        Stack Trace:
        SP               IP               Function
        000000001dc8e250 00007ffe984cf8aa System.Runtime.AsyncResult.End[[System.__Canon, mscorlib]](System.IAsyncResult)
        000000001dc8e290 00007ffe9783c750 System.ServiceModel.Channels.CommunicationObject.EndOpen(System.IAsyncResult)
        000000001dc8e2c0 00007ffe403846d7 Microsoft.ApplicationServer.Caching.WcfServerChannel.OnOpen(System.IAsyncResult)
        000000001dc8e370 00007ffe9843580a System.Runtime.AsyncResult.Complete(Boolean)

    2 exceptions: 0x000000007fff12c8 0x000000007fff1368
        In Generation: 2 from .NET v4.0.30319.34014
        HResult: 0x80131530
        Type: System.Threading.ThreadAbortException
        Message: <null>
        Stack Trace:

    3 exceptions: 0x000000018aeab718 0x000000028a2ee0d8 0x000000038c1b62e0
        In Generation: 0 from .NET v4.0.30319.34014
        HResult: 0x80131501
        Type: System.ServiceModel.CommunicationException
        Message: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '00:10:00'.
          Inner Exception: 0x000000018aeab578
        Stack Trace:

    633 exceptions: 0x00000000834137d8 0x0000000085e5d6d0 0x0000000085e5dda8 0x0000000085e5e6d0 0x0000000085e827d0 0x0000000085e88be0 0x0000000085e90be0 0x0000000085e91558 0x0000000085ea5ca0 0x0000000085ec0920 ...
        In mixed Generations from .NET v4.0.30319.34014
        HResult: 0x80131620
        Type: System.IO.IOException
        Message: The read operation failed, see inner exception.
          Inner Exception: 0x0000000083413600
        Stack Trace:
        SP               IP               Function
        0000000021dac1e0 00007ffe9cd7b401 System.Net.Security.NegotiateStream.EndRead(System.IAsyncResult)
        0000000021dac220 00007ffe97038eb0 System.ServiceModel.Channels.StreamConnection.EndRead()

    654 exceptions: 0x000000008340d850 0x0000000085e047c8 0x0000000085e055b0 0x0000000085e12d48 0x0000000085e1b570 0x0000000085e286f8 0x0000000085e36a10 0x0000000085e3c850 0x0000000085e488b0 0x0000000085e50208 ...
        In mixed Generations from .NET v4.0.30319.34014
        HResult: 0x80004005
        Type: System.Net.Sockets.SocketException
        Message: An existing connection was forcibly closed by the remote host
        Stack Trace:
        SP               IP               Function
        0000000021dae760 00007ffe9803b0aa System.ServiceModel.Channels.SocketConnection.HandleReceiveAsyncCompleted()
        0000000021dae7a0 00007ffe9703b5e4 System.ServiceModel.Channels.SocketConnection.OnReceiveAsync(System.Object, System.Net.Sockets.SocketAsyncEventArgs)

    1284 exceptions: 0x0000000083413600 0x0000000083413930 0x0000000085e36e08 0x0000000085e4bc98 0x0000000085e50600 0x0000000085e5ade8 0x0000000085e5d4f8 0x0000000085e5d828 0x0000000085e5db80 0x0000000085e5df00 ...
        In mixed Generations from .NET v4.0.30319.34014
        HResult: 0x80131501
        Type: System.ServiceModel.CommunicationException
        Message: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '00:10:00'.
          Inner Exception: 0x000000008340d850
        Stack Trace:
        SP               IP               Function
        0000000021dae5b0 00007ffe984cf8aa System.Runtime.AsyncResult.End[[System.__Canon, mscorlib]](System.IAsyncResult)
        0000000021dae5f0 00007ffe97c7fc83 System.ServiceModel.Channels.ConnectionStream+ReadAsyncResult.End(System.IAsyncResult)
        0000000021dae620 00007ffe9c516890 System.Net.FixedSizeReader.ReadCallback(System.IAsyncResult)

     

    Thanks!


    • Edited by C. T. Haun Tuesday, March 31, 2015 3:39 PM more info
    Friday, March 27, 2015 7:35 PM

All replies

  • Did you ever find a solution for this?

    or where to start to find the error?

    I have the same issue on our sharepoint/project 2013 server.

    Tuesday, May 26, 2015 7:31 AM
  • Here are the actions I have taken so far.

    First, I adding two different servers as distributed cache hosts (Add-SPDistributedCacheServiceInstance) and removing the problematic servers (Remove-SPDistributedCacheServiceInstance) as cachehosts. The new servers were physical servers (no need to worry about the wmware team leaving the virtual server running with dynamic/unreserved memory like they like to do). The new servers had more RAM and no problems with low memory conditions. I also updated the cache size to about 15% of our RAM on the new servers (Update-SPDistributedCacheSize -CacheSizeInMB 3072). I also increased three settings based on some concerns seen in the ULS logs with these cmds:

    $DLTC = Get-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache

    $DLTC.requestTimeout = "4500"

    $DLTC.channelOpenTimeOut = "4500"

    $DLTC.MaxBufferSize = "40000000"

    Set-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache $DLTC

    Second, I upgraded from AppFabric CU4 to CU5. I didn't know CU 6 was out. I'm planning to upgrade to CU6 soon.

    Third, I set all the recommended exclusions for our antivirus scanner per http://support.microsoft.com/kb/952167 ("Certain folders may have to be excluded from antivirus scanning when you use a file-level antivirus program in SharePoint.") I was very cautious and thorough about this. In addition to those exclusions, I added the additional exclusion for C:\Program Files\AppFabric 1.1 for Windows Server\DistributedCacheService.exe.

    Fourth, since these crashes seemed to be related to communications problems, I upgraded the NIC drivers and firmware from 2012 versions to 2014 versions.

    The above actions definitely helped immensely. But the crashes of distributedcacheservice.exe/appfabric continued happening every few days. And distributedcacheservice.exe is still showing signs (in the crash dumps with !dae and in debugdiag CRL exception logging) of many System.ServiceModelCommunicationException, System.Net.Sockets.SocketException

    We are still seeing these events every few days:


    Symptoms/Errors

    Log Name:      System
    Source:        Service Control Manager
    Date:          7/26/2015 3:06:12 AM
    Event ID:      7031
    Task Category: None
    Level:         Error
    Description:
    The AppFabric Caching Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.


    Log Name:      Application
    Source:        .NET Runtime
    Date:          7/26/2015 3:05:35 AM
    Event ID:      1026
    Task Category: None
    Level:         Error
    Keywords:      Classic
    Description:
    Application: DistributedCacheService.exe
    Framework Version: v4.0.30319
    Description: The process was terminated due to an unhandled exception.
    Exception Info: System.Runtime.CallbackException
    Stack:
       at System.Runtime.Fx+WaitThunk.UnhandledExceptionFrame(System.Object)
       at System.ServiceModel.Channels.SocketConnection.OnSendAsync(System.Object, System.Net.Sockets.SocketAsyncEventArgs)
       at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
       at System.Net.Sockets.SocketAsyncEventArgs.FinishOperationSuccess(System.Net.Sockets.SocketError, Int32, System.Net.Sockets.SocketFlags)
       at System.Net.Sockets.SocketAsyncEventArgs.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
       at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)


    Log Name:      Application
    Source:        Application Error
    Date:          7/26/2015 3:05:37 AM
    Event ID:      1000
    Task Category: (100)
    Level:         Error
    Keywords:      Classic
    Description:
    Faulting application name: DistributedCacheService.exe, version: 1.0.4632.0, time stamp: 0x4eafeccf
    Faulting module name: KERNELBASE.dll, version: 6.3.9600.17415, time stamp: 0x54505737
    Exception code: 0xe0434352
    Fault offset: 0x0000000000008b9c
    Faulting process id: 0x45b4
    Faulting application start time: 0x01d0c59602fa61e8
    Faulting application path: C:\Program Files\AppFabric 1.1 for Windows Server\DistributedCacheService.exe
    Faulting module path: C:\Windows\system32\KERNELBASE.dll
    Report Id: 18ab034b-336d-11e5-80e8-0017a477105c

    Log Name:      Application
    Source:        Windows Error Reporting
    Date:          7/26/2015 3:05:43 AM
    Event ID:      1001
    Task Category: None
    Level:         Information
    Keywords:      Classic
    Description:
    Fault bucket , type 0
    Event Name: CLR20r3
    Response: Not available
    Cab Id: 0

    Problem signature:
    P1: DistributedCacheService.exe
    P2: 1.0.4632.0
    P3: 4eafeccf
    P4: System
    P5: 4.0.30319.34239
    P6: 53e4531e
    P7: 2b90
    P8: 7b
    P9: System.Runtime.CallbackException
    P10:

    Attached files:
    C:\Users\SVCSPIPService\AppData\Local\Temp\WER8B25.tmp.appcompat.txt
    C:\Users\SVCSPIPService\AppData\Local\Temp\WER8BD2.tmp.WERInternalMetadata.xml
    C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_DistributedCache_bda0ea38d8f2df8f62f824f859e966bf8a7d71f1_95ad5b95_cab_77698be0\triagedump.dmp
    C:\Users\SVCSPIPService\AppData\Local\Temp\WER8BE2.tmp.WERDataCollectionFailure.txt
    WERGenerationLog.txt

    These files may be available here:
    C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_DistributedCache_bda0ea38d8f2df8f62f824f859e966bf8a7d71f1_95ad5b95_cab_77698be0

    Analysis symbol:
    Rechecking for solution: 0
    Report Id: 18ab034b-336d-11e5-80e8-0017a477105c
    Report Status: 4

    The distributedcacheservice.exe process is also complaining very frequently that the remote host seems to be closing its socket connection forcibly and prematurely.  This leads to thousands of .net exceptions in the distributedcacheservice.exe process and leads eventually to a hard crash of the process. 


    The last distributed cache crash left a crash dump behind that sheds a small amount of light on what is causing the crash. It seems like the activity that lead to the crash was just DistributedCache seeming to be doing activity that seemed normal: Microsoft.ApplicationServer.Caching.WcfServerChannel was trying to make or maintain a socket connection between net.tcp://{servername.domainname}.com:22233 and some unknown address with some normal ephemeral port.  I can see the port but I cannot fish out the remote endpoint. Either it shows up blank like this:

    • 0000000000000000 m_RemoteEndPoint
    • 0000000000000000 m_SocketAddress

    Or it shows up unintelligible like this:

    • 00000000c58ef1f0 m_RemoteEndPoint
    • 00000000c58ef190 m_Address
    • 2761342986 m_Address

    I was able to fish one IP address out and it seemed local: 0.0.0.0. But I’m not sure if that corresponds to the remote endpoint.  Maybe the communication was from one socket on Web1 to another socket on Web1.  


    The worker thread that caused the crash had four complaints as it began its tailspin into a crashing death:

    1.

    Type: System.ServiceModel.CommunicationException
    Message: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '00:10:00'.

    2.

    Type: System.Net.Sockets.SocketException
    Message: An existing connection was forcibly closed by the remote host

    3.

        Type: System.NullReferenceException
        Message: Object reference not set to an instance of an object.
        Stack Trace:
        SP               IP               Function
        0000000027d8e260 00007ff94e72f8da System.Runtime.AsyncResult.End[[System.__Canon, mscorlib]](System.IAsyncResult)
        0000000027d8e2a0 00007ff9505fb1f0 System.ServiceModel.Channels.CommunicationObject.EndOpen(System.IAsyncResult)
        0000000027d8e2d0 00007ff8fdf11767 Microsoft.ApplicationServer.Caching.WcfServerChannel.OnOpen(System.IAsyncResult)
        0000000027d8e380 00007ff94e69580a System.Runtime.AsyncResult.Complete(Boolean)

    4.
        Type: System.Runtime.CallbackException
        Message: Async Callback threw an exception.
        Stack Trace:
        SP               IP               Function
        0000000027d89ef0 00007ff94e72f3be System.Runtime.AsyncResult.Complete(Boolean)
        0000000027d8e430 00007ff94e698f82 System.Runtime.AsyncResult.AsyncCompletionWrapperCallback(System.IAsyncResult)
        0000000027d8e4b0 00007ff94e730aed System.Runtime.Fx+AsyncThunk.UnhandledExceptionFrame(System.IAsyncResult)
        0000000027d8e500 00007ff94e72f34c System.Runtime.AsyncResult.Complete(Boolean)
        0000000027d8e5b0 00007ff950a42b3e System.ServiceModel.Channels.ServerSessionPreambleConnectionReader+ServerFramingDuplexSessionChannel+OpenAsyncResult.OnWriteUpgradeResponse(System.Object)
        0000000027d8e630 00007ff94e733801 System.Runtime.Fx+WaitThunk.UnhandledExceptionFrame(System.Object)
        0000000027d8e680 00007ff9507badbb System.ServiceModel.Channels.SocketConnection.OnSendAsync(System.Object, System.Net.Sockets.SocketAsyncEventArgs)
        0000000027d8e6e0 00007ff95b0039a5 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
        0000000027d8e840 00007ff95b003719 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
        0000000027d8e870 00007ff95b0036f7 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
        0000000027d8e8c0 00007ff95a089d16 System.Net.Sockets.SocketAsyncEventArgs.FinishOperationSuccess(System.Net.Sockets.SocketError, Int32, System.Net.Sockets.SocketFlags)
        0000000027d8ea20 00007ff95a0890c0 System.Net.Sockets.SocketAsyncEventArgs.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
        0000000027d8ea70 00007ff95b078de6 System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)

    My knee-jerk ideas about this were . . .

    We could try to figure out how to increase the local socket timeout but that’s not going to help if the remote host is in fact forcibly closing the connection.

    I can set up some WCF logging but I’m not sure if this would tell us much.  It might tell us which server or IP is the host that is doing the forcible closure of the socket. But the logging would could have an impact on performance and I don’t want to just leave it running on one or both dcache servers 24x7 until the next solid crash.

    I'm running a perfmon capture that includes Network Interface (object) and "packet outbound errors" (counter) we can see if there are any errors that might cause this problem. If there are zero packet outbound errors, we can ignore the network card settings. If there are two or more errors here, we should get serious about the following two recommendations.

    First, while I am glad to see that Win2012R2 has the tcp chimney disabled by default there is still room to disable RSS as a precaution.

    Here is one recommendation for disabling RSS:  http://ficility.net/tag/windows-server-2012-network-tcp/

    For Windows server 2012 TCP default global configuration has changed. By default the configuration of Chimney offload state is as it should be without need of change. You still need to change Receive-Side Scaling there:

    Before:
    TCP Global Parameters
    ----------------------------------------------
    Receive-Side Scaling State          : enabled
    Chimney Offload State               : disabled
    NetDMA State                        : disabled
    Direct Cache Access (DCA)           : disabled
    Receive Window Auto-Tuning Level    : normal
    Add-On Congestion Control Provider  : none
    ECN Capability                      : enabled
    RFC 1323 Timestamps                 : disabled
    Initial RTO                         : 3000
    Receive Segment Coalescing State    : enabled


    After:
    Netsh int tcp set global rss=disabled

    netsh int tcp show global

    TCP Global Parameters
    ----------------------------------------------
    Receive-Side Scaling State          : disabled
    Chimney Offload State               : disabled
    NetDMA State                        : disabled
    Direct Cache Access (DCA)           : disabled
    Receive Window Auto-Tuning Level    : normal
    Add-On Congestion Control Provider  : none
    ECN Capability                      : enabled
    RFC 1323 Timestamps                 : disabled
    Initial RTO                         : 3000
    Receive Segment Coalescing State    : enabled


    Second, also as a precaution we could disable tcp offloading in the NIC properties of the d.cache servers.  
    Example:  http://www.rackspace.com/knowledge_center/article/disabling-tcp-offloading-in-windows-server-2012

    Disabling TCP Offloading in Windows Server 2012
    Article ID: 3855
    Last updated on July 18, 2014
    ________________________________________
    TCP offload engine is a function used in network interface cards (NIC) to offload processing of the entire TCP/IP stack to the network controller. By moving some or all of the processing to dedicated hardware, a TCP offload engine frees the system's main CPU for other tasks. However, TCP offloading has been known to cause some issues, and disabling it can help avoid these issues.
    NOTE: We recommend keeping TCP offloading enabled in any source images that you use to build new servers, and then disabling TCP offloading in the source image after the new server is built. If TCP offloading is disabled on an image, a server build from that image might fail. We are working on a solution for this issue. However, as this is a multiple vendor issue the resolution will depend on the vendors' cooperative efforts.
    Disable TCP Offloading
    1. In the Windows server, open the Control Panel and select Network Settings > Change Adapter Settings.
     
    2. Right-click on each of the adapters (private and public), select Configure from the Networking menu, and then click the Advanced tab. The TCP offload settings are listed for the Citrix adapter.
     
    3. Disable each of the following TCP offload options, and then click OK:
    o IPv4 Checksum Offload
    o Large Receive Offload
    o Large Send Offload
    o TCP Checksum Offload

          =======================

    As of 7/30/2015, I have not disabled RSS or any nuance of tcp offloading. But I'm keeping that in mind as a possibility for later.

    I asked the security team to see if they could detect any interference from endpoint protection.  (No response from them on that yet.)

    I'm planning on upgrading soon to App Fabric CU6. . . . Update... CU 6 did not solve the problem either.


    • Marked as answer by C. T. Haun Friday, June 12, 2015 4:55 PM
    • Unmarked as answer by C. T. Haun Wednesday, September 30, 2015 7:19 PM
    • Edited by C. T. Haun Wednesday, September 30, 2015 7:20 PM improvement
    Friday, June 12, 2015 4:55 PM
  • The following action seemed like it caused the problem to go away for two or three weeks.  I thought it was solved. But the crashes did return.

    The distributed cache crashes totally disappeared (for two weeks) after unchecking the box for

    "Allow the computer to turn off this device to save power"

    in the POWER MANAGEMENT settings of the physical NICs.

    We unchecked this setting in all the NIC properties on distributed cache host servers, the SQL servers, and all other physical servers in the farm. I think power management was causing sockets to be forcibly closed and distributed cache really couldn't handle that.

    The fact that the crashes were usually at night (when almost no one is using our farm) could fit with the idea of NICs going to sleep (which I would imagine would happen when activity is low?).

    Here are my notes on that setting. . .  

    From a Microsoft SharePoint RAAS report (Risk Assessment as a Service) I saw mention of http://support.microsoft.com/kb/2740020.  This article is helpful but it doesn’t apply specifically to Win2012 R2.  Also the convenient Fix-It tool in that kb article doesn’t work on Win2012. The method it gives to turn the NIC powersaving settings off in the registry manually are painful and in my opinion leaves too much room for missing something. So I’m left thinking that the simplest and surest way to go is through the NIC properties, per below.

    Change adapter settings

    Select the first NIC and right-click it to see the gray menu. Select properties.

    Select Configure button.

    Switch to Power Management Tab

    And uncheck “allow the computer to turn off this device to save power.”

    Repeat this process for the other NICs.

    However, unchecking “allow the computer to turn off this device to save power” CAN possibly cause RDP access to fail.  Therefore this should be done through iLO rather than RDP.

    Per http://blogs.technet.com/b/exchange/archive/2013/10/22/do-you-have-a-sleepy-nic.aspx
    CAUTION Be careful when you change this setting. If it's enabled and you decide to disable it, you must plan for this modification as it will likely interrupt network traffic. It may seem odd that by just making a seemingly non-impacting change that the NIC will reset itself, but it definitely can. Trust me; I had a customer ‘test’ this during the day by accident… oops!


    We could use PowerShell instead. 

    http://blogs.technet.com/b/heyscriptingguy/archive/2014/04/09/windows-server-2012-r2-network-cmdlets-part-3.aspx
    I can even fight my old nemesis: Power management. What if a vendor didn’t quite implement the power management right, and this was causing network adapters to prematurely go to sleep? Well, I can dive into that advanced property with Windows PowerShell right now!
    Get-NetAdapter | Disable-NetAdapterPowerManagement
    There! No more sleeping on MY network now! All thanks to Windows Powershell!

    Compare with: http://blogs.technet.com/b/heyscriptingguy/archive/2014/01/16/working-with-network-adapter-power-settings.aspx

    Wednesday, September 30, 2015 7:21 PM
  • Here are some additional event log errors we are seeing. . .


    So it turns out Distributed Cache does sort of have its own event log.  Event viewer > Applications and Services Logs > Microsoft > Windows > Application Server-Applications.

     


    I think the following type of crash—the Event ID 110 crash—seems like maybe it is a lighter-weight crash which *maybe* does not hurt the newsfeeds (?) and which does not trigger a crash dump from Windows Error Reporting.

    Log Name:      Microsoft-Windows-Application Server-System Services/Admin
    Source:        Microsoft-Windows Server AppFabric Caching
    Date:          10/3/2015 11:26:23 AM
    Event ID:      110
    Task Category: (1)
    Level:         Error
    Keywords:     
    User:          domain\serviceaccountnamehere
    Computer:      DistCacheServer1.xyz.domain.com
    Description:
    AppFabric Caching service crashed.{4cad2928000000000000000000000000 lease relationship with 56df2dd4000000000000000000000000 failed}

    That crash seems to be very common on our distributed cache servers.  And it is always preceded by one of the following a second (or less) before that kind of crash:

    Log Name:      Microsoft-Windows-Application Server-System Services/Admin
    Source:        Microsoft-Windows-Fabric
    Date:          10/3/2015 11:26:23 AM
    Event ID:      1
    Task Category: None
    Level:         Error
    Keywords:     
    User:          domain\serviceaccountnamehere
    Computer:      DistCacheServer1.xyz.domain.com
    Description:
    {4cad2928000000000000000000000000-56df2dd4000000000000000000000000} subject relationship {130883394653745840} expired {DistCacheServer1.xyz.domain.com}-{DistCacheServer2.xyz.domain.com}

    These “subject relationship . . . expired” errors can occur without causing the 110 crash. But when the 110 crash does happen, it seems that the “subject relationship . . . expired” error happens right before it.


    This second type of crash—the event ID 111 crash—is less frequent than the first type of crash and I think it has a bigger impact on mysites.  It does seem to trigger a WER crash dump.


    Log Name:      Microsoft-Windows-Application Server-System Services/Admin
    Source:        Microsoft-Windows Server AppFabric Caching
    Date:          9/25/2015 7:26:33 AM
    Event ID:      111
    Task Category: (1)
    Level:         Error
    Keywords:     
    User:          domain\serviceaccountnamehere
    Computer:      DistCacheServer1.xyz.domain.com
    Description:
    AppFabric Caching service crashed with exception {System.Runtime.CallbackException: Async Callback threw an exception. ---> System.NullReferenceException: Object reference not set to an instance of an object.
       at System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result)
       at System.ServiceModel.Channels.CommunicationObject.EndOpen(IAsyncResult result)
       at Microsoft.ApplicationServer.Caching.WcfServerChannel.OnOpen(IAsyncResult result)
       at System.Runtime.AsyncResult.Complete(Boolean completedSynchronously)
       --- End of inner exception stack trace ---
       at System.Runtime.AsyncResult.Complete(Boolean completedSynchronously)
       at System.Runtime.AsyncResult.AsyncCompletionWrapperCallback(IAsyncResult result)
       at System.Runtime.Fx.AsyncThunk.UnhandledExceptionFrame(IAsyncResult result)
       at System.Runtime.AsyncResult.Complete(Boolean completedSynchronously)
       at System.ServiceModel.Channels.ServerSessionPreambleConnectionReader.ServerFramingDuplexSessionChannel.OpenAsyncResult.OnWriteUpgradeResponse(Object asyncState)
       at System.Runtime.Fx.WaitThunk.UnhandledExceptionFrame(Object state)
       at System.ServiceModel.Channels.SocketConnection.OnSendAsync(Object sender, SocketAsyncEventArgs eventArgs)
       at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Net.Sockets.SocketAsyncEventArgs.FinishOperationSuccess(SocketError socketError, Int32 bytesTransferred, SocketFlags flags)
       at System.Net.Sockets.SocketAsyncEventArgs.CompletionPortCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* nativeOverlapped)
       at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* pOVERLAP)}. Check debug log for more information


    Interestingly, as of October 5th, the last time we saw that type of crash on DistributedCacheServer1 was September 25th.

    Here is a warning we saw on DistributedCacheServer1 back on Sept 28th but have not seen since. . .

    Log Name:      Microsoft-Windows-Application Server-System Services/Admin
    Source:        Microsoft-Windows-Fabric
    Date:          9/15/2015 4:11:07 PM
    Event ID:      6
    Task Category: None
    Level:         Warning
    Keywords:     
    User:          domain\serviceaccountnamehere
    Computer:      DistCacheServer1.xyz.domain.com
    Description:
    {4cad2928000000000000000000000000} failed to refresh lookup table, exception: {Microsoft.Fabric.Common.OperationCompletedException: Operation completed with an exception ---> System.TimeoutException: The operation has timed out.
       --- End of inner exception stack trace ---
       at Microsoft.Fabric.Common.OperationContext.End()
       at Microsoft.Fabric.Federation.FederationSite.EndRoutedSendReceive(IAsyncResult ar)
       at Microsoft.Fabric.Data.ReliableServiceManager.EndRefreshLookupTable(IAsyncResult ar)}

    We did see it recently on Web 2 however. . .

    Log Name:      Microsoft-Windows-Application Server-System Services/Admin
    Source:        Microsoft-Windows-Fabric
    Date:          10/3/2015 4:50:59 AM
    Event ID:      6
    Task Category: None
    Level:         Warning
    Keywords:     
    User:          domain\serviceaccountnamehere
    Computer:      DistCacheServer2.xyz.domain.com
    Description:
    {56df2dd4000000000000000000000000} failed to refresh lookup table, exception: {Microsoft.Fabric.Common.OperationCompletedException: Operation completed with an exception ---> System.TimeoutException: The operation has timed out.
     
    Here is another variation of that:

    Log Name:      Microsoft-Windows-Application Server-System Services/Admin
    Source:        Microsoft-Windows-Fabric
    Date:          10/2/2015 9:08:13 AM
    Event ID:      6
    Task Category: None
    Level:         Warning
    Keywords:     
    User:          domain\serviceaccountnamehere
    Computer:      DistCacheServer2.xyz.domain.com
    Description:
    {56df2dd4000000000000000000000000} failed to refresh lookup table, exception: {Microsoft.Fabric.Common.OperationCompletedException: Operation completed with an exception ---> Microsoft.Fabric.Federation.SiteNodeShutdownException: SiteNode has already shutdown
       --- End of inner exception stack trace ---
       at Microsoft.Fabric.Common.OperationContext.End()
       at Microsoft.Fabric.Federation.FederationSite.EndRoutedSendReceive(IAsyncResult ar)
       at Microsoft.Fabric.Data.ReliableServiceManager.EndRefreshLookupTable(IAsyncResult ar)}


    Another Variation:

    Log Name:      Microsoft-Windows-Application Server-System Services/Admin
    Source:        Microsoft-Windows-Fabric
    Date:          10/1/2015 2:39:55 PM
    Event ID:      6
    Task Category: None
    Level:         Warning
    Keywords:     
    User:          domain\serviceaccountnamehere
    Computer:      DistCacheServer2.xyz.domain.com
    Description:
    {56df2dd4000000000000000000000000} failed to refresh lookup table, exception: {Microsoft.Fabric.Common.OperationCompletedException: Operation completed with an exception ---> Microsoft.Fabric.Federation.RoutingException: The target node explicitly aborted the operation
       --- End of inner exception stack trace ---
       at Microsoft.Fabric.Common.OperationContext.End()
       at Microsoft.Fabric.Federation.FederationSite.EndRoutedSendReceive(IAsyncResult ar)
       at Microsoft.Fabric.Data.ReliableServiceManager.EndRefreshLookupTable(IAsyncResult ar)}

    Monday, October 5, 2015 4:02 PM
  • Hi Christopher!

    It seems like you are affected by an issue that is fixed with CU7 for App Fabric that was released recently.

    The Event ID 111 entries in your App Fabric Eventlog (Eventviewer: Applications and Services Logs - Microsoft - Windows - Application Server-System Services - Microsoft-Windows-Application Server-System Services/Admin) are similar to the error details that are shown in the screenshot below.

    Cumulative Update 7 for Microsoft AppFabric 1.1 for Windows Server
    https://support.microsoft.com/en-us/kb/3092423

    Viele Grüße aus Deutschland,
    Christian



    Wednesday, December 9, 2015 7:32 AM
  • Christian,

    Thanks very much for following up with me on this.

    I'll plan on trying the new CU out soon!

    Christopher

    Thursday, December 10, 2015 3:06 PM