none
Connectivity Issues - 17830/10054

    Question

  • Hi,

    Running Enterprise Edition, 4 Node Cluster. SQL 2008 R2 Sp1+CU4.

    Seeing widespread but intermittent Login Timeouts from client processes. Nothing reported in SQL Errorlogs, Windows Event Logs or network switch logs. These issues are intermittent but occur a couple of times per day. Have spent time upgrading firmware on network switches etc but these issues endure.

    The only thing I really see server side is in the ring buffer. I see bursts of LoginTimers events, perhaps 60-70 at a time. They all have SniConsumerError=17830, and TdsInputBufferError=10054. Not ruling out further network issues, but would appreciate any pointers.

    From sys.messages: 17830: Network error code 0x%x occurred while establishing a connection; the connection has been closed. This may have been caused by client or server login timeout expiration. Time spent during login: total %d ms, enqueued %d ms, network writes %d ms, network reads %d ms, establishing SSL %d ms, negotiating SSPI %d ms, validating login %d ms, including user-defined login processing %d ms.%.*ls

    10054: Connection reset by peer. [NT only]

    From KBArticle: This can happen if a sockets client (which includes SPX) is actively sending or receiving data to or from SQL Server, and the network connection is lost. This error is usually seen in conjunction with error 17824 when on sockets-based connections. For example, an Microsoft Windows NT client using the NWLink SPX transport could be selecting data from SQL Server, and if the network cable is unplugged from the SQL Server computer, you will see error 10054.

    Here is an example of the xml from the ring buffer:

    <Record id="231" type="RING_BUFFER_CONNECTIVITY" time="1094487444">
      <ConnectivityTraceRecord>
        <RecordType>LoginTimers</RecordType>
        <Spid>0</Spid>
        <SniConnectionId>ADD632AE-4D3D-4076-A71E-9C853F0CAA72</SniConnectionId>
        <SniConsumerError>17830</SniConsumerError>
        <SniProvider>7</SniProvider>
        <State>11</State>
        <RemoteHost>10.10.66.49</RemoteHost>
        <RemotePort>55844</RemotePort>
        <LocalHost>10.10.64.16</LocalHost>
        <LocalPort>54436</LocalPort>
        <RecordTime>3/5/2012 14:30:21.228</RecordTime>
        <TdsBuffersInformation>
          <TdsInputBufferError>10054</TdsInputBufferError>
          <TdsOutputBufferError>0</TdsOutputBufferError>
          <TdsInputBufferBytes>0</TdsInputBufferBytes>
        </TdsBuffersInformation>
        <LoginTimers>
          <TotalLoginTimeInMilliseconds>83</TotalLoginTimeInMilliseconds>
          <LoginTaskEnqueuedInMilliseconds>44</LoginTaskEnqueuedInMilliseconds>
          <NetworkWritesInMilliseconds>35</NetworkWritesInMilliseconds>
          <NetworkReadsInMilliseconds>3</NetworkReadsInMilliseconds>
          <SslProcessingInMilliseconds>57</SslProcessingInMilliseconds>
          <SspiProcessingInMilliseconds>0</SspiProcessingInMilliseconds>
          <LoginTriggerAndResourceGovernorProcessingInMilliseconds>0</LoginTriggerAndResourceGovernorProcessingInMilliseconds>
        </LoginTimers>
      </ConnectivityTraceRecord>
      <Stack>
        <frame id="0">0X0000000001570EBB</frame>
        <frame id="1">0X000000000156DCA6</frame>
        <frame id="2">0X000000000157292E</frame>
        <frame id="3">0X0000000000829A94</frame>
        <frame id="4">0X000000000021A8A8</frame>
        <frame id="5">0X00000000001DF070</frame>
        <frame id="6">0X00000000001DEE46</frame>
        <frame id="7">0X00000000001DEC8B</frame>
        <frame id="8">0X00000000007A923A</frame>
        <frame id="9">0X00000000007A9389</frame>
        <frame id="10">0X00000000007A7201</frame>
        <frame id="11">0X00000000007A8402</frame>
        <frame id="12">0X00000000747437D7</frame>
        <frame id="13">0X0000000074743894</frame>
        <frame id="14">0X000000007710652D</frame>
        <frame id="15">0X000000007733C521</frame>
      </Stack>
    </Record>

    Monday, March 05, 2012 3:28 PM

All replies

  • When this happens do you see very high cpu times? Can you also start collecting sys.dm_os_schedulers,sys.dm_os_waiting_Tasks and a server side trace and/orxevent to see what's running?(you might be impacting performance but that's the only way to capture what's going on)

    I have seen similar issues and at the time of login timer errors it was a very busy sql server and unable to service incoming logins(had to do with some preemptive threads taking lot of cpu time and making other threads to wait)

    How many cpu's do you have? Does the Server have any databases that are mirrored?


    Thanks

    Wednesday, March 07, 2012 2:38 AM
  • Thanks for the response, I appreciate it. 

    I don't think this is resource starvation to be honest. This server has 4x8core L7555 Xeons, and 128GB RAM. There are 3 SQL instances running on it, each with a Numa node allocated (one free), and each with an appropriate amount of RAM. Page Life Expectancy is okay (500+). CPU utilisation across the nodes maxes out at 40%, with no single core being overloaded for long.

    I've seen a couple of instances of Non Yielding workers since this post, which is a worry, but managed to recover without a restart.

    At this stage, i think I might be seeing an issue with the Network cards getting overload with Multi cast packets. We have some multicast systems on the network, and we think that our switches can filter this stuff out from the SQL machines more effectively. We 're seeing billions of discarded packets, and network card/switch saturation might explain a lot of the things i have been seeing.

    Also been seeing plenty of these:

    TCP connection closed but a child process of SQL Server may be holding a duplicate of the connection's socket.  Consider enabling the TcpAbortiveClose SQL Server registry setting and restarting SQL Server. If the problem persists, contact Technical Support.

    and these:

    The client was unable to reuse a session with SPID 66, which had been reset for connection pooling. The failure ID is 29. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.

    Can you suggest any dmv queries that might help identify network waits?

    I'm planning on implementing xevents, but not yet, and carefully when I do as these are production servers. 

    Thanks again

    Thursday, March 08, 2012 6:10 PM
  • I cannot think of dmv to monitor network waits, I normally capture it from perfmon check Network Interface(output queue length,errors etc).

    You can check if you notice any queuing when you have the timeouts.

    For the connection pool error did you check this?

    http://blogs.msdn.com/b/psssql/archive/2010/08/03/how-it-works-error-18056-the-client-was-unable-to-reuse-a-session-with-spid-which-had-been-reset-for-connection-pooling.aspx

    Friday, March 09, 2012 1:25 AM
  • I am facing the same problem. Has anyone answer of this?

    Manish

    Friday, May 18, 2012 8:52 AM
    • Proposed as answer by LearnerSql Tuesday, May 29, 2012 4:30 AM
    Friday, May 25, 2012 7:03 PM
  • Hey, only just noticed this. 

    In the end all of my problems came down to 2 things. 

    1. TCP Offloading. I suggest  you to consider turning this off at the windows level. check out this:

    http://blogs.technet.com/b/brad_rutkowski/archive/2007/08/10/how-to-know-if-tcp-offload-is-working.aspx

    This solved many problems for me.

    2. Affinity Mask. We were using Numa affinity in SQL. Turning this off and letting the various SQL instances fight for resources ended up improving lots of things. sp_configure 'affinity mask', 0

    HTH

    Mark

    Tuesday, June 12, 2012 7:33 AM