A NetTcpBinding WCF service in Windows Azure work role refuses client connections at some stage

คำตอบที่เสนอ A NetTcpBinding WCF service in Windows Azure work role refuses client connections at some stage

  • 15 สิงหาคม 2555 0:27
     
     

    We have been stuck with this problem for a couple of weeks, hopefully someone here could give us a clue:

    We have a NetTcpBinding (with DuplexChannel callback, InstanceContextMode = InstanceContextMode.Single, ConcurrencyMode = ConcurrencyMode.Multiple, ReliableSesstion enabled) WCF service deployed as a Windows Azure worker role. It works fine when after it starts – the clients connect to it without any issue. But after running for a couple of days, new clients fail to connect to it with the following exception while the existing established connection still work (the established connection can still invoke service methods and receive events with no problem).

    System.ServiceModel.EndpointNotFoundException: Could not connect to net.tcp://myserive.mycompany.com:443/MyServer. The connection attempt lasted for a time span of 00:00:02.6718750. TCP error code 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 157.55.143.80:443.  ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 157.55.143.80:443

       at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)

       at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)

       at System.ServiceModel.Channels.SocketConnectionInitiator.Connect(Uri uri, TimeSpan timeout)

       --- End of inner exception stack trace ---

    We got the WCF service memory dump file and analysed it with Windebug. It seems that there are no issue with Memory/CPU/Thread/DeadLock etc. And we also confirm that the current number of connections has not reached the serviceThrottling  maxConcurrentSessions” value yet (there are only tens of active connections and we set the maxConcurrentSessions="1000").

    We have enabled the WCF trace on test client. The error message indicates that it fails to open socket connection to the server:

    <Exception><ExceptionType>System.ServiceModel.EndpointNotFoundException, System.ServiceModel, Version=3.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089</ExceptionType><Message>Could not connect to net.tcp://myserive.mycompany.com:443/MyServer. The connection attempt lasted for a time span of 00:00:20.9990592. TCP error code 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 157.55.143.80:443. </Message><StackTrace>   at System.ServiceModel.Channels.SocketConnectionInitiator.TraceConnectFailure(Socket socket, SocketException socketException, Uri remoteUri, TimeSpan timeSpentInConnect)

       at System.ServiceModel.Channels.SocketConnectionInitiator.Connect(Uri uri, TimeSpan timeout)

       at System.ServiceModel.Channels.BufferedConnectionInitiator.Connect(Uri uri, TimeSpan timeout)

       at System.ServiceModel.Channels.TracingConnectionInitiator.Connect(Uri uri, TimeSpan timeout)

       at System.ServiceModel.Channels.ConnectionPoolHelper.EstablishConnection(TimeSpan timeout)

       at System.ServiceModel.Channels.ClientFramingDuplexSessionChannel.OnOpen(TimeSpan timeout)

       at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)

    We also ran “netstat ” on the server. It shows that the server is listening the NetTcpBinding port (443 in our scenario) and some established tcp connections between the  client and server.

    We could not figure out why the WCF service refuses socket connection after it running for some time.

    Any help appreciated.

ตอบทั้งหมด

  • 15 สิงหาคม 2555 5:31
    ผู้ดูแล
     
     

    Hi,

    >there are only tens of active connections

    Could you post the exact number of the active connections? This is a very important clue.

    Please also enable server side WCF tracing to see whether you can get some useful information from the trace log when the problem occurs.

    Please also post your serviceThrottling and binding configurations.


    Allen Chen [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.


    • แก้ไขโดย Allen Chen - MSFTModerator 15 สิงหาคม 2555 5:31
    • ทำเครื่องหมายเป็นคำตอบโดย Allen Chen - MSFTModerator 5 กันยายน 2555 1:59
    • ยกเลิกการทำเครื่องหมายเป็นคำตอบโดย Allen Chen - MSFTModerator 6 กันยายน 2555 1:32
    •  
  • 16 สิงหาคม 2555 2:26
     
     คำตอบที่เสนอ

    >Could you post the exact number of the active connections? This is a very important clue.

    Following is the memory dump on the FlowThrottle/Sessions. It indicates the maxConcurrentSessions is 1000 and the current active sessions is 12.

    0:000> !do 0000000002275b08
    Name:        System.ServiceModel.Dispatcher.FlowThrottle
    MethodTable: 000007ff0107d298
    EEClass:     000007ff01092728
    Size:        88(0x58) bytes
    File:        D:\windows\Microsoft.Net\assembly\GAC_MSIL\System.ServiceModel\v4.0_4.0.0.0__b77a5c561934e089\System.ServiceModel.dll
    Fields:
                  MT    Field   Offset                 Type VT     Attr            Value Name
    000007feeebdc758  4002ec7       40         System.Int32  1 instance             1000 capacity
    000007feeebdc758  4002ec8       44         System.Int32  1 instance               12 count
    000007feeebdd588  4002ec9       4c       System.Boolean  1 instance                0 warningIssued
    000007feeebdc758  4002eca       48         System.Int32  1 instance               70 warningRestoreLimit
    000007feeebd59c8  4002ecb        8        System.Object  0 instance 0000000002275b60 mutex
    000007feeebecbb0  4002ecc       10 ...ding.WaitCallback  0 instance 0000000002275ac8 release
    000007ff010e0b90  4002ecd       18 ...bject, mscorlib]]  0 instance 0000000002275b78 waiters
    000007feeebd6870  4002ece       20        System.String  0 instance 0000000002275a38 propertyName
    000007feeebd6870  4002ecf       28        System.String  0 instance 0000000002275a80 configName
    000007feeebe81c8  4002ed0       30        System.Action  0 instance 00000000022e02d8 acquired
    000007feeebe81c8  4002ed1       38        System.Action  0 instance 00000000022e0318 released

    And further analyse on NetTcpChannel socket connection pool ConnectionAcceptor shows that the problem is because the Tcp socket connection pool is full. The counter of the connections has reached the maximum value of the maxPendingConnections and the ConnectionAcceptor does not accept new socket connections under this situation.

    But we do not know the reason why the number of connections in socket connection pool reached maxPendingConnections (10). The tempoerary solution is to increase the value of maxPendingConnections to 50 (maxConnections="50" in the binding configuration tag).

    0:000> !do 00000000023d5d00
    Name:        System.ServiceModel.Channels.ConnectionAcceptor
    MethodTable: 000007ff013c1be0
    EEClass:     000007ff013b11a8
    Size:        88(0x58) bytes
    File:        D:\windows\Microsoft.Net\assembly\GAC_MSIL\System.ServiceModel\v4.0_4.0.0.0__b77a5c561934e089\System.ServiceModel.dll
    Fields:
                  MT    Field   Offset                 Type VT     Attr            Value Name
    000007feeebdc758  400035e       38         System.Int32  1 instance                1 maxAccepts
    000007feeebdc758  400035f       3c         System.Int32  1 instance               10 maxPendingConnections
    000007feeebdc758  4000360       40         System.Int32  1 instance               10 connections
    000007feeebdc758  4000361       44         System.Int32  1 instance                0 pendingAccepts
    000007ff013c03e0  4000362        8 ...onnectionListener  0 instance 00000000023d5ab8 listener
    000007feeebf8b50  4000363       10 System.AsyncCallback  0 instance 00000000023d5df0 acceptCompletedCallback
    000007feeec042f0  4000364       18 ...bject, mscorlib]]  0 instance 00000000023d5e30 scheduleAcceptCallback
    000007feeebe81c8  4000365       20        System.Action  0 instance 00000000023d5d58 onConnectionDequeued
    000007feeebdd588  4000366       48       System.Boolean  1 instance                0 isDisposed
    000007ff013c21a8  4000367       28 ...AvailableCallback  0 instance 00000000023d5cc0 callback
    000007ff013c0e98  4000368       30 ...els.ErrorCallback  0 instance 00000000023d5ba0 errorCallback

    • เสนอเป็นคำตอบโดย Veerendra Kumar 16 สิงหาคม 2555 7:25
    •  
  • 16 สิงหาคม 2555 7:20
    ผู้ดูแล
     
     

    Hi,

    Thanks for the update. Do you mean you've resolved this issue after setting maxConnections to a larger value?


    Allen Chen [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

  • 16 สิงหาคม 2555 9:43
     
     

    Hi Allen,

    We would need a couple of days to check the result of setting maxConnections to a larger value.

    To be honest, we are still looking at a more elegant solution. We have noticed that if the WCF trace is enabled on the server, a WCF warning message "Maximum number of pending connections has been reached. " will create when the connections reaches the value of maxConnections. Maybe we could restart the ServiceHost based on the appearance of this warning.

    • ทำเครื่องหมายเป็นคำตอบโดย Allen Chen - MSFTModerator 5 กันยายน 2555 1:59
    • ยกเลิกการทำเครื่องหมายเป็นคำตอบโดย Allen Chen - MSFTModerator 6 กันยายน 2555 1:33
    •  
  • 22 สิงหาคม 2555 1:23
    ผู้ดูแล
     
     

    Hello,

    Is there any update of this issue? Does the change make it work?


    Allen Chen [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

  • 5 กันยายน 2555 22:38
     
     

    Sorry for my late response. We have spent more than two weeks to observer it.

    It seems that changing the settings of maxConnections  to 100 does not resolve the problem completely. The connections still reaches the 100 after one week.

    And as expected, we got the "Maximum number of pending connections has been reached." warning in the WCF trace files.

    I have decompiled the System.ServiceModel.dll to try to understand the logic behind the connections counter of the System.ServiceModel.Channels.ConnectionAcceptor class. But it is more complicate than I expected. I am aware of when the connections counter increases (on a new socket connection accepted) but I could not figure out when (or say to satisfy what sort of conditions) its value will decrease.

    Just wondering anyone who has a deep understanding of it could explain what sort of potential situations could increase the connections counter value without decreasing it.

    Another thing we have noticed, form the result of a simple "Netstat" command, that the server only has about 25 tcp connections from clients - so why the  connections counter can reach 100 in this situation?


    • แก้ไขโดย Roger NZ 6 กันยายน 2555 1:11
    •  
  • 6 กันยายน 2555 3:31
    ผู้ดูแล
     
     

    Hi,

    >Just wondering anyone who has a deep understanding of it could explain what sort of potential situations could increase theconnections counter value without decreasing it.

    If you use Reflector to read code you can see the counter is decreased by the bold code. From your description the code  this.connections++ is executed but this.connections-- is not. Considering this, my thought:

    Assume there is no exception in other code after  this.connections++ then flag is true and connection is not null when it enters:

     if (connection != null)
        {
           this.callback(connection, this.onConnectionDequeued); //triggerOnConnectionDequeued
        }

    so this.callback(connection, this.onConnectionDequeued);  will be executed.

    Assume this.callback(connection, this.onConnectionDequeued) triggers OnConnectionDequeued properly, the counter must be decreased unless it's waiting for ThisLock release.

    Suggestions:

    • Maybe caused by unhandled exceptions in this method (that is catched by external code). Please check out whether there're exceptions in WCF trace/Event log. Focus on the exceptions that may be thrown in this method.
    • Maybe caused by ThisLock. Please analyze dump to check callstack of all managed threads and see whether some threads are waiting for ThisLock release.
    • If none of above is the root cause, the only possibility seems is this.callback(connection, this.onConnectionDequeued) somehow does not trigger OnConnectionDequeued. In this case I think it requires further investigation on the dump file. I suggest you contact our support http://www.windowsazure.com/en-us/support/contact/.

    >Another thing we have noticed, form the result of a simple "Netstat" command, that the server only has about 25 tcp connections from clients - so why the  connections counter can reach 100 in this situation?

    The connections counter is an application level counter. It is possible that this counter is not updated correctly due to behaviors that the developers of WCF happen do not consider.

    private void HandleCompletedAccept(IAsyncResult result)
    {
        IConnection connection = null;
        lock (this.ThisLock)
        {
            bool flag = false;
            Exception exception = null;
            try
            {
                if (!this.isDisposed)
                {
                    connection = this.listener.EndAccept(result);
                    if (connection != null)
                    {
                        if (DiagnosticUtility.ShouldTraceWarning && ((this.connections + 1) >= this.maxPendingConnections))
                        {
                            TraceUtility.TraceEvent(TraceEventType.Warning, 0x40024, SR.GetString("TraceCodeMaxPendingConnectionsReached"), new StringTraceRecord("MaxPendingConnections", this.maxPendingConnections.ToString(CultureInfo.InvariantCulture)), this, null);
                        }
                        this.connections++;
                    }
                }
                flag = true;
            }
            catch (CommunicationException exception2)
            {
                if (DiagnosticUtility.ShouldTraceInformation)
                {
                    DiagnosticUtility.ExceptionUtility.TraceHandledException(exception2, TraceEventType.Information);
                }
            }
            catch (Exception exception3)
            {
                if (Fx.IsFatal(exception3))
                {
                    throw;
                }
                if ((this.errorCallback == null) && !ExceptionHandler.HandleTransportExceptionHelper(exception3))
                {
                    throw;
                }
                exception = exception3;
            }
            finally
            {
                if (!flag)
                {
                    connection = null;
                }
                this.pendingAccepts--;
            }
            if ((exception != null) && (this.errorCallback != null))
            {
                this.errorCallback(exception);
            }
        }
        this.AcceptIfNecessary(false);
        if (connection != null)
        {
            this.callback(connection, this.onConnectionDequeued); //trigger OnConnectionDequeued
        }
    }

     

    private void OnConnectionDequeued()
    {
        lock (this.ThisLock)
        {
            this.connections--;
        }
        this.AcceptIfNecessary(false);
    }


    Allen Chen [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.








  • 7 กันยายน 2555 2:03
     
     

    Thanks Allen for you quick responce.

    >> Maybe caused by unhandled exceptions in this method (that is catched by external code). Please check out whether there're exceptions in WCF trace/Event log. Focus on the exceptions that may be thrown in this method.

    In the WCF trace there is no exception thrown in the HandleCompletedAccept method.

    >> Maybe caused by ThisLock. Please analyze dump to check callstack of all managed threads and see whether some threads are waiting for ThisLock release.

    No waiting on the ThisLock.

    >>If none of above is the root cause, the only possibility seems is this.callback(connection, this.onConnectionDequeued) somehow does not trigger OnConnectionDequeued.

    Took a closer look at the "this.callback(connection, this.onConnectionDequeued)". It does not trigger OnConnectionDequeued, instead, it passes the OnConnectionDequeue as an Action to the ConnectionDequeuedCallback property of the InitialServerConnectionReader class. And at some stage, I guess when the Connection is dequeued from the underlining InputQueue<TChannel>, the OnConnectionDequeued is triggerred. What I could not figure out is what sort conditions to satisfy to make that dequeue happen.


    • แก้ไขโดย Roger NZ 7 กันยายน 2555 3:41
    •  
  • 7 กันยายน 2555 6:49
    ผู้ดูแล
     
     

    Hi,

    Yes it's dequeued from InputQueue<TChannel>.Dequeue(TimeSpan). The condition you're looking for may vary.

    You may first look at all  InputQueue<TChannel> objects in managed heap, dump its fields and read Reflector to simulate what the method does. You'll probably be able to figure out what causes InputQueue<T>.InvokeDequeuedCallback not called. If InputQueue<TChannel>.Dequeue(TimeSpan) seems fine, you may look at objects that uses InputQueue<TChannel>.Dequeue(TimeSpan). For example, objects of type (find via Reflector's analyze function):

    System.ServiceModel.Dispatcher.DuplexChannelBinder+AutoCloseDuplexSessionChannel

    Keep doing the same thing and you'll probably be able to find out the root cause.


    Allen Chen [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

  • 9 กันยายน 2555 22:54
     
     

    Hi - I am not sure if this issue has been resolved yet. I had a similar experience - in my case the WCF service was a not a duplex channel but we were still facing the exact same issue as you are. And it turned out that database connections were the issue. Some of the calls were long running and held the TCP connection for longer time and thus not leaving any TCP connection for incoming calls. We increased the TCP connection numbers to the database as well.

    Important to note here is - the number of active connections number you get from WCF trace is only related to WCF active connections, it does not tell you other active TCP connectiosn on the server.

    You might want to check this. Also (sorry if this offends you) - if you are not having any WCF proxy pool on the client, make sure you close the proxies after you are done using them.

    Let us know how this goes.


    MkMahesh

  • 19 ธันวาคม 2555 8:34
     
     
    No idea why I was reading this (just browsing)... but you can also see this issue when you are not closing your connections on the client when finished calling. Particularly if you are making frequent calls. *Don't forget to close your connections after use ;)