none
service broker error : Connection handshake failed

    Question

  •     When I configure SSB in two machine to send message, I get a error message in target machine SQLProfiler:
        "Connection handshake failed. There is already an existing connection with the same peer and this connection lost the arbitration. State 80."

        Then I get another message "This message could not be delivered because it is a duplicate.", but I am sure the configuration of routes is right.

       How to solve the problem ? Thanks.  I test it in Version 9.0.1399.
    Friday, November 11, 2005 10:38 AM

Answers

  • I feared that having both the non-clustered instance and the virtual instance listenning on port 4022 might create a conflict in case of the a SQL failover from 192.168.10.3 to 192.168.10.4

    So I actually installed two SQL instances in the configuration you described. took me a while to get the cluster up... but now I've tested SSB connectivity between them and, even though I've also see the 'arbitration lost' error messages, I could not reproduce the problem you described. In my case dialog messages traffic flows fine even after the arbitration event occurs.

    Is 192.168.10.3/4 the private (heartbit) address of the nodes, or the public address?

    Thursday, November 17, 2005 7:29 AM
    Moderator

All replies

  • It seems from the description that the initiator connects to the target and sends the message, then the target does not send back the acknowledgements. This causes the initiator to retry the message again and again, and each time the message is dropped as duplicate by the target.

    The 'arbitration lost' handshake failure happens because the two instances attempt to open multiple connections between them. In this case, one of the connection is closed and the traffic should be redirected on the other connection left open. This scenario usually happens when there is some form of Network Address Translation (NAT) between the two hosts or when the initiator is having multiple IP adresses or NIC cards.

    Do the profiler events apear repetedly, again and again, like every one minute or so? Can you also profile the initiator and tell me if any event is traced there?

    Can you describe the network configuration? Name and IP addresses of the initiator, name and IP addresses of the target, names used in the routes etc?
    If you're reluctant to post these on a public forum, send them to me at remus.rusanu@... at my address at work (microsoftdotcom)

    Thanks,
    ~ Remus

    Friday, November 11, 2005 7:54 PM
    Moderator
  •     Hi Remus, message 'arbitration lost'  happens only once, but 'duplicate message' happens frequently after 'arbitration lost'  occurs.

        My network configuration is , 192.168.10.3 and 192.168.10.4 is sql server cluster, virtual ip is 192.168.10.104, instance name is 'Inst01', and another instance is '192.168.10.4\Inst02', '192.168.10.104\Inst01' is initiator, '192.168.10.4\Inst02' is target, is there any problem in this configuration ?

        But not all messages can't be deliveried, messages before 'arbitration lost'  succeed, but messages after 'arbitration lost'  fail.
    Sunday, November 13, 2005 5:24 AM
  • If I understand correctly, you installed one instance on a virtual cluster and a second instance on one of the two physical machines that form the virtual cluster, right?

    I do not see a problem with this configuration, but I'm gonna have to test it for myself because there are some gotchas there (since you are addressing the virtual cluster from one of the nodes).

    You do not have to use TCP addresses in the routes (you can, but is not required). You should probably use the computer DNS name (or the cluster virtual name).

    Can you answer some more questions for me please:
    1. What TCP ports did you configured the two endpoints on Inst01 and Inst02 (CREATE ENDPOINT)?
    2. When you do the experiment, which one of the nodes of the cluster is the active one for the SQL instance Inst01? 192.168.10.3 or 192.168.10.4 ?
    3. Can you post the exact CREATE ROUTE statements you used, both on Inst01 and on Inst02?

    Thanks,
    ~ Remus




    Sunday, November 13, 2005 5:45 AM
    Moderator
  •     You are right. I just want to send messages from virtual sql 192.168.10.104\Inst01 to 192.168.10.4\Inst02.
        I change the create route statement with machine name and test service broker again, get these error message:

    target:
    Broker:connection
        An error occurred while receiving data: '64(The specified network name is no longer available.)'.
    Broker:message undeliverable:
        This message could not be delivered because the 'receive broker error' action cannot be performed in the 'CLOSED' state.

    initiator:
    Broker:connection
        A new connection was established with the same peer and this connection lost the arbitration. State 79.
    Broker:message undeliverable:
        This message could not be delivered because the 'receive end conversation' action cannot be performed in the 'DISCONNECTED_INBOUND' state.



        Other information for you:

        1. TCP ports is default port 4022.
        2. 192.168.10.3 is active and 192.168.10.4 is passive.
        3. Create Route statement is generated from a config table, it is difficult to post it here, but I think it should be no problem because I tested it before in single net interface card.

    Monday, November 14, 2005 4:18 AM
  • Could you try to separate the ports, have one instance use 4022 and the other instance use another port, say 4023?

    I'm sorry I have to ask you to use this trial and error approach, but I just don't have a cluster available for testing right now. I'll try to get one set up tommorow.

    HTH,
    ~ Remus

    Monday, November 14, 2005 9:00 AM
    Moderator
  •     First, the two instances can send messages and reply normally before the error occurs, so I think ROUTE should be no problem.
        Second, the error still occurs after I change the port of 192.168.10.4/Inst02 to 4023.
        BTW, anythings need to be noticed in cluster environment?    

        Thanks for your help.
    Monday, November 14, 2005 10:00 AM
  • I feared that having both the non-clustered instance and the virtual instance listenning on port 4022 might create a conflict in case of the a SQL failover from 192.168.10.3 to 192.168.10.4

    So I actually installed two SQL instances in the configuration you described. took me a while to get the cluster up... but now I've tested SSB connectivity between them and, even though I've also see the 'arbitration lost' error messages, I could not reproduce the problem you described. In my case dialog messages traffic flows fine even after the arbitration event occurs.

    Is 192.168.10.3/4 the private (heartbit) address of the nodes, or the public address?

    Thursday, November 17, 2005 7:29 AM
    Moderator
  • here are the details:

    sb initiator

     

    sb machine – single private ip address

    sb endpoint – port 4025

    sb route – public IP address of target: 4026

     

    initiator firewall

     

    fixed nat – map public ip address of sb machine to sb machine  private address

    allow incoming tcp: 4025 connection to sb machine private address

     

    sb target

     

    sb machine – single private ip address

    sb endpoint – port 4026

    sb route – public IP address of initiator: 4025

     

    TARGET FIREWALL

     

    fixed nat – map public ip address of sb machine TO SB machine private address

    allow incoming tcp: 4026 connection to sb machine private address

     

     

    get the following error messages on both ends

     

    duplicate messages

     

    connection handshake failed. state 80

     

    new connection, arbitration failed 79

     

     

    the application messages are being delivered. would like to see a clean trace with the expected events.

     

    it appears that ack timeout may be low?, and connections are closed 90 secs after non activity. would it help if i used the same port numbers on both ends?

    thanks.

    Thursday, November 17, 2005 7:03 PM
  • here's the fix that worked for me:

    use the same SSB endpoint TCP port numbers on both ends if you have firewalls that do stateful inspections. now an getting a clean trace with no duplicates msgs. or handshake failures.

    hope this helps
    Thursday, November 17, 2005 8:42 PM
  •     192.168.10.3 and 192.168.10.4 is public ip, both the two machine have another network interface card with heart beat ip 192.168.0.1 and 192.168.0.2.

        I have tested it in same endpoint port and different endpoint port, but get same error messages.
    Friday, November 18, 2005 7:22 AM
  • Can you download the script from http://www.gotdotnet.com/codegallery/releases/checkForDownload.aspx?id=9f7ae2af-31aa-44dd-9ee8-6b6b6d3d6319&releaseid=49374ab0-f1e0-42b3-b7d1-433c7af60aab run it and send me the output at remus.rusanu at microsoft dot com so I can investigate the issue? This script will colect SSB related info from your machine (service, queues, routes, certificates, endpoints etc) and cummulate all info in one XML result. Make sure you capture the entire output (SQL Server Management Studio might truncate the output by default).
    If is possible, please also send me the captured Profiler traces showing the problem happening.

    Thanks,
    ~ Remus
    Thursday, December 01, 2005 4:37 AM
    Moderator
  • Sorry to see your reply so late. I have sent a email to you, pls check, thx.
    Friday, December 09, 2005 7:19 AM