none
Service Bus Message Broker stuck in "Starting" state RRS feed

  • Question

  • I have a Workflow Manager and Service Bus farm configured and working. I am trying to add a second server to the farm. During configuration, as the second host is being added through Add-SBHost, there is an attempt made to stop and start Service Bus services on the first host. The start fails because the Service Bus Message Broker service gets stuck in the "Starting" state. It retries this 3 times, each try taking around 7 minutes and each time the failure ending with a message that looks like this:

    ServiceBus Broker service failed to start, retry count 1.  Exception message: Operation timed out..  Stack Trace:    at System.Fabric.FabricRuntime.Create(Action fabricExitCallback)

       at Microsoft.Cloud.ServiceBus.MessageContainerHost.Ring.Join()

       at Microsoft.Cloud.ServiceBus.MessageContainerHost.MessageContainerHostComponent.Open()

       at Microsoft.Cloud.HostingModel.ComponentHost.OpenComponent(IComponent component, RequestTracker tracker)

       at Microsoft.Cloud.HostingModel.ComponentHost.Open()

       at Microsoft.ServiceBus.MessageBroker.Backend.OnStart(String[] args)

    This error is preceded by 2 others that present no more information than the above does:

    TrackingId: f8ff41e9-53e0-48fc-b4de-b7505d4d5f7b, SubsystemId: NoSystemTracker. Failed to open Messaging Host Component System.TimeoutException: Operation timed out. ---> System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80071BFF

       at System.Fabric.Interop.NativeRuntime.FabricEndCreateRuntime(IFabricAsyncOperationContext context)

       at System.Fabric.FabricRuntime.NativeFabricRuntimeFactory.InitializeFabricRuntimeEndWrapper(FabricRuntime runtime, AsyncCallOutAdapter adapter)

       at System.Fabric.Interop.Utility.FinishNativeAsyncInvoke[TResult](String functionTag, Func`2 endFunc, TaskCompletionSource`1 source, AsyncCallOutAdapter adapter, Boolean expectCompletedSynchronously)

       --- End of inner exception stack trace ---

       at System.Fabric.FabricRuntime.Create(Action fabricExitCallback)

       at Microsoft.Cloud.ServiceBus.MessageContainerHost.Ring.Join()

       at Microsoft.Cloud.ServiceBus.MessageContainerHost.MessageContainerHostComponent.Open()

    and...

    Windows Fabric Runtime create failed Exception System.TimeoutException: Operation timed out. ---> System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80071BFF

       at System.Fabric.Interop.NativeRuntime.FabricEndCreateRuntime(IFabricAsyncOperationContext context)

       at System.Fabric.FabricRuntime.NativeFabricRuntimeFactory.InitializeFabricRuntimeEndWrapper(FabricRuntime runtime, AsyncCallOutAdapter adapter)

       at System.Fabric.Interop.Utility.FinishNativeAsyncInvoke[TResult](String functionTag, Func`2 endFunc, TaskCompletionSource`1 source, AsyncCallOutAdapter adapter, Boolean expectCompletedSynchronously)

       --- End of inner exception stack trace ---

       at System.Fabric.FabricRuntime.Create(Action fabricExitCallback)

       at Microsoft.Cloud.ServiceBus.MessageContainerHost.Ring.Join()

    The Windows Fabric Runtime creation itself seems to be retried 6 times, each try taking around 70 seconds and each time it fails, the following information message is posted to the log:

    Windows Fabric Runtime create failed (Retries left 6) Exception System.TimeoutException: Operation timed out. ---> System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80071BFF

       at System.Fabric.Interop.NativeRuntime.FabricEndCreateRuntime(IFabricAsyncOperationContext context)

       at System.Fabric.FabricRuntime.NativeFabricRuntimeFactory.InitializeFabricRuntimeEndWrapper(FabricRuntime runtime, AsyncCallOutAdapter adapter)

       at System.Fabric.Interop.Utility.FinishNativeAsyncInvoke[TResult](String functionTag, Func`2 endFunc, TaskCompletionSource`1 source, AsyncCallOutAdapter adapter, Boolean expectCompletedSynchronously)

       --- End of inner exception stack trace ---

       at System.Fabric.FabricRuntime.Create(Action fabricExitCallback)

       at Microsoft.Cloud.ServiceBus.MessageContainerHost.Ring.Join()

    Can anyone throw some light on what could be causing this failure? I am hoping the answer is not uninstall everything and reinstall.

    Thursday, June 6, 2013 2:42 PM

All replies

  • This issue can be caused by many, many reasons:

    1. If the SPN of machine(s) is not setup correctly (Kerberos issues, …).
    2. If the machine is not correctly added to AD. For example you are running the copy have VM image (same hostname, but different SID), which is not added to AD.
    3. More reasons: http://blogs.msdn.com/b/sguyge/archive/2013/04/29/running-service-bus-server-on-a-legacy-domain-without-a-fully-qualified-domain-name.aspx

    Damir


    Damir Dobric
    developers.de
    daenet.de
    daenet.eu
    daenet.com

    Sunday, June 9, 2013 4:18 PM
  • Thanks for responding, Damir. I have verified that the fully qualified domain name has been specified correctly for all machines. So that doesn't seem to be the problem here. We are not using Kerberos and I am unable to verify the settings in AD but I would imagine these should be OK given that when the primary server was added to the SB farm, everything did work fine. It wasn't as if the Service Bus Message Broker never started. It only started causing this problem when the second server was added and the services were stopped and restarted on primary.

    So it looks like there is no way to identify a root cause here? My only option is to reinstall the farm from scratch?

    Monday, June 10, 2013 4:17 PM
  • I've seen this happen a few times on reboot and have had to restart the services.

    1. Find the corresponding process using task manager and kill it.

    2. Start using services.msc

    At times this works, at other times I've had to also stop the Gateway and Fabric services (in task manager) to get it up and running.

    A re-install is not needed.

    One time I found the corresponding SQL server service had not properly started after the reboot. Once sql server was started and all 3 services (fabric, gateway, broker) were restarted then everything started working.

    It's happened at other times and I've not found an underlying issue for those cases.

    Thursday, June 27, 2013 4:56 PM
  • Happened again, this time with Service Bus 1.1 

    The issue is the Service Bus Message Broker service would not start, or would stay stuck at 'Starting'.

    Restarted all the other services, restarted SQL Server.

    Neither of these helped or changed the outcome.

    Ultimately using Service Bus Configurator, left the farm. The ran it again and rejoined existing farm. Once finished the Message Broker service would run again.

    What changed? AFAIK the only change was time passed. On Friday it was working, then on Monday it wasn't.

    Monday, December 23, 2013 8:17 PM
  • In my case Proxifier add additional host entry and Windows Fabric crashed.

    In powershell:

    [system.net.dns]::GetHostAddresses('localhost')

    if result is

    Address            :
    AddressFamily      : InterNetworkV6
    ScopeId            : 0
    IsIPv6Multicast    : False
    IsIPv6LinkLocal    : False
    IsIPv6SiteLocal    : False
    IsIPv6Teredo       : False
    IsIPv4MappedToIPv6 : False
    IPAddressToString  : ::1
    
    Address            : 16777343
    AddressFamily      : InterNetwork
    ScopeId            :
    IsIPv6Multicast    : False
    IsIPv6LinkLocal    : False
    IsIPv6SiteLocal    : False
    IsIPv6Teredo       : False
    IsIPv4MappedToIPv6 : False
    IPAddressToString  : 127.0.0.1

    it's ok to configure Service Bus

    but if you see four records

    PS > [system.net.dns]::GetHostAddresses('localhost')
    
    
    Address            :
    AddressFamily      : InterNetworkV6
    ScopeId            : 0
    IsIPv6Multicast    : False
    IsIPv6LinkLocal    : False
    IsIPv6SiteLocal    : False
    IsIPv6Teredo       : False
    IsIPv4MappedToIPv6 : False
    IPAddressToString  : ::1
    
    Address            :
    AddressFamily      : InterNetworkV6
    ScopeId            : 0
    IsIPv6Multicast    : False
    IsIPv6LinkLocal    : False
    IsIPv6SiteLocal    : False
    IsIPv6Teredo       : False
    IsIPv4MappedToIPv6 : False
    IPAddressToString  : ::1
    
    Address            : 16777343
    AddressFamily      : InterNetwork
    ScopeId            :
    IsIPv6Multicast    : False
    IsIPv6LinkLocal    : False
    IsIPv6SiteLocal    : False
    IsIPv6Teredo       : False
    IsIPv4MappedToIPv6 : False
    IPAddressToString  : 127.0.0.1
    
    Address            : 16777343
    AddressFamily      : InterNetwork
    ScopeId            :
    IsIPv6Multicast    : False
    IsIPv6LinkLocal    : False
    IsIPv6SiteLocal    : False
    IsIPv6Teredo       : False
    IsIPv4MappedToIPv6 : False
    IPAddressToString  : 127.0.0.1
    

    it's problem

    If you don't have proxifier, but records whatever 4 that check winsock providers

    netsh.exe  winsock show catalog

    One of providers cause trouble.

    In my case:

    PS > netsh  winsock show catalog
    
    Winsock Catalog Provider Entry
    ------------------------------------------------------
    Entry Type:                         Layered Chain Entry
    Description:                        PROXIFIER MSAFD Tcpip [TCP/IP]
    Provider ID:                        {8529F118-12D6-4F6B-A26B-91B03C503E41}
    Provider Path:                      %SystemRoot%\system32\PrxerDrv.dll
    Catalog Entry ID:                   1027
    Version:                            2
    Address Family:                     2
    Max Address Length:                 16
    Min Address Length:                 16
    Socket Type:                        1
    Protocol:                           6
    Service Flags:                      0x20066
    Protocol Chain Length:              2
    Protocol Chain: 1026 : 1001



    nigo

    Tuesday, October 21, 2014 5:29 AM
  • After HOURS of running into this problem, what worked for me was enabling TLS 1.0 in the registry.

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL\Protocols\TLS 1.0\Client]
    "Enabled"=dword:00000001

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL\Protocols\TLS 1.0\Server]
    "Enabled"=dword:00000001

    I had these previously disabled in all my SP servers in trying to get Office Online Server working. 

    Tuesday, March 5, 2019 1:26 PM
  • Hi,

    Thanks. It fixed my issue where I was not able to start workflows in list manually. but after enabling TLS 1.0, it worked.

    Wednesday, October 14, 2020 8:55 AM