none
Can't Start 2nd Host in AppFabric Cluster

    Question

  • I've been having issues getting an AppFabric Cluster running with 2 hosts.  I've verified that it works with only 1 host (ie. cache item count increases as I navigate the web app), but when I try joining the 2nd host to the cluster I'm getting the following exception:

    Failed to read remote registry key from host 289851-cache2: Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCAdmin026>:SubStatus<ES0001>:Remote registry access failed on host 289851-cache2. Check if the required permissions are available. ---> System.IO.IOException: The network path was not found.

     

      at Microsoft.Win32.RegistryKey.Win32ErrorStatic(Int32 errorCode, String str)

      at Microsoft.Win32.RegistryKey.OpenRemoteBaseKey(RegistryHive hKey, String machineName, RegistryView view)

      at Microsoft.ApplicationServer.Caching.AdminApi.CacheAdmin.GetRemoteRegistryKey(String hostName, Boolean writable)

      --- End of inner exception stack trace ---

      at Microsoft.ApplicationServer.Caching.AdminApi.CacheAdmin.GetRemoteRegistryKey(String hostName, Boolean writable)

      at Microsoft.ApplicationServer.Caching.AdminApi.CacheAdmin.GetServerVersion(String hostName),DistributedCache.CacheAdmin,Error

     

    Where 289851-cache2 is the name of the second host.

    Even though the exception said check permissions, I've changed the cache connection account to be my domain account and granted access everywhere - the account is an admin on both AppFabric host machines, has full access to the network share, and I run the AppFabric PowerShell as administrator. But after I run Start-CacheCluster the first host will have a status of UP while the second host will have a status of STARTING indefinitely. I need to restart the machine to do anything else since it won't let me stop, start, or kill the process even via task manager or through services in the control panel.

    We're running Windows Server 2008 64 bit with WindowsServerAppFabricSetup_x64_6.0 installed (not 6.1 since I believe that requires Windows Server 2008 R2) on a network share.

    The DistributedCacheAgent.config of host 1:

          <host replicationPort="22236" arbitrationPort="22235" clusterPort="22234" hostId="1739552749" size="1228" leadHost="true" account="<domain account>" name="localhost" cacheHostName="AppFabricCachingService" cachePort="22233" />

    The DistributedCacheAgent.config of host 2:

          <host replicationPort="22236" arbitrationPort="22235" clusterPort="22234" hostId="1739552749" size="1228" leadHost="false" account="<domain account>" name="localhost" cacheHostName="AppFabricCachingService" cachePort="22233" />

    Thanks in advance.

    Monday, August 30, 2010 9:15 PM

Answers

  • I have resolved it!!!

    Basically the ports that AppFabric was configured to use - 22233, 22234, 22235, 22236 - were blocked which was causing the final problem.

    I thought I would detail the steps here, how I troubleshooted this and resolved it in the hope that it helps someone else.

    First of all I installed nmap from http://nmap.org/download.html and ran the following command through the Zenmap gui to do a port scan:

    nmap -p 22233-22236 ***.**.96.*** -P0

    Which resulted in the following output:

     

    PORT      STATE SERVICE
    22233/tcp open  unknown
    22234/tcp open  unknown
    22235/tcp open  unknown
    22236/tcp open  unknown
    

    So according to that, all the ports are open so everything should be good. That then got me thinking - how does app fabric know what IP address to connect to? This must come from either the name or account attribute of the host entry in ClusterConfig.xml. If this is the case then it must be using a DNS entry to resolve this to the IP address. Our servers are set up with an internal and a much more locked down external IP address, so if the server name happens to be resolving to the external IP then that could be the cause of the problem. The next thing I did was ping the name of one the cache servers and sure enough the host name was resolving to the external IP address. I then ran nmap again against the host name and the external IP address and the results were pretty much the same:

     

    PORT      STATE    SERVICE
    22233/tcp filtered unknown
    22234/tcp filtered unknown
    22235/tcp filtered unknown
    22236/tcp filtered unknown
    
    

    None of the ports were available - result! I then simply added a host entry on both servers to point the other server name at the external IP. After doing that I was able to start the cache cluster and both hosts are up. I don't suggest that the solution to this problem is to add the host entries, I'm not a security expert and have no idea what problems this may cause, but I will now go to my ISP and determine what we should really do.

    In summary we had two problems that were preventing the cache cluster from working.

    1. The remote registry service wasn't running on either of our servers which is required so that app fabric can access the registry of a machine remotely.
    2. App fabric was unable to connect to any of the configured ports.

    Hopefully in later releases of AppFabric we can get some better logging in place as, in particular for the second problem, we had pretty much nothing to go on.

    Thanks again for all you help Jason. Hopefully this will help someone else in the future.


    http://www.sharpcoder.co.uk
    • Proposed as answer by S1mm0 Wednesday, September 01, 2010 4:00 PM
    • Marked as answer by Lester Gloria Wednesday, September 01, 2010 6:41 PM
    Wednesday, September 01, 2010 4:00 PM
  • I looked closer at what was going on behind the scenes when you get this exception. It's trying to open this registry key on the remote host:

       HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\AppFabric\v1.0\Version

    One thing I noticed is that this key is constructed based on the version of AppFabric, in this case v1.0. So potentially if the versions of AppFabric on two cache hosts did not match, this might be an exception you could get, since the registry key wouldn't be on the remote machine. I don't know if this is happening, so this is just a guess (for example if the one machine had a Beta install and the other RTM). However, you state that you're using the same RTM setup on both hosts, so I'll move on.

    What I would suggest is to do the following from each cache host:

    1. On the start menu, get to the Run dialog.
    2. Launch RegEdit.exe.
    3. In the registry editor, go to the File menu.
    4. Click Connect Network Registry.
    5. Type in the name of the other server and then connect.
    6. Now navigate to the path above. See if you have any access errors or if that path doesn't exist.

    Let me know what this test gives you. Hopefully it will shed some light. And please remember to do the remote registry test going from each machine to the other.

    Thanks!

    Jason Roth

    • Marked as answer by Lester Gloria Wednesday, September 01, 2010 6:42 PM
    Tuesday, August 31, 2010 9:52 PM

All replies

  • Hi. I'd like to help you get this working. First, I have a few questions.

    1. Are you running the RTM version of AppFabric that was relesed in July? (http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=467e5aa5-c25b-4c80-a6d2-9f8fb0f337d2)
    2. Where are you finding the DistributedCacheAgent.config file? In my C:\Windows\System32\AppFabric directory, I have a DistributeCacheService.exe.config file. Is this what you're looking at?
    3. In your XML snippet, you have account="<domain account>". However, if you are working in a domain environment, the only supported configuration is to use the "NT AUTHORITY\NETWORK SERVICE" account here.

    My other concern is whether the appropriate firewall ports have been opened up. If you use the UI to configure AppFabric Caching on each host, then this would have been an option near the end for the Windows Firewall. But if you didn't use the UI, check the checkbox, or if you have a third-party firewall, you'll have to take some action to get the machines to work together properly. By the way, if you're using scripting we have a white paper that covers the correct way to script the configuration of AppFabric Caching: http://msdn.microsoft.com/en-us/library/ff921027.aspx.

    Thanks.

    Jason Roth

     

    Tuesday, August 31, 2010 2:09 AM
  • Thanks Jason, here's some of the info you asked:

    1. I'm running RTM, specifically WindowsServerAppFabricSetup_x64_6.0.exe (we're running Windows Server 2008 R2).
    2. That's the path where I'm looking at the config. I've also exported the cluster config to make sure the hosts are properly configured.
    3. I wasn't aware that only the NETWORK SERVICE account is supported, but I had it originally set to NETWORK SERVICE account and it was still giving me the same error.  I will change it back.

    I have been using the UI to configure AppFabric Caching on each host, but there is no firewall setup since the last step has all the options greyed out.

    I'm currently working on a different environment to see if it's a specific environmental issue, since I've only been testing in 1 environment which is more restrictive.

    Lester

    Tuesday, August 31, 2010 2:52 PM
  • Lester,

    OK. Let me know if the new environment has the same problem. If the firewall options are greyed out, it is possible that the rules were already setup for the firewall. You might want to go to:

    1. Start | Administrative Tools | Windows Firewall with Advanced Security.
    2. Click on Inbound rules.
    3. Look for two rules for "AppFabric Caching Service (TCP-In).
    4. Look for rules for "Remote Service Managerment" (there might be three of these).
    5. Make sure all of these rules are enabled on the problem machine.

    Again, if you have some other firewall installed other than the Windows Firewall, there may be manual steps required for that firewall to allow these same types of actions.

    Jason

    Tuesday, August 31, 2010 3:20 PM
  • Okay, so the new environment has the same problem with the same error message.

    I checked the firewall settings, the 1 rule for AppFabric Caching Service (TCP-In) and the 3 rules for Remote Service Management were all disabled, so I enabled them.  But, Windows Firewall is disabled in the first place so it shouldn't make a difference.

    Based on the exception text, I'm trying to understand, why it's trying to read the remote registry and what do I have to do to give it proper access.  It seems a possible that the error is either:

    1. It doesn't have the appropriate permission to access the registry OR
    2. The network path is invalid but it's throwing a misleading exception message about permission
    If there was some way to tell what network path it's using or what account it's connecting as, that should help lead me to the right direction (although I've added the NETWORK SERVICE account as a local admin on both boxes).

    Thanks,
    Lester

    Tuesday, August 31, 2010 6:26 PM
  • I'll try to do some more research on this. Are these machines part of a domain or is this a workgroup?

    Jason

    Tuesday, August 31, 2010 7:42 PM
  • I looked closer at what was going on behind the scenes when you get this exception. It's trying to open this registry key on the remote host:

       HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\AppFabric\v1.0\Version

    One thing I noticed is that this key is constructed based on the version of AppFabric, in this case v1.0. So potentially if the versions of AppFabric on two cache hosts did not match, this might be an exception you could get, since the registry key wouldn't be on the remote machine. I don't know if this is happening, so this is just a guess (for example if the one machine had a Beta install and the other RTM). However, you state that you're using the same RTM setup on both hosts, so I'll move on.

    What I would suggest is to do the following from each cache host:

    1. On the start menu, get to the Run dialog.
    2. Launch RegEdit.exe.
    3. In the registry editor, go to the File menu.
    4. Click Connect Network Registry.
    5. Type in the name of the other server and then connect.
    6. Now navigate to the path above. See if you have any access errors or if that path doesn't exist.

    Let me know what this test gives you. Hopefully it will shed some light. And please remember to do the remote registry test going from each machine to the other.

    Thanks!

    Jason Roth

    • Marked as answer by Lester Gloria Wednesday, September 01, 2010 6:42 PM
    Tuesday, August 31, 2010 9:52 PM
  • Hi Jason

    I work with Lester and have been looking into this problem with him. Thanks for your help so far...

    I have managed to resolve the "Remote registry access" problem. I discovered that this was because the "Remote Registry" service wasn't running on either of the servers. I have started this service on both of the servers and this has, I hesitate to say, moved us forward. We are still unable to get the second host to start however. I can run the Start-CacheCluster cmdlet on either one of our servers and the first host (which is also configured to be a lead) starts but the second one always times out and is left in the starting state.

    The output I get from the cmdlet (when run on the second problematic server) is:

    Start-CacheCluster : ErrorCode<ERRCAdmin003>:SubStatus<ES0001>:Time-out occurred on net.tcp://289851-cache2:22233.
    At line:1 char:19
    + Start-CacheCluster <<<<  -Verbose
        + CategoryInfo          : NotSpecified: (:) [Start-CacheCluster], DataCacheException
        + FullyQualifiedErrorId : ERRCAdmin003,Microsoft.ApplicationServer.Caching.Commands.StartCacheClusterCommand
    
    
    HostName : CachePort      Service Name            Service Status Version Info
    --------------------      ------------            -------------- ------------
    289843-cache1:22233       AppFabricCachingService UP             1 [1,1][1,1]
    289851-cache2:22233       AppFabricCachingService STARTING       1 [1,1][1,1]
    VERBOSE: Cluster started successfully.

    I have switched on verbose logging, but unfortunately this tells us very little (which is why I was hesitant in saying we had moved forward):

    Starting lead host 289843-cache1:AppFabricCachingService,DistributedCache.CacheAdmin,Verbose,2010-8-31 19:20:04.490
    Cluster position: Status=Starting host 289843-cache1:AppFabricCachingService, PercentComplete=0, ClusterOperationComplete=False,DistributedCache.AdminPS,Verbose,2010-8-31 19:20:04.506
    Lead hosts started = 0, Quorum count = 1, Timeout remaining = 120,DistributedCache.CacheAdmin,Verbose,2010-8-31 19:20:04.833
    Lead hosts started = 0, Quorum count = 1, Timeout remaining = 119,DistributedCache.CacheAdmin,Verbose,2010-8-31 19:20:05.847
    Lead hosts started = 0, Quorum count = 1, Timeout remaining = 118,DistributedCache.CacheAdmin,Verbose,2010-8-31 19:20:06.861
    Lead hosts started = 0, Quorum count = 1, Timeout remaining = 117,DistributedCache.CacheAdmin,Verbose,2010-8-31 19:20:07.875
    Cluster position: Status=Started host 289843-cache1:AppFabricCachingService, PercentComplete=50, ClusterOperationComplete=False,DistributedCache.AdminPS,Verbose,2010-8-31 19:20:08.889
    Starting non-lead host 289851-cache2:AppFabricCachingService,DistributedCache.CacheAdmin,Verbose,2010-8-31 19:20:08.904
    Cluster position: Status=Starting host 289851-cache2:AppFabricCachingService, PercentComplete=50, ClusterOperationComplete=False,DistributedCache.AdminPS,Verbose,2010-8-31 19:20:08.904
    Cluster position: Status=Cluster started., PercentComplete=50, ClusterOperationComplete=True,DistributedCache.AdminPS,Verbose,2010-8-31 19:21:09.565
    Cluster started successfully.,DistributedCache.AdminPS,Verbose,2010-8-31 19:21:09.846
    I have tried re-installing AppFabric on both servers but this hasn't helped. Interestingly, I was unable to use the setup.exe to uninstall AppFabric from the problem server. I'm not sure if this suggests the server may have ended up in a strange state or whether it was just caused because the service was still "starting".

    Thanks

    Simon

     


    http://www.sharpcoder.co.uk
    Wednesday, September 01, 2010 12:39 AM
  • Simon/Lester,

    Thanks for letting me know the resolution to the registry exception. That should help others who find this thread.

    Are you using SQL Server for your provider? If so, I would suggest setting up the cluster to use a shared network folder (XML) for the provider just as a test. If this works, then we can look closer at the SQL Server that you're using for the configuration store.

    Other than that, I'll try to think of some other ways that we can figure out what's causing this. Thanks for your patience.

    Jason Roth

    Wednesday, September 01, 2010 10:56 AM
  • Hi Jason,

    We identified that as a simplification that we could make to our setup early on, so we are already using a shared network folder rather than SQLServer - thanks for the idea though.

    I'm going to try reinstalling the second instance of AppFabric again as I'm not happy that it was properly uninstalled previously and then failing that, I'm going to try to install it on another server in our environment to try to determine if it is now something specific to this one server or if it is the entire environment in general.

    Just one further thing to add. I managed to get things set up with two servers in the second environment that Lester talks about earlier in this thread (I only did what Lester did, so I'm not sure what the initial problem he had was caused by). This shows however, that there isn't any kind of fundamental problem with the clustered cache approach, it is something specific to how the environment we are having problems with, is set up. It's just a little difficult to troubleshoot due to the lack of information that is being provided by the log file.

    Thanks for your continued help with this problem.

    Simon


    http://www.sharpcoder.co.uk
    Wednesday, September 01, 2010 1:20 PM
  • One further thing to add. The environment I have got things working on is using windows server 2008 R2 machines and v6.1 of the AppFabric installer. The environment that we are having problems with is using windows 2008 servers and therefore we are having to use v6.0 of the AppFabric installer.
    http://www.sharpcoder.co.uk
    Wednesday, September 01, 2010 1:27 PM
  • A little bit more information...

    Before trying to install AppFabric on a different machine, I thought I would try changing the configuration to swap which server was configured to be the lead host. The problematic server was configured to not be the lead previously. Changing this configuration and attempting to start the cluster again resulted in the first server not starting this time. Therefore whatever is wrong with our environment appears to affect the host that has lead set to false.

    I also tried setting both hosts to be lead. This resulted in neither of them starting.

    Thanks

    Simon


    http://www.sharpcoder.co.uk
    Wednesday, September 01, 2010 2:12 PM
  • I have resolved it!!!

    Basically the ports that AppFabric was configured to use - 22233, 22234, 22235, 22236 - were blocked which was causing the final problem.

    I thought I would detail the steps here, how I troubleshooted this and resolved it in the hope that it helps someone else.

    First of all I installed nmap from http://nmap.org/download.html and ran the following command through the Zenmap gui to do a port scan:

    nmap -p 22233-22236 ***.**.96.*** -P0

    Which resulted in the following output:

     

    PORT      STATE SERVICE
    22233/tcp open  unknown
    22234/tcp open  unknown
    22235/tcp open  unknown
    22236/tcp open  unknown
    

    So according to that, all the ports are open so everything should be good. That then got me thinking - how does app fabric know what IP address to connect to? This must come from either the name or account attribute of the host entry in ClusterConfig.xml. If this is the case then it must be using a DNS entry to resolve this to the IP address. Our servers are set up with an internal and a much more locked down external IP address, so if the server name happens to be resolving to the external IP then that could be the cause of the problem. The next thing I did was ping the name of one the cache servers and sure enough the host name was resolving to the external IP address. I then ran nmap again against the host name and the external IP address and the results were pretty much the same:

     

    PORT      STATE    SERVICE
    22233/tcp filtered unknown
    22234/tcp filtered unknown
    22235/tcp filtered unknown
    22236/tcp filtered unknown
    
    

    None of the ports were available - result! I then simply added a host entry on both servers to point the other server name at the external IP. After doing that I was able to start the cache cluster and both hosts are up. I don't suggest that the solution to this problem is to add the host entries, I'm not a security expert and have no idea what problems this may cause, but I will now go to my ISP and determine what we should really do.

    In summary we had two problems that were preventing the cache cluster from working.

    1. The remote registry service wasn't running on either of our servers which is required so that app fabric can access the registry of a machine remotely.
    2. App fabric was unable to connect to any of the configured ports.

    Hopefully in later releases of AppFabric we can get some better logging in place as, in particular for the second problem, we had pretty much nothing to go on.

    Thanks again for all you help Jason. Hopefully this will help someone else in the future.


    http://www.sharpcoder.co.uk
    • Proposed as answer by S1mm0 Wednesday, September 01, 2010 4:00 PM
    • Marked as answer by Lester Gloria Wednesday, September 01, 2010 6:41 PM
    Wednesday, September 01, 2010 4:00 PM
  • I'm getting the same error: Time-out occurred on net.tcp://servername:22233 , etc ...

    I've done as you said above without any luck - any other ideas?

    Primary host is up and running, second is attached but won't start.

    Thursday, November 18, 2010 9:52 PM
  • Can you go to the secondary host, run powershell on that host, and try start-cachehost there?

    Can you look at your firewall and make sure that you've enabled "Remote Service Management"?

    Can you ping the remote server by server name from the first cache host and vice versa?

    Sorry for not having exact answers, but I'm not sure why it's timing out without more information. Thanks!

    Jason Roth

    Friday, November 19, 2010 12:38 PM
  • I have a same problem and try all the things but it didnot work. 

    After I to join a new cache host, I cannot start cache cluster and also can't start seperate host.

    Our environment have 3 servers using Windows Server 2008 R2 SP1 and XML config for cluster.

    I checked for firewall and Remote Service Management as you said and all is OK.

    And I granted permissions for Network Service account in DistributedCache.exe.config file.

    I can ping the remote servers all around.

    Anything else I can try, please!!!


    Friday, January 11, 2013 10:16 AM