locked
ASDK New Install - Unable to deploy a VM - Timeouts RRS feed

  • Question

  • I have a new install of ASDK, and in fact I have had the same problem on 2 deployments and tried to redeploy. In both cases, the installation went fine and without error. After registering ASDK and downloading the Windows 2016 and Ubuntu images from the marketplace and then creating a Plan/Offer and a new user subscription, I attempt to deploy a VM. after 45 minutes, the deployment fails with the status of Conflict and the following:

    { "id": "/subscriptions/0c63d591-e872-42fb-83fc-dde3ddf70053/resourceGroups/TestRG/providers/Microsoft.Resources/deployments/Canonical.UbuntuServer1804LTS-20200306175242/operations/7093BA7A30EB79EB", "operationId": "7093BA7A30EB79EB", "properties": { "provisioningOperation": "Create", "provisioningState": "Failed", "timestamp": "2020-03-06T18:40:00.3894862Z", "duration": "PT46M36.4369885S", "trackingId": "0c4c2eed-b618-4a78-93aa-655a2de6dcfc", "statusCode": "Conflict", "statusMessage": { "status": "Failed", "error": { "code": "ResourceDeploymentFailure", "message": "The resource operation completed with terminal provisioning state 'Failed'.", "details": [ { "code": "VmProvisioningTimeout", "message": "VM 'UbuntuTestVM' failed to provision with timeout." } ] } }, "targetResource": { "id": "/subscriptions/0c63d591-e872-42fb-83fc-dde3ddf70053/resourceGroups/TestRG/providers/Microsoft.Compute/virtualMachines/UbuntuTestVM", "resourceType": "Microsoft.Compute/virtualMachines", "resourceName": "UbuntuTestVM" } }}

    This is happening for all VMs I try to deploy, linux or windows, all marketplace images. However, if I look in Hyper-V Manager, the VMs are happily up and running without issue (well, until the system eventually turns them off in what I assume is a garbage collection action).  System resources during this time are good (CPU is not pegged, over 90gb of ram free, storage is a NVMe SSD). 

    When I look inside the VM itself and look in the AzureWindows logs, I see the following:

    1. [00000010] [03/06/2020 18:56:51.94] [TRACE] Checking if service 'WindowsAzureGuestAgent' is running
    2. [00000010] [03/06/2020 18:56:51.94] [TRACE] ServiceControllerStatus for WindowsAzureGuestAgent is Running
    3. [00000010] [03/06/2020 18:56:51.97] [WARN]  Fabric address in configuration file is null
    4. [00000010] [03/06/2020 18:56:51.98] [INFO]  Checking if the environment is OnPrem.
    5. [00000010] [03/06/2020 18:56:52.00] [INFO]  Try to fetch identity data of the VM from metadata server at http://169.254.169.254/Microsoft.Compute/identity.
    6. [00000010] [03/06/2020 18:56:52.19] [INFO]  Identity endpoint not found response, environment is not OnPrem.
    7. [00000010] [03/06/2020 18:56:52.70] [INFO]  Initializing ControlSystem.
    8. [00000010] [03/06/2020 18:56:53.14] [INFO]  Attempting to discover fabric address on interface Ethernet.
    9. [00000010] [03/06/2020 18:56:53.16] [INFO]  Discovered fabric address on interface Ethernet.
    10. [00000010] [03/06/2020 18:56:53.16] [INFO]  Using fabric address 168.63.129.16 from DHCP.
    11. [00000010] [03/06/2020 18:56:53.16] [INFO]  Using fabric URI http://168.63.129.16/.
    12. [00000010] [03/06/2020 18:56:55.50] [ERROR] GetVersions() failed with exception: Microsoft.ServiceModel.Web.WebProtocolException: Server Error: Not Found (NotFound) ---> System.Net.WebException: The remote server returned an error: (404) Not Found.
    13.    at System.Net.HttpWebRequest.GetResponse()
    14.    at System.ServiceModel.Channels.HttpChannelFactory`1.HttpRequestChannel.HttpChannelRequest.WaitForReply(TimeSpan timeout)
    15.    --- End of inner exception stack trace ---
    16.    at Microsoft.ServiceModel.Web.WebHttpChannelProxy`1.Invoke(IMessage msg)
    17.    at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
    18.    at Microsoft.WindowsAzure.RoleContainer.Protocol.IControlSystem.GetVersions()
    19.    at Microsoft.WindowsAzure.RoleContainer.Protocol.ControlSystem.SelectProtocolVersion().

    What I believe is happening is that the VM provisions succesfully, but does not seem to be able to inform the Stack/Fabric that it is up and running. 

    As I said, this is already the second deployment were this occurs. Does anyone have any tips or things to check for this?


    Thanks

    <style><br _moz_dirty="" /></style>

    Friday, March 6, 2020 7:08 PM

Answers

  • So, I believe I have resolved this issue. It is probably related to the fact that it is a Nested install, and I just used the CloudBuilder.vhdx as the OS drive directly without having a "host" OS.

    From looking at all of the logs, I found that the RDAgent.exe service was responsible for running the Wireserver. This service was up and running, and all tests were passing (test-azurestack was all pass). Restarting the service would also not generate any errors.

    However, I then looked at C:\Logs\AgentLogs and found the following:

    [2020/03/07, 22:29:55.915,  INFO, 00028992] Agent: Started.
    [2020/03/07, 22:29:55.915, ERROR, 00028992] Failing Win32 status detected 0x00000002 (hr=0x80070002) [x:\bt\1036515\repo\src\inc\rdcommonregistry.h] line 169
    
    [2020/03/07, 22:29:55.915,  INFO, 00028992] InitializeCurrentOS: Unable to get current os from registry hr=0x80070002. Trying to determine it.
    [2020/03/07, 22:29:56.087,  INFO, 00028992] MI_Operation_GetInstance() returned 0 instances of class MSFT_DiskImage
    [2020/03/07, 22:29:56.087, FATAL, 00028992] Failed to find OSImageName using GetNodeDiskImages.
    [2020/03/07, 22:29:56.087, ERROR, 00028992] <- InitializeCurrentOS=0x80070490
    [2020/03/07, 22:29:56.087, ERROR, 00028992] <- GetCurrentOS=0x80070490
    [2020/03/07, 22:29:56.087, ERROR, 00028992] Failing HRESULT detected 0x80070490 [x:\bt\1036515\repo\src\agent\agentcore\main.cpp] line 5490

    This was an important clue! The process was trying to determine the correct OS, first by checking for a registry value, and if that fails, trying to determine it by listing all the disks and finding the one that was a MSFT_DiskImage, as would be expected if you were booting in CloudBuilder.vhdx on a host OS. In my case, CloudBuilder.vhdx simply added as the boot drive to my VM hosting ASDK.

    Time to bust out the old Process Monitor and see what the RDAgent.exe process was looking for in Registry. And there i found it trying and failing to find a key called OsImageName in \HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows Azure.

    I added a REG_SZ key with the name OsImageName and put in the date of "Svr2016", which is a common shortened version of the identifier for Server 2016 (total guess..). Restarted the RDAgent service and BOOM! no more error and the log rapidly started filling with entries related to WireContainer and WireServer.

    I restarted some of the VMs that were in a failed state from the portal, and within a few minutes they correctly reported back that they were running. New deployments now also actually finish correctly by informing the portal that the VMs are up and running.

    I will have to get a guide together on installing the latest version of ASDK 1910 in a nested Hyper-V VM, as the older guides out there on the web no longer cover all of the required tweaks and settings, like this one. Hopefully this is the last of the problems with this install!

    <style><br _moz_dirty="" /></style>

    Saturday, March 7, 2020 11:17 PM

All replies

  • I have found some additional information that may be relevant. I see that the 168.63.129.16 address in azurestack is redirected to the ASDK host. I looked into the IIS logs under inetpub and see the requests from the VMs in the logs and returning a 404

    2020-03-06 20:04:24 127.0.0.1 GET /00394b99-ebfb-4e8a-8053-80e9854123f0/ comp=versions 80 - 127.0.0.1 WALinuxAgent/2.2.40 - 404 0 2 1

    Can anyone who has a working ASDK tell me if there are supposed to by anything under c:\inetpub\wwwroot ? i only have iisstart.html, .png and empty folders for aspnet_client.

    Thanks

    <style><br _moz_dirty="" /></style>

    Friday, March 6, 2020 9:03 PM
  • That address you refer to is used as the recursive resolver. In order for a VM to deploy successfully, it must be up and running, and report it's status to Azure Stack.

    The status reporting is NOT required for the machine to boot and running. Even though it might be in a "failed" state, try to log into the VM. I will be able to get you my version of the wwwroot folder soon, but this is likely an issue with the agent communicating with Azure Stack. 

    Does this happen for both Windows and Linux VMs?

    If so, Validate your ASDK deployment and see if there are any failed roles.

    Saturday, March 7, 2020 1:18 AM
  • Travis, Thanks for the reply. I can confirm that the VMs are up and running and are reachable, however the portal is not aware of this. Yes, this is for both Linux and Windows VMs. I have deployed from scratch again and am running into the same issue again. The problem is that because the deployments never succeed, I have not been able to install any of the extra RPs as they fail out (App Service, SQL, etc).

    Where is the Wireserver supposed to be running in an ASDK environment? As in, what VM?

    <style><br _moz_dirty="" /></style>

    Saturday, March 7, 2020 8:58 PM
  • So, I believe I have resolved this issue. It is probably related to the fact that it is a Nested install, and I just used the CloudBuilder.vhdx as the OS drive directly without having a "host" OS.

    From looking at all of the logs, I found that the RDAgent.exe service was responsible for running the Wireserver. This service was up and running, and all tests were passing (test-azurestack was all pass). Restarting the service would also not generate any errors.

    However, I then looked at C:\Logs\AgentLogs and found the following:

    [2020/03/07, 22:29:55.915,  INFO, 00028992] Agent: Started.
    [2020/03/07, 22:29:55.915, ERROR, 00028992] Failing Win32 status detected 0x00000002 (hr=0x80070002) [x:\bt\1036515\repo\src\inc\rdcommonregistry.h] line 169
    
    [2020/03/07, 22:29:55.915,  INFO, 00028992] InitializeCurrentOS: Unable to get current os from registry hr=0x80070002. Trying to determine it.
    [2020/03/07, 22:29:56.087,  INFO, 00028992] MI_Operation_GetInstance() returned 0 instances of class MSFT_DiskImage
    [2020/03/07, 22:29:56.087, FATAL, 00028992] Failed to find OSImageName using GetNodeDiskImages.
    [2020/03/07, 22:29:56.087, ERROR, 00028992] <- InitializeCurrentOS=0x80070490
    [2020/03/07, 22:29:56.087, ERROR, 00028992] <- GetCurrentOS=0x80070490
    [2020/03/07, 22:29:56.087, ERROR, 00028992] Failing HRESULT detected 0x80070490 [x:\bt\1036515\repo\src\agent\agentcore\main.cpp] line 5490

    This was an important clue! The process was trying to determine the correct OS, first by checking for a registry value, and if that fails, trying to determine it by listing all the disks and finding the one that was a MSFT_DiskImage, as would be expected if you were booting in CloudBuilder.vhdx on a host OS. In my case, CloudBuilder.vhdx simply added as the boot drive to my VM hosting ASDK.

    Time to bust out the old Process Monitor and see what the RDAgent.exe process was looking for in Registry. And there i found it trying and failing to find a key called OsImageName in \HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows Azure.

    I added a REG_SZ key with the name OsImageName and put in the date of "Svr2016", which is a common shortened version of the identifier for Server 2016 (total guess..). Restarted the RDAgent service and BOOM! no more error and the log rapidly started filling with entries related to WireContainer and WireServer.

    I restarted some of the VMs that were in a failed state from the portal, and within a few minutes they correctly reported back that they were running. New deployments now also actually finish correctly by informing the portal that the VMs are up and running.

    I will have to get a guide together on installing the latest version of ASDK 1910 in a nested Hyper-V VM, as the older guides out there on the web no longer cover all of the required tweaks and settings, like this one. Hopefully this is the last of the problems with this install!

    <style><br _moz_dirty="" /></style>

    Saturday, March 7, 2020 11:17 PM
  • Thanks for the follow up! Impressive detective work to get the agent working properly!

    1910 has had a few issues with VMs and statuses. 

    Tuesday, March 10, 2020 2:22 AM
  • Brilliant! Thank you for tracking this down and documenting it.

    I was having the same issue with ASDK 2002, using a Nested install (base OS is W2019, created a VM on top of that to install ASDK into). I was stopping at step 305.1, RdAgent issues.

    Also for ASDK 2002, I found that I had to create a D: drive after getting the ASDK Hyper-V setup. If I did not have a D: drive, it was stopping earlier in the process, as there were some directories that the installer was creating (D:\AzureStack and D:\Data). I created D: as a 200GB drive, it looks like the install used about 3GB.

    Thanks again. I agree that the Nested install docs need to be updated.

    Craig Pelkie

    Monday, June 8, 2020 10:28 PM