none
Azure hosted Ubuntu 13.10 servers stop responding to ssh after update & reboot

    Question

  • We've been happily running two Ubuntu 13.10 servers on Azure for the past month. Today, we updated one of them and rebooted - and now cannot ssh in to it at all. Rebooted 10 times, even did a full shutdown, still can't get in. So, we set up another machine to get back up and running, but after installing all the usual packages (node.js, mapnik), we restart the brand new machine and now that has stopped responding to ssh. 

    I just went to our main production machine and restarted - it's been very happy for a month, we've logged in every day without problem - and now that is unreachable. Node (started via Upstart) is running (I can do http requests and get responses perfectly fine), but I can't ssh in to the box.

    I've searched around and all I can find online that shows a similar issue just suggests restarting the servers repeatedly, or worse, deleting the VMs and manually digging around to try to determine the root cause.

    This really doesn't give me hope - we have three machines all showing the same problem on the same day, when we've had no trouble for months of running linux VMs? Our older VMs (> 1 month but have had regular updates) are running without any problems.

    Any suggestions or thoughts as to why three have failed in one day? And how can we get them back?

    Thursday, April 10, 2014 9:17 PM

Answers

  • if you are using Ubuntu 13.10 image  then you might have hit a known issue.

    After upgrade the end of the file (/etc/ssh/sshd_config)  has an invalid parameter (actually two parameters smashed against each other): “UsePAM yesClientAliveInterval 180”

    This will cause the SSH server to fail to start. This could be remedied ahead of time by simply editing this file after upgrading, but before rebooting. 

    However, to fix this after a reboot the only option is to mount the VHD to another VM and fix the /etc/ssh/sshd_config file.


    This posting is provided AS IS, with no warranties, and confers no rights.

    • Proposed as answer by Pradeep M G Saturday, April 12, 2014 7:15 AM
    • Marked as answer by NautilyticsDev Monday, April 14, 2014 8:45 PM
    Saturday, April 12, 2014 7:15 AM

All replies

  • Hi,

    For this issue, We would suggest you raise a incident case instead:

    http://azure.microsoft.com/en-us/support/options/

    Regards.


    Vivian Wang

    Saturday, April 12, 2014 5:34 AM
  • if you are using Ubuntu 13.10 image  then you might have hit a known issue.

    After upgrade the end of the file (/etc/ssh/sshd_config)  has an invalid parameter (actually two parameters smashed against each other): “UsePAM yesClientAliveInterval 180”

    This will cause the SSH server to fail to start. This could be remedied ahead of time by simply editing this file after upgrading, but before rebooting. 

    However, to fix this after a reboot the only option is to mount the VHD to another VM and fix the /etc/ssh/sshd_config file.


    This posting is provided AS IS, with no warranties, and confers no rights.

    • Proposed as answer by Pradeep M G Saturday, April 12, 2014 7:15 AM
    • Marked as answer by NautilyticsDev Monday, April 14, 2014 8:45 PM
    Saturday, April 12, 2014 7:15 AM
  • Thanks for the replies - I have a ticket in with support but it's been several days and no solution as yet (though I appreciate there has been a weekend in the way). I would love to attach a VHD but since all the affected machines are in the same region (Western Europe) I don't have any accessible machines I can use to attach those VHDs to. I could create a new machine there and make a point of not upgrading it, but that did get me thinking...

    All the affected machines are in the same region.

    I just created a new machine based in East US, updated and upgraded without trouble. Given that all of our Linux machines in the East US region are unaffected, I may just move over development to the US for now (since development is based in the US anyway), with fingers crossed that our production machine will continue running until a solution is found. 

    Monday, April 14, 2014 1:50 PM
  • Thanks Pradeep - after further discussion with Microsoft support, it does indeed sound like the sshd_conf file is the issue. Rather than juggle machines and hard drives to recover, we ended up spinning up two fresh machines in this instance (taking the time to create a useful disk image), but hopefully others can benefit from your advice.
    Monday, April 14, 2014 8:47 PM
  • You sir! I could give you my first born child for this answer ;)

    BTW it also affects Ubuntu 12.04.

    Wednesday, April 16, 2014 2:26 AM
  • Since we just had another two machines in a different region go down for the same reason (now having the trouble on Ubuntu 13.10 boxes on East storage) I documented the steps required to firstly prevent, and secondly to fix.

    To prevent losing access, immediately after you run apt-get updates check the config file at:

    /etc/ssh/sshd_config

    If the following line appears

    UsePAM yesClientAliveInterval 180

    Add a line break and save the file:

    UsePAM yes

    ClientAliveInterval 180

    If you don't check and restart with the invalid configuration, you'll lose access. To restore access you will need another machine to which you can attach a hard drive. In our experience, it was a case of using another one of our boxes in the same region, but you could download the VHD, repair, and re-upload.

    The steps we took to repair the file were as follows:

    1. shut down troubled machine
    2. delete virtual machine from azure panel, but KEEP hard disk
    3. shut down working machine
    4. attach the hard disk from troubled machine to the working machine
    5. start the working machine
    6. connect to the working machine
    7. view the attached drives using the command sudo fdisk -l
    8. make note of the second disk listed (assuming the first is the current OS) - you need to mount this drive to access it (could be named /dev/sdc1)
    9. create a directory for the mounted drive (where your files will be when you mount it) sudo mkdir /media/brokenmachinename
    10. mount the drive: sudo mount /dev/sdc1 /media/brokenmachinename
    11. edit the config: sudo pico /media/brokenmachinename/etc/ssh/sshd_config
    12. Fix the broken config as described above by adding the missing line break
    13. unmount the drive: sudo umount /media/brokenmachinename
    14. shut down the machine and dismount the drive from the Azure panel
    15. restart the working machine without the other drive attached, make sure it works and you can still connect
    16. create a new VM for the old drive (which should now be fixed)
    17. Select: "+ NEW", "Compute", "Virtual Machine", "From Gallery", "My Disks", then select your VHD. (it may take a moment for the released hard drive to show up)
    18. boot the new machine and check you can connect.

    Hope this helps anyone else who has the same trouble!

    Wednesday, May 7, 2014 6:34 PM