locked
VM stopped responding, disk IO causing 'freeze' RRS feed

  • General discussion

  • Anyone else having problems with their VM's in Europe region?

    A few hours ago, mine started running extremely slow, basically unusable.

    Thought it could have been a DDOS attack, or the machine being overloaded.

    I stopped Apache, eventually, and nothing was running, even removed the port 80 endpoint.

    Still the machine would just freeze for 15-30 minutes when cat'ing a file or anything IO related, then work for a few minutes, then freeze again.

    From what I can see its when disk IO is performed, then the process goes into 'D' state, which suggests its waiting for the disks to return data.

    I presume there might be a problem with the storage system, as this looks something similar that happened last month also, and it turned out to be a problem with the storage system. Although the Azure status page said it was only US effected, it also effected Europe.

    Anyone else having problems this evening?

    I'm on a Linux VM, Ubuntu.

    Thanks, Rob.

    • Changed type Vivian_Wang Tuesday, August 12, 2014 2:12 AM
    Thursday, July 31, 2014 7:08 PM

All replies

  • Well, after rebooting, the machine seems ok now.

    Checking out /var/log/syslog, I can see that the system disk was getting IO errors, then it remounted things in Read Only mode.

    After searching the internet, I found this problem from back in 2012, and it doesnt seem to have a response from MS that its been fixed.

    Can someone look into this, and see why/if it has been fixed? As it seems identical to this.

    http://social.msdn.microsoft.com/Forums/windowsazure/en-US/cae5d9d5-65a3-41b7-83d6-3cc24c418c18/vm-becomes-unresponsive-or-disk-becomes-readonly?forum=WAVirtualMachinesforWindows

    It seems like the MS guy looking into it just 'stopped' responding to people :(

    Thanks, Rob.


    • Edited by Rob Donovan Thursday, July 31, 2014 8:06 PM
    Thursday, July 31, 2014 8:06 PM
  • Hi,

    Thanks for your feedback.

    Maybe you could check the service status history:

    https://azure.microsoft.com/en-us/status/#history

    Regards.


    Vivian Wang

    Friday, August 1, 2014 3:14 AM
  • Well, Its been happening again all night to my machine, so its not fixed.

    There is nothing in the azure status page, but that link I gave describes the problem exactly, and no one from MS seems to have responded to whether its has been fixed or not, why?

    While I've been asleep, my machine and web sites have basically been unusable :(

    Can someone please look into this problem, as it seems you have a major problem with disk IO or something.

    All I get is very slow performance and tons of IO errors in /var/log/syslog

    Aug  1 05:46:48 localhost kernel: [36574.655822] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:46:52 localhost kernel: [36578.673982] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:46:52 localhost kernel: [36578.876261] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:46:58 localhost kernel: [36585.015004] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:47:04 localhost kernel: [36591.044913] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:47:09 localhost kernel: [36596.416288] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:47:16 localhost kernel: [36602.835093] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:47:26 localhost kernel: [36613.482907] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:47:26 localhost kernel: [36613.537066] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:47:26 localhost kernel: [36613.567137] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:47:27 localhost kernel: [36613.685954] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:47:33 localhost kernel: [36619.651648] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:47:34 localhost kernel: [36620.675480] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:47:34 localhost kernel: [36620.721300] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:47:40 localhost kernel: [36626.870829] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:47:46 localhost kernel: [36633.359322] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:47:53 localhost kernel: [36639.675089] JBD2: Detected IO errors while flushing file data on sda1-8
    Aug  1 05:47:59 localhost kernel: [36645.863615] JBD2: Detected IO errors while flushing file data on sda1-8


    Friday, August 1, 2014 5:52 AM
  • Well, it had also corrupted my disks, so I ran fsck to correct that.

    (But, because Azure doesnt give you access to the console, you cant do that so easily on the root device, have to force it do fix it during reboot by editing /etc/default/rcS and changing FSCKFIX=yes and also do at touch /forcefsck and then reboot)

    That fixed the fsck errors, and the machine was fine for 30 minutes. Now syslog is reporting errors again, and the machine is freezing.

    Sar is also reporting massive await times on the device also.

    How do I get this resolved. Its clearly not something I've done, and MS only gives out Billing support, which is just ludicrous. How do I get MS to fix or look at a technical problem? I'm sorry, but I'm not paying for support when its not my problem. :(

    sar:

    07:05:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
    07:15:01 AM    dev8-0      4.33    105.39     17.74     28.42      0.11     26.44     21.67      9.39
    07:15:01 AM   dev8-16      0.40      0.86      2.84      9.24      0.00      4.20      0.97      0.04
    07:25:02 AM    dev8-0      8.51    565.68     18.77     68.71      0.07      8.41      4.47      3.80
    07:25:02 AM   dev8-16      2.11      2.01     39.68     19.80      0.02      8.18      0.53      0.11
    07:35:01 AM    dev8-0      2.60     86.33     10.63     37.26      0.02      8.02      5.57      1.45
    07:35:01 AM   dev8-16      0.19      1.51      0.00      8.00      0.00     11.54      1.68      0.03
    07:47:48 AM    dev8-0      1.42     24.90     10.20     24.75      7.55   5322.78    462.15     65.53
    07:47:48 AM   dev8-16      1.09      8.03      1.24      8.51      0.00      3.98      0.60      0.07

    syslog:

    Aug  1 07:46:06 localhost kernel: [ 2873.633314] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633329] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633334] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633338] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633342] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633345] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633349] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633352] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633355] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633359] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633362] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633366] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633370] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633373] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633377] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633380] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2873.633383] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:46:06 localhost kernel: [ 2876.680536] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x4
    Aug  1 07:47:29 localhost kernel: [ 2969.722676] INFO: task jbd2/sda1-8:236 blocked for more than 120 seconds.
    Aug  1 07:47:29 localhost kernel: [ 2969.725433] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Aug  1 07:47:29 localhost kernel: [ 2969.728180] jbd2/sda1-8     D 0000000000000000     0   236      2 0x00000000
    Aug  1 07:47:29 localhost kernel: [ 2969.728186]  ffff880109ec9ce0 0000000000000046 ffff880109ec9cb0 0000000300000001
    Aug  1 07:47:29 localhost kernel: [ 2969.728190]  ffff880109ec9fd8 ffff880109ec9fd8 ffff880109ec9fd8 0000000000013700
    Aug  1 07:47:29 localhost kernel: [ 2969.728194]  ffff880109dc16f0 ffff880109b5c4d0 ffff880109ec9cf0 ffff880109ec9df8
    Aug  1 07:47:29 localhost kernel: [ 2969.728198] Call Trace:
    Aug  1 07:47:29 localhost kernel: [ 2969.728207]  [<ffffffff8165b5df>] schedule+0x3f/0x60
    Aug  1 07:47:29 localhost kernel: [ 2969.728212]  [<ffffffff8126226c>] jbd2_journal_commit_transaction+0x18c/0x1270
    Aug  1 07:47:29 localhost kernel: [ 2969.728217]  [<ffffffff8165d6ee>] ? _raw_spin_lock_irqsave+0x2e/0x40
    Aug  1 07:47:29 localhost kernel: [ 2969.728222]  [<ffffffff81077da8>] ? lock_timer_base.isra.29+0x38/0x70
    Aug  1 07:47:29 localhost kernel: [ 2969.728226]  [<ffffffff8108b830>] ? add_wait_queue+0x60/0x60
    Aug  1 07:47:29 localhost kernel: [ 2969.728230]  [<ffffffff81266fdb>] kjournald2+0xbb/0x220
    Aug  1 07:47:29 localhost kernel: [ 2969.728233]  [<ffffffff8108b830>] ? add_wait_queue+0x60/0x60
    Aug  1 07:47:29 localhost kernel: [ 2969.728236]  [<ffffffff81266f20>] ? commit_timeout+0x10/0x10
    Aug  1 07:47:29 localhost kernel: [ 2969.728239]  [<ffffffff8108ad8c>] kthread+0x8c/0xa0
    Aug  1 07:47:29 localhost kernel: [ 2969.728243]  [<ffffffff81667bf4>] kernel_thread_helper+0x4/0x10
    Aug  1 07:47:29 localhost kernel: [ 2969.728246]  [<ffffffff8108ad00>] ? flush_kthread_worker+0xa0/0xa0
    Aug  1 07:47:29 localhost kernel: [ 2969.728249]  [<ffffffff81667bf0>] ? gs_change+0x13/0x13


    Friday, August 1, 2014 7:55 AM
  • Well, its another 24hrs of my VM being unusable.

    Have to say, Microsoft's technical support for this product is appalling.

    How can I be charged money for a product and it not work or get support for it, when their hardware is failing.

    The only option I have now is to move to another company, who will support their product better.

    Its useless me using Azure, as there seems to be no support for technical issues, unless I pay 120 Euros for Microsoft to fix their own problems. Which I object to.

    If it was for technical support for things I have done or cant understand, then I could see that I need to pay for that (or though I dont understand why I need to be tied into a 6 month contract).

    But when the hardware or something to do with the setup of the system is failing, then there is no way I should have to pay to get that resolved.

    I understand that obviously its not effecting 1000s of people or more people would be reporting it, but it looks like this was a problem 2 years ago, and there seems to be no response from MS on it at all, apart from one guy who started looking into it, and then just disappeared.

    Its sad as it seems to be a nice setup, but without good support it is of no use to me.

    :(

    Saturday, August 2, 2014 7:53 AM
  • Hi,

    For this issue, you may open a support ticket:

    http://azure.microsoft.com/en-us/support/options/

    Regards.


    Vivian Wang

    Friday, August 8, 2014 2:26 AM
  • Hi,

    I cant create a support ticket, as I am only allow to create billing support tickets.

    And I refuse to pay you 120 Euros for you to fix a problem with YOUR hardware.

    I think you need some better support for your product, I understand if its something wrong I have done or something I dont understand, then I would expect to pay you for support, but there is nothing I can do about Disk IO problems being logged in the kernel.

    I've used/programmed computers for 28 years and have a great understanding of how things work, even down to the Linux kernel, so paying for your support isnt something I need. Esp in the past when I have talked to your support in India, and they just dont understand me or are able to trouble shoot problems well.

    I have now had to migrate my machine over to a new VM, and this has fixed the disk IO problem, obviously as it will pick a new storage container. It also fixed another problem that I have had open with your support on why my machine had bad IO speed for the past 3 months.

    Other people seem to be having similar problems though, so I would suggest you get someone who knows what they are doing to look into it.

    For a company that is supposed to be at the forefront of computing, its really bad that I seem to get quite a bit of downtime and problems on Azure :(

    Regards,

    Rob Donovan.

    Tuesday, August 12, 2014 8:10 AM