none
Inconsistent VHD access performance

    Question

  • Hello,

    I am programming a web role that accesses a VHD.

    The access performance of the VHD is critical for my scenario.

    However, what I observe is that the first access of the VHD takes 15 times longer than the average access time. Also, occassionally there are spikes in VHD access time, which badly affect my web role's performance:

    •Run 1   -> Done! It took 00:00:00.2956  <-- Spike! (1st access)
    •Run 2   -> Done! It took 00:00:00.0277
    •Run 3   -> Done! It took 00:00:00.3428  <-- Spike!
    •Run 4   -> Done! It took 00:00:00.0148
    •Run 5   -> Done! It took 00:00:00.0164
    •Run 6   -> Done! It took 00:00:00.0138
    •Run 7   -> Done! It took 00:00:00.0169
    •Run 8   -> Done! It took 00:00:00.0143
    •Run 9   -> Done! It took 00:00:00.0164
    •Run 10 -> Done! It took 00:00:00.0087
    •Run 11 -> Done! It took 00:00:00.0092
    •Run 12 -> Done! It took 00:00:00.2366  <-- Spike!
    •Run 13 -> Done! It took 00:00:00.0302
    •Run 14 -> Done! It took 00:00:00.0148
    •Run 15 -> Done! It took 00:00:00.0184
    •Run 16 -> Done! It took 00:00:00.0164
    •Run 17 -> Done! It took 00:00:00.0148
    •Run 18 -> Done! It took 00:00:00.0302
    •Run 19 -> Done! It took 00:00:00.0159
    •Run 20 -> Done! It took 00:00:00.0164
    •Run 21 -> Done! It took 00:00:00.0153
    •Run 22 -> Done! It took 00:00:00.0153
    •Run 23 -> Done! It took 00:00:00.0174
    •Run 24 -> Done! It took 00:00:00.0148
    •Run 25 -> Done! It took 00:00:00.0159
    •Run 26 -> Done! It took 00:00:00.0159
    •Run 27 -> Done! It took 00:00:00.0159

    Can anyone explain why I do not observer consistent VHD access time behaviour?

    I really have to achieve it.

    Your help is much appreciated!

    Kind Regards

    Vlad

    Thursday, January 13, 2011 12:48 PM

Answers

  • Hi,

    I think the answer to question 7 is the key to your next troubleshooting steps.  If you see the same spikes on your local machine, then this indicates that the problem is not Azure related and is not related to the VHD. 

    When running locally the Azure CloudDrive APIs simulate the VHD in the cloud by just writing directly to local files on your hard drive in a temporary folder.  This means there is no VHD, caching, clouddrive driver, etc involved.  This is beneficial for your troubleshooting since you can remove Azure from the picture and have a much less complex scenario. 

     

    My first guess as to why this is happening is .NET garbage collection.  If you hit a GC during one of your test runs then that particular test run will spike.  There are a host of other reasons why this might occur (physical hard drive caching, OS IO caching, VM working set reduction, competing IO requests, CPU scheduling, etc), but I think the most likely scenario is garbage collection. 

    Given that this problem occurs on your local machine I would recommend removing Azure from the equation and using perfmon on your local machine to see what else is happening at the time of the spikes.  You may also have some luck posting this same issue (but not including the Azure complexity) in the Common Language Runtime forum (http://social.msdn.microsoft.com/Forums/en-US/clr/threads) or the Windows Perfmon and Diagnostic Tools forum

    (http://social.msdn.microsoft.com/Forums/en-US/perfmon/threads).

    Hope this helps


    bill boyce
    Thursday, January 20, 2011 4:44 PM

All replies

  • Hi,

    My first guess would be accessing the page block which backs the VHD instead of cache, other CPU or disk activity on the machine, or .NET garbage collection, but it could be any number of things so we would need a lot more information to even begin investigating.
    1. What do you mean when you say you are accessing a VHD?  Are you referring to the Windows Azure Drive functionality, or accessing a VM Role?  I assume you are referring to Azure Drive.
    2. What is the size of the VHD in blob storage?  What size cache are you using?
    3. What do your tests measure?  Read/write?  How much data are you using?
    4. What API are you using to access the VHD?  Can you share the code?
    5. How are you measuring performance?
    6. What information have you seen from perfmon, especially around CPU utilization, disk IO, paging, .NET garbage collection, etc?
    7. What results do you get using the same code on your local machine?
    8. What size VM are you using and do you notice a performance difference when using a larger VM?


    bill boyce
    Friday, January 14, 2011 9:16 PM
  • Hi Bill,

     1) By accessing a VHD I mean that I create new files in the VHD and copy them to another location within the VHD. Yes, I am talking about Windows Azure Drive and not about VM Role.

     2) The size of a VHD is about 250 MB. The size of cache is 128 MB.

     3) The files I create in the VHD are just empty files and just copy them to another directory. So, the file size is definitely not an issue here. I measured the access times using Stopwatch.

     4) I just mount the VHD and then do the file creation and copying. I can send you my whole solution. Can you give me your email address pls?

     5) I use Stopwatch to measure performance.

     6) I do not know how to use perfmon etc. in Azure. Any hints are much appreciated.

     7) Same results with spikes.

     8) I am using a small VM. I have not tried using a larger VM.

    Kind Regards

    Vlad

    Tuesday, January 18, 2011 9:01 AM
  • Hi,

    I think the answer to question 7 is the key to your next troubleshooting steps.  If you see the same spikes on your local machine, then this indicates that the problem is not Azure related and is not related to the VHD. 

    When running locally the Azure CloudDrive APIs simulate the VHD in the cloud by just writing directly to local files on your hard drive in a temporary folder.  This means there is no VHD, caching, clouddrive driver, etc involved.  This is beneficial for your troubleshooting since you can remove Azure from the picture and have a much less complex scenario. 

     

    My first guess as to why this is happening is .NET garbage collection.  If you hit a GC during one of your test runs then that particular test run will spike.  There are a host of other reasons why this might occur (physical hard drive caching, OS IO caching, VM working set reduction, competing IO requests, CPU scheduling, etc), but I think the most likely scenario is garbage collection. 

    Given that this problem occurs on your local machine I would recommend removing Azure from the equation and using perfmon on your local machine to see what else is happening at the time of the spikes.  You may also have some luck posting this same issue (but not including the Azure complexity) in the Common Language Runtime forum (http://social.msdn.microsoft.com/Forums/en-US/clr/threads) or the Windows Perfmon and Diagnostic Tools forum

    (http://social.msdn.microsoft.com/Forums/en-US/perfmon/threads).

    Hope this helps


    bill boyce
    Thursday, January 20, 2011 4:44 PM
  • Hi Bill,

    a colleague of mine has another scenarios and also observes access time spikes while working with VHD.

    Also, when working with DevFabric On-Premise, I definitely use a VHD, which I attach to (it usually gets attached to a:). Then I copy the VHD to Azure and observe the behaviour with the spikes.

    Kind Regards

    Vlad

    Thursday, January 20, 2011 5:18 PM
  • You may want to consider checking with the Azure Support web page to gain a more in-depth level of support on this issue. There are various options of support to consider. They would be able to work with you and disect the issue fully.

    http://www.microsoft.com/windowsazure/support/

     

     

     


    bill boyce
    Monday, January 24, 2011 3:13 PM