I am working on a project where we need to download two same large blobs (one 780 MBs other 40 MBs) at multiple worker roles. With increasing number of worker roles this process keeps slowing down. The maximum time (from start at some processor to end at all processors) looks like the following:
#workers File Handling Time 1 265.00 2 311.00 4 633.00 8 632.00 16 737.00 32 835.00 64 910.00 94 1134.00
I understand that this problem is not Azure specific as reading from Hard disk is a bottleneck here. I was just wondering if there is anything that can be done here. Thanks for your responses in advance.
- แก้ไขโดย dinwal_pdc_guy 5 ตุลาคม 2554 14:10
Hi, just to close this thread, there are a number of factors that may have played a part in this, and we wanted to post them here.
VM size: http://blogs.msdn.com/b/avkashchauhan/archive/2011/01/30/windows-azure-network-io-capability-for-each-vm-size.aspx
In the case of an xl vm you have the entire machine’s NIC available to you, in the case of a small medium etc, you are collocated with other vms and must share the nic. Meaning that using one small instance has a maximum theoretical throughput of 12.5 MB / s. If the workers are homogeneous (i.e. performing the same work load) we would recommend using a xl vm with multiple threads for concurrency. This could also help solve the issue of hitting the maximum throughput per blob partition by having the blob downloaded 8x less times.
Blob read stream begins by fetching a list of blocks and issuing subsequent downloads correlating to each block size (4mb by default). This has some overhead as you must issue the getBlock list and the turnaround for multiple requests subsequently. Also, whenever possible use the concrete type method GetBlockBlobReference to prevent added latencies from internal calls to fetchAttributes to determine the blob type.
Co-locate your data:
When a compute instance is co-located with the storage account (in the same data center, e.g. North Central US) the transactions happen over the internal Network which is both free (bandwidth – you still pay for transactions) and much higher performance. We always recommend making sure the compute and storage are in the same location.
Stagger Requests to a single partition:
A Blob partition is its canonical name([accountname][containername][blobname]). So issuing several concurrent requests to the exact same blob may exceed the partition target of 60 MB/s. If you need to scale faster (more concurrent reads) consider temporarily issuing a few copyblob operations and duplicating the blob during startup and then cleaning it up later.
I have tryed to download files from 1 thread until 24 of threads on all kind of instances. I have tryed all kind of files in size from 2kb until 25MB. This is the result for an 11 MB file. I don't understand why I see these drops (thread #8, LargeInstanceTime)... and before that the download time is a lot less. Are the files cached in the blobstore by the infrastructure somehow? Or is there another explanation?
Hi, sorry for the delay in response.
It's interesting that you see that spike for your large-instance test but no other test. A couple questions:
1) How many requests per second were you making? Remember that as you use more threads, your transaction count will increase. Please let us know rate per instance, and how many instances you were running. If your rate is too high, you may be reaching the scalability target for requests/second for a blob(discussed in "Windows Azure Storage Abstractions and their Scalability Targets"), especially if you had many requests to a single blob, as that could reach the scalability target of requests on that single partition. You may consider duplicating your blob and making requests to different copies of it if you plan on accessing the same blob from multiple machines.
2) For 11MB files, I wouldn't expect to see great improvement past a few threads - we normally recommend using a 4MB request size when in the same data center, so using more than 3 threads wouldn't likely see much improvement (and I see in your tests that it doesn't).
Let me know what your request rate is and we can continue to try to determine the issue from there.