locked
Azure blob store performance question RRS feed

  • Question

  • Is there any chance of a direct-connection API (a.k.a. TCP/IP socket) being released for Azure storage?

    I _love_ the flexibility that the ATOM/REST interface provides, but when communicating with it in the same data center, it is entirely too slow to use without a comprehensive multi-server caching strategy available. We have a case of iterating through 400-500 stored datasets (stored in Blob store, BSON format, ~1-2 megs each), and it's taking anywhere form 45 to 90 seconds, with most of the time spent waiting on I/O. **yuck**

    After moving to WADS from MongoDB, we're seriously considering moving back to Mongo and just hosting it in 3-4 worker roles (it was _way_ faster...as in hardly noticed the delay). Ideas???

    Thursday, November 11, 2010 1:13 AM

Answers

  • When latency matters, it matters.  But when you're doing lots of operations serially, it's easy for a latency problem to become a throughput problem.  You were quoting a number about how long it takes to do a number of operations... if your goal is to lower that overall time, you'll need to make sure you're parallelizing properly on the client so that latency doesn't affect your throughput.  (You don't want the worker instance to be sitting idle while waiting for the previous operation to complete.)

    If you want to do a single operation with lower latency than is provided by Windows Azure storage, there are always a few .NET-specific tips for the client (like turning of Expect100Continue and UseNagleAlgorithm), but generally, there's an HTTP request involved no matter what you do.

    Nowhere in this thread was I recommending you add more worker role instances to solve this problem.

    Monday, November 15, 2010 8:13 PM

All replies

  • Try reading this post and the posts contained within it to get an idea of performance. 
    Friday, November 12, 2010 3:55 PM
  • I don't know of plans for a non-HTTP based interface for storage.

    It's important to keep in mind the difference between throughput and latency.  A roundtrip to storage, involving a new HTTP connection and a network, is going to have significant latency (milliseconds, but significant when added up).  However, if you use async operations and multiple threads, I suspect you can get the throughput you need.

    I think the key (as in the recent WA Storage blog post about table storage) is parallelizing the work on the client so that you're doing all these operations at the same time.

    This series of blog posts by Rob Gillen about maximizing blob throughput is excellent: http://rob.gillenfamily.net/2010/09/13/maximizing-throughput-in-windows-azure-e28093-part-1/

    Friday, November 12, 2010 6:21 PM
  • Thanks Steve / SparkCode...

    Unfortunately, while reading the post you pointed out [1], Rob seems to state that while for internal cases, it is best to just download in a non-parallel manner:

    "Therefore, from the data and tests we’ve run so far, using a blocked or chunked approach and parallelized transfers works well for external-to-Azure uploads and downloads as well as uploads (compute to blob storage) for internal-to-Azure movements. Internal-to-Azure downloads (blob storage to compute targets) should be performed using the standard/non-parallelized approach. "

    This supports the scenarios we are experiencing. We have 2-4 worker roles (which we call "analysis engines") that process these datasets based on messages picked up from Azure queues. Even while uploading in parallel to save the compute results back to blob store (which seems silly - they're generally under 50KB), we still end up waiting on dataset retrievals from ADS (typically averaging 60 ~ 90 msec each, not counting deserialization time).

    While I can appreciate the "throughput vs. latency" argument, in this case latency is our killer, which is what the post referred to by SparkCode pointed out very nicely.

    Are there any other possibilities with Azure storage? If not, I think our short-term path to moving back to MongoDB is our best option.

    Thanks!

    Friday, November 12, 2010 7:03 PM
  • Just to add-on to my previous post, I think the real issue here (for us) is not large, multi-megabyte reads/writes. It is the rapid reading of many small items that doesn't seem to work well currently.

    Thanks (again)...

    Friday, November 12, 2010 7:34 PM
  • Is it just me, or does it seem odd/frustrating/counter-intuitive that MS ties network perf/bandwidth to the size of the machine? This seems to favor building applications out of large, multicore machines (more $$), instead of larger numbers of smaller machines that have _the same I/O perf_.

    Very interesting...

    Friday, November 12, 2010 8:12 PM
  • I'd have to go reread, but I thought Rob's conclusion was that when downloading a single blob, it didn't improve performance to do parallel downloads, but I believe for your scenario (downloading lots of small blobs), doing it in parallel is a must.

    Also, just to make 100% sure I'm not confused, when you say "ADS" you mean Windows Azure blob storage, right?

    Friday, November 12, 2010 10:30 PM
  • It makes sense if you consider that what's really happening is that there's a server (with certain network capacity and disk I/O capabilities), and you're getting some fraction of it (up to and including the entire server).  The bigger fraction you have, the bigger portion of network and disk I/O you get.  And yes, also the more cores you get.

    In general, I think network and disk I/O scale about linearly with the cores, so two small instances should yield the same total network bandwidth as one medium instance.  (I can't swear to that, as it has to do with overhead of virtualization and other factors too.)  The general advice is to test various configurations for your specific app and see what yields the best price performance.

    Friday, November 12, 2010 10:33 PM
  • In his PDC 10 presentation, Hoi Vo has a slide (# 32) showing network bandwidth scaling linearly with core number for everything except extra small which he describes as throttled at 5Mbps. He pretty much repeats Steve's admonition to test the performance of your service so that you get the best price performance for it (rather than someone elses).

    Friday, November 12, 2010 10:46 PM
    Answerer
  • Thanks for the replies....these all seem like valid points, except for when raw latency matters (**laugh**). Anyway, something tells me that Rob's findings on parallel downloads being better had a lot to do with his testing being done on only XL (8 core) instances, where you can actually get true parallel activities.  However, in our initial designs, we're limiting ourselves to many single core boxes (Small instances), where adding a plethora of threads doesn't necessarily mean faster response (actually, slowed down a few things).

    To be fair, the only step I haven't tested yet is breaking the analysis step down across multiple roles. Currently, it's a single sequential job step (that can run on many instances at once, but each is a single unit of work). This will require a higher level of job abstraction to coordinate jobs, but should yield true parallelism in the job.

    I'll get busy and post any results I find...

    Thanks!

    P.S. Still dreaming of Mongo-ish response times in ADS....**whistling**

     

    Friday, November 12, 2010 11:55 PM
  • Some additional thoughts...

    Even though I can appreciate why the network bandwidth is related to machine size as the VM's are allocated, one could argue that having to purchase extra CPU *just* to get bandwidth makes no sense (remember, nobody should actually have to know/understand that these are VM's, how the hypervisor works, etc).  I don't see any reason that someone couldn't desire a Small (or even XS) role, but still want a full 100 meg (or even 1GB) connection. We can build that in-house today easily, and in the XS case, for not much money at all.

    Back to the latency vs. throughput argument...I'm sorry, I'm just not convinced of the jedi-hand-wave phrase "no, don't worry about latency - we have throughput!". It's a simple matter of economics - yes, I can get tremendous throughput by running jobs across 128 workers, but I'm not paying for that (or if I'm okay with that much money, I'll just build physical servers on site). 

    I realize that Azure is a bit "nascent", but these are the types of issues that anyone porting an on-site application will run into. We're used to SQL server-like response times (at the very least), and maybe 4-8 boxes for large apps. The trouble is, when comparing costs, we'll also estimate 4-8 roles - not 64, or even 128.

    Thoughts? Comments?

    Monday, November 15, 2010 3:31 PM
  • James Hamilton of AWS has a great blog on datacenter infrastructure. He has a recent post on network infrastructure in datacenters which goes some way to explaining why things are the way they are with bandwidth.

    -- In a classic network design, there is more bandwidth within a rack and more within an aggregation router than across the core. This is because the network is over-subscribed.

    -- Continuing on the over-subscription problem mentioned above, data intensive workloads like MapReduce and high performance computing workloads run poorly on oversubscribed networks.

    -- The network equipment business model is broken. We love the server business model where we have competition at the CPU level, more competition at the server level, and an open source solution for control software.  In the networking world, it’s a vertically integrated stack and this slows innovation and artificially holds margins high. It’s a mainframe business model. (my emphasis)

    With regard to the issue that "someone couldn't desire a Small (or even XS) role, but still want a full 100 meg (or even 1GB) connection" I think the answer there is that it is still early days and that if there is enough demand for something it will probably come at some point. AWS, for example, periodicially rolls out new instance types (Cluster GPU instances today) - but even AWS does not have all that many even though it has been available far longer than Azure.

     

    Monday, November 15, 2010 4:35 PM
    Answerer
  • Neil/Steve,

    Thanks for the feedback - much appreciated!

    In the spirit of playing along, what then, in the Azure world, is a suggested solution for network-bound / HPC-like systems?  We can't change the reality that we need to move a large amount of data quickly between nodes, and we can't just add threads to a single larger role, since the datasets will not all fit in memory at the same time. Of course there is also the "add more RAM" avenue, but even that runs out at some point.

    If there is any advice to offer, I would say we're all ears.

    As always, thanks again for your time. Conversation is the most important "feature" a platform can offer.

     

    Monday, November 15, 2010 8:10 PM
  • When latency matters, it matters.  But when you're doing lots of operations serially, it's easy for a latency problem to become a throughput problem.  You were quoting a number about how long it takes to do a number of operations... if your goal is to lower that overall time, you'll need to make sure you're parallelizing properly on the client so that latency doesn't affect your throughput.  (You don't want the worker instance to be sitting idle while waiting for the previous operation to complete.)

    If you want to do a single operation with lower latency than is provided by Windows Azure storage, there are always a few .NET-specific tips for the client (like turning of Expect100Continue and UseNagleAlgorithm), but generally, there's an HTTP request involved no matter what you do.

    Nowhere in this thread was I recommending you add more worker role instances to solve this problem.

    Monday, November 15, 2010 8:13 PM
  • Oops...I may have mis-spoke (somewhere) - I meant add parallel threads, not machines. Thanks for correcting me!
    Monday, November 15, 2010 8:31 PM