locked
Azure Batch VS HDinsight/Data VS Lake Analytics? RRS feed

  • Question

  • Hey, I just did some test on Azure Batch. I am still confused about the differences between Azure Batch /HDinsight/Data Lake. I know they are totally different from the point of underlying infrastructure. But if we look them from a high point. They are all distributed architecture and can process big data, right?  Fox example, if we want to count the word in a large text. Can someone make a summary for these three technologies? Thanks in advance.
    Wednesday, January 20, 2016 9:28 AM

Answers

  • Great question!

    It’s not super clear cut and you can run most HPC and Big Data workloads on either, but, hopefully I can help describe the strengths of each, as certain things are easier on one platform or the other.

    HDInsight/Hadoop is primarily designed for processing unstructured and semi-structured data, typically text files. As I’m sure you’re aware, it uses the Map-Reduce paradigm to do this which is a powerful but sometimes limited model. One of the primary strengths of Hadoop is the ability to host/manage your data as a core part of the system (in the HDFS filesystem) and schedule the compute close to that data (where possible). This benefit is sometimes negated in certain cloud scenarios where customers are wanting to use HDInsight temporarily, purely for job execution and not long-term persistence i.e. start a cluster, transfer data to the cloud, do some processing and then move the results back on-premise (and shut down the cluster). The other consideration with Hadoop is it is a very complex system and it can be quite onerous to manage and tune to get the best performance. I believe the HDInsight team have done a lot to make this easier however.

    Azure Batch has many similarities but is designed for general purpose scale-out computing. Its strength is in taking existing executables and allowing them to be run across many VMs. You simply add tasks through the REST API. These tasks specify the command line to execute and any dependent input files and Batch takes care of the rest by scheduling these tasks to VMs in a pool. You can manage pools of VMs explicitly and have these automatically scale up and down as work increases/decreases. Another strength is if you need to get hold of a very large number of VMs (1000-100,000), Batch will automatically hunt out capacity across many clusters in the datacentre to service this request.

    In summary, I would say if you have a large amount of data that you anticipate keeping in the cloud and continually processing from time to time, AND in particular if that data is text based, then HDInsight would be a good solution.

    Otherwise, I would suggest Azure Batch is simpler and likely better suited.

    I haven’t mentioned Data Lake yet. This is primarily a large clustered file system. It supports the HDFS protocol so you can think of it as a Hadoop HDFS file system but the underlying mechanics and capabilities are very different (more sophisticated and more performance tuning available). With Data Lake you can obviously schedule your map-reduce jobs to run against it. So again, this is *very* well suited to long term storage with ongoing data processing.

    On Azure Batch we’re looking at being able to leverage Data Lake as a storage layer and will be evaluating the performance of this, as we have customers who want to store data in an HDFS style file system but use Azure Batch for doing certain processing.

    It’s a bit of a grey area but hope that helps a little. If you have a particular scenario that you want more specific guidance on, it would be great if you can share some more details.

    Best

    Dave


    Principal Group Software Engineering Manager Azure Big Compute

    • Marked as answer by Mars.zhai Thursday, January 21, 2016 3:24 AM
    Thursday, January 21, 2016 2:24 AM
  • Hi Ran,

    Yes - you nailed it! If you have existing Map-Reduce code and in particular, if you are using Hadoop already (and are happy with it :) ), then HDInsight is the most natural fit given it is Hadoop under the covers (based on Hortonworks distro).

    Whether you're coding a Jar with your Map-Reduce logic, or have an existing executable, you need to deploy this to the cloud service at some point so this is a common requirement across any of these cloud systems. The primary use case for Azure Batch is enabling existing applications to scale massively in the cloud so customers expect that the application executables and dependencies need to be managed along with this. We're about to roll out a new feature with Azure Batch for managing Application Packages explicitly in a simple, declarative way. You still need to upload the Application Package obviously however, it's treated as a first class citizen by the system which makes managing and running jobs with the application simpler. We also have some additional features coming a bit later which makes it really easy to enable an existing application end to end.

    I'd recommend this blog post to get a better understanding of Azure Data Lake and how it fits in the broader ecosystem. Cosmos is the name of internal system it's based on which has been around in Microsoft for a long time.

    https://azure.microsoft.com/en-us/blog/behind-the-scenes-of-azure-data-lake-bringing-microsoft-s-big-data-experience-to-hadoop/

    All the best
    Dave


    Principal Group Software Engineering Manager Azure Big Compute


    Thursday, January 21, 2016 3:57 AM

All replies

  • Great question!

    It’s not super clear cut and you can run most HPC and Big Data workloads on either, but, hopefully I can help describe the strengths of each, as certain things are easier on one platform or the other.

    HDInsight/Hadoop is primarily designed for processing unstructured and semi-structured data, typically text files. As I’m sure you’re aware, it uses the Map-Reduce paradigm to do this which is a powerful but sometimes limited model. One of the primary strengths of Hadoop is the ability to host/manage your data as a core part of the system (in the HDFS filesystem) and schedule the compute close to that data (where possible). This benefit is sometimes negated in certain cloud scenarios where customers are wanting to use HDInsight temporarily, purely for job execution and not long-term persistence i.e. start a cluster, transfer data to the cloud, do some processing and then move the results back on-premise (and shut down the cluster). The other consideration with Hadoop is it is a very complex system and it can be quite onerous to manage and tune to get the best performance. I believe the HDInsight team have done a lot to make this easier however.

    Azure Batch has many similarities but is designed for general purpose scale-out computing. Its strength is in taking existing executables and allowing them to be run across many VMs. You simply add tasks through the REST API. These tasks specify the command line to execute and any dependent input files and Batch takes care of the rest by scheduling these tasks to VMs in a pool. You can manage pools of VMs explicitly and have these automatically scale up and down as work increases/decreases. Another strength is if you need to get hold of a very large number of VMs (1000-100,000), Batch will automatically hunt out capacity across many clusters in the datacentre to service this request.

    In summary, I would say if you have a large amount of data that you anticipate keeping in the cloud and continually processing from time to time, AND in particular if that data is text based, then HDInsight would be a good solution.

    Otherwise, I would suggest Azure Batch is simpler and likely better suited.

    I haven’t mentioned Data Lake yet. This is primarily a large clustered file system. It supports the HDFS protocol so you can think of it as a Hadoop HDFS file system but the underlying mechanics and capabilities are very different (more sophisticated and more performance tuning available). With Data Lake you can obviously schedule your map-reduce jobs to run against it. So again, this is *very* well suited to long term storage with ongoing data processing.

    On Azure Batch we’re looking at being able to leverage Data Lake as a storage layer and will be evaluating the performance of this, as we have customers who want to store data in an HDFS style file system but use Azure Batch for doing certain processing.

    It’s a bit of a grey area but hope that helps a little. If you have a particular scenario that you want more specific guidance on, it would be great if you can share some more details.

    Best

    Dave


    Principal Group Software Engineering Manager Azure Big Compute

    • Marked as answer by Mars.zhai Thursday, January 21, 2016 3:24 AM
    Thursday, January 21, 2016 2:24 AM
  • Thanks a lot ,Dave. It's really helpful.

    In my understanding, it's more about the workflow and user's architecture, right ? If customer has some existing 'Map-reduce' job and want to migrate them to HDinsight, they would like to prefer HDinsight as the extenstion of their architecture. And also as HDinsight is 100% compatible with Hadoop, users can use the resource from Hadoop community.

    Azure Batch can break through the 'Map-reduce' limitation and take more advantage of the scalability in Cloud. It's a simpler architecture and  users don't need to go deep into some sophisticated  code such as core of Hadoop, right? But I think we can simplify  the development cycle of our Azure Batch. This 'uploading executables and dependencies ' model seems not friendly in cloud world. It doesn't match the topic of 'productivity'.  :)

    I know little about Data Analytics. It sounds fantastic. I haven't found any material talking about the underlying mechanism. Please kindly send some link if you know any about that.

    Thanks again, Dave!

    Best Regards,

    Ran

    Thursday, January 21, 2016 3:24 AM
  • Hi Ran,

    Yes - you nailed it! If you have existing Map-Reduce code and in particular, if you are using Hadoop already (and are happy with it :) ), then HDInsight is the most natural fit given it is Hadoop under the covers (based on Hortonworks distro).

    Whether you're coding a Jar with your Map-Reduce logic, or have an existing executable, you need to deploy this to the cloud service at some point so this is a common requirement across any of these cloud systems. The primary use case for Azure Batch is enabling existing applications to scale massively in the cloud so customers expect that the application executables and dependencies need to be managed along with this. We're about to roll out a new feature with Azure Batch for managing Application Packages explicitly in a simple, declarative way. You still need to upload the Application Package obviously however, it's treated as a first class citizen by the system which makes managing and running jobs with the application simpler. We also have some additional features coming a bit later which makes it really easy to enable an existing application end to end.

    I'd recommend this blog post to get a better understanding of Azure Data Lake and how it fits in the broader ecosystem. Cosmos is the name of internal system it's based on which has been around in Microsoft for a long time.

    https://azure.microsoft.com/en-us/blog/behind-the-scenes-of-azure-data-lake-bringing-microsoft-s-big-data-experience-to-hadoop/

    All the best
    Dave


    Principal Group Software Engineering Manager Azure Big Compute


    Thursday, January 21, 2016 3:57 AM
  • Do you have plans of adding either  a diagnostic or a telemetry kinda stuff for azure batch. Right now there is none and that makes the stuff that we run in Azure batch a nightmare to debug or profile

    Please mark the response as answers if it solves your question or vote as helpful if you find it helpful. http://thoughtorientedarchitecture.blogspot.com/

    Friday, January 22, 2016 12:13 PM
  • We do have plans, but it would be very useful if you could provide your specific requests.  For example, would you like to be able to configure Azure Diagnostics on all your pool nodes, what metrics would you like captured, etc?

    The main mechanism available currently is to instrument your task code to write to stdout, stderr, or a custom log file.  You can then use the Batch API's, Batch Explorer, and shortly will be able to use the portal to list and display the task files.

    Tuesday, January 26, 2016 7:57 PM
  • Also with Azure batch, it is pay as you go for the resource while with HDInsight, you pay for the cluster right from the tine you have subscribed for the resource.
    Wednesday, January 17, 2018 6:23 AM