none
Azure Databricks vs Azure HDInsights

    Question

  • Hi All, 

    I want to know what what use cases people think are best for Azure Databricks vs Azure HDInsights. 

    What are the pros and cons of each of these services? What are the most common limitations? 

    I know in the complete architecture we should be looking at leveraging both services, but it would be good to know what to use when. 

    Thanks

       

    Tuesday, March 26, 2019 10:54 PM

All replies

  • Hi Ricky6789,

    Azure Databricks is a premium Spark offering that is ideal for customers who want their data scientists to collaborate easily and run their Spark based workloads efficiently and at industry leading performance.

    Azure HDInsight brings both Hadoop and Spark under the same umbrella and enables enterprises to manage both using the same set of tools e.g. using Ambari, Apache Ranger etc. It also offers industry standard notebook experience with support for both Jupyter and Zeppelin notebooks. Enterprises that want this ease of manageability across all their big data workloads can choose to use HDInsight.

    Azure HDInsight is a cloud distribution of the Hadoop components from the Hortonworks Data Platform (HDP). Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. You can use the most popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, and more. For more details, refer to Azure HDInsight Documentation.

    Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. For more details, refer to Azure Databricks Documentation.

    Here is the comparison on Azure HDInsight vs Databricks.

    Hope this helps.

    Wednesday, March 27, 2019 10:05 AM
    Moderator
  • Thank you for your reply. 

    So to put it simply  -

    Azure Databricks is targeted for data scientists. So many ad-hoc jobs, sharing and analytics. Enables self service users to process huge volumes, at scale. 

    Azure HDInsight is for traditional Hadoop+Spark use cases, production ready data pipelines at a enterprise scale offered by Yarn and others. 

    What is confusing me is Azure Data Factory - Mapping Data Flow. The new release of this converts the mapping flow to Databricks to fulfill data integration use cases. Why has Microsoft decided to use Databricks for this when it is more targeted to Data scientists. Do Data Bricks workspaces scales to handled 1000+ concurrent jobs, like HDI could? As Azure DB is a newer offering, does it have all the fine grain security features offered by HDI

    thanks 


     

    Wednesday, March 27, 2019 4:59 PM
  • Hi Ricky6789,

    If you want to use both Spark and Hadoop, you may use the same set of OOS tools to manage both, where HDInsight is the right service to choose from.

    If you want a premium collaboration experience for your data scientists, the Databricks is a great option to choose from.

    Why has Microsoft decided to use Databricks for this when it is more targeted to Data scientists?

    Databricks is the preferred product over HDI, unless the customer has a mature Hadoop ecosystem already established.  But more and more I tend to find that the majority of workloads are Spark, so Databricks is a better option.  In terms of pure Spark workloads Databricks greatly outperforms HDI.  Although Databricks clusters can be up and running indefinitely, they’re intended to be created and terminated as necessary – so if you wish to use other Apache Hadoop data transformation tools and have them available 24/7 then HDI may better a better option than Databricks

    For more details, refer “What product to use to transform my data?”.

    Do Data Bricks workspaces scales to handled 1000+ concurrent jobs, like HDI could?

    These are limitation in Azure Data Bricks.

    • The number of jobs is limited to 1000.
    • The number of jobs a workspace can create in an hour is limited to 1000 (includes “run now”). This limit also affects “jobs” created by the REST API and notebook workflows.
    • The number of actively concurrent runs a workspace can create is limited to 150.

    For more details, refer “Making Data Scientists productive in Azure”.

    Hope this helps.

    Thursday, March 28, 2019 8:06 AM
    Moderator
  • Hi Ricky6789,

    Just checking in to see if the above answer helped. If this answers your query, do click “Mark as Answer” and Up-Vote for the same. And, if you have any further query do let us know.

    Tuesday, April 2, 2019 10:58 AM
    Moderator