locked
Data Factory vs. HDInsight RRS feed

  • Question

  • I was in a gov't hackfest and I watched a guy do a design where he included HDinsight but I felt it was too complex.

    What are the conditions when Data Factory is better to use than HDInsight?

    I felt he could of used Data Factory and done the data transformations too not just raw compute. Also, can't Data Factory take SSIS packages and manipulate them?

    Friday, March 9, 2018 10:49 AM

Answers

  • ADF does very little transformations. You can merge textfiles, map columns between different schemas and transform between some textfile types, other than that you have to do it in the source, the sink or in a transform activity.

    Spark was just an example of a compute instance where you can do transformations. I personally use Data Lake Analytics for batch transform jobs. You can trigger the U-SQL job in an activity in ADF.

    If you wanna run SSIS-packages your only choice as far as I know is running them in an Azure-SSIS integration runtime in ADF. You can't run SSIS packages in HDInsight.

    I'm talking about ADFv2 here. Don't know anything about v1.
    • Edited by Molotch Friday, March 16, 2018 1:25 PM spelling
    • Marked as answer by ResidentX10 Tuesday, October 9, 2018 1:53 PM
    Friday, March 16, 2018 1:24 PM

All replies

  • You can design SSIS packages with built-in transformation tasks and deploy them to SSIS in ADF to be executed on Azure-SSIS Integration Runtime (IR).  Furthermore, you can combine SSIS executions with other data integration activities in ADF pipelines.
    Monday, March 12, 2018 9:09 AM
  • So really ADF is better than HDINSIGHT, right? That was my question.
     Plus HDINSIGHT is complex to setup but all of this can be managed in the SSIS package is you use ADF, right?
    Monday, March 12, 2018 9:54 AM
  • ADF is a managed orchestrator with prebuilt connectors, logging, triggers and scheduling. HDInsight is a managed YARN cluster. Different things. If you want to orchestrate an ETL pipeline, use ADF, if you want to run YARN applications, use HDInsight.

    Lets say you want to copy data from a database to a file store or database somewhere. That's an ADF job. Lets say you want to run a Tranformation job in Spark, that's a HDInsight job (though I'd use Azure Databricks for Spark).

    Since HDInsight is a compute cluster you can do everything there but why would you? Use the right tool to solve the problem.

    Friday, March 16, 2018 8:05 AM
  • Moving along, I have SQL on-premise with lots of SSIS packages. Can HDInsight manipulate them or do I have to use ADF for that? I thought transformation was handled by ADF not Spark?
    Friday, March 16, 2018 10:04 AM
  • ADF does very little transformations. You can merge textfiles, map columns between different schemas and transform between some textfile types, other than that you have to do it in the source, the sink or in a transform activity.

    Spark was just an example of a compute instance where you can do transformations. I personally use Data Lake Analytics for batch transform jobs. You can trigger the U-SQL job in an activity in ADF.

    If you wanna run SSIS-packages your only choice as far as I know is running them in an Azure-SSIS integration runtime in ADF. You can't run SSIS packages in HDInsight.

    I'm talking about ADFv2 here. Don't know anything about v1.
    • Edited by Molotch Friday, March 16, 2018 1:25 PM spelling
    • Marked as answer by ResidentX10 Tuesday, October 9, 2018 1:53 PM
    Friday, March 16, 2018 1:24 PM
  • I'm closing this unless there are other comments. Thanks. I wasn't getting notifications so I'm just now updating all my old unclosed posts. 
    Tuesday, October 9, 2018 1:54 PM