none
Automatization of Spark Job in Azure RRS feed

  • Question

  • I've a class which make some extract, transform an load to a dataset located in a different JSON files.

    This process work Ok. But, I've the necessity to process manually every month the project (an Scala project in intelliJ).

    So, I'm trying to automate the process. But, I didn't find documentation o a tutorial to known what is the best service to accomplish this objective.

    Exists any recommendation to follow in order to automate an HDinsight job in Azure. I've tried with this one without success

    • Azure Data Factory
    • Azure Logic Apps

    The processs Should:

    • Create a HDInsight Spark Cluster
    • Run The process (An Scala Class)
    • Delete the HDInsight Spark Cluster created before

    Thanks!

    Wednesday, December 11, 2019 8:58 PM

All replies

  • Hello,

    Azure Data Factory is the best solution for this use case. When you say, “I’ve tried with this one with this one without success”, what you have tried? What was the outcome of it?

    Meanwhile, you may refer the SO thread which addressing similar issue.

    Hope this helps. Do let us know if you any further queries.

    ----------------------------------------------------------------------------------------

    Do click on "Mark as Answer" and Upvote on the post that helps you, this can be beneficial to other community members.

    Thursday, December 12, 2019 8:20 AM
    Moderator
  • My problem with Azure Data Factory is the impossibility to create an Hd Insight Spark Cluster on demand and after processing the data deleting it. Therefore, I should manually create the cluster before run the factory.

    The links below is which I've found about "Create on demand hd insight cluster":

    • How to create Azure on demand HD insight Spark cluster using Data Factory
    • Access datalake from Azure datafactory V2 using on demand HD Insight cluste
    Thursday, December 12, 2019 12:44 PM
  • Hello,

    The HDInsight cluster creates a default container in the blob storage you specified in the JSON (linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice needs to be processed unless there is an existing live cluster (timeToLive) and is deleted when the processing is done.

    As more activity runs, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: adf**yourdatafactoryname**-**linkedservicename**-datetimestamp. Use tools such as Microsoft Storage Explorer to copy out different kinds of logs to other storage account and delete containers in your Azure blob storage.

    • How to create Azure on demand HD insight Spark cluster using Data Factory
    • Access datalake from Azure datafactory V2 using on demand HD Insight cluste

    This Tutorial: Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory, helps you to create on-demand HDInsight cluster.

    Hope this helps. Do let us know if you any further queries.

    ----------------------------------------------------------------------------------------

    Do click on "Mark as Answer" and Upvote on the post that helps you, this can be beneficial to other community members

    Monday, December 16, 2019 5:07 AM
    Moderator
  • Hello,

    Just checking in if you have had a chance to see the previous response. We need the following information to understand/investigate this issue further.

    Thursday, December 19, 2019 8:12 AM
    Moderator