Oozie with Spark and ADLS RRS feed

  • Question

  • I am using a Spark 2.0 on Linux (HDI 3.5) cluster with an ADLS in the backgroun to store the data. Now I want to create a sample Oozie job but struggle with setting up everything.

    I already enabled the Ambari view for Workflows which gives me a rather simple UI to create a simple workflow. So far my workflow contains only one Spark step which runs a python "Hello World" code. Whenever I want to submit the Job i receive this error message:

    Error occurred while saving workflow.
    Oozie error. Please check the workflow.

    my current guess is that there is something wrong my the configuration of the Name Node and the Job Tracker/Resource Manager.

    For the Name Node I tried to use adl://myadls.azuredatalakestore.net/clusters/mycluster - basically the root path of my cluster in ADLS 
    I thought this would be the same as wasb://container_name@storage_name.blob.core.windows.net for a cluster on a blob storage account?

    for the JobTracker/Resource Manager I used the servername from Ambari --> YARN --> Active ResourceManager - something like hn1-myclus.fsxg4kdzabcdeeflqozoca1iwe.fx.internal.cloudapp.net

    what would be the correct settings here?

    the default settings are ${resourceManager} and ${nameNode} but I dont know where those variables are set or what their values should be so I hardcoded the values described above

    any ideas or samples for this scenario?

    thanks in advance,

    Gerhard Brueckl
    blogging @ http://blog.gbrueckl.at
    working @ http://www.pmOne.com

    Wednesday, August 30, 2017 3:14 PM

All replies

  • You may refer the documentation Use Oozie with Hadoop to define and run a workflow on Linux-based HDInsight for configuration incase if you haven’t checked earlier and see if that helps.

    Do click on "Mark as Answer" on the post that helps you, this can be beneficial to other community members.

    Wednesday, August 30, 2017 5:57 PM
  • sure, I read the documentation but I it was not working as expected

    the issues seems to be that when running the cluster on ADLS, Oozie tries to create a folder in the root of ADLS to store the logs e.g. adl://myadls.azuredatalakestore.net/user/myuser/....

    to deal with this issue I created the /user/ folder beforehand and granted the cluster-service-principal appropriate permissions

    however, the question would by 
    a) why isnt it creating this folder in the clusters root directory?
    b) where can i change where it creates this folder?

    a general question would be, whether it is possible to reference files from ADLS (e.g. job.xml) or if only local files are supported?
    the command "oozie job -config job.xml -submit" refers to a local "job.xml"-file - what If I want to store my job-files in ADLS (local files get deleted when I kill the cluster)

    kind regards,

    Gerhard Brueckl
    blogging @ http://blog.gbrueckl.at
    working @ http://www.pmOne.com

    Thursday, August 31, 2017 8:00 AM