Data Factory Pipeline to invoke Spark Program RRS feed

  • Question

  • Hi,

    I have a Data Factory that I want to use to run an HDInsightSpark program. This program is a .py file, which is stored in a file on my Blob storage, that takes three .txt files from another folder (input) on my Blob storage and runs some data transformation on them and writes them to another .txt file in an output folder located in the same storage account. I extract the data with the textFile method and I load them into the output file with the saveAsTextFile method in my pyspark code (when I wrote it and executed it manually in the Jupyter notebook, it worked fine).

    In my Data Factory setup, I have two linked services: one for the Storage Linked Service account and one for the HDInsight Linked Service account. I also have four datasets: three input datasets that each point to the input folder and have the respective names of the input .txt files and one output dataset that points to the output folder and has the name of the output .txt file I want to write. Finally, I have one pipeline with the single activity of type "HDInsightSpark" that points to the rootPath and entryFilePath to my python code folder and my .py file, respectively. I write the inputs section of this activity to be the names of the three input datasets and the the output to be the name of the output dataset, as I named it in the other section.

    My question (thank you for bearing with me as I briefed you on the context) is that in the Monitor & Manage icon in the data factory dashboard, none of the activities laid out in the Activity Window are happening because they say "the upstream dependencies are not ready", so how can I fix this and run my program?

    Tuesday, June 13, 2017 1:43 PM