none
Process multiple Gen 2 folder to their respective SQL tables. Each folder can have one to many file which need to be processed. RRS feed

  • Question

  • Requirement: 
    Directory 1 -> File 1, File 2, File 3 -> Processed to Table 1
    Directory 2 -> File 1, File 2, File 3 -> Processed to Table 2.

    Directory and Table information can be retrieved from the configuration table in SQLDb using Lookup activity and in Each Folder childItems can be retrieved through Metadata activity.

    In order to meet above requirement, How could I join the Lookup output with Metadata output based on common folder name etc, before passing to foreach.

    Thanks

    Tuesday, November 12, 2019 2:02 PM

All replies

  • Hi Sam cit,

    Your requirement looks pretty much similar to below issue. The only difference is that you have multiple folders and each folder has multiple files in it that is the challenge here.

     https://social.msdn.microsoft.com/Forums/en-US/e4fcb782-4c82-4aa5-aaa4-8538b2a18cc6/process-multiple-gen-2-folder-to-their-respective-sql-tables-each-folder-can-have-one-to-many-file?forum=AzureDataFactory


    One solution could be:

    Create a parent pipeline which will get the list of folder names from the source using a GetMetadataActivity + FilterActivity (Optional - just in case if your root has mix of folders and files then you can filter only folders) + ForEachActivity + ExecutePipelineActivity inside ForEachActivity and then pass the each folder name to child pipeline as parameters. 

      

    PARENT PIPELINE:


    • GetMetadataActivity: This is to get the list of child items from source.
    • FilterActivity: This is to filter the output of GetMetadataActivtiy to only pass the folder name list (Optional, depends on specific scenario).
    items: @activity('getFolderName').output.childItems
    Condition: @equals(item().type, 'Folder')

    • ForEachActivity : This is to loop through the list of folder names.
    items: @activity('filterFolders').output.value
    • ExecutePipelineActivity : This is to execute the child pipeline by passing the folder names as input parameters from parent pipeline. 
    Input parameters to child pipeline from parent pipeline : @activity('filterFolders').output.value



    CHILD PIPELINE
    :

    Please follow the solution provided in this MSDN thread to create your child pipeline to implement the rest of the logic. This will have a GetMetadataActivity + ForEachActivity (Inside ForEach you will have a LookupActivity to get the corresponding sink details from SQL Table + CopyActivity to copy the file data to corresponding table).


    • GetMetadataActivity:  This is to get the list of file names from the folder (this folder name is the input parameter value received from the parent pipeline)


    • ForEachActivity: This is to loop through the list of files returned as output of GetMetadataActivity in the previous step.

    ForEach Items: @activity('getMetaDataFileNameList').output.childItems

    • In this scenario there should be slight change in the lookup query while following the previous MSDN thread.

    Lookup Activity Query: @concat('SELECT TABLE_SCHEMA, TABLE_NAME FROM SinkTablesForFiles WHERE SourceFileName = ', pipeline().parameters.apostropheChar,pipeline().parameters.folderName,pipeline().parameters.apostropheChar)
    
    
    This will generate below Sql query for each iteration to get the respective table names.
    
    "sqlReaderQuery": "SELECT TABLE_SCHEMA, TABLE_NAME FROM SinkTablesForFiles WHERE SourceFolderName = 'FolderA'"
    
    "sqlReaderQuery": "SELECT TABLE_SCHEMA, TABLE_NAME FROM SinkTablesForFiles WHERE SourceFolderName = 'FolderB'"
    Note: The above query is constructed based on below sample SQL table 

    Id

    SourceFolderName

    TABLE_SCHEMA

    TABLE_NAME

    1

    FolderA

    dbo

    TableA

    1

    FolderB

    dbo

    TableB

    1

    FolderC

    dbo

    TableC



    Hope this helps to achieve your requirement. Please feel free to ask if you have any questions.


    Thank you

    If a post helps to resolve your issue, please click the "Mark as Answer" of that post and/or click Answered"Vote as helpful" button of that post. By marking a post as Answered and/or Helpful, you help others find the answer faster.


    Wednesday, November 13, 2019 2:28 AM
    Moderator
  • Hi Sam cit,

    Just checking to see if the above answer was helpful. If this answers your query, please consider to click “Mark as Answer” and "Up-Vote" as it might be beneficial to other community members reading this thread. And, if you have any further query do let us know.


    Thank you

    If a post helps to resolve your issue, please click the "Mark as Answer" of that post and/or click Answered "Vote as helpful" button of that post. By marking a post as Answered and/or Helpful, you help others find the answer faster.

    Thursday, November 14, 2019 2:24 AM
    Moderator
  • Thank you Kranthi for your above suggestion. I am in process of testing it, in meantime I would like to have the clarification on below.

    Metadata activity (getFolderName) in Parent pipeline: List the Folder Name such as "FolderA" but in order to keep dataset dynamic I require the Folder Name with absolute path for example: "root/country/department/project/FolderA"

    The absolute path is required to be passed in the Child pipeline Copy activity and to configure the source dataset, which otherwise require hard coding the directory path.

    Would appreciate your clarification.

    Thursday, November 14, 2019 3:58 AM
  • Hi Sam cit,

    Thanks for your response. It's a great question. In that case we need to create three different datasets.

    For example, I am assuming the folder structure as below:

    kranthi/input/MultipleCSVFiles/FolderA/File1.csv

    kranthi/input/MultipleCSVFiles/FolderB/File2.csv

    kranthi/input/MultipleCSVFiles/FolderB/File3.csv


    • Dataset1: This is used in Metadata activity (getFolderName) in Parent pipeline to get the folder names, here we point the dataset folder path to 'kranthi/input/MultipleCSVFiles'
    • Dataset2: This is used in Metadata activity (getMetaDataFileNameList) in child pipeline, here we point the dataset to 'kranthi/input/MultipleCSVFiles/FolderA' in the first iteration of child pipeline. Here we receive the folder name from Parent pipeline as an input parameter (parameter name: folderName) to child pipeline. The dynamic expression to be used in getmetadata is as below.





    • Dataset3: This is used in Copy Activity source. The configuration should be as follows.

      Create 2 data set parameters as below one is to pass the filename and the other is to pass the folder name.



      s






      Pass the folder name and file name as above.

      Hope this helps. Please let me know if you have any questions.

    Thank you

    If a post helps to resolve your issue, please click the "Mark as Answer" of that post and/or click Answered"Vote as helpful" button of that post. By marking a post as Answered and/or Helpful, you help others find the answer faster.


    Friday, November 15, 2019 8:23 AM
    Moderator
  • Just use dataflow without requiring all this complex coding logic. Enable rowUrl and you will get what you need
    • Proposed as answer by dataflowuser Friday, November 15, 2019 7:05 PM
    Friday, November 15, 2019 7:05 PM
  • Hi Sam cit,

    Following up to see if the above suggestion was helpful? Let us know if you have any further query. 

    Thank you

    If a post helps to resolve your issue, please click the "Mark as Answer" of that post and/or click Answered "Vote as helpful" button of that post. By marking a post as Answered and/or Helpful, you help others find the answer faster.

    Monday, November 18, 2019 8:28 PM
    Moderator
  • Hi Sam cit,

    Following up again to see if the above suggestion was helpful? If any of the replies answers your query, please consider to click “Mark as Answer” and "Up-Vote" that reply as it might be beneficial to other community members reading this thread. And, if you have any further query do let us know.

    Thank you

    If a post helps to resolve your issue, please click the "Mark as Answer" of that post and/or click Answered "Vote as helpful" button of that post. By marking a post as Answered and/or Helpful, you help others find the answer faster.

    Tuesday, November 19, 2019 6:33 PM
    Moderator
  • Hi Kranthi,

    Sorry for the late reply and thanks for your clarification. As per your suggestion I have hard coded the directory path and still in process of building it. Could you please clarify the below three questions below.

    1) If for example in FolderA I have multiple subfolder based on date such as, FolderA/YY/MM/DD/sales1.csv and I enable the Recursively copy, would I get the file name as YY/MM/DD/sales1.csv or only sales1.csv ?

    2) Also, let's say in FolderA, I have multiple files which I want to process based on LastModified date. I come across few option, which one do you suggest ?
    a) Metadata, "Modified datetime start and end date" ? 
    b) or should I use Metadata "Field List" Argument Last Modified ?
    c) Filter activity
    d) If condition

    3) In Child and Parent Pipeline after the foreach activity, I would like to catch the prior activity status "Succeeded" or "Failed" and errors. I do not find any output from foreach activity such as @activity('forEachFileName').Status or @activity('forEachFileName').ErrorMessage 

    Appreciate your help

    Thank you.



    • Edited by Sam cit Wednesday, November 20, 2019 3:20 PM
    Wednesday, November 20, 2019 12:24 PM
  • Hi Kranthi,

    Could you please clarify on above. 

    Thank you.

    Wednesday, December 4, 2019 6:25 PM
  • Hello Sam cit.  My apologies for the delay.  I will be taking over this case from Kranthi.
    Friday, December 6, 2019 8:20 PM
    Moderator
  • For Question #2 , I recommend using option A.  Option A (use the Modified datetime start & Modified datetime end from the Get Metadata activity which fetches filenames, not folders), is the most simple and easy to implement.  While the other options could work, it would increase the number of steps and time taken.

    Similar to how you passed the folder name to child pipeline, you can pass the desired time bounds to the child pipeline and reference them for use in the Modified datetime's of Get Metadata.

    I am working on the other questions.  I will get back to your with my findings.

    Saturday, December 7, 2019 1:21 AM
    Moderator
  • For Question #3, the For Each activity itself does not expose output, but I have a better solution.  My solution will provide you more granular detail by capturing the status of the activities inside the For Each activity.  

    1. Create a list-type variable.
    2. Go to the For Each's inner activities.
    3. Create an Append Variable activity for each inner activity you want to capture.
    4. Connect each inner activity you wish to track to its own separate Append Variable activity.  Use the red 'on failure' for capturing failures only, or use blue 'on completion' for capturing both success and fail.


    If this solution does not satisfy you and you still want to get the output of the For Each itself, you can request / upvote the feature in the feedback forum.

    Monday, December 9, 2019 7:53 PM
    Moderator
  • I am having a little difficulty with question #1:

    1) If for example in FolderA I have multiple subfolder based on date such as, FolderA/YY/MM/DD/sales1.csv and I enable the Recursively copy, would I get the file name as YY/MM/DD/sales1.csv or only sales1.csv ?

    Since you are copying from blob to SQL, I do not see how the name becomes a concern on the output side.  Recursive copy is more related to how much of the source is selected.  Recursive copy is most visible on blob-to-blob or other filestores.  Take for example, a source side whose contents look like:

    /container/rootfile.txt
    /container/folderA/fileA.txt
    /container/folderB/fileB.txt
    /container/folderA/subfolder/subfile.txt

    With the source targeted at container, and Recursive is off, only rootfile.txt would be copied.
    With the source targeted at container, and Recursive is on, all files would be copied.
    With the source targeted at folderA, and Recursive is off, only fileA.txt would be copied.
    With the source targeted at folderA, and Recursive is on, FileA.txt and subfile.txt would be copied.

    The folders may also be copied as part of the process.

    Tuesday, December 10, 2019 9:51 PM
    Moderator
  • @Sam cit please let me know whether this answers your questions.  If I do not hear back, I will assume the issue is solved.
    Thursday, December 12, 2019 6:34 PM
    Moderator
  • Since I haven't heard back, I will assume the case resolved.
    Saturday, December 14, 2019 1:13 AM
    Moderator
  • Hi Martin,

    Thanks for your suggestion on Question # 3. I would try this approach and get back.

    Regarding question 1, I would like to know the scenario when the source is target at folder A and Recursive is on then would Metadata activity based on "Modified datetime start and end date" output the path as "subfolder/subfile.txt" or "subfile.txt".

    Appreciate your help.



    • Edited by Sam cit Monday, December 16, 2019 1:58 PM
    Monday, December 16, 2019 1:56 PM
  • My apologies.  I was under the impression the question was regarding copy data, not get metadata.  I'll try it out today.
    Monday, December 16, 2019 6:15 PM
    Moderator
  • There is no "Recursive" option for Get Metadata activity.

    When Get Metadata Activity is aimed at FolderA and the field is childItems, "FileA.txt" and "subfolder" are returned.  Not "FolderA/FileA.txt" and "FolderA/subfolder".

    If fieldlist is both 'ChildI tems' and 'Item Name', the result looks like:

    {
        "itemName": "FolderA",
        "childItems": [
            {
                "name": "FileA.txt",
                "type": "File"
            },
            {
                "name": "subfolder",
                "type": "Folder"
            }
        ],
    ...

    This can be used to create "FolderA/FileA.txt" by concatenation.

    @concat(
    activity(Get Metadata1').output.itemName,
    '/',
    activity('Get Metadata1').output.childItems[0].name)

    Monday, December 16, 2019 7:31 PM
    Moderator