locked
Blob File Filters - Urgent Please RRS feed

  • Question

  • Hi

    I have data files being inserted into a Blob Container. The file names are all beginning with a GUID and then the name of a Stored Procedure. for Example as follows

    21f9e7d6-785a-45a9-afaf-31b81e10a55a_custom.usp_CustomMasterData_TvpMergeV3.json

    SO, I need a file Filter for my Source in my ADF Pipeline which will filter out the GUID and leave: 

    CustomMasterData_TvpMergeV3.json

    Right now with *.json it is reading ALL of the Blob files (2 different record "types).

    I have read https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-blob-storage#folder-and-file-filter-examples

    But it doesn't talk about using the ??? and other file filters as shown in the help tip. 

    Can someone please assist?

    Thanks!
    Mike Kiser


    Mike Kiser

    Wednesday, August 7, 2019 9:50 PM

Answers

  • Hello Mike Kiser and thank you for your inquiry.  Since this is urgent, I will get right to the matter.

    Wildcard character ? matches 0 or 1 characters.
    Wildcard character * matches 0 or 1 or many characters.

    There are two parts to your ask:
    1: Identify the files with GUID
    2: Remove the GUID from the name, leaving the human-readable filename behind.

    I notice that there is an underscore between the guid and filename.  You could do like:
    *_*.json
    However there is vulnerability here.  Filename of "foo_bar.json" would also be picked up.  Since you say ALL of the files have GUID I won't worry about that.  Please confirm this assumption.  When you say there are 'two different record types', do you mean "foo.json" versus "foo.txt" ?

    Since you want to do a transform, I'm afraid we must do a "GetMetadata" to fetch the filenames, then "ForEach" them.
    Inside the ForEach, the copy activity is placed.  Given your example
    some-Guid_storedprocedurename.usp_filename.json
    I extract the filename by the dynamic expression

    @last(split(pipeline().parameters.filename,'.usp_'))

    Example code for the split

    {
        "name": "Clean_up_filename",
        "properties": {
            "activities": [
                {
                    "name": "Set Variable1",
                    "type": "SetVariable",
                    "dependsOn": [],
                    "userProperties": [],
                    "typeProperties": {
                        "variableName": "pretty_filename",
                        "value": {
                            "value": "@last(split(pipeline().parameters.filename,'.usp_'))",
                            "type": "Expression"
                        }
                    }
                }
            ],
            "parameters": {
                "raw_filename": {
                    "type": "string",
                    "defaultValue": "21f9e7d6-785a-45a9-afaf-31b81e10a55a_custom.usp_CustomMasterData_TvpMergeV3.json"
                }
            },
            "variables": {
                "pretty_filename": {
                    "type": "String"
                }
            },
            "annotations": []
        }
    }


    Wednesday, August 7, 2019 11:59 PM
  • Wow, Martin!! I so much appreciate this. Yes, the 2nd file name type is

    51ffa21b-3b45-4786-87a1-57d1007a435c_scada.usp_HourlyGas_TvpMerge.json

    and as I already mentioned the other type of filename is

    neweraf6-c3ad-4ea0-85fc-044e9dd3f6cf_custom.usp_CustomMasterData_TvpMergeV3.json

    They are both .json files but have a different "table (column) structure.

    so both file names have GUIDs in front of the Stored Proc name in the json file. 

    Let me absorb your reply tonight and try to implement it and I'll let you know how it goes. Yes, I have a demo of this new oil and gas ADF system on Monday to Executives...you know how that goes LOL

    Thanks again so much and I will keep you updated; I really appreciate your time for replying!
    Mike Kiser


    Mike Kiser



    • Edited by EMKISER Thursday, August 8, 2019 1:44 AM
    • Marked as answer by EMKISER Friday, August 9, 2019 2:43 PM
    Thursday, August 8, 2019 12:20 AM
  • If you need more explanations or details, let me know.  I will check this thread again tomorrow, but not on weekends.
    • Marked as answer by EMKISER Friday, August 9, 2019 2:43 PM
    Thursday, August 8, 2019 8:21 PM
  • Thanks Martin! This was exactly what I was needing. 

    I really appreciate your response!

    Mike Kiser


    Mike Kiser

    • Marked as answer by EMKISER Friday, August 9, 2019 9:16 PM
    Friday, August 9, 2019 2:43 PM
  • Thanks! I am good for now. Have a great weekend!
    Mike Kiser

    Mike Kiser

    • Marked as answer by EMKISER Friday, August 9, 2019 9:16 PM
    Friday, August 9, 2019 2:44 PM
  • You too!
    • Marked as answer by EMKISER Friday, August 9, 2019 9:16 PM
    Friday, August 9, 2019 8:58 PM

All replies

  • Hello Mike Kiser and thank you for your inquiry.  Since this is urgent, I will get right to the matter.

    Wildcard character ? matches 0 or 1 characters.
    Wildcard character * matches 0 or 1 or many characters.

    There are two parts to your ask:
    1: Identify the files with GUID
    2: Remove the GUID from the name, leaving the human-readable filename behind.

    I notice that there is an underscore between the guid and filename.  You could do like:
    *_*.json
    However there is vulnerability here.  Filename of "foo_bar.json" would also be picked up.  Since you say ALL of the files have GUID I won't worry about that.  Please confirm this assumption.  When you say there are 'two different record types', do you mean "foo.json" versus "foo.txt" ?

    Since you want to do a transform, I'm afraid we must do a "GetMetadata" to fetch the filenames, then "ForEach" them.
    Inside the ForEach, the copy activity is placed.  Given your example
    some-Guid_storedprocedurename.usp_filename.json
    I extract the filename by the dynamic expression

    @last(split(pipeline().parameters.filename,'.usp_'))

    Example code for the split

    {
        "name": "Clean_up_filename",
        "properties": {
            "activities": [
                {
                    "name": "Set Variable1",
                    "type": "SetVariable",
                    "dependsOn": [],
                    "userProperties": [],
                    "typeProperties": {
                        "variableName": "pretty_filename",
                        "value": {
                            "value": "@last(split(pipeline().parameters.filename,'.usp_'))",
                            "type": "Expression"
                        }
                    }
                }
            ],
            "parameters": {
                "raw_filename": {
                    "type": "string",
                    "defaultValue": "21f9e7d6-785a-45a9-afaf-31b81e10a55a_custom.usp_CustomMasterData_TvpMergeV3.json"
                }
            },
            "variables": {
                "pretty_filename": {
                    "type": "String"
                }
            },
            "annotations": []
        }
    }


    Wednesday, August 7, 2019 11:59 PM
  • Wow, Martin!! I so much appreciate this. Yes, the 2nd file name type is

    51ffa21b-3b45-4786-87a1-57d1007a435c_scada.usp_HourlyGas_TvpMerge.json

    and as I already mentioned the other type of filename is

    neweraf6-c3ad-4ea0-85fc-044e9dd3f6cf_custom.usp_CustomMasterData_TvpMergeV3.json

    They are both .json files but have a different "table (column) structure.

    so both file names have GUIDs in front of the Stored Proc name in the json file. 

    Let me absorb your reply tonight and try to implement it and I'll let you know how it goes. Yes, I have a demo of this new oil and gas ADF system on Monday to Executives...you know how that goes LOL

    Thanks again so much and I will keep you updated; I really appreciate your time for replying!
    Mike Kiser


    Mike Kiser



    • Edited by EMKISER Thursday, August 8, 2019 1:44 AM
    • Marked as answer by EMKISER Friday, August 9, 2019 2:43 PM
    Thursday, August 8, 2019 12:20 AM
  • If you need more explanations or details, let me know.  I will check this thread again tomorrow, but not on weekends.
    • Marked as answer by EMKISER Friday, August 9, 2019 2:43 PM
    Thursday, August 8, 2019 8:21 PM
  • Thanks Martin! This was exactly what I was needing. 

    I really appreciate your response!

    Mike Kiser


    Mike Kiser

    • Marked as answer by EMKISER Friday, August 9, 2019 9:16 PM
    Friday, August 9, 2019 2:43 PM
  • Thanks! I am good for now. Have a great weekend!
    Mike Kiser

    Mike Kiser

    • Marked as answer by EMKISER Friday, August 9, 2019 9:16 PM
    Friday, August 9, 2019 2:44 PM
  • You too!
    • Marked as answer by EMKISER Friday, August 9, 2019 9:16 PM
    Friday, August 9, 2019 8:58 PM