none
Pipeline to Copy On Premise File System to Azure Data Lake Store

    Question

  • Hello,

    I have a 100 GB of data in the form of CSV files, which is in folder-subfolder structure and each sub folder is having multiple files. I want to migrate/copy this entire hierarchy to data lake store and for that I have tried following things.

    Steps for creating Source:

    1) Created a gateway using Data Management Gateway. JSON for gateway is

    {
        "name": "FileSystemGateWay",
        "properties": {
            "description": "FileSystemGateWay",
            "hostServiceUri": "<Host URI for Service running on host machine>",
            "dataFactoryName": "<Factory name>",
            "status": "Online",
            "versionStatus": "UpToDate",
            "version": "1.9.5865.2",
            "registerTime": "2016-02-18T08:51:46.0196586Z",
            "lastConnectTime": "2016-02-22T05:43:44.2630273Z"
        }
    }

    2) Created Link service using Data Factory for File System and set Gateway mentioned in above step. 

    {
        "name": "FolderDataStoreLinkService",
        "properties": {
            "description": "",
            "hubName": "hub name",
            "type": "OnPremisesFileServer",
            "typeProperties": {
                "host": "\\hostname",
                "gatewayName": "FileSystemGateway",
                "userId": "domain\username",
                "password": "password"
            }
        }
    }

    3) Created New Data Set for File system with Linked service created in above step and set Type as 'FileShare' and folder path of my host machine which contains all the data.

    {
        "name": "OnPremisesFileDataSet",
        "properties": {
            "published": false,
            "type": "FileShare",
            "linkedServiceName": "<Link Service Name>",
            "typeProperties": {
                "folderPath": "DataLake\\Temp\\"
            },
            "availability": {
                "frequency": "Hour",
                "interval": 2
            },
            "external": true,
            "policy": {}
        }
    }

    Steps for creating Destination:

    1) Created Link service for Azure Data lake store with Authorized store uri. JSON for linked service is:

    {
        "name": "<Link Service Name>",
        "properties": {
            "description": "",
            "hubName": "<Hub Name>",
            "type": "AzureDataLakeStore",
            "typeProperties": {
                "dataLakeStoreUri": "<Store uri>",
                "subscriptionId": "<subscription ID>",
                "resourceGroupName": "RG_Name"
            }
        }
    }

    2) Created Data Set for Azure data lake with Linked service of destination. JSON for dataset is:

    {
        "name": "DataLakeDataSet",
        "properties": {
            "published": false,
            "type": "AzureDataLakeStore",
            "linkedServiceName": "<Link Service Name>",
            "typeProperties": {
                "folderPath": "\\input\\"
            },
            "availability": {
                "frequency": "Hour",
                "interval": 1
            }
        }
    }

    I have tried type properties JSON as:

            "typeProperties": {
                "fileName": "*.*",
                "folderPath": "input/",
                "format": {
                    "type": "TextFormat",
                    "rowDelimiter": "\n",
                    "columnDelimiter": "\t"
                }

    After, this I have created Pipleline which is connecting Source to Destination. JSON for pipeline is:

    {
        "name": "PipelineTemplate",
        "properties": {
            "description": "Copy File from File System to Lake",
            "activities": [
                {
                    "type": "Copy",
                    "typeProperties": {
                        "source": {
                            "type": "FileSystemSource"
                        },
                        "sink": {
                            "type": "AzureDataLakeStoreSink",
                            "writeBatchSize": 0,
                            "writeBatchTimeout": "00:00:00"
                        }
                    },
                    "inputs": [
                        {
                            "name": "<SourceDataset Name>"
                        }
                    ],
                    "outputs": [
                        {
                            "name": "<DestinationDataSet Name>"
                        }
                    ],
                    "scheduler": {
                        "frequency": "Hour",
                        "interval": 1
                    },
                    "name": "OnpremisesFileSystemtoStore",
                    "description": "copy activity"
                }
            ],
            "start": "2016-02-16T20:00:00Z",
            "end": "2016-02-16T21:00:00Z",
            "isPaused": false,
            "hubName": "Factory hub name",
            "pipelineMode": "Scheduled"
        }
    }

    but, nothing worked for me. It seems that this Source File system setting works for the single file only, So, If I want to migrate or move entire folder structure to Data lake store what exact setting I have to do, so that it will create same replica of file system on my store.

    Currently Folder partitioned is possible for year/month/day and hour wise, but I have folder structure as Region, Units and each unit have day wise files. hence not sure How I can apply partitioned on it? partitionedBy supports on 1 type that is DateTime. How can we apply partition based on the string values?



    • Edited by Manthan Upadhyay Monday, February 22, 2016 6:16 AM Added Details for partition possiblities
    Monday, February 22, 2016 5:56 AM

Answers

  • Hi Manthan,

    There are additional parameters you can use:
    https://azure.microsoft.com/en-us/documentation/articles/data-factory-onprem-file-system-connector/

    Specifically in the Pipeline parameters:
                    "typeProperties": {
                        "source": {
                            "type": "FileSystemSource",
                            "recursive": true
                        },
                        "sink": {
                            "type": "AzureDataLakeStoreSink",
                            "copyBehavior": "PreserveHierarchy",
                            "writeBatchSize": 0,
                            "writeBatchTimeout": "00:00:00"
                        }
                    },
    Let us know if still have any issues.

    Thanks,
    Sachin Sheth
    Program Manager
    Azure Data Lake

    Tuesday, February 23, 2016 7:35 PM

All replies

  • Hi Manthan,

    There are additional parameters you can use:
    https://azure.microsoft.com/en-us/documentation/articles/data-factory-onprem-file-system-connector/

    Specifically in the Pipeline parameters:
                    "typeProperties": {
                        "source": {
                            "type": "FileSystemSource",
                            "recursive": true
                        },
                        "sink": {
                            "type": "AzureDataLakeStoreSink",
                            "copyBehavior": "PreserveHierarchy",
                            "writeBatchSize": 0,
                            "writeBatchTimeout": "00:00:00"
                        }
                    },
    Let us know if still have any issues.

    Thanks,
    Sachin Sheth
    Program Manager
    Azure Data Lake

    Tuesday, February 23, 2016 7:35 PM
  • Hi Sachin,

    Thank you for your response.

    Yes, while trying other options I found the same property which has helped me to keep the hierarchy and file folder name in same format. But this is a great help which guide us that we are going in right direction.

    Thank you,

    Manthan Upadhyay

    Wednesday, February 24, 2016 4:35 AM
  • Hi Sachin,

    Is it possible to do this but qualify the files copied by a wildcard such as *.csv? My client has a source structure which has both data files and then standard collaboration files like doc, xlsx, etc which they would not want copied.

    Thanks!

    David

    Wednesday, March 16, 2016 7:16 PM
  • Hi David,

    Please see if the fileFilter property helps. Here is the documentation on the property: 

    https://azure.microsoft.com/en-us/documentation/articles/data-factory-onprem-file-system-connector/#on-premises-file-system-dataset-type-properties

    Specify a filter to be used to select a subset of files in the folderPath rather than all files. 

    Allowed values are: * (multiple characters) and ? (single character).

    Examples 1: "fileFilter": "*.log"
    Example 2: "fileFilter": 2014-1-?.txt"

    Note: fileFilter is applicable for an input FileShare dataset

    HTH,

    Sreedhar

    Thursday, June 30, 2016 7:57 PM