none
ADF v2: Getting list of SFTP files with "Get Metadata" activity times out with large file count in folder

    Question

  • Scenario
    I am using a "Get Metadata" to get a list of files in an SFTP directory from an outside vendor.  I then use a filter to filter that list of files based on a date criteria that is in the name of the file.

    If I have a few files (like 10) it works fine, if I have a thousand files it works fine.

    Problem:

    The problem is that the vendor keeps a few months of rolling history window.  This is typically 12,600 files in the SFTP directory. When there is some large number of files in a single directory (haven't narrowed down the exact number), The "Get Metadata" that tries to get the files from the SFTP connection TIMES OUT.

    I have tried setting the timeout to 10 minutes, 15 mins, etc and no luck.

    Questions:

    Is there some hard-capped limitation? Am I missing something? Is there a setting that I can change?  I can't change the folder name/location or the number or files in this location.  I have already asked the vendor.

    I attached an image below of what the Pipeline below looks like

    Thanks!
    Matt


    • Edited by mschandler Friday, June 22, 2018 8:35 PM
    Friday, June 22, 2018 8:25 PM

Answers

  • Unfortunately today GetMetadata has a limitation that the size of output cannot exceed 1M. 

    This is in our backlog to extend the size limitation, and we are currently working on the fix to error out instead of annoying timeout. Is it possible for you to separate files into different sub folders for now?

    Monday, June 25, 2018 2:37 AM

All replies

  • Unfortunately today GetMetadata has a limitation that the size of output cannot exceed 1M. 

    This is in our backlog to extend the size limitation, and we are currently working on the fix to error out instead of annoying timeout. Is it possible for you to separate files into different sub folders for now?

    Monday, June 25, 2018 2:37 AM
  • This is real bad news for us.  Is there any work-arounds you can think of?

    We have already asked the vendor (a very large multi-national org) and they refuse saying no one else has issues, however everyone else is "on prem" as far as we know.

    We have a production system running on this now and am stuck!

    Is there another way to get a list of files?  I only need to see how many files have a particular date in the file name, like:

    fileA_zone1_20180624.csv
    fileA_zone1_20180625.csv
    fileB_zone1_20180625.csv
    fileB_zone2_20180625.csv
    fileB_zone3_20180625.csv

    In this scenario, if I was looking for files with a date of June 25, 2018 it would be the last 4 files.

    ~_~_~_~_~_

    Would any other field list acomplish what I need or does only "Child Items" return a list of files

    • Edited by mschandler Monday, June 25, 2018 1:35 PM
    Monday, June 25, 2018 1:21 PM
  • Sorry for the bad news. 

    What's the purpose of filtering files with a date? Is it followed by a Copy activity? If so, copy activity has built-in file filter support to achieve this.

    If not, I'm afraid you might need a custom activity to get your work done currently.
     https://docs.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity

    Besides, you can submit your feedback here: https://feedback.azure.com/forums/270578-data-factory, and we will reprioritize our backlog items according to this list.

    Tuesday, June 26, 2018 9:00 AM
  • Chunchun, thanks for the follow-up.  The purpose is that this vendor will drop our daily files into a folder, and each file has a date in the name associated with the date we want to process.

    There are exactly 205 files to process each day and it is a significant processing event, so we run the processing event once per day.

    The vendor puts the files in the folder *sometime* during the day and we need to process the data as soon as possible so that the executives can report on it.  So we have a looping process that checks to see if the 205 files (again based on the date) are available, and if not we have a "Waiting" activity for 15 minutes, then check again.  The idea is that we process as soon as possible and as soon as those 205 files are available.  We dont want to wait all day until evening time if the files were available at 2am on one of the days.

    Hope this makes sense.

    Tuesday, June 26, 2018 12:53 PM
  • Hi

    I'm trying to do a very similar thing but I'm struggling to ready a list of files from SFTP using Get Metadata Activity, how did you set it up? 

    Monday, July 16, 2018 3:21 PM
  • Are you using ADF V2 UI? It would be easy for you to setup a getmeta data activity with UI.
    Tuesday, July 17, 2018 2:54 AM