none
U-SQL Fileset Wildcard not pulling all of the files

    Question

  • So I've had a U-SQL job that has been running for about 6 months now, when it was built everything was fine and validated to include all data. On a daily basis we have ADF jobs that pull data from source systems and place them into a folder structure in the Data Lake that is based on date and goes something like this Source System/source table/Year/Month/Day/filename.tsv. I have a U-SQL job that pulls data from about 15 different file structures and then aggregates and transforms the data to send downstream. So while debugging some missing data I've found the following:

    These 2 statements should pull the same number of files (a file per day up until May 26th)   

    FROM"/sourcesystem/app_defect/{*}/{*}/{*}/app_defect.tsv" 

        returns 139 streams (stops May 9th, 2016 even though there are files daily up to May 26)

    FROM"/sourcesystem/app_inspection/{*}/{*}/{*}/app_inspection.tsv"

        returns 47 streams (stops Jan 9th, 2016 even though there are files daily up to May 26, 2016)

    How can I debug why all of the files are not being picked up?

       

       

    Friday, May 27, 2016 6:47 PM

All replies

  • I think we resolved this issues in discussions with you?

    The issue was that the pattern was missing files that had additional parts to the name. Changing the pattern to include an additional wild card helped:

    /sourcesystem/app_inspection/{*}/{*}/{*}/app_inspection{*}.tsv.


    Michael Rys

    Thursday, June 16, 2016 12:00 AM
    Moderator