How to query multiple file? Need query for one month files.


  • I have Azure Data factory copying data to Azure Data Lake Storage every hour(or every day) and it is creating file for each hour.

    File format is followings "outTable216, 12, 14, 06, 00.csv", "outTable216, 12, 14, 07, 00.csv". 12 stand here for December. Each file have columns CustomerName and SalesAmount. We can use name of file or timestamp of to determine December files. It is also possible to query all files and then filter December, but I suppose that is very big load if there are whole year files in Azure Data Lake Storage.

    I would like to make query for all files created in December and SUM all SalesAmount of December and store to output file. I know how to make query for single file, but don't how to make query for multiple files from December.

    Could somebody advice me?


    Wednesday, December 14, 2016 8:26 AM


  • You want to use a file set query.

    In your extract statement, use something like the following:

    @input = EXTRACT ... // your regular schema
                   , month int // virtual column
             FROM "/path/outTable216, {month}, {*}.csv"
             USING Extractors.Csv();
    @December = SELECT * FROM @input WHERE month == 12;

    This will only extract data from the files with 12 in the location of the month.

    If you want to do time ranges, you can also specify a dateTime virtual column and pattern and then do dateTime comparisons. The virtual column and pattern would looks something like

    @input = EXTRACT ... // your regular schema
                   , date DateTime // virtual column
             FROM "/path/outTable{date:yyyy}, {date:MM}, {date:dd}, {date:HH}, {*}.csv"
             USING Extractors.Csv();

    Michael Rys

    • Marked as answer by Kenny_I Thursday, December 15, 2016 7:05 AM
    Wednesday, December 14, 2016 8:41 AM