none
USQL - Loading a dynamic set of files

    Question

  • I'm trying to write a USQL query that that accepts a variable set of files that come from an Azure Blob storage. However I can't find a correct variable-syntax that contains multiple files. I've tried an comma separated string, a sql.array<>

    I would like to do something like this:

    DECLARE @filelist <sometype> = {@“wasb://container@storageaccount/filename1.json”, @“wasb://container@storageaccount/filename2.json”}

    SELECT [columns]
    FROM @filelist

    The background behind this question is that I'm trying to find a way to do an incremental update from the Blob storage to the Azure Data Lake. Ik can query a list with new/updated files from a Azure SQL table and would like to use this list as an input to USQL to copy these updated files to the Data Lake.

    Wednesday, July 27, 2016 11:02 AM

Answers

All replies

  • Dear Simon

    Here is the link to the relevant documentation. You can provide a list of files either by an explicit list, but that list has to be enumerated in place and cannot be parameterized as an array. Feel free to file a feature request at http://aka.ms/adlfeedback.

    Alternatively you can use the fileset syntax, e.g.

    EXTRACT ... FROM "wasb://container@account/folder/filename{*}.json" USING ....


    Michael Rys

    Wednesday, July 27, 2016 10:26 PM
    Moderator
  • Hello Michael,

    Thanks for your answer so far. I also tried the fileset syntax, however I don't see a way to specify multiple patterns. To be able solve my question I would have to be able to add a list of filenames as a pattern or specify multiple patterns. However, in my opinion, an variable with array of input files would offer more flexibility.

    For now, I'm going to try to generate an USQL query that contains an EXTRACT statement for each file that I need to load. The disadvantage here is that I don't see how to use a generated USQL query in an Data Factory flow and I really prefer to use DF for the orchestration of my data flows.

    I'll submit a feature request. Any other tips are greatly appreciated.

    Thursday, July 28, 2016 6:26 AM
  • Hi Simon,

    If you want to orchestrate this whole process from ADF you could try something like:

    1. Create a T-SQL stored procedure in your ASQLDB that generates your U-SQL query and saves the result in a ASQLDB table. Execute this SP from ADF.

    2. Create ADF pipeline that reads generated U-SQL query from ASQLDB table and moves it to blob storage.

    3. Create ADF activity that executes U-SQL query that is stored in the file on blob storage. For an example, see: https://azure.microsoft.com/nl-nl/documentation/articles/data-factory-usql-activity/#data-lake-analytics-u-sql-activity


    Jorg Klein's Microsoft Business Intelligence Blog



    • Edited by Jorg KleinMVP Thursday, July 28, 2016 1:01 PM
    • Marked as answer by Simon_Z Thursday, August 18, 2016 12:39 PM
    Thursday, July 28, 2016 12:45 PM
  • You should be able to either say something like:

    @data = EXTRACT ... your schema ...
                 FROM "/path/{date:yyyy}/{filename}.csv", "/otherpath/{date2:yyyy}/{date2:MM}/file{fileno}.csv"
                 ....

    If you need more than one pattern.

    Or if you want to select from a list of filenames, something like:

    @d =
        EXTRACT date string,
                time string,
                author string,
                tweet string, 
                filename string
        FROM "/Samples/Data/Tweets/{filename}Tweets.csv"
        USING Extractors.Csv();
    
    @r =
        SELECT *
        FROM @d
        WHERE filename IN("iC", "MikeDoesBigData", "mwinkle");

    works.

    You can also write something like:

    DECLARE @tweeters SqlArray<string> = new SqlArray<string>{"iC", "MikeDoesBigData", "mwinkle"};
                                       
    @d =
        EXTRACT date string,
                time string,
                author string,
                tweet string, 
                filename string
        FROM "/Samples/Data/Tweets/{filename}Tweets.csv"
        USING Extractors.Csv();
    
    @r =
        SELECT *
        FROM @d
        WHERE  @tweeters.Contains(filename);
    
    OUTPUT @r
    TO "/output/result1.csv"
    USING Outputters.Csv();
    

    But that is not providing partition elimination.


    Michael Rys

    Thursday, July 28, 2016 10:07 PM
    Moderator
  • Thanks for your help Michael.

    I've spent some time experimenting and ended up with the solution as Jorg suggested. I want to do a recurring partial load based on changed files, therefore I need to point to changing set of files (in different folders) for each execution. I cannot take a complete folder because they contain up to 10.000 different files.

    That means, my only solution is to to generate a query before each execution. Following Jorg's advise I now generate a new query for each execution in an ASQLDB. I copy this query using the Data Factory to a file in a blob storage and execute the copied query using an Azure Data Lake activity in the Data Factory. To me, it's quite an complicated solution, but it seems to work for now.

    EXTRACT date string,
            time string,
            author string,
            tweet string, 
            filename string
    FROM "generatedfolder/generatedfilename",
               "generatedfolder1/generatedfilename1",
               "generatedfolder2/generatedfilename2"
    USING Extractors.Csv();




    • Edited by Simon_Z Thursday, August 18, 2016 1:10 PM
    Thursday, August 18, 2016 1:05 PM
  • Dear Simon

    That works. Or you could write a pattern and the pass the parameters for the patterns.

    For example

    @data = EXTRACT date string, time string, ..., folder string, filename string, folderno int, fileno int
    FROM "/.../{folder}{folderno}/{filename}{fileno}.csv"
    USING Extractors.Csv();
    
    @data = SELECT date, time, ...
    FROM @data
    WHERE folder IN ("folder1", "folder2") AND folderno BETWEEN @first AND @last AND filename IN ("filenameA", "filenameB") AND fileno BETWEEN @firstfileno AND @lastfileno;
    

    where you pass the names as parameters to your script.
    The main problem I see is that we currently do not support/constant-fold lookups in constant arrays but only in the list... Otherwise you could do it with arbitrary list of filenames.


    Michael Rys

    Friday, August 19, 2016 5:56 AM
    Moderator
  • Hi Michael,

    Is there a way we could pass the extract file definition dynamically so that I could read multiple files with different definition in a single script ?

    Thanks,

    Vijay

    Thursday, August 23, 2018 9:27 AM