none
Split a file based on it's contents

    Question

  • I have a U-SQL script that generates a rather larger file.

    A small sample:

    "UA-1234", 1, 1, 2
    "UA-1234", 1, 2, 3
    "UA-5678", 5, 6, 7
    "UA-5678", 7, 8, 9
    "UA-1234", 3, 2, 3
    "UA-9876", 1, 2, 3

    What I'd like to do, given that sample, is split the file into smaller files based on the first column. So in this example, I'd end up with 3 files:

    UA-1234.csv:
    "UA-1234", 1, 1, 2
    "UA-1234", 1, 2, 3
    "UA-1234", 3, 2, 3

    UA-5678.csv:
    "UA-5678", 5, 6, 7
    "UA-5678", 7, 8, 9

    UA-9876.csv:
    "UA-9876", 1, 2, 3

    Is this possible with straight U-SQL?

    Thanks,
    Matt

    Saturday, March 26, 2016 6:49 PM

Answers

  • Hi Matt

    This feature is not yet available but is being worked on. How many files would you need to generate?

    Today, you would write a procedure that parameterizes the file names as follows and then script gen the invocation with every possible value.

    DROP PROCEDURE IF EXISTS WriteToFile;
    DROP TYPE IF EXISTS T;
    CREATE TYPE T AS TABLE(a string, b int, c int, d int);
    CREATE PROCEDURE WriteToFile(@rowset T, @fname string, @path string = "/output")
    AS
    BEGIN
    DECLARE @loc = @path + "/" + @fname + ".csv";
       
    @x =
        SELECT *
        FROM @rowset
        WHERE a == @fname;
    OUTPUT @x
    TO @loc
    USING Outputters.Csv();
    END;
    @t =
        SELECT *
        FROM(
            VALUES
            (
                "UA-1234",
                1,
                1,
                2
            ),
            (
                "UA-1234",
                1,
                2,
                3
            ),
            (
                "UA-5678",
                5,
                6,
                7
            ),
            (
                "UA-5678",
                7,
                8,
                9
            ),
            (
                "UA-1234",
                3,
                2,
                3
            ),
            (
                "UA-9876",
                1,
                2,
                3
            )) AS t(a,b,c,d);
    WriteToFile
    (
        @t,
        "UA-1234",
        DEFAULT
    );


    Michael Rys

    • Marked as answer by mdpurecars Monday, March 28, 2016 11:52 PM
    Saturday, March 26, 2016 10:00 PM
    Moderator

All replies

  • Hi Matt

    This feature is not yet available but is being worked on. How many files would you need to generate?

    Today, you would write a procedure that parameterizes the file names as follows and then script gen the invocation with every possible value.

    DROP PROCEDURE IF EXISTS WriteToFile;
    DROP TYPE IF EXISTS T;
    CREATE TYPE T AS TABLE(a string, b int, c int, d int);
    CREATE PROCEDURE WriteToFile(@rowset T, @fname string, @path string = "/output")
    AS
    BEGIN
    DECLARE @loc = @path + "/" + @fname + ".csv";
       
    @x =
        SELECT *
        FROM @rowset
        WHERE a == @fname;
    OUTPUT @x
    TO @loc
    USING Outputters.Csv();
    END;
    @t =
        SELECT *
        FROM(
            VALUES
            (
                "UA-1234",
                1,
                1,
                2
            ),
            (
                "UA-1234",
                1,
                2,
                3
            ),
            (
                "UA-5678",
                5,
                6,
                7
            ),
            (
                "UA-5678",
                7,
                8,
                9
            ),
            (
                "UA-1234",
                3,
                2,
                3
            ),
            (
                "UA-9876",
                1,
                2,
                3
            )) AS t(a,b,c,d);
    WriteToFile
    (
        @t,
        "UA-1234",
        DEFAULT
    );


    Michael Rys

    • Marked as answer by mdpurecars Monday, March 28, 2016 11:52 PM
    Saturday, March 26, 2016 10:00 PM
    Moderator
  • I can't really say how many files there would be on a given run, but it would probably be more than 100, maybe around 1000.

    It would really all depend on the number of distinct first column values, the "UA-xxx"s.

    We wouldn't really know which of these would be included before hand since we're aggregating 60 days worth of information and there isn't a set number of those first column values.

    I know we can get a distinct list of the first column values. Is there a way to "loop" through those to call the WriteToFile procedure in the example?

    Thanks,

    Matt


    Saturday, March 26, 2016 10:15 PM
  • You can only loop through in for example Powershell. Or write one script that creates the script that contains the calls and then runs that second script (after downloading it to submit).

    Once the feature that we are working on to allow file set patterns in the OUTPUT statement has been completed, you should be able to do it in one script.


    Michael Rys

    Monday, March 28, 2016 6:30 PM
    Moderator