none
U-SQL adding an extra column to rowset

    Question

  • Hi,

    Let's say I have 2 files like this:

    File 1 (has 1 column called "id"):

    "id"

    1

    2

    5

    File 2:

    "id"   "name"      "age"

    1        a              20

    2        b              30

    3        c              40  

    4        d              40

    5        e              20

    And I want to output a file like this:

    id       name        age          newColumn

    1        a              20                 1

    2        b              30                 1

    3        c              40                 0

    4        d              40                 0

    5        e              20                 1

    So basically, I want to output the data from file2 but with an extra column which has a '1' if a row with the same id is in file1 and a '0' otherwise.

    How can I do this in U-SQL? The only way to add a column that I have come across is altering a table, but when I read the data from a file I get a rowset. Also, I am not sure how to check, for each row in file2, if a row with the same id appears in file1.

    Any tips would be appreciated. Thanks!

    Friday, June 23, 2017 8:40 PM

Answers

  • You can use ANTISEMIJOIN and SEMIJOIN to get your result. I use constant rowsets to model your input:

    @r1 =
    SELECT id
    FROM(
        VALUES
        (
            1
        ),
        (
            2
        ),
        (
            5
        )) AS T(id);
    
    @r2 =
    SELECT *
    FROM(
        VALUES
        (
            1,
            "a",
            20
        ),
        (
            2,
            "b",
            30
        ),
        (
            3,
            "c",
            40
        ),
        (
            4,
            "d",
            40
        ),
        (
            5,
            "e",
            20
        )) AS T(id, name, age);
    
    @res1 =
    SELECT @r2.id, name, age,
           1 AS newColumn
    FROM @r2
          SEMIJOIN
             @r1 ON @r1.id == @r2.id;
    
    @res2 =
    SELECT @r2.id, name, age,
           0 AS newColumn
    FROM @r2
          ANTISEMIJOIN
             @r1 ON @r1.id == @r2.id;
    
    @res =
    SELECT *
    FROM @res1
    UNION ALL
    SELECT *
    FROM @res2;
    
    OUTPUT @res
    TO "/output/result.csv"
    USING Outputters.Csv();


    Michael Rys

    Friday, June 23, 2017 10:09 PM
    Moderator

All replies

  • You can use ANTISEMIJOIN and SEMIJOIN to get your result. I use constant rowsets to model your input:

    @r1 =
    SELECT id
    FROM(
        VALUES
        (
            1
        ),
        (
            2
        ),
        (
            5
        )) AS T(id);
    
    @r2 =
    SELECT *
    FROM(
        VALUES
        (
            1,
            "a",
            20
        ),
        (
            2,
            "b",
            30
        ),
        (
            3,
            "c",
            40
        ),
        (
            4,
            "d",
            40
        ),
        (
            5,
            "e",
            20
        )) AS T(id, name, age);
    
    @res1 =
    SELECT @r2.id, name, age,
           1 AS newColumn
    FROM @r2
          SEMIJOIN
             @r1 ON @r1.id == @r2.id;
    
    @res2 =
    SELECT @r2.id, name, age,
           0 AS newColumn
    FROM @r2
          ANTISEMIJOIN
             @r1 ON @r1.id == @r2.id;
    
    @res =
    SELECT *
    FROM @res1
    UNION ALL
    SELECT *
    FROM @res2;
    
    OUTPUT @res
    TO "/output/result.csv"
    USING Outputters.Csv();


    Michael Rys

    Friday, June 23, 2017 10:09 PM
    Moderator
  • Thank you so much!
    Friday, June 23, 2017 10:56 PM