none
U-SQL JOIN not ignoring NULL Values (With example code/data)

    Question

  • I have been getting incorrect results in a process and in troubleshooting the problem I realized that my NULL values are not being ignored when joined on.  For instance, in SQL, if you JOIN on a NULL value it returns no record.  But, in U-SQL it seems to still return the record.  Likely because it's joining on \N == \N instead of NULL = NULL.  How do I get around this or fix this issue? 

    Here's test data/code to show the

    DECLARE @Prod_dir string = "/TestData/";
    DECLARE @FileName string = "employee.csv";

    @data = 
        SELECT * FROM 
            ( VALUES
            ("02143",   30,     100,  "Smith"),
            ("98052",   25,     50,   "Andersen"),
            ("02139",   25,     60,   (string)null),
            ("61801",   23,     80,   (string)null)
            ) AS T(PersonId, Age, Salary, LastName);

    @output =
        SELECT a.*, b.PersonId AS NewID, b.Age AS NewAge, b.Salary AS NewSalary, b.LastName AS NewLastName
        FROM @data AS a
             INNER JOIN @data AS b
    ON a.LastName == b.LastName;

    OUTPUT @output
        TO @Prod_dir + "TEST_"+ @FileName
        ORDER BY PersonId ASC
        USING Outputters.Csv(nullEscape: "\\N", quoting:true, outputHeader:true);

    Tuesday, May 1, 2018 7:17 PM

Answers

  • The only way I've found around this is to add a NULL check on each of my joins and convert the join to a CROSS JOIN with a self reference to itself on ID=ID. For instance:

    CROSS JOIN ON (a.LastName == b.LastName AND a.LastName IS NOT NULL) OR (a.PersonId = b.PersonId) 

    The alternative is to do multiple Selects and UNION them together. 

    Neither of these solutions are ideal.  It seems to be a Bug that the NULL is not actually handled as a true NULL.

    • Marked as answer by FrankMn Wednesday, May 9, 2018 2:08 PM
    Tuesday, May 1, 2018 8:40 PM

All replies

  • The only way I've found around this is to add a NULL check on each of my joins and convert the join to a CROSS JOIN with a self reference to itself on ID=ID. For instance:

    CROSS JOIN ON (a.LastName == b.LastName AND a.LastName IS NOT NULL) OR (a.PersonId = b.PersonId) 

    The alternative is to do multiple Selects and UNION them together. 

    Neither of these solutions are ideal.  It seems to be a Bug that the NULL is not actually handled as a true NULL.

    • Marked as answer by FrankMn Wednesday, May 9, 2018 2:08 PM
    Tuesday, May 1, 2018 8:40 PM
  • You will achieve your desired result by adding the following line to your SELECT statement:

    WHERE b.LastName IS NOT NULL;



    Wednesday, May 2, 2018 5:08 PM