none
Using BETWEEN (or equivalent) in a JOIN?

    Question

  • In T-SQL, I can write a query such as the following, which will join records from table 'b' on a common key which fall within a range from table 'a':

    SELECT
    FROM a
    JOIN b ON a.id = b.id
    AND b.EventDate BETWEEN a.StartDate AND a.EndDate

    Is there an equivalent in U-SQL? So far, I've encountered limitation upon limitation, e.g.:

    • The Join Condition >= is not supported.
    • The Join Condition <= is not supported.
    • Joining on a condition other than == is not supported.

    As an attempted workaround, I tried embedding the comparison logic within a ternary expression, e.g.:

    AND b.eventTime == (b.eventTime >= a.windowStart ? b.eventTime : b.eventNull)
    AND b.eventTime == (b.eventTime <= a.windowEnd ? b.eventTime : b.eventNull)

    But, this, too, returned an error:

    • The expressions used in the comparison operator inside a join condition are not columns.

    If there's a way to implement this logic, please advise. Moving this condition to the WHERE clause is a highly objectionable last resort.


    Jason

    Monday, November 23, 2015 4:59 PM

Answers

  • Hi Jason

    The full specification on how to write joins in U-SQL can be found here:

    https://msdn.microsoft.com/en-us/library/azure/mt621318.aspx

     (note that the docs may be reorganized and the link may not be the correct one after the reorg, so you can always start from http://aka.ms/usql_reference).

    So as Saveen points out, the right way is to move any non-equijoin conditions to the WHERE clause.

    The reason why we are syntactically more restrictive than normal SQL dialects is that we want to make it clearer where the optimizer and execution framework can provide you efficient processing of the join in the scale-out, distributed environment, and where you have to understand that the processing is more expensive.

    Unlike "single-box" SQL implementations, where data is local to the processing nodes, the data and processing is parallelized across many nodes. Doing a non-equijoin will require data movement between the nodes and thus is less efficient than an equijoin.

    In your example, moving the BETWEEN into the WHERE clause is not too bad, assuming your join predicate is selective, since it will apply the filter in the after the join has occurred.

    We could do the rewrite for you, but at that point in time you would still get the same plan and not understand the cost implications.

    So the recommended way to write it is:

    SELECT
    FROM a
    JOIN b ON a.id == b.id
    WHERE b.EventDate BETWEEN a.StartDate AND a.EndDate;

    I hope this helps

    Michael


    Michael Rys

    • Marked as answer by Jason L Brugger Wednesday, November 25, 2015 12:52 PM
    Wednesday, November 25, 2015 3:30 AM
    Moderator

All replies

  • Jason, could you elaborate on why moving the condition to the WHERE clause is undesirable?
    Monday, November 23, 2015 6:21 PM
    Moderator
  • It seems that a lot of unnecessary results would be returned by the JOIN and then discarded by the WHERE clause. Is this not the case? The above is a simplified example. Does it make a difference for OUTER joins?

    Jason

    Monday, November 23, 2015 6:57 PM
  • Hi Jason

    The full specification on how to write joins in U-SQL can be found here:

    https://msdn.microsoft.com/en-us/library/azure/mt621318.aspx

     (note that the docs may be reorganized and the link may not be the correct one after the reorg, so you can always start from http://aka.ms/usql_reference).

    So as Saveen points out, the right way is to move any non-equijoin conditions to the WHERE clause.

    The reason why we are syntactically more restrictive than normal SQL dialects is that we want to make it clearer where the optimizer and execution framework can provide you efficient processing of the join in the scale-out, distributed environment, and where you have to understand that the processing is more expensive.

    Unlike "single-box" SQL implementations, where data is local to the processing nodes, the data and processing is parallelized across many nodes. Doing a non-equijoin will require data movement between the nodes and thus is less efficient than an equijoin.

    In your example, moving the BETWEEN into the WHERE clause is not too bad, assuming your join predicate is selective, since it will apply the filter in the after the join has occurred.

    We could do the rewrite for you, but at that point in time you would still get the same plan and not understand the cost implications.

    So the recommended way to write it is:

    SELECT
    FROM a
    JOIN b ON a.id == b.id
    WHERE b.EventDate BETWEEN a.StartDate AND a.EndDate;

    I hope this helps

    Michael


    Michael Rys

    • Marked as answer by Jason L Brugger Wednesday, November 25, 2015 12:52 PM
    Wednesday, November 25, 2015 3:30 AM
    Moderator