none
Best way to compare date ranges in U-SQL?

    Question

  • Does anyone know the best way to join 2 tables on overlapping date ranges?

    What I have is 2 tables with start and end date columns. I'd like to join the tables where the date ranges overlap. I've tried some queries but they are very slow.



    • Edited by Tony Thul Wednesday, April 17, 2019 4:14 PM
    Wednesday, April 17, 2019 2:52 PM

All replies

  • Depends on the use-case.

    When it's required for further calendar calculus, then often using a calendar table as base table and join the other tables can simplify logic. But for the cost of larger sets.

    But basically it is a non-equi join over the negative critera: NOT non-overlapping. Which is simply

    .. INNER JOIN .. ON NOT (B.ToDate < A.FromDate OR A.ToDate < B.FromDate)

    EDIT:
    oops. This applies to T-SQL.. so not sure whether it should be done in U-SQL.
    Wednesday, April 17, 2019 3:03 PM
  • Right. U-SQL will not join on multiple conditions
    Wednesday, April 17, 2019 4:14 PM
  • Hi

    It would be useful to see what you have tried.

    In general, U-SQL does not allow non-equijoins in the ON clause because it wants to make sure that you are aware of the performance implications of such joins in a scale out environment. What you should do is to move the range condition into the WHERE clause (in case of an INNER JOIN, OUTER JOIN may need extra null handling if you want T-SQL). See the U-SQL Join documentation for more details.

    Also, if you want to do any kind of join or predicate, having the right data organization in your input rowsets is important. If you are operating on data extracted from files, you will cause a lot of data shuffles. If you however stored the data in a U-SQL table with the right distribution (e.g., a RANGE distribution), you may be able to avoid them and get better performance. Also, if you have a "reference table", you may want to cause a BROADCAST join where the smaller reference data gets shipped to each of the join nodes.

    Alternatively, you could try to write a custom combiner. A combiner is normally a black box for the optimizer, so it is mainly suggested if the comparison logic is based on complex .NET logic where you want to apply the join over multiple values. 

    And finally, you may not really want to do a join in the first place. E.g., if you want to expand your data points into a daily row data point, a CROSS APPLY may be more appropriate (there are examples here and on stack overflow).


    Michael Rys

    Thursday, April 18, 2019 9:27 PM
    Moderator
  • You can try to extract two tables in two variables and then do the join between these variable. One more thing to make sure that you convert your date in int while comparing like 23 Jan 2019 should be 20190123.

    This may be useful. let me know.

    Friday, April 26, 2019 10:44 AM