none
Fill in sparse data?

    Question

  • I have a dataset that looks like this:

    Date Value
    2016-12-01 5
    2016-12-08 6
    2016-12-31 7

    In this example, I am missing 12/2 - 12/7 and 12/9 - 12/30. What I'd like to do is transform it so that the missing dates are filled in and have the value of the previous existing, so I'd end up with:

    Date Value
    2016-12-01 5
    2016-12-02 5
    2016-12-03 5
    ... 
    2016-12-08 6
    2016-12-09 6
    2016-12-10 6
    ...
    2016-12-31 7

    Any ideas on how to tackle this with straight U-SQL?

    The script will be running in a data factory, so I know I can write some code and do something with azure batch if needed, but I'd prefer to stay with U-SQL if possible.

    Thanks,

    Matt

    Thursday, June 2, 2016 9:28 PM

All replies

  • For exploding the dates, you can use a LINQ expression. Then I essentially did a cross join and some filtering to get your desired results (the code below assumes the value is in ascending order, and that can be tweaked if the assumption is not right).

    @data =

        EXTRACT date DateTime,

                value int

        FROM @"/rukmanig/dates.txt"

        USING Extractors.Tsv();

     

    @minmax =

        SELECT MIN<DateTime>(date).Value AS min_date,

               MAX<DateTime>(date).Value AS max_date

        FROM @data;

     

    @alldates =

        SELECT new SQL.ARRAY<DateTime>(System.Linq.Enumerable.Range(0, (int)(max_date - min_date).TotalDays + 1).

                                                                    Select(i => new DateTime(min_date.AddDays(i).Ticks)).ToList()) AS all_dates

        FROM @minmax;

     

    @alldates =

        SELECT date

        FROM @alldates

             CROSS APPLY

                 EXPLODE(all_dates) AS T(date);

     

    @output =

        SELECT @alldates.date,

                     @data.date AS ref_date,

               @data.value

        FROM @alldates

             CROSS JOIN

                 @data;

     

    @output =

        SELECT date.ToShortDateString() AS date,

               MAX(value) AS value

        FROM @output

        WHERE date >= ref_date

           GROUP BY date;

     

     

    OUTPUT @output

    TO "/rukmanig/outputdates.txt"

    USING Outputters.Tsv();

    Thursday, June 2, 2016 10:34 PM