locked
Find duration of a gap in the values RRS feed

  • Question

  • Hi,

    I'm trying to spot for how long I had missing values in my data stream, so using the published pattern I hacked some queries but now I stuck at computing the aggregation duration of each gap in the stream. I guess I need to join the results on themselves to do that but.. no clue how to do that.

    Any help would be extremely welcome !

    Here is the current T-SQL code:

    Input data is as simple a timestamp, itemid and value sampled on a hourly basis

    -- Detect missing values in the hopping window
    WITH missingvalues AS
    (
        SELECT 
            System.Timestamp as WindowEnd,
            ItemId,
            TopOne() OVER (ORDER BY timestamp DESC) AS lastEvent
        FROM 
            flattenmsgs TIMESTAMP BY timestamp
        GROUP BY 
            HOPPINGWINDOW(hour, 24, 1), ItemId
    ),
    
    -- select only missing values
    gaps AS
    (
        SELECT WindowEnd as MissingTimeStamp, ItemId
        FROM 
            missingvalues    
        WHERE 
            WindowEnd <> lastEvent.timestamp
    ),
    
    -- find duration between each gap (aggregate gaps over one hour)
    gapsDuration AS
    (
     ???
    )
    
    -- debug
    SELECT *
    INTO output
    FROM gapsDuration

    The table "gaps"is then filled with a list of timestamp, id for each hour a sample is missing.

    ex:

    [  
       {  
          "missingtimestamp":"2014-03-30T11:00:00.0000000Z",
          "Itemid":"100_IND-0000000999_999_Resource-1.WHr"
       },
       {  
          "missingtimestamp":"2014-03-30T12:00:00.0000000Z",
          "Itemid":"100_IND-0000000999_999_Resource-1.WHr"
       },
       {  
          "missingtimestamp":"2014-03-30T13:00:00.0000000Z",
          "Itemid":"100_IND-0000000999_999_Resource-1.WHr"
       },
       {  
          "missingtimestamp":"2014-03-30T18:00:00.0000000Z",
          "Itemid":"100_IND-0000000999_999_Resource-1.WHr"
       },
       {  
          "missingtimestamp":"2014-03-30T19:00:00.0000000Z",
          "Itemid":"100_IND-0000000999_999_Resource-1.WHr"
       },
    ...
    ]

    You can see that there are 2 gaps, one of 3 hours then another of 2 hours.

    So the next query, table "gapsDuration", has to aggregate them to compute the overall duration of each gap (means returns : ItemID, starttime, endtime, duration)

    Thanks ! And even if there is a better solution to spot the gaps rather than filtering the "filled values"... tell me !

    matt




    • Edited by matthieu.b Wednesday, November 2, 2016 6:23 PM
    Wednesday, November 2, 2016 6:04 PM

All replies

  • Take a look at LAG function that can be used to retrieve previous element

    Here is the query that returns gap duration per ItemId. Please note that this query assumes that max duration of the gap is 1 day. If there is no previous event within a day, LAG will return NULL. You can adjust this value as needed.

    SELECT
        ItemId,
        timestamp,
        DATEDIFF(
            second,
            LAG(timestamp) OVER (PARTITION BY ItemId LIMIT DURATION(day, 1)),
            timestamp) AS gapDuration
    FROM flattenmsgs TIMESTAMP BY timestamp


    • Marked as answer by matthieu.b Thursday, November 3, 2016 1:56 PM
    • Unmarked as answer by matthieu.b Thursday, November 3, 2016 3:27 PM
    Wednesday, November 2, 2016 7:53 PM
  • Interesting !

    I tried, but I don't understand some results:

    [  
       {  
          "itemid":"100_IND-0000000999_999_Resource-1.WHr",
          "timestamp":"2014-03-30T06:00:00.0000000Z",
          "gapduration":null
       },
       {  
          "itemid":"100_IND-0000000999_999_Resource-1.WHr",
          "timestamp":"2014-03-30T07:00:00.0000000Z",
          "gapduration":1
       },
       {  
          "itemid":"100_IND-0000000999_999_Resource-1.WHr",
          "timestamp":"2014-03-30T08:00:00.0000000Z",
          "gapduration":1
       },
       {  
          "itemid":"100_IND-0000000999_999_Resource-1.WHr",
          "timestamp":"2014-03-30T09:00:00.0000000Z",
          "gapduration":1
       },
       {  
          "itemid":"100_IND-0000000999_999_Resource-1.WHr",
          "timestamp":"2014-03-30T10:00:00.0000000Z",
          "gapduration":1
       },
       {  
          "itemid":"100_IND-0000000999_999_Resource-1.WHr",
          "timestamp":"2014-03-30T16:00:00.0000000Z",
          "gapduration":6
       },
       {  
          "itemid":"100_IND-0000000999_999_Resource-1.WHr",
          "timestamp":"2014-03-30T18:00:00.0000000Z",
          "gapduration":1
       },
       {  
          "itemid":"100_IND-0000000999_999_Resource-1.WHr",
          "timestamp":"2014-03-30T19:00:00.0000000Z",
          "gapduration":1
       },
       {  
          "itemid":"100_IND-0000000999_999_Resource-1.WHr",
          "timestamp":"2014-03-30T20:00:00.0000000Z",
          "gapduration":1
       },

    As you can see,  it counts 6 hours between 16h and 10h (so 5 gaps) and 1 hour (so 0 gap) when there is not gap so normal duration but also 1 hour between 16h and 18h instead of 2 hours. (so 1 gap).

    do you see why ?

    Wednesday, November 2, 2016 8:59 PM
  • Can you share your query? I seem to be getting slightly different results. Note that the query returns DATEDIFF between previous event and current event. Based on that you can calculate how many "gaps" you have.

    Thursday, November 3, 2016 5:29 AM
  • Hi,

    I found why, in fact to simplify my post, I "forgot" the WHERE clause in the query which was filtering the invalid data (WHERE state = 'Valid') and at 17:00:00 there was a row which has state = 'Invalid' so I guess the LAG cannot manage the filtering afterwards so to make it running I added a prior filter :

    -- Filter valid cumulated values
    WITH filtered AS
    (
        SELECT
            itemid, timestamp as t
        FROM 
            flattenmsgs
        WHERE
            state = 'Valid'
    ), 
    
    -- find duration between each gap (aggregate gaps over one hour)
    duration AS
    (
        SELECT
            itemid,
            t, 
            DATEDIFF(
                hour,
                LAG(t) OVER (PARTITION BY itemid LIMIT DURATION(day, 1)), t)-1 AS gapDuration
        FROM 
            filtered     
    ),
    
    -- keep only not null duration
    gapsDuration AS
    (
        SELECT
            itemid,
            DATEADD(hour, -1*gapDuration, t) as StartTime,        
            DATEADD(hour, -1, t) as EndTime,
            gapDuration as Duration
        FROM 
                duration
        WHERE
            gapDuration > 0
    )

    So now I have the expected data, as you shown:

    [  
       {  
          "itemid":"100_IND-0000000999_999_Resource-1.WHr",
          "starttime":"2014-03-30T11:00:00.0000000Z",
          "endtime":"2014-03-30T15:00:00.0000000Z",
          "duration":5
       },
       {  
          "itemid":"100_IND-0000000999_999_Resource-1.WHr",
          "starttime":"2014-03-30T17:00:00.0000000Z",
          "endtime":"2014-03-30T17:00:00.0000000Z",
          "duration":1
       },
       {  
          "itemid":"100_IND-0000000999_999_Resource-1.WHr",
          "starttime":"2014-03-30T21:00:00.0000000Z",
          "endtime":"2014-03-31T01:00:00.0000000Z",
          "duration":5
       },
       {  
          "itemid":"100_IND-0000000999_999_Resource-1.WHr",
          "starttime":"2014-03-31T07:00:00.0000000Z",
          "endtime":"2014-03-31T11:00:00.0000000Z",
          "duration":5
       },
       {  
          "itemid":"100_IND-0000000999_999_Resource-1.WHr",
          "starttime":"2014-03-31T17:00:00.0000000Z",
          "endtime":"2014-03-31T21:00:00.0000000Z",
          "duration":5
       }
    ]

    • Edited by matthieu.b Thursday, November 3, 2016 1:30 PM
    Thursday, November 3, 2016 1:20 PM
  • Then.... next step is for each gap, from the oldest to the newest, to retrieve the 12 past values to perform interpolation and fill the gap.

    I tried something like this but no results are shown (not even 0.... white screen)

    -- Then for each gap start time, extract the 12 previous values
    -- How to select the first gap ? Something like MIN(gapsDuration.startTime) ?
    PreviousValues AS
    (
        SELECT 
            flattenmsgs.ItemId,
            flattenmsgs.value,
            flattenmsgs.timestamp
        FROM
            flattenmsgs
            INNER JOIN gapsDuration
            ON DATEDIFF(hour, flattenmsgs, gapsDuration) BETWEEN 0 AND 13  
            AND flattenmsgs.timestamp < gapsDuration.startTime
            AND flattenmsgs.timestamp > DATEADD(hour, -13, gapsDuration.startTime)
    )

    Or said differently, my question is "how to loop over each gap, extract the 12 hours previous data and accordingly call my AzureML web service to do the inference for the missing hours" ?

    What is the strategy ? Sending each gap on another SA instance to perform inference one by one ? Maybe better than trying to fix my code... 




    • Edited by matthieu.b Thursday, November 3, 2016 3:55 PM
    Thursday, November 3, 2016 3:26 PM