locked
Restarting job with a tumbling window, Delay in output RRS feed

  • Question

  • Hi there, pretty new to SA jobs so the question an assumptions may be totally wrong! 

    We are facing a problem with a job that seems to have a hard time to process "old data" (job restarted). 

    The job is basically counting rows on a grouping expression with a tumbling window of 2 hours. 
    After restarting the job a few days back in time, we would have thought that the job would immediately process and emit data that comes before the current tumbling window. What we see is that it only emits data when the tumbling window expires and is emitting from data when the job was last stopped.
    With current ingress it will take unacceptably long time before the job will "reach" the current tumbling window.
    Shouldn't the job "burst" data directly when data belongs to past tumbling windows?

    Any ideas what could be wrong or workarounds? 

    Below i our SA Query, highlights: 

    • Both input and outputs are Eventhubs.
    • Input hub has 4 partitions. 
    • Output hub has 2 partitions. 
    • At the moment we have set the SU to 18. 
    • The query uses an UDF function.

    Best regards Niclas

    Our SA job Query is:

    WITH events as (
     SELECT * FROM [ads] TIMESTAMP BY EventEnqueuedUTCTime PARTITION BY PartitionId
    ),
    
    ListEvents as (
     SELECT *,events.PartitionId FROM events PARTITION BY PartitionId
     CROSS APPLY GetArrayElements(events.events) AS e
     WHERE e.ArrayValue.domain = 'ads'
    ),
    
    partitioned AS (
    SELECT
        System.TimeStamp as Timestamp,
        e.ArrayValue.domain,
        e.ArrayValue.subdomain,
        CONCAT(e.ArrayValue.object.type,'-',e.ArrayValue.verb) as metric,
        udf.getDimensionRecord(e.ArrayValue.app.channelid,
                             e.ArrayValue.app.remarketing,
                             e.ArrayValue.object.officeid,
                             e.ArrayValue.object.agencyid,
                             e.ArrayValue.app.usedposition,
                             e.ArrayValue.object.id,
                             e.ArrayValue.object.agentid,
                             e.ArrayValue.object.advertisingtype,
                            DATEPART("yyyy", System.TimeStamp),
                            DATEPART("qq",   System.TimeStamp),
                            DATEPART("mm",   System.TimeStamp),
                            DATEPART("dd",   System.TimeStamp),
                            DATEPART("wk",   System.TimeStamp),
                            DATEPART("dw",   System.TimeStamp),
                            DATEPART("hh",   System.TimeStamp)) AS dimensions,
        count(*) as val
    FROM
    ListEvents PARTITION BY PartitionId
        GROUP BY
        e.ArrayValue.object.id,
        e.ArrayValue.domain,
        e.ArrayValue.subdomain,
        CONCAT(e.ArrayValue.object.type, '-', e.ArrayValue.verb),
        e.ArrayValue.app.channelid,
        e.ArrayValue.app.remarketing,
        e.ArrayValue.object.officeid,
        e.ArrayValue.object.agencyid,
        e.ArrayValue.object.agentid,
        e.ArrayValue.object.advertisingtype,
        e.ArrayValue.app.usedposition,
        PartitionId,
        TumblingWindow(Hour, 2)
    )
    
    
    SELECT
        Timestamp,
        domain,
        subdomain,
        metric,
        dimensions,
        SUM(val) as val
    INTO
        [metric]
    FROM
        partitioned
    GROUP BY
        Timestamp,
        domain,
        subdomain,
        metric,
        dimensions,
        TumblingWindow(Hour, 2)


    Tuesday, February 27, 2018 4:25 PM