locked
scale question, partition a step RRS feed

  • Question

  • Hi All,

    It's me again... :)

    I have a question regarding scaling jobs. According to the documentation here it states that the input source must be partitioned. What does that mean? I have an event hub with 16 partitions and to spread my events on all partitions I'm setting the PartitionKey to a value between 0 and 15 on all published events. Is that the correct approach? Is that all I need to do to get a "partitioned input source"?

    When I've got it up and running according to above I get errors in the operation log when I try to use partitioned queries using the "Partition By PartitionId" statement. The error indicates that one of the partitions is currently empty. In the known issues and limitations document here it states:

    "When running a partitioned query with a non-partitioned sub-query as the second step, if one of the Event Hub partitions on the input is completely empty, the query will not generate results. An error for this will be reflected in the Operation Logs for the job. Please make sure all Event Hub partitions have incoming events at all times to avoid this problem."

    Does this mean that if I can't guarantee that I'll have incoming events in all partitions at all times I shouldn't use partitioned queries during preview?

    BR, Max 

    Thursday, November 6, 2014 2:53 PM

Answers

  • I don't see anything egregious.  The way I would approach "debugging" the situation is:

    - Replace minutes with seconds to see results sooner

    - Does the query work if you remove "partition by"? 

    - Does the first step work without the second?

    Thanks,

    --Lev

    Friday, November 7, 2014 6:00 PM

All replies

  • Hi Max!

    We do have some temporary limitations in this area, but what you are doing (spreading events more or less evenly to all partitions) is correct.

    Is your job actually not producing output, or are you just seeing errors in the log?  A few of our log messages (like the "Partition is empty", or "partition is no longer empty") are incorrectly designated as failures, even though they are warnings.  They are safe to ignore if your job is actually working well.

    Thanks,

    --Lev

    Thursday, November 6, 2014 4:16 PM
  • Ahh... I just saw the error in the operation log and assumed that it didn't work. I'll get it up and running again to see if the output is persisted or not.

    BUT if I understand you correct, all I have to do to spread all events across all partitions is setting PartitionKey to a value between 0 and (number of partitions in event hud -1). Correct? And by doing so I have a "partitioned input source", or?

    BR, Max.


    Thursday, November 6, 2014 8:09 PM
  • Correct.
    Thursday, November 6, 2014 9:08 PM
  • Hi,

    I've tried it now and it doesn't seem to work, the output isn't persisted to the database. I keep getting these errors:

    "A partition is not receiving events. Please check that all partitions have, or are receiving data."

    "DryShardError errors are occuring too rapidly. They are being suppressed for the next few minutes"

    As I said earlier I distribute my events across my 16 partitions in the event hub by setting PartitionKey on my events to a value between 0 and 15.

    My partitioned query:

    CREATE TABLE input(
        PartnerId BIGINT,
        ClientId BIGINT,
        AppId NVARCHAR(max),
        GeoProfile BIGINT,
        PartitionKey NVARCHAR(max)
    );
    With Step1 AS
    (
        SELECT
            PartnerId,
            AppId,
            Count(*) AS Count,
            Sum(GeoProfile) AS CountGeoProfiled 
        FROM
            input Partition By PartitionId
        GROUP BY
            TumblingWindow(minute, 5), PartnerId, AppId
    )
    SELECT
        DateAdd(minute,-5,System.TimeStamp) AS Date, 
        PartnerId,
        AppId,
        SUM(Count) AS Count,
        SUM(CountGeoProfiled) AS CountGeoProfiled,
        (SUM(Count) - SUM(CountGeoProfiled)) AS CountGeoNoneProfiled    
    FROM
        Step1
    GROUP BY
    TumblingWindow(minute, 5), PartnerId, AppId

    Any ideas?

    BR, Max




    • Edited by MaxErixon Friday, November 7, 2014 4:07 PM
    Friday, November 7, 2014 3:58 PM
  • I don't see anything egregious.  The way I would approach "debugging" the situation is:

    - Replace minutes with seconds to see results sooner

    - Does the query work if you remove "partition by"? 

    - Does the first step work without the second?

    Thanks,

    --Lev

    Friday, November 7, 2014 6:00 PM
  • Max,

    Did you manage to get this to work?

    Thanks,

    --Lev

    Thursday, November 13, 2014 6:33 PM
  • I have the same issue and it is not persisting the output.

    The inner query in the With clause works correctly when I run it on it's own.

    WITH Step1 AS (
        SELECT DeviceId AS DeviceId, system.TimeStamp AS WinEndTime, Avg(Temperature) as [Average],Sum(Temperature) AS [Sum], Count(Temperature) AS [Count],
        min(Temperature) AS [Min], max(Temperature) AS [Max]
        FROM EventHubInput Partition By PartitionId
        Group by DeviceId, TumblingWindow(second, 5)
    )
    
    SELECT DeviceId, system.TimeStamp as WinEndTime, Sum([Sum])/Sum([Count]) as [Average], Sum([Sum]) as [Sum], Sum([Count]) as [Count],
        min([Min]) as [Min], max([Max]) as [Max] 
    FROM Step1
    Group by DeviceId, TumblingWindow(second, 5)


    ITWorx Research

    Monday, December 8, 2014 12:34 PM
  • Does the query work if you remove 'Partition BY' clause? 

    How are you sending input to Event Hub?  Are you sending it to all partitions?  Are you sending it "live" when the job is running, or did you send some in the past?

    Monday, December 8, 2014 4:20 PM