Track message progress in storage queue RRS feed

  • Question

  • I have a Windows Azure storage queue which completes processing requested by our customers (delivery of the generated content comprises fulfillment of an order). The whole queue implementation is working fine, but now I'd like to add more visibility to the process and, at the very least, be able to guarantee that a message is still queued for processing. Can anyone tell me if either of these two capabilities are possible with Azure Storage queues?

    1) Verify whether a given message is or is not still in the queue. This is the baseline minimum I'd like to see, if possible. I already use a storage table to track when a work item was enqueued and update that record with the dequeue time and the processing completion time (as each event occurs). But I would like to be able to get an ID for a message and then be able to verify that no message has fallen through a crack, either in my code (e.g. a failure path prevented some aspect of the process from working as designed) or in Azure (say, an outage causing queue data loss). Does such a feature exist?

    2) Determine a message's position in the queue (like ApproximateMessagesAhead for a message, to correspond to AppoximateMessageCount for the queue itself). This would give me a much more satisfying way to report estimated progress and completion time.

    For #2, I realize that I could perhaps approximate this with the message log table I am using, but I would prefer a solution based on the actual state of the queue. The possibilities get pretty complex: suppose each message to be processed requires 15 minutes of CPU time, but at some point processing starts failing and a big block of queue messages return to visibility. Now one of those messages might have gone from position 5, to 3, to 1, then get dequeued, and suddenly find itself back at position 18. In such a circumstance you're already looking at a failure scenario, so you can't rely on the processing node having properly updated the storage table with proper state info.

    Any insight and/or recommendations would be welcome. Thanks!

    Friday, October 10, 2014 5:06 PM

All replies

  • I have what sounds like a similar system, mine is for managing outbound email dispatching/notifications.

    I couldn't use an Azure queue because the longest Lifespan of a queue item was 7 days, and I sometimes queue notifications for weeks in advance.  I also have to change/delete the notifications periodically, such as if an email changes.  I also can't have two emails sent out for the same thing - I would be hunted down by my customers if that occurred.

    I needed to monitor the queue progress to know if my service provider was having a problem.  That either meant a new table alongside a queue to track delivery stats, or some inter-role communication.  Development time was a concern(it had to be done two days before yesterday) and it had to work the first time.  I was going to use AutoScale if I needed, so there was no way I was going to make some inter-role communication system to communicate provider failure.  Oh, and I had to audit the good/bad deliveries, which meant a table anyway.

    So, here's what I did:

    In my service, I use a storage table as a queue, using a PartitionKey as a time/date, like .ToString(yyyyMMddHHmmss) plus some uniqueness in case of duplicate UTC times.  A queued item PKey looks like "Q20140101150000[uid]"

    Each time I query the table for work, I look for the top 10 items older than "Q" + UTCNow in the above format.

    This duplicates the operation of a queue because I have the opportunity to make the item hidden by changing the PartitionKey to a later time, and then re-adding it to the table.  Also, if nothing 'de-queues' the item, the item persists forever.  I could schedule weeks, months or years in advance without worrying about queue Lifespan limits!

    I scan the table periodically every couple of seconds if I don't have work, as to not pound the table and create millions of table accesses that I'd be paying for.  Of course if I had a queued item, I immediately run a table scan again after doing the work to grab and dequeue as many as possible while there was work still to do.

    To prevent multiple roles getting the item: When Azure Instance '0' reads the table, it immediately sets an "in progress" flag property on the work item indicating it's being worked on and tries to save it back to the table.  If the ETAG is newer and the update fails, another azure role instance "owns" the job and I ignore the item and grab the next item to work on.  If the table Update save operation works, and the flag wasn't set, my instance then owns the work for the item and can process it.

    Once done "the work" represented by the item in the table, I change the PartitionKey so it is no longer in my "queue" by resetting the 'Q' prefix to "A" (for audit) and add the necessary delivery info to the item to use it for auditing the delivery that just occurred.  If it failed sending email, I reset the "in progress" flag, optionally set the PartitionKey to a new queued time, and update the item in the table.

    I also then have unlimited management ops I can perform while items are queued:

    • One scan for Q+UTCNow or older shows me backlogged items.
    • One scan for A+UtcNow (&<'Q') show delivery dates and audited items.
    • It's fast, and duplicates (& extends) the queue functionality.
    • Easy to change items in mid queue
    • Simple to gather the location of any item by scanning the table.

    Anyway, I don't know if it suits your needs, but it's another way to do it.  So far, it's worked perfectly for me.

    Darin R.

    Friday, October 10, 2014 6:15 PM
  • Hm, this sounds like it might be solution. I like the way you avoid race conditions when dequeuing records. It doesn't even sound like too big a hassle to implement, though it's a bit of a bitter pill to swallow given that queues basically work for me aside from this one, seemingly natural additional requirement. Overall, a very good idea.

    My concerns mostly revolve around efficiency and ease of extension. You're looking to create and query against a great number of partition keys-- possibly coming in at a 1:1 partition/row ratio, right? I believe that Azure might split data across storage nodes by partition, so you might end up with data scattered all over the place that needs to be read each time you query the 'queue'. But perhaps this is mitigated by the Q/A distinction in the PartitionKey; if the number of Q records is manageably small, reading potentially all of them to find not-in-process items may not be an issue. --Assuming you only need to read the A records for exact records, that is. I could see that working.

    Do you have a solution for the automatic return to visibility after a set time? Thinking out loud here: seems like the 'in process flag' could include a 'timeout' date/time in UTC, and anything looking for the flag could ignore it if the timeout date/time has elapsed.

    Hm. I'll need to think about it a bit more. Again, it seems like a lot of extra work (and a lot less efficient in the normal case) compared to a standard storage queue. It would be simpler to just peek the entire queue-- though I'm a little fuzzy as to whether THAT is possible either.

    Friday, October 10, 2014 7:21 PM
  • Actually, one more question: you mentioned multiple instances, but do you have a solution for automatic scale-out? We have tested and plan to use queue-based autoscale for our Cloud Service, which would apparently go away in this case.

    Of course, I could do CPU-based scaling, but each work item (video compositing) effectively pegs the CPU of the instances I have for many minutes at a time. I'd probably end up spinning up extra VMs even with nothing in the queue.

    Friday, October 10, 2014 7:47 PM
  • A note on PartitionKey and how that could be made better.  If you set PK to just "Q" and use RowKey for the UTC queue time + some uniqueness, that would also distinguish the Queue from Archive, and would make table scans faster, more "partitionable."

    I actually mis-wrote my original a bit.  I use UTC yyyyMMdd, as PK and don't use time at all, so it is less items per partition than I may have indicated in my previous post.  I do this because events for me are date based, and time isn't an issue so long as delivery is made.

    Also, in terms of auto-visibility - the item will come up in a table scan the very next scan if it fails (since it's UTC is behind) so you always have that ability to check flags or date/time.

    pseudo-steps to incorporate visibility:

    1. Scan table for Q item(s).  If none, Sleep and goto 1
    2. !item.IsBeingProcessedFlag || item.UTC (either parsed from PK, or item DateTime property, Timestamp, etc) is older than max work length threshold(visibility), go to 4
    3. goto 1
    4. Mark IsBeingProcessed, Save to update ETAG.  If fails, go to 1.
    5. Do Work
    6. Work done?  Change PartitionKey, delete, save with new "A" PartitionKey, goto 1.
    7. Work failed?  Depends on work: mark as not being processed, save, set new PK for UTC/visibility, go to 1.

    Darin R.

    Friday, October 10, 2014 7:48 PM
  • AutoScale stumped me.  I'd like to use it, but for me queued items reflect future things and don't necessarily mean work that can be done right now.

    This looked intriguing and something I might think about in the future: http://blogs.ugidotnet.org/corrado/archive/2011/02/09/how-to-programmatically-change-the-number-of-azure-role-instances.aspx

    Can you check the queue length and workload in the thing queuing the work and scale programmatically from there instead of reacting to the queue in the queue reader role?

    Darin R.

    Friday, October 10, 2014 8:44 PM
  • Ah, I see. That's a difference between our scenarios, then; I want to process everything in the queue as soon as resources are available, so the standard queue autoscale implementation at least addresses my situation.

    I'll file that link away for future reference, thanks. It's good to know I have the ability to scale things automatically myself if I need to. I guess if I were going to do this myself using a table (as opposed to queue) implementation, I would probably have a separate worker role or webjob monitoring the number of work items periodically, to reduce complexity and control the frequency the check is performed.

    (I guess one up-side of this is that I could scale the way I want instead of the way autoscale wants me to. I haven't played with it enough to be sure, but the step-up and step-down didn't lead to very optimal scaling characteristics in my limited testing so far.)

    More to think about, I guess. Thanks for your help so far. I'd still be interested to find out if anyone has a queue-based workaround, but this gives me another option.

    Friday, October 10, 2014 9:30 PM