locked
Get Table Storage data using Partition Range without knowing used Partitions RRS feed

  • Question

  • I need to process all rows in my table storage using background workers on the Azure Cloud and would like to have each background worker process a range of partition keys.  What's a good way of setting that range without knowing which partitions have items?

    Currently our partitionkey is just an unsigned int (gotten from a hash of a large composite key), but we can't really, say, shoot off a background worker for every 100 partitions and have millions of messages stuck in the message queue (where probably alot of them would process nothing).

    Should I keep track of every partitionkey I've created?  Or something else?  Perhaps a smaller key?  Is there a general number of partitions I should aim for if I have maybe 100-200 million rows?

    Thanks.


    • Edited by ilovejerky Thursday, October 25, 2012 6:42 PM
    Thursday, October 25, 2012 6:41 PM

Answers

  • Since you partition keys are effectively random (being derived from a hash), you can partition the set of partition keys appropriately like (0 - 100000, 100001 - 200000 ...) such that background workers could work in parallel on distinct partitions within an exclusive range of partitions, without any duplicate work being done. 
    • Marked as answer by Johnson - MSFT Thursday, December 6, 2012 11:34 AM
    Thursday, October 25, 2012 9:03 PM

All replies

  • I don't suppose you could give a bit more of the bigger picture? We might be able to suggest a more manageable solution to the root challenge you're trying to address.

    Thursday, October 25, 2012 6:48 PM
  • Hm, I'm not sure if there's more to give.

    I guess in simpler words, I need to work on groups of data, where each group is denoted by the partition key.  I want to work on them in parallel in background workers, however I'm not sure how to tell the background workers which partition key to use since the partition key is just some random unsigned int.

    Thursday, October 25, 2012 7:37 PM
  • Since you partition keys are effectively random (being derived from a hash), you can partition the set of partition keys appropriately like (0 - 100000, 100001 - 200000 ...) such that background workers could work in parallel on distinct partitions within an exclusive range of partitions, without any duplicate work being done. 
    • Marked as answer by Johnson - MSFT Thursday, December 6, 2012 11:34 AM
    Thursday, October 25, 2012 9:03 PM