Reading and conditionally updating N rows, where N > 100,000 for DNA Sequence processing
-
12 September 2012 3:05
I have a proof of concept application that uses Azure tables to associate DNA sequences to "something".
Table 1 is the master table. It uniquely lists every DNA sequence. The PK is a load balanced hash of the RK. The RK is the unique encoded value of the DNA sequence.
Additional tables are created per subject. Each subject has a list of N DNA sequences that have one reference in the Master table, where N is > 100,000. The PK is a load balanced hash, and the RK is the unique value of the DNA sequence. Assume that the quantity of RKs here is many order of magnitudes smaller than the Master table.
It is possible for many tables to reference the same DNA sequence, but in this case only one entry will be present in the Master table.
My Azure dilemma:
I need to lock the reference in the Master table as I work with the data. I need to handle timeouts, and prevent other threads from overwriting my data as one C# thread is working with the information. Other threads need to realise that this is locked, and move onto other unlocked records and do the work.
Ideally I'd like to get some progress report of how my computation is going, and have the option to cancel the process (and unwind the locks).
What is the best approach for this?
I'm looking at these code snippets for inspiration:
http://stackoverflow.com/q/4535740/328397
- Diedit oleh ChrisLaMont 12 September 2012 3:33
- Diedit oleh ChrisLaMont 12 September 2012 3:35
- Diedit oleh ChrisLaMont 12 September 2012 12:24
Semua Balasan
-
12 September 2012 21:18
I'd use storage queues in addition and lease the queue messages as long a thread is working on. that way you also get some kind of timeout/retry mechanism, and since queue messages can also be updated both in visibility timeout but also the content, you can store some progress information in the queue message. To get going you'll just need an instance which creates the queue messages out of your master table data, but shouldn't be a big deal.
Just a high level idea, but may help as starting point.
- Ditandai sebagai Jawaban oleh Jiang YunModerator 20 September 2012 6:34
-
14 September 2012 12:32
Hi,
I'd recommend you using Azure Blob Leases to manage concurrency.
Here's an article with more details about it: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/06/12/new-blob-lease-features-infinite-leases-smaller-lease-times-and-more.aspx
Best Regards,
Carlos Sardo- Disarankan sebagai Jawaban oleh Carlos Sardo 14 September 2012 12:32
- Ditandai sebagai Jawaban oleh Jiang YunModerator 20 September 2012 6:34
-
19 September 2012 0:27
Hi Chris - I agree with the previous two responses (thanks!) - Queues and Blob Leases are great options here. Taking a lock on a blob lease is a great way to manage concurrency, and queues are also a great way to make sure that work is only done once, in a fault tolerant way. Carlos already referenced the leases blog post, but here's the one about using "Update Message" for queues to update the contents: http://blogs.msdn.com/b/windowsazurestorage/archive/2011/09/15/windows-azure-queues-improved-leases-progress-tracking-and-scheduling-of-future-work.aspx
Also, just make sure you're using retries for everything!
-Jeff
- Ditandai sebagai Jawaban oleh Jiang YunModerator 20 September 2012 6:34