Reading and conditionally updating N rows, where N > 100,000 for DNA Sequence processing


  • I have a proof of concept application that uses Azure tables to associate DNA sequences to "something".

    Table 1 is the master table.  It uniquely lists every DNA sequence.  The PK is a load balanced hash of the RK.  The RK is the unique encoded value of the DNA sequence.

    Additional tables are created per subject.  Each subject has a list of N DNA sequences that have one reference in the Master table, where N is > 100,000.   The PK is a load balanced hash, and the RK is the unique value of the DNA sequence.  Assume that the quantity of RKs here is many order of magnitudes smaller than the Master table. 

    It is possible for many tables to reference the same DNA sequence, but in this case only one entry will be present in the Master table.

    My Azure dilemma:

    I need to lock the reference in the Master table as I work with the data.  I need to handle timeouts, and prevent other threads from overwriting my data as one C# thread is working with the information.  Other threads need to realise that this is locked, and move onto other unlocked records and do the work.

    Ideally I'd like to get some progress report of how my computation is going, and have the option to cancel the process (and unwind the locks).

    What is the best approach for this?

    I'm looking at these code snippets for inspiration:

    • Edited by ChrisLaMont Wednesday, September 12, 2012 12:24 PM
    Wednesday, September 12, 2012 3:05 AM


All replies