locked
Fetching Differential Data from Azure Table Storage RRS feed

  • Question

  • Hi,

    I have a scenario where I need to retrieve data from azure table and and insert in different table after some logic applied using Worker Process, as the worker process has an interval of 30 minutes how to fetch the next data from table storage which was not fetched in the previous run of the worker process. 


    ali.khan

    Sunday, August 24, 2014 1:34 PM

Answers

  • Hi fak87,

    Thank you post the issue to our forum.

    >>>as the worker process has an interval of 30 minutes how to fetch the next data from table storage which was not fetched in the previous run of the worker process. 

    With your description, I understand that you want to fetch data from table service using a worker process. And your worker process fetch part or single entity from storage table once every 30 minutes. Then you donot want to fetch those data which retrieved before. For this issue, I think you can add a property to the entity you retrieved and update it to the storage table. For example, add a property named "IsAccessed" as bool type. When the entity accessed by your worker process, then set it to true. When you fetch data next time, you can just filter the entities whose IsAccessed is false.

    As known that storage table service saves entities which can contains different properties, even different numbers of properties. Then you can also add other properties dynamically like below:

    http://blog.smarx.com/posts/adding-a-property--column--in-windows-azure-tables

    http://code.msdn.microsoft.com/windowsazure/How-to-CRUD-table-storage-ebefd270

    Then you can filter entities from storage table by using rest api or .net library with linq:

    http://msdn.microsoft.com/en-us/library/azure/dd894031.aspx

    http://msdn.microsoft.com/en-us/library/azure/dd894039.aspx

    Best Regards,

    Fuxiang


    We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place. <br/> Click <a href="http://support.microsoft.com/common/survey.aspx?showpage=1&scid=sw%3Ben%3B3559&theme=tech"> HERE</a> to participate the survey.

    Monday, August 25, 2014 6:33 AM

All replies

  • Hi fak87,

    Thank you post the issue to our forum.

    >>>as the worker process has an interval of 30 minutes how to fetch the next data from table storage which was not fetched in the previous run of the worker process. 

    With your description, I understand that you want to fetch data from table service using a worker process. And your worker process fetch part or single entity from storage table once every 30 minutes. Then you donot want to fetch those data which retrieved before. For this issue, I think you can add a property to the entity you retrieved and update it to the storage table. For example, add a property named "IsAccessed" as bool type. When the entity accessed by your worker process, then set it to true. When you fetch data next time, you can just filter the entities whose IsAccessed is false.

    As known that storage table service saves entities which can contains different properties, even different numbers of properties. Then you can also add other properties dynamically like below:

    http://blog.smarx.com/posts/adding-a-property--column--in-windows-azure-tables

    http://code.msdn.microsoft.com/windowsazure/How-to-CRUD-table-storage-ebefd270

    Then you can filter entities from storage table by using rest api or .net library with linq:

    http://msdn.microsoft.com/en-us/library/azure/dd894031.aspx

    http://msdn.microsoft.com/en-us/library/azure/dd894039.aspx

    Best Regards,

    Fuxiang


    We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place. <br/> Click <a href="http://support.microsoft.com/common/survey.aspx?showpage=1&scid=sw%3Ben%3B3559&theme=tech"> HERE</a> to participate the survey.

    Monday, August 25, 2014 6:33 AM
  • Hi Fak87,

    If you can make a change to how the data is submitted from the data producer, there are two efficient ways to handle this task.

    1) Use the partition key to represent a 30 minute chunk.

    var timeChunk = RoundDownTo30MinuteChunk(DateTime.UtcNow); // see http://stackoverflow.com/questions/1393696/rounding-datetime-objects

    string partitionKey = string.Format("{0:D19}", DateTime.MaxValue.Ticks - timeChunk.Ticks);

    You'll then query the table for all the rows with the PartitionKey from the chunk starting 60 minutes ago. That is, if it is now 02:35:00, you'll want to query for the chunk that corresponds to 01:30:00 - 02:00:00. That way you have confidence that all the data has been uploaded for that chunk.

    This solution is efficient but risky, because your worker may go down and actually miss a chunk of data. Which means you'll need to separately track which periods have and have not been processed.

    2) A better solution is to use queues. Have the data producer put a message in a queue each time it finishes with a chunk of data. For example: if the producer uploads a set of 100 rows, it would then put a message in a queue with a unique partitionkey for those rows (or rowkey range or some other unique identifier). Then your worker would periodically poll the queue for new messages. When a message arrives it will process the corresponding rows. This is a much more effective solution because you can have your worker poll continuously and immediately process the rows (rather than waiting until a specific period of time has passed). You can also make use of more than one worker, which will allow you to scale up and process more data as your service grows. This is called the competing consumers pattern, and it is one of the most effective methods of processing data. See Competing Consumers Pattern http://msdn.microsoft.com/en-us/library/dn568101.aspx for details.

    Regards,

    Brent

    Monday, September 1, 2014 2:23 PM
  • Hi Fuxiang,

    Annotating the original table isn't going to be a very efficient solution. Queries against non-key fields will slow down over time. If the table is very small and stays small, then annotating the processed rows with IsAccessed = true will work. But if the table has thousands or millions of rows then the query for IsAccessed == false will become prohibitively slow. Querying the table for IsAccessed == false is a table scan, and every row in the table will need to be checked for every query.

    See http://stackoverflow.com/questions/4831989/azure-table-storage-how-fast-can-i-table-scan for a succinct explanation of why table scans are slow.

    In Azure Table Storage, it is important to use the PartitionKey and/or RowKey as part of the querying to allow scale-out and keep the queries fast.

    Regards,
    Brent

    Monday, September 1, 2014 2:34 PM