locked
How to process historical data RRS feed

  • Question

  • I have a query that aggregates records into five minute results using an application timestamp value.

    I had originally pushed the historical records into the same event hub input as live results were going in. This essentially failed with lots of out of order and late input events.

    I then created a new temporary input event hub and query to isolate historical record processing from live/current record processing. Initially this did not produce any output records for the input records because the job was started from the current time. I then stopped the query and started it with a job start time of way back when the historical records started (May 2015). The query then picked up the historical records, but 76018 of the 77875 records were identified as being late and only 30 output records were generated. I also found that my process for pumping in historical records into the event hub couldn't keep up with the ASA query which overran the historical time point and then no more input records were found by the query.

    Is there an effective way of processing old historical records?

    Sunday, September 6, 2015 6:47 AM

Answers

  • It's not possible today, because ASA only supports near real time event processing today, except for running the query in the browser to test with sample data.

    That said, we are working on a feature to allow non near real time events processed without triggering the late arrival policy. There is no ETA at this time though.

    • Marked as answer by Zafar Abbas Monday, September 21, 2015 4:59 PM
    Tuesday, September 8, 2015 4:29 PM

All replies

  • It's not possible today, because ASA only supports near real time event processing today, except for running the query in the browser to test with sample data.

    That said, we are working on a feature to allow non near real time events processed without triggering the late arrival policy. There is no ETA at this time though.

    • Marked as answer by Zafar Abbas Monday, September 21, 2015 4:59 PM
    Tuesday, September 8, 2015 4:29 PM
  • Hi Rory,

    I am interested to understand your scenario for processing historical data.

    Could you explain where is the data coming from and how frequently you push to Event Hub?

    Do you push this data to EventHub in a particular schedule?

    When you try to process the historic data using ASA, do you want your job to run for a batch?

    If yes, how would you like to define the boundaries of your batch?

    Over all, I am curious why would you want to use ASA for processing historical data compared to traditional tools like HIVE/PIG or SQL Database.

    Tuesday, November 17, 2015 2:59 AM
  • Hi Dipanjan,

    My results come in each minute and get aggregated to five minutes and 60 minute aggregates. The minute results also get pushed into table storage. I need the ability to read a time period of those stored results and push them through ASA again if I ever want to reprocess a segment of time. The issue here is that the timestamp of those records being processed may be months in the past compared to when they are reprocessed.

    The ideal scenario would be for ASA to batch records per aggregate period with a tolerance irrespective of when the timestamp of those records are. As such, the ASA query could process over both current and historical data.

    I have fallen back to SQL Azure to run this aggregation because this historical processing was not available with ASA. Ideally I would like to move away from SQL Azure to more suitable designs (for example it is difficult to avoid deadlocks in SQL).

    The other reason I went back to SQL Azure was because ASA only supports hour aggregates against the start of the hour, however I need aggregates relative to three points of the hour:

    1. On the UTC hour (supported)
    2. 30 minute offset (not supported)
    3. 45 minute offset (not supported)
    Tuesday, November 17, 2015 7:56 AM
  • Hi Dipanjan,

    I just ran across this old thread.  We have a need for processing historical data in Stream Analytics as well.  Has any progress been made on supporting this in ASA, since this last reply?

    Our use case is that we need to process historical usage information from public cloud hosts.  We want to aggregate this data on a daily basis [TumblingWindow(day, 1)], from hourly usage details downloaded via API from multiple cloud hosts (Azure, EC2, etc.), so we can bill customers for this summarized usage.

    We are sending this usage data to Event Hub from a daemon (Azure WebJob) that collects all the data on a recurring basis.

    Using the 'Test' functionality for our query actually works perfectly, and gives us the aggregated results we need.

    It just seems the late arrival logic is causing problems, and we don't get any output events.

    Thanks,
    Kirk

    Friday, February 26, 2016 6:38 AM
  • As I mentioned in a different thread, it is possible to process data older than 20 days with Stream Analyticsou only need to set the late arrival setting to -1. Just as is explained in the Microsoft documentation below:

    https://docs.microsoft.com/en-us/azure/templates/microsoft.streamanalytics/streamingjobs

    Setting the value to -1 is equivalent to setting the late arrival setting to indefinitely.

    This value can not be set by using the SA Gui in the Azure Portal however, but can be set by either deploying a job using an ARM template and power shell. Or by changing the value of the late arrival setting for your job by editing the job in resources.azure.com.

    Monday, September 24, 2018 12:43 PM