locked
How long is the input data stored? RRS feed

  • Question

  • We've been playing with the Stream Analytics and it looks great. But there's one simple question we haven't found an answer for: How long is the input data stored?

    For example,

    1. If we today start collecting data through Events Hub and output the results of the job to a SQL Server

    2. And then year later realize that we want to run additional queries to the input data, which we haven't previously stored into the SQL Server

    Can we just create a new job with a different query, configure the "Start Output" to point to previous year and off we go? Or is the input data lost at some point?

    Monday, November 10, 2014 6:41 AM

Answers

  • No, Stream Analytics is not preserving input data.  Your experience above is caused by Event Hub going "above and beyond the call of duty", probably because you have not generated all that many events, so it didn't feel the need to truncate.

    But it is not safe to rely on that behavior, and the moment you ramp up your usage, Event Hub will start enforcing retention.

    • Marked as answer by MikaelAd Thursday, November 13, 2014 6:06 PM
    Wednesday, November 12, 2014 12:20 AM

All replies

  • Assuming that your input is Event Hub, you have two complementary options.

    For testing and short-term replay and testing, you may configure Event Hub to keep the data for relatively short periods of time (up to a week); the default is 24 hours. 

    But for longer periods, you should persist the data in Blobs.  Stream Analytics makes this very easy: just set up a simple job whose input is Event Hub, transformation is a simple "select everything you are interested in preserving", and the output is blobs.  Let this job run forever.  It will not interfere with your other live computations.

    Then, whenever you want to do computations on the past, just create a new job with the input pointed at the blob container you created (Stream Analytics supports 'Blobs' as an input), set your Start Output time, and off you go.

    Since the data is stored in your regular blob storage account in a standard format, you can also use other tools to do computations on it, like HDInsight, if you like.

    Thanks,

    --Lev

    Monday, November 10, 2014 4:34 PM
  • Than you for the reply. I wonder though, is Stream Analytics doing some data preserving automatically? To elaborate:

    1. We created our test Event Hub and Stream Analytics job on the 2th of November. Job uses the Event Hub as input and outputs the data to SQL Server's table.

    2. If we now stop the job and delete everything from the table, we get a clean state:

    3. Now, when the job is stopped we can configure its "Start output". If we set it to 2th of November and restart the job, we get this:

    As the Event Hub only has data retention for 3 days, how it's possible that Stream Analytics can recreate everything startng from the 2th of Nobember? 

    If we keep deleting the data from SQL server's table every day, is there a point in time when we lose some of the input data? 

    Tuesday, November 11, 2014 6:38 PM
  • No, Stream Analytics is not preserving input data.  Your experience above is caused by Event Hub going "above and beyond the call of duty", probably because you have not generated all that many events, so it didn't feel the need to truncate.

    But it is not safe to rely on that behavior, and the moment you ramp up your usage, Event Hub will start enforcing retention.

    • Marked as answer by MikaelAd Thursday, November 13, 2014 6:06 PM
    Wednesday, November 12, 2014 12:20 AM
  • Excellent, thank you for your help!
    Thursday, November 13, 2014 6:06 PM