Handling / Persisting high performance data streams (event streams)

Respondido Handling / Persisting high performance data streams (event streams)

  • domingo, 6 de setembro de 2009 08:20
     
     
    Seems like the best forum to start ;)

    I am in the process of developping a system that records and redistributes high performance data streams. The application is pretty large, but this one part (which is pretty isolated) is collecting update events from a number of sources and has to redistribute them to a number of end points. It ALSO has to store them for later playback and analysis / aggregation.

    What makes that thing a little tricky is that we talk of about 250.000 events in about 1.5 million streams. Sadly, that is 250.000 events PER SECOND ;) And yes, this is a financial data stream application - the CME (Chicago Mercantile Exchange) stream alone has about 30.000 events per second during peak times, in about 400.000 "streams" (symbols). My current approach would be to have a separate server that collects those events an dkeeps the "current" data of different types there for easy retrieval via API (which is not so complicated - bascially a hashtable with symbol/data holder, and the data holder has x slots for the different data types (last trade, last best bid/ask, last orde book on bid / ask side, last open price etc.).

    There are, on top of that, occasional heavy spikes of similar events - we talk of around 1.000.000 per second from an exchange, but those are REALLY time limited. Basically.... when a market opens, a number of events (status changeto open, open price, initial first trade possibly) are generated RIGHT THERE.

    THe main problem I have is persisting this hugh amount of informations for later analysis. The data on most streams is pretty low (REALLY low on actually the majority which may not really change at all - talk of heavy out of the money options) while SOME of them will be really active.

    Anyone any idea? Or a better forum? ;)

    * Is StreamInsight an alternative platform to build that around?
    * How would you solve the persistence side of that?

Todas as Respostas

  • quarta-feira, 16 de setembro de 2009 06:50
     
     Respondido

    Great question for this forum.

    Microsoft StreamInsight at this point focuses on processing data streams while the data is in flight and while it can be kept in main memory. In essence, the current version of StreamInsight provides you with a main memory query processor for streaming data that arrives with high rates and where you want to continuously process the data with low latency.

    Many end-to-end scenarios indeed involve data persistence at some point. However, the specifics of how and where you want to store the data depend a lot on your scenario and your application domain. The StreamInsight adapter SDK provides you with the framework to plug in your domain-specific data stores on the input or the output side. We do not prescribe a specific persistence solution. To just quickly mention some of the options: you could choose among Microsoft SQL Server (see one of our sample adapters) or adapters to platforms provided by partners that target specific domains (see the TechEd 2009 session for an example how this might work). This also allows you to plug in existing persisted data that you might want to correlate to live data feeds for comparison in queries. Note that the queries are agnostic of the fact whether the data is persisted or not: the adapters make this transparent and the binding process allows you to point the same query to different data sources as long as they return the types required by the query.

    Given the data rates you mention, another interesting option might be to consider using StreamInsight to preprocess the data to throttle down the data rate before persisting it. StreamInsight can allow you to reason much better for what data you really want to pay the price of persisting it.

    Hope this helps.

    Best regards,
    Torsten
    MS StreamInsight Team

    Disclaimer: This posting is provided "AS IS" with no warranties, and confers no rights.


    This posting is provided "AS IS" with no warranties, and confers no rights.
  • quinta-feira, 17 de setembro de 2009 05:17
     
     
    Given the data rates you mention, another interesting option might be to consider using StreamInsight to preprocess the data to throttle down the data rate before persisting it. StreamInsight can allow you to reason much better for what data you really want to pay the price of persisting it.

    Thanks for the answer.

    Filtering sadly is NOT an option - with SOME minor regards for obviousy error data, but this is not relevant by any means (maybe 1 of 1 million entries would be filtered). THe data must be persisted as it is for later analysis.... see it as a hugh lab environment, just that we dont talk of "Temperatures" but of "Money". The data streams in question are the complete data feeds for some financial exchanges. CME (the main one I am interested in) - the Chicago Mercantile Exchange - trades around 400.000 symbols (most of them options) and some get a couple of thousand updates per second - most of them changes in offers, but.... the recording is to allow real back testing of complex trading strategies (and the data needs to be forked off to existing strategies and visualization).... so I simply do not see how I can kill a significant part of it ;) If I do... there are a dozen other exchanges waiting ;( I simply do not see how I can filter that down. Especially as the big peaks are open/close, when all the status updates come in (400.000 symbols switching from "Open" to "Halted" means 400.000 updates, followed by 400.000 close price indications). During "normal" market (half an hour post open) there are about 10.000 - 15.000 updates sent per second.

    Thanks for the news, though. I definitely have to have a look at StreamInsight - possibly a platform for the information coordination step, feeding in all input and then sending it out to some other terminals.
  • quinta-feira, 12 de novembro de 2009 19:12
     
     
    Given the data rates you mention, another interesting option might be to consider using StreamInsight to preprocess the data to throttle down the data rate before persisting it. StreamInsight can allow you to reason much better for what data you really want to pay the price of persisting it.

    Thanks for the answer.

    Filtering sadly is NOT an option - with SOME minor regards for obviousy error data, but this is not relevant by any means (maybe 1 of 1 million entries would be filtered). THe data must be persisted as it is for later analysis.... see it as a hugh lab environment, just that we dont talk of "Temperatures" but of "Money". The data streams in question are the complete data feeds for some financial exchanges. CME (the main one I am interested in) - the Chicago Mercantile Exchange - trades around 400.000 symbols (most of them options) and some get a couple of thousand updates per second - most of them changes in offers, but.... the recording is to allow real back testing of complex trading strategies (and the data needs to be forked off to existing strategies and visualization).... so I simply do not see how I can kill a significant part of it ;) If I do... there are a dozen other exchanges waiting ;( I simply do not see how I can filter that down. Especially as the big peaks are open/close, when all the status updates come in (400.000 symbols switching from "Open" to "Halted" means 400.000 updates, followed by 400.000 close price indications). During "normal" market (half an hour post open) there are about 10.000 - 15.000 updates sent per second.

    Thanks for the news, though. I definitely have to have a look at StreamInsight - possibly a platform for the information coordination step, feeding in all input and then sending it out to some other terminals.
    Have you arrived at a way forward with this problem?

    Does a supported data update rate of (say) 250,000 updates per second from managed code sound like it may be sufficient?

    I'm curious about what solution you have settled on.

    Hugh



    Hugh Moran - http://www.morantex.com
  • quinta-feira, 10 de dezembro de 2009 09:55
     
     
    Dear,
    i think i have a same problem here,
    we have a messaging platform [SMS, USSD, WAP, MMS], that platform is able to process approximately 20,000 message per second
    The problem here is that i need a way to be able to store these messages into the data base, and for every message i must return with a unique ID for it to be able to query the status of it.
    in mean while, i have 5 basic sources to insert the messages into the DB, and i must persist data for all of them.

    finally, i must be able to query, update, analyze at same time while sending.

    so if there are a solution with Stream Insight, kindly give me a start point to be able to handle that problem,

    H
  • quarta-feira, 11 de abril de 2012 14:57
     
     

    Really?

    Is that the best answer? Cut down the speed of streams, delete/kill some events?

    I am facing the exact same problem, so how do we actually save this kind of HUGE data into database?

  • quarta-feira, 11 de abril de 2012 20:14
     
     

    Typically, yes, that is the best answer. If you look at historians ... products designed specifically to store this type of data ... such as OSISoft PI System, they always downsample/compress the data before storing it.

    There are a couple of things to keep in mind/consider ... first, whenever you are storing data, you have to deal with the ability of the disk to keep up with the data rates. Even SSDs aren't fast enough to do it. You can add more disks to a striped RAID array (RAID 1 or RAID 10) to improve throughput but you still won't be able to keep up. The real bottleneck on any storage system (Sql Server/Oracle/PI/whatever) is the disk. If you are doing a transacted write, you'll have a synchronous write to the log before the call returns. If you are writing single events at a time, your overhead for the transaction are even higher.

    Also, in most cases, it isn't necessary ... or even desirable ... to store data at those rates. There's also the cost of storage to consider. For example, when you look at the number of events coming from, say, all the smart meters in the Houston area, you are talking about some 10 million readings every 15 minutes, which averages to about 10000/second. The storage required for that gets to be pretty massive pretty quickly ... and that level of detail typically has limited, if any, value to the business. Offshore wells are no different ... 1 single supermajor will create terabytes of data every day ... and that's after downsampling and compressing. And that doesn't even touch their onshore operations. Saving everything is, quite simply, cost-prohibitive.

    Then ... if your database server is on a different machine, you have the wire to consider. While gigabit networks are commonplace now, they are still limited in bandwidth. You can only shove so much into that limited pipe and no more. As with transacted writes, writing single events at a time simply increases the overhead associated with the network.

    Which leads to the next question ... have you determined what your actual bottleneck is? Is it disk throughput? Network bandwidth? Something else? Have you used the StreamInsight performance counters to watch the output queues and see what they are doing? What are the event rates that you are trying to write?

    All of that said, it's still possible to save relatively large volumes of data to a data store. We've tested our Sql Server recorder output adapter at 30K events/second on a dual core laptop. But the database gets REALLY big REALLY fast. So we limit how much is written in real applications.


    DevBiker (aka J Sawyer)
    Microsoft MVP - Sql Server (StreamInsight)


    Ruminations of J.net


    If I answered your question, please mark as answer.
    If my post was helpful, please mark as helpful.