locked
Azure Data Storage/Analysis Technology Choices RRS feed

  • Question

  • I've been investigating the Azure offerings for an app idea that I have.  I'm having trouble figuring out how some of the Big Data pieces fit together.  My concept is basically a sensor-based system where sensor data will be dynamic, and thus well suited to storage in a dynamic schema.  Some realtime* reporting will be required along two dimensions of the sensor data, but deeper analysis (not realtime) will be required along many dimensions.

    I have experience in SQL Server/SSIS/SSAS so Table Storage/Hadoop/MongoDB concepts are new to me.

    Am I correct in thinking:

        • MongoDB is good for storing and querying (in realtime) data that requires a dynamic schema, as long as the data you're querying lives in a single shard?
        • Hadoop is not really meant for persisting data but rather running Map Reduce operations to derive new insights from data (but not in realtime)?
        • Since Hadoop is not intended as a priamry means of persisting data, you wouldn't typically keep a large Hadoop cluster hanging around with a populated HDFS data store, but rather would just spin up a Hadoop cluster and populate the HDFS data store when you wanted to do a bunch of analysis of data, and then you'd typically persist the results elsewhere?
        • Although Map Reduce operations can also be run directly on MongoDB, the performance of MR on MongoDB is not as good as Hadoop, especially as data sizes grow?

    If the above is accurate, then in my scenario would the following architecture make sense:

    1. Temporarily store incoming sensor readings in a queue (implemented in SQL Azure Federation, Azure Tables, Azure Queue(s), or a MongoDB cluster?)
    2. Use a worker process to move the queued readings into two distinct MongoDB persistence clusters (each cluster respectively sharded along one of the two realtime reporting dimensions)
    3. Periodically spin up a Hadoop cluster and populate it with data from one of the MongoDB persistence clusters (shouldn't matter which one), do a bunch of Map Reduce work, and (directly or indirectly) persist the results into SSAS
    4. End-user realtime queries will be run against one of the two MongoDB persistence clusters, and results will be limited to data from a single shard from the queried cluster.
    5. End-user non-realtime reporting will be run against the SSAS database, and will be limited by the schema of the SSAS database

    ???

    *I'm using 'realtime' in a relative sense - it's ok for things to be a minute or two stale, and I realize that with a queue, things won't be actual 'realtime'.


    • Edited by cpnet Saturday, September 29, 2012 12:16 AM clarifiction
    Saturday, September 29, 2012 12:08 AM

All replies