locked
MPP versus SSIS RRS feed

  • Question

  • I’m new to parallel data warehouse (PDW) technology which uses the concept of massively parallel processing (MPP). In the SMP environment I routinely did most of my stuff in SSIS where possible. The approach was Extract, Transform, Load (ETL). However, when I started with the PDW I was advised that instead of ETL, the approach in PDW is ELT. I’m now faced with a dilemma. Can I default my approach to stored procedures that I now have the brute force of MPP? The data that I’m working with is largely stored in the PDW and most of it is generally stored in hash distributed format. I find that the general approach in this new environment the data is bulk-loaded loaded first to the landing area and then transformation is done using CTAS. These are flat-tables not star schemas. Can someone advise me the best practice in this regard – SSIS or MPP from the context that I have presented above.

    Thanks,


    Mpumelelo

    Friday, March 3, 2017 1:26 PM

All replies

  • Hi Mpumelelo,

    This is a good question, and one that I see quite a lot, so it's worth exploring the reasons as to why you would pick one data loading method over another.

    With MPP technology, such as PDW or Azure Data Warehouse, the architecture is such that you have a "shared nothing" design. What does that mean exactly? Well, each compute node has it's own storage and if you think about the strengths of this design it means that each compute node works independently.

    With all of these independent nodes, with their own storage and own data, you need a "co-ordinator" and that is where the control node comes into play. The control node manages the workload over the compute nodes, as per a typical distribute compute architecture.

    Now, to your question, if you think about how an SMP architecture works you only have the option of loading data sequentially. You have one path to the storage and you have one "compute node".

    With MPP you have multiple compute nodes and multiple data stores.

    If you use a Row-by-Row tool such as SSIS it will load the data sequentially, regardless of whether you have a MPP or SMP architecture. Certainly with an MPP architecture the data load will go through the control node to the underlying compute nodes.

    If you use something like Polybase, you engage the compute nodes directly and bypass the bottleneck of the control node.

    So your ELT methodology is the way to go with MPP architecture. Remove the single instance Control Node bottleneck and you will be able to load data at speed. In the case of Azure SQL DWH, you can scale-out the compute nodes to achieve near-linear load performance.

    With regards to loading data - yes, load your data as "flat files" into a stage area. From there, do your transforms into your data model.

    The hash-distribution allows you to "group" certain data elements together in one compute node. For example, you may have a report that requires all orders for a particular customer. Now imagine that the data for Customer A is on Compute Node 1 but all of the order data for Customer A is distributed, round-robin, over all of the compute nodes. The control node would have to pull all of the order data from the compute nodes and join it to the customer data. If you were to hash distribute your order data on the Customer Key (assuming this is the PK/FK join) then your Customer and related Order data would reside on the same compute node and hey presto, it's fast to query.

    You do have to put a bit of thought into the data model when it comes to a distributed architecture but the trade-off for a well designed data model is the performance.

    I hope that helps answer your question, but feel free to comment further.

    Kind Regards,

    Paul Duffett

    Microsoft Azure Data Solution Architect


    Friday, March 3, 2017 2:04 PM
  • Moreover, even for SMP SQL Server using SSIS for data movement and Stored Procedures for transformation is a valid (IMO preferred) approach. 

    David


    Microsoft Technology Center - Dallas
    My blog

    Friday, March 3, 2017 2:50 PM
  • Hi Mpumelelo,

    This is a good question, and one that I see quite a lot, so it's worth exploring the reasons as to why you would pick one data loading method over another.

    With MPP technology, such as PDW or Azure Data Warehouse, the architecture is such that you have a "shared nothing" design. What does that mean exactly? Well, each compute node has it's own storage and if you think about the strengths of this design it means that each compute node works independently.

    With all of these independent nodes, with their own storage and own data, you need a "co-ordinator" and that is where the control node comes into play. The control node manages the workload over the compute nodes, as per a typical distribute compute architecture.

    Now, to your question, if you think about how an SMP architecture works you only have the option of loading data sequentially. You have one path to the storage and you have one "compute node".

    With MPP you have multiple compute nodes and multiple data stores.

    If you use a Row-by-Row tool such as SSIS it will load the data sequentially, regardless of whether you have a MPP or SMP architecture. Certainly with an MPP architecture the data load will go through the control node to the underlying compute nodes.

    If you use something like Polybase, you engage the compute nodes directly and bypass the bottleneck of the control node.

    So your ELT methodology is the way to go with MPP architecture. Remove the single instance Control Node bottleneck and you will be able to load data at speed. In the case of Azure SQL DWH, you can scale-out the compute nodes to achieve near-linear load performance.

    With regards to loading data - yes, load your data as "flat files" into a stage area. From there, do your transforms into your data model.

    The hash-distribution allows you to "group" certain data elements together in one compute node. For example, you may have a report that requires all orders for a particular customer. Now imagine that the data for Customer A is on Compute Node 1 but all of the order data for Customer A is distributed, round-robin, over all of the compute nodes. The control node would have to pull all of the order data from the compute nodes and join it to the customer data. If you were to hash distribute your order data on the Customer Key (assuming this is the PK/FK join) then your Customer and related Order data would reside on the same compute node and hey presto, it's fast to query.

    You do have to put a bit of thought into the data model when it comes to a distributed architecture but the trade-off for a well designed data model is the performance.

    I hope that helps answer your question, but feel free to comment further.

    Kind Regards,

    Paul Duffett

    Microsoft Azure Data Solution Architect


    Hi Paul,

    Thank you for your answer. I still have additional questions related what you have said in your response. Unfortunately I'm tied up at the moment. I will ask the question over the weekend when I am free.

    Kind regards,


    Mpumelelo

    Friday, March 3, 2017 4:56 PM
  • Hi Paul,

    You have mentioned that “with an MPP architecture the data load will go through the control node to the underlying compute nodes.

    If you use something like Polybase, you engage the compute nodes directly and bypass the bottleneck of the control node”.

    What I would like to know is whether it is possible to use Polybase on PDW tables, i.e. tables on individual databases. My simple understanding is that Polybase links big data/Hadoop with the SQL environment. Since PDW is already on the SQL side (so to say) how then is it possible to engage Polybase to access the compute nodes. Or is it just a question of system configuration to achieve a bypass of the control node?

    Kind regards,


    Mpumelelo

    Saturday, March 4, 2017 11:37 PM
  • When accessing data that is already loaded into SQL DW Tables, the compute nodes do (almost) all the work.  The control node creates a distributed query plan, made up of queries that run on the compute nodes and data movement tasks for the compute nodes to exchange data.  At the end the compute nodes send almost-final results to the control node for final aggregation and transmission to the client.

    David


    Microsoft Technology Center - Dallas
    My blog

    Sunday, March 5, 2017 1:55 PM
  • Thanks David. So in other words if the data is on SQL DW tables in a PDW environment, its only the control node and compute nodes that do the work? Polybase only comes into play for the data in HDInsight (big data). Is that correct?

    Kind regards,


    Mpumelelo

    Sunday, March 5, 2017 3:49 PM
  • >Is that correct?

    Yes.  Polybase is for interacting with external data sources.

    David


    Microsoft Technology Center - Dallas
    My blog

    Sunday, March 5, 2017 4:50 PM
  • SOME external data sources.

    We are also migrating everything to PDW Azure DW, but a big missing for me that there is no way to directly address Azure SQL Database from out of a DW.

    Some of our departments also take the effort to migrate to Azure, but on Database because of the more OLTP characteristics off the apps.

    But on the BI side we just can't extract directly that data, for me a real gap in the Azure environment.

    Monday, March 13, 2017 7:54 AM