Newbie getting asked to research this big data thing - no more ETL ? RRS feed

  • Question

  • Im a .NET developer but ive also been down the data warehouse road with SSAS & SSIS.  What has been capturing my attention lately is the blog posts about Hadoop, "big data" etc.  Some of these products promise "no ETL".   Hmmm, really?  As of late, there seems to be two main competing products - Amazon Redshift and Azure Data Warehouse.  Do each of these not require you to write a big monster ETL process?  

    As for my teams needs - all of our databases together wont reach 1 TB but the data is big enough that ETL processes using a "flush and fill" approach generate a lot of Tlog usage and traffic in general on a nightly basis.

    Ive booked some time on Pluralsight to research these, but my basic question before I get started regarding the Azure solution (SQL Data Warehouse) - is does it also let you avoid the ETL step (no need for an SSIS package) ?

    • Edited by shiftbit Sunday, December 20, 2015 1:27 AM hdfgh
    Sunday, December 20, 2015 1:20 AM


  • This is a case of multiple technologies and groups solving similar/overlapping problems, each with their own terminology and more important, their own agenda. ETL is just a generic term for preparing data for final consumption. It is not unique or specific to traditional DW environments. It's just about getting data in the right place and form for consumption. In particular, Transform is often incorrectly labeled as normalizing the data for a specific schema. That's just one possible task, it could be anything you want to do to get the data in the form you want to consume it in.

    In the materials you read, be wary of no ETL claims. Ask the author if that means there will be no data movement or transformation at all. Unless you're running your reports, analytics, etc.... against the source where the data was born, there will be data movement. Plus you're probably transforming the data within the reporting app before producing the user consumed report. The difference is where you move it to and how you move it. There's definitely real scenario of running reporting/analytical or other non-OLTP workloads against production OLTP systems but that's not a trivial thing to get right and isn't really what hadoop is about.

    As for Redshift and Azure SQL Data Warehouse, these are basically relational data warehouse solutions provided as cloud services instead of software/appliances you run on your local premises. You still need to load data into these DWs before you can take advantage of their immense compute capacities. You also have the add benefits of not having to manage the infrastructure, hardware, software and capacity (e.g. SQL DW has a cool pause capability to put your DW in snooze mode so you don't incur compute charges when not used).

    Even if you decide to move away from a traditional relational data warehouse and use Hadoop, you'd still be moving data from your source system (e.g. SAP, Dynamics, etc...) to Hadoop storage (HDFS, AzureBlob, S3, FTP, ...). It's VERY different from DW storage plus with Hadoop you can actually write your ETL jobs within Hadoop or use existing ETL tools (most support Hadoop). See HDInsight if you're interested in the Microsoft Hadoop based solution in Azure (https://azure.microsoft.com/en-us/solutions/hadoop/). Amazon has EMR that's also based on hadoop.

    No great genius has ever existed without some touch of madness. - Aristotle

    Tuesday, December 22, 2015 10:07 PM