SQL Server Developer Center > SQL Server Forums > SQL Server Data Warehousing > one size fits all architecture for star's etl
Ask a questionAsk a question
 

Questionone size fits all architecture for star's etl

  • Monday, September 14, 2009 1:41 PMdb042188 Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    is there a whitepaper somewhere that describes a flexible architecture for a star's etl?  I have many goals I'd
    like to accomplish with the architecture of my etl but dont necessarily see anybody discussing them on a macro level.
    Here are some of my goals:

    1) I may wish to update previously recorded facts with dimensions like "close date". 

    2) I may wish to delete instances of some dimensions that are no longer in use

    3) I may be stuck to MOLAP for a while, but dont want to introduce obstacles to ROLAP and HOLAP.

    4) Types I and II SCDs are something I want to account for

    5) Because various measure groups share conformed dimensions, I'd like some kind of precedence between dim
       recording process(es) and all fact processing that depends on those conformed dims

    6) I'd like sufficient decoupling to target/run one object's ETL if necessary

    7) I'm sometimes forced to deal with "deltas" (change between yesterday and today) in the absence of transactions
       that record changes in the OLTP, maybe that issue will dictate how I architect things

    8) Maybe I should consider generating all/some of the ETL automatically, perhaps based on metatdata about my realtionships

    9) Wherever Incremental Processing Options make sense, I want to use them, but maybe with an option to do full
       now and then

    10) I want to append time dims periodically as I approach boundary conditions.  For instance, I presently limit
        my time dim entries to no later than one year from today.  So I'd want to insert new rows there maybe once a
        month.  I do this to limit the perfomance issues I believe are present in "In Flight Transaction" scenarios
        when using MDX.  Unfortunately I use the max time dim for accounts whose close date is really null, so I'd
        probably want to update "open" facts across the board with a new close date any time this boundary changes.

    11) Reusability (for instance between new fact realtionships and old fact relationship expectations/changes)

    12) Flexibility

    13) Server and DB flexibility

    14) maybe restartability  

    15) I think I'd like a layered/precedence approach that looks somethingh like this:

     a) somehow record anything of a Type II nature in a sub package, perhaps one subpackage per Type II dim.
            b) record existing fact dim "expectations", perhaps one sub package per measure group, keep a key to be
               used for identifying "change" scenarios.  Remember close dates.
            c) slide any "Today" tables to "Yesterday" for delta scenarios, maybe one sub package per scenario.
            d) record new facts and their dim expectations with new pks that start after highest existing one, maybe
               new sub package for every measure group.  Remember close dates.  
            e) use all previous dim related data to record and process dim changes, but wait till all facts are done
               running
            f) resolve fks on pre recorded and new facts and process.  Dont use Identity on Fact pks as they've already
               been assigned in what looks like it will be a staging table

    16) notifications, good and bad

    17) anything that might support rerunning X number of days of ETL to bring us forward from a known problem
        point to current on one or more objects


    Does anybody have ideas/strategies/experiences/articles they'd like to share?

    • Edited bydb042188 Monday, September 14, 2009 1:53 PMblurb about ssas forum no longer true
    • Moved byEric WisdahlModeratorWednesday, September 16, 2009 1:13 PMMore appropriate for the data warehousing forum? (From:SQL Server Integration Services)
    •  

All Replies

  • Monday, September 14, 2009 1:44 PMSudeep Raj Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    For SSAS put ur post in this forum:
    http://social.msdn.microsoft.com/Forums/en-US/sqlanalysisservices/threads
    Hope this helps !!
    Please close the threads once answered

    Sudeep      My Blog
  • Monday, September 14, 2009 1:54 PMdb042188 Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    its been there for days with no response and very few views.  I just moved it here.  I removed the blurb about thinking the ssas forum was the better choice.
  • Wednesday, September 16, 2009 6:59 AMDuane DouglasModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    this forum probably isn't a good place to seek advice on high-level etl design.  as you probably already know, entire books have been devoted to etl design...while this forum focuses on the nuances of ssis...with etl as a backdrop, not a focal point.

    anyway, microsoft published an etl design whitepaper in 2005 for its "project real" case study, which implements a star schema: http://msdn.microsoft.com/en-us/library/cc966422.aspx . bear in mind that this document pertains to ssis 2005, not ssis 2008.  this is an important distinction because ssis 2008 includes a number of improvements over ssis 2005 which may simplify your etl design.

    your original post is a valid one -- i just haven't encountered an appropriate place to engage in such a discussion.

    hth

    Duane Douglas, MCAD, MCITP | http://www.ssisbi.com | Please mark the post(s) that answered your question.

    SSIS Business Intelligence

  • Wednesday, September 16, 2009 1:16 PMEric WisdahlModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    I have moved this to the Data Warehousing forum as it seems like the more appropriate place for this discussion.  In all reality, I'm not positive that this is the best place either.  As Duane said, this is a very broad question that entire books (even libraries) have been written.
    Please mark answered posts. Thanks for your time.
  • Wednesday, September 16, 2009 5:33 PMdb042188 Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    there's a lot of info in the Project Real documentation.  I've met one of the main players and he is very impressive.  These docs are written mostly at a micro level but I did take away the following:

    1) be prepared for daily , weekly etc schedules
    2) consider the "inferred member" approach for dimension "place holding"
    3) use schemas to help manage the different roles your tables will play
    4) use views to assist in olap processes
    5) use stored procs and udf's as necessary
    6) store all of the packages in one solution
    7) consider "load group" components that orchestrate dim and fact packages
    8) use naming conventions for components (eg 3 letter component type prefix followed by 3-4 word description)  
    9) top to bottom, left to right
    10) various levels of auditing (etl load ids are germane to some if not all of this) 
    11) leverage ON ERROR 
    12) store etl load ids on fact records
    13) separate package for each dim that has its own source
    14) leverage config features 

    Their suggested flow seems to go like this (from the technical link):

    1) check for consistency of all dims and facts
    2) convert data
    3) check dim changes and handle them
    4) insert new dims when they arrive
    5) create partitions
    6) insert facts
    7) maintain audit

    ...but maybe this isnt really their chronology, after all they maintain audits in a variety of places, some early in pkgs, some late and some in between other components.

    So, in hopes of starting some discussion, does anyone out there subscribe to the idea of dealing with all (historic and new, across all measure groups) facts to record dim "expectations" before dealing with dim updates? Perhaps as a more flexible approach to etl architecture?  Thus allowing 1) unused dim deletes and 2) a "vote" from every measure group on shared dimensions before dimensions are dealt with.       

  • Tuesday, October 20, 2009 7:53 PMdb042188 Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    the template we're going to try goes like this....

    1st layer  runs a pre audit

    2nd  layer gets control dates, high current fact PKs etc, stores them in variables.  This is a sequence container with parallel  EXEC SQL TASKS (est) right now. 
     
    3rd layer slides what was "today" tables to "yesterday" etc.  This is for shops that require a Delta approach to recognizing new facts.  It is a sequence container that contains parallel EXECUTE PACKAGE TASKS (ept).  There is one ept for each measure group.   
       

    4th layer cuts new facts to a stage area.  It is a sequence container containg multiple ept's, one for each measure group.
     

    5th layer cuts old facts to same stage area.  It is a sequence container containg multiple ept's, one for each measure group.

    6th layer records dim expectations.  It is a sequence container containg multiple ept's, one for each dim.

    7th layer  builds dims and deletes "orphaned" dim rows.  It is a sequence container containg multiple ept's, one for each dim.

    8th layer resolves fks on facts.   It is a sequence container containg multiple ept's, one for each measure group.

    9th layer processes dims update.

    10th layer processes facts full.

    11th layer runs post audit.

    12th layer notifies.