one size fits all architecture for star's etl
is there a whitepaper somewhere that describes a flexible architecture for a star's etl? I have many goals I'd
like to accomplish with the architecture of my etl but dont necessarily see anybody discussing them on a macro level.
Here are some of my goals:1) I may wish to update previously recorded facts with dimensions like "close date".
2) I may wish to delete instances of some dimensions that are no longer in use
3) I may be stuck to MOLAP for a while, but dont want to introduce obstacles to ROLAP and HOLAP.
4) Types I and II SCDs are something I want to account for
5) Because various measure groups share conformed dimensions, I'd like some kind of precedence between dim
recording process(es) and all fact processing that depends on those conformed dims6) I'd like sufficient decoupling to target/run one object's ETL if necessary
7) I'm sometimes forced to deal with "deltas" (change between yesterday and today) in the absence of transactions
that record changes in the OLTP, maybe that issue will dictate how I architect things8) Maybe I should consider generating all/some of the ETL automatically, perhaps based on metatdata about my realtionships
9) Wherever Incremental Processing Options make sense, I want to use them, but maybe with an option to do full
now and then10) I want to append time dims periodically as I approach boundary conditions. For instance, I presently limit
my time dim entries to no later than one year from today. So I'd want to insert new rows there maybe once a
month. I do this to limit the perfomance issues I believe are present in "In Flight Transaction" scenarios
when using MDX. Unfortunately I use the max time dim for accounts whose close date is really null, so I'd
probably want to update "open" facts across the board with a new close date any time this boundary changes.11) Reusability (for instance between new fact realtionships and old fact relationship expectations/changes)
12) Flexibility
13) Server and DB flexibility
14) maybe restartability
15) I think I'd like a layered/precedence approach that looks somethingh like this:
a) somehow record anything of a Type II nature in a sub package, perhaps one subpackage per Type II dim.
b) record existing fact dim "expectations", perhaps one sub package per measure group, keep a key to be
used for identifying "change" scenarios. Remember close dates.
c) slide any "Today" tables to "Yesterday" for delta scenarios, maybe one sub package per scenario.
d) record new facts and their dim expectations with new pks that start after highest existing one, maybe
new sub package for every measure group. Remember close dates.
e) use all previous dim related data to record and process dim changes, but wait till all facts are done
running
f) resolve fks on pre recorded and new facts and process. Dont use Identity on Fact pks as they've already
been assigned in what looks like it will be a staging table16) notifications, good and bad
17) anything that might support rerunning X number of days of ETL to bring us forward from a known problem
point to current on one or more objects
Does anybody have ideas/strategies/experiences/articles they'd like to share?- Edited bydb042188 Monday, September 14, 2009 1:53 PMblurb about ssas forum no longer true
- Moved byEric WisdahlModeratorWednesday, September 16, 2009 1:13 PMMore appropriate for the data warehousing forum? (From:SQL Server Integration Services)
All Replies
- For SSAS put ur post in this forum:
http://social.msdn.microsoft.com/Forums/en-US/sqlanalysisservices/threads
Hope this helps !!
Please close the threads once answered
Sudeep My Blog - its been there for days with no response and very few views. I just moved it here. I removed the blurb about thinking the ssas forum was the better choice.
- this forum probably isn't a good place to seek advice on high-level etl design. as you probably already know, entire books have been devoted to etl design...while this forum focuses on the nuances of ssis...with etl as a backdrop, not a focal point.
anyway, microsoft published an etl design whitepaper in 2005 for its "project real" case study, which implements a star schema: http://msdn.microsoft.com/en-us/library/cc966422.aspx . bear in mind that this document pertains to ssis 2005, not ssis 2008. this is an important distinction because ssis 2008 includes a number of improvements over ssis 2005 which may simplify your etl design.
your original post is a valid one -- i just haven't encountered an appropriate place to engage in such a discussion.
hth
Duane Douglas, MCAD, MCITP | http://www.ssisbi.com | Please mark the post(s) that answered your question. - I have moved this to the Data Warehousing forum as it seems like the more appropriate place for this discussion. In all reality, I'm not positive that this is the best place either. As Duane said, this is a very broad question that entire books (even libraries) have been written.
Please mark answered posts. Thanks for your time. - there's a lot of info in the Project Real documentation. I've met one of the main players and he is very impressive. These docs are written mostly at a micro level but I did take away the following:
1) be prepared for daily , weekly etc schedules
2) consider the "inferred member" approach for dimension "place holding"
3) use schemas to help manage the different roles your tables will play
4) use views to assist in olap processes
5) use stored procs and udf's as necessary
6) store all of the packages in one solution
7) consider "load group" components that orchestrate dim and fact packages
8) use naming conventions for components (eg 3 letter component type prefix followed by 3-4 word description)
9) top to bottom, left to right
10) various levels of auditing (etl load ids are germane to some if not all of this)
11) leverage ON ERROR
12) store etl load ids on fact records
13) separate package for each dim that has its own source
14) leverage config features
Their suggested flow seems to go like this (from the technical link):
1) check for consistency of all dims and facts
2) convert data
3) check dim changes and handle them
4) insert new dims when they arrive
5) create partitions
6) insert facts
7) maintain audit
...but maybe this isnt really their chronology, after all they maintain audits in a variety of places, some early in pkgs, some late and some in between other components.
So, in hopes of starting some discussion, does anyone out there subscribe to the idea of dealing with all (historic and new, across all measure groups) facts to record dim "expectations" before dealing with dim updates? Perhaps as a more flexible approach to etl architecture? Thus allowing 1) unused dim deletes and 2) a "vote" from every measure group on shared dimensions before dimensions are dealt with. - the template we're going to try goes like this....
1st layer runs a pre audit
2nd layer gets control dates, high current fact PKs etc, stores them in variables. This is a sequence container with parallel EXEC SQL TASKS (est) right now.
3rd layer slides what was "today" tables to "yesterday" etc. This is for shops that require a Delta approach to recognizing new facts. It is a sequence container that contains parallel EXECUTE PACKAGE TASKS (ept). There is one ept for each measure group.
4th layer cuts new facts to a stage area. It is a sequence container containg multiple ept's, one for each measure group.
5th layer cuts old facts to same stage area. It is a sequence container containg multiple ept's, one for each measure group.
6th layer records dim expectations. It is a sequence container containg multiple ept's, one for each dim.
7th layer builds dims and deletes "orphaned" dim rows. It is a sequence container containg multiple ept's, one for each dim.
8th layer resolves fks on facts. It is a sequence container containg multiple ept's, one for each measure group.
9th layer processes dims update.
10th layer processes facts full.
11th layer runs post audit.12th layer notifies.



