Should I rebuild this fact table every day?
We have a problem with one of our fact tables. One of the sources of data for this fact table allows retrospective data entry (e.g. this sale was completed 2 months ago) and modification of historical data (e.g. this code "ABC" last week was really an "XYZ"). The biggest problem is that our source database doesn't have any auditing so potentially anything can change any time without us knowing.
When we designed this fact table, we weren't aware users were making retrospective edits. We're simply just adding whatever activity has happened on the day to the fact table. It was a poor assumption on our part. As a result, we're missing a significant amount of fact data and some of it is incorrect.
So the question is, should I just rebuild the whole table every night at the end of the ETL? Or should I put in some change detection logic? For reference, this fact table is just over 4 million records and should grow roughly 20-30k records each day. Is this a "fully rebuildable" size?
Another thing is that there are other columns in this fact table that aren't subject to this problem. Should I just rebuild these as well or split it into seperate tables?
Any opinions or guidance much appreciated.
Answers
There's no need to apologize for how your business operates - those are business processes that need to be modeled. Unless you can give the business a better process to work from that will still solve the issues they face, the current business model is "correct".
Now, the only thing left in the decision making process is whether or not an account can be in the same phase for two different periods of time. In my example of sales orders, it's not possible for an order to be "pending payment after shipment is completed" twice, in two different distinct time periods. For an "account", I feel differently. You may suspend an account for a couple weeks one year, then reinstate it, then suspend it again (for a different reason) later.
If your accounts don't behave like I've described my sales orders, then an accumulating snapshot fact table won't work for you. Your requirements are particularly difficult - but again, don't seem that alien. I would recommend reading through Kimball Design Tip #35, then go over to the Kimball Forums to present your case. When you do - don't use "phase 1" type language. Describe your dimension as an account dimension, and the attribute you want to track as a "status", and that retroactive changes to status are made.
- Marked As Answer byJin ChenMSFT, ModeratorMonday, October 19, 2009 3:24 AM
All Replies
- It sounds as though you don't actually have a well-structured fact table - coming from a Kimball methodology. The short answer is that you should not have to rebuild your fact table every night, and adding "pre-dated" entries to the fact table at any time should be just fine.
The reason I say it doesn't sound like you have a well-structured fact table (or in fact, a data warehouse at all) is that changes to "codes" should have NO impact on the fact table. Why? Because you shouldn't be storing those codes in the fact table. They should be in dimension tables, linked to the fact table by surrogate (meaningless integer) keys. The dimension tables handle changes to the "codes" in the way you need them handled - by being what's called "slowly changing dimensions".
Perhaps I've misunderstood your question - some sample data would surely help - but I think you should read "The Data Warehouse Toolkit" by Kimball and Ross. It's "the bible"...
- Thanks for the response.
I think I've explained the problem badly.
The problem with the backdated transactions is that I don't know how to detect them. Each day, the fact table is appended to with stuff that has an entry date of today. The problem is that someone can go in today and input an entry date of last week. Unfortunately, we didn't realise users could do that when we designed the ETL.
The other problem is that a code I used before to calculate a fact can change. We're not storing these in the fact table, it's something that is used to determine a fact calculation. For example, I might have a rule that in the ETL that says if it was code ABC, then perform some calculation but if it was XYZ, then do a different calculation. The problem is that if someone goes into the source db and changes the history so the ABC is now XYZ, the fact is now incorrect.
How should I redo this fact table to cater for the above two problem?
Any guidance much appreciated. - The fact that the users can "backdate" transactions should be irrelevant - unless your problem is that you're basing your "differential extract" based on that transaction date. (Differential extract meaning - "just give me the rows that were added to the source system since the last time I extracted".) If that's the case, then you need to add/find some other mechanism to mark the rows you've extracted. The simplest thing to do would be to add a date/time column to the source data that has a default value of "now" - something you could (technically) easily add in a SQL Server hosted DB... but that may be problematic if you don't have control over the source system. If you can't alter the source system, then your preferred recourse is to determine a compound key for the source data - a subset of the columns from the source that uniquely identifies each row. If you can get that - 100% reliably! - then you can (ugh) to an intersection comparison of your current fact table with the source data to get the "rows you need to extract". You may be able to reduce the comparison operation if you can say (with 100% guarantee) that backdates will only "go back" a few months or so, so you don't have to recheck EVERY key. Failing that, a full reload is your only option.
For your second problem - it sounds like you're hardcoding dynamic calculation rules into your ETL. Some calculation rules are OK to be hardcoded in ETL if the source system has "hardcoded" them. For example, if the source system fills a "customer type" column with a code from an internal (unmodifiable by any end-user) code, then your only risk is when you receive an update from the source system vendor - they may change code structure or meaning, but that likely means your ETL has to be reexamined/redesigned. If the "code" values are stored in a dynamic list, such as a "customer type" may be, then it's possible that those codes may change as business processes change. If you're saying that a fact from a year ago "would have been calculated differently" now that you know such a code has been changed - then it's probably best if you try to store a "base" value only in the fact table, and store some kind of "factor" or other dynamic property in a dimension table. Your end-user "fact" would then have to be a calculated value based on that "base" (unchanging) fact, and an attribute from the dimension table.
Hope that helps...
- Actually, I like using ROWVERSION as a datatype (a.k.a. TIMESTAMP) better for this purpose because it automatically updates for each insert/update.
http://technet.microsoft.com/en-us/library/ms182776.aspx
That would be my preferred method assuming you can add the column to the table without impacting the application.
Phil Brammer | http://www.ssistalk.com Thanks Phil, I was not aware of ROWVERSION, that's extremely helpful.
I've made some progress on this by changing my fact table and ETL moving the calculation logic to report level, as per your suggestion. However, I'm only halfway there and I found some new related problems. I think I should explain with more concrete examples. Sorry, this is a bit of a long post.
In the source database, there is a table which records the lifetime of an account and the "phases" it goes through (e.g. open, closed, paused, on hold, etc). This table is populated automatically as account phases change. In fact, it looks a lot like a slowly changing dimension:
tblPhase
AccountRef, Phase, From, To
AAA, 1, 10-Jan-09, 10-Feb-09
AAA, 2, 11-Feb-09, NULL
BBB, 1, 12-Jul-08, 01-Jul-09
BBB, 3, 02-Jul-09, NULL
... etc ...
I have an account dimension, dim_account, with Phase as an type 2 attribute. It isn't the only type 2 attribute but if it was, dim_account would be an exact mirror of the tblPhase.
One of the requirements of the business is that they need to be able to report on the number of accounts and length of time of phases at any point in time. So what I did was build a fact table that week level (which is the level we report at) and it looks a bit like:
fact_phase
WeekKey, AccountKey, PhaseKey, Days
1, 1, 1, 7
2, 1, 1, 7
3, 1, 2, 3
3, 1, 1, 4 (a phase changing midweek)
... etc ...
This table is being populated as each week happens. Everything was working fine until I discovered that people can change the Phase table manually in three ways:
1. Change the Todate of a phase.
2. Change the phase code itself.
3. Add in a new phase retrospectively.
Remember I said earlier that my account dimension is a mirror of the phase table? If the Phase table changes retrospecitvely, it impacts both dim_account and fact_phase if, for example, an existing phase was shortened and a new one was slotted in.
A further problem is that I have another slowly changing dimension which contains the Account dimension key as a foreign key, which is also a type 2 attribute. (Was this a mistake?) So if the datefrom and dateto changes in dim_account, I will need to rekey the related dimension which means I have to also rekey facts related to that dimension. What a mess.
Given all these problems / issues, I think I have three options:
1. Rebuild fact table completely every night, remove phase type 2 attribute from the account dimension, delete the extra dimension records generated by this attribute and run an update to rekey all the impacted fact tables.
2. Modify the ETL to put in change detection on the phase table to pick up changed dates and phases. (e.g. using something like ROWVERSION) And then apply those changes to dim_account and fact_phase and all other impacted fact tables.
3. Some combination of 1 and 2.
All the options feel wrong to me which makes me think that original design is wrong.
Any further insight much appreciated.Without completely understanding what you've written - although I thank you very much for being verbose and complete, it's refreshing :) - this is what I'd suggest.
First, you only need the one account dimension. Have your phase attribute be a type 2 attribute in it, and leave it at that.
Second, you need a fact table called an Accumulating Snapshot Fact Table (See The Data Warehouse Toolkit p128+). It would look something like this:
AccountRef Phase1_date_start Phase1_date_end Phase2_date_start Phase2_date_end Phase3_date_start Phase3_date_end AAA 2009-01-10 2009-02-10 2009-02-11 NULL NULL NULL BBB 2008-07-12 2009-07-01 NULL NULL 2009-07-02 NULL
That structure allows you to easily construct "lag" measurements (number of days).
- Thanks for the response Todd.
Yes, sorry for the brain dump. It all makes sense in my head but it doesn't always come out the right way. And I need to get myself a copy of that book, "The Data Warehouse Toolkit". Many people make references to it.
I understand your Snapshotting suggestion and it would certainly work. I'm guessing there should be a snapshot DateKey as a column too. My only concern is that if we decided to add a new type of phase I'd have to add more columns. But I can accept that because it doesn't happen very often.
The bit that I'm still unclear about is keeping the phase attribute in dim_account as a Type 2.
E.g. if I have this data in tblPhase
tblPhase
AccountRef, Phase, From, To
AAA, 1, 10-Jan-09, 10-Feb-09
AAA, 2, 11-Feb-09, NULL
my dimension would look like this:
dim_account
AccountKey, phase, AccountRef, DateFrom, DateTo
1, 1 AAA 10-Jan-09, 10-Feb-09
2, 2 AAA 11-Feb-09, NULL
This is fine until Joe User comes along today and changes the history in tblPhase to:
tblPhase
AccountRef, Phase, From, To
AAA, 1, 10-Jan-09, 10-Feb-09
AAA, 2, 11-Feb-09, 10-Jul-09
AAA, 3, 11-Jul-09, 20-Jul-09
AAA, 2, 21-Jul-09, NULL
If this happens, the accumulating snaphot for the fact_phase would be fine. But what do I do with dim_account?
EDIT: I just thought of the answer (I think). Maybe I should put the phase dates as attributes of the dimension too.
EDIT2: I realise I'm talking nonsense, the above doesn't make sense. So I'm still stuck. - You are incorrect in assuming that the Snapshot Fact table would have a DateKey column. That doesn't make sense for it - the relevant dates are in the facts themselves. You could theoretically keep the date facts as FKs to a date dimension - but that typically doesn't help you out when calculating derivative facts (like # days span). It may help you if you need to count "working days" - but only in cases where you have an unusual calendar that may vary by division within your company.
And yes, the Data Warehouse Toolkit is a "bible" as far as I'm concerned. You can also take courses from the Kimball Group - they are fantastic.
For your dimension table - does your source OLTP table look like this:
AccountRef, Phase, From, To
AAA, 1, 10-Jan-09, 10-Feb-09
AAA, 2, 11-Feb-09, NULL
Or does it look like this:
AccountRef, Phase
AAA, 1
Without completely understanding what you've written - although I thank you very much for being verbose and complete, it's refreshing :) - this is what I'd suggest.
First, you only need the one account dimension. Have your phase attribute be a type 2 attribute in it, and leave it at that.
Second, you need a fact table called an Accumulating Snapshot Fact Table (See The Data Warehouse Toolkit p128+). It would look something like this:AccountRef Phase1_date_start Phase1_date_end Phase2_date_start Phase2_date_end Phase3_date_start Phase3_date_end AAA 2009-01-10 2009-02-10 2009-02-11 NULL NULL NULL BBB 2008-07-12 2009-07-01 NULL NULL 2009-07-02 NULL
That structure allows you to easily construct "lag" measurements (number of days).

What I don't like about this is the constraint on how wide the table needs to be and the changes that may occur. Plus a table structure like this is hard to query because you have to work with multiple columns and structure your logic accordingly. Never mind when someone inserts a record (back-fills) for a period that is in the middle of two periods already in the table. Now you have to move data from a set of columns to another set. All in all, I see the logic in trying to implement such a structure very painful.
Phil Brammer | http://www.ssistalk.com | Twitter: http://twitter.com/PhilBrammer- I can only imagine that you've never had to service a query such as "calculate the average lag time of sales orders per customer group". Calculating lag times within a row is insanely easier than inter-row calculations. In order to answer the above, it's a simple calculated column to get a "lag days" for each row, then simple aggregation to get an average over a particular group. If the table were structured with one row per date or date period, it would require (at least) one preparatory pass by a set-based operation to calculate lag times, and be much more complicated.
In fact, the accumulating snapshot table is smaller than an equivalent "transaction" table if most (but not all) of the date fields are expected to be populated - which is typically the case. In most cases, the business isn't going to "invent" another event in this scenario - but even if it does, all we're talking about is appending columns that default to NULL, and populating from there. No changes to queries are required - which is probably not the case with the other table structure.
An accumulating snapshot table is well suited to lag calculations, I happen to agree with the Kimball Group on this one wholeheartedly.
- In *this* case it may work out well because a phase can change at most six times during a week. But what if a similar table is needed for monthly/quarterly?
The pains I'm talking about are from a maintenance perspective. I fully understand the performance/ease of use merits of such a structure. It's just problematic to maintain - especially on data that changes frequently.
It's good for aggregation tables that don't change much perhaps, but again, you have to be careful of how you *use* it. In other words, if the assumption is that the first set of columns (of time) must occur before the second set, when a user back-fills in a "phase" that started before the dates in the first set of columns, now you have to rebuild the row so that the dates are in order across the columns. Perhaps it's not a big deal...
Phil Brammer | http://www.ssistalk.com | Twitter: http://twitter.com/PhilBrammer - Simple example:
ID = 1
Start = 01/01/2009
End = 01/10/2009
ID = 1
Start = 01/11/2009
End = 01/17/2009
Insert those into Kimball's structure:
ID Phase1_Start Phase1_End Phase2_Start Phase2_End Phase3_Start Phase3_End Phase4_Start Phase4_End
1 01/01/2009 01/10/2009 01/11/2009 01/17/2009
Now a user makes some changes:
ID = 1
Start = 01/01/2009 <-- no change
End = 01/03/2009 <-- new end
Begin insert:
ID = 1
Start = 01/04/2009
End = 01/10/2009
Now what do you do with the arrangement of data?
Do you change the end of phase1 and then insert the new phase at the next available slot in the table?
ID Phase1_Start Phase1_End Phase2_Start Phase2_End Phase3_Start Phase3_End Phase4_Start Phase4_End
1 01/01/2009 01/03/2009 01/11/2009 01/17/2009 01/04/2009 01/10/2009
OR do you do this? (Correct end date, move existing "Phase2" to Phase3 and insert new phase)
ID Phase1_Start Phase1_End Phase2_Start Phase2_End Phase3_Start Phase3_End Phase4_Start Phase4_End
1 01/01/2009 01/03/2009 01/04/2009 01/10/2009 01/11/2009 01/17/2009
Phil Brammer | http://www.ssistalk.com | Twitter: http://twitter.com/PhilBrammer - Ah - I see where we're crossed up. You're missing the "phase" in the source data. Here's your scenario rephrased as I understand how the OP is having to deal with his data:
To start, he has this DB table:
ID Phase Start End
1 One 2009-01-01 2009-01-10
1 Two 2009-01-11 2009-01-17
He imports into his fact table and inserts the following row (assuming this is the first load ever):
ID PhaseOne_Start PhaseOne_End PhaseTwo_Start PhaseTwo_End
1 2009-01-01 2009-01-10 2009-01-11 2009-01-17
If the user makes some changes - the important fact is "to which row(s) in the source OLTP table"? Your confusion is easily answered when you require they identify the phase. If you don't identify that - then it doesn't matter what DW table structure they have - you're hosed attempting to infer what they mean. If they make this change:
ID Phase Start End
1 One 2009-01-01 2009-01-03
1 Two 2009-01-04 2009-01-17
Then you update the row in the fact table (because ID #1 already exists, and there's only one row per ID) to:
ID PhaseOne_Start PhaseOne_End PhaseTwo_Start PhaseTwo_End
1 2009-01-01 2009-01-03 2009-01-04 2009-01-17
But like I said - if they don't specify what phase(s) they're changing dates on, the DW structure is irrelevant. You're hosed.
- It seems I misunderstood what an accumulating snapshot was. I get it now - thanks.
The number of phases an account goes through can vary between 1 and anything. Does this mean that the fact table"grows" in columns to as wide as the highest number of phases in an account? E.g. if one account had 20 phases (not uncommon for some of the craziest accounts), then the table will have 40 date columns? And then if it went into the 21st phase, I'd have to create two new columns? (one for start and one for end) - Perhaps Phil understood your requirements better than I did.
The accumulating snapshot fact table works best when your phases are actual "things". How do I describe this? We've been presenting examples with "Phase One" and "Phase Two" because we don't know your business. In my business, we're a distributor. We don't deal in phases of anything, but we do have an accumulated snapshot fact table for our "orders". It looks (kindof) like this:
Order # Entered Confirmed First_Ship Last_Ship Invoiced Invoice_First_Payment Invoice_Paid_Full
1000 2009-01-10 2009-01-10 2009-01-11 2009-01-12 2009-01-12 2009-02-10 2009-02-10
The above table makes it easy for us to see if there's an "unacceptable" lag between the date the order is confirmed, and when it's completely shipped to the customer. Or from when shipment's complete, and when the invoice goes out. Or in how long the customer takes to pay.
But the salient point from my example is that those dates "mean something". They aren't "meaningless" like "Action1" and "Action2". The kind of table I'm describing from my experience lends itself well to an accumulated snapshot because the columns are usually very static. The business guys know what the "stages" or "events" are, and those won't change unless the business analysis need changes. When that happens, you may need to add columns to the table - but only to represent a new, specific event.
Are your "phases" specific, name-able events?

- Based on the OP's point #3 a few posts up: "Add in a new phase retrospectively", I see phase numbers being changed to accomodate the new phase, which could be between existing phases. As such, the 'view of reality' for a given [AccountRef] has just been altered and now impacts downstream tables.
It sounds like the phase numbers are generic, and have no real meaning, other than 1 comes before 2, 2 comes before 3, etc... And at any time a user can enter a new phase that goes anywhere (not just at the end, a la "insert only") along the timeline for a given account.
Phil Brammer | http://www.ssistalk.com | Twitter: http://twitter.com/PhilBrammer ... it doesn't seem like arbitrary phases would add any particular business value to a process...
Ha! I wish more systems were designed with that very thought of yours in mind...
Phil Brammer | http://www.ssistalk.com | Twitter: http://twitter.com/PhilBrammer- Hi Phil, Todd, thanks for all your helpful replies so far.
I realise I should've been clearer on something. The phases are not sequential phase numbers, they're PhaseID's for stuff like "on hold, active, inactive" etc. I mentioned this in my second post, but granted, it was a long brain dump so it's easy to miss. I used numbers 1,2,3... because they're arbitrary, but it seems to have confused things further because it implies sequence. Perhaps I should've just used the words. Sorry for leading you up the garden path, guys.
I agree that in a business sense, it's impossible to invent an event in the past... however, it still happens anyway where I am. It happens the most when people set accounts to the wrong phases or durations and needs to be corrected. e.g. "this account was not actually on hold for 2 weeks, it was only for one week". Or simply if someone just forgets e.g. "oops, I forgot to change this account from inactive to active last week, I need to do it now". It's really sloppy and a real pain. I know there are arguments for fixing the source and business processes but this whole backdating and retrospective editing has been going on for many many years and is completely ingrained in the culture. Plus, we don't own or maintain the source database and they will charge an absolute fortune for tightening it up.
So going back to this problem... I'm still stuck. :-)
My fact_phase table still looks like this:
fact_phase
WeekKey, AccountKey, PhaseKey, Days
1, 1, 1, 7
2, 1, 1, 7
3, 1, 2, 3
3, 1, 1, 4 (a phase changing midweek)
... etc ...
And I still have this problem with the type 2 attribute Phase in dim_account which I'm contemplating just removing (even though I feel it belongs there). There's no need to apologize for how your business operates - those are business processes that need to be modeled. Unless you can give the business a better process to work from that will still solve the issues they face, the current business model is "correct".
Now, the only thing left in the decision making process is whether or not an account can be in the same phase for two different periods of time. In my example of sales orders, it's not possible for an order to be "pending payment after shipment is completed" twice, in two different distinct time periods. For an "account", I feel differently. You may suspend an account for a couple weeks one year, then reinstate it, then suspend it again (for a different reason) later.
If your accounts don't behave like I've described my sales orders, then an accumulating snapshot fact table won't work for you. Your requirements are particularly difficult - but again, don't seem that alien. I would recommend reading through Kimball Design Tip #35, then go over to the Kimball Forums to present your case. When you do - don't use "phase 1" type language. Describe your dimension as an account dimension, and the attribute you want to track as a "status", and that retroactive changes to status are made.
- Marked As Answer byJin ChenMSFT, ModeratorMonday, October 19, 2009 3:24 AM
Now, the only thing left in the decision making process is whether or not an account can be in the same phase for two different periods of time. In my example of sales orders, it's not possible for an order to be "pending payment after shipment is completed" twice, in two different distinct time periods. For an "account", I feel differently. You may suspend an account for a couple weeks one year, then reinstate it, then suspend it again (for a different reason) later.
If your accounts don't behave like I've described my sales orders, then an accumulating snapshot fact table won't work for you. Your requirements are particularly difficult - but again, don't seem that alien. I would recommend reading through Kimball Design Tip #35, then go over to the Kimball Forums to present your case. When you do - don't use "phase 1" type language. Describe your dimension as an account dimension, and the attribute you want to track as a "status", and that retroactive changes to status are made.

Thanks Todd.
Yes, the account phases don't work like your described sales order. Accounts can leave a phase and re-enter it at a later date.
I'll swat up on the Kimball links and if I still can't work it out, I'll head over to the Kimball forums and state the problem as you advised.
Many thanks for the help. I'm sure it won't be the last time you hear from me!
Thanks again.


