locked
Dimensional desgin - allocating facts RRS feed

  • Question

  •      

               

                 

                              

            
    • Edited by kiwiNspain Thursday, April 11, 2013 7:05 PM
    Thursday, April 11, 2013 2:11 PM

Answers

  • Your discussion demonstrates the problem of writing in a casual manner and not stating clearly the relationships between the entities nor the definition of said entities.  Without that information, readers must assume which generally leads to misunderstanding. 

    First, you did not post actually DDL.  This means we have to guess what your actual (or proposed) primary keys are as well as the natural keys.  Based on your naming patterns, we can likely assume the relationships - but it is better to formally define them to avoid ambiguity (and to more easily identify errors or oversights on your end).

    You said "I have a 4 level process".  What exactly does that mean?  Is "from XMLs" important here?  "Process" is generally a term associated with work flow and not a data model.  The confusion starts early.   In addition, you did not explain the purpose of each table.  Why is this important?  Without knowing what your model represents and the real-world system on which it is based, it is impossible to answer questions like "is there another way to design the fact table".  I will also point out that you posted/discussed 4 tables (plural) but use the term "fact table" (singular).  After re-reading this a couple of times, perhaps the goal here is to generate some sort of datawarehouse, denormalizing in the process of doing that.  That generally is where the terms "fact" and "dimension" come into play - but such a discussion would be based on an existing database and it seems you are still modeling that. 

    Some other statements (and your picture) are also puzzling.  You said that you can sum cost and income (and what exactly does "income" mean?) at the billing detail level.  Your tables do not support that; cost is stored at the billing level, not the billing detail level. In addition, your comments about the billing detail table give an example of cost for a computer as the sum of the cost of the associated items - yet that does not match your tables.  Either that or that sentence fragment/example was intended to be some sort of column? 

    In any event, the time to worry about performance is AFTER you have established the correctness of your data model.  A poorly designed data model will generally lead to bad performance - and quite often bad data/measures.   As a general comment, the use of DISTINCT in any query should make you pause and reconsider something - the correctness of the query, the understanding of the model, or the correctness of the model.  This directive is often a big giant flag that something is wrong.  If, on the other hand, the data model is fixed at this point, then there are techniques which can be used to increase performance.  Just keep in mind that every benefit has a cost.  Faster select queries will generally cause slower insert/update/delete queries - as well as extra maintenance.

    Thursday, April 11, 2013 6:03 PM
  • Toro07, the text of your question went away. Should this thread be removed? I hate to, because Scott's answer is good :)

    >>  As a general comment, the use of DISTINCT in any query should make you pause and reconsider something - the correctness of the query, the understanding of the model, or the correctness of the model. <<

    And replacing it with a GROUP BY with no aggregation is not a replacement :)


    Louis

    Without good requirements, my advice is only guesses. Please don't hold it against me if my answer answers my interpretation of your questions.

    Thursday, April 11, 2013 9:41 PM

All replies

  • Your discussion demonstrates the problem of writing in a casual manner and not stating clearly the relationships between the entities nor the definition of said entities.  Without that information, readers must assume which generally leads to misunderstanding. 

    First, you did not post actually DDL.  This means we have to guess what your actual (or proposed) primary keys are as well as the natural keys.  Based on your naming patterns, we can likely assume the relationships - but it is better to formally define them to avoid ambiguity (and to more easily identify errors or oversights on your end).

    You said "I have a 4 level process".  What exactly does that mean?  Is "from XMLs" important here?  "Process" is generally a term associated with work flow and not a data model.  The confusion starts early.   In addition, you did not explain the purpose of each table.  Why is this important?  Without knowing what your model represents and the real-world system on which it is based, it is impossible to answer questions like "is there another way to design the fact table".  I will also point out that you posted/discussed 4 tables (plural) but use the term "fact table" (singular).  After re-reading this a couple of times, perhaps the goal here is to generate some sort of datawarehouse, denormalizing in the process of doing that.  That generally is where the terms "fact" and "dimension" come into play - but such a discussion would be based on an existing database and it seems you are still modeling that. 

    Some other statements (and your picture) are also puzzling.  You said that you can sum cost and income (and what exactly does "income" mean?) at the billing detail level.  Your tables do not support that; cost is stored at the billing level, not the billing detail level. In addition, your comments about the billing detail table give an example of cost for a computer as the sum of the cost of the associated items - yet that does not match your tables.  Either that or that sentence fragment/example was intended to be some sort of column? 

    In any event, the time to worry about performance is AFTER you have established the correctness of your data model.  A poorly designed data model will generally lead to bad performance - and quite often bad data/measures.   As a general comment, the use of DISTINCT in any query should make you pause and reconsider something - the correctness of the query, the understanding of the model, or the correctness of the model.  This directive is often a big giant flag that something is wrong.  If, on the other hand, the data model is fixed at this point, then there are techniques which can be used to increase performance.  Just keep in mind that every benefit has a cost.  Faster select queries will generally cause slower insert/update/delete queries - as well as extra maintenance.

    Thursday, April 11, 2013 6:03 PM
  • Toro07, the text of your question went away. Should this thread be removed? I hate to, because Scott's answer is good :)

    >>  As a general comment, the use of DISTINCT in any query should make you pause and reconsider something - the correctness of the query, the understanding of the model, or the correctness of the model. <<

    And replacing it with a GROUP BY with no aggregation is not a replacement :)


    Louis

    Without good requirements, my advice is only guesses. Please don't hold it against me if my answer answers my interpretation of your questions.

    Thursday, April 11, 2013 9:41 PM