none
Composite keys

    Question

  •  

    Hi, I'm new to Analysis Services and have been reading the Kimball book in addition to the book by Melomed, Gorbach, Berger and Bateman. Unfortunately, none of these books adequately explains the role of composite keys within multi-dimensional modelling. Coming from a relational background, I know conceptually what a composite key is - although the only time I have ever used one is when specifying a primary key for an assignment table bridging a many-to-many relationship.

     

    I would greatly appreciate if someone could provide an explanation, preferably with examples, that illustrate the implications of failing to, or incorrectly specifying the composite key for a dimension. Additionally, I would greatly appreciate any general rules-of-thumb and guidelines on when to use composite keys (especially identifying situations where they MUST be used) and what the implications are on using them, in terms of the correctness of queries and the data hierarchies themselves.

     

    I also understand the obvious performance penalty associated, but I am more interested in implications on correctness and some simple examples.

     

    I have tried to search for some laymens examples to no avail on Google and in general, am disappointed in the literature associated with Analysis Services available on the market. I apologize this is a relatively simple question but I'm having a hard time adjusting to thinking in "multi-dimensional" terms

     

    Thank you kindly!

    Phil

    Monday, June 02, 2008 8:41 PM

All replies

  •  

    Hi Phil, I have never come across a situation where we need to create a composite key for a dimension. Within the context of (multi) dimensional modeling, I think the discussion about composite key has always been in the fact table. On the one side, it is preferable to have the primary key of the fact table as the composite of dimension key columns. On the the other side, it is preferable to define a new column called 'fact key' as the primary key of the fact table. The latter argument is especially useful (in terms of performance) in certain database platform such as SQL Server (because of its clustered index implementation) and Teradata (because of it's Primary Key concept* - see below). I think Bob Becker from Kimball Group once written both sides of the arguments. I read this article on Kimball Group's web site. Sorry I don't have the URL. He also mentioned a functional use of fact key column, i.e. to enable self reference within the fact table, i.e. by a fact table row to another fact table row.

     

    As far as I'm aware a dimension should have a surrogate (meaningless) key, that is a incremental/identity int or big int (assuming we use SQL Server platform). There is one exception to this rule, which I think either Ralph Kimball himself or one of the Kimball Group associates mentioned in one of their monthly article: time dimension having a smart (non surrogate) key. The main argument is partitioning (this is specific to SMP and doesn't apply to MPP platform). I mean physical horisontal table partitioning. Not vertical partitionoing e.g. col1 to col10 on table1, col11 to col20 on table2. Not view partitioning (ala SQL 2000). Because the fact table is partitioned by date, there is a need to define the date key on the fact table (and therefore on the date dimension) to be a smart key, for example: 20080613 for today. This enable us to define a partition boundary. Let's take SQL 2005 platform (principally works the same way for Oracle 10g, just different syntax): the date dim of YYYYMMDD format enable us to define left/right range point for the partition function, e.g. :

    create partition function pf_campaign_subscription (int)

    as range left for values

    (20080101, 20080201, 20080301, 20080401, ...)

    Compare this with the alternative surrogate key solution:

    create partition function pf_campaign_subscription (int)

    as range left for values

    (10365, 10396, 10425, 10456, 10486, 10517, 10547, 10578, 10609, 10639, 10670, 10700, 10731)

     

    That is the only exception of the surrogate key concept for dimension table (that I know of, quite well be there are other valid implementation of 'smart' keys that I don't know). Of that only exception, perhaps there is a need to split the date key into 2, e.g. YYYYMM and DD? I don't know. In insurance world we have Risk Number and Declaration Number and in retail banking world we have Sort Code and Account Number, but that's natural keys. We don't use these columns as the primary keys of dimension tables, we create data warehouse surrogate keys. Well if you have to create a OLAP cube from ODS that's a different story, as the PK of ODS tables are natural keys.

     

    The case of having date and time dimension in 1 dimension table have been discussed a few times by Ralph Kimball and his associates and it always ended up with separating the two into different dimensions, e.g. we have date dimension and time of day dimension. Sometimes it is better to replace the time of day dimension with a timestamp column on the fact table, depending on the purpose and application. So no, we don't have a case here where the dimension key is YYYYMMDDHHMMSS.

     

    But then I'm thinking about telecom industry: the phone call fact table where we could have thousand of calls every second, especially in a large country such as USA. Would we define datetime dimension with dimension key YYYYMMDDHHMMSS for partitioning reason? No, it's still better to split into 2 dimensions: date dimension with YYYYMMDD key and time of day dimension with HHMMSS. Only the date key is used for partitioning; the time key is not. Or alternatively define it as a timestamp column on the fact table.

     

    So, no, Phil, I can't think of any case of having composite key in the dimension table.

     

    Kind regards,

    Vincent

     

    *Teradata definition of Primary Key is different from relational modeling. In Teradata world, because it's a Massive Parallel Processing architecture, Primary Key is used for data distribution, i.e. to distribute data as evenly as possible through all nodes.

    Friday, June 13, 2008 9:15 PM