none
Data warehouse Fact table Design help RRS feed

  • Question

  •  

     

     

    Hi All,

     

    We have a Data warehouse Fact table design issue.

     

    Each user will have a random set of FieldName and FieldValue. Below is how data looks in our DWH Staging area.

     

    UserID                  FieldName                          FieldValue

    1                              Gender                                M

    1                              Language                             English

    1                              StartDate                             03/07/2008

    1                              Region                                  East

    1                              zipcode                                12345

    2                              State                                     NY

    2                              PhoneNumber                  1234567890

    3                              Enddate                               03/04/2008

    3                              Language                             Spanish

    ………………………………………………

    ……………………………………………..

     

    If you count distinct FieldName it will be nearly 2000. What is the correct design in Data warehouse for this data.

     

    I have two options in my mind.

     

    Option1:

     

    I can create a fact table with the same structure as in DWH staging(as mentioned above ) but the fact table ends up with nearly 3 or 4 billion records. I am afraid about the performance of such a huge table having non-fact data.

     

    Option 2:

     

    I am thinking to have a dimension with all distinct field names and pivot the fact table with one record per userid with multiple columns.

     

    Something like this…

     

    DimID    FieldName

    1                     Gender

    2                     Language

    3                     Startdate

    4                     Region

    5                     Zipcode

    6                     State

    7                     PhoneNumber

    8                     Enddate

    ………………………….

     

     

    Fact Table

     

    Userid    01          02               03               04           05             06          07                08  …..2000

    1             M            English     03/07/2008    East         12345       NULL     NULL               NULL

    2             NULL     NULL         NULL                NULL       NULL        NY          1234567890  NULL

    3             NULL     Spanish    NULL                NULL       NULL        NULL     NULL               03/04/2008

     

     

    The problem here is the fact table is going to have nearly 2000 columns with varchar(50) each. I am afraid that this table will reach multiple page allocations and will affect the query performance to a great extent.

     

    Which option do you suggest? or Is there any other way of designing this kind of data in DWH?

     

    Any help is greatly appreciated

     

    Thanks

    Kambala

     

     

     

     

     

     

     

     

     

     

    Thursday, May 5, 2011 7:18 PM

Answers

  • It doesn't seem like you have any particular facts to track - and the "staging" area design is particularly unappealing.

    Do you expect these characteristics of userids to change over time, and what kind of tracking will your users want on these changes?  Will they want to know when certain columns have changed, or will they want changes made retroactively?

    As a starting point, my suggestion would be to treat this data as a single dimension and/or "factless fact table".  More like your option 2.  However, 2000 columns is prohibitive in just about any table design - even a data warehouse dimension.  I would suggest you break up the columns into groupings that would tend to be queried together, or that are otherwise related to one another logically. 

    For example, group your "region", "state", and "zipcode" columns together in one table - called "DimUserLocation".  Group your "gender", "name", and "birthday" type columns in another table called "DimUserIdentity".  You may end up with ten to twenty such tables.  Each table would need to have the business key as a set of columns, as well as an independent surrogate key.  (This means that the surrogate key value for "Bob" in DimUserLocation would NOT be related to the key value for "Bob" in DimUserIdentity.)  You may then need a factless fact table to relate the surrogate keys together if queries would use attributes from separate tables.


    Todd McDermid's Blog Talk to me now on
    Monday, May 9, 2011 5:40 PM

All replies

  • You can't design a data warehouse with that information.  The information you need is:

    What questions are users wanting to answer with this data?

    Only then can you determine if and how you can design your fact and dimension tables.


    Todd McDermid's Blog Talk to me now on
    Friday, May 6, 2011 3:25 PM
  • There are many query requests around this data. Most general one is reporting all users and their fieldname and values in a pivoted form ex: all filednames are column names and thier values as rows in the column.

    In other cases, we also want to filter the data using these fieldname, values as dimensions mostly using SSRS.

    Any thoughts on how can we handle this?

     

    --Kambala

    Monday, May 9, 2011 1:12 PM
  • It doesn't seem like you have any particular facts to track - and the "staging" area design is particularly unappealing.

    Do you expect these characteristics of userids to change over time, and what kind of tracking will your users want on these changes?  Will they want to know when certain columns have changed, or will they want changes made retroactively?

    As a starting point, my suggestion would be to treat this data as a single dimension and/or "factless fact table".  More like your option 2.  However, 2000 columns is prohibitive in just about any table design - even a data warehouse dimension.  I would suggest you break up the columns into groupings that would tend to be queried together, or that are otherwise related to one another logically. 

    For example, group your "region", "state", and "zipcode" columns together in one table - called "DimUserLocation".  Group your "gender", "name", and "birthday" type columns in another table called "DimUserIdentity".  You may end up with ten to twenty such tables.  Each table would need to have the business key as a set of columns, as well as an independent surrogate key.  (This means that the surrogate key value for "Bob" in DimUserLocation would NOT be related to the key value for "Bob" in DimUserIdentity.)  You may then need a factless fact table to relate the surrogate keys together if queries would use attributes from separate tables.


    Todd McDermid's Blog Talk to me now on
    Monday, May 9, 2011 5:40 PM
  • As a starting point, i suggest you to treat this data as a single dimension. I would advic you to break up the columns into groupings that would resulting to be queried together, or that are otherwise related to each another logically.
    Tuesday, August 23, 2011 10:39 AM