locked
Adding a Varchar(max) column to a Data Warehouse fact table having 200 Million rows RRS feed

  • Question

  • Hi All,

    We have a fact table with 200 Million rows in our Data warehouse. Now we need to add a new column to this fact table which is of length varchar(max) (maximun length of the data in that column is around 2000) which doesn't change the grain of the fact table. Will this affect the performance of querying the other small columns in that fact table? If so to what extent? we are having a second thought of creating a new separate fact table of the same grain and adding this lengthy column to the new fact table so that we can use this new fact table only when we want to query this lengthy column.

    I am not sure whether a big fact table can hold such a lengthy column or not?

    Any ideas are greatly appreciated.

    Thanks

    Kambala

    Thursday, April 28, 2011 7:35 PM

Answers

  • Thanks for the info.

    No - you absolutely don't want to store "responses" as text in the fact table.  EVER.

    You DO want to make a dimension table.  And I think that if you looked at the text you'd be collecting from each call... you'd find that you have duplicates, which provides the opportunity to only store one row in the dimension table and reference that row several times from the fact table.

    There are two reasons to not place this text in the fact table which really apply to your situation.

    First, there are two types of columns in a fact table.  Foremost, there are things that are meant to be aggregated.  Call length, for example.  The second type of column contains "filter" type information to subdivide, group, or limit the rows you're adding up.  For example, which call center the call was handled in.  When you use those two types of columns together, you're able to answer questions such as "what was the average call duration for call center A compared to all other call centers?"  Storing a "response" plain-text column doesn't fit EITHER of those purposes.

    Second, there are hundreds of millions of rows in a fact table.  Every single one of them is necessary - you can't deduplicate or aggregate any of them, or you'll lose important granularity information.  Fact tables should therefore have as few columns as possible, with as small a data type as possible.  Why?  I/O.  If you have a row that contains four surrogate keys (date, time, employee, customer) and one fact (call duration in seconds), you have a 12 byte wide row (4 bytes for an INT key for date, 2 bytes for SMALLINT for time, employee and customer, and a SMALLINT for call duration).  That means a 100 million row table will be (at least) 1.2 billion bytes (1.2GB).  If you add on a CHAR(100) (not even a 1000, or MAX), now your row size is 112 bytes, or 11.2GB.  Answering the "what was the average call duration" question I posed above will now almost certainly require disk reads (unless your server has 16GB or more RAM, and even then, I would doubt it).  A disk read compared to a memory read is at least 100x slower.  So if your query on the 12 byte wide row took 1/2 a second, the query on the 112 byte wide row will take at least a minute.  Bump it up to 1000 characters...

    Now, I'm not certain how a VARCHAR(MAX) gets stored in a table - definitely off-row, but on-page?  I don't know.  Unless those VARCHARs get stored completely off-page (with just a reference pointer), then you are still saddled with reading them from disk, whether the CPU has to process them or not.  Even if they are stored completely off-page... how is that different than doing that work manually by building a "response" dimension with a surrogate key?  It's different in that SQL Server will NEVER read or cache the responses (and waste RAM) untill and unless someone actually queries something that depends on the response.  And since the response dimension is HUGE and unstructured - they're either not likely to do it, and/or it's going to take FOREVER to serve the query.  Since it's going to take forever, it's either useless and should get dumped - or you're going to have to optimize the storage to help queries.  Optimize how?  You need to find out what questions/queries they're likely to issue against it.  Term extraction, response standardization and categorization are likely what you'll be looking at.


    Todd McDermid's Blog Talk to me now on
    • Marked as answer by kambala Monday, May 2, 2011 1:07 PM
    Friday, April 29, 2011 5:14 PM
  • Date of birth will have a VERY SMALL set of answers.  Yes, your other questions will of course generate a ton of unique responses.

    But the real question here is how are your users going to query this information? 

    There is absolutely no reason to store it if they aren't querying it - and it's not possible to know how to store it if you don't know how they're going to query it.

    You need to be able to answer that question before you can keep any of it in your Data Warehouse: What business questions are going to be asked of this data?


    Todd McDermid's Blog Talk to me now on


    • Marked as answer by kambala Monday, May 2, 2011 1:11 PM
    Friday, April 29, 2011 9:11 PM

All replies

  • Um. Yes.

    I would be extremely wary of doing anything like this.  It's standard practice to only include numeric columns in fact tables.  The reason behind this is because you can't aggregate (sum, average) character values - nevermind the absolute explosion of space required to host it.

    It's quite typical that this information doesn't really belong in the fact table, and instead should reside in a dimension table as an attribute.

    If you can provide more information about how this massive chunk of text is to be used, perhaps we could suggest more appropriate solutions.


    Todd McDermid's Blog Talk to me now on
    Thursday, April 28, 2011 9:01 PM
  • Please follow what Todd has suggested..it is not a good practice to have varchar in fact tables. BTW could you please explai what kind of reporting you do using the varchar(max) field ? so that we could be able to analyze more on this.

    Friday, April 29, 2011 6:55 AM
  • Thanks a lot for the response. 

    The scenario is: We make calls to people and the call details are stored in a fact like callId, Duration etc.

    Each call will have a response recorded and need to be stored. Initially I thought of creating a response dimension and adding the key to the fact. This case would make the dimension very huge like more than 100 million. So, I was thinking for an other option where I could fit that column in a fact table.

    Reporting scenario is: we report all people and their call responses in an excel format.

    Any help in designing is very much appreciated.

     

    Thanks,

    Kambala

    Friday, April 29, 2011 1:07 PM
  • Thanks for the info.

    No - you absolutely don't want to store "responses" as text in the fact table.  EVER.

    You DO want to make a dimension table.  And I think that if you looked at the text you'd be collecting from each call... you'd find that you have duplicates, which provides the opportunity to only store one row in the dimension table and reference that row several times from the fact table.

    There are two reasons to not place this text in the fact table which really apply to your situation.

    First, there are two types of columns in a fact table.  Foremost, there are things that are meant to be aggregated.  Call length, for example.  The second type of column contains "filter" type information to subdivide, group, or limit the rows you're adding up.  For example, which call center the call was handled in.  When you use those two types of columns together, you're able to answer questions such as "what was the average call duration for call center A compared to all other call centers?"  Storing a "response" plain-text column doesn't fit EITHER of those purposes.

    Second, there are hundreds of millions of rows in a fact table.  Every single one of them is necessary - you can't deduplicate or aggregate any of them, or you'll lose important granularity information.  Fact tables should therefore have as few columns as possible, with as small a data type as possible.  Why?  I/O.  If you have a row that contains four surrogate keys (date, time, employee, customer) and one fact (call duration in seconds), you have a 12 byte wide row (4 bytes for an INT key for date, 2 bytes for SMALLINT for time, employee and customer, and a SMALLINT for call duration).  That means a 100 million row table will be (at least) 1.2 billion bytes (1.2GB).  If you add on a CHAR(100) (not even a 1000, or MAX), now your row size is 112 bytes, or 11.2GB.  Answering the "what was the average call duration" question I posed above will now almost certainly require disk reads (unless your server has 16GB or more RAM, and even then, I would doubt it).  A disk read compared to a memory read is at least 100x slower.  So if your query on the 12 byte wide row took 1/2 a second, the query on the 112 byte wide row will take at least a minute.  Bump it up to 1000 characters...

    Now, I'm not certain how a VARCHAR(MAX) gets stored in a table - definitely off-row, but on-page?  I don't know.  Unless those VARCHARs get stored completely off-page (with just a reference pointer), then you are still saddled with reading them from disk, whether the CPU has to process them or not.  Even if they are stored completely off-page... how is that different than doing that work manually by building a "response" dimension with a surrogate key?  It's different in that SQL Server will NEVER read or cache the responses (and waste RAM) untill and unless someone actually queries something that depends on the response.  And since the response dimension is HUGE and unstructured - they're either not likely to do it, and/or it's going to take FOREVER to serve the query.  Since it's going to take forever, it's either useless and should get dumped - or you're going to have to optimize the storage to help queries.  Optimize how?  You need to find out what questions/queries they're likely to issue against it.  Term extraction, response standardization and categorization are likely what you'll be looking at.


    Todd McDermid's Blog Talk to me now on
    • Marked as answer by kambala Monday, May 2, 2011 1:07 PM
    Friday, April 29, 2011 5:14 PM
  • Thank you for your explanation Todd.

    I agree that putting the response in the fact table is not a good idea. 

     And I think that if you looked at the text you'd be collecting from each call... you'd find that you have duplicates, which provides the opportunity to only store one row in the dimension table and reference that row several times from the fact table

    But these responses I am talking about are not limited. Different people answers a question in different ways.

    For example:

    What is your date of birth? we will have many distinct answers.

    How are you doing today? we will have many different answers with different lengths.

    what is your address? we will have many different answers. 

    When we consider all these, adding the response in the dimension table will raise to nearly 100 Million distinct responses. So I am really confused whether having such a big dimension will work or not?

     

    Please Let me know what you think.

     

    Thanks

    Kambala

     

    Friday, April 29, 2011 5:51 PM
  • Date of birth will have a VERY SMALL set of answers.  Yes, your other questions will of course generate a ton of unique responses.

    But the real question here is how are your users going to query this information? 

    There is absolutely no reason to store it if they aren't querying it - and it's not possible to know how to store it if you don't know how they're going to query it.

    You need to be able to answer that question before you can keep any of it in your Data Warehouse: What business questions are going to be asked of this data?


    Todd McDermid's Blog Talk to me now on


    • Marked as answer by kambala Monday, May 2, 2011 1:11 PM
    Friday, April 29, 2011 9:11 PM
  • Hey Todd,

    Thanks so much for your valuable suggestions. I can now completely rule out the option of having that column in fact table. I prefer having it as a dimension.

    The type of query which might need this bulk responses is a daily report with the list of all calls and their corresponding questions/responses.

    Other than this case, there is no other place I might need to query that bulk dimension.

    Does this sound reasonable?

     

    Thanks,

    Kambala

    Monday, May 2, 2011 1:15 PM
  • Completely.

    Adding it to the fact table would kill the rest of the capabilities of your data warehouse.


    Todd McDermid's Blog Talk to me now on
    Monday, May 2, 2011 3:09 PM