locked
SSAS 2008 - Degenerate Dimension (Distinct Count) RRS feed

  • Question

  • Hello,

    I have a data warehouse with a Sales Transaction table. This table contains the Order #, item being sold, quantity, price,  etc.

    Order # is my degenerate dimension.

    My question is how can I get a unique count of the Order # in the Cube without:

    (1) Creating a new Distinct Count Measure group (exponentially increases the cube's processing time)
    OR
    (2) Creating a new Dimension with the Order # as the primary key and creating a new Calculated Measure (DistinctCount([Order #].[Order #])) that will give me the distinct count of the Order #.
    This cannot be done because the size (350,000,000 records and rapidly growing) of this degenerate dimension is greater that the SQL Server threshold of 4GB.

    Any insight will be appreciated.

    Thanks
    Vitaly

    Saturday, March 3, 2012 3:23 PM

Answers

  • Something that surprises many cube designers is that Distinct Count can be more expensive than a dimension on the same field.  If you have a dimension on that field, each member's value will only be stored once in the cube.  If you create a Distinct Count on the field, there is no dimension, but each member's value will need to be stored in every cell (and aggregation) that the member is related to.  This is the nature of Distinct Count, SSAS must have every value available to determine the distinct values.  So, if you cannot create a dimension on the field, the chances are you will have some performance issues on a Distinct Count.

    As Stephen suggests, a ROLAP dimension is definitely worth a try.  Distinct Count performance might be interesting, and it would be worth tuning the SQL tables to be as skinny as possible, well indexed and sufficient memory to retain the table in memory.

    Also, you might like to think outside the square and do something like reduce the big dimension count by only using ones that have been used in the last 12 months (or some other way of reducing the size without losing too much valuable information).  Facts older than 12 months (or some other filter logic) can all connect to the same member and won't be counted in distinct counts.  In the illustration I have given with last 12 months, you will only be able to query the distinct count for the last 12 months data.  (Just an idea, don't take it as a best practice.)


    Richard

    • Marked as answer by Challen Fu Wednesday, March 14, 2012 9:29 AM
    Monday, March 12, 2012 4:08 AM

All replies

  • no calculated measure, no distinct count......

    1. For an order if there is only one row and you do not want to distinct count, u can created count aggregation type measure.

    2. If there are multiple rows in transaction layer and then you do not want to use calc measure or distinct count...then probably you have to the logic to ETL layer or report layer.

    ETL layer.

    in the table may be you can add a column a column say OrderBit where for every first order row you insert, you set this flag to 1 rest of rows for the same order will be 0. then u can add a measure using SUM aggregation type instead of Distinct count

    Report Layer

    Mostly in report you may end up grouping the orders or show a summary for order. So you use the rowcount in subtotal or sum of rows or something to get the count of orders


    vinu

    Saturday, March 3, 2012 5:13 PM
  • Vinu,

    1. Most orders have more than one Transaction Item so we can't use a straight Line Item Count measure from the cube.

    2. We have implemented an algorithm @ the ETL layer where we have a column that is set to 1/(# of times a particular product occurs on an order)
    so that you can sum this number and get the distinct number of Orders. It's not the best solution and I was hoping there was a built in option at the Cube level that I was overlooking.

    Thank you for your input.

    Vitaly

    Saturday, March 3, 2012 6:23 PM
  • Any other ideas? Really looking for a Cube level solution not ETL or Reporting Layer.

    Thanks
    Monday, March 5, 2012 8:10 PM
  • helloVitaly

    most order fact tables have an order line number column . so distinctcount of order number should be equivalent to a simple count of order line number = 1. more generaly you should look at an attribute of an order line that has to be present, unique within order lines set  of an order and common to the other unique attribute value of the other orders sets.

    Regards,

    Philip,

    Monday, March 5, 2012 8:34 PM
  • Vitaly,

    You want to count the number of orders when filtering by sales items in the cube.  This tells me that you must have something in the cube to identify the order.  In other words, you must have a dimension or a Distinct Count on Order.  A dimension on Order is likely to be more efficient, since it will only store each OrderId once, while the Distinct Count will have to repeat it in every cell and aggregation where it is appropriate.  You could optimise an Order dimension, a little, by only having orders with a unique set of sales items.  Here are a few tips on optimising the order dimension http://richardlees.blogspot.com.au/2009/11/optimizing-basket-analysis.html.

    Another option you might consider, if you only need indicative numbers is data mining.  Data mining can grab the essense of the data, which it can use to estimate the number of Orders.  See http://richardlees.blogspot.com.au/2009/05/data-mining-as-enhancement-to-basket.html for more info.

    Hope that helps,

    Richard

    http://RichardLees.com.au


    Richard

    Tuesday, March 6, 2012 5:47 AM
  • Richard,

    Thank you for your response.

    My initial idea was to create the Orders dimension with all the unique Order#'s (degenerate dimension). The problem is that the size of this dimension will be 350,000,000 which is not allowed in SSAS because of the 4GB limit of varchar(max) that the dimension is stored in. Every time I process this dimension, with a single primary key (Order#) it fails when the size of the dimension reaches 4GB. Had I been able to create this dimension, I would have used the DistinctCount function within a calculated measure.

    I actually have other dimensions that are much smaller in size (5,000 records) that I use this method with. Unfortunately the Order# is too large for SQL Server.

    The reason I need to get this number is for quick cube analysis in Performance Point so the data mining does not apply. I can easily run a SQL query on the Transaction Fact table and get the result but the problem is that it takes too long and is not part of the cube.

    Vitaly 
    Tuesday, March 6, 2012 2:24 PM
  • The dimension is the correct way to go. 

    Change the StorageMode of the dimension to ROLAP and it will build in about 5 seconds without giving you any memory errors.

    This is because it refers to the database table when it wants to know about that dimension rather than trying to build a large MOLAP object to store the information in

    -------------------------------------------------------------------------

    I gave this a go in my cube.  My situation is slightly different but you may find that it performs like a dog with 3 legs.

    I am trying to get a drillthrough for which I need a fact dimension so that I can return some of the fields from the Fact table in my drillthrough.  If my fact dimension is ROLAP then the drillthrough action tries to read the whole dimension from the database when the MDX runs, this fails because there are 1,000,000,000 records

    Currently, I am bodging it so that I can build my fact dimension as a MOLAP on the most recent 100,000,000 records.  Anything older than that won't allow drillthrough.

    Maybe you should go back to having a count distinct measure group and accept the trade off in processing time.  You don't have to process the whole cube every night, you can partition it by date.  I think that there is no answer which will make this perform amazingly well.

    • Edited by Stephen P Friday, March 9, 2012 6:59 AM Update
    Friday, March 9, 2012 4:55 AM
  • Something that surprises many cube designers is that Distinct Count can be more expensive than a dimension on the same field.  If you have a dimension on that field, each member's value will only be stored once in the cube.  If you create a Distinct Count on the field, there is no dimension, but each member's value will need to be stored in every cell (and aggregation) that the member is related to.  This is the nature of Distinct Count, SSAS must have every value available to determine the distinct values.  So, if you cannot create a dimension on the field, the chances are you will have some performance issues on a Distinct Count.

    As Stephen suggests, a ROLAP dimension is definitely worth a try.  Distinct Count performance might be interesting, and it would be worth tuning the SQL tables to be as skinny as possible, well indexed and sufficient memory to retain the table in memory.

    Also, you might like to think outside the square and do something like reduce the big dimension count by only using ones that have been used in the last 12 months (or some other way of reducing the size without losing too much valuable information).  Facts older than 12 months (or some other filter logic) can all connect to the same member and won't be counted in distinct counts.  In the illustration I have given with last 12 months, you will only be able to query the distinct count for the last 12 months data.  (Just an idea, don't take it as a best practice.)


    Richard

    • Marked as answer by Challen Fu Wednesday, March 14, 2012 9:29 AM
    Monday, March 12, 2012 4:08 AM
  • It turns out that for my purposes (drillthrough) I can add a measure and specify no aggregates.  The value of that measure at the grain level of the cube is then available to me for drillthrough.

    Good luck with your distinct count

    Monday, March 12, 2012 7:13 AM
  • Vitaly, what solution did you go with?

    I'm facing the same issue with two DDs (orders and Items).

    Due to the granularity 1 order or item DD can be repeated multiple times, so a simple count won't work.

    The work around I am currently testing dividing 1 by the number of lines that make up and order/item. e.g If 1 order is over 10 lines, then 1 line will represent 0.1 of an order. Once summed in the cube you will obviously get 1

    Tuesday, April 16, 2013 8:14 AM