locked
MDX to fetch distinct counts RRS feed

  • Question

  • Hi all,

    I have a fact table that stores Orders.
    It contains an Order ID, order date, a customer, quantity, a product and a Item number withing the order.
    Invoice number, date and customer are repeated for each product in an order.
    I need to produce a result set containing each distinct product and the count of distinct other products that were ever bought together with it.
    For instance, if the data looks like

    Order    Item no    Product
    1    1    Apples
    1    2    Oranges
    2    1    Apples
    2    2    Beer
    3    1    Oranges
    3    2    Wine
    4    1    Apples
    4    2    Oranges


    I need to return the following list (possibly sliced by dates, customers or item number).

    Apples        2 (Bought together with Oranges and Beer)
    Oranges        2 (Apples and wine)
    Beer        1 (Apples)
    Wine        1 (Oranges)


    In my particular case an order has always 2 Items (and each order has an Item no=1 and Item no=2 and it is important to diferentiate between item 1 and item 2), so if it makes it easier I can store it like

    OrderID, Customer, Date, Product1, Product1 quantity, Product 2, Product 2 quantity.

    I'm looking for a way to to model this, what is the MDX to fetch such result set, would the count be stored as a calculation, and what are the performance implications assuming there is a very large number of products and orders.

    Much appreciated



    Monday, July 12, 2010 4:41 PM

Answers

  • Hi,

    Using data mining is a good choice, especially in Basket Analysis as Tom said. For MDX query, you can refer to below sample in [Adventure Works]:

    1)    You need to ensure you have created a Degenerate dimension in your cube, for example in [Adventure Works], we can use dimension [Internet Sales Order Details], because we need to analysis the transactions.

    2)    Run below query against on [Adventure Works]:

    with member measures.x as

    exists(

    [Product].[Product].[Product],

    {exists([Internet Sales Order Details].[Sales Order Number].[Sales Order Number],

    [Product].[Product].currentmember,

    "Internet Sales")}

    ,"Internet Sales"

    ).count-1

    ,Non_Empty_Behavior =

     { [Measures].[Internet Order Quantity]

     }

     

    select non empty {[Date].[Date].[Date]}*{measures.x} on 0,

    non empty  [Product].[Product].[Product] on 1

    from [Adventure Works]

    where [Date].[Calendar].[Month].&[2003]&[8];

    3)    To test the first cell (20030801, All-Puropse Bike stand), you can run below t-sql query on Data source:

    select * from dbo.FactInternetSales

    where ProductKey = 486 and OrderDateKey = 20030801

     

    select distinct ProductKey from dbo.FactInternetSales

    where  SalesOrderNumber= 'SO51882' or   SalesOrderNumber= 'SO51903'

     

    Regarding to performance, since you need to analysis from the lowest granularity – transaction level, then the performance could be bad if you have a very larger fact table (ensure you have created partition). So this is another reason of using data mining.

     

    Hope this helps,

    Raymond

     


    Raymond Li - MSFT
    • Marked as answer by Raymond-Lee Tuesday, July 20, 2010 8:17 AM
    Wednesday, July 14, 2010 6:20 AM
  • I agree data mining can be useful for answering these questions.  However, data mining produces predictions, not accurate results.  Is that OK for you?  I tend to find data mining complementary to basket analysis, rather than a substitute.

    You might find my blog on Basket Analysis useful.  http://richardlees.blogspot.com/2009/11/optimizing-basket-analysis.html  If you need to perform basket analysis and you have more than a few dozen million OrderIDs, performance will be challenging.  Fortunately, there are ways to optimise cubes to support basket analysis.  See http://richardlees.blogspot.com/2009/11/optimizing-basket-analysis.html 


    Richard
    • Marked as answer by Raymond-Lee Tuesday, July 20, 2010 8:17 AM
    Thursday, July 15, 2010 7:05 AM

All replies

  • Hi Jon,

    Using the fact-table structure from the bottom of your post with the differentiation between Product1 and Product2 is a good idea. I suggest you also add an unique integer field identifying Product1 and one identifying Product2. You need these for specifying your distinct count measures, as they have to be based on integer fields. So your fact-table should be like.

    OrderID, Customer, Date, Product1, Product1_IntKey, Product1 quantity, Product2,  Product2_IntKey, Product 2 quantity

    Assuming you have only 1 Product-dimension (re-)use it as a role-playing dimension in your UDM, with mapping on both product1 and product 2.

    Add the fact-table to your cube and add a distinct-count measures on your Product2_IntKey. This allows you to easliy get a distinct count of the Product 2 members for whatever slice on Date, Customers and Product1 members.

    You do not need any complex MDX for this as it can be used like anu other regular measure.

    (Would you also be interested in a distinct count of the Product1-members when slicing on Product2, then also add the Product1_IntKey as a distinct count measure).

    There could be a performance implication when you have a lot of facts/combinations. When that's an issue consider adding distinct count measures in seperate facttables in you ETL, however doing so, you probably loose some flexibility.

     

    hth,

    Cees

     

     

     

     


    Please remember to mark the replies as answers or at least vote helpfull if they help and unmark them if they provide no help.
    Monday, July 12, 2010 5:38 PM
  • Hi Jon,

    I would recommend the following, use MS Data Mining Method Association Rules, its  the kind of method that suits your kind of basket analysis best.

    If for some reason using data minig mehtods is no option you should build your basket as described by Ralph Kimball in his book, but please consider the data mining add in for MS Excel:

    "The Data Warehouse Toolkit - 2nd Edition". In chapter 2 "Retail Sales" is a short chapter about "Market Basket Analysis".

    It's about definig a basket consisting of a pair of products that were bought together. The basket is your order. If you have a distinct count measure on your order no you can determine how often product a was bought together with product b or how often a product fromm product group I was bought together with products from product group II ...

    As Cees already mentioned in his post youd need your product dimension as a role playing dimension.

    All this happens  in a new fact table that is fed by your "normal" fact table.

    Hope this helps

    Tom

    Monday, July 12, 2010 6:58 PM
  • Thanks Cees, Tom.

    Let me first say (for Cees) that if yesterday's game were to be played today, I'd cheer for the Dutch.

    I don't like the idea of a role playing dimension for product (my bad, I was not very clear in my initial post) because Product needs to be seen as one dimension. Apples and oranges have the same chance of being either the first or second item in an order, and for most cases this should be treated like one dimension, if someone browses the cube and pulls one dimension, I need to make sure that it contain both apples and bananas although maybe at that point in time all apples were the first item and all oranges the second. I see item number (1 or 2) as a separate dimension, which someone might want to slice on, for instance show me all products that were bought as primary products in an order, but the general case, they want to see all products.

    I'll definitely take a look at data mining, I'm just not familiar with it at all.

     

    Thanks again

    Monday, July 12, 2010 7:30 PM
  • Hi,

    Using data mining is a good choice, especially in Basket Analysis as Tom said. For MDX query, you can refer to below sample in [Adventure Works]:

    1)    You need to ensure you have created a Degenerate dimension in your cube, for example in [Adventure Works], we can use dimension [Internet Sales Order Details], because we need to analysis the transactions.

    2)    Run below query against on [Adventure Works]:

    with member measures.x as

    exists(

    [Product].[Product].[Product],

    {exists([Internet Sales Order Details].[Sales Order Number].[Sales Order Number],

    [Product].[Product].currentmember,

    "Internet Sales")}

    ,"Internet Sales"

    ).count-1

    ,Non_Empty_Behavior =

     { [Measures].[Internet Order Quantity]

     }

     

    select non empty {[Date].[Date].[Date]}*{measures.x} on 0,

    non empty  [Product].[Product].[Product] on 1

    from [Adventure Works]

    where [Date].[Calendar].[Month].&[2003]&[8];

    3)    To test the first cell (20030801, All-Puropse Bike stand), you can run below t-sql query on Data source:

    select * from dbo.FactInternetSales

    where ProductKey = 486 and OrderDateKey = 20030801

     

    select distinct ProductKey from dbo.FactInternetSales

    where  SalesOrderNumber= 'SO51882' or   SalesOrderNumber= 'SO51903'

     

    Regarding to performance, since you need to analysis from the lowest granularity – transaction level, then the performance could be bad if you have a very larger fact table (ensure you have created partition). So this is another reason of using data mining.

     

    Hope this helps,

    Raymond

     


    Raymond Li - MSFT
    • Marked as answer by Raymond-Lee Tuesday, July 20, 2010 8:17 AM
    Wednesday, July 14, 2010 6:20 AM
  • I agree data mining can be useful for answering these questions.  However, data mining produces predictions, not accurate results.  Is that OK for you?  I tend to find data mining complementary to basket analysis, rather than a substitute.

    You might find my blog on Basket Analysis useful.  http://richardlees.blogspot.com/2009/11/optimizing-basket-analysis.html  If you need to perform basket analysis and you have more than a few dozen million OrderIDs, performance will be challenging.  Fortunately, there are ways to optimise cubes to support basket analysis.  See http://richardlees.blogspot.com/2009/11/optimizing-basket-analysis.html 


    Richard
    • Marked as answer by Raymond-Lee Tuesday, July 20, 2010 8:17 AM
    Thursday, July 15, 2010 7:05 AM