locked
AVERAGEX of Distinct Counts - Performance Killer RRS feed

  • Question

  • I have a fairly standard model:

    'Date Table' - 2,000 rows
    'Fact Table' - 40,000 rows
    'Employee Table' - 500 rows

    Unfortunately I experience very poor performance doing an AVERAGEX of a distinct count.  Measures below:

    Distinct Count of Weeks With Hours:=CALCULATE(DISTINCTCOUNT('Date Table'[WeekNameInCalendar]), 'Fact Table')
    Query Plan (4ms):

    AddColumns: IterPhyOp IterCols(0)(''[temp])
    SingletonTable: IterPhyOp IterCols(0)(''[temp])
    Spool: LookupPhyOp Integer #Records=1 #KeyCols=62 #ValueCols=1 DominantValue=BLANK
    AggregationSpool<Cache>: SpoolPhyOp #Records=1
    VertipaqResult: IterPhyOp #FieldCols=0 #ValueCols=1

    Avg Distinct Count of Weeks With Hours Per Employee:=AVERAGEX (
        SUMMARIZE (
            'Fact Table',
            'Employee Table'[TK_KEY]
        ),
        [Distinct Count of Weeks With Hours]
    )

    Query Plan (cold: 1,800, hot: 250ms)

    AddColumns: IterPhyOp IterCols(0)(''[temp])
    SingletonTable: IterPhyOp IterCols(0)(''[temp])
    Spool_Average: LookupPhyOp Double #Records=1 #KeyCols=0 #ValueCols=2 DominantValue=BLANK
    AggregationSpool<Average>: SpoolPhyOp #Records=1
    CrossApply: IterPhyOp IterCols(0)('Employee Table'[TK_KEY])
    Spool_UniqueHashLookup: IterPhyOp LookupCols(0)('Employee Table'[TK_KEY]) #Records=503 #KeyCols=62 #ValueCols=0
    AggregationSpool<Cache>: SpoolPhyOp #Records=503
    VertipaqResult: IterPhyOp #FieldCols=1 #ValueCols=0
    Spool_Iterator<Spool>: IterPhyOp IterCols(0)('Employee Table'[TK_KEY]) #Records=503 #KeyCols=1 #ValueCols=1

    AggregationSpool<Copy>: SpoolPhyOp #Records=503
    Spool_Iterator<Spool>: IterPhyOp #Records=1 #KeyCols=62 #ValueCols=1
    AggregationSpool<Cache>: SpoolPhyOp #Records=1
    ... the last 3 lines repeat about 1,500 times

    Can anyone suggest a method to tweak this?

    Monday, September 28, 2015 1:40 PM

Answers

  • Replace your SUMMARIZE() with a VALUES('Employee Table'[TK_KEY]), and you can avoid the expensive navigation of the relationship between your fact table and employee dimension.

    Beyond that there's not much room for measure optimization. You might try profiling with a rewrite of your [Distinct Count of Weeks with Hours] to use COUNTROWS(VALUES()) instead of DISTINCTCOUNT(), but there's rarely a significant performance difference between these constructs.

    What you're asking for is an expensive calculation. AVERAGEX() (like all X functions, along with FILTER() and some others) is a row-by-row iterator. It is stepping through the table you pass in the first argument, and it is evaluating the expression in the second argument once per row in the first table. Your [Distinct Count of Weeks with Hours] takes 4ms, and you're asking to evaluate it several hundred times, stepping through 'Employee Table'. If every row of 'Employee Table' is in context, for example in a grand total, then you're asking for 500 evaluations of a 5ms measure....

    GNet Group BI Consultant

    • Marked as answer by Simon Nuss Thursday, October 1, 2015 12:58 PM
    Wednesday, September 30, 2015 8:24 PM

All replies

  • Can you explain what you're trying to do exactly? Not in terms of DAX functions to use, but what you want to achieve - I don't think I understand completely.

    A good start could be to perform the AVERAGEX over the Employee table, not your Fact table (even summarized).

    Wednesday, September 30, 2015 8:18 PM
    Answerer
  • Replace your SUMMARIZE() with a VALUES('Employee Table'[TK_KEY]), and you can avoid the expensive navigation of the relationship between your fact table and employee dimension.

    Beyond that there's not much room for measure optimization. You might try profiling with a rewrite of your [Distinct Count of Weeks with Hours] to use COUNTROWS(VALUES()) instead of DISTINCTCOUNT(), but there's rarely a significant performance difference between these constructs.

    What you're asking for is an expensive calculation. AVERAGEX() (like all X functions, along with FILTER() and some others) is a row-by-row iterator. It is stepping through the table you pass in the first argument, and it is evaluating the expression in the second argument once per row in the first table. Your [Distinct Count of Weeks with Hours] takes 4ms, and you're asking to evaluate it several hundred times, stepping through 'Employee Table'. If every row of 'Employee Table' is in context, for example in a grand total, then you're asking for 500 evaluations of a 5ms measure....

    GNet Group BI Consultant

    • Marked as answer by Simon Nuss Thursday, October 1, 2015 12:58 PM
    Wednesday, September 30, 2015 8:24 PM
  • Great, thanks Greg!  I should have thought of that sooner, silly me.

    I did find a white paper from Alberto and a few other resources on distinct counts.  Failing your great suggestion, I'll look into those.

    Thursday, October 1, 2015 12:48 PM