locked
calculated measure maxing out RAM RRS feed

  • Question

  • i wonder if anyone could kindly help me.

    i have a calculated measure in Power Pivot that takes too long to run that it would max out my RAM when running it in Excel 2016 (for some reason it actually manages to run in Excel 2013 but still takes a long time). Can someone please help me optimise the measure? i have shared the measure below. it is the Unique Leads (MTD) measure that I need to run in a pivot table that is broken down by everyday in a month. The Unique Leads (MTD) measure basically calculates the month-to-date unique leads where a unique lead is defined as an email address that has not appeared in the 'application' table thus far in a particular month. Please let me know if you need more information!

    Unique Leads (MTD):=[CumulativeUniqueApps]-CALCULATE([CumulativeUniqueApps], filter(allexcept('application', Affiliate[Affiliate], 'Lead Type'[Loan Type]), CALCULATE([CumulativeUniqueApps], filter('application', EARLIER([Date])< [Date]))))

    CumulativeUniqueApps:=calculate(DISTINCTCOUNT([Email]), filter(allexcept('application', Affiliate[Affiliate], 'Lead Type'[Loan Type]), CALCULATE(DISTINCTCOUNT([Email]), filter('application', earlier([Date])<=[Date]))))

    Friday, November 6, 2015 11:35 AM

Answers

  • You're certainly correct that a measure offers more flexibility in choosing the date range. The original requirement was very specific that you wanted all time and within the same month. There is a tradeoff between flexibility and speed of evaluation. With the fairly static requirement you originally posted, helper columns make more sense. If you want to be able to select an arbitrary time frame, then the specific type of helper column I discussed would be useless.

    That being said, you will also face tradeoffs in writing measures to handle arbitrary date selections. Again, as described originally, your measures are specific in requirement, and would be best implemented in a different manner than if the goal is to select arbitrary date ranges.

    As for the SQL with an inequality join, you could do something (roughly) like the following, with more equijoins as necessary for the grouping:

    SELECT
      a.email
      ,a.date
      ,COUNT(a2.*) AS PriorRecordCount
    FROM Application a
      LEFT JOIN Application a2
        ON a2.date < a.date
    AND a2.email = a.email
    GROUP BY a.email, a.date

    In this case you get every record from Application, along with a count of rows where the same email occurred on an earlier date.

    This, of course, is hard-coded in the query and as inflexible as a calculated column.


    The measure pattern you want to use is identical to new customer analysis, as thoroughly explained in this DAX Pattern post.
    Here's a sample of how you'd adapt it to your CumulativeUniqueApps:

    DistinctEmails:=DISTINCTCOUNT('Application'[Email])
    
    CumulativeUniqueApps:=COUNTROWS(
    	FILTER(
    		ADDCOLUMNS(
    			SUMMARIZE(
    				'Application'
    				,'Application'[Email]
    				,Affiliate[Affiliate]
    				,LeadType[LoanType]
    			)
    			,"PreviousRecordCount"
    			,CALCULATE(
    				[DistinctEmails]
    				,FILTER(
    					ALL('Application'[date])
    					,'Application'[date] < MIN('Application'[date])
    				)
    			)
    		)
    		,[PreviousRecordCount] = 0
    	)
    )




    GNet Group BI Consultant

    Saturday, November 7, 2015 5:09 AM
  • Mostly because it seems like you'd still derive value from having a separate distinct count measure. There's no functional difference and I'd be surprised if you see any major performance impact.

    Without context of how you'd be using/presenting the measure, I stuck as close to your original in the new suggestion. In a pivot table as you've described, VALUES() alone is fine.

    GNet Group BI Consultant

    Monday, November 9, 2015 3:58 PM

All replies

  • First, www.daxformatter.com is your friend.

    So, let's look at the measures we need to optimize:

    Unique Leads (MTD) :=
    [CumulativeUniqueApps]
        - CALCULATE (
            [CumulativeUniqueApps],
            FILTER (
                ALLEXCEPT ( 'application', Affiliate[Affiliate], 'Lead Type'[Loan Type] ),
                CALCULATE (
                    [CumulativeUniqueApps],
                    FILTER ( 'application', EARLIER ( [Date] ) < [Date] )
                )
            )
        )

    CumulativeUniqueApps :=
    CALCULATE (
        DISTINCTCOUNT ( [Email] ),
        FILTER (
            ALLEXCEPT ( 'application', Affiliate[Affiliate], 'Lead Type'[Loan Type] ),
            CALCULATE (
                DISTINCTCOUNT ( [Email] ),
                FILTER ( 'application', EARLIER ( [Date] ) <= [Date] )
            )
        )
    )
    First, it's considered a best practice to always refer to fully qualified column names. It makes it much easier to reason about your model structure. E.g. I have no idea what tables [Email] and [Date] belong to, so I'll have to make assumptions moving forward. Wrong assumptions lead to poor suggestions and incorrect or nonfunctioning results.

    So, you've got some heavily nested FILTER()s, which force a table to be created in memory that can then be iterated over row-by-row in a single thread. That's where your memory problems come from.

    You're also going through a lot of contortions to get a cumulative total.

    Here's a pretty bog standard way to do this, assuming you have a proper date dimension. Please let us know where assumptions divorce from reality.

    DistinctEmailCount:=
    DISTINCTCOUNT(Application[Email])
    
    CumulativeUniqueApps:=
    CALCULATE(
        [DistinctEmailCount]
        ,FILTER(
            ALL(DimDate)
            ,DimDate[Date] <= MAX(DimDate[Date])
        )
    )
    
    UniqueLeadsMTD:=
    TOTALMTD(
        [DistinctEmailCount]
        ,DimDate[Date]
    )

    You'll have to explain what the purpose of your ALLEXCEPT()s are. In the examples I've given, the only filter context we are modifying is that of the date dimension, so you don't need to explicitly preserve context on any other fields or tables.

    GNet Group BI Consultant

    Friday, November 6, 2015 3:07 PM
  • many thanks for your reply.

    I should've explained that when counting unique leads, we need to take into consideration of where a lead comes (which affiliate) and the lead type. ie. an unique email is only unique to a particular affiliate for a certain lead type. That's why I had the ALLEXCEPT ( 'application', Affiliate[Affiliate], 'Lead Type'[Loan Type] )

    Also, Unique Leads (MTD) measures is actually supposed to count the number of 'new' unique email each day. ie. if an email has appeared in previous days within the same month then it is not considered a unique lead. Does that make sense?





    Friday, November 6, 2015 5:55 PM
  • It makes sense, but if you take an ALL(DimDate), you won't affect the current context on those other dimensions. Additionally, by avoiding an ALLEXCEPT() on your fact table you avoid iterating over every row in your fact table, which becomes the memory hog. Have you tried the measures I suggested?


    GNet Group BI Consultant

    Friday, November 6, 2015 6:07 PM
  • sorry i just modified my previous reply to clarify a caveat on the Unique Leads (MTD) measure. 
    Friday, November 6, 2015 6:15 PM
  • So, you've not addressed whether the first two measures I provided meet your needs or not.

    As for the unique leads in month, this is something where a helper column will make your life a lot easier. You know you want to count distinct emails on the first day of the month that they appear.

    So add a bit flag that indicates whether a specific email is the first occurrence in a month. This will be a lot easier and faster than anything else I could suggest. The ETL layer is where you'll want to do this, or in a view on top of the table in your data warehouse, or in a query you write to populate the table if you can't write a view or have one written for you, or as a calculated column in Power Pivot if you can't do those other things.

    Your table would have an additional column which has a 1 for the first instance of an email in a month and 0 otherwise. Then you can just sum this field. Much easier, faster, and less RAM-hungry.

    GNet Group BI Consultant

    Friday, November 6, 2015 6:31 PM
  • P.S. the measure you suggested gives a running total of unique emails but what I want is to count each day's new unique email
    Friday, November 6, 2015 6:32 PM
  • So, if you want the count of new emails on a given day, a bit flag for each indicator - first ever, and first in month will make your life much easier, and your measure evaluation much faster.

    GNet Group BI Consultant

    Friday, November 6, 2015 6:38 PM
  • i can see where you coming from using calculated columns but i am struggling to design the formula. Would you be able to help?

    P.S. I have a date column in the same application table where the email addresses are found

    Friday, November 6, 2015 6:40 PM
  • First - a date dimension is a wonderful thing. Use one. There's a lot of material on what should go into a date dimension that's easy to find online.

    Again, a calculated column should be the last choice for implementing this logic. You'll see compression, but not as effective as a "native" column coming in from the source. Additionally, if your data source is a SQL database, the query to write is much simpler than the equivalent in DAX. I'd be happy to help with the SQL necessary in the query if you need.

    If you're using Power Query, I could take a stab at the M you'd need, but that's not really my forte.

    If you must implement this logic in a calculated column, it'll look like this:

    FirstEver=
    IF(
        CALCULATE(
            COUNTROWS('Application')
            ,ALLEXCEPT(
                'Application'
                ,'Application'[Email]
                ,Affiliate[Affiliate]
                ,LeadType[LoanType]
            )
            ,'Application'[date] < EARLIER('Application'[date])
        ) > 0
        ,0
        ,1
    ) FirstInMonth= IF(
        CALCULATE(
            COUNTROWS('Application')
            ,ALLEXCEPT(
                'Application'
                ,'Application'[Email]
                ,Affiliate[Affiliate]
                ,LeadType[LoanType]
            )
            ,'Application'[date] < EARLIER('Application'[date])
            ,YEAR('Application'[date]) = YEAR(EARLIER('Application'[date]))
            ,MONTH('Application'[date]) = MONTH(EARLIER('Application'[date]))
        ) > 0
        ,0
        ,1
    )
    Again, you should use a date dimension with a month index field in it, because then you can skip the last two lines of that CALCULATE() and instead just include DimDate[MonthKey] in the ALLEXCEPT().

    Edit: Missed Application[Email] in column definitions.

    Additionally, the measure can become a sum of these two fields. This works with Power Pivot calculated column solution only if you can guarantee an email won't appear twice in one day with the same LeadType[LoanType] and Affiliate[Affiliate] values; works with TSQL solution I've got in mind regardless. If you can't make that guarantee and use the calculated column solution, then it becomes
    CALCULATE(
        DISTINCTCOUNT(Application[Email])
        ,Application[FirstEver] = 1
    )



    GNet Group BI Consultant


    • Edited by greggyb Friday, November 6, 2015 7:27 PM missed a field in formula
    Friday, November 6, 2015 7:21 PM
  • Hi Greg Many thanks for the calculated column solution. I shall give it a try when I'm back at work next week. My ETL is actually MySQL + power query. I don't think the flag can be implemented in MySQL in this case can it? If not would it be more efficient to implement it in Power query than power pivot?
    Friday, November 6, 2015 8:28 PM
  • So, the specific SQL solution I was thinking of would use window functions, which, though part of the SQL standard since SQL2003, are not available in MySQL.

    That being said, you can still get to the necessary bit flag field without too much difficulty. I'd need to know what tables you have in the database to write an appropriate query, but essentially you'd join Application to itself on email, affiliate, and loan type, with a < join on date.

    Again, I'm not sure on how I might do this in M for Power Query. I also don't know what the performance would be like compared to Power Pivot, but it's generally a best practice to perform this sort of work pre-Power Pivot. It's not essential that you avoid it in Power Pivot, but you've got a SQL query and an M query both happening, so you may as well use them to do this work.

    GNet Group BI Consultant

    Friday, November 6, 2015 9:02 PM
  • we have a transactional database and i did think about doing a self-join but wasn't sure how to implement the query. what did you mean by < join? 

    btw, although the least efficient, am i right in saying that using calculated measures in this case is actually more flexible than other methods? eg. with calculated measures i can easily display each day's new unique email over a specific date range in a pivot table without modifying my measures, but if i were to use other methods then i would have to add more columns to cater for the new requirements right?

    if using calculated measures in this case also has its advantages then i think i will keep my measures as well. having said that, is it still possible to optimise the measures i gave you?

    Friday, November 6, 2015 11:33 PM
  • You're certainly correct that a measure offers more flexibility in choosing the date range. The original requirement was very specific that you wanted all time and within the same month. There is a tradeoff between flexibility and speed of evaluation. With the fairly static requirement you originally posted, helper columns make more sense. If you want to be able to select an arbitrary time frame, then the specific type of helper column I discussed would be useless.

    That being said, you will also face tradeoffs in writing measures to handle arbitrary date selections. Again, as described originally, your measures are specific in requirement, and would be best implemented in a different manner than if the goal is to select arbitrary date ranges.

    As for the SQL with an inequality join, you could do something (roughly) like the following, with more equijoins as necessary for the grouping:

    SELECT
      a.email
      ,a.date
      ,COUNT(a2.*) AS PriorRecordCount
    FROM Application a
      LEFT JOIN Application a2
        ON a2.date < a.date
    AND a2.email = a.email
    GROUP BY a.email, a.date

    In this case you get every record from Application, along with a count of rows where the same email occurred on an earlier date.

    This, of course, is hard-coded in the query and as inflexible as a calculated column.


    The measure pattern you want to use is identical to new customer analysis, as thoroughly explained in this DAX Pattern post.
    Here's a sample of how you'd adapt it to your CumulativeUniqueApps:

    DistinctEmails:=DISTINCTCOUNT('Application'[Email])
    
    CumulativeUniqueApps:=COUNTROWS(
    	FILTER(
    		ADDCOLUMNS(
    			SUMMARIZE(
    				'Application'
    				,'Application'[Email]
    				,Affiliate[Affiliate]
    				,LeadType[LoanType]
    			)
    			,"PreviousRecordCount"
    			,CALCULATE(
    				[DistinctEmails]
    				,FILTER(
    					ALL('Application'[date])
    					,'Application'[date] < MIN('Application'[date])
    				)
    			)
    		)
    		,[PreviousRecordCount] = 0
    	)
    )




    GNet Group BI Consultant

    Saturday, November 7, 2015 5:09 AM
  • After reading the new customer analysis pattern I have two questions: 1. In the dax pattern example the author just used the countrow function within calculate. Any reason why you used distinctcount in my case? 2. If I'm using the measure along with dropping the affiliate and loan type fields into my pivot table, am I right in saying that I can avoid grouping by them in my measure and just use values(application[email]) like in the dax pattern example?
    Monday, November 9, 2015 12:20 AM
  • Mostly because it seems like you'd still derive value from having a separate distinct count measure. There's no functional difference and I'd be surprised if you see any major performance impact.

    Without context of how you'd be using/presenting the measure, I stuck as close to your original in the new suggestion. In a pivot table as you've described, VALUES() alone is fine.

    GNet Group BI Consultant

    Monday, November 9, 2015 3:58 PM
  • hi there

    i would if you could help again and this is related to the new/returning customers DAX pattern.

    so i've got the the following measure that counts number of unique/new emails where i would drop it in a pivot table grouped by date and affiliate.

    Unique Posts:=COUNTROWS(
             FILTER(
                     ADDCOLUMNS(
                             values('PTUK Posts'[Email])
                            , "PreviousRecordCount"
                            , CALCULATE(
                                     COUNTROWS('PTUK Posts')                      
                                    , FILTER(
                                             ALL('PTUK Posts'[DateTime_Posted])
                                            ,'PTUK Posts'[DateTime_Posted] < MIN('PTUK Posts'[DateTime_Posted])
                                    )
                            )
                    )
                    ,[PreviousRecordCount] = 0
            )
    )

    the DateTime_Posted column contains the date and time when an email is posted by an affiliate and I have got another column in the table that contains just the time part.

    Now I wonder if it's possible to modify my measure so that on any given day, an email is only counted as unique if it has not been posted by the same affiliate within the last 5 minutes of the first time it gets posted on the day. eg. if an affiliate first posts an email at 11:50, and then posted the same email 6 times from 12:00-12:05 and then same email again 7 times from 13:00-13:05 on the same day, that would count as three unique posts for that day. At the moment the measure would only count 1 unique post for that day.

    I don't have to implement this in a measure if you think it could be better done in SQL or a calculated column?

    Thanks so much for your help in advance.

    Wednesday, December 9, 2015 6:59 PM