locked
How to create an incrementing sub counter for "sort by column" purposes RRS feed

  • Question

  • Ok, so something I could easily solve in Sql server in my case but interesting that this seemed to be tricky

    I have a table ComponentPeriod. In it we have the combination of a component (e.g. A,B,C ) and a period (2014 Q1, 2014 Q2, 2014 January etc)

    I want the periods to be in descending order (2015 Q4, 2015 Dec, 2015 Nov, 2015 Oct, 2015 Q3 ... etc) and so I need to create a sequential number series to allow this to happen (as we can only order in the client tools by a single column - and so I guess the technique I'm looking for is used a lot to produce these types of "order by" columns)

    I have done this in the past using

    Period_Sequence = 
    calculate
    (	countrows(Period)
    ,	filter
    	(	Period
    	,	   Period[Start_Date] <= earlier(Period[Start_Date]) 
            && Period[Duration_String] = earlier(Period[Duration_String])
    	)
    )


    Which was fine when I was referring to a table where Periods where distinct directly but now I

    have denormalised this for ComponentPeriod so I need something a little more sophisticated

    Whats the best way to get a sequence with perhaps some partitions in across a subset of distinct columns (I guess from SUMMARIZE or similar)

    .g. I want

    Period_Name, Type, Sequence

    2015 Q4, Quarter, 1

    2015 Dec, Month, 2

    2015 Nov, Month, 3

    2015 Oct, Month, 4

    2015 Q3, Quarter, 5

    2015 Sep, Month, 6

    etc

    even though there may be multiple records in ComponentPeriod that have the period 2015 Q4, but I want them all to have the value Sequence value of 1?

    I've got as far as:

    Period_Sequence Desc = 
    calculate
    (	countrows(summarize(ComponentPeriod, ComponentPeriod[Period_End_Date]))
    ,	filter
    	(      ComponentPeriod
    	,	   ComponentPeriod[Period_End_Date] >= earlier(ComponentPeriod[Period_End_Date])
    	)
    )

    But this doesn't distinguish between the different types.

    I need an equivalent of the t-sql -

    row_number() over (order by Period_End_Date desc, case when 'P3M' then 1 when 'P1M' then 2 end asc

    Thanks for any tips!

    Monday, November 16, 2015 5:09 PM

Answers

  • Before any discussion toward the (well-defined, thank you for that) requirement, I'd like to ask why you're combining date granularities like you are. I know this may seem a bit hypocritical, since I just told you in another thread to store the several granularities in a single fact table.

    These situations are very different, though. In the other thread you are collapsing a logical header/detail fact table combination into a single denormalized fact table at the finer grain of the detail table.

    In this construction you are denormalizing your date hierarchy into a fact table. This is highly abnormal. My question would be why you need to pre-aggregate your quarter values when they can, presumably, be calculated by an aggregation of your monthly values (or indeed, from daily values if those are also being recorded).

    The primary goal should be to maintain your fact table at the finest granularity you can, and use aggregates and measures to roll up to coarser granularities. When we denormalize a header/detail pair, we achieve this - we store everything at the detail level, even if it necessitates repeating header values.

    When you denormalize a dimensional hierarchy like this, you impose a a large cognitive (and potentially run-time speed) overhead in calculating even trivial measures, because they now must be hierarchy-aware and evaluate conditionals at run-time.

    Secondly, across a lot of your recent questions it looks like you're trying simultaneously normalize and denormalize data. I'm not trying to be critical, but better understand your needs and hopefully be helpful. You are building these sorts of bridge tables, which we'd normally expect for a M2M relationship, but your models are pretty easy to maintain as a star schema. In a dimensional model, the fact table is typically sufficient to track the relationships between all of the dimensions. We know that some combination of dimension values is valid when it appears in the fact table, and we don't need to maintain this relationship exhaustively elsewhere. Without a more thorough understanding of your model and reporting requirements, I can't say anything for certain, and there are situations where this is necessary, but I wouldn't be doing my job (I work as a BI consultant - the forums are mainly for practice and when I'm bored) if I didn't ask these sorts of questions and challenge the design.

    Thirdly, I appreciate a challenge as much as anyone, and that's usually why I'm on here. That being said, it's usually not the best to fight your tools. The sort of requirement you have is exactly the sort of thing that SQL excels at and can be quite difficult in DAX. If you can solve your problem in SQL, I'd suggest doing that. I would always implement this sort of index in SQL rather than DAX if I were working on this project. DAX is primarily an aggregation and analytical query language, not well suited for row-based queries or logic. ROW_NUMBER() is by definition working on rows, and is the sort of thing SQL is very well optimized for.

    All that being said, here's some DAX that'll get you there based on how I understand your requirement, either of these should work for you (second variation using variables - available in Power BI desktop and Excel 2016, one or both of which I believe you are using):
    Period_Sequence = 
    COUNTROWS(
    	FILTER(
    		SUMMARIZE(
    			ComponentPeriod
    			,ComponentPeriod[PeriodEndDate]
    			,ComponentPeriod[Quarter]
    			,"EndDate"
    			,ComponentPeriod[PeriodEndDate] + ComponentPeriod[Quarter]
    		)
    		,[EndDate] >= EARLIER(ComponentPeriod[PeriodEndDate]) + EARLIER(ComponentPeriod[Quarter])
    	)
    )
    
    Period_Sequence_Var =
    VAR VEndDate = ComponentPeriod[PeriodEndDate] + ComponentPeriod[Quarter]
    RETURN
    COUNTROWS(
    	FILTER(
    		SUMMARIZE(
    			ComponentPeriod
    			,ComponentPeriod[PeriodEndDate]
    			,ComponentPeriod[Quarter]
    			,"EndDate"
    			,ComponentPeriod[PeriodEndDate] + ComponentPeriod[Quarter]
    		)
    		,[EndDate] >= VEndDate
    	)
    )

    I find the variable version a bit cleaner than the EARLIER() version. There's some good introductory reading to variables in this article.

    Edit: Adding an image of my sample data, to make the measure make more sense - [Quarter] is just a 1 for Quarter level, which increases the [Period_End_Date] and allows the sorting to be much simpler.



    GNet Group BI Consultant


    • Edited by greggyb Monday, November 16, 2015 6:00 PM
    • Marked as answer by Brett Gerhardi(H) Monday, November 23, 2015 4:44 PM
    Monday, November 16, 2015 5:57 PM

All replies

  • Before any discussion toward the (well-defined, thank you for that) requirement, I'd like to ask why you're combining date granularities like you are. I know this may seem a bit hypocritical, since I just told you in another thread to store the several granularities in a single fact table.

    These situations are very different, though. In the other thread you are collapsing a logical header/detail fact table combination into a single denormalized fact table at the finer grain of the detail table.

    In this construction you are denormalizing your date hierarchy into a fact table. This is highly abnormal. My question would be why you need to pre-aggregate your quarter values when they can, presumably, be calculated by an aggregation of your monthly values (or indeed, from daily values if those are also being recorded).

    The primary goal should be to maintain your fact table at the finest granularity you can, and use aggregates and measures to roll up to coarser granularities. When we denormalize a header/detail pair, we achieve this - we store everything at the detail level, even if it necessitates repeating header values.

    When you denormalize a dimensional hierarchy like this, you impose a a large cognitive (and potentially run-time speed) overhead in calculating even trivial measures, because they now must be hierarchy-aware and evaluate conditionals at run-time.

    Secondly, across a lot of your recent questions it looks like you're trying simultaneously normalize and denormalize data. I'm not trying to be critical, but better understand your needs and hopefully be helpful. You are building these sorts of bridge tables, which we'd normally expect for a M2M relationship, but your models are pretty easy to maintain as a star schema. In a dimensional model, the fact table is typically sufficient to track the relationships between all of the dimensions. We know that some combination of dimension values is valid when it appears in the fact table, and we don't need to maintain this relationship exhaustively elsewhere. Without a more thorough understanding of your model and reporting requirements, I can't say anything for certain, and there are situations where this is necessary, but I wouldn't be doing my job (I work as a BI consultant - the forums are mainly for practice and when I'm bored) if I didn't ask these sorts of questions and challenge the design.

    Thirdly, I appreciate a challenge as much as anyone, and that's usually why I'm on here. That being said, it's usually not the best to fight your tools. The sort of requirement you have is exactly the sort of thing that SQL excels at and can be quite difficult in DAX. If you can solve your problem in SQL, I'd suggest doing that. I would always implement this sort of index in SQL rather than DAX if I were working on this project. DAX is primarily an aggregation and analytical query language, not well suited for row-based queries or logic. ROW_NUMBER() is by definition working on rows, and is the sort of thing SQL is very well optimized for.

    All that being said, here's some DAX that'll get you there based on how I understand your requirement, either of these should work for you (second variation using variables - available in Power BI desktop and Excel 2016, one or both of which I believe you are using):
    Period_Sequence = 
    COUNTROWS(
    	FILTER(
    		SUMMARIZE(
    			ComponentPeriod
    			,ComponentPeriod[PeriodEndDate]
    			,ComponentPeriod[Quarter]
    			,"EndDate"
    			,ComponentPeriod[PeriodEndDate] + ComponentPeriod[Quarter]
    		)
    		,[EndDate] >= EARLIER(ComponentPeriod[PeriodEndDate]) + EARLIER(ComponentPeriod[Quarter])
    	)
    )
    
    Period_Sequence_Var =
    VAR VEndDate = ComponentPeriod[PeriodEndDate] + ComponentPeriod[Quarter]
    RETURN
    COUNTROWS(
    	FILTER(
    		SUMMARIZE(
    			ComponentPeriod
    			,ComponentPeriod[PeriodEndDate]
    			,ComponentPeriod[Quarter]
    			,"EndDate"
    			,ComponentPeriod[PeriodEndDate] + ComponentPeriod[Quarter]
    		)
    		,[EndDate] >= VEndDate
    	)
    )

    I find the variable version a bit cleaner than the EARLIER() version. There's some good introductory reading to variables in this article.

    Edit: Adding an image of my sample data, to make the measure make more sense - [Quarter] is just a 1 for Quarter level, which increases the [Period_End_Date] and allows the sorting to be much simpler.



    GNet Group BI Consultant


    • Edited by greggyb Monday, November 16, 2015 6:00 PM
    • Marked as answer by Brett Gerhardi(H) Monday, November 23, 2015 4:44 PM
    Monday, November 16, 2015 5:57 PM
  • Thanks once again Greg.

    It is always difficult to know how much detail to put in these questions, but happy to explain further and I appreciate your attention to detail and depth.

    Ultimately I guess I was searching for a solution to the problem in DAX as the underlying problem is the restriction of PowerBI that requires a "single column" to sort another column on. As the problem comes from PowerBI, I'd quite like to be able to solve this in PowerBI as this seems somewhat nicer and of course it is just luck that I happen to have the data in SqlServer.. I might not be so lucky next time, but the single column sort requirement will remain! I'm hungry for knowledge with this tool right now and so when a short search didn't throw up an answer and how helpful this forum had been and how much I'd learned from posts here already.

    Writing this now and having spent some time away from the problem I realise that PowerQuery (M) was probably the better tool for the job to fulfil my architectural goal of being data source agnostic. I need to look into the solution there a bit further too perhaps.

    So I now have the relationships as follows which seems to work quite well as long as I have the 2 way relationships set

    You will see that I have two dimensions SchemePeriod and ComponentPeriod these concepts actually come from the source data. A Scheme can have multiple Components and they can be of different Periods (so Scheme could be quarter and a Component could be month inside the quarter). However the Period is a central table and can hold any time interval between two dates that they both share.

    The way I'm doing this is as much like a star as I can as I have learned so far that many to many can create problems for the Semantic Model. As soon as you have some bridge between two points (e.g. SchemePeriod -> Period and ComponentPeriod -> Period we seem to lose the ability to do the two way relationships and then my other functionality is impaired (i.e. a slicer on Period would no longer update based on other dimension slicer values).

    I have built multidimensional models in my time and read the Kimball books and he was clear that it is fine to denormalise Dimension tables and prefer a star over snowflake. In this case I'm not sure why you interpret what I'm doing as creating a fact table? I believe I'm just denormalising the SchemePeriod/ComponentPeriod to include attributes from other tables in my OLTP model?

    In my model I believe that SchemeParticipant and ComponentParticipant are the fact tables and all the others are Dimension as such.

    Regarding the simultaneous normalise/denormalising observation I think really I am trying to achieve as few changes to the source data as possible. I think a couple of times on my learning journey I've gone too far (due to an issue which I didn't realise could be solved in a different way rather than deviate further from the source) and so I like to correct in my next version to try to implement what I think I've learned .. to check I got it right!

    Taking a step back, this work I'm doing is evaluating the tools and compromises so I understand it all.. I need to make a lot of mistakes to achieve this and fortunately for me Greg you are around to keep me honest :)

    And again you have delivered. Variables eh.. interesting.. I'll leave that lesson for tomorrow I think!

    Thanks again!

    Monday, November 16, 2015 8:06 PM
  • So, the reason I say you've got OLTP mixed in with your dimensional model goes back to your source.

    It's common to store data as a table of details/attributes about a thing, and then a separate table mapping between two of these. In your source, I'm imagining you have something like your current tables, Scheme, SchemePeriod, Component, ComponentPeriod. This is where my comment of OLTP comes from. The naming patterns and the schema both suggest like 3NF constructions to me.

    I'll simplify a bit below by only refering to a subset of tables. I'll also make some up but I hope the names are clear enough.

    If I were designing this model, I would have my fact table ParticipantMeasure. I would then have as keys in that table PeriodKey, ComponentKey, SchemeKey (note my answer on your other thread about combining facts - this isn't necessary but will simplify the current discussion as well), ParticipantKey.

    Thus I would have the following dimensions: Period, Component, Scheme, Participant.

    What you are doing is replicating the logical dimension of Period into two separate physical tables, those being SchemePeriod and ComponentPeriod. With these tables you have one logical dimension duplicated in two places. This is denormalization, in that you are breaking normal form, but a denormalized model is not a synonym for a dimensional model.

    Dimensions are attributes of facts, and the fact table acts as a bridge between all of the dimensions (technically speaking, you can look at a fact table as a bridge between multiple dimensions, all of which are participating in a classical many-to-many relationship). In dimensional modeling, we don't typically maintain bridge tables unless we have an active many-to-many relationship.

    What you're doing in ComponentPeriod is maintaining a mapping between components and periods. This is no longer a dimension. A dimension is a bag of attributes related to one independent business concept. A bridge table is a mapping between two dimensions. You've got both in one table.

    I understand that a single component can have multiple periods and vice versa, but I'm sure a single achievement can be associated with multiple measure types and vice versa (even if not, your model would support it if it were possible).

    I think your goal is to be able to maintain bi-directional filters on every relationship, because that makes your model easier to reason about, which is fine, but I think it's what is forcing you into the duplicated period dimension.

    I think if you split period out into its own dimension, and then combine your SchemePArticipant and ParticipantMeasure tables into a single flattened fact, you'd find your life a whole lot simpler, and certainly less confusing to any end users (e.g. "Which Period fields do I use? They're in two tables."). The only wrinkle there is that [Reward_Value] would not make sense with some naive aggregations (sum and average would not make sense, min and max would), but that is easily taken care of with a measure like I provided an example for in your other thread.

    I hope this clears up my comments and helps.

    Also, yes, Power Query is a better tool than Power Pivot to do the sort index column in. I was hung up on the SQL vs Power Pivot question.

    GNet Group BI Consultant

    Monday, November 16, 2015 8:44 PM
  • Ok, yes I understand what you are saying.

    One of the things that I am struggling with is "how have the model rules now changed in tabular vs multidimensional"? In multidimensional things were pretty strict you had to have fact and factless tables for bridges and dimensions were clear. Although I could still have created the ComponentPeriod/SchemePeriod dimensions there but in fact I would probably create a single dimension with the hierarchy for Scheme -> Component... and perhaps I should be doing this in tabular but currently powerBI doesn't support hierarchies.

    Underlying this is "what is the point of the tabular model" which stems from "what was wrong with the multidimensional model". My assumption was that multidimensional was too strict and unfamiliar to allow "normal" people to understand for self-service BI. So as a result we have more flexibility and I'm trying to explore that flexibility. You seem to be saying that I should build tabular models the same way I built the multidimensional models? Is that right? If not what are the exceptions in your experience?

    The reason why I ended up combining Period into ComponentPeriod is because I wanted to allow users to be able to build visualisations based on both Scheme Periods as well as Component Periods. This way rather than having 4 dimensions (Scheme, Component, Period(Scheme), Period(Component) with the 4 referencing keys in ParticipantMeasure fact) in the model, I just had 2 with the same information in (which seemed more user-friendly) and I believed that the dimensional model coped ok with it.

    I'm still not sure about combining the two fact tables together, my feeling is that it may make things more complex as the model evolves and more information is needed to be put into it and I'm still not entirely sure what problem it solves right now. The two related fact tables seem to be working ok so far, so what issue do you think I will hit with this in future?

    My intention on PowerBI is to expose this to end users and allow them to build their own visualisations and ask questions in natural language, so I'm trying to make it simple but also as lightweight development as possible to allow changes in underlying source to be reflected with as little work as possible. Hence my to-ing and fro-ing on approach.

    Thanks again Greg

    Tuesday, November 17, 2015 11:21 AM
  • My expectation is that you would have a single period dimension - if that is incorrect then I'd have to revisit my thoughts around the model design.

    That being said, joining fact tables together is a big red flag in my experience, because you then force context to flow through the single most expensive relationship (between a fact table of size X rows, and the detail table of size 2X+ rows). This is a performance red flag, and often a sign of sub-optimal design. There are certainly always exceptions to rules, and I've worked with a data set where fact-fact relationships were the answer, because of scope and time constraints, but eventually we revisited that model and were able to flatten the fact table (in a different manner than I've suggested to you), and performance improved, along with the reasonability (aka ease of being able to reason about) of the model.

    If it turns out that you need to maintain two facts at different granularities (which, if you have two periodkeys, probably does make more sense), then the normal pattern is not to join the facts, but to use conformed dimensions across the facts. There's a good overview of this design pattern in this DAX Patterns article. In this way, we maintain relatively cheap relationships only between dimensions and facts.

    I can tell you that Q&A will have some issues with repeated dimension fields - it should handle most cases just fine, but you will hit unexpected behavior there. Additionally, even with a hand-built report, it would require users to label it appropriately, with a title indicating which period is in consideration. This is going to be something to be aware of and defensive against no matter what if you have multiple copies of period in any form.

    As for Tabular vs Multidimensional. The question you ask is what are the rules. The rules in Tabular are relatively minimal compared to Multidimensional. The engine does not force a specific design implementation. That being said, dimensional models are good not because they conform to Multidimensional design rules, but because they enhance usability of a solution and are easier to reason about. Multidimensional was designed around dimensional modelling concepts, not the other way around. "What can I throw out from Multidimensional?" The answer is nearly everything. "What should I throw out from Multidimensional?" The answer is almost nothing.

    If you can create a single Period dimension with the separate fact tables like in the article I've linked, that would be the easiest model to deal with in my mind, and certainly avoids ambiguity in labels and in Q&A use.

    As for hierarchies, all of Microsoft's visualization tools can deal with arbitrary run-time hierarchies. You can bring any fields from any dimensions into a pivot table's row and column labels or onto the axis of a chart (in Excel or Power View), and you get drill down capabilities as if it were a defined Multidimensional hierarchy. In Multidimensional hierarchies are necessary for performance and functional reasons. In Tabular, they're just a group of columns.

    GNet Group BI Consultant

    Tuesday, November 17, 2015 3:37 PM
  • In the normalised view there is just one period table, but really while SchemePeriods and ComponentPeriods Periods may overlap somewhat they wouldn't really be used to compare together, so I think it makes more sense for me right now to keep them separate. I'll see how we get on with this approach and I may revisit this when I start the see the problems that you suspect I will hit.

    I've taken your advice and removed the second fact table and am now trying to create all the measures in the most granular table (there were a couple more from the models I showed to keep problem space clear). and it will be good as it will get me more familiar with the functions, so a good learning approach in addition.

    I understand that really each model should ideally just one fact table and dimensions in the traditional star schema arrangement and I was trying to use the flexibility of the tabular method in my core design which is probably bad form and will likely impair usage of this flexibility of the model in the future. Thanks for your guidance on this.

    The problem I had with the two fact tables joined tougher via their common dimensions was that slicers then had reduced functionality as it didn't remove values that were no longer valid for the measures shown on the report as it was considering the entire model (including the unused fact table). I'm aware of a new "values" field in the Custom Chiclet slicer that I thought afterwards would perhaps allow me to produce the same behaviour with a more complex model, but I realised this after I'd changed to the one fact table so not tried it yet.

    With hierarchies in multidimensional, this allowed aggregations to be built based on these defined hierarchies. Is this somehow magically solved with tabular without it being necessary?

    Thursday, November 19, 2015 11:34 AM
  • Regarding the hierarchy differences in Tabular and Multidimensional. Tabular's engine is designed to perform aggregations very quickly, whereas Multidimensional was being developed in a time of vastly slower processors and insufficient RAM for large data sets to reside in memory - aggregates are a performance feature in Multidimensional. The expectation with Tabular is that all aggregates are calculated at run-time, due to the efficiency of this engine and the fact that all model data is resident in RAM. It's just a different design decision.

    By removing the aggregations defined on hierarchies, much of the reason for hierarchies was removed. We can certainly debate the merits of this decision, but that's the short story behind the differences.

    As for the model, I am glad to have been helpful. Model design is not something that can be completely codified. There are just patterns and problems, and it's sometimes a struggle to figure out the right pattern. I hope my suggestions have a positive effect for your model.

    GNet Group BI Consultant

    Thursday, November 19, 2015 3:23 PM