none
modeling facts involved in multi overlapping date based relationships RRS feed

  • Question

  • Hi we run 2017 STD. I have an environment with a Project Starts entity/table, a Staff Team Schedule entity/table and a Non Project entity/table.  I'd like to build relationships amongst them to support the ability to quickly update a downstream mart where the things they represent are consolidated.  Basically resource id and any overlap in from and to dates relates them.

    In each table, by far the most critical fields outside of the primary keys are start and end times of sql server data type datetime2(7).  I show more below in the code block.

    Non Project "durations" take productive time away from Projects (Starts).

    The relationships that make the most sense to me from a simplicity perspective are Project Starts <->Staff Team Schedule  and Project Starts<->Non Project.  The important thing is that these would be many to many.  Its also important that I be able to flag changes in this data to aid our incremental etl in performing very fast.   I usually use row versions to help with stuff like that.

    I just read some documentation on the graph database.  It would have been nice to use that technology here except for 1) no temporal capability in graph db, 2) the functions are too new to us so we'd feel a little at risk, 3) I've read complaints about their ability to be extracted from anywhere but ssms, ie errors get generated if you select info from them in other viz tools.

    I dont think hierarchy id's would work here.   I think xml  or json docs would be doable in spite of the many to many relationships but again we don't really want to use the languages necessary to shred those technologies. 

    so i'm thinking about temporal junction records amongst the entities after adding a project starts table at the day granularity (so a 2 day project duration would have 2 such records) level.  starts would be related to this new table 1 to many.  and it would really be this new table that participates in the relationships I described earlier.   A junction record would simply be a pk col for each entity being related along with all the usual row versioning and special dates you see in the code block.  There would be no such thing as a physical delete.

    The reason I'm posting here is to learn if the community can recommend something beside junction records to create these relationships in what feels like a warehouse being set up.  Maybe something really fast.  There are 3.5 million project starts, 62 million non project items, 5.7 million team schedules.  I was thinking about one table instead of multiple junction records but I think the redundancy would be unwieldy.  

    CREATE TABLE [dbo].[Project_Starts](
        [ID] [int] IDENTITY(1,1) NOT NULL,
    	[Project_Id] [int] NOT NULL,
    	[Start_Time] datetime2(7) NULL,
    	[End_Time] datetime2(7) NULL,
    	Resource_Id int,
    	[RowVersion] [timestamp] NOT NULL,
    	[RVcast]  AS (CONVERT([bigint],[Rowversion])),
    	[RecordChangeDate] [datetime2](7) NULL,
    	[RecordDeleteDate] [datetime2](7) NULL,
    	[RecordPendedDate] [datetime2](7) NULL,
    	[RecordCreateDate] [datetime2](7) GENERATED ALWAYS AS ROW START NOT NULL,
    	[RecordEndDate] [datetime2](7) GENERATED ALWAYS AS ROW END NOT NULL,
    PRIMARY KEY CLUSTERED 
    (
    	[ID] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
    	PERIOD FOR SYSTEM_TIME ([RecordCreateDate], [RecordEndDate])
    ) ON [PRIMARY]
    WITH
    (
    SYSTEM_VERSIONING = ON ( HISTORY_TABLE = [dbo].[Project_Starts_temporalhistory] )
    )
    GO
    
    
    CREATE TABLE [dbo].[Staff_Team_Schedule](
        [ID] [int] IDENTITY(1,1) NOT NULL,
    	Schedule_Id int,
    	[Start_Time] datetime2(7) NULL,
    	[End_Time] datetime2(7) NULL,
    	[Team_Name] varchar(100),
    	Resource_Id int,
    	[RowVersion] [timestamp] NOT NULL,
    	[RVcast]  AS (CONVERT([bigint],[Rowversion])),
    	[RecordChangeDate] [datetime2](7) NULL,
    	[RecordDeleteDate] [datetime2](7) NULL,
    	[RecordPendedDate] [datetime2](7) NULL,
    	[RecordCreateDate] [datetime2](7) GENERATED ALWAYS AS ROW START NOT NULL,
    	[RecordEndDate] [datetime2](7) GENERATED ALWAYS AS ROW END NOT NULL,
    PRIMARY KEY CLUSTERED 
    (
    	[ID] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
    	PERIOD FOR SYSTEM_TIME ([RecordCreateDate], [RecordEndDate])
    ) ON [PRIMARY]
    WITH
    (
    SYSTEM_VERSIONING = ON ( HISTORY_TABLE = [dbo].[Staff_Team_Schedule_temporalhistory] )
    )
    GO
    
    
    CREATE TABLE [dbo].[Non_Project](
        [ID] [int] IDENTITY(1,1) NOT NULL,
    	[Non_Project_Id] [int] NOT NULL,
    	[Description] varchar(100),
    	[Start_Time] datetime2(7) NULL,
    	[End_Time] datetime2(7) NULL,
    	Resource_Id int,
    	[RowVersion] [timestamp] NOT NULL,
    	[RVcast]  AS (CONVERT([bigint],[Rowversion])),
    	[RecordChangeDate] [datetime2](7) NULL,
    	[RecordDeleteDate] [datetime2](7) NULL,
    	[RecordPendedDate] [datetime2](7) NULL,
    	[RecordCreateDate] [datetime2](7) GENERATED ALWAYS AS ROW START NOT NULL,
    	[RecordEndDate] [datetime2](7) GENERATED ALWAYS AS ROW END NOT NULL,
    PRIMARY KEY CLUSTERED 
    (
    	[ID] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
    	PERIOD FOR SYSTEM_TIME ([RecordCreateDate], [RecordEndDate])
    ) ON [PRIMARY]
    WITH
    (
    SYSTEM_VERSIONING = ON ( HISTORY_TABLE = [dbo].[Non_Project_temporalhistory] )
    )
    GO




    • Edited by db042190 Thursday, December 5, 2019 1:04 PM changed title slightly
    Wednesday, November 27, 2019 6:00 PM

All replies

  • I might delete this post because the hierarchical relationships I described while valid, aren't really justified in a data model for this problem. It seems to me cutting to the chase as I show below is the way to accomplish my goals which are to model project starts as "durations" during which productive things should have happened but "sliceable" on staff dimensions and Non Project stuff as "durations" during which productive things were interrupted and sliceable on 1) staff dimensions and 2) the productive things (project # ids) they interrupted.

    I should be able to aggregate the durations.  There are other dimensions including resource id which was shown in the original post.  There is actually another fact table of "counts" of deliverables produced during the project starts but it wasn't mentioned because its so simple.  And there are standard conversion rates that allow different projects to be compared and aggregated together.  And there are goals which allow for variance calculations. 

    So how would I relate a project start to multiple staff shifts?  By artificially introducing multiple records for the same project start that coincide with the staff schedule start and stop times a particular slice (fact record) of project start relates to.   There would be similar artificial generation for Non Project durations in both of its date based relationships.

    Basically I'd have a star schema whose fact records are artificially generated to guarantee a one to one relationship with their dims.  And row versions to help trigger incremental updates to downstream technologies like tabular etc. 

    So what if the user slices on date filters that don't exactly coincide with the artificially generated start and stop times?  I'm picturing some prorating being done on such boundary data points when visuals are being generated.




    • Edited by db042190 Tuesday, December 3, 2019 7:47 PM more details
    Tuesday, December 3, 2019 7:21 PM
  • I found the info at https://stackoverflow.com/questions/325933/determine-whether-two-date-ranges-overlap very useful. especially the period relationship enum and associated bar graphic shown by user687474.  I don't know how well this will carry over from 2 to n overlapping date based relationships which I expect in Non Project durations but i'll cross that bridge when I come to it.

    In the mean time I've put together the "splitting" code for project start durations shown in the code block.  it doesn't include some incremental change candidate key filters that should help performance.  ps relates to project starts.  ts to team schedules.  for now the data is flattened rather than shaped in a star schema.  for now the date relationships are purposely left as raw and dumbed down as much as possible for extensibility purposes.  notice the dash based graphic right of the date relationship case statement, this would allow the developer to see what overlap relationship is being represented in DateRangeRelationship.  These may align better if you paste the query into ssms.

    I expect performance to be pretty bad during initial load.

    declare @dt datetime = getdate() ;with DRR as ( select 'start inside' DateRangeRelationship,'ps start - ts end' split --notice similarity to enclosing start touching --union all --select 'start inside' DateRangeRelationship,'ts end - ps end' split union all select 'inside start touching' DateRangeRelationship,'ps start - ps end' split --notice similarity to exact match and inside and inside end touching union all select 'enclosing start touching' DateRangeRelationship,'ps start - ts end' split --union all --select 'enclosing start touching' DateRangeRelationship,'ts end - ps end' split --union all --select 'enclosing' DateRangeRelationship,'ps start - ts start' split union all select 'enclosing' DateRangeRelationship,'ts start - ts end' split --union all --select 'enclosing' DateRangeRelationship,'ts end - ps end' split --union all --select 'enclosing end touching' DateRangeRelationship,'ps start - ts start' split --notice similarity to end inside union all select 'enclosing end touching' DateRangeRelationship,'ts start - ps end' split union all select 'exact match' DateRangeRelationship,'ps start - ps end' split union all select 'inside' DateRangeRelationship,'ps start - ps end' split union all select 'inside end touching' DateRangeRelationship,'ps start - ps end' split --union all --select 'end inside' DateRangeRelationship,'ps start - ts start' split union all select 'end inside' DateRangeRelationship,'ts start - ps end' split

    union all
    select 'missing team schedule' DateRangeRelationship,'ps start - ps end' split ) select y.*, drr.split, case when drr.split like 'ps start -%' then ps_starttime when drr.split like 'ts end -%' then ts_endtime when drr.split like 'ts start -%' then ts_starttime end split_starttime, case when drr.split like '%- ts end' then ts_endtime when drr.split like '%- ps end' then ps_EndTimeOverride when drr.split like '%- ts start' then ts_starttime end split_endtime from ( select x.*, --------------- PS case when ts_starttime < ps_starttime and ts_endtime < ps_EndTimeOverride then 'start inside' ------ ts when ts_starttime = ps_starttime and ts_endtime > ps_EndTimeOverride then 'inside start touching' -----------------ts when ts_starttime = ps_starttime and ts_endtime < ps_EndTimeOverride then 'enclosing start touching' ------ ts when ts_starttime > ps_starttime and ts_endtime < ps_EndTimeOverride then 'enclosing' ------ ts when ts_starttime > ps_starttime and ts_endtime = ps_EndTimeOverride then 'enclosing end touching' ----------- ts when ts_starttime = ps_starttime and ts_endtime = ps_EndTimeOverride then 'exact match' --------------- ts when ts_starttime < ps_starttime and ts_endtime > ps_EndTimeOverride then 'inside' --------------------ts when ts_starttime < ps_starttime and ts_endtime = ps_EndTimeOverride then 'inside end touching' ------------------ ts when ts_starttime > ps_starttime and ts_endtime > ps_EndTimeOverride then 'end inside'                             ----TS
    else 'missing team schedule' end DateRangeRelationship --                                             ???????????????  TS from ( select ps.project_id, ts.schedule_id, ps.resource_id, ps.start_time ps_starttime, ps.end_time ps_endtime, ts.start_time ts_starttime, ts.end_time ts_endtime, ts.team_name, case when ps.end_time < '12/31/9999' then ps.end_time else @dt end ps_EndTimeOverride from project_starts ps left join staff_team_schedule ts on ps.resource_id=ts.resource_id and --this elegant pair of conditions works biz wise in my case. sql is currently doing a clustered index -- scan on both which i hope will change. ps.start_time < ts.end_time and case when ps.end_time = '12/31/9999' then @dt else ps.end_time end > ts.start_time where ps.recorddeletedate is null and ts.recorddeletedate is null ) x ) y join DRR on y.DateRangeRelationship = drr.DateRangeRelationship --order by project_id,ts_starttime


      





    • Edited by db042190 Wednesday, December 11, 2019 6:13 PM team should have been a left join if missing items are to be recorded properly
    Friday, December 6, 2019 3:49 PM
  • the gap filling exercise started at https://social.msdn.microsoft.com/Forums/en-US/9426fd72-ea19-431c-9473-840e13888891/finding-gaps-in-date-ranges?forum=transactsql would be my next step which i'll try at another time.  Now I'm going see what happens when there are 3 date overlaps possible. Specifically non project, team and project. non project would be the anchor.   project was the anchor in the 1st exercise.  I'll post back here.
    Wednesday, December 11, 2019 12:22 PM
  • I looked at this briefly and think if you have n overlapping entities with date ranges, you have to (maybe should) apply the join I showed earlier n - 1 times.  So for non project overlapped with project starts overlapped with team schedules, I'd need to do the join twice.  Or come up with some sort of left join based recursion having each pass incorporate a different table and having each write record a recursion level number. Only recursion level 3's would be of interest when its done.

    Even if there was something slicker that could be done in one non recursive query, I suspect the cardinality would be a show stopper in our environment.  A recursion might also be a show stopper.

    It also seems that if you don't apply gap filling (perhaps always after stacking) after each of the first n-2 iterations, that problems (more and more missing pieces of info) will propagate down in a more and more undesirable way.

    With that said I'm not sure that the cte called DRR wasn't overkill.  If I keep it, and really want to maintain its auditability potential, i'd have to add a union all for each missing x where x is another non anchor table in the splitting algorithm.  and somehow deal with the fact that the anchor in pass 1 is now np, the left join is on perhaps ts, and the anchor in pass 2 is a permutation of the latter and the left join would then be on ps.

     
    • Edited by db042190 Wednesday, December 11, 2019 6:34 PM spelling and more reasons why the cte may not be useful
    Wednesday, December 11, 2019 6:12 PM
  • i coded this last overlap as a separate join. Its all pretty ugly and i have only 3 overlapping entities. If i had added stacking and gap filling it would have been much uglier.

    short of making this a black box or buying a 3rd party solution, i cant see our staff wanting to support something like this.  this felt like a rabbit hole and i have to believe there is a better way. 

    I do not see a 3rd party solution to this.  Instead, it appears this may be a sub feature of 3rd party software out there but not the main purpose of any software out there. 

     
    • Edited by db042190 Friday, December 13, 2019 3:19 PM more on 3rd party
    Friday, December 13, 2019 1:13 PM