none
getting an average but not including top and bottom 5% RRS feed

  • Question

  • Hello,
    I'm working on a web app that gets data from sql server 2008 and have really only had to do very basic TSQL stuff to get what I need. Now I have to do something that is a little more complicated, at least for my TSQL experience level.

    I need to get an average, but not include a portion of data from the low and high extremes.

    TableA
    ----------
    PKJobID
    StartTime
    EndTime

    in example table above StartTime and EndTime are both datetime fields. What I need to do is calculate the average time it takes for a job to complete based on its start and end times, but not include the longest 5% and the shortest 5%. Could anyone show a TSQL example of how to accomplish this?
    Friday, August 28, 2009 8:32 PM

Answers

  • I would try something like this:


    declare @rowsCount int
    declare @first_row int
    declare @last_row int
    
    select @rowsCount =  count(*) from TableA
    set @first_row = 0.05*@rowsCount -- integer part, i.e.: if @rowsCount = 25, @first_row = floor(1.25) = 1
    set @last_row = @rowsCount - @first_row 
    
    select @first_row,@last_row
    
    select avg(t.execTimeInMiliseconds)
    from 
    (select datediff(ms,endtime,starttime) as execTimeInMiliseconds,row_number() over (order by datediff(ms,endtime,starttime)) as rownum from tableA) t
    where t.rownum between @first_row and @last_row



    Tell me if it works - I'm too lazy to create this table, fill it with sample data and check if syntax is ok ;-)

    Piotr
    • Proposed as answer by Brad_SchulzModerator Friday, August 28, 2009 11:23 PM
    • Marked as answer by c0pe Monday, August 31, 2009 3:11 PM
    Friday, August 28, 2009 9:11 PM
  • select avg(execTimeInMiliseconds)
    from
    (select datediff(ms,endtime,starttime) as execTimeInMiliseconds,ntile(20) over (order by datediff(ms,endtime,starttime)) as s from tableA) as t
    where s between 2 and 19
    • Proposed as answer by Brad_SchulzModerator Sunday, August 30, 2009 3:14 PM
    • Marked as answer by c0pe Monday, August 31, 2009 7:05 PM
    Sunday, August 30, 2009 2:03 PM
  • I did the GROUP BY because I misunderstood your table structure... I didn't think it was a unique ID, I thought it was a JobID (that referenced another table).

    Here's an example addressing what you just described above, though I kept the StartTime/EndTime concept:

    use tempdb
    go
    create table TableA (StartTime datetime,EndTime datetime)
    insert TableA select '00:00:00','00:00:10'  --10 seconds
        union all select '00:00:20','00:00:35'  --15
        union all select '00:01:00','00:01:20'  --20
        union all select '00:01:30','00:01:42'  --12
        union all select '00:02:00','00:02:15'  --15
        union all select '00:02:20','00:02:22'  --2
        union all select '00:02:30','00:02:32'  --2
        union all select '00:03:00','00:04:15'  --75
        union all select '00:05:00','00:05:21'  --21
        union all select '00:05:30','00:05:45'  -15
    go
    
    select AvgTimeOfAllEntries = avg(datediff(second,StartTime,EndTime))
    from TableA 
    /*
    AvgTimeOfAllEntries
    -------------------
                     18
    */
    
    ;with JobTimes as
    (
      select JobTime=datediff(second,StartTime,EndTime)
      from TableA
    )
    ,JobStats as
    (
      select MinTime=min(JobTime)
            ,MaxTime=max(JobTime)
            ,TimeRange=max(JobTime)-min(JobTime)+1
      from JobTimes
    )
    select AvgTimeExcluding5PctExtremes = avg(JobTime)
    from JobTimes,JobStats
    where JobTime between MinTime+TimeRange*0.05 and MaxTime-TimeRange*0.05
    /*
    AvgTimeExcluding5PctExtremes
    ----------------------------
                              15
    */
    

    So what I'm doing is using a CTE (Common Table Expression) to set up a "derived table" of Job Times... it just calculates the seconds between StartTime and EndTime.  Then I create a second CTE that calculates the Minimum Time found in the first CTE, and the Maximum Time found, and the Range of all the times.  Finally, I connect the two of them (FROM JobTimes,JobStats) and ONLY pull out the ones with the JobTimes that exclude the 5% extremes.

    If you want to see how it works, execute this query, which is pretty much the same, but it just lists the columns (instead of calculating the average) and it DOES NOT exclude the extremes:

    ;with JobTimes as
    (
      select JobTime=datediff(second,StartTime,EndTime)
      from TableA
    )
    ,JobStats as
    (
      select MinTime=min(JobTime)
            ,MaxTime=max(JobTime)
            ,TimeRange=max(JobTime)-min(JobTime)+1
      from JobTimes
    )
    select *,RangeStart=MinTime+TimeRange*0.05,RangeEnd=MaxTime-TimeRange*0.05 
    from JobTimes,JobStats
    /*
        JobTime     MinTime     MaxTime   TimeRange  RangeStart  RangeEnd
    ----------- ----------- ----------- ----------- ----------- ---------
             10           2          75          74        5.70     71.30
             15           2          75          74        5.70     71.30
             20           2          75          74        5.70     71.30
             12           2          75          74        5.70     71.30
             15           2          75          74        5.70     71.30
              2           2          75          74        5.70     71.30
              2           2          75          74        5.70     71.30
             75           2          75          74        5.70     71.30
             21           2          75          74        5.70     71.30
             15           2          75          74        5.70     71.30
    */
                            

    So you can see here that the RangeStart and RangeEnd are 5.70 thru 71.30.  The WHERE clause in the first query I posted pulls out ONLY those times that fall into that range... so the JobTime values of 2 and 2 and 75 are eliminated when calculating the AVG.

    Hope this helps...


    --Brad (My Blog)
    • Marked as answer by c0pe Monday, August 31, 2009 7:03 PM
    Monday, August 31, 2009 5:31 PM
    Moderator

All replies

  • I would try something like this:


    declare @rowsCount int
    declare @first_row int
    declare @last_row int
    
    select @rowsCount =  count(*) from TableA
    set @first_row = 0.05*@rowsCount -- integer part, i.e.: if @rowsCount = 25, @first_row = floor(1.25) = 1
    set @last_row = @rowsCount - @first_row 
    
    select @first_row,@last_row
    
    select avg(t.execTimeInMiliseconds)
    from 
    (select datediff(ms,endtime,starttime) as execTimeInMiliseconds,row_number() over (order by datediff(ms,endtime,starttime)) as rownum from tableA) t
    where t.rownum between @first_row and @last_row



    Tell me if it works - I'm too lazy to create this table, fill it with sample data and check if syntax is ok ;-)

    Piotr
    • Proposed as answer by Brad_SchulzModerator Friday, August 28, 2009 11:23 PM
    • Marked as answer by c0pe Monday, August 31, 2009 3:11 PM
    Friday, August 28, 2009 9:11 PM
  • It depends on what you mean by "longest 5%" and "shortest 5%".

    For example, let's say a job has the following times:  1 sec, 7 sec, 8 sec, 9 sec, 20 sec. 

    The range of that is 1 to 20, which is a total range of 20 seconds (i.e. the max time minus the min time +1:  20-1+1)

    So do you want to get rid of anything that is at the extreme 5%'s?  In other words, 5% of the range of 20 is 20*0.05 = 1.00, so if anything is outside the range of min+1.00 to max-1.00, or 1+1.00 to 20-1.00, or 2 to 19, then it should be thrown out?  Then that would get rid of the 1 and 20 in our sample, leaving only the 7, 8, and 9.  And the average of those is 8.

    (However, if you average ALL 5 values, you get an average of 9).

    Is this what you're looking for?

    --Brad (My Blog)
    Friday, August 28, 2009 9:30 PM
    Moderator
  • Also, are you excluding the longest/shortest 5% of ALL rows in the table, or is it on a job-by-job basis?

    --Brad (My Blog)
    Friday, August 28, 2009 9:32 PM
    Moderator
  • If what I described above is what you want, then I think this will do it for you (it is untested... just off the top of my head):

    (Note this only works in SQL2005 and beyond):

    ;with JobTimes as
    (
      select PKJobID
            ,JobTime=datediff(second,StartTime,EndTime)
      from TableA
    )
    select PKJobID
           AvgTimeExcluding5PctExtremes = avg(JobTime)
    from JobTimes A
    cross apply (select MinTime=min(JobTime)
                       ,MaxTime=max(JobTime)
                       ,TimeRange=max(JobTime)-min(JobTime)+1
                 from JobTimes B
                 where PKJobID=A.PKJobID) F
    where JobTime between MinTime+TimeRange*0.05 and MaxTime-TimeRange*0.05
    group by PKJobID
    



    --Brad (My Blog)
    Friday, August 28, 2009 9:36 PM
    Moderator
  • hi, thanks for the replies.

    to clarify what I mean by shortest and longest 5%. Lets say there are 100 records in the table, each record represents a job. 5% of 100 is 5, so from this list of 100 records, sorted by how long the job took, I would not want to use the first or last 5 from the list. So its based on job, not min and max times.

    Any explanation you could give with the examples would be greatly appreciated also.

    thanks!
    Friday, August 28, 2009 9:56 PM
  • I think, based on your reply, that Piotr's post does exactly what you're looking for.

    --Brad (My Blog)
    Friday, August 28, 2009 10:07 PM
    Moderator
  • thanks again,

    I will try Piotr's approach on Monday... I'm out for now.
    Friday, August 28, 2009 10:13 PM
  • That's excellent, Peso... A perfect use for NTILE.
    --Brad (My Blog)
    Sunday, August 30, 2009 3:14 PM
    Moderator
  • I don't know why... but there's something strange there because there are only 6 people... should be 10.

    Sometimes when maintenance is done on the site (like it was last Friday night), things sometimes get reset or just downright out of whack.

    I'll see if I can ask around and find out.
    --Brad (My Blog)
    Monday, August 31, 2009 1:53 PM
    Moderator
  • Peso, thanks for the reply. I'm going to read up on 'ntile' and 'over', which I have never used, in sql books online, but could you provide an explanation as to how your solution works... I've really only done basic SELECT, UPDATE, DELETE queries with WHERE clauses... since this may be a part of my app I really want to understand it.

    thanks!
    Monday, August 31, 2009 1:54 PM
  • Hi Piotr, this seems to work, no syntax errors... I did not do a manual check of the math though... I'll need to get a smaller set of data to do that on, hehe. I do want to understand this and the only part I don't currently really understand is the 'row_number()' and 'over' parts... I have never used them. I took different parts of this and executed seperately to see what each is doing and so I understand the end result of each part, I am just not totally solid on 'row_number()' and 'over'... I think books online is a little over my head on that topic, I read it, and still am unsure. Anyway you could provide a basic explanation?

    thanks again for the great solution. Now I just want to be sure I totally understand it so that I can use again in the future without simply 'copy/pasting'.
    Monday, August 31, 2009 3:11 PM
  • Hi Brad,

    I think you were ahead of me with your original reply... Piotr gave me exactly what I asked for, but what you described is actually what I *should* be doing to more accurately meet my goal. Basically I need to take an average but I want to remove both extreme ends of the spectrum to get a more realistic/accurate average. What I'm averaging is the time a job takes to complete given start and end times. But for the sake of simplicity, lets just say I have a table with two fields, key and value.

    SampleTable
    -------------
    1, 10
    2, 15
    3, 20
    4, 12
    5, 15
    6, 2
    7, 2
    8, 75
    9, 21
    10, 15

    I want to get an average not including both low and high extreme ends. I will wind up playing around with the working query to see what percent is a good one for me to use in my case as for the percent to clip off the top and bottom, but anything is fine for the example. I think your solution is what I'm looking for, except for the grouping. Could you give another example using this SampleTable that would use a percent to clip off the extreme low and high? Any explanation would be appreciated as well, as I need to understand anything I wind up using. After some studying I do understand Piotr's code completely now, but realized its not what I should use... a good learning experience none the less though.
    Monday, August 31, 2009 4:49 PM
  • row_number() is a function which returns numbered set of rows ;-) I don't know how to explain it better so I will give You an example:

    Let's say we have a table "Person" and it has 3 columns:
    Id, Name and DateOfBirth

    we have 5 rows:

    (1,'John','1980-10-03')
    (3,'Mary','1985-11-23')
    (4,'Tom','1982-05-07')
    (8,'Phil','1977-02-15')
    (11,'Lilly','1954-10-14')

    when You use row_number() as an additional column You will get numbers for those rows:

    select *,row_number() [over...(will explain it below)] as rownum from Person
    as a result You will get:

    (1,'John','1980-10-03',1 )
    (3,'Mary','1985-11-23',2 )
    (4,'Tom','1982-05-07',3 )
    (8,'Phil','1977-02-15',4 )
    (11,'Lilly','1954-10-14',5 )

    so You have another column which You can use to filter Your resultset (You must use temporary named result set - another name for this is Common Table Expression - CTE).
    Why? let's say You want to know who is 4th oldest person You can simply order by DataOfBirth and then add condition: where rownum = 4 (rownum is an allias for numbers column)

    But we can't order by DateOfBirth (asc) affter applying row_number() because we would get this:

    (11,'Lilly','1954-10-14',5 )
    (8,'Phil','1977-02-15',4 )
    (1,'John','1980-10-03',1 )
    (4,'Tom','1982-05-07',3 )
    (3,'Mary','1985-11-23',2 )

    so instead of Tom You would get Phil as 4th oldest person which is not correct.
    in order to get correct row numbers You use over (order by column_name)
    in this case it would be:

    select *,row_number() over (order by DateOfBirth) as rownum from Person

    to get 4th oldest person syntax is:
    WITH cte  --alias for temporary named result set
    as
    (
       select *,row_number() over (order by DateOfBirth) as rownum from Person
    )
    Select * from cte where rownum = 4

    I hope than makes it a bit more clear.

    But to be fair... I think that Peso's answer is simplier than mine and it's more elegant ;-) And You should mark it as an answer as well :-)
    I forgot about function NTILE (bnever actually used it)

    NTILE(NN) is a function that works similar to ROW_NUMBER() but it "divides" table into NN parts - every row in a part has the same number. I.e.: If we have a table with 100 rows and use NTILE(4) You will have 25 rows with number 1, 25 with numer 2, 25 threes and 25 fours - and then to select 50% rows from the middle of Your table You could take only twos and threes.





    Monday, August 31, 2009 4:57 PM
  • I did the GROUP BY because I misunderstood your table structure... I didn't think it was a unique ID, I thought it was a JobID (that referenced another table).

    Here's an example addressing what you just described above, though I kept the StartTime/EndTime concept:

    use tempdb
    go
    create table TableA (StartTime datetime,EndTime datetime)
    insert TableA select '00:00:00','00:00:10'  --10 seconds
        union all select '00:00:20','00:00:35'  --15
        union all select '00:01:00','00:01:20'  --20
        union all select '00:01:30','00:01:42'  --12
        union all select '00:02:00','00:02:15'  --15
        union all select '00:02:20','00:02:22'  --2
        union all select '00:02:30','00:02:32'  --2
        union all select '00:03:00','00:04:15'  --75
        union all select '00:05:00','00:05:21'  --21
        union all select '00:05:30','00:05:45'  -15
    go
    
    select AvgTimeOfAllEntries = avg(datediff(second,StartTime,EndTime))
    from TableA 
    /*
    AvgTimeOfAllEntries
    -------------------
                     18
    */
    
    ;with JobTimes as
    (
      select JobTime=datediff(second,StartTime,EndTime)
      from TableA
    )
    ,JobStats as
    (
      select MinTime=min(JobTime)
            ,MaxTime=max(JobTime)
            ,TimeRange=max(JobTime)-min(JobTime)+1
      from JobTimes
    )
    select AvgTimeExcluding5PctExtremes = avg(JobTime)
    from JobTimes,JobStats
    where JobTime between MinTime+TimeRange*0.05 and MaxTime-TimeRange*0.05
    /*
    AvgTimeExcluding5PctExtremes
    ----------------------------
                              15
    */
    

    So what I'm doing is using a CTE (Common Table Expression) to set up a "derived table" of Job Times... it just calculates the seconds between StartTime and EndTime.  Then I create a second CTE that calculates the Minimum Time found in the first CTE, and the Maximum Time found, and the Range of all the times.  Finally, I connect the two of them (FROM JobTimes,JobStats) and ONLY pull out the ones with the JobTimes that exclude the 5% extremes.

    If you want to see how it works, execute this query, which is pretty much the same, but it just lists the columns (instead of calculating the average) and it DOES NOT exclude the extremes:

    ;with JobTimes as
    (
      select JobTime=datediff(second,StartTime,EndTime)
      from TableA
    )
    ,JobStats as
    (
      select MinTime=min(JobTime)
            ,MaxTime=max(JobTime)
            ,TimeRange=max(JobTime)-min(JobTime)+1
      from JobTimes
    )
    select *,RangeStart=MinTime+TimeRange*0.05,RangeEnd=MaxTime-TimeRange*0.05 
    from JobTimes,JobStats
    /*
        JobTime     MinTime     MaxTime   TimeRange  RangeStart  RangeEnd
    ----------- ----------- ----------- ----------- ----------- ---------
             10           2          75          74        5.70     71.30
             15           2          75          74        5.70     71.30
             20           2          75          74        5.70     71.30
             12           2          75          74        5.70     71.30
             15           2          75          74        5.70     71.30
              2           2          75          74        5.70     71.30
              2           2          75          74        5.70     71.30
             75           2          75          74        5.70     71.30
             21           2          75          74        5.70     71.30
             15           2          75          74        5.70     71.30
    */
                            

    So you can see here that the RangeStart and RangeEnd are 5.70 thru 71.30.  The WHERE clause in the first query I posted pulls out ONLY those times that fall into that range... so the JobTime values of 2 and 2 and 75 are eliminated when calculating the AVG.

    Hope this helps...


    --Brad (My Blog)
    • Marked as answer by c0pe Monday, August 31, 2009 7:03 PM
    Monday, August 31, 2009 5:31 PM
    Moderator
  • thank you so much for the detailed explanation, I really appreciate it.... and I understand now. :)
    Monday, August 31, 2009 7:03 PM
  • got it, thank you for the detailed explanation, I really appreciate it. :)
    Monday, August 31, 2009 7:04 PM