locked
SQL Agent Jobs failure count Mismatch RRS feed

  • Question

  • Hi All,

    I need some help in the below query. There are 2 queries.

    1st query which will tell me the monthly wise job report which includes columns such as TotalJobExecutions, Success Count, Failure count, Retry count, Cancel count.

    2nd query, is completly concentrated on Failure count. Here i wanted to find out the reasons for the job failures.

    These 2 queries is showing correct output on my local instance. bascially, I installed a sql instance and created 3 jobs and made them fail for few runs. On my local instance, the queries and counts are matching perfectly. But, when I run the set of queries, the counts are mismatching. i.e. The sum(failure count) column in the 1st output doesnt match up with the Sum(totalCount) in the 2nd output.

    Note: also I am considering only last 3 months worth data.i.e. i am interested in last 3 months job failures. Please check with WHERE clause conditions.
    Now I wanted to understand when I ran the same 2 queries in PROD server, why there is count mistmatch. I am completely not able to understand. How can I fix this? Is there anything am I missing in the 2nd query. Also, in the 2nd query I am considering step id =0 (i.e. job outcome) like I did in 1st query.Instead, to get the exact error message, I am considering the failed step to know the exact error message and I am joining the messageid to sys.sysmessages dmv to know the actual reasons for the job failures. (stepid<>0)

    Is there any special scenario or extra condition I am missing in the 2nd query. I am literally confused here. does running rerunning any failed step from point of failure will that a make a difference in the counts? I have also tested that case as well still counts show me correct on local instant but on prod where we have around 100 jobs which is showing wrong results.
    or else if a job is has multiple schedules does it make any difference????

    All I am looking for is, if I sum up the values of "FailureCount" column values, that should match if I sum TotalCount column values in the 2 query output.

    Can anybody help ?

    Query 1
    ==============
    use msdb
    go
    ;
    with MyCte
    as
    (
    SELECT j.name JobName,
    h.run_status,
    msdb.dbo.agent_datetime(h.run_date, h.run_time) rundatetime,
    case h.run_status when 0 then 'failed'
    when 1 then 'Succeded'
    when 2 then 'Retry'
    when 3 then 'Cancelled'
    when 4 then 'In Progress'
    end as ExecutionStatus
    --h.message MessageGenerated
    FROM sysjobhistory h inner join sysjobs j
    ON j.job_id = h.job_id
    where h.step_id = 0
    and msdb.dbo.agent_datetime(h.run_date, h.run_time) between '2016-06-01 00:00:00.000' and '2016-09-21 23:59:59.000'
    )
    select
    @@servername AS Servername, --//CHANGE THE SERVERNAME
    [Month]=cast(month(rundatetime) as varchar(2))+'/'+cast(year(rundatetime) as varchar(4)) ,
    [TotalExecutions] = count(*)
    ,SuccessCount = sum(case when ExecutionStatus = 'Succeded' then 1 else 0 end)
    ,FailureCount = sum(case when ExecutionStatus = 'failed' then 1 else 0 end)
    ,RetryCount = sum(case when ExecutionStatus = 'Retry' then 1 else 0 end)
    ,CancelledCount = sum(case when ExecutionStatus = 'Cancelled' then 1 else 0 end)
    from MyCte
    group by cast(month(rundatetime) as varchar(2))+'/'+cast(year(rundatetime) as varchar(4))
    go


    Query 2
    ==============
    ;
    with FailureReasons
    AS
    (
    SELECT [JobName] = JOB.name,
    msdb.dbo.agent_datetime(HIST.run_date, HIST.run_time) as RunDateTime,
    [Step id] = HIST.step_id,
    [StepName] = HIST.step_name,
    HIST.run_status,
    [Status] = CASE WHEN HIST.run_status = 0 THEN 'Failed'
    WHEN HIST.run_status = 1 THEN 'Succeeded'
    WHEN HIST.run_status = 2 THEN 'Retry'
    WHEN HIST.run_status = 3 THEN 'Canceled'
    END,
    HIST.sql_severity,
    HIST.sql_message_id,
    HIST.message

    FROM sysjobs JOB
    INNER JOIN sysjobhistory HIST ON HIST.job_id = JOB.job_id
    where HIST.run_status = 0 -- means only failed
    and msdb.dbo.agent_datetime(HIST.run_date, HIST.run_time) between '2016-06-01 00:00:00.000' and '2016-09-21 23:59:59.000'
    and step_id <> 0
    )
    SELECT t1.sql_message_id "Error #",t2.[Description], [TotalCount] = COUNT(1)
    from FailureReasons t1
    inner join sys.sysmessages t2 on (t1.sql_message_id = t2.error)
    where t2.msglangid = 1033
    group by t1.sql_message_id,t2.[Description]
    ORDER BY [TotalCount] DESC

    Sample output


    Thanks,

    Sam

    Wednesday, September 21, 2016 6:39 PM

Answers

  • Hi Samantha v,

    Correct me if I’m wrong, a single SQL Agent job could generate multiple rows that matches the condition in your where clause in your second query when you set ‘On failure action’ to something like ‘Go to the next step’. To confirm that, I would suggest you check all your SQL Agent Job steps on your server first see if there’s any of them is configured to go to next step or other steps.

    If you have any other questions, please let me know.

    Regards,
    Lin
    Thursday, September 22, 2016 7:23 AM