locked
Azure Batch - taskStateMonitor Error "A Task Was Canceled" RRS feed

  • Question

  • We are using a taskStateMonitor in our Batch application to tell us when a job is finished. We modeled it after the sample Azure Batch Application. Very rarely we get an exception that says ""A Task Was Canceled". This ends the taskStateMonitor, so the logic that checks for failed tasks runs, but there are none, so our function returns a "job completed successfully" This triggers an automatic file download.

    Our exception is caught in the outer Try/Catch.

    Is there any way to tell why we get this message? The job continues to run to completion just fine.

    We are leaning towards writing out own monitor that takes into account the JobState as well as the state of all the tasks in the job.

            '------------------------------------------
            'Create Task Monitor
            '------------------------------------------
            Try
                Dim detail As New ODATADetailLevel(selectClause:="id,state")
                Dim tasks As List(Of CloudTask) = Await batchClient.JobOperations.ListTasks(CurProject.Job.ID, detail).ToListAsync()
                Dim taskStateMonitor As TaskStateMonitor = batchClient.Utilities.CreateTaskStateMonitor()
                Dim TimeoutExept As Boolean = False
                Try
                    Await taskStateMonitor.WhenAll(tasks, TaskState.Completed, timeout)
                Catch generatedExceptionName As TimeoutException 'We should never hit this
                    TimeoutExept = True
                    Status_Message("One or more tasks failed to reach the Completed state within the timeout period.")
                    Return False
                End Try

                '------------------------------------------
                'Check Completed Tasks for Success/Failure
                '------------------------------------------
                detail.SelectClause = "id,state,executionInfo"

                For Each task As CloudTask In tasks
                    Await task.RefreshAsync(detail)

                    If task.ExecutionInformation.Result = TaskExecutionResult.Failure Then
                        allTasksSuccessful = False
                        Status_Message(String.Format("WARNING: Task [{0}] encountered a failure: {1}", task.Id, task.ExecutionInformation.FailureInformation.Message))
                        If task.ExecutionInformation.ExitCode <> 0 Then
                            Status_Message(String.Format("WARNING: Task [{0}] returned a non-zero exit code - this may indicate task execution or completion failure.", task.Id))
                            If task.Id.ToUpper.Contains("PSIZIPRESULTS") Then
                                Status_Message("   Zip Results Taskfailed." & "See " & CurProject.Job.ID & "\BatchWork\Wrapup\ for log files.")
                            End If
                        End If
                    End If
                Next

                'If allTasksSuccessful Then
                '    Status_Message("Job Completed Successfully - All tasks completed successfully.")
                'End If

            Catch ex As Exception
                'Dim Msg As String = GetExceptionMsg(ex)
                Status_Message("MonitorJobTasks: " & ex.Message)
            End Try

            Return allTasksSuccessful

    Thursday, October 4, 2018 6:54 PM

All replies

  • How often would you saw you see this message of the task being canceled?

    What have you done so far to isolate the issue? 

     
    Friday, October 5, 2018 7:01 PM
  • There is an unfortunate collision in terminology between the layers in the .net sdk.

    In this case "a task was canceled" does not refer to a Batch "task" (cloudtask).  It refers to a TPL task... and is the generic text returned by the timeout mechanism.

    What is the value of "timeout" in your code?  Chances are very good you need to increase this value.

    There are two timeouts in play in the client/server architecture.  1: client-side timeout, 2: server-side timeout.

    I will poke around a bit and provide better visibility into which is most likeliy in play.

    d

    Monday, October 8, 2018 9:55 PM
  • It's so intermittent we can't recreate it. We know the message is coming from the task monitor subroutine.
    Tuesday, October 9, 2018 1:28 PM
  • We got it again. This time not all the tasks got submitted to Azure Batch.

    Could it be we are starting up the task monitor too soon after submitting the job?

    Could you specify what timeouts you are talking about? I could increase them and see if that helps.

    Thanks!

    Scott

    Tuesday, October 9, 2018 1:32 PM
  • Task submission is unrelated to TSM.  If you start TSM before tasks are submitted then you are entering race conditions that you will need to resolve.   Suggested usage with TSM is to monitor only tasks that are submitted. In the sources you include above, the authoritative list of tasks comes from ListTasks… and, thus, all tasks have been submitted: this is by definition.

    The timeout value I mention is the one in the signature of your sources... the argument to the WhenAll() methods.

    There are 4 effects in play with TSM:

    1.  Timeout on the method signature
    2.  Retry policy in effect
    3.  "client-side" timeout
    4.  "server-side" timeout

    1 is the maximum value the TSM is given to monitor.  The error you cite will result when that time limit is reached.  This value is enforced across (including) multiple calls to the service.

    3 is the maximum time any single given call is allowed to take.  This is independent of what the service may or may not be doing.  When this limit is reached, you will see the error you cite.

    2 retries errors... in theory including "3".

    4 is the maximum amount of time you (the sdk) tell the service it can take.

    By default, 3 and 4 are set to useful defaults by the sdk.  If you wish to override them you can do so via Interceptors.

    The error you cite is generally associated with client-side timeout as defined by the cancelationtoken: with the signature in the sources included above, a cancelationtoken is created for you.  

    First step would be for you to examine the value of the timespan you pass in.  Larger task collections require more time to monitor.  If the lower layers of the service are particularly busy, more time could be required as well.

    Adding a timer around the call could help identify which timeout is being triggered as well.

    d



    • Edited by DarylMsft Tuesday, October 9, 2018 6:30 PM edit
    Tuesday, October 9, 2018 6:26 PM
  • Thank you so much for the help. I set the timeout to a large number for the total timeout for the monitor.

           Dim allTasksSuccessful As Boolean = True

            Dim timeout As TimeSpan = TimeSpan.FromHours(500) ' Jobs are already set with a timeout so we don't need this.

    The jobs we will be submitting have 1000's of tasks. So that could be taking a long time on the server side. We really only need to know the JOB status but I don't think there is a JOB status monitor.

    I was thinking that it might be better not to use the TSM at all, and to monitor the JOB status myself in the application. Then if needed for certain a JOB status we can look through the tasks to find failed ones etc.

    We are using the TSM because it was used in the initial sample Azure Batch App we started testing with.

    What do you think?

    Tuesday, October 9, 2018 6:48 PM
  • There is no JSM because in the origin of the job, it was/is mostly unaware of the tasks.  There is one job feature that might be of interest: OnAllTasksComplete 

    This seems to be a match to your use of TSM that is targeting TaskState.Completed.  At least it would render the job.state relevant to "completion" of all tasks.  Polling the job state would not be very effective w/o this sort of approach.

    TSM efficiency is a function of the length of the task id: the underlying REST API call requires a list of IDs with a max length limit so shorter IDs pack better and result in fewer calls.  TSM is fine with a million tasks but prudent use of select (which you do above) helps a lot.

    Also of interest would be GetTaskCounts().  This API returns a convergent approximation of the state distributions of the tasks in a job.  The values returned are not atomically consistent but can be very helpful when establishing trends.

    d

    Thursday, October 11, 2018 1:07 AM
  • Thanks. We use GetTaskCounts to monitor the overall counts of the tasks in each state and show them on the local front end application.

    We also set the job to terminate when all tasks are complete. We are just using the TSM to be able to update the front end application and tell the user the job is done, and download results. The TSM has just been funky once in a while reporting that task canceled error.

    What would be the problem with monitoring the Job status since we have it set to terminate when all jobs complete? We also have it terminate if a task returns a non-zero return code. We use AutoPools as well.

       CurProject.Job.Last_Status.Taskcounts = Await CurProject.Job.CloudJob.GetTaskCountsAsync()

        CurProject.Job.CloudJob.OnAllTasksComplete = OnAllTasksComplete.TerminateJob

             

    Thursday, October 11, 2018 3:41 PM
  • Setting OnAllTasksComplete supports the relevance of job polling.  Since you set it, I think monitoring the job is fine.

    The additional step of setting a JobAction on the task also supports the relevance of job polling.  My main point is that the job state is decoupled from the task states by default and only features like these connect the job state to tasks.

    d

    Thursday, October 11, 2018 7:31 PM