locked
How to mark a job as completed? RRS feed

  • Question

  • I am using a IJobManager to specify a program to control my job. Once  the JobManager app has determined that all of the work is complete, I would like for my JobManager app to mark the cloud job as complete. I see the IJobManager.KillJobOnCompletion property, but that's not what I'm looking for. Is there a way to manually set the job as complete?
    Friday, February 6, 2015 6:15 PM

Answers

  • Hi ccoxtn,

    Sorry for the long delay in replies.

    We investigated the issue you saw and have root caused the behavior.

    Here is what happened:  Your job had been running for a long time, when the VM which it was running on rebooted. Your JM was then rescheduled on another VM, and ~20/30m later that VM was rebooted, the same thing happened several times over the course of a few hours, and each time, the Batch system rescheduled the JM on another VM.

    Internally, the Batch system has a concept of an “internal retry” which is invoked when a VM with an active task goes bad. This can occur either through a reboot or a loss of contact with the Batch service due to, for example, issues with the network – in any distributed system this can sometimes happen.  When a task is retried due to this sort of failure, it is counted as an internal retry.  After the retry count crosses a certain threshold, the Batch services stops performing internal retries.  Your JM hit this threshold and was moved to completed state with the scheduling error you saw.

    That is the “why” of it.

    1. Now on to what we’re going to do to make this experience better for you:

      Our design is pretty conservative with the internal retry limits to prevent a mis-configured task which was crashing the VM from causing havoc in your pool by being scheduled on a ton of VMs and then crashing each of them. We will push a hot fix to reduce the probability of you hitting this internal retry limit.
    2. In the longer term, we are considering allowing users to set a policy.
    3. We also will improve the error reporting aspect as the internal retry count wasn’t surfaced to you, and the error code you saw was not useful.  We will work on making these improvements.

    For the time being you can do the following (which we talked about in the earlier part of this same thread)

    You can flip the flag for KillJobOnJMCompletion off and just explicitly terminate your job at the end of your job manager.  This way even if the job manager has some issues completing the rest of your work won’t get terminated.

    What you saw should be a relatively rare occurrence, so if you decide to leave KillJobOnJMCompletion on and this issue reoccurs please let us know.

    Sorry for the inconvenience and hopefully the workaround I suggested works until we deploy our next hotfix.

    -Matt and the Azure Batch team

    • Proposed as answer by Yiding Zhou - MSFT Tuesday, June 2, 2015 5:21 AM
    • Marked as answer by ccoxtn Tuesday, June 2, 2015 3:03 PM
    Friday, February 13, 2015 1:25 AM

All replies

  • You can use the Terminate() call on the ICloudJob (or WorkItemManager.TerminateJob)... but I think you actually do want the KillJobOnCompletion property of the JobManager.

    If you terminate the job from within your job manager (while the job manager is still running) then the job manager will be forcefully terminated by the Batch system because that job manager is part of the job which you just terminated.  This means your job manager will have a nonzero exit code (because it was terminated due to the terminate job call)

    The cleanest way to accomplish what you want is I think to just set the KillJobOnCompletion to true, and then when the job manager realizes that all the work is done, it just gracefully exits and the Batch system will end the job for you.

    Can you explain more why KillJobOnCompletion isn't what you want?

    Thanks,

    -Matt

    Friday, February 6, 2015 6:30 PM
  • Matt,

    I am currently using the KillJobOnCompletion property. I had a job that ran 60 TVMs for 12 hours before the job was marked as completed, but something happened and the job manager task was marked as completed due to a server error (SchedulingError.Category=ServerError, Message=840A020A). The RetryCount on the job manager is 0, even though I have retries configured. So it looks like the server error resulted in my job manager getting marked as completed, when it actually had more work to do.

    Because I had set KillJobOnCompletion=True, this left me in a state where there are quite a few active tasks in the "completed" job that aren't running. If the job wasn't marked as completed, I could manually start up a new job manager task to continue the batch in this same job.

    My thought is that if my job manager can control when the job is marked complete, then if something happens to the azure batch system/server, I can still have an active job to work with. Also, if the job hadn't been marked as completed, all my active tasks would have continued to be processed.

    Friday, February 6, 2015 7:21 PM
  • Hi ccoxtn,

    I understand your scenario -- it sounds like to me something may have gone wrong, since the error you specified means that the retry limit has been reached.  Since you said there were 0 retries that doesn't sound right (considering you said you configured retries to be enabled).

    If you share the following information with us, we can check what happened to your job manager in our service logs.

    Region:

    Account name:

    Work item name:

    Job name (nice to have):

    Approximate time the job was running (start/end times are best):

    Edit: To clarify, what you tried to do should work - you don't need to handle ending the job yourself, if you specify retries for the job manager we should only terminate the job manager (and thus the job) after the retries have been exhausted.

    Thanks,

    -Matt


    Saturday, February 7, 2015 12:18 AM
  • Matt,

    Here's some details on the job I as running:

    Region: South Central US
    Account Name: Pegasus
    Work Item Name: ExportLeadsToRedshift
    Job Name: job-000000001
    Job CreationTime: 2/6/2015 2:00:51 AM
    ExecutionInformation.EndTime: 2/6/2015 4:17:48 PM
    JobManager task name is LeadsJobManager, and it shows ExecutionInformation.RetryCount=0, ExecutionInformation.SchedulingError.Message=840A020A, ExecutionInformation.StartTime= 2/6/2015 3:50:30 PM.

    Monday, February 9, 2015 7:38 PM
  • Hi ccoxtn,

    Sorry for the long delay in replies.

    We investigated the issue you saw and have root caused the behavior.

    Here is what happened:  Your job had been running for a long time, when the VM which it was running on rebooted. Your JM was then rescheduled on another VM, and ~20/30m later that VM was rebooted, the same thing happened several times over the course of a few hours, and each time, the Batch system rescheduled the JM on another VM.

    Internally, the Batch system has a concept of an “internal retry” which is invoked when a VM with an active task goes bad. This can occur either through a reboot or a loss of contact with the Batch service due to, for example, issues with the network – in any distributed system this can sometimes happen.  When a task is retried due to this sort of failure, it is counted as an internal retry.  After the retry count crosses a certain threshold, the Batch services stops performing internal retries.  Your JM hit this threshold and was moved to completed state with the scheduling error you saw.

    That is the “why” of it.

    1. Now on to what we’re going to do to make this experience better for you:

      Our design is pretty conservative with the internal retry limits to prevent a mis-configured task which was crashing the VM from causing havoc in your pool by being scheduled on a ton of VMs and then crashing each of them. We will push a hot fix to reduce the probability of you hitting this internal retry limit.
    2. In the longer term, we are considering allowing users to set a policy.
    3. We also will improve the error reporting aspect as the internal retry count wasn’t surfaced to you, and the error code you saw was not useful.  We will work on making these improvements.

    For the time being you can do the following (which we talked about in the earlier part of this same thread)

    You can flip the flag for KillJobOnJMCompletion off and just explicitly terminate your job at the end of your job manager.  This way even if the job manager has some issues completing the rest of your work won’t get terminated.

    What you saw should be a relatively rare occurrence, so if you decide to leave KillJobOnJMCompletion on and this issue reoccurs please let us know.

    Sorry for the inconvenience and hopefully the workaround I suggested works until we deploy our next hotfix.

    -Matt and the Azure Batch team

    • Proposed as answer by Yiding Zhou - MSFT Tuesday, June 2, 2015 5:21 AM
    • Marked as answer by ccoxtn Tuesday, June 2, 2015 3:03 PM
    Friday, February 13, 2015 1:25 AM
  • Matt,

    Thanks for the follow-up. For now I will leave KillJobOnJMCompletion on, as it seems to be working well for me since my initial report. In the case I reported, I was able to pull information about the unfinished tasks and restart them in a new job.

    For the future, if you do allow users to set a policy, it would be nice to have the ability to set the job to disabled instead of completed, when problems like these occurred. Ideally, I would be able to fix any problems on my end, or wait for Azure service problems to be resolved, and then resume the job where it left off.

    Friday, February 13, 2015 3:38 PM