locked
TaskStateMonitor - Supervisor goes down scenario RRS feed

  • Question

  • Hi,

    I am using the TaskStateMonitor to monitor the completion of my tasks. I am hosting my TaskStateMonitor along with the batch client in a web role. Here TaskStateMonitor plays the role of the supervisor which supervises all the tasks. Now what if my hosting application that hosts the TaskStateMonitor goes down?

    1.)One option that i can think of is to query the Batch REST API. But is it possible to get the completion status with polling the Batch RESI API?

    Is there any other options to reliably monitor the status?


    Please mark the response as answers if it solves your question or vote as helpful if you find it helpful. http://thoughtorientedarchitecture.blogspot.com/

    Thursday, October 15, 2015 2:47 PM

Answers

  • You asked: "Now if the Jobmanager VM goes down, I would assume the Batch service would restart another VM instance and run the JobManager. But in this case will the environment variables along with their values be preserved when a new VM with Jobmanager is spinned out?"

    To clarify, if you have a pool with 10 VMs in it, and you have a single job running on that pool, with a JobManager, the JobManager will be running on 1 of the 10 VMs, and the other 9 VMs will be for normal tasks.  If that VM goes down, the JM will preempt a task on one of the remaining 9 VMs.  As for the environment variable question, yes the environment variables would be repopulated even if the task is being "rescheduled" -- they are set when the task starts to run (regardless of how many times it was running before).  Generally the easiest way to determine if the JobManager has run already is just to ListTasks all the tasks in the Job and see if the right number are there and in the right state.  For more complex JobManagers you might need to implement some state-management using some sort of persistent storage Azure Storage or maybe even something like Redis for instance.

    You asked: "Also as a different question, when the batch service creates new VMs for the tasks would they create the VMs in the same region as the batch account? Or in a arbitrary region?"

    They are always in the same region as the Batch account.

    Lastly, you asked: "Thirdly, the CloudTask object has something called as an AffinityInformation property. what can this be used for?"

    It is used for "soft" affinity.  Basically, you can put the name of a VM in there when you submit the task and we will try to schedule the task on the VM requseted.  But if that VM is busy at task scheduling time, the task will end up scheduled elsewhere (thus the "soft" affinity).  In the future we may support hard affinity (i.e. the task will not be scheduled at all unless it can be scheduled on the VM requested).  If you need a "hard" affinity feature, we can discuss some workarounds in the meantime.

    Hope that helps.


    Thursday, October 15, 2015 9:37 PM

All replies

  • Hi Haripraghash,

    Can you clarify more about what you mean when you say: "But is it possible to get the completion status with polling the Batch REST API"?  That's exactly what TaskStateMonitor does under the hood (it just looks at the status of the tasks and waits for them all to reach the desired state using the basic REST API operations such as ListTasks).

    One thing you might want to look into is using the JobManager feature of CloudJob.  The JobManager is a "special" task which starts before all other tasks and is primarily used for communication and task submission in the Batch Service.  The JobManager has some "special" properties like we will always try to schedule it (so if the VM it is running on goes down, the scheduler will preempt another task to schedule the JobManager as long as there is one VM free).

    You can take a look at a simplistic sample of how to submit a JobManager here.

    So your workflow could be something like this:

    1) WebRole decides to kick off a job - it submits a CloudJob with a JobManager task.  If you need to control how many tasks will be submitted based on some parameters your WebRole knows, it can include these as environment variables or as part of the JobManager commandline.

    2) JobManager task starts for your job, it reads its environment variables (or command-line) and starts submitting new tasks for your job.

    3) The JobManager uses TaskStateMonitor to monitor for task completions, and when the tasks are all completed the JobManager exits.  You can also use the JobManager to do some simplistic coordination/staging if you like.

    The one thing to be aware of with the JobManager is that technically for full robustness you should make it idempotent -- which usually means when the JM starts, it needs to check the status of its job to see if it has already submitted the required tasks or not.

    Thursday, October 15, 2015 5:22 PM
  • Thank you for your response.

    Now i understand that JM is a reliable one.

    Coming to the idempotency part, if the host application(web role) sets a few environment variables of the JobManagerTask object. Now batch service hosts the jobmanager in a VM and and  jobmanager in turn creates new tasks based on the environment variables. Now if the Jobmanager VM goes down, I would assume the Batch service would restart another VM instance and run the JobManager. But in this case will the environment variables along with their values be preserved when a new VM with Jobmanager is spinned out?

    Also as a different question, when the batch service creates new VMs for the tasks would they create the VMs in the same region as the batch account? Or in a arbitrary region?

    Thirdly, the CloudTask object has something called as an AffinityInformation property. what can this be used for?



    Please mark the response as answers if it solves your question or vote as helpful if you find it helpful. http://thoughtorientedarchitecture.blogspot.com/

    Thursday, October 15, 2015 8:27 PM
  • You asked: "Now if the Jobmanager VM goes down, I would assume the Batch service would restart another VM instance and run the JobManager. But in this case will the environment variables along with their values be preserved when a new VM with Jobmanager is spinned out?"

    To clarify, if you have a pool with 10 VMs in it, and you have a single job running on that pool, with a JobManager, the JobManager will be running on 1 of the 10 VMs, and the other 9 VMs will be for normal tasks.  If that VM goes down, the JM will preempt a task on one of the remaining 9 VMs.  As for the environment variable question, yes the environment variables would be repopulated even if the task is being "rescheduled" -- they are set when the task starts to run (regardless of how many times it was running before).  Generally the easiest way to determine if the JobManager has run already is just to ListTasks all the tasks in the Job and see if the right number are there and in the right state.  For more complex JobManagers you might need to implement some state-management using some sort of persistent storage Azure Storage or maybe even something like Redis for instance.

    You asked: "Also as a different question, when the batch service creates new VMs for the tasks would they create the VMs in the same region as the batch account? Or in a arbitrary region?"

    They are always in the same region as the Batch account.

    Lastly, you asked: "Thirdly, the CloudTask object has something called as an AffinityInformation property. what can this be used for?"

    It is used for "soft" affinity.  Basically, you can put the name of a VM in there when you submit the task and we will try to schedule the task on the VM requseted.  But if that VM is busy at task scheduling time, the task will end up scheduled elsewhere (thus the "soft" affinity).  In the future we may support hard affinity (i.e. the task will not be scheduled at all unless it can be scheduled on the VM requested).  If you need a "hard" affinity feature, we can discuss some workarounds in the meantime.

    Hope that helps.


    Thursday, October 15, 2015 9:37 PM