locked
Task Scheduling and Idle Nodes RRS feed

  • Question

  • I have an Azure Batch job that creates hundreds of tasks. I am using an auto pool, with a target dedicated value of 8, a max tasks per node value of 4, and a scheduling policy of "spread". When I monitor the nodes in the pool using either the heat map or just the details in batch explorer, I notice that one node seems to be doing most the work, while the others nodes just sit there idle. I can see that there are hundreds of tasks that are "Active" for this job, but there are only 3 ever "Running" at any given time (the 4th being the job manager).

    I am at a bit of a loss on where to start debugging this or how to solve this issue. Can anyone provide some direction or insight into why this might possibly be happening? 

    Thank you,

    RJ

    EDIT: I meant to add, here are the number of tasks that have been run per node as of the most recent execution of this task:

    Node 1 - 10
    Node 2 - 85
    Node 3 - 19
    Node 4 - 373
    Node 5 - 76
    Node 6 - 12
    Node 7 - 56
    Node 8 - 85

    When the task first starts, it seems to spread the work out fairly evenly, but then, as you can see from the numbers above, it starts to favor a single node and eventually all the other nodes just sit there, idle.

    • Edited by RJ Regenold Tuesday, February 2, 2016 5:16 PM Added more information
    Tuesday, February 2, 2016 5:11 PM

Answers

  • Hi RJ,

    Sorry for the delay in response.  We uncovered a bug impacting scheduling to VMs in certain scenarios (including your case which you shared with us).  In these cases, the VM may become idle and not accept any new tasks even if there are new tasks to schedule on it.  We are rolling out a hotfix to all impacted regions ASAP (rollout will likely occur tomorrow 2/5/2016).

     

    Thank you for bringing this issue up with us – we apologize for the inconvenience this issue has caused.

    Thanks,

    -Matt

    • Marked as answer by RJ Regenold Friday, February 5, 2016 1:12 AM
    Thursday, February 4, 2016 11:47 PM

All replies

  • Hi RJ Regenold,

    We can definitely take a look at the odd behavior you're observing.  It would be helpful if you could share some additional information with us so we can take a look into our logs.

    What region is your account in?

    What is your account name in that region?

    What is the name of the job which had this behavior?

    Around what time did this job run?

    Tasks shouldn't be sitting in the queue in Active state when there are still VMs in the pool which are in Idle state.  Can you confirm that the Nodes in the pool are all in state "Idle" even when the job has many active tasks?

    Thanks,

    -Matt

    Tuesday, February 2, 2016 9:32 PM
  • Hi, Matt. Thanks for the reply. Here is the additional info you requested:

    Region: North Central US

    Account name: supplylogixbeta

    Job name: shipping-20160202-20160202041

    Job start time: 2/2/2016 16:10:03 UTC

    I can confirm that there were at least 40 tasks that said "Active" while 7 of the 8 nodes were in an Idle state. The one node that was processing tasks was running the job manager task and 3 other tasks.

    Please let me know if you would like any additional information and I'll be happy to provide it. Thanks again for the reply!

    RJ

    Tuesday, February 2, 2016 10:24 PM
  • Hi RJ,

    Sorry for the delay in response.  We uncovered a bug impacting scheduling to VMs in certain scenarios (including your case which you shared with us).  In these cases, the VM may become idle and not accept any new tasks even if there are new tasks to schedule on it.  We are rolling out a hotfix to all impacted regions ASAP (rollout will likely occur tomorrow 2/5/2016).

     

    Thank you for bringing this issue up with us – we apologize for the inconvenience this issue has caused.

    Thanks,

    -Matt

    • Marked as answer by RJ Regenold Friday, February 5, 2016 1:12 AM
    Thursday, February 4, 2016 11:47 PM
  • Not a problem! I really appreciate how quickly you guys identified the problem and fixed it. Thanks!
    Friday, February 5, 2016 1:12 AM