locked
Low priority nodes stays idle or unusable why ? RRS feed

  • Question

  • Hello,

    I noticed some unwanted behaviour using low priority nodes in a pool.

    I have allocated 2 nodes in low priority. After some time, one gets unusable, and when I launch a job, the other stays Idle forever... 

    I am forced to define a node as dedicated (which works correctly), because low ones seems to be useless after some time, like if they get to old to execute jobs... To make them work again I have to scale to 0 and back to 2. 

    Is that a bug in the behaviour of low priority nodes ?

    Is there a way to automatically kill them and recreate them in the auto scale formula ?

    Thanks in advance

    Thomas

    Monday, June 11, 2018 4:21 PM

All replies

  • Hi Thomas,

    Low priority nodes should remain in idle so long as no task has been scheduled on them. Simply creating a job should not move any nodes in the associated cluster to a new state.

    Also, low priority nodes may be preempted at any time, see the documentation for more information: https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms

    When preempted, low priority nodes will not be available for tasks scheduling. However, Batch pools automatically seek the target number of low priority nodes. If nodes are preempted, then Batch attempts to replace the lost capacity and return to the target. So you do not need to kill the nodes or create an auto scale formula, Batch handles this complexity for you and will reallocate the nodes when they are available.

    Thanks,

    Jake


    Monday, June 11, 2018 6:22 PM
  • Hello Jake, 

    Thanks for answering, but it does not behave this way, this morning one node is still unusable, the other still idle, and my task still not executed... no nodes are preemted... From my point of view Batch seems broken.

    Thanks

    Thomas


    Thomas

    Tuesday, June 12, 2018 8:56 AM
  • We are also seeing this behaviour in the last few days (since the major North Europe outage on the 19th, although that might be coincidence).

    In our application we have a pool per customer, each of which has a couple of low priority nodes always running.  In the last few days we have had a number of customers complaining to us that their jobs aren't running.  In each case they have a couple of low priority nodes sitting idle, and a job with a task in the active state but not running anywhere (our customers often run single tasks on the pool, in in addition to large scale studies).

    Our priority is to get the customer up and running again, so in each case we've adjusted the autoscale formula to replace the running nodes with dedicated ones, which pick up the job as soon as they are online.  Simply scaling up the pool would also work, but switching to dedicated nodes guarantees the faulty nodes are taken offline.  Once the old nodes are removed we can switch the pool back to low priority and the new low priority nodes will work fine. This indicates as Thomas says that it's an issue that occurs after the nodes have been online for some time.

    I'm unsure if this only affects low priority nodes: we only have low priority nodes running constantly.  However we are switching everyone over to dedicated nodes today at our expense to see if that eliminates the problem.  I'll report back.

    Note that we probably aren't aware every time this issue occurs as if the customer runs many tasks the pool will scale up anyway, hiding the problem with the faulty nodes (and most likely taking them offline when the pool scales down as they are guaranteed not to be busy).

    We're keeping our own pool on low priority nodes, so if we can catch it in this state I'll isolate the pool so you can investigate the issue while it's occuring.


    Thursday, June 28, 2018 8:46 AM
  • I have an example of this occurring right now... I have a pool with 3 low priority nodes that are in the idle state, I have two Batch jobs in the Active state targeting the pool, and they each have one task in the Active state that isn't being picked up by the three idle nodes.  I've isolated the pool so we should be able to keep it in this state as long as the nodes don't get preempted.

    I'm not sure what details I can securely give in this forum, but if there is an email address I can use then I'll send all the info over for someone to take a look.

    I can't open a support ticket unfortunately as I can't purchase a support plan right now (it says "If you recently canceled a support plan, try again at a later date" when I try and upgrade).

    Friday, June 29, 2018 7:52 AM
  • Update: I'm told this issue has now been fixed.
    • Proposed as answer by JamesThurley Monday, July 16, 2018 4:49 PM
    Monday, July 16, 2018 4:49 PM