none
Azure Batch Windows Containers Failing To Start RRS feed

  • Question

  • I'm using the Azure Batch .NET SDK to create my Azure Batch pool and job. I have a custom docker image in Azure Container Registry to be used for my tasks. It appears like if I try to specify a large number of tasks per node, that on startup some of the tasks will fail to create a container. This is just my guess because the task fails with a null exit code. The failure information isn't helpful because I have Output files specified, so the failure message is just saying that it couldn't find any output files. There is no stdout so I know that my app in the container is not being launched. Other tasks startup fine and eventually the node is running the number of tasks per node. My docker image is about 12gb so I'm wondering if docker can't get enough resources to start all of those containers right away? I haven't been able to find any log files to confirm my suspicion.

    In this case, since the exit code is null, the task is not retried. I've been getting around this by re-activating the task in my code so that it will get scheduled and completes fine later.

    Here's a sample log event for the failure from the azure diagnostic log

    "Tenant": "9dd2ce90e3fc494fac0a87f94b23d486",
        "time": "2018-05-08T17:10:59.2929160Z",
        "resourceId": "/SUBSCRIPTIONS/1226C487-759D-40A9-8C73-1947D9156E89/RESOURCEGROUPS/SAFEGUARD/PROVIDERS/MICROSOFT.BATCH/BATCHACCOUNTS/WTSSAFEGUARD",
        "category": "ServiceLog",
        "operationName": "TaskFailEvent",
        "operationVersion": "2017-06-01",
        "properties": {"jobId":"safeguardjob_86aa74b9-7c0d-46a3-8256-525dfe4dfb28","id":"topntask26","taskType":"User","systemTaskVersion":0,"nodeInfo":{"poolId":"safeguardpool_86aa74b9-7c0d-46a3-8256-525dfe4dfb28","nodeId":"tvm-2661886182_1-20180508t164947z"},"multiInstanceSettings":{"numberOfInstances":1},"constraints":{"maxTaskRetryCount":3},"executionInfo":{"startTime":"2018-05-08T17:06:35.037Z","endTime":"2018-05-08T17:07:55.281Z","exitCode":0,"retryCount":0,"requeueCount":0}}

    In the log it shows exitCode of zero, but in code the tasks ExecutionInformation.ExitCode is null.

    Wednesday, May 9, 2018 3:06 PM

All replies

  • Hi,

    Does your task use same container image to do the work? If yes, it would be better to specify the container image in the pool level, especially when MaxTasksPerNode great than 1. The docker engine will try to download the same container multiple times in parallel. It may cause conflict.

    The Batch service should expose better error message. Based on the information, the pool had already deleted, so we couldn't get information from the node to see what happen.

    Could you file an incident by Azure portal about the issue? That way we can better assist you to figure out the problem, or repro the issue.

    Xing


    Thursday, May 10, 2018 6:37 PM
  • Any update on this? 
    Tuesday, May 15, 2018 2:59 AM
    Moderator