locked
Azure batch - cloud Job commit issue RRS feed

  • Question

  • Hi,

    We are using Azure batch service and facing an intermittent issue while creating cloud job using Azure Batch API. To explain in detail, we use batch service to process large client files by programatically creating task for each file and in each task we create a batch cloud job and tasks and use TasksMonitor to wait and read the tasks output.

    So say , if 4 client files are being processed the system will create 4 parallel tasks which internally creates job and tasks for each file. The problem is sometimes the batch API is stuck while committing the cloud job and doesn't proceed further and this is also stopping the processing of other files. i.e if one job is stuck during commit, other files which ever the status of it are stuck and all the files doesn't proceed further. We initially thought the issue might be due to restriction of no. of web requests we make from application, so we increased the limit by setting maxconnection in app.cofig under system.net section. But this dint solve the problem. 

    So, we are stuck and unable to proceed and have no clue of the issue. Please help.

    Thanks in advance.

    Regards,

    Harish

    Tuesday, February 28, 2017 1:24 AM

All replies

  • Hi Harish,

    When you say job commit gets stuck, what exactly do you mean? Do you mean control never returns from a call to job.Commit()? Is it possible that job.Commit() is throwing an exception that halts progress?

    It might be helpful to see the code snippet around where you think job.Commit() is getting stuck.

    Another possibility if you can reliably reproduce this issue is to run fiddler while the code is running and watch the actual HTTP traffic, which might give an indication of what's going on.

    -Matt

    Tuesday, February 28, 2017 6:00 PM
  • Hi Matt,

    Yes, the control never returns after calling job.Commit().  Below is code block where we create a job and the control is stuck during cloudJob.Commit(). 

    using (BatchClient batchClient = BatchClient.Open(batchSharedKeyCredentials))
                    {
                        string poolId = GetConfigurationValue("PoolId");
                        PoolInformation poolInformation = new PoolInformation {PoolId = poolId};
    
    
                        CloudJob cloudJob = batchClient.JobOperations.CreateJob(jobId, poolInformation);
                        EnvironmentSetting environmentSetting = new EnvironmentSetting("USR_STORAGE_CONN_STR",
                                                                                       GetConfigurationValue(
                                                                                           "Azure_Storage_ConnStr"));
    
                        cloudJob.CommonEnvironmentSettings = new List<EnvironmentSetting> {environmentSetting};
    
    
                        cloudJob.Commit();                    
    
                        CloudStorageAccount linkedStorageAccount =
                            CloudStorageAccount.Parse(GetConfigurationValue("Azure_Storage_ConnStr"));
    
                        cloudJob.PrepareOutputStorageAsync(linkedStorageAccount).Wait();
    
                    }

    Is it possible when multiple threads (System.tasks) are trying to create Job and Tasks at the same moment this issue might occur? Because as I mentioned earlier, we create tasks based on no. of files uploaded by client and each task will be trying to create job followed by tasks parallely.

    Regards,

    Harish

              
    Tuesday, February 28, 2017 6:56 PM
  • You do not mention your hosting model but a common deadlock with the asynchronous "await' pattern is triggered in "message pumping environments" and ".Wait()"... which is called on the last statement with PrepareOutputStorageAsync(...).Wait(). Best practices here are to call .ConfigureAwait(continueOnCapturedContext:false).Wait().

    I do realize that you specifically say it is the .Commit() call but I mention the above because it is the most common cause of deadlock.  If you are not in asp.net and/or something like a UI then we still have no RCA.

    In races between "AddJob" and "AddTask(s)" the winner is cleanly defined:  if no job exists, AddTask(s) calls fail.  If a job with the given id exists... AddJob fails.  Otherwise AddJob comes in first and succeeds and AddTask(s) comes in after and also succeeds.

    d

    Friday, March 3, 2017 9:52 PM
  • Hi Harish -- in addition to what Daryl mentioned above, can you confirm that no exception is being thrown from .Commit? I don't see where it would be bubbled up to if one were thrown based on the code snippet you posted.

    How are you determining that .Commit is deadlocking (through the debugger?). One thing you can do is share the client-request-id associated with this request with us (also the Azure Batch region your account is on and roughly the time that the request with the specified client-request-id occurred) and we can take a look at our server logs to see if we can see anything suspicious there.

    In order to gather the client-request-id the easiest thing to do is to set a simple request-id generation function and record the result - something like this:

                    Guid clientRequestId;
                    var requestIdGenerator = new ClientRequestIdProvider(req =>
                    {
                        clientRequestId = Guid.NewGuid();
                        return clientRequestId;
                    });
                    job.Commit(additionalBehaviors: new [] { requestIdGenerator });
    -Matt

    Monday, March 6, 2017 6:29 PM