none
Retrying a task after a transient scheduling error RRS feed

  • Question

  • I've just experienced a single task failing (out of 1000 total) with a scheduling error that I believe is transient: The failure was BlobAccessDenied, but if I stick the BlobSource url in my browser the file downloads fine.

    Will setting a MaxTaskRetryCount cause the task to be retried after a SchedulingError?  In the docs it says "The Batch service retries a task if its exit code is nonzero".  But in this case there isn't an exit code so the behaviour isn't clear.

    Also the error caused another dependent task not to run - it is just sitting there in the "Active" state.  Is it correct that a dependent task would simply sit forever in the "active" state if there is a scheduling error in a task it depends on?

    Thanks,

    James.



    • Edited by James Thurley Wednesday, August 17, 2016 6:05 PM Typo
    Wednesday, August 17, 2016 5:38 PM

Answers

  • Hi James,

    MaxTaskRetryCount is intended to allow you to re-run tasks in the event of an intermittent failure in the task itself (i.e. if the task were flaky or in some rare cases could fail to execute properly and you just wanted to retry it). That means that MaxTaskRetryCount won't retry tasks which have a scheduling error - when the service encounters a scheduling error it deems the task unretryable. In the specific case of BlobAccessDenied Batch will interpret that failure as unrecoverable because we assume that if Azure Storage has rejected us with BlobAccessDenied, that access will not reappear later.

    There are a couple of things to check:

    1) Did the SAS you specified to the Azure Batch service for that blob expire? (I assume no, given that you said you can access the same URL provided to Batch in your browser and it still works)

    2) Was there any more detail than "BlobAccessDenied" in the details collection of the scheduling error?

    As for your task dependencies question: Currently if a task fails to run to completion successfully, its dependent tasks will not be scheduled (and thus will stay in the active state). We have plans to allow the user to have more control over what constitutes a "successful" run of a task (so for example if TaskB depends on TaskA, you could say on TaskA that scheduling errors count as "success" and then TaskB would run, but by default TaskB would not run).

    Is the dependent task getting stuck in Active state what you expect, or would you prefer to see some other behavior?

    Thanks,

    -Matt

    • Marked as answer by James Thurley Monday, August 22, 2016 2:36 PM
    Thursday, August 18, 2016 5:37 PM
    Owner

All replies

  • Hi James,

    MaxTaskRetryCount is intended to allow you to re-run tasks in the event of an intermittent failure in the task itself (i.e. if the task were flaky or in some rare cases could fail to execute properly and you just wanted to retry it). That means that MaxTaskRetryCount won't retry tasks which have a scheduling error - when the service encounters a scheduling error it deems the task unretryable. In the specific case of BlobAccessDenied Batch will interpret that failure as unrecoverable because we assume that if Azure Storage has rejected us with BlobAccessDenied, that access will not reappear later.

    There are a couple of things to check:

    1) Did the SAS you specified to the Azure Batch service for that blob expire? (I assume no, given that you said you can access the same URL provided to Batch in your browser and it still works)

    2) Was there any more detail than "BlobAccessDenied" in the details collection of the scheduling error?

    As for your task dependencies question: Currently if a task fails to run to completion successfully, its dependent tasks will not be scheduled (and thus will stay in the active state). We have plans to allow the user to have more control over what constitutes a "successful" run of a task (so for example if TaskB depends on TaskA, you could say on TaskA that scheduling errors count as "success" and then TaskB would run, but by default TaskB would not run).

    Is the dependent task getting stuck in Active state what you expect, or would you prefer to see some other behavior?

    Thanks,

    -Matt

    • Marked as answer by James Thurley Monday, August 22, 2016 2:36 PM
    Thursday, August 18, 2016 5:37 PM
    Owner
  • Hi Matt,

    Thanks for the reply, all makes sense.

    To answer your questions:

    1) The SAS definitely didn't expire, it was valid for 24 hours and worked in the browser.

    2) Yep the full details where: Category=UserError, Code=BlobAccessDenied, Message="Access for one of the specified Azure Blob(s) is denied".  It then had a details section that listed the blob in question, and I copied the full URL from those details when I tested the SAS in the browser.

    I've run tens of thousands of tasks with this code and this is the first one that has produced this error, so I suspect it was some transient glitch somewhere.

    With regards to the dependent task, I always want to run the dependent task even if some of the tasks it depends on fails.  The task does some post-processing and handles missing results by design - see this question which I asked a short while ago (using my other account).  As suggested there I force my error codes to still be zero when there is a failure to ensure the dependent task runs, but it seems the scheduling error is something I can't control for the moment.

    Assuming this scheduling error doesn't start cropping up more frequently, I'm happy to wait until there is a feature for specifying that a dependent task should always run.

    Thanks,

    James.

    Monday, August 22, 2016 2:36 PM
  • Hi James,

    I am just curious - did your SAS which failed have a StartTime set in it?

    Also, did the task that failed run at any "interesting" time of your job (i.e. the first task to run, or the last task to run?)

    -Matt

    Monday, August 22, 2016 7:43 PM
    Owner
  • Hi Matt, sorry for the late reply.  The SAS didn't have a start time set, only an expiry time. This was the exact SAS that failed:

    sv=2015-07-08&sr=b&sig=mrOF%2BGP1lD23xCvxi9600%2Fm1IsGrMWOGCvOIjsWNhpM%3D&se=2016-08-18T17%3A00%3A13Z&sp=r

    It was task number 36 out of 1000.  I add the tasks to Azure Batch in batches of 50, so it would have been in the first set of tasks to be scheduled - so I guess it could theoretically have been the first to run. I think I had between 50 and 100 VMs in the pool ready to go.

    There shouldn't be any kind of race condition between uploading and scheduling either: For each task I upload the file it needs (await blob.UploadTextAsync(content)) and once the upload is complete I create the ResourceFile for it.  Once I have all the resource files for the batch of 50 tasks I send them all to Azure Batch using JobOperations.AddTaskAsync(batchJobId, batchTasks).  

    I've run about 40,000 tasks since this failure and haven't seen it again.

    Thanks,

    James.


    Monday, August 29, 2016 12:02 PM
  • Hi James,

    I meet the same trouble about BlobAccessDenied. do you mean you fix this problem by using JobOperations.AddTaskAsync instead of JobOperations.Add?

    Thanks,

    Shuangbei

    Thursday, June 20, 2019 7:14 AM
  • Hi Shuangbei, no I always used the async version of the method. I haven't seen this occur since so I think it was just a glitch in my case.

    James.

    Thursday, June 20, 2019 8:21 AM
  • Thanks for you reply 

    I find it's the expire time.

    Shuangbei

    Friday, June 21, 2019 8:10 AM