locked
time out running dot net code in Azure Batch RRS feed

  • Question

  • Is anyone else getting time outs running dot net code using batch account scheduled from Data Factory? The code has been live for months, but today one of our environments failed, where it is in West Europe region. Our other 3 environments are in North Europe and are running ok. 

    There is nothing showing on the service status page https://azure.microsoft.com/en-gb/status/history/ but last time this happened, 27th June 2018, it took a week before anything showed up. Last time the error affected all of our environments, but the service status said they only had a problem with one.

    This is the service status message  from last time

    RCA - App Service - West Europe

    Summary of impact: Between 16:00 UTC on 27 Jun 2018 and 13:00 UTC on 28 Jun 2018, a subset of customers using App Service in West Europe may have received HTTP 500-level response codes, timeouts or high latency when accessing App Service (Web, Mobile and API Apps) deployments hosted in this region.

    Root cause and mitigation: During a recent platform deployment several App Service scale units in West Europe encountered a backend performance regression due to a modification in the telemetry collection systems. Due to this regression, customer with .NET applications running large workloads may have encountered application slowness. The root cause of this was an inefficiency in the telemetry collection pipeline which caused overall virtual machine performance degradation and slowdown. The issue was detected automatically, and the engineering team was engaged. A mitigation to remove the inefficiency causing the issue was applied at 10:00 UTC on June 28. After further review, a secondary mitigation was applied to a subset of VMs at 22:00 UTC on June 28. More than 90% of the impacted customers saw mitigation at this time. After additional monitoring, a final mitigation was applied to a single remaining scale unit at 15:00 UTC on June 29. All customers were mitigated at this time.

    Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
    • Removing the changes that caused the regression
    • Reviewing and if necessary adjusting the performance regression detection and alerting


    thanks

    Andrew


    • Edited by ChipDale66 Thursday, August 2, 2018 10:19 AM
    Thursday, August 2, 2018 10:18 AM

All replies

  • Hi Andrew, I took a look and found a small issue did occur on August 2nd in regards to Virtual Machines in North Europe. Although you saw issues with your Batch account they do run on VMs so it is possible you were also impacted. 

    The issue was related to a unhealthy network infrastructure component that causes connectivity issues. This would make sense if you were seeing timeout errors. 

    Are you still seeing any problems or is everything back to normal? From my reports the issue on August 2nd has been resolved. 

    Tuesday, August 7, 2018 7:57 PM