locked
App Service restarts causing app to be inaccessible (502.3 Bad Gateway returned) RRS feed

  • Question

  • We've been experiencing an issue a few times over the past few weeks where something that causes our app service to restart will render the application inaccessible, and return 502.3 Bad Gateways results.

    Background:

    It's an WebApi running on ASPNETCore 2.0.1 targeting 4.6.2. Prior to the last issue, we were running on a single Standard S2 instance, which has always hovered between 5-20% CPU and around 33% usage on RAM. It's now running scaled out on 2 S2's (more information on that below).

    Causes:

    So far the issue seems to be have been triggered from a number of things:

    - App Deployment (VSTS deployment to a staging slot, followed by a swap)

    - Updates to application settings

    - General restarts out of our control (app service initiated)

    The common thread seems to be things that cause the app service to restart.

    We've been able to decouple the occurrence of the issues from any changes to code or configuration of the site, by deploying identical commits to what is currently running, and by making configuration changes that have no bearing on the application code (for instance, adding new app settings).

    When the problem occurs, we see in the kudu event log that the web process has successfully restarted - however, requests appear to no longer make it to the application (this is based on app insights request metrics, which may not be a low enough level to accurately capture what's happening).  External requests to the application simply spin until they finally receive a 502.3 Bad Gateway/Timeout exception.  The app does not resolve this issue on it's own - or perhaps we haven't waited long enough - but the outages have lasted for several minutes.

    Existing Remedies:

    Two different things have appeared to correct the issue and set things straight:

    - Moving the application to a different app service plan

    - Scaling the existing app service plan out to additional instances

    In both cases, without any additional deployments/config changes, the app has come back online. 

    My working theory (really, a guess) is that there is something going on with the front-end load balancer that App Service is using.  I don't have any visibility or evidence here, but I suspect both these operations are changing the view that the load balancer has of our app server and requests are then able to make it through to a site that was really up and running the whole time. 

    Potentially resetting its health or availability state? (I experienced a lot of this back in the day using ARR/IIS on servers for hosting, where a bad health state or availability for nodes in a web farm would have the same exact result).

    Anyone have any thoughts or additional information I can provide to help diagnose?

    Thanks,

    Andrew


    • Edited by Andrew-V Thursday, December 14, 2017 5:49 PM more information provided
    Thursday, December 14, 2017 5:44 PM

All replies

  • There are 3 troubleshooting steps for 502 bad gateway errors,

    • Observe and monitor application behavior
    • Collect data
    • Mitigate the issue

    For more details, you may refer this document: https://docs.microsoft.com/en-us/azure/app-service/app-service-web-troubleshoot-http-502-http-503.

    -----------------------------------------------------------------------------------------------------

    Do click on "Mark as Answer" on the post that helps you, this can be beneficial to other community members.

    • Proposed as answer by Swikruti Bose Thursday, December 14, 2017 7:49 PM
    Thursday, December 14, 2017 7:49 PM
  • This sounds a lot like the problems we have lately, but for us a manual App Service restart will usually correct the issue. It's a very frustrating situation for us, cause high availability for our service is critical. We have contacted also Azure support regarding this issues, but no clue or solution from there so far. Have you resolved your issues?
    Wednesday, December 27, 2017 12:02 PM
  • Well, we have some things to look at.  My colleague has been talking to aspnetcore folks on a couple github issues, with very similar symptoms:

    https://github.com/aspnet/AspNetCoreModule/issues/278

    https://github.com/aspnet/AspNetCoreModule/issues/260

    We've applied the setting described on the second one to prevent app warm up - and while we haven't experienced the issue since then, our app has been under much less traffic as well.  Given enough time, or confirmation that this appears to have fixed it, I will check back in here - but it does appear the ASP.NET team is aware of an issue here.

    Wednesday, December 27, 2017 3:48 PM
  • Thank you for quick reply! Our application is actually based on ASP.NET 4.6, so I'm not sure these github issues apply to us, but we will definitely check them out.

    Wednesday, December 27, 2017 10:48 PM
  • @karelg Could you share the ticket number?

    -----------------------------------------------------------------------------------------------------

    Do click on "Mark as Answer" on the post that helps you, this can be beneficial to other community members.

    Thursday, December 28, 2017 6:38 AM
  • @Swikruti Bose The ticket number is 117121817339464.

    Thursday, December 28, 2017 1:19 PM
  • Are there any updates on this issue? Sadly for us the problem has not disappeared, we are still experiencing service outages. Azure has not logged anything what could be related to this incidents and is falsely reporting that everything is working, but the app was clearly unavailable and requests did not even reach the application.
    Monday, January 8, 2018 12:42 PM
  • I have checked the SR details, looks like engineers are actively engaged for this case. I would suggest you continue there for better assistance.  

    -----------------------------------------------------------------------------------------------------

    Do click on "Mark as Answer" on the post that helps you, this can be beneficial to other community members.

    Tuesday, January 9, 2018 9:30 AM
  • Was there ever any update on this issue? It still seems to occur.

    We have a web service where we had 1400 users accessing it simultaneously, across 3 instances and 8 subdomains (so roughly 58 users should have been hitting each one - not a lot of people) and they pretty much all crashed with these errors. 

    If standard practice to avoid crashing restarts is to add a transform file, then the defaults need to change. This shouldn't take a total service unavailability and some google searching to find a resolution, when its common behaviour.

    Tuesday, September 18, 2018 10:42 AM
  • There is a feedback on the solution for similar issue, you could up-vote it. All of the feedback you share in these forums will be monitored and reviewed by the Microsoft engineering teams responsible for building Azure.
    Wednesday, September 19, 2018 6:06 AM