none
Azure AppGateway stops handling/routing connections after some time

    Question

  • We have an Azure AppGateway set up that routes traffic to 3 of our application nodes. Recently, the App Gateway has started encountering an issue where it just stops working after a few hours - everything was fine and then it just abruptly stops receiving/handling connections. Short of setting up a new App Gateway, nothing else seems to work to get around this to get the requests going again.

    And when we do set up a new App Gateway, the problem re-surfaces after a few hours and its back to square one. We have a number of App Gateways setup and the problem seems to be unique to just this one Gateway. Any advice on how we can troubleshoot this to see what's going on and get that addressed?

    Friday, May 24, 2019 3:43 AM

All replies

  • Hi Aravind, 

    I see contradictory statements here. 

     "when we do set up a new App Gateway, the problem re-surfaces after a few hours "

     "We have a number of App Gateways setup and the problem seems to be unique to just this one Gateway"

    Do you have this issue in any Gateway that you deploy or just one Gateway ?

    Did you enable WAF for this Gateway ?

    Is it a V2 SKU ?

    Regards, 

    Msrini

    Friday, May 24, 2019 4:19 AM
    Moderator
  • Thanks for getting back to me.

    We have multiple App Gateways that are all pretty much configured the same way, one gateway for each of our services. The problem I'm reporting is specific to this one gateway for one of those services. And even after trying one new app gateway after another for that service, the same issue keeps cropping up after a few hours - where the gateway stops routing traffic. 

    Does that help clear any potential contradiction?

    Its a Standard SKU, so not V2 - and WAF is not enabled.

    Best,

    Arvind

    Friday, May 24, 2019 5:27 AM
  • Thank you for clarifying. 

    So, Application Gateway is not accepting traffic for one specific service. In order to comment further, I need logs from the Backend resource and Application Gateway. Since, those logs cannot be shared in this forum, can you raise a support ticket?

    This require deeper investigation with logs from your end as well as Application Gateway end. If you don't have a support plan, let me know. 

    Regards, 

    Msrini

    Friday, May 24, 2019 6:33 AM
    Moderator
  • Hi Msrini,

    Thank you for getting back to me, and for your assistance - appreciate the help

    The Backend Healthy has the following error for all 3 nodes that are sitting behind this App Gateway:

    "Unable to retrieve health status data. Check presence of NSG/UDR blocking access to ports 65503-65534 from Internet to Application Gateway."

    However, if the ports were blocked, the gateway should never have worked - its strange that the issue shows up only after several hours of the service being operational. We raised a support ticket on this same issue a couple of weeks ago but they responded a few days after - and by then, the health issue had "fixed itself" and everything seemed to be working again. We are currently in the process of moving to an EA on our Azure plan and so working out the support tier/plan as part of that. Are there any specific logs we can send you? I can make sure those logs are currently active while we wait for this issue to recur?

    Also, we are exploring moving our AG to V2 - not sure if that'll help avoid the issue, but if we need to reinitialize a new AppGateway, sounds like that might be faster to do that? However, looks like setting up a V2 instance is going to need some work on our end in terms of configuring the VPN etc...so working thru that right now

    Thanks!

    Friday, May 24, 2019 5:05 PM
  • Do you have a Express route connected to this VNET which advertises 0.0.0.0/0 route?

    Regards, 

    Msrini

    Friday, May 24, 2019 5:19 PM
    Moderator
  • No, we don't - we haven't configured any Express Routes at this point. Is that something we need?
    Friday, May 24, 2019 7:01 PM
  • Hi Arvind,

    When it 'stops handling connections' what does the end user see?  What is the HTTP response?

    Thanks,

    Matt

    Friday, May 24, 2019 7:04 PM
  • Hi Matt - the connection just times out; there isn't a response that goes back and the client just handles it like it would a timeout.

    The app nodes themselves are functioning just fine - the issue really is that the requests don't seem to get routed to the nodes. And when we monitor the App Gateway traffic, we don't see requests actually coming thru - which is what is confusing because we know the requests are being sent.

    Thanks for your help,

    Arvind

    Friday, May 24, 2019 7:11 PM
  • OK thanks.  So is this site all accessed over VPN from your previous responses?  When this happens can you access the site over the public internet or is this locked down?

    If this is over VPN and it intermittently stops working then this may be VPN config. 

    If you can confirm this is over VPN I can probably offer further guidance

    Thanks,

    Matt

    Friday, May 24, 2019 7:18 PM
  • No, actually it is not over VPN - the app gateway is actually on the public internet and clients connect to the app gateway over the public domain only. And then the gateway routes the calls to one of the app nodes, that's it. Its a pretty straight-forward set up and when we first set up the AppGateway, everything is fine. Once the issue starts appearing (where it is no longer routing incoming requests), its like the AppGateway is dead - even simple changes to the App Gateway config are just left in "Updating" state forever. And so far, we've had no choice but to set up a new App Gateway to get around the problem

    Thanks,

    Arvind

    Friday, May 24, 2019 7:23 PM
  • That is weird. And new App Gateways do the same thing?  Is there anything common with them like using the same IP or anything?
    Friday, May 24, 2019 7:29 PM
  • Exactly, very weird. The new App Gateways do the same thing but only after several hours - things work fine when we first provision, until the issue surfaces again. 

    And other than the App Gateway properties (backend pool, health probes etc.), there isn't anything else that's common. The IPs have all been different for each instance of the gateway, of course

    Friday, May 24, 2019 7:34 PM
  • Hi,

    Thanks for the quick response.  And when this happens, I assume it doesnt resolve itself and you have to recreate it?  Are all ports in the range 65503-65534  open to the backends?

    When this happens does the graph on the overview blade of the AG stop showing requests?

    Thanks,

    Matt

    Friday, May 24, 2019 7:45 PM
  • Thanks for the quick responses as well, Matt - and appreciate the assistance.

    Yes, that is correct - doesn't resolve it, and we have to recreate an App Gateway. And when this happens, you are right, the graph on the AG overview blade stops showing requests. We'll check on the ports, will confirm that those are open.

    Friday, May 24, 2019 8:25 PM
  • Hmmm. Kinda sounds like a routing issue but its odd that it happens on multiple AG's. Are all the troublesome AG's deployed to the same vnet? Maybe there is an issue there?
    Friday, May 24, 2019 9:58 PM
  • I have seen such issues. The Application Gateway became unresponsive and the reason is the kind of traffic that you send. If I remember it correctly it is the Bot service which was the backend and it made the Application Gateway to queue the request and it became unresponsive. 

    Can you share me the Service Request number so that I can follow up with support team to take a look again ?

    Regards, 

    Msrini

    Saturday, May 25, 2019 5:47 AM
    Moderator
  • Thanks Msrini - here are the SR details:

    Title    Unknown backend health status for last 6 hours
    Support request ID    119041626000543
    Created on    Tue, Apr 16, 2019, 7:06:19 AM UTC

    As for the backend in our case, its a service that's just handling regular web requests - is actually a fairly busy service that handles requests all the time. Let me know if more specifics on the service would be useful

    Saturday, May 25, 2019 2:10 PM
  • Yeah, wondering about the vnet as well - so we are going to try moving it off there to see if there's some relief. Will keep you posted if we do manage to get to the bottom of this. We do have a couple of other AGs (for other apps) in the same vnet but the request throughput on those is lower

    Its odd that it only seems to happen on this AG (so maybe there's a traffic pattern there), and its also odd that it doesn't happen right away but only after some time of staying operational. 

    Saturday, May 25, 2019 2:12 PM
  • As you mention that this is a very busy site....do you know how many connections it may roughly have?  As this stops working after a while, I'm wondering if it is worth looking at TCP port exhaustion on the backends when this happens?

    On Windows you can get a TCP count with:

    netstat -ano | find "TCP" /c

    Sunday, May 26, 2019 10:45 AM
  • hmm, interesting point. Let me see if we can schedule something to monitor and log that over time. They are all Ubuntu nodes.

    Sunday, May 26, 2019 4:13 PM
  • Ah OK, that may make it easier to see (it may not even be this but worth checking).  I think Linux would log this to /var/log/messages as too many open files but I cant be sure.  I've asked our Linux team on Slack to see how they would check this as you'll likely be able to rule it out from the current logs.

    I'll come back to you...

    Sunday, May 26, 2019 4:21 PM
  • I've been told this for check port exhaustion on Linux for what I said:

    "typically too many open files errors would show up
    but also tcp connection failures
    you can also check the limit by looking at the web server pid in /proc/$PID/limits and x-check that against the count of open file handles you find using `lsof`"

    As i said, it may not be this but I have seen similar on Windows and the site just seems to time out on the LB.

    Thanks,
    Matt

    Sunday, May 26, 2019 4:45 PM
  • Thanks Matt. We've definitely seen the "Too many open files" errors in the past (and yes, those typically have to do with sudden spikes in request volumes) but I don't think we saw any issues with the nodes in this instance that indicated a problem like that. Will check and see if we can trace some history there. Since the last reset (touch wood!), we increased the capacity of the App GW and the issue hasn't recurred. And until it recurs, we won't be able to investigate more but obviously I'm praying it doesn't happen again anytime soon :) 


    Tuesday, May 28, 2019 7:00 PM