locked
Site going down intermittently with 500 errors RRS feed

  • Question

  • I've been struggling to figure out what's going on with my website for 2 days to no avail. The issue first became apparent when I started seeing 500 errors for every single api endpoint in my app. Quite literally every request that comes in is failing. I tried to access the Deployments panel to redeploy/sync and see if that fixed anything but was confronted with a blank page. I have also tried restarting the website with no change. 

    When I discovered the blank Deployments page I thought I might just need to re-link bitbucket, but that fails on every attempt. Authorization is successful and the web hook is created in the bitbucket repository, but something after that step is failing before completion. The portal sometimes claims that the link was successful, but that's never been the case. 

    The website in question I'd rather not post publicly, but I've created a dummy site called rendezvoustest. The only other website is the one in question. 

    I've continued investigating and have found that Kudu is occasionally accessible, but more often than not fails to show anything at all or a "502 - Web server received an invalid response while acting as a gateway or proxy server." error. When I've been able to access it I've attempted to get Diagnostic dumps but that has proved to be a fruitless effort. Furthermore, the Process Explorer shows nothing or at least never finishes refreshing. A couple of times have I been able to access the website files through Kudu successfully but that often ends up with a timeout error.

    I've tried linking to Github and a local git repo as well without success. I'm able to run my web server locally without issue, so I'm pretty confident it's not something on my end. I can't think of anything else at the moment to try and am looking for some guidance or a miracle worker.


    • Moved by David Ebbo Friday, July 3, 2015 5:09 PM Not git related
    • Edited by David Ebbo Friday, July 3, 2015 7:04 PM Clarified title
    Wednesday, March 4, 2015 7:16 PM

Answers

  • We found something that appears to be related to the issue. In your production slot, you have a gigantic newrelic_agent.log (over 10GB!!), which is causing some extreme perf issues. It's the site\wwwroot\bin folder. Presumably, it's a log file that is not vital, so you should try simply deleting it (or back it up first if needed).

    Kudu Console may not work because of the issue, but FTP should work. If the file is in use, you may have to stop the site first.

    You may also want to check the other two slots in case they have similar files. It's just not healthy for a log file to grow that large :)

    Please let us know how that works out.

    • Proposed as answer by David Ebbo Friday, July 10, 2015 1:32 PM
    • Marked as answer by GridSmart Dev Wednesday, July 15, 2015 2:26 AM
    Thursday, July 9, 2015 8:57 PM

All replies

  • That  sounds like some general site health issue that is not related to being linked to Bitbucket. If you cannot reliably access Kudu, then many things are bound to not work.

    Is the site itself running normally, or is it also having issues?

    You can use this technique to share the site name with the team without revealing it publicly.

    Wednesday, March 4, 2015 8:41 PM
  • I covered those quite well in the OP but to recap:

    The site itself is also having issues. Every request is returning a "502 - Web server received an invalid response while acting as a gateway or proxy server" error. Even the root of the app which simply returns a string that says "Running" is failing.

    The dummy site I've already created based on the technique for not revealing the site name is called rendezvoustest.

    Wednesday, March 4, 2015 9:18 PM
  • Sorry, I missed that in the OP! We will be investigating the issue.
    Wednesday, March 4, 2015 10:08 PM
  • The issue should be mitigated for now.  Let us know if you encounter them again.

    Suwatch

    Thursday, March 5, 2015 1:13 AM
  • Thank you. Will an issue like this always require manual intervention? Is there something I can do to prevent or at least mitigate it happening again?
    Thursday, March 5, 2015 2:22 PM
  • This issue should not be happening in the first place, and we are looking at preventing it. So hopefully the question of 'how to fix it after it happens' would become moot :)

    But as far as mitigating, one thing that can sometimes help is to scale to a different VM size (e.g. Medium to Large or vice versa), and then back to your original size.

    Thursday, March 5, 2015 6:37 PM
  • Just wanted to revisit this and thank you for the tip on scaling then back to the original size to resolve this issue. Had it come up again on the same instance as a month ago and was able to resolve it that way.
    Friday, April 17, 2015 2:22 PM
  • Saw this issue again. Got it to work by changing VM sizes again but thought I'd make note of the occurrence.
    Wednesday, July 1, 2015 9:40 PM
  • Can someone please help me understand what causes this problem? I've now gone down 3 times in the past 48 hours because Kudu seemingly just locks up. A few months ago it was noted that this shouldn't be happening and that you're working on a fix the prevents it but I'd like to hopefully understand what's causing it. Are there certain API calls, HTTP requests or otherwise that might have an impact on this?

    I'm at a point where if something doesn't change I'm going to have to find a new platform to run my stuff on.

    Thursday, July 2, 2015 2:23 PM
  • Early in the thread, you mentioned that when that happened the site was also down with 500. Is it still the case each time? You only mention Kudu lock up, so I'm not clear that we're talking about the same thing. If the site is also down, we need to redirect the focus of this thread as a general site issue instead of having the focus be on Kudu/bitbucket.

    Other question: are other sites in the same App Service Plan affected, or just the one site?

    Your dummy site rendezvoustest does not seem to exist. Could you create it and keep it alive until this thread complete. This helps us identify your account/subscription.

    thanks,
    David

    Friday, July 3, 2015 5:22 AM
  • When I refer to Kudu locking up I mean I suddenly can't access things controlled by Kudu on the backend like log streaming, deployment settings, etc. The application was running perfectly for 18 hours until it happened again around 3am this morning. Restarting the app has no effect and cycling the VM Size of the App Service Plan is the only thing that I've been able to use to remedy the outage. I'm not seeing any crazy exceptions or crashes on my end that I've been able to find which is why this has become so frustrating. As this was happening I was also unable to access myapp.scm.azurewebsites.net. Sometimes that's the case, other times I can access the root but not something like /api/deployments.

    Currently this is the only site within the App Service Plan so unfortunately I can't say if any other apps are affected. I have 3 total slots for the site and all 3 are affected equally though. I've recreated the rendezvoustest app within the same service plan and am configuring the deployment integration now. 

    Friday, July 3, 2015 11:32 AM
  • Sorry, but I'm still not quite clear here. The fact that there are issues with the scm site during those periods has been established. What I'm asking is whether your actual site is equally affected during those same periods. I assume that it isn't, otherwise, you would likely focus on this rather than on the scm (since scm is obviously the lesser issue). But I would like to confirm this.

    One more question: when you say 3am, what time zone are you referring to.

    Thanks for recreating the test site, as this will help.

    thanks,
    David

    Friday, July 3, 2015 1:56 PM
  • Note that right now, I see your deploy slot being down (http://xxx-deploy.azurewebsites.net/). But the scm site is fine (https://xxx-deploy.scm.azurewebsites.net/).

    The reason the site is down seem to be an app issue. Your error logs has these Node entries: "Error: Cannot find module 'mime-db'". You can see the details by going into the D:\home\LogFiles\Application in Kudu Console.

    This sounds unrelated to what you are describing, but I thought I'd mention it, since it is something you'll want to look into regardless.

    Friday, July 3, 2015 3:36 PM
  • Yes, my actual site is completely unresponsive during those times. My clients are unable to access any of the functionality which is the focus of this. Every time I've referred to outages or my site being down, that's what I mean. Without my site going down the first time months ago I'd have never even learned about scm. I only keep mentioning it because every time I'm unable to access my site, I'm also unable to access some or all of scm. Thus my assumption so far is that something on the backend is causing my site to go down. Lately I've noticed 503 Service Unavailable responses when I try to access the APIs I host through the site.

    3am was a reference to EST.

    As for my deploy slot, that is an accepted albeit not quite understood problem as well. The production and deploy slots are both running the same commit from my master branch, but it seems like after every new deploy that mime-db isn't installed as part of `npm install`. Then I go into my deployment settings and force a sync with the repo all works well again. Seems repeatable but I haven't deployed enough since I first noticed that to have a full grasp of what the circumstances that trigger it are. I do know it always has to do with newrelic though.

    Sorry if I've appeared to be unclear.


    Friday, July 3, 2015 4:26 PM
  • Thanks for the additional info. So at this point, let's completely switch the investigation to a general site availability issue, and remove any focus on bitbucket/kudu/scm, as that has been the main source of confusion.

    And just to confirm, you are saying that around 3am EST (aka 7am UTC), all requests to the following sites were failing with 500 level errors (I'm replacing the site name by xxx here):

    • http://xxx.azurewebsites.net/
    • http://xxx-alpha.azurewebsites.net/
    • http://xxx-deploy.azurewebsites.net/


    Is that correct?

    Friday, July 3, 2015 4:46 PM
  • FYI I renamed the thread and moved it to the Web Apps forum to prevent more confusion if others look at it.

    Friday, July 3, 2015 7:10 PM
  • Sounds good. I'm not so sure there is actually confusion with focusing on kudu/scm (certainly bitbucket isn't an issue, that was just how I first found the problem a few months ago) as the two seem to be directly linked from my vantage point. 

    It's not necessarily all 3 slots that do it at the same time. For instance, right now xxx.azurewebsites.net is exhibiting the problem and has been since about 5:40pm UTC (I'll stick to using that for the sake of clarity) while xxx-alpha.azurewebsites.net isn't exhibiting the problem. 

    Saturday, July 4, 2015 2:14 AM
  • Yes, I was able to observe the issue and gained some initial knowledge about it, but more investigation is needed. I have involved other engineers to help.

    Note that today is a holiday, so I can't guarantee response time (though we're doing what we can). There is the option of opening a support ticket, which has people around the clock.

    David

    Saturday, July 4, 2015 3:24 AM
  • Any status updates here? I recycle the site and within a matter of hours it starts happening again. Not really sure where to go from here. I'll continue to investigate on my side to see if I can narrow anything down but I'm not sure what angle to take anymore.
    Tuesday, July 7, 2015 2:41 AM
  • No real breakthrough, but I'll share some thinking about what might be happening:

    • Scaling up/down gets you a clean new VM
    • Something that the site does at runtime slowly causes some kind of resource to be exhausted. It's not CPU or memory, as those are fine. Maybe some type of handles.
    • This eventually causes the VM to become bad in some way

    This is speculative, and what leads me to think this is that switching VM always fixes it, and that we have not seen this pattern in other sites.

    What doesn't really add up is that normally, killing the process (or restarting the site) would causes resources to be reclaimed, while in your case it doesn't seem to help.

    Question: is there anything unusual or notable about what the site is doing at runtime?



    • Edited by David Ebbo Tuesday, July 7, 2015 4:04 AM
    Tuesday, July 7, 2015 3:23 AM
  • We made a potential temporary mitigation. Would you please let us know whether things are working better now? It is only a mitigation, but we will be able to learn from it.

    thanks,
    David

    Tuesday, July 7, 2015 6:02 AM
  • Everything seems to be humming along since you replied about 6 hours ago. 

    As for what the application does, it is essentially built to accept uploads of specially formatted zip files, stores the files in a local temp folder for some processing, writes to a database and then cleans up the temp files. Can't really think of anything it does that I'd consider unusual. 

    Tuesday, July 7, 2015 12:22 PM
  • Had this reappear just about an hour ago after what amounted to 48 hours of total uptime. Any further insights, particularly in regards to the type of work the application is performing?
    Thursday, July 9, 2015 5:23 PM
  • We found something that appears to be related to the issue. In your production slot, you have a gigantic newrelic_agent.log (over 10GB!!), which is causing some extreme perf issues. It's the site\wwwroot\bin folder. Presumably, it's a log file that is not vital, so you should try simply deleting it (or back it up first if needed).

    Kudu Console may not work because of the issue, but FTP should work. If the file is in use, you may have to stop the site first.

    You may also want to check the other two slots in case they have similar files. It's just not healthy for a log file to grow that large :)

    Please let us know how that works out.

    • Proposed as answer by David Ebbo Friday, July 10, 2015 1:32 PM
    • Marked as answer by GridSmart Dev Wednesday, July 15, 2015 2:26 AM
    Thursday, July 9, 2015 8:57 PM
  • You're certainly right, that shouldn't be happening at all. In fact I'm not even remotely sure why that log file gets generated by newrelic in a production instance. Thanks for catching that! Definitely going to make sure that doesn't happen anymore. I'll report back if I have any further issues after getting that cleaned up.
    Friday, July 10, 2015 12:18 PM