locked
Occasional poor search performance, 503 error RRS feed

  • Question

  • Our Azure search basic instance performs well for us most of the time. However, we are seeing occasional slowdowns as  indicated by long dependency durations as reported by Application Insights and also verified interactively.

    For instance, Application Insights reports that between 21:51 and 21:56 UTC today, the average response time from search was 20s with 6 out of the 10 search requests during that time period failing with a response code of 0 (connection aborted?) after 20+ seconds.

    Our index has less than 1500 documents and is under 30MB.

    I have turned on search diagnostics, and there was not a high number of requests during this time. We're talking 3 or less queries per minute if the operation log is to be believed!

    The only clue I see in the diagnostics logs is a single 503 error about the time of this slowdown. Why would a 503 happen under such light load?

    Our usage is a web application that calls azure search directly from clients' web browsers via AJAX. I have observed a few of these slowdowns personally. I see 10+ second time-to-first-byte from azure search to my browser. However, the azure search operation log shows these requests were serviced in less than 2 seconds.


    • Edited by kjkruege Wednesday, May 11, 2016 11:29 PM
    Wednesday, May 11, 2016 11:27 PM

Answers

  • Thanks, I just looked at your service and I can see what happened. This is a single-unit service with no replicas configured. While we do our best to ensure availability in these, availability is not guaranteed (services with 2 or more replicas are backed by an availability SLA, details here).

    Sometimes a glitch can catch a search unit and disrupt its availability, with multiple replicas the other units take over. In your case since the only unit was briefly unavailable you saw rejected connections (sometimes this problem may manifest as status 503 in HTTP calls, depending on the nature of the issue). This in particular case, there was an ongoing deployment where we pushed new bits, which caused the search unit to be unreachable briefly.

    Let me know if there's anything else I can look into.


    This posting is provided "AS IS" with no warranties, and confers no rights.

    Thursday, May 12, 2016 7:33 PM
    Moderator

All replies

  • If you could share the name of your service I can look in more detail in the timeframe you specified.

    You mentioned that you see a significant difference between time as measured in your browser versus reported in the search logs. That might point at a network issue as well. That said, let's look at our internal telemetry first to see if there's anything else going on from our side. Once I get your service name I'll look into it. If you'd rather not post it here, feel free to email it to me to Pablo dot Castro AT usual microsoft domain.


    This posting is provided "AS IS" with no warranties, and confers no rights.

    Thursday, May 12, 2016 1:08 AM
    Moderator
  • Our instance name is "teachengineering"

    I agree that it appears there could be a networking aspect to this. If so, I do think it is closer to the server than the clients, because the clients experiencing the slowdowns are geographically dispersed. I have witnessed the slowdown personally as a client from two different ISPs.

    Thursday, May 12, 2016 4:52 PM
  • Thanks, I just looked at your service and I can see what happened. This is a single-unit service with no replicas configured. While we do our best to ensure availability in these, availability is not guaranteed (services with 2 or more replicas are backed by an availability SLA, details here).

    Sometimes a glitch can catch a search unit and disrupt its availability, with multiple replicas the other units take over. In your case since the only unit was briefly unavailable you saw rejected connections (sometimes this problem may manifest as status 503 in HTTP calls, depending on the nature of the issue). This in particular case, there was an ongoing deployment where we pushed new bits, which caused the search unit to be unreachable briefly.

    Let me know if there's anything else I can look into.


    This posting is provided "AS IS" with no warranties, and confers no rights.

    Thursday, May 12, 2016 7:33 PM
    Moderator
  • Thanks, Pablo. That makes sense.  Just a few clarifying questions:

    1) I'm sure it varies, but how often are you typically pushing bits that would interrupt a service unit?

    2) In addition to failed requests, I also see slow successful requests around the same time. I assume this could be attributed to instance shut-down / warm-up time?

    3) Would the free tier be similarly impacted by the same updates? Or does the free tier perhaps run on multiple units behind the scenes? (I realize there would be other downsides to the free tier because of its shared nature).

    Thursday, May 12, 2016 9:46 PM
  • 1) No fixed schedule, but roughly weekly

    2) Yes, you're probably catching the edges of the process

    3) Free tier services doesn't see this particular issue, those indexes run in a shared environment that tends to have higher availability than single-replica services not configured for high availability


    This posting is provided "AS IS" with no warranties, and confers no rights.

    Friday, May 13, 2016 12:23 AM
    Moderator