none
Understanding Application Insights availability test time window and alerting on 15 minutes of downtime

    Question

  • I'd like to create an Application Insights availability test which pings my web app from multiple locations and sends an alert if these pings were failing continuously for 15 minutes. How should I configure test frequency for the test and "Alert failure time window" for the alert to achieve this?

    I'm asking because the obvious answer would be to set test frequency to the lowest (5 minutes) and the alert time window to 15 minutes. However this causes the alert to fire instantly: I have configured 3 as the "Alert location threshold" and I can see from the AI logs that the alert is sent out immediately after the third failed availability request.

    Why is this? How can I achieve the behavior I'm looking for?

    Thank you in advance!

    Friday, January 26, 2018 3:02 PM

Answers

  • Thank you for reporting the confusing term "alert failure time window". The current alerting system looks back at the trailing 15 minutes (the alert failure time window), and counts the number of locations that have reported failures. As soon as you trigger the failure, the criteria is met. That is, the number of locations reporting failure over the last 15 minutes is >= the threshold (i.e., 1). Therefore, the confusing aspect is that it is not how long a failing test must continue to fail. It is the lookback window in which failures are counted.

    The "alert failure time window" is perhaps better understood with what it takes for the alert to resolve - if the number of locations reporting failures drops below the threshold in the last 15 minutes, then the alert auto-resolves.

    We will look into changing the term, or adding documentation to clarify this. Meanwhile, to have more resilient alerting - you have the following options:

    1) Use more locations to test, and have alert locations set to be at least say 1/3rd of the test locations.

    2) Use alerting on availability metrics, such as the aggregate availability percentage.

    3) If about 3-5 minutes of additional latency in receiving the alert is okay, then you can use alerts on queries where you can have elaborate criteria on how many consecutive failures over what period of time. 

    Tuesday, June 5, 2018 5:42 PM

All replies

  • Thank you for reporting the confusing term "alert failure time window". The current alerting system looks back at the trailing 15 minutes (the alert failure time window), and counts the number of locations that have reported failures. As soon as you trigger the failure, the criteria is met. That is, the number of locations reporting failure over the last 15 minutes is >= the threshold (i.e., 1). Therefore, the confusing aspect is that it is not how long a failing test must continue to fail. It is the lookback window in which failures are counted.

    The "alert failure time window" is perhaps better understood with what it takes for the alert to resolve - if the number of locations reporting failures drops below the threshold in the last 15 minutes, then the alert auto-resolves.

    We will look into changing the term, or adding documentation to clarify this. Meanwhile, to have more resilient alerting - you have the following options:

    1) Use more locations to test, and have alert locations set to be at least say 1/3rd of the test locations.

    2) Use alerting on availability metrics, such as the aggregate availability percentage.

    3) If about 3-5 minutes of additional latency in receiving the alert is okay, then you can use alerts on queries where you can have elaborate criteria on how many consecutive failures over what period of time. 

    Tuesday, June 5, 2018 5:42 PM
  • Thanks, will look into that!
    Tuesday, June 5, 2018 11:29 PM
  • We're now evaluating going with log queries, though it doesn't really work as I'd expect, see this thread.

    Can you elaborate what you mean by #2, how can such metrics be queried? I checked with Azure Monitor but don't seem to find any options to query the percentage directly. I can calculate this percentage in a query as well (like this), though I'm not sure that it won't have the same issues what I've mentioned in the linked thread.

    Tuesday, August 28, 2018 1:33 PM
  • Now things seem to have improved: a log search alert now fires reliably: if it queries the last 15 minutes every 5 minutes then it will correctly notice a downtime within 15-20 minutes. However alerts still permanently remain in the fired state.
    Friday, December 28, 2018 5:06 PM