Business as ‘unusual’

At a recent conference, an on-line retailer talked about a time when they had a problem with their website: the first they knew about it was when they discovered they’d taken no orders for 24 hours.  A few days later, an in-passing chat with Red Gate’s head of IT revealed that we’d recently had a similar problem with an external website, albeit not as catastrophic as not being able to sell anything!

In both cases, there were no obvious ‘problem’, nothing that triggered an alert; but clearly it wasn’t a case of “business as usual”, and in the first case they had a business-critical issue.

It’s clear that monitoring for divergence from “the norm” is very important – monitoring needs to be more than just looking for quantifiable ‘exceptions’ like a job failure, or a long-running query.  Indeed, SQL Response v1 already has some alerts for behaviour that diverges from the norm – ie, “Job duration unusual”, “CPU utilization unusual”.

So we’re thinking about what we’ll do to further support this in SQL Response v2.  There are two aspects to this, and we’d like to know your views on both of them:

1) What are the key activities that you’d want to know about, if they diverge from normal behavior?

For example, a database table might normally grow at 100 MB a day; so when growth slows to 1 MB a day, clearly this isn’t going to trigger a “Low Disk Space” alert, but it would trigger a “Database table growth unusual” or “Database growth unusual” alert.

And secondly…

2) How far do you want to be able to manage different “norms”?

As can be seen from the comments to another blog post, this can get very complicated very quickly.  It’s not possible to auto-configure how normal behavior varies over time, as this might mask problems – for eg, a rogue process that runs periodically, will generate a “normal” behavior.

What’s normal during the working day, may not be normal during the maintenance window at night; what’s normal on a Friday isn’t normal on a Sunday, and isn’t normal for the last Friday of the month (when payroll runs)… and all of this variability in the norm, can vary by activity.

The flip-side of defining the ‘norm’, is to define exceptions – eg, don’t alert on long-running queries between 1am and 2am.  This might make configuration easier, but risks not alerting to actual problems.

So there’s a trade-off here between configurability, usability and functionality.

Please post your thoughts here.

This entry was posted in Uncategorized. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

2 Comments

  1. Kev Riley
    Posted October 21, 2009 at 11:26 am | Permalink

    Might be asking for too much here, but could the app ‘learn’ what was normal?

    After an initial running-in period, followed by user acceptance that the observed behaviour was ‘nornal’, the app would have a benchmark to then refer to.

    Every time the behaviour strayed too far from the ‘norm’ then alerts would fire, but as well as clearing the alert, the user would have the option to merge or add this occurence into the normal behaviour pattern.

    Sounds like futuristic AI monitoring!!

    I have struggled in the past to efficiently define the ‘job duration unusual’ alerts, resulting in many false-positives, so I know how difficult this can be. Conversely you don’t want to make configuring the alerts too complex, otherwise you put a barrier to usability.

    • Posted October 22, 2009 at 5:11 pm | Permalink

      We’ve thought about having some form of heuristics in the system to learn about ‘typical’ behaviour, but we don’t foresee us doing this for v2. One of the problems is that, during the learning period, the alerting process would be very error-prone, and you’d get false alarms as well as missed issues. It could also become very confusing to understand what the definition of ‘norm’ is.

      So for v2 we’re discussing user-defined maintenance windows, and possibly other mechanisms to define exceptions. We’re curious as to how far people would need to go with this, bearing in mind the possible trade-off with additional configuration.

      Did you have any thoughts on the first question, about what things would you want to monitor for divergence from the norm?

      Thanks very much for the reply!

Post a Comment

Required fields are marked *

Add an Image

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>