New Alerts Pt 1 (of 3)

I’ve spent the last few weeks investigating the alerts that the new v2 product will offer, in particular the new alerts that are not currently available in v1.

SQL Response v2 will provide a number of Overviews for SQL Server Instances summarising activity on those servers. For example, a number of Perfmon counters (Buffer Cache Hit Ratio etc) will be shown, allowing the user to track changes in these values over time.

However, this does not mean to say a DBA wants to be alerted every time each and every one of these counters goes over or below a given threshold. Many are interesting pieces of information that help an investigation (and hence should be shown on an Overview), but should not be triggers for alerts.

Also, of course, a number of alerts should be generated by binary events, rather than on continuous values crossing thresholds. For example, the v1 alert “SQL Server unreachable” is precisely this.

So we’re very interested in getting feedback on what you would like to be alerted on; to make this slightly easier, I’m splitting this posting in to 3 parts -

  • Perfmon counters (machine and SQL Server),
  • Non-perfmon SQL Server alerts (deadlocks, problem queries etc),
  • The rest! (Here, I’ll cover areas such as replication, mirroring, job management, auditing and so on).

First of all – Perfmon counters. Of course, there is a myriad of these, and to alert on all of them would be overwhelming to a DBA. In particular, many counters coincide heavily, the same problem causing multiple perfmon spikes.

Which values do you look at regularly? I’m interested in anything that relates either to SQL Server or to the underlying OS/hardware. There are the standard four categories, if it helps:

  • Processor,
  • Disk I/O,
  • Memory,
  • Network

- but any feedback would be most welcome! What I’ve found particularly interesting in this area is that many DBAs consider certain counters to be either less informative than many believe or, worse, misleading. Are there any counters that you think are particularly misleading? What are the alternatives that really capture the problem on the server?

Thank you again for your help.

This entry was posted in Alert, Metrics and tagged . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

2 Comments

  1. Jonathan Allen
    Posted March 31, 2010 at 3:16 pm | Permalink

    Not sure if its still relevant but I was browsing the blog and noticed this hadnt any comments. In order of preference I would use Disk I/O then CPU then Memory then Network I/O – probably judging from experience as to where most problems have previously been located. I have never know a NIC be an issue, inadequate Memory got blamed by M$ years ago when there was a call in to them but I have my reservations about whether it was actually the case. I have seen plenty of systems go slow when the CPU gets monopolised by a particular application (Disk I/O is usually high too) and have seen a server stop totally due to Disk I/O when the software mirror broke (CPU was high too).

    Hope the down tools week has been good.

    • Ben Rees
      Posted March 31, 2010 at 3:33 pm | Permalink

      Cheers! Yes, the commenting for this topic was a little… slow.

      And yes, down-tools going really well. There’s a few SQL monitoring related activities going on – we’ll then have a process to see what should go in to the final product…

Post a Comment

Required fields are marked *

Add an Image

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>