Our first monitoring dashboard entry!

We’re delighted to announce that we’ve received our first entry to the design a dashboard competition. It’s a magnificent effort from Merrill Aldrich, who sent us a fully functional .NET application to demonstrate his ideas.

SQL Server Monitoring Dashboard by Merrill Aldrich

As can be seen, the UI neatly displays the health state of the SQL instances. In his explanatory notes, Merrill stated that

“most third-party monitoring solutions try to monitor everything and gather way too much information, when the most important things to watch can be gleaned from a few key performance data points”

He lists the following as the key data points to monitor:

    CPU Utilization
    Page Life Expectancy and Pending Memory Grants (i.e. Memory Pressure)
    Disk Latency and Disk Free Space
    Duration and number of Blocked Processes
    Failed Agent jobs
    Missing Backups
    Errors from the log

Merrill says he’s happy to drill into more detail when necessary but these are the key indicators. Do you agree?

Although he humbly suggests it’s ‘barely a mock-up’, we immediately installed his application and added 3 test machines to be monitored…

SQL Server Monitoring Dashboard by Merrill Aldrich (2)

(There would also be a text area to to display warnings from all monitored instances, such as failed agent jobs, or errors from the log, for example).

Thanks Merrill, we’re really impressed by the calibre of this submission! We like how the colour of the entire graph changes when the current level rises above a value. The UI is uncluttered but drew our eyes to the issues as soon as they occured. Thanks for sharing this with us.

So now tell us what you think? What elements of this design appeal? Is there anything missing that you couldn’t live without? Post your comments, or send us your own design (sketch, scribble, scrawl, or functional application) and we’ll post it for discussion.

This entry was posted in Design A Dashboard. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

4 Comments

  1. Posted September 29, 2009 at 8:40 pm | Permalink

    One thing I miss in a monitoring solution is telling it about my disk configuration. Since it will have no way to know how many disks, their speed, and what RAID config they are in, I would like to tell it. So, disk queue lengths would be divided by number of disks based on RAID type. That way the colours for good, warning, and bad have meaning to me. I see so much red in current monitoring software that I just ignore most all of it.

  2. Priya.Sinha
    Posted September 30, 2009 at 11:28 am | Permalink

    Thanks a lot for your helpful comments. Just to confirm … for disk you would like to see the following information:

    - Type of RAID
    - RPM for Disk (speed)
    - Ratio of (Physical Disk: Avg. Disk Queue Length/ Number of Disk)

    Is there any more important information which you would like to see. If so then please let us know.

  3. jonathan allen
    Posted October 1, 2009 at 11:29 am | Permalink

    Chuck has a good point about seeing red all the time and becoming numb to it all.

    This in my view is either a chronic (ie failing) architecture or the thresholds for the alerts are set inappropriately. The alerts need to bring attention to deviations from the norm in order to be effective in directing your attention to the most needy issues. Being able to configure the alerts to represent your needs is therefore vital for the software to prove beneficial. Being able to slice and dice the various measures and metrics so that any significant variation can be investigated and where necessary adjusted or otherwise countered. With v1 there are cycles that cannot be allowed for in the job histories that mean I have scheduled jobs that are permanently in High Alert mode because it doesnt cope with my requirements – (short details: we have replication running hourly so lots of hourly SQL Agent jobs, we have 500 staff working mainly 9-5 so the jobs have a ‘normal’ daily load 9 – med, rising to high in late am, lunch time dip with biggest peak of day mid pm and trail off to end of day. The execution time tracks this activity with its duration as there is more work to do. However over night we have DML scripts that increment records for various reasons and the data gets a huge change overnight. Replication jobs execute to track these changes and the job time goes through the roof. The result – almost every execution of these jobs is out of the “last 10 executions” average. A solution would be to allow different schedules – I would like to know if the overnight job varies from its norm and I would like to know if the daily job executions differ from theirs. I could create two jobs on the server but this is creating an increased SQL admin overhead just to get the monitoring tool to behave – in essence, the tail wagging the dog.

    Its a tricky issue to handle and the team that solves the whole monitoring/alerting issue will have a gold standard product to be proud of.

    Personally I would have an interest in the RAID config, HDD spindle speed etc that Chuck mentions but probably only from an audit point of view. The speed would indicate thrashing and may be an indicator of fragmentation or RAID failing over after a HDD failure but they wouldnt be the key indicators I would run to if there was a slow down.

    HDD space used or rather changes in % free would be useful – an early warning that a log file is growing unchecked etc.

    Jonathan

  4. Colin Millerchip
    Posted October 2, 2009 at 3:05 pm | Permalink

    Hi Jonathan,

    Interesting point you make about “deviations from the norm”. We’ve been giving some thought to this already, and I’m working on a blog post to explore this further, and see what people’s thought are on this.

    Thanks,

    Colin.

Post a Comment

Required fields are marked *

Add an Image

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>