What metrics do you use? Part two: Disk

This is part two of our blogs about which counters you use when monitoring your servers. Click here for part one, about memory counters.

Suggested Counters

  • Disk space available

    The amount of disk space available, in GB, per logical disk.

  • Logical disk idle time

    The percentage of elapsed time that the disk is not servicing any read or write requests

  • Logical disk average read latency

    Average time, in seconds, of a read of data to the disk.

  • Logical disk average write latency

    Average time, in seconds, of a write of data to the disk.

  • Logical disk transfer/sec

    The rate of read and write operations on a disk.

Note: all these counters are for logical disks only. We are not currently proposing to capture any metrics for the physical disk drive. Is this something you’d need to see?

Are these counters useful to you? Are there are any other disk counters that you look at, that we’ve not mentioned here? Let us know!

Subscribe to our feed to keep up to date with all our posts.

This entry was posted in Metrics and tagged . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

10 Comments

  1. Tom
    Posted September 14, 2009 at 3:52 pm | Permalink

    Here’s a quick mock-up of how these metrics could be represented in the UI.

    Click on the thumbnail below to see the full image;

    1_disk.png

  2. Jonathan Allen
    Posted September 16, 2009 at 2:11 pm | Permalink

    Kind of related to Disk use is replication (OK, not that related but the logs are stored on a share before transfer … ) and that got me thinking about HA metrics.

    Can we have something for mirroring and replication stats please? Something to let me set thresholds for Send Queue, Redo Queue, Average Delay, Time Behind etc and check they are behaving would be most useful. Also alerts for replication jobs that are on the wobble would make the admin life easier too.

    Jonathan

  3. Mark Allison
    Posted September 17, 2009 at 10:19 am | Permalink

    Would be nice to also have the option of displaying disk space available as a percentage too and be able to alert on percentage free.

    I also monitor Avg Disk Read Queue Length and Avg Disk Write Queue Length – although with our SAN attached disks this is becoming nonsensical to monitor as sometimes we see queue lengths in the thousands – not very useful. I do find this metric useful on servers with directly attached storage.

    Perhaps also have one called % busy or something very high level so that if something is critical and budget is required a screenshot can be sent to management. i.e. “Look, Bob Disk D on our trading system is busy 95% of the time – we need to do something about our disk I/O” Other tools have a metric called Disk Load and its measured as a percentage – I don’t use this myself but I do publish it in reports to management who don’t have a clue what a disk queue is.

  4. Merrill Aldrich
    Posted September 18, 2009 at 11:31 pm | Permalink

    I find the latency counters are the most useful on a dashboard. I want to know about slowness – that’s what would matter most.

  5. Tom
    Posted September 21, 2009 at 1:20 pm | Permalink

    Thank you all for your great feedback!

    Jonathon

    Replication and mirroring alerts are definitely something we’re planning to implement

    We’ve not got round to thinking about how we’ll cater for them in the monitoring side of the UI yet though.

    I think mirroring will be easier than replication!

    Mark

    We’ll definitely have a disk free type alert. There’s mixed feelings whether it should be a percentage or absolute value – Maybe we’ll allow both!

    Sounds like Disk Load is something worth us looking into.

    Merrill

    That’s interesting to hear. Would the read and write latencies we’ve listed be sufficient or are there any other metrics that would help?

    Thanks Again,

    Tom

  6. Merrill Aldrich
    Posted September 21, 2009 at 5:51 pm | Permalink

    The ones listed would work

  7. PDinCA
    Posted September 22, 2009 at 1:07 am | Permalink

    Not “counter-based” but:

    Can you consider a “Free space life expectancy” for each disk that’s based on a user-defined duration from your historical disk usage store. If I have a 90% used disk but it’s highly stable, I don’t need a “red light”. If SR2 considers that, based on usage, I’ll be in trouble in X, Y, Z days, then flash me the X, Y, Z-level yellow, orange or red light, please…

    If you can tell me when the fragmentation, where the file-type-mix makes it relevant, exceeds an acceptable level, flag me.

    Please keep the “noise level” low, again due to lack of DBA, so the hit-list of things to attend to is abundantly clear.

    • Tom
      Posted September 29, 2009 at 9:50 am | Permalink

      We have been discussing a type of alert that would fire when your disk space, based on current growth rates, looks like it will run out within a fixed time period.

      e.g. “In 7 days disk space will reach 90%”

      With continuous alerts, and multi threshold alerting, we should also cater for escalating the warning too.

      It’s a mission objective for Response to only fire useful alerts, and we’ll do everything we can to achieve this!

  8. John
    Posted October 1, 2009 at 4:07 pm | Permalink

    On your disk free alert levels: “We’ll definitely have a disk free type alert. There’s mixed feelings whether it should be a percentage or absolute value – Maybe we’ll allow both!” Make sure I can set it to different levels for each disk.. I may only need a small amount on my system drives but want to have a different level for my sql data disks or logs..

    • Tom.Randle
      Posted October 1, 2009 at 4:48 pm | Permalink

      We’ll definitely allow you to configure on a per disk basis. What would be really nice though is if we could set the thresholds automatically based on the type of use.

Post a Comment

Required fields are marked *

Add an Image

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>