USE Method (Utilization, Saturation, Errors)

Introduction

The USE Method is a powerful methodology for analyzing system performance and troubleshooting issues in computer systems. Developed by Brendan Gregg, a renowned performance engineer, this approach provides a systematic way to identify performance bottlenecks by examining three key aspects of every resource: Utilization, Saturation, and Errors.

In this guide, we'll explore how to implement the USE Method within Grafana to create effective monitoring dashboards that help quickly pinpoint system problems.

What is the USE Method?

The USE Method directs you to examine:

Utilization: The percentage of time a resource is busy servicing work
Saturation: The degree to which a resource has extra work queued up
Errors: Count of error events

For every major system resource, you should be able to answer these three questions:

What is the resource's utilization?
What is the resource's saturation?
Are there any errors related to this resource?

Key Resources to Monitor

The USE Method typically applies to these major hardware resources:

CPU
Memory
Network Interfaces
Storage I/O
Storage Capacity
GPU (if applicable)

Let's see how to apply this method using Grafana.

Implementing USE Method in Grafana

Prerequisites

To implement the USE Method in Grafana, you'll need:

A running Grafana instance
Data sources configured (like Prometheus, InfluxDB, etc.)
Node exporter or similar agents running on target systems

Creating a USE Dashboard

Step 1: Set Up Dashboard Structure

Start by creating a new dashboard in Grafana and organizing it into row sections for each resource type:

CPU → Utilization, Saturation, Errors
Memory → Utilization, Saturation, Errors
Disk → Utilization, Saturation, Errors
Network → Utilization, Saturation, Errors

Step 2: Configure CPU Panels

Let's start with CPU monitoring panels:

CPU Utilization

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

This query shows the percentage of CPU time that is not idle, effectively showing utilization.

CPU Saturation

node_load1 / on(instance) count by(instance) (node_cpu_seconds_total{mode="idle"})

This gives us the load average divided by CPU count, which indicates saturation when values exceed 1.

CPU Errors

increase(node_context_switches_total[5m])

While not errors directly, excessive context switches can indicate CPU problems.

Step 3: Configure Memory Panels

Memory Utilization

(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

This shows the percentage of used memory.

Memory Saturation

node_memory_SwapTotal_bytes > 0
  and
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100

This measures swap usage percentage, indicating memory saturation.

Memory Errors

increase(node_vmstat_pgmajfault[5m])

Page faults indicate memory access errors.

Step 4: Configure Disk I/O Panels

Disk Utilization

rate(node_disk_io_time_seconds_total[5m]) * 100

This shows the percentage of time the disk was busy.

Disk Saturation

rate(node_disk_io_time_weighted_seconds_total[5m])

This metric includes queue waiting time, indicating saturation.

Disk Errors

increase(node_disk_io_time_weighted_seconds_total{device=~"sd.*"}[5m])

Track I/O errors on disk devices.

Step 5: Configure Network Panels

Network Utilization

(sum by (instance) (rate(node_network_receive_bytes_total[5m])) + 
sum by (instance) (rate(node_network_transmit_bytes_total[5m]))) / 
(sum by (instance) (node_network_speed_bytes{device!="lo"}))

This shows network interface utilization as a percentage of available bandwidth.

Network Saturation

sum by (instance) (rate(node_network_receive_drop_total[5m]) +
rate(node_network_transmit_drop_total[5m]))

Dropped packets indicate network saturation.

Network Errors

sum by (instance) (rate(node_network_receive_errs_total[5m]) +
rate(node_network_transmit_errs_total[5m]))

This counts network interface errors.

Visualizing the USE Method

To make your USE dashboard intuitive, consider using these visualization techniques:

Traffic Light Indicators

For each metric, create a threshold to show status:

Green: Normal operation (e.g., utilization < 70%)
Yellow: Warning (e.g., utilization 70-90%)
Red: Critical (e.g., utilization > 90%)

Time Series with Thresholds

Use time series panels with colored threshold lines to show trends and indicate when metrics cross into warning or critical territory.

Practical Example: Troubleshooting with USE

Let's walk through a typical scenario using our USE dashboard:

Alert Triggered: A service is responding slowly
USE Dashboard Check:
- CPU Utilization is 95% (Red)
- CPU Saturation shows load average of 3.5 on a 4-core system (Yellow)
- No significant errors
Diagnosis: The system is CPU-bound
Investigation: Using additional tools like top or perf to identify the specific processes consuming CPU
Resolution: Scale up the instance or optimize the code

Best Practices

Set Appropriate Thresholds: Different systems have different baseline behavior
Use Relative Time Ranges: Compare current metrics to historical patterns
Combine with RED Method: For service-level monitoring, combine USE (resources) with RED (requests, errors, duration)
Add Annotations: Mark deployments and configuration changes on your graphs
Create Resource-Specific Dashboards: Have deep-dive dashboards for each resource type

Common Pitfalls

Ignoring Baseline Performance: Always know what "normal" looks like
Alert Fatigue: Don't set thresholds too low
Missing Context: Include metadata like instance type and configuration details
Over-reliance on Single Metrics: Always consider the three aspects together (USE)

Example USE Dashboard in Grafana

Here's how to set up a basic CPU panel group that follows the USE method:

const dashboard = {
  "panels": [
    {
      "title": "CPU Utilization",
      "type": "gauge",
      "gridPos": { "h": 8, "w": 8, "x": 0, "y": 0 },
      "options": {
        "thresholds": {
          "steps": [
            { "value": null, "color": "green" },
            { "value": 70, "color": "yellow" },
            { "value": 85, "color": "red" }
          ]
        }
      },
      "targets": [
        {
          "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
          "refId": "A"
        }
      ]
    },
    {
      "title": "CPU Saturation (Load Average / CPU Count)",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 8, "x": 8, "y": 0 },
      "options": {
        "thresholds": {
          "steps": [
            { "value": null, "color": "green" },
            { "value": 0.7, "color": "yellow" },
            { "value": 1, "color": "red" }
          ]
        }
      },
      "targets": [
        {
          "expr": "node_load1 / on(instance) count by(instance) (node_cpu_seconds_total{mode=\"idle\"})",
          "refId": "A"
        }
      ]
    },
    {
      "title": "CPU Errors (Context Switches)",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 8, "x": 16, "y": 0 },
      "targets": [
        {
          "expr": "increase(node_context_switches_total[5m])",
          "refId": "A"
        }
      ]
    }
  ]
}

Summary

The USE Method provides a systematic approach to monitoring and troubleshooting system performance issues by examining Utilization, Saturation, and Errors for each system resource.

By implementing this methodology in Grafana, you can create comprehensive dashboards that help quickly identify bottlenecks and system problems before they affect your users.

Remember these key points:

Check Utilization, Saturation, and Errors for every resource
Set appropriate thresholds based on your system's characteristics
Combine multiple metrics for a complete picture
Use visual cues to highlight problems quickly

Additional Resources

Exercises

Create a basic USE dashboard for your own system using Grafana and Prometheus
Identify which resources are most utilized during peak load periods
Set up alerts based on the USE metrics with appropriate thresholds
Perform a mock troubleshooting exercise using only the USE dashboard
Extend your dashboard to include application-specific resources beyond hardware

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What is the USE Method?​

Key Resources to Monitor​

Implementing USE Method in Grafana​

Prerequisites​

Creating a USE Dashboard​

Step 1: Set Up Dashboard Structure​

Step 2: Configure CPU Panels​

CPU Utilization​

CPU Saturation​

CPU Errors​

Step 3: Configure Memory Panels​

Memory Utilization​

Memory Saturation​

Memory Errors​

Step 4: Configure Disk I/O Panels​

Disk Utilization​

Disk Saturation​

Disk Errors​

Step 5: Configure Network Panels​

Network Utilization​

Network Saturation​

Network Errors​

Visualizing the USE Method​

Traffic Light Indicators​

Time Series with Thresholds​

Practical Example: Troubleshooting with USE​

Best Practices​

Common Pitfalls​

Example USE Dashboard in Grafana​

Summary​

Additional Resources​

Exercises​