CPU Utilization

Introduction

CPU utilization is a fundamental metric in system performance monitoring that measures the percentage of time a CPU spends executing non-idle tasks. Understanding CPU utilization is crucial for identifying performance bottlenecks, capacity planning, and ensuring the optimal operation of your applications and infrastructure.

In this guide, we'll explore how to effectively monitor and visualize CPU utilization using Grafana, a popular open-source analytics and monitoring platform. Whether you're managing a small application or a large-scale distributed system, mastering CPU utilization monitoring in Grafana will help you maintain peak performance and quickly address potential issues.

Understanding CPU Utilization

What is CPU Utilization?

CPU utilization represents the percentage of time that a CPU is busy processing instructions, as opposed to waiting (idle state). It's typically expressed as a percentage from 0% (completely idle) to 100% (fully utilized).

CPU utilization can be broken down into several components:

User time: CPU time spent running user-space processes
System time: CPU time spent on kernel operations
I/O wait: Time spent waiting for I/O operations to complete
Idle time: Time when the CPU is not processing any instructions
Nice time: Time spent running processes with adjusted priorities
Interrupt time: Time spent handling hardware interrupts

Why Monitor CPU Utilization?

Monitoring CPU utilization helps you:

Identify performance bottlenecks: Consistently high CPU usage may indicate applications that need optimization
Plan capacity: Understanding usage patterns helps determine when to scale resources
Detect anomalies: Unexpected CPU spikes may signal problems like infinite loops or resource leaks
Optimize costs: Proper resource allocation ensures you're not over-provisioning infrastructure
Track application health: Changes in CPU patterns can indicate application issues before they impact users

Setting Up CPU Utilization Monitoring in Grafana

Prerequisites

Before proceeding, ensure you have:

A running Grafana instance (version 8.0 or higher recommended)
A data source configured (Prometheus, InfluxDB, or other supported time-series database)
Metrics collection set up for your systems (via node_exporter, Telegraf, or similar tools)

Creating a Basic CPU Utilization Dashboard

Let's create a simple yet effective CPU utilization dashboard in Grafana:

Log in to your Grafana instance
Click on the "+" icon in the sidebar and select "Dashboard"
Click "Add new panel"
Configure the panel to display CPU utilization

For Prometheus Data Source (with node_exporter)

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

This query calculates the percentage of CPU utilization by subtracting the idle percentage from 100%.

For InfluxDB (with Telegraf)

SELECT mean("usage_idle")
FROM "cpu"
WHERE $timeFilter
GROUP BY time($__interval), "host"

To get the utilization percentage, you'd use:

SELECT 100 - mean("usage_idle")
FROM "cpu"
WHERE $timeFilter
GROUP BY time($__interval), "host"

Visualizing Different CPU Metrics

For a more comprehensive view, create separate panels for different CPU metrics:

User CPU Time

For Prometheus:

rate(node_cpu_seconds_total{mode="user"}[5m]) * 100

System CPU Time

For Prometheus:

rate(node_cpu_seconds_total{mode="system"}[5m]) * 100

I/O Wait Time

For Prometheus:

rate(node_cpu_seconds_total{mode="iowait"}[5m]) * 100

Advanced CPU Utilization Visualizations

Multi-Core CPU Visualization

Modern systems often have multiple CPU cores. To visualize utilization across all cores:

avg by (mode, cpu) (irate(node_cpu_seconds_total[5m])) * 100

This query can be used with a heatmap visualization to create a color-coded map of all CPU cores and their states.

CPU Saturation Alerts

Set up alerts for when CPU utilization exceeds thresholds:

Edit your CPU utilization panel
Go to the "Alert" tab
Configure conditions like:
- "Alert when CPU utilization is above 80% for 5 minutes"
- "Alert when CPU utilization increases by more than 30% in 1 minute"

For example, a Prometheus alert query might look like:

avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) * 100 > 80

CPU Usage Patterns and Anomaly Detection

Grafana can help identify unusual CPU patterns:

Create a panel with historical CPU data
Add a moving average to identify the baseline
Use Grafana's outlier detection to highlight anomalies

Practical Examples

Example 1: Diagnosing High CPU Usage

Let's walk through a typical scenario of diagnosing high CPU usage:

Observe the dashboard: You notice CPU utilization has spiked to 95% on one of your servers
Check specific metrics: Look at the breakdown between user, system, and wait times
Correlate with other metrics: Check if memory usage, network traffic, or disk I/O also show anomalies
Identify processes: Use a top processes panel to identify which applications are consuming CPU
Take action: Based on findings, you might restart a service, optimize code, or scale resources

Example 2: Capacity Planning

Here's how to use CPU utilization data for capacity planning:

Collect historical data: Monitor CPU utilization over weeks or months
Analyze patterns: Identify peak usage times and trends
Create projections: Use Grafana's prediction functions to forecast future usage
Set thresholds: Establish when to add resources based on projected growth
Document findings: Create reports showing current usage and future needs

To visualize CPU trends in Grafana, you can use the following query with a trend line:

avg_over_time(cpu_usage_percent[$__interval]) and predict_linear(cpu_usage_percent[12h], 86400)

Creating a Complete CPU Dashboard

Let's combine everything we've learned to create a comprehensive CPU dashboard:

Dashboard Structure

Overview row:
- CPU Utilization gauge (current value)
- CPU Usage trend (last 24 hours)
- Core count and basic system info
Detailed Metrics row:
- Per-core utilization heatmap
- User/System/I/O Wait breakdowns
- Context switches and interrupts
Correlation row:
- CPU vs. Memory usage
- CPU vs. Disk I/O
- CPU vs. Network traffic

Sample Dashboard Visualization

Here's a visual representation of how the different metrics relate:

Troubleshooting High CPU Utilization

When you encounter high CPU utilization, follow these steps:

Determine the type of high utilization:
- User CPU: Application issues
- System CPU: Kernel or driver issues
- I/O Wait: Disk or network bottlenecks
Check for specific patterns:
- Consistent high usage: Underprovisioned resources or inefficient applications
- Spikes: Periodic jobs or resource contention
- Gradual increase: Memory leaks or growing workloads
Common solutions:
- Optimize application code
- Scale resources (vertical or horizontal)
- Distribute workload
- Set resource limits
- Upgrade hardware

Grafana CPU Troubleshooting Dashboard

Create a specialized troubleshooting dashboard with:

# Top processes by CPU usage
topk(10, sum by (process) (rate(process_cpu_seconds_total[5m])))

# CPU usage trend with annotations for deployments
rate(node_cpu_seconds_total{mode!="idle"}[5m])

# Context switches per second
rate(node_context_switches_total[5m])

Best Practices for CPU Utilization Monitoring

Set appropriate thresholds: Not all high CPU is bad; understand your workload's characteristics
Monitor all components: CPU usage alone doesn't tell the full story
Correlate metrics: Connect CPU metrics with application performance
Use templates: Create reusable dashboard templates for different server types
Implement multi-level alerts: Set warning and critical thresholds
Track historical trends: Compare current usage to past patterns
Document baselines: Know what "normal" looks like for your systems

Summary

CPU utilization monitoring is a critical aspect of performance management in any IT environment. With Grafana, you can:

Visualize real-time and historical CPU usage
Break down utilization by type (user, system, I/O wait)
Monitor multiple cores and servers simultaneously
Set alerts for abnormal conditions
Correlate CPU metrics with other system indicators
Plan capacity based on usage trends

By following the guidelines in this tutorial, you'll be able to create comprehensive CPU monitoring dashboards that help maintain optimal performance and quickly address issues before they impact users.

Additional Resources

Explore more about Grafana visualizations
Learn about PromQL for advanced queries
Understand system performance fundamentals

Exercises

Create a basic CPU utilization dashboard showing overall usage
Add panels to show breakdowns of user, system, and I/O wait times
Configure alerts for when CPU usage exceeds 85% for more than 10 minutes
Create a dashboard that correlates CPU usage with memory, disk, and network metrics
Build a capacity planning report using historical CPU data

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding CPU Utilization​

What is CPU Utilization?​

Why Monitor CPU Utilization?​

Setting Up CPU Utilization Monitoring in Grafana​

Prerequisites​

Creating a Basic CPU Utilization Dashboard​

For Prometheus Data Source (with node_exporter)​

For InfluxDB (with Telegraf)​

Visualizing Different CPU Metrics​

User CPU Time​

System CPU Time​

I/O Wait Time​

Advanced CPU Utilization Visualizations​

Multi-Core CPU Visualization​

CPU Saturation Alerts​

CPU Usage Patterns and Anomaly Detection​

Practical Examples​

Example 1: Diagnosing High CPU Usage​

Example 2: Capacity Planning​

Creating a Complete CPU Dashboard​

Dashboard Structure​

Sample Dashboard Visualization​

Troubleshooting High CPU Utilization​

Grafana CPU Troubleshooting Dashboard​

Best Practices for CPU Utilization Monitoring​

Summary​

Additional Resources​

Exercises​

Introduction

Understanding CPU Utilization

What is CPU Utilization?

Why Monitor CPU Utilization?

Setting Up CPU Utilization Monitoring in Grafana

Prerequisites

Creating a Basic CPU Utilization Dashboard

For Prometheus Data Source (with node_exporter)

For InfluxDB (with Telegraf)

Visualizing Different CPU Metrics

User CPU Time

System CPU Time

I/O Wait Time

Advanced CPU Utilization Visualizations

Multi-Core CPU Visualization

CPU Saturation Alerts

CPU Usage Patterns and Anomaly Detection

Practical Examples

Example 1: Diagnosing High CPU Usage

Example 2: Capacity Planning

Creating a Complete CPU Dashboard

Dashboard Structure

Sample Dashboard Visualization

Troubleshooting High CPU Utilization

Grafana CPU Troubleshooting Dashboard

Best Practices for CPU Utilization Monitoring

Summary

Additional Resources

Exercises