Alert Conditions

Introduction

Alert conditions are the heart of Grafana's alerting system. They define the specific criteria that must be met before an alert is triggered. Understanding how to properly configure alert conditions allows you to create precise monitoring rules that notify you only when truly necessary, reducing alert fatigue while ensuring you catch important events in your systems.

In this guide, we'll explore how alert conditions work in Grafana Alerting, how to configure them effectively, and provide practical examples for common monitoring scenarios.

What Are Alert Conditions?

Alert conditions are expressions that evaluate your time series data and determine when an alert should fire. They typically include:

A query or metric - The data you want to monitor
A condition - The logical comparison to perform
A threshold - The value that triggers the alert
Duration - How long the condition must be true before firing

When these conditions are met, Grafana will change the alert state to "Pending" and then to "Firing" after the configured evaluation period.

Basic Alert Condition Structure

In Grafana, alert conditions are configured as part of alert rules. Let's look at how to create a basic alert condition:

// Basic alert condition structure
// A (query/metric) condition B (threshold) for C (duration)
avg() of query(A, 5m, now) is above 80 for 5m

This simple condition means: "If the average value of metric A exceeds 80 for at least 5 minutes, trigger an alert."

Creating Alert Conditions

Step 1: Access the Alerting Interface

Navigate to Grafana's alerting section by clicking on the bell icon in the left sidebar, then select "Alert rules."

Step 2: Create a New Alert Rule

Click the "New alert rule" button to begin creating an alert with conditions.

Step 3: Define Your Query

First, you need to define what data you want to monitor. This is done through Grafana's query editor:

// Example PromQL query for CPU usage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Step 4: Set the Condition

After defining your query, you need to set the condition that will trigger the alert:

// Alert condition for CPU usage over 80%
IS ABOVE 80

Step 5: Define Evaluation Behavior

Set how long the condition must be true before the alert fires:

// Alert will fire if condition is true for 5 minutes
FOR 5m

Types of Alert Conditions

Grafana supports several types of conditions for different monitoring needs:

Threshold Conditions

The most common type of alert condition compares a metric against a fixed threshold.

// Simple threshold condition
IS ABOVE 90
IS BELOW 10
IS OUTSIDE RANGE 10 TO 90
IS WITHIN RANGE 10 TO 90

Multiple Conditions with Math and Functions

You can create more complex conditions using math, aggregations, and functions:

// Complex condition example
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100 > 5

This condition triggers when the error rate (HTTP 5xx status codes) exceeds 5% of total requests.

Time-Based Conditions

Some alerts need to consider patterns over time:

// Check if there are no data points received for 5 minutes
count_over_time(up[5m]) == 0

Practical Examples

Let's explore some real-world examples of alert conditions for common monitoring scenarios:

Example 1: High CPU Usage Alert

This alert triggers when CPU usage remains above 85% for more than 5 minutes:

// CPU usage alert
avg by(instance) (100 - (rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) > 85

Example 2: Service Availability Alert

This alert fires when an endpoint returns non-200 status codes:

// Service availability alert
sum(rate(http_requests_total{status!="200",route="/api/important"}[5m])) > 0

Example 3: Disk Space Alert

Alert when disk space is running low (less than 10% free):

// Low disk space alert
100 * (node_filesystem_free_bytes / node_filesystem_size_bytes) < 10

Example 4: Spike Detection

Alert on sudden spikes in error rates:

// Error rate spike detection
increase(app_exceptions_total[5m]) > 10

Best Practices for Alert Conditions

Follow these guidelines to create effective alert conditions:

1. Avoid Alert Noise

Configure thresholds carefully to avoid alert fatigue:

// Instead of a fixed threshold that might be too sensitive
http_request_duration_seconds > 0.1

// Consider using percentiles for more robust alerting
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5

2. Use Appropriate Time Windows

Choose evaluation periods that match your system's behavior:

// For metrics that fluctuate frequently, use longer evaluation periods
FOR 15m

// For critical systems that need immediate attention
FOR 1m

3. Include Labels and Annotations

Make your alerts informative by adding context:

// Example labels and annotations configuration
labels:
  severity: warning
  category: performance
annotations:
  summary: "High CPU usage on {{$labels.instance}}"
  description: "CPU usage is above 85% for 5 minutes on {{$labels.instance}}"

Advanced Alert Conditions

For more complex monitoring needs, Grafana offers advanced condition configurations:

Multi-Dimensional Alerts

Alert on specific dimensions of your metrics:

// Alert on high latency for specific API endpoints
max by(endpoint) (http_request_duration_seconds{endpoint=~"/api/.*"}) > 1

Alerting on Absent Metrics

Detect when metrics stop reporting:

// Alert when metric stops reporting
absent(up{job="important-service"} == 1)

Relative Change Alerts

Alert on significant changes from normal:

// Alert when traffic drops by more than 50% compared to last hour
sum(rate(http_requests_total[5m])) < sum(rate(http_requests_total[5m] offset 1h)) * 0.5

Visualizing Alert Conditions

Grafana provides powerful visualization tools to help you set appropriate alert conditions. Let's create a diagram to illustrate how alert conditions work:

Debugging Alert Conditions

When your alert conditions aren't working as expected, Grafana provides several debugging tools:

Test Rule: You can test your alert rule before saving it to see how it evaluates
State History: View the history of state changes for your alert
Silence: Temporarily silence alerts while you work on fixing issues

Example of using the alert testing feature:

// Alert test
// In the Grafana UI, you can simulate how your alert would behave 
// using historical data by clicking "Test Rule"

Summary

Alert conditions are the foundation of effective monitoring in Grafana. By understanding how to create precise, actionable conditions, you can build an alerting system that notifies you of real problems while avoiding unnecessary alerts. Remember these key points:

Alert conditions consist of queries, thresholds, and durations
Choose appropriate thresholds and evaluation periods for your specific use case
Use labels and annotations to provide context for alert notifications
Leverage Grafana's advanced features for complex monitoring needs

With the knowledge from this guide, you should be able to create effective alert conditions that help you maintain the health and performance of your systems.

Additional Resources

Exercises

Create an alert condition that triggers when memory usage exceeds 90% for more than 10 minutes.
Design an alert condition that detects when your application's error rate exceeds 1% of total requests.
Build a multi-condition alert that fires when both CPU usage is high and disk space is low.
Create an alert condition that detects when a service hasn't reported metrics for 5 minutes.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What Are Alert Conditions?​

Basic Alert Condition Structure​

Creating Alert Conditions​

Step 1: Access the Alerting Interface​

Step 2: Create a New Alert Rule​

Step 3: Define Your Query​

Step 4: Set the Condition​

Step 5: Define Evaluation Behavior​

Types of Alert Conditions​

Threshold Conditions​

Multiple Conditions with Math and Functions​

Time-Based Conditions​

Practical Examples​

Example 1: High CPU Usage Alert​

Example 2: Service Availability Alert​

Example 3: Disk Space Alert​

Example 4: Spike Detection​

Best Practices for Alert Conditions​

1. Avoid Alert Noise​

2. Use Appropriate Time Windows​

3. Include Labels and Annotations​

Advanced Alert Conditions​

Multi-Dimensional Alerts​

Alerting on Absent Metrics​

Relative Change Alerts​

Visualizing Alert Conditions​

Debugging Alert Conditions​

Summary​

Additional Resources​

Exercises​