Alert Grouping and Routing
Introduction
When monitoring complex systems with Prometheus, it's common to experience a flood of alerts when something goes wrong. For example, if a database server fails, you might receive dozens of related alerts about high latency, failed queries, and connection errors. This phenomenon, called "alert storms," can lead to notification fatigue and make it difficult to identify the root cause of issues.
Alertmanager, a component of the Prometheus ecosystem, solves this problem through two powerful mechanisms: Alert Grouping and Alert Routing. These features help organize alerts intelligently and ensure they reach the right teams through appropriate notification channels.
In this guide, we'll explore how to configure and use these capabilities to build an effective alerting system.
Alert Grouping
Alert grouping combines related alerts into a single notification, reducing noise and providing context for troubleshooting.
How Grouping Works
Alertmanager groups alerts based on labels. Alerts with the same grouping labels are combined into a single notification. By default, Alertmanager groups alerts by the alertname
label, but you can customize this behavior.
Configuring Alert Groups
Grouping is configured in the Alertmanager configuration file. Here's a basic example:
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
Let's break down these configuration options:
group_by
: Specifies the labels used to group alerts togethergroup_wait
: How long to wait to buffer alerts of the same group before sending a notificationgroup_interval
: How long to wait before sending a notification about new alerts that are added to a group of alertsrepeat_interval
: How long to wait before sending a notification again if it has already been sent
Example: Grouping in Action
Consider these alerts triggered simultaneously:
Alert 1: { alertname="HighCPUUsage", cluster="prod-us-east", service="api", instance="server1" }
Alert 2: { alertname="HighCPUUsage", cluster="prod-us-east", service="api", instance="server2" }
Alert 3: { alertname="HighCPUUsage", cluster="prod-us-east", service="database", instance="db1" }
With the grouping configuration above, Alertmanager would create two notification groups:
- Group 1: Alert 1 and Alert 2 (same
alertname
,cluster
, andservice
) - Group 2: Alert 3 (different
service
)
Instead of receiving three separate notifications, you would receive two more meaningful grouped notifications.
Alert Routing
Alert routing directs different types of alerts to appropriate teams and notification channels based on labels.
Understanding Routing Trees
Alertmanager uses a tree structure for routing alerts. The configuration starts with a top-level route and can have nested child routes. Each route can specify:
- Which alerts it handles (using matchers)
- How alerts are grouped
- Where notifications are sent (receivers)
- How notifications are throttled
Configuring Alert Routes
Here's an example routing configuration:
route:
receiver: 'default-receiver'
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
service: 'database'
receiver: 'database-team'
group_by: ['alertname', 'cluster', 'instance']
group_wait: 45s
- match:
severity: 'critical'
receiver: 'pager-duty'
group_wait: 10s
receivers:
- name: 'default-receiver'
email_configs:
- to: '[email protected]'
- name: 'database-team'
email_configs:
- to: '[email protected]'
slack_configs:
- channel: '#db-alerts'
- name: 'pager-duty'
pagerduty_configs:
- service_key: '<your-pagerduty-service-key>'
In this configuration:
- All alerts go to the
default-receiver
by default - Alerts with
service: database
go to thedatabase-team
receiver - Alerts with
severity: critical
go to thepager-duty
receiver
Matchers
Routes use matchers to determine which alerts they should handle. Matchers support several operations:
=
: Exact match!=
: Negative match=~
: Regex match!~
: Negative regex match
Example of more complex matching:
routes:
- matchers:
- service=~"api|web"
- environment="production"
receiver: 'prod-webapp-team'
This route matches alerts where the service label matches "api" or "web" AND the environment label equals "production".
Receiver Types
Alertmanager supports various notification channels:
- Slack
- PagerDuty
- Webhook
- OpsGenie
- VictorOps
- Pushover
- And more...
You can configure multiple notification methods for each receiver.
Putting It All Together: A Real-World Example
Let's look at a more comprehensive example that demonstrates both grouping and routing:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: '[email protected]'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
receiver: 'operations-team'
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers:
- service=~"api|frontend"
receiver: 'application-team'
group_by: ['alertname', 'service', 'instance']
- matchers:
- service=~"database|cache"
receiver: 'infrastructure-team'
group_by: ['alertname', 'service', 'instance']
routes:
- matchers:
- severity="critical"
receiver: 'database-oncall'
group_wait: 10s
continue: true
- matchers:
- team="security"
receiver: 'security-team'
group_wait: 1m
receivers:
- name: 'operations-team'
email_configs:
- to: '[email protected]'
slack_configs:
- channel: '#ops-alerts'
- name: 'application-team'
email_configs:
- to: '[email protected]'
slack_configs:
- channel: '#app-alerts'
- name: 'infrastructure-team'
email_configs:
- to: '[email protected]'
slack_configs:
- channel: '#infra-alerts'
- name: 'database-oncall'
pagerduty_configs:
- service_key: 'your-pagerduty-service-key'
- name: 'security-team'
email_configs:
- to: '[email protected]'
slack_configs:
- channel: '#security-alerts'
This configuration:
- Routes alerts to different teams based on service labels
- Sends critical database alerts to an on-call engineer via PagerDuty
- Uses different grouping strategies for different types of alerts
- Customizes wait times for each route
Visual Representation of Routing
Advanced Features
Inhibition Rules
Inhibition rules allow you to suppress certain alerts when other alerts are firing. This is useful when you have a high-level alert that makes lower-level alerts redundant.
Example:
inhibit_rules:
- source_matchers:
- alertname="NodeDown"
target_matchers:
- severity="warning"
equal: ['cluster', 'instance']
This rule suppresses all warning-level alerts for instances that are reported as down.
Time-Based Routing
You can implement time-based routing using time intervals. This is useful for directing alerts to different teams based on working hours or on-call schedules.
Example:
routes:
- matchers:
- severity="critical"
receiver: 'daytime-oncall'
group_wait: 30s
time_intervals: ['workday-hours']
- matchers:
- severity="critical"
receiver: 'nighttime-oncall'
group_wait: 30s
time_intervals: ['non-work-hours']
time_intervals:
- name: 'workday-hours'
time_intervals:
- weekdays: ['monday:friday']
start_time: '09:00'
end_time: '17:00'
- name: 'non-work-hours'
time_intervals:
- weekdays: ['monday:friday']
start_time: '17:00'
end_time: '09:00'
- weekdays: ['saturday', 'sunday']
Mute Timings
Mute timings allow you to silence notifications during specific time periods, such as planned maintenance windows.
mute_time_intervals:
- name: 'maintenance-window'
time_intervals:
- weekdays: ['wednesday']
start_time: '22:00'
end_time: '23:59'
route:
receiver: 'operations-team'
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
mute_time_intervals: ['maintenance-window']
Best Practices
-
Keep Related Alerts Together: Group alerts that are likely to occur together and should be handled by the same team.
-
Route by Severity: Consider routing alerts based on severity level to ensure critical issues get immediate attention.
-
Route by Team Responsibility: Direct alerts to the teams that are responsible for the systems generating them.
-
Document Your Routing Logic: Create a diagram of your routing tree to help teams understand where alerts will go.
-
Test Your Configuration: Use the
amtool
command-line utility to validate and test your Alertmanager configuration:
amtool check-config alertmanager.yml
-
Start Simple: Begin with a basic configuration and add complexity as needed. It's easier to understand and troubleshoot a simpler routing tree.
-
Use Templating: Customize notification messages using templates to include relevant information:
receivers:
- name: 'ops-team'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: >
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
{{ end }}
Summary
Alert grouping and routing are powerful features in Alertmanager that help tame notification noise and ensure alerts reach the right teams. By properly configuring these features, you can create an alerting system that:
- Reduces alert fatigue by grouping related notifications
- Directs alerts to the appropriate teams based on service, severity, or other criteria
- Customizes notification behavior for different types of alerts
- Uses appropriate communication channels for different situations
Effective alert management is crucial for maintaining system reliability and enabling quick response to issues. By investing time in properly configuring Alertmanager, you'll create a more manageable and useful alerting system.
Additional Resources
- Official Alertmanager Documentation
- Prometheus Alerting Rules Configuration
- Common Alerting Patterns
Exercises
-
Configure Alertmanager to group alerts by environment and service, with a 1-minute wait time.
-
Create a routing configuration that sends:
- Database alerts to the database team
- Frontend alerts to the frontend team
- Critical alerts to both PagerDuty and Slack
-
Define an inhibition rule that suppresses disk space warnings when a "HostDown" alert is firing for the same instance.
-
Create a time-based routing configuration that sends alerts to different teams during business hours versus after hours.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)