Advanced Alert Routing
Introduction
Alert routing is a critical aspect of any monitoring system. It ensures that alerts are delivered to the appropriate teams or individuals based on various criteria such as severity, service, time of day, and on-call schedules. Prometheus, through its Alertmanager component, provides powerful and flexible alert routing capabilities.
In this guide, we'll explore advanced alert routing techniques in Prometheus Alertmanager, enabling you to create sophisticated notification pipelines that direct alerts efficiently and intelligently across your organization.
Prerequisites
Before diving into advanced alert routing, you should have:
- A working Prometheus setup
- Alertmanager installed and configured
- Basic understanding of Prometheus alerts and rules
- Familiarity with YAML configuration
Understanding Alertmanager Routing
Alertmanager uses a tree-based routing configuration that determines how alerts are processed. This configuration is defined in the alertmanager.yml
file.
The routing tree starts with a top-level route (the root), which can have nested child routes. Each route can specify:
- Matching criteria for alerts
- Grouping behavior
- Notification receivers
- Timing parameters (repeat interval, group wait, etc.)
- Child routes for further routing
Here's a visual representation of the routing tree:
Basic Routing Configuration
Let's start with a basic routing configuration to understand the core concepts:
route:
receiver: 'default-receiver'
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'team-critical'
- match:
team: database
receiver: 'database-team'
In this example:
- All alerts first enter the root route
- Alerts with
severity: critical
are sent toteam-critical
- Alerts with
team: database
are sent to thedatabase-team
- All other alerts go to the
default-receiver
Advanced Routing Techniques
Now, let's explore more sophisticated routing strategies.
Routing by Service and Severity
A common pattern is to route alerts based on both the service and its severity:
route:
receiver: 'default-receiver'
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
service: database
receiver: 'database-team'
routes:
- match:
severity: critical
receiver: 'database-oncall'
- match:
severity: warning
receiver: 'database-slack'
- match:
service: api
receiver: 'api-team'
routes:
- match:
severity: critical
receiver: 'api-oncall'
This configuration creates a hierarchical routing structure where:
- Alerts are first matched by service
- Within each service, they're further routed based on severity
Using Match_RE for Regular Expression Matching
For more flexible matching, Alertmanager supports regular expressions via match_re
:
route:
receiver: 'default-receiver'
routes:
- match_re:
service: (api|web|auth)
receiver: 'frontend-team'
- match_re:
service: (database|cache|queue)
receiver: 'backend-team'
This routes alerts for multiple services to specific teams using pattern matching.
Time-Based Routing
You can implement time-based routing to handle different notification paths during business hours versus outside hours:
route:
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'daytime-oncall'
routes:
- match:
timeperiod: afterhours
receiver: 'nighttime-oncall'
To make this work, you'll need to use the Time-based alerting label injector like prom/alertmanager-injector
to add time-based labels to your alerts.
Routing by Environment
Different environments often require different handling:
route:
receiver: 'default-receiver'
routes:
- match:
environment: production
receiver: 'prod-team'
group_by: ['alertname', 'cluster', 'service']
routes:
- match:
severity: critical
receiver: 'prod-oncall'
group_wait: 0s # Immediate notification for production criticals
repeat_interval: 1h
- match:
environment: staging
receiver: 'dev-team'
group_by: ['alertname', 'service']
group_wait: 1m
repeat_interval: 12h
This configuration applies different grouping and timing parameters based on the environment.
Implementing Inhibition Rules
Inhibition rules allow you to suppress notifications for certain alerts if other alerts are already firing. This reduces alert noise during outages.
inhibit_rules:
- source_match:
severity: 'critical'
alertname: 'ClusterDown'
target_match:
severity: 'warning'
equal: ['cluster']
This rule suppresses all warning alerts for a cluster if there's already a critical ClusterDown
alert for that same cluster.
Practical Example: Multi-Team, Multi-Environment Setup
Let's create a comprehensive example for a company with multiple teams and environments:
global:
resolve_timeout: 5m
route:
receiver: 'default-email'
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Production environment routing
- match:
environment: production
receiver: 'prod-alerts'
group_by: ['alertname', 'job', 'instance']
routes:
# Database team routing
- match:
team: database
receiver: 'database-slack'
routes:
- match:
severity: critical
receiver: 'database-pager'
group_wait: 0s
repeat_interval: 1h
# Frontend team routing
- match:
team: frontend
receiver: 'frontend-slack'
routes:
- match:
severity: critical
receiver: 'frontend-pager'
group_wait: 0s
repeat_interval: 1h
# Staging environment routing
- match:
environment: staging
receiver: 'staging-slack'
group_wait: 1m
repeat_interval: 12h
# Development environment routing
- match:
environment: development
receiver: 'dev-slack'
group_wait: 5m
repeat_interval: 24h
# Receiver definitions
receivers:
- name: 'default-email'
email_configs:
- to: '[email protected]'
- name: 'prod-alerts'
slack_configs:
- channel: '#prod-alerts'
- name: 'database-slack'
slack_configs:
- channel: '#db-alerts'
- name: 'database-pager'
pagerduty_configs:
- service_key: '<your-pagerduty-key>'
- name: 'frontend-slack'
slack_configs:
- channel: '#frontend-alerts'
- name: 'frontend-pager'
pagerduty_configs:
- service_key: '<your-pagerduty-key>'
- name: 'staging-slack'
slack_configs:
- channel: '#staging-alerts'
- name: 'dev-slack'
slack_configs:
- channel: '#dev-alerts'
# Inhibition rules
inhibit_rules:
- source_match:
severity: 'critical'
alertname: 'ServiceDown'
target_match:
severity: 'warning'
equal: ['job', 'instance']
This configuration implements:
- Environment-based routing (production, staging, development)
- Team-based sub-routes (database, frontend)
- Severity-based routing within teams
- Different timing parameters based on environment
- Different notification channels (email, Slack, PagerDuty)
- Inhibition rules to reduce alert noise
Verifying Your Routing Configuration
Before deploying your configuration, you can use the Alertmanager API to test routing:
curl -X POST -H "Content-Type: application/json" -d '[{
"labels": {
"alertname": "DiskFull",
"severity": "critical",
"team": "database",
"environment": "production"
}
}]' http://alertmanager:9093/api/v1/alerts
You can also use the Alertmanager UI to check the status of active alerts and verify they've been routed correctly.
Common Routing Patterns and Best Practices
Tiered Escalation
Create escalation tiers that gradually increase the urgency of notifications:
route:
receiver: 'slack'
routes:
- match:
severity: critical
receiver: 'tier1'
group_wait: 0s
continue: true # Continue to process child routes
routes:
- match:
severity: critical
receiver: 'tier2'
group_wait: 10m # Wait 10 minutes before escalating
The continue: true
parameter is crucial here—it allows an alert to match multiple receivers in the same branch.
Geographical Routing
For global teams, route alerts based on geographical regions:
route:
receiver: 'global-team'
routes:
- match:
region: us-east
receiver: 'us-team'
- match:
region: eu-west
receiver: 'eu-team'
- match:
region: ap-south
receiver: 'asia-team'
Service Level Objectives (SLO) Routing
Route alerts differently based on whether they affect SLOs:
route:
receiver: 'default'
routes:
- match:
affects_slo: 'true'
receiver: 'slo-violations'
group_by: ['slo_name', 'service']
Troubleshooting Routing Issues
If alerts aren't being routed as expected, check:
- Label inconsistencies: Ensure alert labels match your routing criteria
- Route order: Routes are evaluated in order. More specific routes should appear before general ones
- Configuration validation: Run
amtool check-config alertmanager.yml
to validate your configuration - Inhibition issues: Check if alerts are being inhibited by other active alerts
- Grouping conflicts: Ensure your
group_by
settings don't conflict with your routing needs
Summary
Advanced alert routing in Prometheus Alertmanager is a powerful capability that helps organizations direct alerts to the right people at the right time. Key takeaways include:
- Alertmanager uses a tree-based routing structure
- Routing can be based on labels like severity, team, service, or environment
- Regular expressions provide flexible matching with
match_re
- Timing parameters control notification behavior
- Inhibition rules reduce alert noise
- Complex organizational structures can be represented with nested routes
By mastering these advanced routing techniques, you can create a notification system that balances responsiveness with minimal alert fatigue, ensuring your teams can effectively respond to issues.
Additional Resources
- Alertmanager Configuration Documentation
- Alertmanager GitHub Repository
- Prometheus Alerts Best Practices
Exercises
-
Create a routing configuration for a company with three teams (infrastructure, application, and security) across two environments.
-
Implement time-based routing that sends alerts to different teams based on work hours in different time zones.
-
Design an inhibition rule that suppresses service-specific alerts when there's a broader infrastructure problem.
-
Create a multi-tier escalation policy that starts with Slack notifications and escalates to PagerDuty after 15 minutes for unacknowledged critical alerts.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)