Prometheus Alert Manager
Introduction
Alert Manager is a critical component in the Prometheus ecosystem that handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing alerts to the correct receiver integration such as email, PagerDuty, or Slack. Alert Manager also handles silencing and inhibition of alerts.
Think of Alert Manager as the communication center for your monitoring system - Prometheus generates the alerts based on metrics, but Alert Manager decides what to do with those alerts, who should receive them, and how they should be organized.
Understanding Alert Manager's Role
Before diving into Alert Manager, it's important to understand the full alerting workflow in Prometheus:
- Prometheus Server: Evaluates alert rules and generates alerts
- Alert Manager: Receives alerts, processes them, and sends notifications
- Receivers: Various notification channels configured to receive alerts
Installation and Setup
Installing Alert Manager
You can download the pre-compiled binary from the Prometheus downloads page or use Docker to run Alert Manager:
# Download and extract Alert Manager
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar xvfz alertmanager-0.25.0.linux-amd64.tar.gz
cd alertmanager-0.25.0.linux-amd64/
# Run Alert Manager
./alertmanager --config.file=alertmanager.yml
Using Docker:
docker run --name alertmanager -d -p 9093:9093 \
-v $(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
prom/alertmanager:latest
Basic Configuration
Alert Manager is configured using a YAML file. Here's a simple configuration example:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.org:587'
smtp_from: '[email protected]'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email-notifications'
receivers:
- name: 'email-notifications'
email_configs:
- to: '[email protected]'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Core Concepts
1. Grouping
Alert Manager groups similar alerts together to reduce noise. For example, if a network outage affects multiple servers, you don't want to receive dozens of separate notifications.
route:
# Group alerts by cluster and alertname
group_by: ['cluster', 'alertname']
# Wait 30s to collect similar alerts before sending notification
group_wait: 30s
# Wait 5m before sending notification about new alerts for the same group
group_interval: 5m
Example scenario: A database cluster with 5 nodes experiences high CPU usage. Instead of receiving 5 separate alerts, you receive one alert stating "High CPU usage on database cluster (5 affected instances)".
2. Routing
Routing determines which team or person should receive particular alerts.
route:
# Default receiver if no matching child routes
receiver: 'operations-team'
# Child routes for specific alert types
routes:
- match:
service: 'database'
receiver: 'database-team'
- match:
service: 'frontend'
receiver: 'frontend-team'
- match_re:
service: 'api|backend'
receiver: 'backend-team'
In this example:
- Database alerts go to the database team
- Frontend alerts go to the frontend team
- API and backend alerts go to the backend team
- All other alerts go to the operations team
3. Receivers
Receivers define how notifications should be sent. Alert Manager supports multiple notification methods:
receivers:
- name: 'operations-team'
email_configs:
- to: '[email protected]'
slack_configs:
- channel: '#alerts'
api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
- name: 'database-team'
pagerduty_configs:
- service_key: 'database-team-key'
- name: 'frontend-team'
webhook_configs:
- url: 'http://example.org/notify'
4. Silences
Silences temporarily suppress notifications based on matchers. This is useful during maintenance windows or when dealing with known issues.
You can create silences through the Alert Manager UI or via the API:
# Using the amtool CLI
amtool silence add alertname="HighCPULoad" instance="server1.example.org" --comment="Maintenance window" --duration=2h
5. Inhibition
Inhibition allows suppressing certain alerts when other alerts are active. For example, you might want to suppress all warning-level alerts related to a system when a critical alert is already firing.
inhibit_rules:
- source_match:
severity: 'critical'
app: 'mysql'
target_match:
severity: 'warning'
app: 'mysql'
# Only inhibit if these labels match between both alerts
equal: ['env', 'instance']
Integrating Alert Manager with Prometheus
To configure Prometheus to send alerts to Alert Manager, add the following to your prometheus.yml
:
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'
rule_files:
- 'alert_rules.yml'
Then, define alert rules in alert_rules.yml
:
groups:
- name: example
rules:
- alert: HighCPULoad
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load (instance {{ $labels.instance }})"
description: "CPU load is > 80%
VALUE = {{ $value }}
LABELS = {{ $labels }}"
Practical Examples
Example 1: High-Availability Configuration
For production environments, you'll want to run multiple Alert Manager instances for high availability:
# prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager1:9093'
- 'alertmanager2:9093'
- 'alertmanager3:9093'
Each Alert Manager instance should be configured with peer information:
# alertmanager.yml
global:
# ... other global settings ...
cluster:
peers:
- 'alertmanager1:9094'
- 'alertmanager2:9094'
- 'alertmanager3:9094'
Example 2: Escalating Notifications
Here's how to configure escalation for critical alerts:
route:
receiver: 'default-receiver'
routes:
- match:
severity: critical
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'critical-alerts'
receivers:
- name: 'default-receiver'
email_configs:
- to: '[email protected]'
- name: 'critical-alerts'
email_configs:
- to: '[email protected]'
pagerduty_configs:
- service_key: '<pagerduty-service-key>'
# Escalation after 10 minutes if not acknowledged
opsgenie_configs:
- api_key: '<opsgenie-api-key>'
note: 'Alert has been firing for over 10 minutes'
description: '{{ .CommonAnnotations.description }}'
message: '{{ .CommonAnnotations.summary }}'
priority: 'P1'
Example 3: Time-Based Routing
Sending alerts to different teams based on business hours:
route:
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'daytime-oncall'
# Mon-Fri 9AM-5PM
time_intervals:
- weekdays
- match:
severity: critical
receiver: 'nighttime-oncall'
# All other times
time_intervals:
- always_except
- weekdays
time_intervals:
- name: weekdays
time_intervals:
- weekdays: ['monday:friday']
location: 'America/New_York'
times:
- start_time: '09:00'
end_time: '17:00'
receivers:
- name: 'default-receiver'
email_configs:
- to: '[email protected]'
- name: 'daytime-oncall'
email_configs:
- to: '[email protected]'
- name: 'nighttime-oncall'
email_configs:
- to: '[email protected]'
pagerduty_configs:
- service_key: '<pagerduty-service-key>'
Alert Manager UI
Alert Manager provides a web interface (typically at http://localhost:9093
) where you can:
- View current alerts
- Create and manage silences
- Test alert routing
- View the status of Alert Manager itself
Best Practices
- Keep it Simple: Start with a basic configuration and expand as needed
- Use Labels Effectively: Design a good labeling strategy for routing
- Configure Reasonable Timeouts: Set appropriate group_wait, group_interval, and repeat_interval values
- Implement High Availability: Run multiple Alert Manager instances
- Test Your Setup: Use tools like amtool to verify configuration
- Document Alert Policies: Ensure team members know how to respond to different alerts
- Avoid Alert Fatigue: Group similar alerts and only alert on actionable issues
Testing with amtool
The amtool
command-line utility is useful for testing Alert Manager configurations:
# Check if configuration is valid
amtool check-config alertmanager.yml
# Test alert routing
amtool config routes test --config.file=alertmanager.yml \
alertname=HighCPULoad severity=critical instance=web1.example.com
# Manually trigger an alert
amtool alert add alertname=TestAlert severity=warning instance=test1 \
--annotation=summary="Test alert" \
--annotation=description="This is a test alert"
Troubleshooting
Common issues and their solutions:
-
Alerts not being received:
- Check network connectivity between Prometheus and Alert Manager
- Verify route configurations match the alert labels
- Check that receivers are properly configured
-
Duplicate alerts:
- Ensure Alert Manager instances can communicate with each other
- Check for proper deduplication settings
-
Alert Manager not starting:
- Validate your configuration with
amtool check-config
- Check for syntax errors in YAML files
- Validate your configuration with
Summary
Alert Manager is a powerful component of the Prometheus ecosystem that handles the delivery, grouping, deduplication, silencing, and inhibition of alerts. By understanding its core concepts and implementing best practices, you can create an effective alerting system that notifies the right people at the right time without causing alert fatigue.
The key concepts to remember:
- Grouping: Combining similar alerts to reduce noise
- Routing: Sending alerts to the appropriate teams
- Receivers: Defining notification methods (email, Slack, PagerDuty, etc.)
- Silences: Temporarily suppressing alerts
- Inhibition: Preventing less important alerts when critical alerts are active
Additional Resources
Exercises
- Set up a local Alert Manager instance that sends notifications to a Slack channel
- Configure alert grouping to combine similar alerts from multiple instances
- Create a time-based routing configuration for business hours vs. off-hours
- Implement an inhibition rule that suppresses warning alerts when related critical alerts are firing
- Set up high availability with multiple Alert Manager instances
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)