Skip to main content

Prometheus Alert Manager

Introduction

Alert Manager is a critical component in the Prometheus ecosystem that handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing alerts to the correct receiver integration such as email, PagerDuty, or Slack. Alert Manager also handles silencing and inhibition of alerts.

Think of Alert Manager as the communication center for your monitoring system - Prometheus generates the alerts based on metrics, but Alert Manager decides what to do with those alerts, who should receive them, and how they should be organized.

Understanding Alert Manager's Role

Before diving into Alert Manager, it's important to understand the full alerting workflow in Prometheus:

  1. Prometheus Server: Evaluates alert rules and generates alerts
  2. Alert Manager: Receives alerts, processes them, and sends notifications
  3. Receivers: Various notification channels configured to receive alerts

Installation and Setup

Installing Alert Manager

You can download the pre-compiled binary from the Prometheus downloads page or use Docker to run Alert Manager:

bash
# Download and extract Alert Manager
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar xvfz alertmanager-0.25.0.linux-amd64.tar.gz
cd alertmanager-0.25.0.linux-amd64/

# Run Alert Manager
./alertmanager --config.file=alertmanager.yml

Using Docker:

bash
docker run --name alertmanager -d -p 9093:9093 \
-v $(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
prom/alertmanager:latest

Basic Configuration

Alert Manager is configured using a YAML file. Here's a simple configuration example:

yaml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.org:587'
smtp_from: '[email protected]'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'

route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email-notifications'

receivers:
- name: 'email-notifications'
email_configs:
- to: '[email protected]'

inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']

Core Concepts

1. Grouping

Alert Manager groups similar alerts together to reduce noise. For example, if a network outage affects multiple servers, you don't want to receive dozens of separate notifications.

yaml
route:
# Group alerts by cluster and alertname
group_by: ['cluster', 'alertname']

# Wait 30s to collect similar alerts before sending notification
group_wait: 30s

# Wait 5m before sending notification about new alerts for the same group
group_interval: 5m

Example scenario: A database cluster with 5 nodes experiences high CPU usage. Instead of receiving 5 separate alerts, you receive one alert stating "High CPU usage on database cluster (5 affected instances)".

2. Routing

Routing determines which team or person should receive particular alerts.

yaml
route:
# Default receiver if no matching child routes
receiver: 'operations-team'

# Child routes for specific alert types
routes:
- match:
service: 'database'
receiver: 'database-team'

- match:
service: 'frontend'
receiver: 'frontend-team'

- match_re:
service: 'api|backend'
receiver: 'backend-team'

In this example:

  • Database alerts go to the database team
  • Frontend alerts go to the frontend team
  • API and backend alerts go to the backend team
  • All other alerts go to the operations team

3. Receivers

Receivers define how notifications should be sent. Alert Manager supports multiple notification methods:

yaml
receivers:
- name: 'operations-team'
email_configs:
- to: '[email protected]'
slack_configs:
- channel: '#alerts'
api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'

- name: 'database-team'
pagerduty_configs:
- service_key: 'database-team-key'

- name: 'frontend-team'
webhook_configs:
- url: 'http://example.org/notify'

4. Silences

Silences temporarily suppress notifications based on matchers. This is useful during maintenance windows or when dealing with known issues.

You can create silences through the Alert Manager UI or via the API:

bash
# Using the amtool CLI
amtool silence add alertname="HighCPULoad" instance="server1.example.org" --comment="Maintenance window" --duration=2h

5. Inhibition

Inhibition allows suppressing certain alerts when other alerts are active. For example, you might want to suppress all warning-level alerts related to a system when a critical alert is already firing.

yaml
inhibit_rules:
- source_match:
severity: 'critical'
app: 'mysql'
target_match:
severity: 'warning'
app: 'mysql'
# Only inhibit if these labels match between both alerts
equal: ['env', 'instance']

Integrating Alert Manager with Prometheus

To configure Prometheus to send alerts to Alert Manager, add the following to your prometheus.yml:

yaml
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'

rule_files:
- 'alert_rules.yml'

Then, define alert rules in alert_rules.yml:

yaml
groups:
- name: example
rules:
- alert: HighCPULoad
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load (instance {{ $labels.instance }})"
description: "CPU load is > 80%
VALUE = {{ $value }}
LABELS = {{ $labels }}"

Practical Examples

Example 1: High-Availability Configuration

For production environments, you'll want to run multiple Alert Manager instances for high availability:

yaml
# prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager1:9093'
- 'alertmanager2:9093'
- 'alertmanager3:9093'

Each Alert Manager instance should be configured with peer information:

yaml
# alertmanager.yml
global:
# ... other global settings ...

cluster:
peers:
- 'alertmanager1:9094'
- 'alertmanager2:9094'
- 'alertmanager3:9094'

Example 2: Escalating Notifications

Here's how to configure escalation for critical alerts:

yaml
route:
receiver: 'default-receiver'
routes:
- match:
severity: critical
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'critical-alerts'

receivers:
- name: 'default-receiver'
email_configs:
- to: '[email protected]'

- name: 'critical-alerts'
email_configs:
- to: '[email protected]'
pagerduty_configs:
- service_key: '<pagerduty-service-key>'

# Escalation after 10 minutes if not acknowledged
opsgenie_configs:
- api_key: '<opsgenie-api-key>'
note: 'Alert has been firing for over 10 minutes'
description: '{{ .CommonAnnotations.description }}'
message: '{{ .CommonAnnotations.summary }}'
priority: 'P1'

Example 3: Time-Based Routing

Sending alerts to different teams based on business hours:

yaml
route:
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'daytime-oncall'
# Mon-Fri 9AM-5PM
time_intervals:
- weekdays

- match:
severity: critical
receiver: 'nighttime-oncall'
# All other times
time_intervals:
- always_except
- weekdays

time_intervals:
- name: weekdays
time_intervals:
- weekdays: ['monday:friday']
location: 'America/New_York'
times:
- start_time: '09:00'
end_time: '17:00'

receivers:
- name: 'default-receiver'
email_configs:
- to: '[email protected]'

- name: 'daytime-oncall'
email_configs:
- to: '[email protected]'

- name: 'nighttime-oncall'
email_configs:
- to: '[email protected]'
pagerduty_configs:
- service_key: '<pagerduty-service-key>'

Alert Manager UI

Alert Manager provides a web interface (typically at http://localhost:9093) where you can:

  1. View current alerts
  2. Create and manage silences
  3. Test alert routing
  4. View the status of Alert Manager itself

Best Practices

  1. Keep it Simple: Start with a basic configuration and expand as needed
  2. Use Labels Effectively: Design a good labeling strategy for routing
  3. Configure Reasonable Timeouts: Set appropriate group_wait, group_interval, and repeat_interval values
  4. Implement High Availability: Run multiple Alert Manager instances
  5. Test Your Setup: Use tools like amtool to verify configuration
  6. Document Alert Policies: Ensure team members know how to respond to different alerts
  7. Avoid Alert Fatigue: Group similar alerts and only alert on actionable issues

Testing with amtool

The amtool command-line utility is useful for testing Alert Manager configurations:

bash
# Check if configuration is valid
amtool check-config alertmanager.yml

# Test alert routing
amtool config routes test --config.file=alertmanager.yml \
alertname=HighCPULoad severity=critical instance=web1.example.com

# Manually trigger an alert
amtool alert add alertname=TestAlert severity=warning instance=test1 \
--annotation=summary="Test alert" \
--annotation=description="This is a test alert"

Troubleshooting

Common issues and their solutions:

  1. Alerts not being received:

    • Check network connectivity between Prometheus and Alert Manager
    • Verify route configurations match the alert labels
    • Check that receivers are properly configured
  2. Duplicate alerts:

    • Ensure Alert Manager instances can communicate with each other
    • Check for proper deduplication settings
  3. Alert Manager not starting:

    • Validate your configuration with amtool check-config
    • Check for syntax errors in YAML files

Summary

Alert Manager is a powerful component of the Prometheus ecosystem that handles the delivery, grouping, deduplication, silencing, and inhibition of alerts. By understanding its core concepts and implementing best practices, you can create an effective alerting system that notifies the right people at the right time without causing alert fatigue.

The key concepts to remember:

  • Grouping: Combining similar alerts to reduce noise
  • Routing: Sending alerts to the appropriate teams
  • Receivers: Defining notification methods (email, Slack, PagerDuty, etc.)
  • Silences: Temporarily suppressing alerts
  • Inhibition: Preventing less important alerts when critical alerts are active

Additional Resources

Exercises

  1. Set up a local Alert Manager instance that sends notifications to a Slack channel
  2. Configure alert grouping to combine similar alerts from multiple instances
  3. Create a time-based routing configuration for business hours vs. off-hours
  4. Implement an inhibition rule that suppresses warning alerts when related critical alerts are firing
  5. Set up high availability with multiple Alert Manager instances


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)