Prometheus Alert Manager

Introduction

Alert Manager is a critical component in the Prometheus ecosystem that handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing alerts to the correct receiver integration such as email, PagerDuty, or Slack. Alert Manager also handles silencing and inhibition of alerts.

Think of Alert Manager as the communication center for your monitoring system - Prometheus generates the alerts based on metrics, but Alert Manager decides what to do with those alerts, who should receive them, and how they should be organized.

Understanding Alert Manager's Role

Before diving into Alert Manager, it's important to understand the full alerting workflow in Prometheus:

Prometheus Server: Evaluates alert rules and generates alerts
Alert Manager: Receives alerts, processes them, and sends notifications
Receivers: Various notification channels configured to receive alerts

Installation and Setup

Installing Alert Manager

You can download the pre-compiled binary from the Prometheus downloads page or use Docker to run Alert Manager:

# Download and extract Alert Manager
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar xvfz alertmanager-0.25.0.linux-amd64.tar.gz
cd alertmanager-0.25.0.linux-amd64/

# Run Alert Manager
./alertmanager --config.file=alertmanager.yml

Using Docker:

docker run --name alertmanager -d -p 9093:9093 \
  -v $(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager:latest

Basic Configuration

Alert Manager is configured using a YAML file. Here's a simple configuration example:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.org:587'
  smtp_from: '[email protected]'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email-notifications'

receivers:
- name: 'email-notifications'
  email_configs:
  - to: '[email protected]'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Core Concepts

1. Grouping

Alert Manager groups similar alerts together to reduce noise. For example, if a network outage affects multiple servers, you don't want to receive dozens of separate notifications.

route:
  # Group alerts by cluster and alertname
  group_by: ['cluster', 'alertname']
  
  # Wait 30s to collect similar alerts before sending notification
  group_wait: 30s
  
  # Wait 5m before sending notification about new alerts for the same group
  group_interval: 5m

Example scenario: A database cluster with 5 nodes experiences high CPU usage. Instead of receiving 5 separate alerts, you receive one alert stating "High CPU usage on database cluster (5 affected instances)".

2. Routing

Routing determines which team or person should receive particular alerts.

route:
  # Default receiver if no matching child routes
  receiver: 'operations-team'
  
  # Child routes for specific alert types
  routes:
  - match:
      service: 'database'
    receiver: 'database-team'
    
  - match:
      service: 'frontend'
    receiver: 'frontend-team'
    
  - match_re:
      service: 'api|backend'
    receiver: 'backend-team'

In this example:

Database alerts go to the database team
Frontend alerts go to the frontend team
API and backend alerts go to the backend team
All other alerts go to the operations team

3. Receivers

Receivers define how notifications should be sent. Alert Manager supports multiple notification methods:

receivers:
- name: 'operations-team'
  email_configs:
  - to: '[email protected]'
  slack_configs:
  - channel: '#alerts'
    api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    
- name: 'database-team'
  pagerduty_configs:
  - service_key: 'database-team-key'
    
- name: 'frontend-team'
  webhook_configs:
  - url: 'http://example.org/notify'

4. Silences

Silences temporarily suppress notifications based on matchers. This is useful during maintenance windows or when dealing with known issues.

You can create silences through the Alert Manager UI or via the API:

# Using the amtool CLI
amtool silence add alertname="HighCPULoad" instance="server1.example.org" --comment="Maintenance window" --duration=2h

5. Inhibition

Inhibition allows suppressing certain alerts when other alerts are active. For example, you might want to suppress all warning-level alerts related to a system when a critical alert is already firing.

inhibit_rules:
  - source_match:
      severity: 'critical'
      app: 'mysql'
    target_match:
      severity: 'warning'
      app: 'mysql'
    # Only inhibit if these labels match between both alerts
    equal: ['env', 'instance']

Integrating Alert Manager with Prometheus

To configure Prometheus to send alerts to Alert Manager, add the following to your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 'alertmanager:9093'

rule_files:
  - 'alert_rules.yml'

Then, define alert rules in alert_rules.yml:

groups:
- name: example
  rules:
  - alert: HighCPULoad
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU load (instance {{ $labels.instance }})"
      description: "CPU load is > 80%
  VALUE = {{ $value }}
  LABELS = {{ $labels }}"

Practical Examples

Example 1: High-Availability Configuration

For production environments, you'll want to run multiple Alert Manager instances for high availability:

# prometheus.yml
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 'alertmanager1:9093'
      - 'alertmanager2:9093'
      - 'alertmanager3:9093'

Each Alert Manager instance should be configured with peer information:

# alertmanager.yml
global:
  # ... other global settings ...

cluster:
  peers:
    - 'alertmanager1:9094'
    - 'alertmanager2:9094'
    - 'alertmanager3:9094'

Example 2: Escalating Notifications

Here's how to configure escalation for critical alerts:

route:
  receiver: 'default-receiver'
  routes:
    - match:
        severity: critical
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 1h
      receiver: 'critical-alerts'

receivers:
  - name: 'default-receiver'
    email_configs:
      - to: '[email protected]'
  
  - name: 'critical-alerts'
    email_configs:
      - to: '[email protected]'
    pagerduty_configs:
      - service_key: '<pagerduty-service-key>'
      
    # Escalation after 10 minutes if not acknowledged
    opsgenie_configs:
      - api_key: '<opsgenie-api-key>'
        note: 'Alert has been firing for over 10 minutes'
        description: '{{ .CommonAnnotations.description }}'
        message: '{{ .CommonAnnotations.summary }}'
        priority: 'P1'

Example 3: Time-Based Routing

Sending alerts to different teams based on business hours:

route:
  receiver: 'default-receiver'
  routes:
    - match:
        severity: critical
      receiver: 'daytime-oncall'
      # Mon-Fri 9AM-5PM
      time_intervals:
        - weekdays
      
    - match:
        severity: critical
      receiver: 'nighttime-oncall'
      # All other times
      time_intervals:
        - always_except
        - weekdays

time_intervals:
  - name: weekdays
    time_intervals:
      - weekdays: ['monday:friday']
        location: 'America/New_York'
        times:
          - start_time: '09:00'
            end_time: '17:00'

receivers:
  - name: 'default-receiver'
    email_configs:
      - to: '[email protected]'
  
  - name: 'daytime-oncall'
    email_configs:
      - to: '[email protected]'
      
  - name: 'nighttime-oncall'
    email_configs:
      - to: '[email protected]'
    pagerduty_configs:
      - service_key: '<pagerduty-service-key>'

Alert Manager UI

Alert Manager provides a web interface (typically at http://localhost:9093) where you can:

View current alerts
Create and manage silences
Test alert routing
View the status of Alert Manager itself

Best Practices

Keep it Simple: Start with a basic configuration and expand as needed
Use Labels Effectively: Design a good labeling strategy for routing
Configure Reasonable Timeouts: Set appropriate group_wait, group_interval, and repeat_interval values
Implement High Availability: Run multiple Alert Manager instances
Test Your Setup: Use tools like amtool to verify configuration
Document Alert Policies: Ensure team members know how to respond to different alerts
Avoid Alert Fatigue: Group similar alerts and only alert on actionable issues

Testing with amtool

The amtool command-line utility is useful for testing Alert Manager configurations:

# Check if configuration is valid
amtool check-config alertmanager.yml

# Test alert routing
amtool config routes test --config.file=alertmanager.yml \
  alertname=HighCPULoad severity=critical instance=web1.example.com

# Manually trigger an alert
amtool alert add alertname=TestAlert severity=warning instance=test1 \
  --annotation=summary="Test alert" \
  --annotation=description="This is a test alert"

Troubleshooting

Common issues and their solutions:

Alerts not being received:
- Check network connectivity between Prometheus and Alert Manager
- Verify route configurations match the alert labels
- Check that receivers are properly configured
Duplicate alerts:
- Ensure Alert Manager instances can communicate with each other
- Check for proper deduplication settings
Alert Manager not starting:
- Validate your configuration with amtool check-config
- Check for syntax errors in YAML files

Summary

Alert Manager is a powerful component of the Prometheus ecosystem that handles the delivery, grouping, deduplication, silencing, and inhibition of alerts. By understanding its core concepts and implementing best practices, you can create an effective alerting system that notifies the right people at the right time without causing alert fatigue.

The key concepts to remember:

Grouping: Combining similar alerts to reduce noise
Routing: Sending alerts to the appropriate teams
Receivers: Defining notification methods (email, Slack, PagerDuty, etc.)
Silences: Temporarily suppressing alerts
Inhibition: Preventing less important alerts when critical alerts are active

Additional Resources

Exercises

Set up a local Alert Manager instance that sends notifications to a Slack channel
Configure alert grouping to combine similar alerts from multiple instances
Create a time-based routing configuration for business hours vs. off-hours
Implement an inhibition rule that suppresses warning alerts when related critical alerts are firing
Set up high availability with multiple Alert Manager instances

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Alert Manager's Role​

Installation and Setup​

Installing Alert Manager​

Basic Configuration​

Core Concepts​

1. Grouping​

2. Routing​

3. Receivers​

4. Silences​

5. Inhibition​

Integrating Alert Manager with Prometheus​

Practical Examples​

Example 1: High-Availability Configuration​

Example 2: Escalating Notifications​

Example 3: Time-Based Routing​

Alert Manager UI​

Best Practices​

Testing with amtool​

Troubleshooting​

Summary​

Additional Resources​

Exercises​