Alert Manager Configuration

Introduction

Alertmanager is a critical component in the Prometheus ecosystem that handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing alerts to the correct receiver integration such as email, Slack, or PagerDuty. Additionally, it handles silencing and inhibition of alerts to reduce noise and alert fatigue.

In this guide, we'll explore how to configure Alertmanager effectively to ensure your team receives the right alerts at the right time through the right channels.

Understanding Alertmanager's Role

Before diving into configuration, let's understand where Alertmanager fits in the Prometheus ecosystem:

When an alert rule is triggered in Prometheus, it sends the alert to Alertmanager, which then processes it according to its configuration to determine:

How to group similar alerts
Where to send the notification
When to silence or inhibit certain alerts
How to handle repeated notifications

Basic Configuration Structure

Alertmanager's configuration is defined in a YAML file, typically named alertmanager.yml. Let's look at the basic structure:

global:
  # Global configuration parameters that apply to all other sections
  
templates:
  # Templates for notifications
  
route:
  # The routing tree for notifications
  
receivers:
  # The different receiving endpoints
  
inhibit_rules:
  # Rules for inhibiting certain alerts when others are firing
  
time_intervals:
  # Time intervals that can be used in routes

Global Configuration

The global section defines parameters that apply as defaults across the entire configuration:

global:
  resolve_timeout: 5m
  smtp_from: '[email protected]'
  smtp_smarthost: 'smtp.example.org:587'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'
  slack_api_url: 'https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

Key parameters include:

resolve_timeout: How long to wait before resolving an alert when it's no longer firing
Integration-specific settings like SMTP for email or API URLs for services like Slack

Configuring Routes

The routing tree is the heart of Alertmanager configuration. It determines how alerts are grouped and which receiver should get which alerts.

route:
  # The root route with common parameters
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'team-emails'
  
  # Child routes for specific cases
  routes:
  - match:
      severity: critical
    receiver: 'team-pagers'
    continue: true
    
  - match:
      service: database
    receiver: 'database-team'
    
  - match_re:
      service: ^(frontend|backend)$
    receiver: 'web-team'

Let's break down these parameters:

group_by: Groups alerts with matching labels, reducing notification noise
group_wait: How long to wait to buffer alerts of the same group before sending
group_interval: How long to wait before sending a notification about new alerts added to a group
repeat_interval: How long to wait before resending an alert notification if it's still firing
receiver: The default receiver for this route
routes: Child routes with specific matching criteria
match/match_re: Match labels exactly or with regular expressions
continue: If true, continue matching sibling routes even after this one matches

Setting Up Receivers

Receivers define where notifications are sent:

receivers:
- name: 'team-emails'
  email_configs:
  - to: '[email protected]'
    send_resolved: true
    
- name: 'team-pagers'
  pagerduty_configs:
  - service_key: '<pagerduty-service-key>'
    send_resolved: true
  
- name: 'web-team'
  slack_configs:
  - channel: '#web-alerts'
    send_resolved: true
    title: '{{ template "slack.title" . }}'
    text: '{{ template "slack.text" . }}'

Each receiver can have multiple configurations of the same type or mix different types. Common receiver types include:

email_configs: For email notifications
slack_configs: For Slack notifications
pagerduty_configs: For PagerDuty notifications
webhook_configs: For custom webhook integrations
victorops_configs: For VictorOps notifications
pushover_configs: For Pushover notifications

Inhibition Rules

Inhibition rules allow you to suppress notifications for certain alerts when other alerts are already firing, which helps reduce alert fatigue:

inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'cluster', 'service']

This example inhibits all warning alerts that have the same alertname, cluster, and service labels as critical alerts that are already firing.

Time-Based Routing with Time Intervals

Time intervals let you define specific time periods that can be used in routing:

time_intervals:
- name: workdays
  time_intervals:
  - weekdays: ['monday:friday']
    location: 'America/New_York'
    
- name: workhours
  time_intervals:
  - times:
    - start_time: '09:00'
      end_time: '17:00'
    weekdays: ['monday:friday']
    location: 'America/New_York'

You can then use these in your routes:

route:
  # ...other configuration...
  routes:
  - match:
      severity: critical
    receiver: 'team-pagers'
    
  - match:
      severity: warning
    receiver: 'team-emails'
    active_time_intervals:
    - workhours

This ensures warning alerts only trigger the 'team-emails' receiver during work hours.

Templates for Notification Customization

Templates allow you to customize the format of your notifications:

templates:
- '/etc/alertmanager/templates/*.tmpl'

A simple template file (email.tmpl) might look like:

{{ define "email.subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }}
{{ end }}

{{ define "email.text" }}
{{ if gt (len .Alerts.Firing) 0 }}
Firing Alerts:
{{ range .Alerts.Firing }}
- {{ .Labels.alertname }}: {{ .Annotations.description }}
{{ end }}
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
Resolved Alerts:
{{ range .Alerts.Resolved }}
- {{ .Labels.alertname }}: {{ .Annotations.description }}
{{ end }}
{{ end }}
{{ end }}

Complete Example Configuration

Here's a more complete example configuration that ties everything together:

global:
  resolve_timeout: 5m
  smtp_from: '[email protected]'
  smtp_smarthost: 'smtp.example.org:587'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'
  slack_api_url: 'https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX'

templates:
- '/etc/alertmanager/templates/*.tmpl'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'team-emails'
  
  routes:
  - match:
      severity: critical
    receiver: 'team-pagers'
    group_wait: 10s
    repeat_interval: 1h
    
  - match:
      service: database
    receiver: 'database-team'
    group_by: ['alertname', 'instance', 'cluster']
    
  - match_re:
      service: ^(frontend|backend)$
    receiver: 'web-team'
    active_time_intervals:
    - workhours

receivers:
- name: 'team-emails'
  email_configs:
  - to: '[email protected]'
    send_resolved: true
    
- name: 'team-pagers'
  pagerduty_configs:
  - service_key: '<pagerduty-service-key>'
    send_resolved: true
  slack_configs:
  - channel: '#alerts-critical'
    send_resolved: true
  
- name: 'database-team'
  email_configs:
  - to: '[email protected]'
    send_resolved: true
  slack_configs:
  - channel: '#db-alerts'
    send_resolved: true
    
- name: 'web-team'
  slack_configs:
  - channel: '#web-alerts'
    send_resolved: true

inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'cluster', 'service']

time_intervals:
- name: workhours
  time_intervals:
  - times:
    - start_time: '09:00'
      end_time: '17:00'
    weekdays: ['monday:friday']
    location: 'America/New_York'

Running and Verifying Alertmanager

To run Alertmanager with your configuration:

alertmanager --config.file=/path/to/alertmanager.yml

You can verify your configuration without starting Alertmanager using:

amtool check-config /path/to/alertmanager.yml

This will validate your configuration file and report any errors.

Best Practices for Alertmanager Configuration

When setting up your Alertmanager, consider these best practices:

Group related alerts: Use meaningful group_by parameters to reduce notification noise
Set appropriate timing parameters:
- Use shorter group_wait for critical alerts
- Use longer repeat_interval for non-critical alerts
Route based on severity: Route critical alerts to high-priority channels like pagers
Use inhibition rules: Suppress less important alerts when critical alerts are active
Implement time-based routing: Don't wake people up for non-critical issues
Test your configuration: Use amtool to verify configuration and test alert routing

Practical Example: Multi-Team Setup

Let's look at a practical example for an organization with multiple teams:

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'job']
  routes:
  - match:
      team: infrastructure
    receiver: 'infrastructure-team'
    routes:
    - match:
        severity: critical
      receiver: 'infrastructure-pager'
  
  - match:
      team: application
    receiver: 'application-team'
    routes:
    - match:
        severity: critical
      receiver: 'application-pager'
      
  - match:
      team: database
    receiver: 'database-team'
    routes:
    - match:
        severity: critical
      receiver: 'database-pager'

This configuration:

Groups alerts by alertname and job
Routes alerts to teams based on the team label
For each team, routes critical alerts to a pager receiver

Summary

Alertmanager is a powerful tool for managing alerts in your monitoring system. By configuring it properly, you can ensure the right people receive the right notifications at the right time, reducing alert fatigue and improving response times to critical issues.

In this guide, we've covered:

The basic structure of Alertmanager configuration
How to configure routing trees for alert notification
Setting up different types of receivers
Using inhibition rules to reduce notification noise
Implementing time-based routing
Customizing notification templates
Best practices for effective alerting

With these tools, you can build an alerting system that balances the need for timely notifications with the importance of not overwhelming your team with unnecessary alerts.

Additional Resources

Exercises

Basic Configuration: Create a simple Alertmanager configuration that sends all alerts to your email address.
Advanced Routing: Set up a routing tree that sends critical alerts to a Slack channel during work hours and to PagerDuty after hours.
Template Customization: Create a custom template for Slack notifications that includes alert details and links to your dashboards.
Inhibition Rules: Configure inhibition rules that suppress warning alerts when related critical alerts are firing.
Time-Based Routing: Set up different routing behaviors for weekdays versus weekends.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Alertmanager's Role​

Basic Configuration Structure​

Global Configuration​

Configuring Routes​

Setting Up Receivers​

Inhibition Rules​

Time-Based Routing with Time Intervals​

Templates for Notification Customization​

Complete Example Configuration​

Running and Verifying Alertmanager​

Best Practices for Alertmanager Configuration​

Practical Example: Multi-Team Setup​

Summary​

Additional Resources​

Exercises​