Alert Grouping and Routing

Introduction

When monitoring complex systems with Prometheus, it's common to experience a flood of alerts when something goes wrong. For example, if a database server fails, you might receive dozens of related alerts about high latency, failed queries, and connection errors. This phenomenon, called "alert storms," can lead to notification fatigue and make it difficult to identify the root cause of issues.

Alertmanager, a component of the Prometheus ecosystem, solves this problem through two powerful mechanisms: Alert Grouping and Alert Routing. These features help organize alerts intelligently and ensure they reach the right teams through appropriate notification channels.

In this guide, we'll explore how to configure and use these capabilities to build an effective alerting system.

Alert Grouping

Alert grouping combines related alerts into a single notification, reducing noise and providing context for troubleshooting.

How Grouping Works

Alertmanager groups alerts based on labels. Alerts with the same grouping labels are combined into a single notification. By default, Alertmanager groups alerts by the alertname label, but you can customize this behavior.

Configuring Alert Groups

Grouping is configured in the Alertmanager configuration file. Here's a basic example:

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h

Let's break down these configuration options:

group_by: Specifies the labels used to group alerts together
group_wait: How long to wait to buffer alerts of the same group before sending a notification
group_interval: How long to wait before sending a notification about new alerts that are added to a group of alerts
repeat_interval: How long to wait before sending a notification again if it has already been sent

Example: Grouping in Action

Consider these alerts triggered simultaneously:

Alert 1: { alertname="HighCPUUsage", cluster="prod-us-east", service="api", instance="server1" }
Alert 2: { alertname="HighCPUUsage", cluster="prod-us-east", service="api", instance="server2" }
Alert 3: { alertname="HighCPUUsage", cluster="prod-us-east", service="database", instance="db1" }

With the grouping configuration above, Alertmanager would create two notification groups:

Group 1: Alert 1 and Alert 2 (same alertname, cluster, and service)
Group 2: Alert 3 (different service)

Instead of receiving three separate notifications, you would receive two more meaningful grouped notifications.

Alert Routing

Alert routing directs different types of alerts to appropriate teams and notification channels based on labels.

Understanding Routing Trees

Alertmanager uses a tree structure for routing alerts. The configuration starts with a top-level route and can have nested child routes. Each route can specify:

Which alerts it handles (using matchers)
How alerts are grouped
Where notifications are sent (receivers)
How notifications are throttled

Configuring Alert Routes

Here's an example routing configuration:

route:
  receiver: 'default-receiver'
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  
  routes:
    - match:
        service: 'database'
      receiver: 'database-team'
      group_by: ['alertname', 'cluster', 'instance']
      group_wait: 45s
      
    - match:
        severity: 'critical'
      receiver: 'pager-duty'
      group_wait: 10s
      
receivers:
  - name: 'default-receiver'
    email_configs:
      - to: '[email protected]'
  
  - name: 'database-team'
    email_configs:
      - to: '[email protected]'
    slack_configs:
      - channel: '#db-alerts'
  
  - name: 'pager-duty'
    pagerduty_configs:
      - service_key: '<your-pagerduty-service-key>'

In this configuration:

All alerts go to the default-receiver by default
Alerts with service: database go to the database-team receiver
Alerts with severity: critical go to the pager-duty receiver

Matchers

Routes use matchers to determine which alerts they should handle. Matchers support several operations:

=: Exact match
!=: Negative match
=~: Regex match
!~: Negative regex match

Example of more complex matching:

routes:
  - matchers:
      - service=~"api|web"
      - environment="production"
    receiver: 'prod-webapp-team'

This route matches alerts where the service label matches "api" or "web" AND the environment label equals "production".

Receiver Types

Alertmanager supports various notification channels:

Email
Slack
PagerDuty
Webhook
OpsGenie
VictorOps
Pushover
And more...

You can configure multiple notification methods for each receiver.

Putting It All Together: A Real-World Example

Let's look at a more comprehensive example that demonstrates both grouping and routing:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

route:
  receiver: 'operations-team'
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  
  routes:
    - matchers:
        - service=~"api|frontend"
      receiver: 'application-team'
      group_by: ['alertname', 'service', 'instance']
      
    - matchers:
        - service=~"database|cache"
      receiver: 'infrastructure-team'
      group_by: ['alertname', 'service', 'instance']
      routes:
        - matchers:
            - severity="critical"
          receiver: 'database-oncall'
          group_wait: 10s
          continue: true
            
    - matchers:
        - team="security"
      receiver: 'security-team'
      group_wait: 1m

receivers:
  - name: 'operations-team'
    email_configs:
      - to: '[email protected]'
    slack_configs:
      - channel: '#ops-alerts'
        
  - name: 'application-team'
    email_configs:
      - to: '[email protected]'
    slack_configs:
      - channel: '#app-alerts'
        
  - name: 'infrastructure-team'
    email_configs:
      - to: '[email protected]'
    slack_configs:
      - channel: '#infra-alerts'
        
  - name: 'database-oncall'
    pagerduty_configs:
      - service_key: 'your-pagerduty-service-key'
        
  - name: 'security-team'
    email_configs:
      - to: '[email protected]'
    slack_configs:
      - channel: '#security-alerts'

This configuration:

Routes alerts to different teams based on service labels
Sends critical database alerts to an on-call engineer via PagerDuty
Uses different grouping strategies for different types of alerts
Customizes wait times for each route

Visual Representation of Routing

Advanced Features

Inhibition Rules

Inhibition rules allow you to suppress certain alerts when other alerts are firing. This is useful when you have a high-level alert that makes lower-level alerts redundant.

Example:

inhibit_rules:
  - source_matchers:
      - alertname="NodeDown"
    target_matchers:
      - severity="warning"
    equal: ['cluster', 'instance']

This rule suppresses all warning-level alerts for instances that are reported as down.

Time-Based Routing

You can implement time-based routing using time intervals. This is useful for directing alerts to different teams based on working hours or on-call schedules.

Example:

routes:
  - matchers:
      - severity="critical"
    receiver: 'daytime-oncall'
    group_wait: 30s
    time_intervals: ['workday-hours']
    
  - matchers:
      - severity="critical"
    receiver: 'nighttime-oncall'
    group_wait: 30s
    time_intervals: ['non-work-hours']

time_intervals:
  - name: 'workday-hours'
    time_intervals:
      - weekdays: ['monday:friday']
        start_time: '09:00'
        end_time: '17:00'
        
  - name: 'non-work-hours'
    time_intervals:
      - weekdays: ['monday:friday']
        start_time: '17:00'
        end_time: '09:00'
      - weekdays: ['saturday', 'sunday']

Mute Timings

Mute timings allow you to silence notifications during specific time periods, such as planned maintenance windows.

mute_time_intervals:
  - name: 'maintenance-window'
    time_intervals:
      - weekdays: ['wednesday']
        start_time: '22:00'
        end_time: '23:59'

route:
  receiver: 'operations-team'
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  mute_time_intervals: ['maintenance-window']

Best Practices

Keep Related Alerts Together: Group alerts that are likely to occur together and should be handled by the same team.
Route by Severity: Consider routing alerts based on severity level to ensure critical issues get immediate attention.
Route by Team Responsibility: Direct alerts to the teams that are responsible for the systems generating them.
Document Your Routing Logic: Create a diagram of your routing tree to help teams understand where alerts will go.
Test Your Configuration: Use the amtool command-line utility to validate and test your Alertmanager configuration:

amtool check-config alertmanager.yml

Start Simple: Begin with a basic configuration and add complexity as needed. It's easier to understand and troubleshoot a simpler routing tree.
Use Templating: Customize notification messages using templates to include relevant information:

receivers:
  - name: 'ops-team'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: >
          {{ range .Alerts }}
            *Alert:* {{ .Labels.alertname }}
            *Description:* {{ .Annotations.description }}
            *Severity:* {{ .Labels.severity }}
          {{ end }}

Summary

Alert grouping and routing are powerful features in Alertmanager that help tame notification noise and ensure alerts reach the right teams. By properly configuring these features, you can create an alerting system that:

Reduces alert fatigue by grouping related notifications
Directs alerts to the appropriate teams based on service, severity, or other criteria
Customizes notification behavior for different types of alerts
Uses appropriate communication channels for different situations

Effective alert management is crucial for maintaining system reliability and enabling quick response to issues. By investing time in properly configuring Alertmanager, you'll create a more manageable and useful alerting system.

Additional Resources

Exercises

Configure Alertmanager to group alerts by environment and service, with a 1-minute wait time.
Create a routing configuration that sends:
- Database alerts to the database team
- Frontend alerts to the frontend team
- Critical alerts to both PagerDuty and Slack
Define an inhibition rule that suppresses disk space warnings when a "HostDown" alert is firing for the same instance.
Create a time-based routing configuration that sends alerts to different teams during business hours versus after hours.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Alert Grouping​

How Grouping Works​

Configuring Alert Groups​

Example: Grouping in Action​

Alert Routing​

Understanding Routing Trees​

Configuring Alert Routes​

Matchers​

Receiver Types​

Putting It All Together: A Real-World Example​

Visual Representation of Routing​

Advanced Features​

Inhibition Rules​

Time-Based Routing​

Mute Timings​

Best Practices​

Summary​

Additional Resources​

Exercises​