Skip to main content

Alert Grouping and Routing

Introduction

When monitoring complex systems with Prometheus, it's common to experience a flood of alerts when something goes wrong. For example, if a database server fails, you might receive dozens of related alerts about high latency, failed queries, and connection errors. This phenomenon, called "alert storms," can lead to notification fatigue and make it difficult to identify the root cause of issues.

Alertmanager, a component of the Prometheus ecosystem, solves this problem through two powerful mechanisms: Alert Grouping and Alert Routing. These features help organize alerts intelligently and ensure they reach the right teams through appropriate notification channels.

In this guide, we'll explore how to configure and use these capabilities to build an effective alerting system.

Alert Grouping

Alert grouping combines related alerts into a single notification, reducing noise and providing context for troubleshooting.

How Grouping Works

Alertmanager groups alerts based on labels. Alerts with the same grouping labels are combined into a single notification. By default, Alertmanager groups alerts by the alertname label, but you can customize this behavior.

Configuring Alert Groups

Grouping is configured in the Alertmanager configuration file. Here's a basic example:

yaml
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h

Let's break down these configuration options:

  • group_by: Specifies the labels used to group alerts together
  • group_wait: How long to wait to buffer alerts of the same group before sending a notification
  • group_interval: How long to wait before sending a notification about new alerts that are added to a group of alerts
  • repeat_interval: How long to wait before sending a notification again if it has already been sent

Example: Grouping in Action

Consider these alerts triggered simultaneously:

Alert 1: { alertname="HighCPUUsage", cluster="prod-us-east", service="api", instance="server1" }
Alert 2: { alertname="HighCPUUsage", cluster="prod-us-east", service="api", instance="server2" }
Alert 3: { alertname="HighCPUUsage", cluster="prod-us-east", service="database", instance="db1" }

With the grouping configuration above, Alertmanager would create two notification groups:

  1. Group 1: Alert 1 and Alert 2 (same alertname, cluster, and service)
  2. Group 2: Alert 3 (different service)

Instead of receiving three separate notifications, you would receive two more meaningful grouped notifications.

Alert Routing

Alert routing directs different types of alerts to appropriate teams and notification channels based on labels.

Understanding Routing Trees

Alertmanager uses a tree structure for routing alerts. The configuration starts with a top-level route and can have nested child routes. Each route can specify:

  • Which alerts it handles (using matchers)
  • How alerts are grouped
  • Where notifications are sent (receivers)
  • How notifications are throttled

Configuring Alert Routes

Here's an example routing configuration:

yaml
route:
receiver: 'default-receiver'
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h

routes:
- match:
service: 'database'
receiver: 'database-team'
group_by: ['alertname', 'cluster', 'instance']
group_wait: 45s

- match:
severity: 'critical'
receiver: 'pager-duty'
group_wait: 10s

receivers:
- name: 'default-receiver'
email_configs:
- to: '[email protected]'

- name: 'database-team'
email_configs:
- to: '[email protected]'
slack_configs:
- channel: '#db-alerts'

- name: 'pager-duty'
pagerduty_configs:
- service_key: '<your-pagerduty-service-key>'

In this configuration:

  1. All alerts go to the default-receiver by default
  2. Alerts with service: database go to the database-team receiver
  3. Alerts with severity: critical go to the pager-duty receiver

Matchers

Routes use matchers to determine which alerts they should handle. Matchers support several operations:

  • =: Exact match
  • !=: Negative match
  • =~: Regex match
  • !~: Negative regex match

Example of more complex matching:

yaml
routes:
- matchers:
- service=~"api|web"
- environment="production"
receiver: 'prod-webapp-team'

This route matches alerts where the service label matches "api" or "web" AND the environment label equals "production".

Receiver Types

Alertmanager supports various notification channels:

  • Email
  • Slack
  • PagerDuty
  • Webhook
  • OpsGenie
  • VictorOps
  • Pushover
  • And more...

You can configure multiple notification methods for each receiver.

Putting It All Together: A Real-World Example

Let's look at a more comprehensive example that demonstrates both grouping and routing:

yaml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: '[email protected]'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'

route:
receiver: 'operations-team'
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h

routes:
- matchers:
- service=~"api|frontend"
receiver: 'application-team'
group_by: ['alertname', 'service', 'instance']

- matchers:
- service=~"database|cache"
receiver: 'infrastructure-team'
group_by: ['alertname', 'service', 'instance']
routes:
- matchers:
- severity="critical"
receiver: 'database-oncall'
group_wait: 10s
continue: true

- matchers:
- team="security"
receiver: 'security-team'
group_wait: 1m

receivers:
- name: 'operations-team'
email_configs:
- to: '[email protected]'
slack_configs:
- channel: '#ops-alerts'

- name: 'application-team'
email_configs:
- to: '[email protected]'
slack_configs:
- channel: '#app-alerts'

- name: 'infrastructure-team'
email_configs:
- to: '[email protected]'
slack_configs:
- channel: '#infra-alerts'

- name: 'database-oncall'
pagerduty_configs:
- service_key: 'your-pagerduty-service-key'

- name: 'security-team'
email_configs:
- to: '[email protected]'
slack_configs:
- channel: '#security-alerts'

This configuration:

  1. Routes alerts to different teams based on service labels
  2. Sends critical database alerts to an on-call engineer via PagerDuty
  3. Uses different grouping strategies for different types of alerts
  4. Customizes wait times for each route

Visual Representation of Routing

Advanced Features

Inhibition Rules

Inhibition rules allow you to suppress certain alerts when other alerts are firing. This is useful when you have a high-level alert that makes lower-level alerts redundant.

Example:

yaml
inhibit_rules:
- source_matchers:
- alertname="NodeDown"
target_matchers:
- severity="warning"
equal: ['cluster', 'instance']

This rule suppresses all warning-level alerts for instances that are reported as down.

Time-Based Routing

You can implement time-based routing using time intervals. This is useful for directing alerts to different teams based on working hours or on-call schedules.

Example:

yaml
routes:
- matchers:
- severity="critical"
receiver: 'daytime-oncall'
group_wait: 30s
time_intervals: ['workday-hours']

- matchers:
- severity="critical"
receiver: 'nighttime-oncall'
group_wait: 30s
time_intervals: ['non-work-hours']

time_intervals:
- name: 'workday-hours'
time_intervals:
- weekdays: ['monday:friday']
start_time: '09:00'
end_time: '17:00'

- name: 'non-work-hours'
time_intervals:
- weekdays: ['monday:friday']
start_time: '17:00'
end_time: '09:00'
- weekdays: ['saturday', 'sunday']

Mute Timings

Mute timings allow you to silence notifications during specific time periods, such as planned maintenance windows.

yaml
mute_time_intervals:
- name: 'maintenance-window'
time_intervals:
- weekdays: ['wednesday']
start_time: '22:00'
end_time: '23:59'

route:
receiver: 'operations-team'
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
mute_time_intervals: ['maintenance-window']

Best Practices

  1. Keep Related Alerts Together: Group alerts that are likely to occur together and should be handled by the same team.

  2. Route by Severity: Consider routing alerts based on severity level to ensure critical issues get immediate attention.

  3. Route by Team Responsibility: Direct alerts to the teams that are responsible for the systems generating them.

  4. Document Your Routing Logic: Create a diagram of your routing tree to help teams understand where alerts will go.

  5. Test Your Configuration: Use the amtool command-line utility to validate and test your Alertmanager configuration:

bash
amtool check-config alertmanager.yml
  1. Start Simple: Begin with a basic configuration and add complexity as needed. It's easier to understand and troubleshoot a simpler routing tree.

  2. Use Templating: Customize notification messages using templates to include relevant information:

yaml
receivers:
- name: 'ops-team'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: >
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
{{ end }}

Summary

Alert grouping and routing are powerful features in Alertmanager that help tame notification noise and ensure alerts reach the right teams. By properly configuring these features, you can create an alerting system that:

  • Reduces alert fatigue by grouping related notifications
  • Directs alerts to the appropriate teams based on service, severity, or other criteria
  • Customizes notification behavior for different types of alerts
  • Uses appropriate communication channels for different situations

Effective alert management is crucial for maintaining system reliability and enabling quick response to issues. By investing time in properly configuring Alertmanager, you'll create a more manageable and useful alerting system.

Additional Resources

Exercises

  1. Configure Alertmanager to group alerts by environment and service, with a 1-minute wait time.

  2. Create a routing configuration that sends:

    • Database alerts to the database team
    • Frontend alerts to the frontend team
    • Critical alerts to both PagerDuty and Slack
  3. Define an inhibition rule that suppresses disk space warnings when a "HostDown" alert is firing for the same instance.

  4. Create a time-based routing configuration that sends alerts to different teams during business hours versus after hours.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)