Skip to main content

Alert Manager Configuration

Introduction

Alertmanager is a critical component in the Prometheus and Grafana ecosystem that handles alerts sent by client applications such as Prometheus servers or Grafana's alerting system. It takes care of deduplicating, grouping, and routing alerts to the correct receiver. When using Grafana Loki for log monitoring, properly configuring Alertmanager ensures that the right people are notified about issues at the right time.

In this guide, we'll learn how to configure Alertmanager to work with Grafana Loki, understand its key components, and see practical examples of how to set up effective alerting workflows.

Alertmanager Basics

Alertmanager processes alerts generated by monitoring systems and handles:

  • Deduplication: Removes duplicate alerts from multiple similar sources
  • Grouping: Combines related alerts into a single notification
  • Routing: Directs alerts to the appropriate team or notification channel
  • Silencing: Temporarily mutes alerts during maintenance or known issues
  • Inhibition: Suppresses alerts when certain other alerts are already firing

Alertmanager Configuration File

Alertmanager is configured via a YAML file, typically called alertmanager.yml. The configuration file consists of several main sections:

yaml
global:
# Global settings like SMTP defaults, repeated notification intervals

route:
# The routing tree for alert notifications

receivers:
# The notification integrations (email, Slack, etc.)

inhibit_rules:
# Rules for muting alerts when others are firing

templates:
# Custom notification templates

Let's examine each section in detail:

Global Configuration

The global section defines default parameters that apply to all alerts unless overridden elsewhere:

yaml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.org:587'
smtp_from: '[email protected]'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
  • resolve_timeout: How long to wait before considering a resolved alert as resolved
  • SMTP settings: For email notifications
  • Messaging service settings: Such as Slack webhook URLs

Route Configuration

The route section defines how alerts are routed to receivers:

yaml
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'team-emails'

routes:
- match:
service: 'loki'
receiver: 'loki-team'
routes:
- match:
severity: 'critical'
receiver: 'loki-oncall'

Key routing parameters:

  • group_by: Labels to group alerts by
  • group_wait: Initial delay before sending a notification for a new group
  • group_interval: Interval between sending updated notifications for changed groups
  • repeat_interval: How long to wait before resending notifications for unchanged groups
  • receiver: Default receiver for this route
  • routes: Nested routing trees for more specific routing

Receivers Configuration

The receivers section defines how notifications should be sent:

yaml
receivers:
- name: 'team-emails'
email_configs:
- to: '[email protected]'
send_resolved: true

- name: 'loki-team'
slack_configs:
- channel: '#loki-alerts'
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'

- name: 'loki-oncall'
pagerduty_configs:
- service_key: 'your-pagerduty-service-key'
description: '{{ .CommonAnnotations.summary }}'

Each receiver can have multiple configurations for different notification channels like:

  • email_configs
  • slack_configs
  • pagerduty_configs
  • webhook_configs
  • victorops_configs
  • pushover_configs
  • And more

Inhibition Rules

Inhibition rules allow you to suppress certain alerts when others are firing:

yaml
inhibit_rules:
- source_match:
severity: 'critical'
alertname: 'LokiDown'
target_match:
severity: 'warning'
equal: ['job', 'instance']

This rule would suppress any warning-level alerts from the same job and instance if a critical "LokiDown" alert is already firing.

Integrating with Grafana Loki

When using Alertmanager with Grafana Loki, you need to:

  1. Configure Loki rules to generate alerts
  2. Set up Grafana to forward alerts to Alertmanager
  3. Configure Alertmanager to handle Loki alerts

Example: Loki Rules

Here's how you might define a Loki rule that generates alerts:

yaml
groups:
- name: loki_rules
rules:
- alert: HighErrorRate
expr: sum(rate({app="myapp"} |= "error" [5m])) / sum(rate({app="myapp"}[5m])) > 0.05
for: 10m
labels:
severity: critical
service: loki
annotations:
summary: High error rate detected
description: "Error rate is above 5% for more than 10 minutes for app {{ $labels.app }}"

Example: Grafana Configuration

In your Grafana configuration (grafana.ini or environment variables):

ini
[unified_alerting]
enabled = true

[unified_alerting.alertmanager]
enabled = true

[alerting]
alertmanager_url = http://alertmanager:9093

Practical Examples

Let's look at some real-world examples of Alertmanager configurations for different use cases.

Example 1: Basic Loki Alert Routing

yaml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/XXXXXXXX'

route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default-receiver'
routes:
- match:
service: 'loki'
receiver: 'loki-team'
routes:
- match:
severity: 'critical'
receiver: 'loki-oncall'
continue: true
- match:
environment: 'production'
severity: 'warning'
receiver: 'loki-oncall'

receivers:
- name: 'default-receiver'
slack_configs:
- channel: '#general-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'

- name: 'loki-team'
slack_configs:
- channel: '#loki-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'

- name: 'loki-oncall'
pagerduty_configs:
- service_key: 'your-pagerduty-service-key'
description: '{{ .CommonAnnotations.summary }}'
slack_configs:
- channel: '#loki-alerts'
title: '[CRITICAL] {{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'

Example 2: Time-based Routing

This example routes alerts differently during business hours vs. off-hours:

yaml
global:
resolve_timeout: 5m

route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'team-email'
routes:
- match:
service: 'loki'
receiver: 'loki-business-hours'
routes:
- match:
severity: 'critical'
receiver: 'loki-oncall'
- match_re:
severity: 'warning|critical'
receiver: 'loki-oncall'
time_intervals: ['offhours', 'weekends']

time_intervals:
- name: 'business-hours'
time_intervals:
- weekdays: ['monday:friday']
times:
- start_time: '09:00'
end_time: '17:00'
- name: 'offhours'
time_intervals:
- weekdays: ['monday:friday']
times:
- start_time: '00:00'
end_time: '09:00'
- start_time: '17:00'
end_time: '24:00'
- name: 'weekends'
time_intervals:
- weekdays: ['saturday', 'sunday']

receivers:
- name: 'team-email'
email_configs:
- to: '[email protected]'

- name: 'loki-business-hours'
slack_configs:
- channel: '#loki-alerts'

- name: 'loki-oncall'
pagerduty_configs:
- service_key: 'your-pagerduty-service-key'
slack_configs:
- channel: '#loki-oncall'

Visualizing Alert Routing

To better understand how alerts are routed through Alertmanager, let's visualize the routing tree with a Mermaid diagram:

Best Practices

When configuring Alertmanager with Loki, consider these best practices:

  1. Group alerts intelligently: Choose labels that help correlate related issues
  2. Use proper timing settings:
    • Short group_wait for critical alerts
    • Longer repeat_interval for non-critical alerts
  3. Implement a severity hierarchy:
    • critical: Requires immediate action
    • warning: Needs attention but not immediate
    • info: For informational purposes
  4. Create good alert descriptions:
    • Include what's happening
    • Why it matters
    • Potential remediation steps
  5. Test your alerts: Use the Alertmanager API to send test alerts
  6. Use templates for consistent notifications:
    • Keep format consistent across channels
    • Include relevant links to dashboards or runbooks

Alertmanager Templates

Templates allow you to customize notification content. Create a template file:

{{ define "slack.custom.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{ end }}

{{ define "slack.custom.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
{{ end }}

Then reference it in your configuration:

yaml
templates:
- '/etc/alertmanager/templates/custom.tmpl'

receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
title: '{{ template "slack.custom.title" . }}'
text: '{{ template "slack.custom.text" . }}'

Common Troubleshooting

If you're having issues with your Alertmanager configuration:

  1. Check the Alertmanager logs:

    bash
    docker logs alertmanager
  2. Validate your configuration:

    bash
    amtool check-config /path/to/alertmanager.yml
  3. Test routing rules:

    bash
    amtool config routes test --config.file=/path/to/alertmanager.yml \
    --verify.receivers=loki-oncall service=loki severity=critical
  4. View current alerts:

    bash
    curl -s http://alertmanager:9093/api/v1/alerts | jq

Summary

Alertmanager is a powerful tool for managing alerts generated by Grafana Loki and other monitoring systems. By properly configuring routes, receivers, and templates, you can ensure that alerts are sent to the right people at the right time, reducing alert fatigue and improving incident response.

Key takeaways:

  • Use proper grouping to reduce noise
  • Set up routing trees to direct alerts to appropriate receivers
  • Customize notifications for each channel
  • Implement inhibition rules to reduce duplicate alerts
  • Use templates for consistent notifications
  • Test your configuration before deploying to production

Exercises

  1. Create an Alertmanager configuration that routes different severity alerts to different Slack channels.
  2. Set up time-based routing for a team that has different on-call schedules during business hours vs. nights and weekends.
  3. Create custom templates for email and Slack notifications that include links to relevant dashboards.
  4. Configure inhibition rules to suppress lower-priority alerts when a related critical alert is firing.

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)