Alert Manager Configuration
Introduction
Alertmanager is a critical component in the Prometheus and Grafana ecosystem that handles alerts sent by client applications such as Prometheus servers or Grafana's alerting system. It takes care of deduplicating, grouping, and routing alerts to the correct receiver. When using Grafana Loki for log monitoring, properly configuring Alertmanager ensures that the right people are notified about issues at the right time.
In this guide, we'll learn how to configure Alertmanager to work with Grafana Loki, understand its key components, and see practical examples of how to set up effective alerting workflows.
Alertmanager Basics
Alertmanager processes alerts generated by monitoring systems and handles:
- Deduplication: Removes duplicate alerts from multiple similar sources
- Grouping: Combines related alerts into a single notification
- Routing: Directs alerts to the appropriate team or notification channel
- Silencing: Temporarily mutes alerts during maintenance or known issues
- Inhibition: Suppresses alerts when certain other alerts are already firing
Alertmanager Configuration File
Alertmanager is configured via a YAML file, typically called alertmanager.yml
. The configuration file consists of several main sections:
global:
# Global settings like SMTP defaults, repeated notification intervals
route:
# The routing tree for alert notifications
receivers:
# The notification integrations (email, Slack, etc.)
inhibit_rules:
# Rules for muting alerts when others are firing
templates:
# Custom notification templates
Let's examine each section in detail:
Global Configuration
The global
section defines default parameters that apply to all alerts unless overridden elsewhere:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.org:587'
smtp_from: '[email protected]'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
resolve_timeout
: How long to wait before considering a resolved alert as resolved- SMTP settings: For email notifications
- Messaging service settings: Such as Slack webhook URLs
Route Configuration
The route
section defines how alerts are routed to receivers:
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'team-emails'
routes:
- match:
service: 'loki'
receiver: 'loki-team'
routes:
- match:
severity: 'critical'
receiver: 'loki-oncall'
Key routing parameters:
group_by
: Labels to group alerts bygroup_wait
: Initial delay before sending a notification for a new groupgroup_interval
: Interval between sending updated notifications for changed groupsrepeat_interval
: How long to wait before resending notifications for unchanged groupsreceiver
: Default receiver for this routeroutes
: Nested routing trees for more specific routing
Receivers Configuration
The receivers
section defines how notifications should be sent:
receivers:
- name: 'team-emails'
email_configs:
- to: '[email protected]'
send_resolved: true
- name: 'loki-team'
slack_configs:
- channel: '#loki-alerts'
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'loki-oncall'
pagerduty_configs:
- service_key: 'your-pagerduty-service-key'
description: '{{ .CommonAnnotations.summary }}'
Each receiver can have multiple configurations for different notification channels like:
email_configs
slack_configs
pagerduty_configs
webhook_configs
victorops_configs
pushover_configs
- And more
Inhibition Rules
Inhibition rules allow you to suppress certain alerts when others are firing:
inhibit_rules:
- source_match:
severity: 'critical'
alertname: 'LokiDown'
target_match:
severity: 'warning'
equal: ['job', 'instance']
This rule would suppress any warning-level alerts from the same job and instance if a critical "LokiDown" alert is already firing.
Integrating with Grafana Loki
When using Alertmanager with Grafana Loki, you need to:
- Configure Loki rules to generate alerts
- Set up Grafana to forward alerts to Alertmanager
- Configure Alertmanager to handle Loki alerts
Example: Loki Rules
Here's how you might define a Loki rule that generates alerts:
groups:
- name: loki_rules
rules:
- alert: HighErrorRate
expr: sum(rate({app="myapp"} |= "error" [5m])) / sum(rate({app="myapp"}[5m])) > 0.05
for: 10m
labels:
severity: critical
service: loki
annotations:
summary: High error rate detected
description: "Error rate is above 5% for more than 10 minutes for app {{ $labels.app }}"
Example: Grafana Configuration
In your Grafana configuration (grafana.ini
or environment variables):
[unified_alerting]
enabled = true
[unified_alerting.alertmanager]
enabled = true
[alerting]
alertmanager_url = http://alertmanager:9093
Practical Examples
Let's look at some real-world examples of Alertmanager configurations for different use cases.
Example 1: Basic Loki Alert Routing
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/XXXXXXXX'
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default-receiver'
routes:
- match:
service: 'loki'
receiver: 'loki-team'
routes:
- match:
severity: 'critical'
receiver: 'loki-oncall'
continue: true
- match:
environment: 'production'
severity: 'warning'
receiver: 'loki-oncall'
receivers:
- name: 'default-receiver'
slack_configs:
- channel: '#general-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'loki-team'
slack_configs:
- channel: '#loki-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'loki-oncall'
pagerduty_configs:
- service_key: 'your-pagerduty-service-key'
description: '{{ .CommonAnnotations.summary }}'
slack_configs:
- channel: '#loki-alerts'
title: '[CRITICAL] {{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
Example 2: Time-based Routing
This example routes alerts differently during business hours vs. off-hours:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'team-email'
routes:
- match:
service: 'loki'
receiver: 'loki-business-hours'
routes:
- match:
severity: 'critical'
receiver: 'loki-oncall'
- match_re:
severity: 'warning|critical'
receiver: 'loki-oncall'
time_intervals: ['offhours', 'weekends']
time_intervals:
- name: 'business-hours'
time_intervals:
- weekdays: ['monday:friday']
times:
- start_time: '09:00'
end_time: '17:00'
- name: 'offhours'
time_intervals:
- weekdays: ['monday:friday']
times:
- start_time: '00:00'
end_time: '09:00'
- start_time: '17:00'
end_time: '24:00'
- name: 'weekends'
time_intervals:
- weekdays: ['saturday', 'sunday']
receivers:
- name: 'team-email'
email_configs:
- to: '[email protected]'
- name: 'loki-business-hours'
slack_configs:
- channel: '#loki-alerts'
- name: 'loki-oncall'
pagerduty_configs:
- service_key: 'your-pagerduty-service-key'
slack_configs:
- channel: '#loki-oncall'
Visualizing Alert Routing
To better understand how alerts are routed through Alertmanager, let's visualize the routing tree with a Mermaid diagram:
Best Practices
When configuring Alertmanager with Loki, consider these best practices:
- Group alerts intelligently: Choose labels that help correlate related issues
- Use proper timing settings:
- Short
group_wait
for critical alerts - Longer
repeat_interval
for non-critical alerts
- Short
- Implement a severity hierarchy:
critical
: Requires immediate actionwarning
: Needs attention but not immediateinfo
: For informational purposes
- Create good alert descriptions:
- Include what's happening
- Why it matters
- Potential remediation steps
- Test your alerts: Use the Alertmanager API to send test alerts
- Use templates for consistent notifications:
- Keep format consistent across channels
- Include relevant links to dashboards or runbooks
Alertmanager Templates
Templates allow you to customize notification content. Create a template file:
{{ define "slack.custom.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{ end }}
{{ define "slack.custom.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
{{ end }}
Then reference it in your configuration:
templates:
- '/etc/alertmanager/templates/custom.tmpl'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
title: '{{ template "slack.custom.title" . }}'
text: '{{ template "slack.custom.text" . }}'
Common Troubleshooting
If you're having issues with your Alertmanager configuration:
-
Check the Alertmanager logs:
bashdocker logs alertmanager
-
Validate your configuration:
bashamtool check-config /path/to/alertmanager.yml
-
Test routing rules:
bashamtool config routes test --config.file=/path/to/alertmanager.yml \
--verify.receivers=loki-oncall service=loki severity=critical -
View current alerts:
bashcurl -s http://alertmanager:9093/api/v1/alerts | jq
Summary
Alertmanager is a powerful tool for managing alerts generated by Grafana Loki and other monitoring systems. By properly configuring routes, receivers, and templates, you can ensure that alerts are sent to the right people at the right time, reducing alert fatigue and improving incident response.
Key takeaways:
- Use proper grouping to reduce noise
- Set up routing trees to direct alerts to appropriate receivers
- Customize notifications for each channel
- Implement inhibition rules to reduce duplicate alerts
- Use templates for consistent notifications
- Test your configuration before deploying to production
Exercises
- Create an Alertmanager configuration that routes different severity alerts to different Slack channels.
- Set up time-based routing for a team that has different on-call schedules during business hours vs. nights and weekends.
- Create custom templates for email and Slack notifications that include links to relevant dashboards.
- Configure inhibition rules to suppress lower-priority alerts when a related critical alert is firing.
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)