Alert Manager Configuration

Introduction

Alertmanager is a critical component in the Prometheus and Grafana ecosystem that handles alerts sent by client applications such as Prometheus servers or Grafana's alerting system. It takes care of deduplicating, grouping, and routing alerts to the correct receiver. When using Grafana Loki for log monitoring, properly configuring Alertmanager ensures that the right people are notified about issues at the right time.

In this guide, we'll learn how to configure Alertmanager to work with Grafana Loki, understand its key components, and see practical examples of how to set up effective alerting workflows.

Alertmanager Basics

Alertmanager processes alerts generated by monitoring systems and handles:

Deduplication: Removes duplicate alerts from multiple similar sources
Grouping: Combines related alerts into a single notification
Routing: Directs alerts to the appropriate team or notification channel
Silencing: Temporarily mutes alerts during maintenance or known issues
Inhibition: Suppresses alerts when certain other alerts are already firing

Alertmanager Configuration File

Alertmanager is configured via a YAML file, typically called alertmanager.yml. The configuration file consists of several main sections:

global:
  # Global settings like SMTP defaults, repeated notification intervals

route:
  # The routing tree for alert notifications

receivers:
  # The notification integrations (email, Slack, etc.)

inhibit_rules:
  # Rules for muting alerts when others are firing

templates:
  # Custom notification templates

Let's examine each section in detail:

Global Configuration

The global section defines default parameters that apply to all alerts unless overridden elsewhere:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.org:587'
  smtp_from: '[email protected]'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'
  slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'

resolve_timeout: How long to wait before considering a resolved alert as resolved
SMTP settings: For email notifications
Messaging service settings: Such as Slack webhook URLs

Route Configuration

The route section defines how alerts are routed to receivers:

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'team-emails'
  
  routes:
  - match:
      service: 'loki'
    receiver: 'loki-team'
    routes:
    - match:
        severity: 'critical'
      receiver: 'loki-oncall'

Key routing parameters:

group_by: Labels to group alerts by
group_wait: Initial delay before sending a notification for a new group
group_interval: Interval between sending updated notifications for changed groups
repeat_interval: How long to wait before resending notifications for unchanged groups
receiver: Default receiver for this route
routes: Nested routing trees for more specific routing

Receivers Configuration

The receivers section defines how notifications should be sent:

receivers:
- name: 'team-emails'
  email_configs:
  - to: '[email protected]'
    send_resolved: true

- name: 'loki-team'
  slack_configs:
  - channel: '#loki-alerts'
    send_resolved: true
    title: '{{ .GroupLabels.alertname }}'
    text: '{{ .CommonAnnotations.description }}'

- name: 'loki-oncall'
  pagerduty_configs:
  - service_key: 'your-pagerduty-service-key'
    description: '{{ .CommonAnnotations.summary }}'

Each receiver can have multiple configurations for different notification channels like:

email_configs
slack_configs
pagerduty_configs
webhook_configs
victorops_configs
pushover_configs
And more

Inhibition Rules

Inhibition rules allow you to suppress certain alerts when others are firing:

inhibit_rules:
- source_match:
    severity: 'critical'
    alertname: 'LokiDown'
  target_match:
    severity: 'warning'
  equal: ['job', 'instance']

This rule would suppress any warning-level alerts from the same job and instance if a critical "LokiDown" alert is already firing.

Integrating with Grafana Loki

When using Alertmanager with Grafana Loki, you need to:

Configure Loki rules to generate alerts
Set up Grafana to forward alerts to Alertmanager
Configure Alertmanager to handle Loki alerts

Example: Loki Rules

Here's how you might define a Loki rule that generates alerts:

groups:
- name: loki_rules
  rules:
  - alert: HighErrorRate
    expr: sum(rate({app="myapp"} |= "error" [5m])) / sum(rate({app="myapp"}[5m])) > 0.05
    for: 10m
    labels:
      severity: critical
      service: loki
    annotations:
      summary: High error rate detected
      description: "Error rate is above 5% for more than 10 minutes for app {{ $labels.app }}"

Example: Grafana Configuration

In your Grafana configuration (grafana.ini or environment variables):

[unified_alerting]
enabled = true

[unified_alerting.alertmanager]
enabled = true

[alerting]
alertmanager_url = http://alertmanager:9093

Practical Examples

Let's look at some real-world examples of Alertmanager configurations for different use cases.

Example 1: Basic Loki Alert Routing

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/TXXXXXXXX/BXXXXXXXX/XXXXXXXX'

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'default-receiver'
  routes:
  - match:
      service: 'loki'
    receiver: 'loki-team'
    routes:
    - match:
        severity: 'critical'
      receiver: 'loki-oncall'
      continue: true
    - match:
        environment: 'production'
        severity: 'warning'
      receiver: 'loki-oncall'

receivers:
- name: 'default-receiver'
  slack_configs:
  - channel: '#general-alerts'
    title: '{{ .GroupLabels.alertname }}'
    text: '{{ .CommonAnnotations.description }}'

- name: 'loki-team'
  slack_configs:
  - channel: '#loki-alerts'
    title: '{{ .GroupLabels.alertname }}'
    text: '{{ .CommonAnnotations.description }}'

- name: 'loki-oncall'
  pagerduty_configs:
  - service_key: 'your-pagerduty-service-key'
    description: '{{ .CommonAnnotations.summary }}'
  slack_configs:
  - channel: '#loki-alerts'
    title: '[CRITICAL] {{ .GroupLabels.alertname }}'
    text: '{{ .CommonAnnotations.description }}'

Example 2: Time-based Routing

This example routes alerts differently during business hours vs. off-hours:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'team-email'
  routes:
  - match:
      service: 'loki'
    receiver: 'loki-business-hours'
    routes:
    - match:
        severity: 'critical'
      receiver: 'loki-oncall'
    - match_re:
        severity: 'warning|critical'
      receiver: 'loki-oncall'
      time_intervals: ['offhours', 'weekends']

time_intervals:
- name: 'business-hours'
  time_intervals:
    - weekdays: ['monday:friday']
      times:
        - start_time: '09:00'
          end_time: '17:00'
- name: 'offhours'
  time_intervals:
    - weekdays: ['monday:friday']
      times:
        - start_time: '00:00'
          end_time: '09:00'
        - start_time: '17:00'
          end_time: '24:00'
- name: 'weekends'
  time_intervals:
    - weekdays: ['saturday', 'sunday']

receivers:
- name: 'team-email'
  email_configs:
  - to: '[email protected]'

- name: 'loki-business-hours'
  slack_configs:
  - channel: '#loki-alerts'

- name: 'loki-oncall'
  pagerduty_configs:
  - service_key: 'your-pagerduty-service-key'
  slack_configs:
  - channel: '#loki-oncall'

Visualizing Alert Routing

To better understand how alerts are routed through Alertmanager, let's visualize the routing tree with a Mermaid diagram:

Best Practices

When configuring Alertmanager with Loki, consider these best practices:

Group alerts intelligently: Choose labels that help correlate related issues
Use proper timing settings:
- Short group_wait for critical alerts
- Longer repeat_interval for non-critical alerts
Implement a severity hierarchy:
- critical: Requires immediate action
- warning: Needs attention but not immediate
- info: For informational purposes
Create good alert descriptions:
- Include what's happening
- Why it matters
- Potential remediation steps
Test your alerts: Use the Alertmanager API to send test alerts
Use templates for consistent notifications:
- Keep format consistent across channels
- Include relevant links to dashboards or runbooks

Alertmanager Templates

Templates allow you to customize notification content. Create a template file:

{{ define "slack.custom.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{ end }}

{{ define "slack.custom.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
{{ end }}

Then reference it in your configuration:

templates:
  - '/etc/alertmanager/templates/custom.tmpl'

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'
    title: '{{ template "slack.custom.title" . }}'
    text: '{{ template "slack.custom.text" . }}'

Common Troubleshooting

If you're having issues with your Alertmanager configuration:

Check the Alertmanager logs:
```
docker logs alertmanager
```

Validate your configuration:

amtool check-config /path/to/alertmanager.yml

Test routing rules:

amtool config routes test --config.file=/path/to/alertmanager.yml \
  --verify.receivers=loki-oncall service=loki severity=critical

View current alerts:

curl -s http://alertmanager:9093/api/v1/alerts | jq

Summary

Alertmanager is a powerful tool for managing alerts generated by Grafana Loki and other monitoring systems. By properly configuring routes, receivers, and templates, you can ensure that alerts are sent to the right people at the right time, reducing alert fatigue and improving incident response.

Key takeaways:

Use proper grouping to reduce noise
Set up routing trees to direct alerts to appropriate receivers
Customize notifications for each channel
Implement inhibition rules to reduce duplicate alerts
Use templates for consistent notifications
Test your configuration before deploying to production

Exercises

Create an Alertmanager configuration that routes different severity alerts to different Slack channels.
Set up time-based routing for a team that has different on-call schedules during business hours vs. nights and weekends.
Create custom templates for email and Slack notifications that include links to relevant dashboards.
Configure inhibition rules to suppress lower-priority alerts when a related critical alert is firing.

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Alertmanager Basics​

Alertmanager Configuration File​

Global Configuration​

Route Configuration​

Receivers Configuration​

Inhibition Rules​

Integrating with Grafana Loki​

Example: Loki Rules​

Example: Grafana Configuration​

Practical Examples​

Example 1: Basic Loki Alert Routing​

Example 2: Time-based Routing​

Visualizing Alert Routing​

Best Practices​

Alertmanager Templates​

Common Troubleshooting​

Summary​

Exercises​

Additional Resources​