Alert Routing

Introduction

Alert routing is a crucial component in any monitoring and alerting system, including Grafana Loki. It ensures that the right alerts reach the right people at the right time. Without proper alert routing, important notifications might go unnoticed, or teams might be overwhelmed with alerts that aren't relevant to them.

In this guide, we'll explore how alert routing works in Grafana Loki, how to configure routing trees, and best practices for implementing an effective alert routing strategy. By the end, you'll understand how to ensure critical alerts are delivered to the appropriate teams and individuals.

Understanding Alert Routing

Alert routing in Grafana Loki determines the path that an alert takes from the moment it's triggered until it reaches its intended recipients. This process involves several components:

Alert Rules: Define conditions that trigger alerts
Alert Instances: Individual alerts generated when conditions are met
Notification Policies: Define how alerts are grouped, inhibited, and routed
Contact Points: Specify where alerts are sent (email, Slack, etc.)
Routing Trees: Hierarchical structures that match alerts to notification policies

Let's visualize this flow:

Configuring Alert Routing in Grafana Loki

Setting Up Contact Points

Before configuring routing, you need to set up contact points - destinations where alerts will be delivered. Common contact points include:

Email
Slack
Webhook
PagerDuty
OpsGenie
Discord

Here's an example of configuring a Slack contact point using the Grafana UI:

Navigate to Alerting → Contact points
Click "New contact point"
Select "Slack" as the integration
Fill in the required fields:
- Name: team-backend-slack
- Slack webhook URL: Your webhook URL
- Message settings: Customize as needed

Alternatively, you can configure contact points using YAML:

apiVersion: 1
contactPoints:
- name: slack-notifications
  receivers:
  - uid: slack-backend-team
    type: slack
    settings:
      url: https://hooks.slack.com/services/your-webhook-url
      title: '{{ template "slack.default.title" . }}'
      text: '{{ template "slack.default.message" . }}'

Creating Notification Policies

Notification policies determine how alerts are processed. They specify:

Grouping of similar alerts
Timing for notifications (when to send, resend, etc.)
Which contact point to use
Muting timings (when to suppress alerts)

Here's an example of a notification policy configuration:

apiVersion: 1
policies:
- orgId: 1
  receiver: slack-notifications
  group_by: ['alertname', 'job']
  repeat_interval: 4h
  routes:
  - receiver: backend-team-email
    group_by: ['alertname', 'severity']
    matchers:
    - team = "backend"
    mute_time_intervals:
    - weekends

Building Routing Trees

Routing trees define the hierarchy and rules for routing alerts to specific notification policies. They use label matchers to determine which route an alert should take.

Here's a simple routing tree example:

apiVersion: 1
groups:
- name: Loki Rules
  rules:
  - name: high_error_rate
    condition: rate(loki_error_total[5m]) > 0.1
    labels:
      severity: critical
      team: backend
  - name: api_latency
    condition: rate(api_request_duration_seconds_sum[5m]) / rate(api_request_duration_seconds_count[5m]) > 0.5
    labels:
      severity: warning
      team: api

With this configuration, alerts with the label team=backend will be routed to the backend-team-email contact point, while other alerts will go to the default slack-notifications contact point.

Practical Example: Setting Up Team-Based Alert Routing

Let's walk through a real-world example of configuring alert routing for different teams in an organization.

Scenario

You have a microservices architecture with three teams:

Backend Team: Responsible for core services
Frontend Team: Manages user interface components
Database Team: Handles database operations

Each team needs to receive alerts relevant to their domain.

Step 1: Create Contact Points

First, set up contact points for each team:

apiVersion: 1
contactPoints:
- name: backend-team
  receivers:
  - uid: backend-slack
    type: slack
    settings:
      url: https://hooks.slack.com/services/backend-webhook
- name: frontend-team
  receivers:
  - uid: frontend-slack
    type: slack
    settings:
      url: https://hooks.slack.com/services/frontend-webhook
- name: database-team
  receivers:
  - uid: database-slack
    type: slack
    settings:
      url: https://hooks.slack.com/services/database-webhook
- name: all-teams
  receivers:
  - uid: all-teams-email
    type: email
    settings:
      to: [email protected]

Step 2: Create Notification Policies with Routing

Now, set up a routing tree with notification policies:

apiVersion: 1
policies:
- orgId: 1
  receiver: all-teams
  group_by: ['alertname']
  routes:
  - receiver: backend-team
    matchers:
    - team = "backend"
    group_by: ['alertname', 'service']
    continue: false
  - receiver: frontend-team
    matchers:
    - team = "frontend"
    group_by: ['alertname', 'component']
    continue: false
  - receiver: database-team
    matchers:
    - team = "database"
    group_by: ['alertname', 'database']
    continue: false

Step 3: Configure Alert Rules with Appropriate Labels

Configure alert rules with team labels to ensure proper routing:

apiVersion: 1
groups:
- name: Services
  rules:
  - name: BackendServiceDown
    expr: up{job="backend"} == 0
    for: 5m
    labels:
      team: backend
      severity: critical
    annotations:
      summary: Backend service is down
      description: "Backend service {{ $labels.instance }} has been down for more than 5 minutes."
  - name: FrontendErrors
    expr: sum(rate(frontend_error_total[5m])) > 10
    for: 5m
    labels:
      team: frontend
      severity: warning
    annotations:
      summary: High frontend error rate
      description: "Frontend is experiencing high error rate ({{ $value }} errors per second)."
  - name: DatabaseHighLatency
    expr: rate(database_query_duration_seconds_sum[5m]) / rate(database_query_duration_seconds_count[5m]) > 1
    for: 10m
    labels:
      team: database
      severity: warning
    annotations:
      summary: Database query latency is high
      description: "Database queries are taking longer than 1 second on average."

Advanced Routing Techniques

Time-Based Routing

Sometimes you want to route alerts differently based on time of day, like during work hours versus after hours:

apiVersion: 1
policies:
- orgId: 1
  receiver: slack-during-hours
  group_by: ['alertname']
  routes:
  - receiver: pagerduty-after-hours
    matchers:
    - severity = "critical"
    time_intervals:
    - name: after-hours
      time_intervals:
    - weekdays:
        start_time: 17:00
        end_time: 09:00
    - weekends

Severity-Based Routing

Route alerts based on severity level:

apiVersion: 1
policies:
- orgId: 1
  receiver: default-notifications
  group_by: ['alertname']
  routes:
  - receiver: critical-alerts
    matchers:
    - severity = "critical"
    continue: true
  - receiver: warning-alerts
    matchers:
    - severity = "warning"
    continue: true

Nested Routing

For complex organizations, nested routing allows for hierarchical team structures:

apiVersion: 1
policies:
- orgId: 1
  receiver: default-team
  routes:
  - receiver: engineering
    matchers:
    - department = "engineering"
    routes:
    - receiver: backend-team
      matchers:
      - team = "backend"
    - receiver: frontend-team
      matchers:
      - team = "frontend"
  - receiver: operations
    matchers:
    - department = "operations"
    routes:
    - receiver: sre-team
      matchers:
      - team = "sre"

Best Practices for Alert Routing

Keep It Simple: Start with a simple routing tree and expand as needed
Consistent Labeling: Use consistent labels across all alerts
Documentation: Document the routing structure for team reference
Testing: Test routing configurations before deploying to production
Redundancy: Include backup contact points for critical alerts
Regular Reviews: Periodically review and update routing configurations
Avoid Alert Fatigue: Use proper grouping and inhibition to prevent overwhelming teams

Common Pitfalls to Avoid

Overlapping Routes: Ambiguous routing rules can send alerts to multiple teams
Missing Routes: Alerts without matching routes may go unnoticed
Too Much Granularity: Overly complex routing trees are hard to maintain
Inconsistent Labels: Inconsistent labeling can break routing rules
No Default Route: Always have a catch-all route for unmatched alerts

Summary

Alert routing is a critical component in Grafana Loki's monitoring and alerting system. Proper configuration ensures that the right people are notified when issues arise, allowing for faster response times and reduced mean time to resolution (MTTR).

By implementing a well-designed alert routing strategy using contact points, notification policies, and routing trees, you can create an effective alerting system that minimizes alert fatigue while ensuring critical issues are addressed promptly.

Additional Resources

Exercises

Set up a basic alert routing configuration with at least two different contact points.
Create a routing tree that sends different alerts to different teams based on service labels.
Implement a time-based routing policy that sends critical alerts to a different contact point during non-business hours.
Design an alert routing strategy for your organization or a hypothetical company with at least three teams.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Alert Routing​

Configuring Alert Routing in Grafana Loki​

Setting Up Contact Points​

Creating Notification Policies​

Building Routing Trees​

Practical Example: Setting Up Team-Based Alert Routing​

Scenario​

Step 1: Create Contact Points​

Step 2: Create Notification Policies with Routing​

Step 3: Configure Alert Rules with Appropriate Labels​

Advanced Routing Techniques​

Time-Based Routing​

Severity-Based Routing​

Nested Routing​

Best Practices for Alert Routing​

Common Pitfalls to Avoid​

Summary​

Additional Resources​

Exercises​