Prometheus Alerting Overview

Introduction

Alerting is a critical component of any monitoring system. While metrics collection and visualization help you understand your systems, alerts actively notify you when something requires attention. Prometheus provides a powerful and flexible alerting system that integrates seamlessly with its monitoring capabilities.

In this guide, you'll learn the fundamentals of Prometheus alerting, including how alert rules are defined, how AlertManager processes alerts, and how to configure notifications across various channels. Whether you're monitoring a small application or a complex distributed system, understanding Prometheus alerting will help you respond quickly to issues before they impact your users.

Alerting Components in Prometheus

Prometheus alerting consists of two main components that work together:

Prometheus Server: Evaluates alert rules and generates alerts
AlertManager: Handles alert grouping, deduplication, silencing, inhibition, and sending notifications

Let's visualize how these components work together:

Prometheus Server's Role

The Prometheus server is responsible for:

Storing alert rule definitions
Periodically evaluating metrics against these rules
Generating alerts when conditions are met
Sending fired alerts to the AlertManager

AlertManager's Role

The AlertManager handles:

Receiving alerts from one or more Prometheus servers
Grouping similar alerts together
Eliminating duplicate alerts
Silencing alerts during maintenance windows
Inhibiting lower-priority alerts when higher-priority alerts are active
Routing alerts to appropriate notification channels

Defining Alert Rules

In Prometheus, alert rules are defined using PromQL (Prometheus Query Language) and are typically stored in YAML files. Each rule has a name, a condition, and optional labels and annotations.

Here's a basic structure of an alert rule:

groups:
  - name: example
    rules:
      - alert: HighCPULoad
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU Load (instance {{ $labels.instance }})"
          description: "CPU load is > 80%
  VALUE = {{ $value }}
  LABELS: {{ $labels }}"

Let's break down the key components:

alert: The name of the alert, used in notifications and the Prometheus UI
expr: A PromQL expression that determines when the alert should fire
for: The duration the condition must be true before firing (prevents flapping)
labels: Key-value pairs used for routing and classification
annotations: Human-readable information to provide context about the alert

Example Alert Rules

Let's look at some common alert rules for different scenarios:

1. Service Availability Alert

- alert: ServiceDown
  expr: up{job="my-service"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Service {{ $labels.job }} is down"
    description: "Service {{ $labels.job }} on {{ $labels.instance }} has been down for more than 2 minutes."

This alert fires when a service being monitored (with job="my-service") reports as down for more than 2 minutes.

2. High Latency Alert

- alert: HighRequestLatency
  expr: http_request_duration_seconds{quantile="0.9"} > 1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High request latency on {{ $labels.instance }}"
    description: "{{ $labels.instance }} has a 90th percentile latency of {{ $value }} seconds for the past 10 minutes."

This alerts when the 90th percentile of HTTP request durations exceeds 1 second for 10 minutes.

3. Disk Space Alert

- alert: LowDiskSpace
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Low disk space on {{ $labels.instance }}"
    description: "Disk usage is above 90% on {{ $labels.instance }} mounted at {{ $labels.mountpoint }}."

This rule triggers when available disk space falls below 10% for 5 minutes.

Configuring AlertManager

The AlertManager is configured through a YAML file, typically named alertmanager.yml. The configuration defines how alerts should be processed and routed to receivers.

Here's a basic AlertManager configuration:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.org:587'
  smtp_from: '[email protected]'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: 'team-emails'
  routes:
  - match:
      severity: critical
    receiver: 'pager-duty'
    continue: true

receivers:
- name: 'team-emails'
  email_configs:
  - to: '[email protected]'

- name: 'pager-duty'
  pagerduty_configs:
  - service_key: '<pagerduty-service-key>'

Let's break down the key sections:

Global Configuration

The global section defines parameters that apply to all alerts, such as SMTP settings for email notifications and the timeout for resolving alerts.

Routing Configuration

The route section defines how alerts are routed to receivers:

group_by: Specifies how alerts should be grouped together
group_wait: How long to wait to buffer alerts of the same group before sending
group_interval: How long to wait before sending a batch of new alerts for a group
repeat_interval: How long to wait before resending an alert
receiver: The default receiver for all alerts
routes: Sub-routes for specialized routing based on alert labels

Receivers Configuration

The receivers section defines notification channels:

Email
Slack
PagerDuty
Webhook
OpsGenie
VictorOps
And many others via integrations

Practical Example: Complete Alerting Setup

Let's walk through a practical example of setting up alerting for a web application:

First, create an alert rules file (prometheus-rules.yml):

groups:
- name: webapp-alerts
  rules:
  - alert: WebAppDown
    expr: up{job="webapp"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Web application is down"
      description: "The web application has been down for more than 1 minute."
  
  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{job="webapp", status=~"5.."}[5m])) / sum(rate(http_requests_total{job="webapp"}[5m])) > 0.05
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High error rate detected"
      description: "Error rate is above 5% ({{ $value | printf "%.2f" }}%) for the past 2 minutes."
      
  - alert: SlowResponses
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="webapp"}[5m])) by (le)) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Slow response times detected"
      description: "95th percentile of response times is above 2 seconds for the past 5 minutes."

Configure your Prometheus server to load these rules (prometheus.yml):

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "prometheus-rules.yml"

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'webapp'
    static_configs:
      - targets: ['webapp:8080']

Configure AlertManager (alertmanager.yml):

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'
  routes:
  - match:
      severity: critical
    receiver: 'pager-duty-critical'
    continue: true

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'
    send_resolved: true
    title: '{{ .CommonAnnotations.summary }}'
    text: '{{ .CommonAnnotations.description }}'
    
- name: 'pager-duty-critical'
  pagerduty_configs:
  - service_key: '<your-pagerduty-key>'
    send_resolved: true

Start all services (using Docker Compose for example):

version: '3'
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus-rules.yml:/etc/prometheus/prometheus-rules.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    ports:
      - '9090:9090'
    
  alertmanager:
    image: prom/alertmanager
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    ports:
      - '9093:9093'
    
  webapp:
    image: your-webapp-image
    ports:
      - '8080:8080'

Alert States in Prometheus

Alerts in Prometheus have three possible states:

Inactive: The alert condition is not met
Pending: The alert condition is met, but hasn't been true for the duration specified in the for field
Firing: The alert condition has been true for the duration specified in the for field and is actively firing

You can view the current state of all alerts in the Prometheus UI under the "Alerts" tab:

http://your-prometheus-server:9090/alerts

Best Practices for Prometheus Alerting

1. Follow the Three Alert Levels

Structure your alerts into three severity levels:

Critical: Requires immediate human attention (pages someone)
Warning: Requires attention soon but not immediately
Info: Informational alerts that don't require immediate action

2. Use Meaningful Alert Names

Make alert names descriptive and specific:

# Bad
- alert: HighCPU

# Good
- alert: InstanceHighCPUUsage

3. Add Useful Context in Annotations

Include actionable information in your alert annotations:

annotations:
  summary: "High memory usage on {{ $labels.instance }}"
  description: "Memory usage is at {{ $value | printf '%.2f' }}%. Consider checking for memory leaks or increasing capacity."
  dashboard: "https://grafana.example.com/d/memory-dashboard"
  runbook: "https://wiki.example.com/runbooks/high-memory"

4. Set Appropriate Thresholds

Avoid alert fatigue by setting thresholds that balance sensitivity and specificity:

# Too sensitive, might cause alert fatigue
- expr: cpu_usage_percent > 50

# Better balance
- expr: cpu_usage_percent > 80
  for: 15m

5. Use the 'for' Clause Appropriately

The for clause helps reduce noise from transient spikes:

# Without 'for' - will trigger on brief spikes
- alert: HighErrorRate
  expr: error_rate > 0.01

# With 'for' - only triggers if sustained
- alert: HighErrorRate
  expr: error_rate > 0.01
  for: 5m

Troubleshooting Alerts

If your alerts aren't working as expected, check the following:

Alert Rules: Verify your alert expressions in the Prometheus UI
Alert States: Check the "Alerts" section in the Prometheus UI
AlertManager Status: Check the "Status" > "AlertManager" section in Prometheus
Connectivity: Ensure Prometheus can connect to AlertManager
Configuration Syntax: Validate your YAML files with a YAML linter
Logs: Check logs from both Prometheus and AlertManager

Summary

Prometheus alerting provides a powerful system for detecting issues in your infrastructure and applications. By defining PromQL-based alert rules in Prometheus and configuring routing and notification in AlertManager, you can create a comprehensive alerting strategy.

Key takeaways:

Prometheus evaluates alert rules while AlertManager handles notification delivery
Alert rules consist of a name, expression, duration, labels, and annotations
AlertManager supports grouping, routing, and multiple notification channels
Well-designed alerts should be actionable and avoid alert fatigue

Additional Resources

Exercises

Basic Alert: Create an alert rule that fires when a service is down for more than 1 minute.
Advanced Expression: Create an alert rule that detects when your application's error rate exceeds 5% over a 5-minute period.
Routing Configuration: Configure AlertManager to send critical alerts to PagerDuty and warning alerts to Slack.
Alert Templates: Create custom notification templates for different alert severity levels.
Test Your Alerts: Simulate conditions that would trigger your alerts to verify they work correctly.

By mastering Prometheus alerting, you'll be able to detect and respond to issues quickly, ensuring better reliability and performance for your systems.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Alerting Components in Prometheus​

Prometheus Server's Role​

AlertManager's Role​

Defining Alert Rules​

Example Alert Rules​

1. Service Availability Alert​

2. High Latency Alert​

3. Disk Space Alert​

Configuring AlertManager​

Global Configuration​

Routing Configuration​

Receivers Configuration​

Practical Example: Complete Alerting Setup​

Alert States in Prometheus​

Best Practices for Prometheus Alerting​

1. Follow the Three Alert Levels​

2. Use Meaningful Alert Names​

3. Add Useful Context in Annotations​

4. Set Appropriate Thresholds​

5. Use the 'for' Clause Appropriately​

Troubleshooting Alerts​

Summary​

Additional Resources​

Exercises​