Prometheus Alerting Rules

Introduction

Alerting rules are a critical component of the Prometheus monitoring ecosystem. They allow you to define conditions that, when met, trigger alerts to notify you about potential issues in your systems. Unlike recording rules which pre-compute expressions, alerting rules are specifically designed to identify problematic situations and initiate notifications through Alertmanager.

In this guide, we'll explore how to create effective alerting rules in Prometheus, understand their syntax, and implement them in real-world scenarios. By the end, you'll be able to set up comprehensive alerting for your infrastructure.

Understanding Alerting Rules

Alerting rules in Prometheus follow a declarative approach, where you define:

The condition to evaluate (a PromQL expression)
The duration the condition must be true before firing an alert
Labels to classify the alert
Annotations to provide human-readable details

When an alerting rule's condition is true for the specified duration, it transitions from a "pending" state to a "firing" state, at which point the alert is sent to Alertmanager.

Basic Structure of Alerting Rules

Alerting rules are defined in YAML files with the following structure:

groups:
  - name: example
    rules:
    - alert: HighCPULoad
      expr: node_cpu_seconds_total{mode="idle"} < 10
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU load detected"
        description: "CPU load is above 90% for more than 5 minutes."

Let's break down the components:

groups: Rules are organized into named groups
rules: List of individual alerting rules
alert: The name of the alert
expr: The PromQL expression that determines when the alert should fire
for: Optional duration the condition must be true before firing
labels: Additional labels for routing and classification
annotations: Human-readable information about the alert

Creating Your First Alerting Rule

Let's create a simple alerting rule that fires when an instance is down:

groups:
  - name: instance_availability
    rules:
    - alert: InstanceDown
      expr: up == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "Instance {{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."

This rule checks if the up metric (which Prometheus automatically generates for each target) equals 0, indicating the target is down. If this condition persists for 1 minute, an alert fires.

Notice the use of template variables like {{ $labels.instance }} in the annotations. These reference labels from the alert's time series and allow you to create dynamic alert messages.

Configuring Prometheus to Load Alerting Rules

To use alerting rules, you need to configure Prometheus to load them. Add this to your prometheus.yml:

rule_files:
  - "alert_rules.yml"

This tells Prometheus to load rules from the file alert_rules.yml.

Advanced Alerting Rule Techniques

Using Template Variables

Templates make your alerts more informative by including details from the alert context:

annotations:
  summary: "High CPU on {{ $labels.instance }}"
  description: "CPU usage is {{ $value | printf \"%.2f\" }}% for 5 minutes."

Available variables:

$labels: Labels from the alert's time series
$value: The value that triggered the alert
$externalURL: The external URL of Prometheus

Multi-condition Alerts

For more complex scenarios, you can use PromQL to create sophisticated conditions:

expr: (node_memory_MemFree_bytes + node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100 < 10

This expression alerts when free memory (including cache) falls below 10% of total memory.

Alert Grouping and Inhibition

You can define relationships between alerts using labels:

groups:
  - name: node_alerts
    rules:
    - alert: InstanceDown
      expr: up == 0
      for: 1m
      labels:
        severity: critical
        service: "{{ $labels.job }}"
      # ...

    - alert: HighCPULoad
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
      for: 5m
      labels:
        severity: warning
        service: "{{ $labels.job }}"
      # ...

By using consistent labels like service, Alertmanager can group related alerts together.

Best Practices for Alerting Rules

1. Alert on Symptoms, Not Causes

Focus alerts on user-visible symptoms:

# Better: Alert on high error rate (a symptom)
- alert: APIHighErrorRate
  expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.1
  for: 5m

# Avoid: Alert on specific causes
- alert: DatabaseConnectionPoolExhausted
  expr: db_connections_current / db_connections_max > 0.9
  for: 5m

2. Use Appropriate Thresholds

Set thresholds that balance between false positives and missed issues:

# Multiple severity levels with different thresholds
- alert: HighCPUUsage
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
  for: 5m
  labels:
    severity: warning
  # ...

- alert: CriticalCPUUsage
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
  for: 5m
  labels:
    severity: critical
  # ...

3. Include Meaningful Context

Add useful information to help troubleshoot issues:

annotations:
  summary: "Memory usage critical on {{ $labels.instance }}"
  description: "Memory usage is at {{ $value | printf \"%.2f\" }}%. Top memory consumers: {{ with printf \"sort_desc(topk(3, process_resident_memory_bytes{instance=\\\"%s\\\"}))\" $labels.instance | query }}{{ range . }}{{ .Labels.process_name }}: {{ .Value | printf \"%.2f\" }}MB, {{ end }}{{ end }}"
  dashboard: "https://grafana.example.com/d/abc123/node-metrics?var-instance={{ $labels.instance }}"

4. Apply Rate and Aggregation Functions Correctly

When alerting on counters, use rate() to handle resets:

# Good: Using rate() to handle counter resets
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1
  for: 5m

# Avoid: Direct comparison of counter values
- alert: TooManyErrors
  expr: http_requests_total{status=~"5.."} > 100
  for: 5m

Real-world Examples

Let's explore some practical alerting rules for common scenarios:

Service Availability Monitoring

groups:
  - name: availability
    rules:
    - alert: ServiceDown
      expr: up == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Service {{ $labels.job }} is down"
        description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute."

    - alert: HighLatency
      expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High latency on {{ $labels.service }}"
        description: "Service {{ $labels.service }} has 95th percentile latency above 1s for 5 minutes."

Resource Utilization Alerts

groups:
  - name: resources
    rules:
    - alert: HostOutOfMemory
      expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Host {{ $labels.instance }} is low on memory"
        description: "Node memory is filling up (< 10% left: {{ $value | printf \"%.2f\" }}%)."

    - alert: HostOutOfDiskSpace
      expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Host {{ $labels.instance }} is low on disk space"
        description: "Disk space is filling up (< 10% left on {{ $labels.device }}, {{ $value | printf \"%.2f\" }}%)."

    - alert: HostHighCPULoad
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Host {{ $labels.instance }} has high CPU load"
        description: "CPU load is > 80% for 5 minutes (current value: {{ $value | printf \"%.2f\" }}%)."

Application-specific Alerts

For a web application:

groups:
  - name: application
    rules:
    - alert: HighErrorRate
      expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate on {{ $labels.service }}"
        description: "Service {{ $labels.service }} has error rate above 5% ({{ $value | printf \"%.2f\" }}%)."

    - alert: ApplicationLatency
      expr: histogram_quantile(0.95, sum(rate(application_request_duration_seconds_bucket[5m])) by (le, endpoint)) > 2
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Slow response time on {{ $labels.endpoint }}"
        description: "Endpoint {{ $labels.endpoint }} has 95th percentile latency above 2s."

Visualizing Alerts

Prometheus provides a built-in UI to view the status of your alerting rules. You can access it at /alerts on your Prometheus server (e.g., http://localhost:9090/alerts).

Here's what you might see:

Testing Alerting Rules

You can test alerting rules before deploying them using the promtool utility:

promtool check rules alert_rules.yml

This validates the syntax of your rules file.

To test if specific metrics would trigger an alert:

promtool test rules alert_test.yml

Where alert_test.yml contains test cases:

rule_files:
  - alert_rules.yml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'up{job="api", instance="instance-1"}'
        values: '1 1 1 0 0 0'
    alert_rule_test:
      - eval_time: 4m
        alertname: InstanceDown
        exp_alerts:
          - exp_labels:
              job: api
              instance: instance-1
              severity: critical

Summary

Alerting rules are a powerful feature in Prometheus that help you detect and respond to issues in your systems. By defining appropriate conditions, thresholds, and annotations, you can create an effective alerting strategy that balances between catching real problems and avoiding alert fatigue.

Key takeaways:

Alerting rules define conditions that, when met for a specified duration, trigger alerts
Rules include conditions (expressions), duration, labels, and annotations
Use template variables to create dynamic, informative alerts
Follow best practices like alerting on symptoms, setting appropriate thresholds, and including context
Test your rules before deploying them to production

Exercises

Create an alerting rule that fires when a service's error rate exceeds 10% for 5 minutes.
Design alerts with multiple severity levels (warning, critical) for disk usage.
Create an alert that combines multiple metrics (e.g., high CPU and memory usage together).
Set up alerting for slow database queries in your application.

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Alerting Rules​

Basic Structure of Alerting Rules​

Creating Your First Alerting Rule​

Configuring Prometheus to Load Alerting Rules​

Advanced Alerting Rule Techniques​

Using Template Variables​

Multi-condition Alerts​

Alert Grouping and Inhibition​

Best Practices for Alerting Rules​

1. Alert on Symptoms, Not Causes​

2. Use Appropriate Thresholds​

3. Include Meaningful Context​

4. Apply Rate and Aggregation Functions Correctly​

Real-world Examples​

Service Availability Monitoring​

Resource Utilization Alerts​

Application-specific Alerts​

Visualizing Alerts​

Testing Alerting Rules​

Summary​

Exercises​

Additional Resources​