Alert Rules Best Practices

Introduction

Alert rules are a critical component of any Prometheus monitoring setup. They define the conditions under which your system should notify operators about potential issues. While setting up alerts might seem straightforward, designing an effective alerting strategy requires careful consideration to ensure alerts are meaningful, actionable, and don't cause alert fatigue.

This guide covers best practices for creating and managing Prometheus alert rules that will help you build a robust alerting system that notifies you of real problems without overwhelming your team with false positives.

Understanding Prometheus Alerting Architecture

Before diving into best practices, let's understand how Prometheus alerting works:

Prometheus evaluates alert rules against your metrics data. When conditions are met, alerts are sent to Alertmanager, which handles grouping, inhibition, silencing, and notification routing.

General Alert Rule Best Practices

1. Alert on Symptoms, Not Causes

Focus your alerts on user-visible symptoms rather than internal causes:

Good: Alert on high error rates, service unavailability, or slow response times
Avoid: Alerting on specific implementation details like CPU usage (unless directly impacting service)

# Good - Alerting on symptom (high error rate)
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High HTTP error rate"
    description: "Error rate is {{ $value | humanizePercentage }} for the past 5 minutes"

# Avoid - Alerting on potential cause without direct user impact
- alert: HighCPUUsage
  expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage"
    description: "CPU usage is {{ $value | humanizePercentage }} for the past 5 minutes"

2. Make Alerts Actionable

Every alert should have a clear, documented action that responders can take. If there's no action to take, it shouldn't be an alert.

# Actionable alert with clear next steps
- alert: HighMemoryUsage
  expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "High memory usage on {{ $labels.instance }}"
    description: "Memory usage is {{ $value | humanizePercentage }} for the past 15 minutes"
    runbook_url: "https://example.com/runbooks/high-memory-usage"

3. Use Appropriate Alert Severities

Define consistent severity levels across your organization and use them appropriately:

Critical: Immediate action required; service is down or severely impacted
Warning: Needs attention soon but not immediately; degraded service
Info: Something to be aware of but not urgent

# Critical alert - Service is down
- alert: ServiceDown
  expr: up{job="api-service"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Service down: {{ $labels.job }}"
    description: "{{ $labels.job }} has been down for more than 2 minutes"

# Warning alert - Service is degraded
- alert: SlowResponseTime
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Slow response time on {{ $labels.instance }}"
    description: "95th percentile response time is above 2s for the past 10 minutes"

4. Choose Appropriate Timeframes

Use the for clause to reduce noise from transient spikes:

Critical alerts: Short duration (30s - 5m) to enable quick response
Warning alerts: Longer duration (5m - 30m) to reduce false positives

# Critical alert with shorter duration
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.2
  for: 2m  # Short duration for critical issue
  labels:
    severity: critical
  annotations:
    summary: "High error rate detected"
    description: "Error rate is {{ $value | humanizePercentage }} for the past 2 minutes"

# Warning alert with longer duration
- alert: ElevatedErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 15m  # Longer duration for warning
  labels:
    severity: warning
  annotations:
    summary: "Elevated error rate detected"
    description: "Error rate is {{ $value | humanizePercentage }} for the past 15 minutes"

Technical Alert Design Best Practices

1. Use Rate Functions for Counter Metrics

When creating alerts on counter metrics, use the rate() or increase() functions rather than raw counter values:

# Good - Using rate function for counter metric
- alert: HighErrorRate
  expr: rate(http_errors_total[5m]) > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High error rate"
    description: "Error rate is {{ $value }} errors/second for the past 5 minutes"

# Bad - Using raw counter value
- alert: ManyErrors
  expr: http_errors_total > 1000  # Bad practice - counters always increase
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Many errors"
    description: "Total errors: {{ $value }}"

2. Handle Missing Data Appropriately

Consider what should happen when metrics are absent:

# Alert when a service is not reporting metrics (missing data)
- alert: ServiceMetricsMissing
  expr: absent(up{job="api-service"})
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Service metrics missing"
    description: "No metrics received from {{ $labels.job }} for the past 5 minutes"

# Use absent_over_time for metrics that should always exist
- alert: ImportantMetricMissing
  expr: absent_over_time(http_requests_total{job="api-service"}[15m])
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Important metric missing"
    description: "http_requests_total metric is missing for {{ $labels.job }}"

3. Use Appropriate Aggregation

Choose the right aggregation method based on your metric and alert purpose:

# Specific instance alerting
- alert: InstanceHighMemory
  expr: node_memory_used_bytes / node_memory_total_bytes > 0.9
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "High memory usage on {{ $labels.instance }}"
    description: "Memory usage is {{ $value | humanizePercentage }}"

# Cluster-level alerting (aggregated)
- alert: ClusterHighMemory
  expr: avg by (cluster) (node_memory_used_bytes / node_memory_total_bytes) > 0.85
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "High average memory usage on {{ $labels.cluster }}"
    description: "Average memory usage is {{ $value | humanizePercentage }}"

4. Design Meaningful Labels and Annotations

Use labels for grouping and routing; use annotations for human-readable context:

# Well-designed labels and annotations
- alert: APIHighLatency
  expr: histogram_quantile(0.95, rate(api_request_duration_seconds_bucket[5m])) > 1
  for: 10m
  labels:
    severity: warning
    team: backend
    service: api
  annotations:
    summary: "High API latency on {{ $labels.instance }}"
    description: "95th percentile latency is {{ $value }} seconds for {{ $labels.endpoint }}"
    dashboard_url: "https://grafana.example.com/d/abc123/api-dashboard?var-instance={{ $labels.instance }}"
    runbook_url: "https://example.com/runbooks/high-api-latency"

Practical Examples

Example 1: Service Availability Alerting

# Critical alert for service unavailability
- alert: ServiceDown
  expr: up{job="important-service"} == 0
  for: 1m
  labels:
    severity: critical
    team: sre
  annotations:
    summary: "Service {{ $labels.job }} is down"
    description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute"
    runbook_url: "https://example.com/runbooks/service-down"

# Warning alert for degraded service
- alert: HighErrorRate
  expr: rate(http_requests_total{code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 5m
  labels:
    severity: warning
    team: sre
  annotations:
    summary: "High error rate on {{ $labels.job }}"
    description: "Error rate is {{ $value | humanizePercentage }} for the past 5 minutes on {{ $labels.instance }}"
    dashboard_url: "https://grafana.example.com/d/abc123/service-dashboard?var-job={{ $labels.job }}"

Example 2: Database Performance Alerting

# Warning alert for slow database queries
- alert: SlowDatabaseQueries
  expr: histogram_quantile(0.95, rate(database_query_duration_seconds_bucket[5m])) > 0.5
  for: 10m
  labels:
    severity: warning
    team: database
  annotations:
    summary: "Slow database queries on {{ $labels.instance }}"
    description: "95th percentile query time is {{ $value }} seconds for the past 10 minutes"
    
# Critical alert for database connection saturation
- alert: DatabaseConnectionsSaturated
  expr: sum(postgres_stat_activity_count) / max(postgres_settings_max_connections) > 0.8
  for: 5m
  labels:
    severity: critical
    team: database
  annotations:
    summary: "Database connections near limit"
    description: "{{ $value | humanizePercentage }} of available database connections are in use"
    runbook_url: "https://example.com/runbooks/database-connections"

Example 3: Multi-level Disk Space Alerting

# Warning level disk space alert
- alert: DiskSpaceFillingUp
  expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes > 0.85
  for: 30m
  labels:
    severity: warning
    team: infrastructure
  annotations:
    summary: "Disk space filling up on {{ $labels.instance }}"
    description: "Disk {{ $labels.device }} mounted on {{ $labels.mountpoint }} is {{ $value | humanizePercentage }} full"

# Critical level disk space alert
- alert: DiskSpaceCritical
  expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes > 0.95
  for: 5m
  labels:
    severity: critical
    team: infrastructure
  annotations:
    summary: "Critical disk space on {{ $labels.instance }}"
    description: "Disk {{ $labels.device }} mounted on {{ $labels.mountpoint }} is {{ $value | humanizePercentage }} full"
    runbook_url: "https://example.com/runbooks/disk-full"

Alert Maintenance Best Practices

1. Document Your Alerts

Maintain documentation for all alerts, including:

The reason for the alert
Expected action to take
Links to relevant dashboards and runbooks

2. Regularly Review Alert Rules

Set up a periodic review process:

Are alerts still relevant?
Do thresholds need adjustment?
Are there false positives?
Are there gaps in coverage?

3. Test Alert Rules Before Deploying

Use Prometheus's testing capabilities to verify alert behavior:

# Test your alert rules before deploying
promtool check rules alerts.yml

# Test a specific rule expression
promtool query instant http://prometheus:9090 'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1'

4. Use Version Control for Alert Rules

Store alert rules in version control:

Document changes with meaningful commit messages
Use pull requests for reviews
Consider automated testing in CI/CD pipelines

Common Alerting Antipatterns to Avoid

1. Alert Overload

Problem: Too many alerts causing fatigue and missed important issues.

Solution: Regularly audit alerts and eliminate non-actionable ones.

2. Poor Alert Descriptions

Problem: Vague alert descriptions lead to confusion during incidents.

Solution: Include specific details, values, and links to runbooks.

3. Static Thresholds for Dynamic Systems

Problem: Fixed thresholds don't account for normal variations in system behavior.

Solution: Consider using dynamic thresholds or percentile-based alerting.

# Instead of a fixed threshold
- alert: HighCPUUsage
  expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8

# Consider a relative threshold comparing to history
- alert: AbnormalCPUUsage
  expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 
        avg_over_time(avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))[1d:5m]) * 1.5
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Abnormal CPU usage on {{ $labels.instance }}"
    description: "CPU usage is 50% higher than the 24-hour average"

Summary

Effective Prometheus alerting requires thoughtful design and regular maintenance. By following these best practices, you can build an alerting system that:

Notifies you of real problems that need attention
Minimizes false positives and alert fatigue
Provides clear, actionable information to responders
Adapts to the changing needs of your systems

Remember that good alerting is an ongoing process that requires regular review and refinement. Start with a small set of critical alerts and expand as you gain experience with your specific system's behavior.

Additional Resources

Exercises

Review your existing alert rules and classify them as symptom-based or cause-based. Convert at least one cause-based alert to focus on user-visible symptoms.
Create a multi-level alert for a service of your choice with appropriate warning and critical thresholds.
For a counter metric in your system (like http_requests_total), write an appropriate alert rule using rate() function.
Design an alert rule that uses templating in annotations to provide specific, actionable information.
Create a dashboard visualization that helps you determine appropriate thresholds for a metric you're considering alerting on.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Prometheus Alerting Architecture​

General Alert Rule Best Practices​

1. Alert on Symptoms, Not Causes​

2. Make Alerts Actionable​

3. Use Appropriate Alert Severities​

4. Choose Appropriate Timeframes​

Technical Alert Design Best Practices​

1. Use Rate Functions for Counter Metrics​

2. Handle Missing Data Appropriately​

3. Use Appropriate Aggregation​

4. Design Meaningful Labels and Annotations​

Practical Examples​

Example 1: Service Availability Alerting​

Example 2: Database Performance Alerting​

Example 3: Multi-level Disk Space Alerting​

Alert Maintenance Best Practices​

1. Document Your Alerts​

2. Regularly Review Alert Rules​

3. Test Alert Rules Before Deploying​

4. Use Version Control for Alert Rules​

Common Alerting Antipatterns to Avoid​

1. Alert Overload​

2. Poor Alert Descriptions​

3. Static Thresholds for Dynamic Systems​

Summary​

Additional Resources​

Exercises​