Skip to main content

Alert Rules Best Practices

Introduction

Alert rules are a critical component of any Prometheus monitoring setup. They define the conditions under which your system should notify operators about potential issues. While setting up alerts might seem straightforward, designing an effective alerting strategy requires careful consideration to ensure alerts are meaningful, actionable, and don't cause alert fatigue.

This guide covers best practices for creating and managing Prometheus alert rules that will help you build a robust alerting system that notifies you of real problems without overwhelming your team with false positives.

Understanding Prometheus Alerting Architecture

Before diving into best practices, let's understand how Prometheus alerting works:

Prometheus evaluates alert rules against your metrics data. When conditions are met, alerts are sent to Alertmanager, which handles grouping, inhibition, silencing, and notification routing.

General Alert Rule Best Practices

1. Alert on Symptoms, Not Causes

Focus your alerts on user-visible symptoms rather than internal causes:

  • Good: Alert on high error rates, service unavailability, or slow response times
  • Avoid: Alerting on specific implementation details like CPU usage (unless directly impacting service)
yaml
# Good - Alerting on symptom (high error rate)
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High HTTP error rate"
description: "Error rate is {{ $value | humanizePercentage }} for the past 5 minutes"

# Avoid - Alerting on potential cause without direct user impact
- alert: HighCPUUsage
expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "CPU usage is {{ $value | humanizePercentage }} for the past 5 minutes"

2. Make Alerts Actionable

Every alert should have a clear, documented action that responders can take. If there's no action to take, it shouldn't be an alert.

yaml
# Actionable alert with clear next steps
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9
for: 15m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | humanizePercentage }} for the past 15 minutes"
runbook_url: "https://example.com/runbooks/high-memory-usage"

3. Use Appropriate Alert Severities

Define consistent severity levels across your organization and use them appropriately:

  • Critical: Immediate action required; service is down or severely impacted
  • Warning: Needs attention soon but not immediately; degraded service
  • Info: Something to be aware of but not urgent
yaml
# Critical alert - Service is down
- alert: ServiceDown
expr: up{job="api-service"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service down: {{ $labels.job }}"
description: "{{ $labels.job }} has been down for more than 2 minutes"

# Warning alert - Service is degraded
- alert: SlowResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "Slow response time on {{ $labels.instance }}"
description: "95th percentile response time is above 2s for the past 10 minutes"

4. Choose Appropriate Timeframes

Use the for clause to reduce noise from transient spikes:

  • Critical alerts: Short duration (30s - 5m) to enable quick response
  • Warning alerts: Longer duration (5m - 30m) to reduce false positives
yaml
# Critical alert with shorter duration
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.2
for: 2m # Short duration for critical issue
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for the past 2 minutes"

# Warning alert with longer duration
- alert: ElevatedErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 15m # Longer duration for warning
labels:
severity: warning
annotations:
summary: "Elevated error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for the past 15 minutes"

Technical Alert Design Best Practices

1. Use Rate Functions for Counter Metrics

When creating alerts on counter metrics, use the rate() or increase() functions rather than raw counter values:

yaml
# Good - Using rate function for counter metric
- alert: HighErrorRate
expr: rate(http_errors_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate"
description: "Error rate is {{ $value }} errors/second for the past 5 minutes"

# Bad - Using raw counter value
- alert: ManyErrors
expr: http_errors_total > 1000 # Bad practice - counters always increase
for: 5m
labels:
severity: warning
annotations:
summary: "Many errors"
description: "Total errors: {{ $value }}"

2. Handle Missing Data Appropriately

Consider what should happen when metrics are absent:

yaml
# Alert when a service is not reporting metrics (missing data)
- alert: ServiceMetricsMissing
expr: absent(up{job="api-service"})
for: 5m
labels:
severity: warning
annotations:
summary: "Service metrics missing"
description: "No metrics received from {{ $labels.job }} for the past 5 minutes"

# Use absent_over_time for metrics that should always exist
- alert: ImportantMetricMissing
expr: absent_over_time(http_requests_total{job="api-service"}[15m])
for: 5m
labels:
severity: warning
annotations:
summary: "Important metric missing"
description: "http_requests_total metric is missing for {{ $labels.job }}"

3. Use Appropriate Aggregation

Choose the right aggregation method based on your metric and alert purpose:

yaml
# Specific instance alerting
- alert: InstanceHighMemory
expr: node_memory_used_bytes / node_memory_total_bytes > 0.9
for: 15m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | humanizePercentage }}"

# Cluster-level alerting (aggregated)
- alert: ClusterHighMemory
expr: avg by (cluster) (node_memory_used_bytes / node_memory_total_bytes) > 0.85
for: 15m
labels:
severity: warning
annotations:
summary: "High average memory usage on {{ $labels.cluster }}"
description: "Average memory usage is {{ $value | humanizePercentage }}"

4. Design Meaningful Labels and Annotations

Use labels for grouping and routing; use annotations for human-readable context:

yaml
# Well-designed labels and annotations
- alert: APIHighLatency
expr: histogram_quantile(0.95, rate(api_request_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: warning
team: backend
service: api
annotations:
summary: "High API latency on {{ $labels.instance }}"
description: "95th percentile latency is {{ $value }} seconds for {{ $labels.endpoint }}"
dashboard_url: "https://grafana.example.com/d/abc123/api-dashboard?var-instance={{ $labels.instance }}"
runbook_url: "https://example.com/runbooks/high-api-latency"

Practical Examples

Example 1: Service Availability Alerting

yaml
# Critical alert for service unavailability
- alert: ServiceDown
expr: up{job="important-service"} == 0
for: 1m
labels:
severity: critical
team: sre
annotations:
summary: "Service {{ $labels.job }} is down"
description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute"
runbook_url: "https://example.com/runbooks/service-down"

# Warning alert for degraded service
- alert: HighErrorRate
expr: rate(http_requests_total{code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
team: sre
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} for the past 5 minutes on {{ $labels.instance }}"
dashboard_url: "https://grafana.example.com/d/abc123/service-dashboard?var-job={{ $labels.job }}"

Example 2: Database Performance Alerting

yaml
# Warning alert for slow database queries
- alert: SlowDatabaseQueries
expr: histogram_quantile(0.95, rate(database_query_duration_seconds_bucket[5m])) > 0.5
for: 10m
labels:
severity: warning
team: database
annotations:
summary: "Slow database queries on {{ $labels.instance }}"
description: "95th percentile query time is {{ $value }} seconds for the past 10 minutes"

# Critical alert for database connection saturation
- alert: DatabaseConnectionsSaturated
expr: sum(postgres_stat_activity_count) / max(postgres_settings_max_connections) > 0.8
for: 5m
labels:
severity: critical
team: database
annotations:
summary: "Database connections near limit"
description: "{{ $value | humanizePercentage }} of available database connections are in use"
runbook_url: "https://example.com/runbooks/database-connections"

Example 3: Multi-level Disk Space Alerting

yaml
# Warning level disk space alert
- alert: DiskSpaceFillingUp
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes > 0.85
for: 30m
labels:
severity: warning
team: infrastructure
annotations:
summary: "Disk space filling up on {{ $labels.instance }}"
description: "Disk {{ $labels.device }} mounted on {{ $labels.mountpoint }} is {{ $value | humanizePercentage }} full"

# Critical level disk space alert
- alert: DiskSpaceCritical
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes > 0.95
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Critical disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.device }} mounted on {{ $labels.mountpoint }} is {{ $value | humanizePercentage }} full"
runbook_url: "https://example.com/runbooks/disk-full"

Alert Maintenance Best Practices

1. Document Your Alerts

Maintain documentation for all alerts, including:

  • The reason for the alert
  • Expected action to take
  • Links to relevant dashboards and runbooks

2. Regularly Review Alert Rules

Set up a periodic review process:

  • Are alerts still relevant?
  • Do thresholds need adjustment?
  • Are there false positives?
  • Are there gaps in coverage?

3. Test Alert Rules Before Deploying

Use Prometheus's testing capabilities to verify alert behavior:

bash
# Test your alert rules before deploying
promtool check rules alerts.yml

# Test a specific rule expression
promtool query instant http://prometheus:9090 'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1'

4. Use Version Control for Alert Rules

Store alert rules in version control:

  • Document changes with meaningful commit messages
  • Use pull requests for reviews
  • Consider automated testing in CI/CD pipelines

Common Alerting Antipatterns to Avoid

1. Alert Overload

Problem: Too many alerts causing fatigue and missed important issues.

Solution: Regularly audit alerts and eliminate non-actionable ones.

2. Poor Alert Descriptions

Problem: Vague alert descriptions lead to confusion during incidents.

Solution: Include specific details, values, and links to runbooks.

3. Static Thresholds for Dynamic Systems

Problem: Fixed thresholds don't account for normal variations in system behavior.

Solution: Consider using dynamic thresholds or percentile-based alerting.

yaml
# Instead of a fixed threshold
- alert: HighCPUUsage
expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8

# Consider a relative threshold comparing to history
- alert: AbnormalCPUUsage
expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) >
avg_over_time(avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))[1d:5m]) * 1.5
for: 15m
labels:
severity: warning
annotations:
summary: "Abnormal CPU usage on {{ $labels.instance }}"
description: "CPU usage is 50% higher than the 24-hour average"

Summary

Effective Prometheus alerting requires thoughtful design and regular maintenance. By following these best practices, you can build an alerting system that:

  • Notifies you of real problems that need attention
  • Minimizes false positives and alert fatigue
  • Provides clear, actionable information to responders
  • Adapts to the changing needs of your systems

Remember that good alerting is an ongoing process that requires regular review and refinement. Start with a small set of critical alerts and expand as you gain experience with your specific system's behavior.

Additional Resources

Exercises

  1. Review your existing alert rules and classify them as symptom-based or cause-based. Convert at least one cause-based alert to focus on user-visible symptoms.

  2. Create a multi-level alert for a service of your choice with appropriate warning and critical thresholds.

  3. For a counter metric in your system (like http_requests_total), write an appropriate alert rule using rate() function.

  4. Design an alert rule that uses templating in annotations to provide specific, actionable information.

  5. Create a dashboard visualization that helps you determine appropriate thresholds for a metric you're considering alerting on.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)