Alert Rules Best Practices
Introduction
Alert rules are a critical component of any Prometheus monitoring setup. They define the conditions under which your system should notify operators about potential issues. While setting up alerts might seem straightforward, designing an effective alerting strategy requires careful consideration to ensure alerts are meaningful, actionable, and don't cause alert fatigue.
This guide covers best practices for creating and managing Prometheus alert rules that will help you build a robust alerting system that notifies you of real problems without overwhelming your team with false positives.
Understanding Prometheus Alerting Architecture
Before diving into best practices, let's understand how Prometheus alerting works:
Prometheus evaluates alert rules against your metrics data. When conditions are met, alerts are sent to Alertmanager, which handles grouping, inhibition, silencing, and notification routing.
General Alert Rule Best Practices
1. Alert on Symptoms, Not Causes
Focus your alerts on user-visible symptoms rather than internal causes:
- Good: Alert on high error rates, service unavailability, or slow response times
- Avoid: Alerting on specific implementation details like CPU usage (unless directly impacting service)
# Good - Alerting on symptom (high error rate)
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High HTTP error rate"
description: "Error rate is {{ $value | humanizePercentage }} for the past 5 minutes"
# Avoid - Alerting on potential cause without direct user impact
- alert: HighCPUUsage
expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "CPU usage is {{ $value | humanizePercentage }} for the past 5 minutes"
2. Make Alerts Actionable
Every alert should have a clear, documented action that responders can take. If there's no action to take, it shouldn't be an alert.
# Actionable alert with clear next steps
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9
for: 15m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | humanizePercentage }} for the past 15 minutes"
runbook_url: "https://example.com/runbooks/high-memory-usage"
3. Use Appropriate Alert Severities
Define consistent severity levels across your organization and use them appropriately:
- Critical: Immediate action required; service is down or severely impacted
- Warning: Needs attention soon but not immediately; degraded service
- Info: Something to be aware of but not urgent
# Critical alert - Service is down
- alert: ServiceDown
expr: up{job="api-service"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service down: {{ $labels.job }}"
description: "{{ $labels.job }} has been down for more than 2 minutes"
# Warning alert - Service is degraded
- alert: SlowResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "Slow response time on {{ $labels.instance }}"
description: "95th percentile response time is above 2s for the past 10 minutes"
4. Choose Appropriate Timeframes
Use the for
clause to reduce noise from transient spikes:
- Critical alerts: Short duration (30s - 5m) to enable quick response
- Warning alerts: Longer duration (5m - 30m) to reduce false positives
# Critical alert with shorter duration
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.2
for: 2m # Short duration for critical issue
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for the past 2 minutes"
# Warning alert with longer duration
- alert: ElevatedErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 15m # Longer duration for warning
labels:
severity: warning
annotations:
summary: "Elevated error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for the past 15 minutes"
Technical Alert Design Best Practices
1. Use Rate Functions for Counter Metrics
When creating alerts on counter metrics, use the rate()
or increase()
functions rather than raw counter values:
# Good - Using rate function for counter metric
- alert: HighErrorRate
expr: rate(http_errors_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate"
description: "Error rate is {{ $value }} errors/second for the past 5 minutes"
# Bad - Using raw counter value
- alert: ManyErrors
expr: http_errors_total > 1000 # Bad practice - counters always increase
for: 5m
labels:
severity: warning
annotations:
summary: "Many errors"
description: "Total errors: {{ $value }}"
2. Handle Missing Data Appropriately
Consider what should happen when metrics are absent:
# Alert when a service is not reporting metrics (missing data)
- alert: ServiceMetricsMissing
expr: absent(up{job="api-service"})
for: 5m
labels:
severity: warning
annotations:
summary: "Service metrics missing"
description: "No metrics received from {{ $labels.job }} for the past 5 minutes"
# Use absent_over_time for metrics that should always exist
- alert: ImportantMetricMissing
expr: absent_over_time(http_requests_total{job="api-service"}[15m])
for: 5m
labels:
severity: warning
annotations:
summary: "Important metric missing"
description: "http_requests_total metric is missing for {{ $labels.job }}"
3. Use Appropriate Aggregation
Choose the right aggregation method based on your metric and alert purpose:
# Specific instance alerting
- alert: InstanceHighMemory
expr: node_memory_used_bytes / node_memory_total_bytes > 0.9
for: 15m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | humanizePercentage }}"
# Cluster-level alerting (aggregated)
- alert: ClusterHighMemory
expr: avg by (cluster) (node_memory_used_bytes / node_memory_total_bytes) > 0.85
for: 15m
labels:
severity: warning
annotations:
summary: "High average memory usage on {{ $labels.cluster }}"
description: "Average memory usage is {{ $value | humanizePercentage }}"
4. Design Meaningful Labels and Annotations
Use labels for grouping and routing; use annotations for human-readable context:
# Well-designed labels and annotations
- alert: APIHighLatency
expr: histogram_quantile(0.95, rate(api_request_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: warning
team: backend
service: api
annotations:
summary: "High API latency on {{ $labels.instance }}"
description: "95th percentile latency is {{ $value }} seconds for {{ $labels.endpoint }}"
dashboard_url: "https://grafana.example.com/d/abc123/api-dashboard?var-instance={{ $labels.instance }}"
runbook_url: "https://example.com/runbooks/high-api-latency"
Practical Examples
Example 1: Service Availability Alerting
# Critical alert for service unavailability
- alert: ServiceDown
expr: up{job="important-service"} == 0
for: 1m
labels:
severity: critical
team: sre
annotations:
summary: "Service {{ $labels.job }} is down"
description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute"
runbook_url: "https://example.com/runbooks/service-down"
# Warning alert for degraded service
- alert: HighErrorRate
expr: rate(http_requests_total{code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
team: sre
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} for the past 5 minutes on {{ $labels.instance }}"
dashboard_url: "https://grafana.example.com/d/abc123/service-dashboard?var-job={{ $labels.job }}"
Example 2: Database Performance Alerting
# Warning alert for slow database queries
- alert: SlowDatabaseQueries
expr: histogram_quantile(0.95, rate(database_query_duration_seconds_bucket[5m])) > 0.5
for: 10m
labels:
severity: warning
team: database
annotations:
summary: "Slow database queries on {{ $labels.instance }}"
description: "95th percentile query time is {{ $value }} seconds for the past 10 minutes"
# Critical alert for database connection saturation
- alert: DatabaseConnectionsSaturated
expr: sum(postgres_stat_activity_count) / max(postgres_settings_max_connections) > 0.8
for: 5m
labels:
severity: critical
team: database
annotations:
summary: "Database connections near limit"
description: "{{ $value | humanizePercentage }} of available database connections are in use"
runbook_url: "https://example.com/runbooks/database-connections"
Example 3: Multi-level Disk Space Alerting
# Warning level disk space alert
- alert: DiskSpaceFillingUp
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes > 0.85
for: 30m
labels:
severity: warning
team: infrastructure
annotations:
summary: "Disk space filling up on {{ $labels.instance }}"
description: "Disk {{ $labels.device }} mounted on {{ $labels.mountpoint }} is {{ $value | humanizePercentage }} full"
# Critical level disk space alert
- alert: DiskSpaceCritical
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes > 0.95
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Critical disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.device }} mounted on {{ $labels.mountpoint }} is {{ $value | humanizePercentage }} full"
runbook_url: "https://example.com/runbooks/disk-full"
Alert Maintenance Best Practices
1. Document Your Alerts
Maintain documentation for all alerts, including:
- The reason for the alert
- Expected action to take
- Links to relevant dashboards and runbooks
2. Regularly Review Alert Rules
Set up a periodic review process:
- Are alerts still relevant?
- Do thresholds need adjustment?
- Are there false positives?
- Are there gaps in coverage?
3. Test Alert Rules Before Deploying
Use Prometheus's testing capabilities to verify alert behavior:
# Test your alert rules before deploying
promtool check rules alerts.yml
# Test a specific rule expression
promtool query instant http://prometheus:9090 'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1'
4. Use Version Control for Alert Rules
Store alert rules in version control:
- Document changes with meaningful commit messages
- Use pull requests for reviews
- Consider automated testing in CI/CD pipelines
Common Alerting Antipatterns to Avoid
1. Alert Overload
Problem: Too many alerts causing fatigue and missed important issues.
Solution: Regularly audit alerts and eliminate non-actionable ones.
2. Poor Alert Descriptions
Problem: Vague alert descriptions lead to confusion during incidents.
Solution: Include specific details, values, and links to runbooks.
3. Static Thresholds for Dynamic Systems
Problem: Fixed thresholds don't account for normal variations in system behavior.
Solution: Consider using dynamic thresholds or percentile-based alerting.
# Instead of a fixed threshold
- alert: HighCPUUsage
expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
# Consider a relative threshold comparing to history
- alert: AbnormalCPUUsage
expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) >
avg_over_time(avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))[1d:5m]) * 1.5
for: 15m
labels:
severity: warning
annotations:
summary: "Abnormal CPU usage on {{ $labels.instance }}"
description: "CPU usage is 50% higher than the 24-hour average"
Summary
Effective Prometheus alerting requires thoughtful design and regular maintenance. By following these best practices, you can build an alerting system that:
- Notifies you of real problems that need attention
- Minimizes false positives and alert fatigue
- Provides clear, actionable information to responders
- Adapts to the changing needs of your systems
Remember that good alerting is an ongoing process that requires regular review and refinement. Start with a small set of critical alerts and expand as you gain experience with your specific system's behavior.
Additional Resources
- Prometheus Alerting Documentation
- Google's Site Reliability Engineering Book - Chapter on Monitoring
- Prometheus Alert Rules Examples GitHub Repository
- AlertManager Configuration Documentation
Exercises
-
Review your existing alert rules and classify them as symptom-based or cause-based. Convert at least one cause-based alert to focus on user-visible symptoms.
-
Create a multi-level alert for a service of your choice with appropriate warning and critical thresholds.
-
For a counter metric in your system (like http_requests_total), write an appropriate alert rule using rate() function.
-
Design an alert rule that uses templating in annotations to provide specific, actionable information.
-
Create a dashboard visualization that helps you determine appropriate thresholds for a metric you're considering alerting on.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)