PromQL Best Practices
Introduction
PromQL (Prometheus Query Language) is the powerful query language that makes Prometheus such an effective monitoring system. While the basics of PromQL are straightforward, writing efficient, maintainable, and accurate queries requires understanding some key principles and patterns. This guide covers essential best practices that will help you write better PromQL queries, avoid common pitfalls, and get the most value from your monitoring data.
Optimize for Query Efficiency
Use Rate() Correctly
When calculating rates from counters, use the rate()
function appropriately:
# Good practice: rate() with a range that is 4x the scrape interval
rate(http_requests_total[5m])
# Not ideal: too short a range can lead to inaccurate or empty results
rate(http_requests_total[10s])
The general guideline is to use a range at least 4 times the scrape interval to ensure enough samples for accurate calculations.
Prefer rate() over irate() for Dashboards
While irate()
gives the instant rate between the last two data points, rate()
provides a more stable view by averaging over a time window:
# Better for dashboards - smoother, less sensitive to outliers
rate(http_requests_total[5m])
# More variable, can show spikes that rate() smooths out
irate(http_requests_total[5m])
Use irate()
when you need to see immediate changes in rates, particularly for alerting on sudden spikes. Use rate()
for most dashboard visualizations.
Avoid Expensive Operations
Minimize the use of expensive operations, especially on large result sets:
# Expensive: sort(topk(1000, sum by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))))
# Better: Narrow down before applying expensive operations
topk(10, sum by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))
Functions like sort()
, topk()
with large values, and *_over_time()
functions with long ranges can significantly impact query performance.
Improve Query Readability and Maintainability
Use Labels Effectively
Leverage meaningful labels to make your queries more specific and easier to understand:
# Too broad
http_requests_total
# Better: filtered by specific service and environment
http_requests_total{service="auth-api", env="production"}
Well-structured labels make your queries more precise and easier to debug.
Write Clear Multi-step Queries with Comments
For complex queries, use comments and split logic into clear steps:
# Get HTTP error rate percentage by service
# 1. Calculate rate of 5xx responses
(
sum by(service) (rate(http_requests_total{status=~"5.."}[5m]))
/
# 2. Calculate rate of all responses
sum by(service) (rate(http_requests_total[5m]))
) * 100
Clear comments and formatting make complex queries easier to understand and maintain.
Follow Consistent Naming Conventions
Establish consistent naming patterns for your metrics and queries:
# Consistent naming pattern for request metrics
api_requests_total{path="/users", method="GET"}
api_request_duration_seconds{path="/users", method="GET"}
# Inconsistent naming (avoid this)
api_requests{path="/users", method="GET"}
api_latency_ms{path="/users", method="GET"}
Consistent naming makes it easier to discover related metrics and build dashboards.
Handle Edge Cases Correctly
Account for Counter Resets
Counters can reset when services restart. The rate()
function handles this automatically, but be aware when doing manual calculations:
# Handles counter resets correctly
rate(process_cpu_seconds_total[5m])
# Dangerous: manual difference doesn't handle counter resets
process_cpu_seconds_total offset 5m - process_cpu_seconds_total
Always use rate()
, increase()
, or delta()
when working with counters.
Use absent() for Alerting on Missing Metrics
When alerting, check for missing metrics to avoid false negatives:
# Alert triggers only if metric exists and is above threshold
http_errors_total > 100
# Better for alerting: handles case where metric disappears
http_errors_total > 100 or absent(http_errors_total)
This pattern ensures your alerts fire even if a service stops reporting metrics entirely.
Handle Highly Dynamic Labels Carefully
Labels that have high cardinality (many possible values) can cause performance issues:
# Problematic: user_id has high cardinality
requests_total{user_id="12345"}
# Better: aggregate first, then filter
sum by(endpoint) (requests_total) > 1000
Avoid using high-cardinality labels in grouping operations, and be cautious about exponential growth in combinations of label values.
Effective Data Visualization Patterns
Calculate Percentages Correctly
When calculating percentages, ensure proper aggregation:
# Correct way to calculate percentage of error responses
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100
# Incorrect: division happens per time series before summing
sum(
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
* 100
)
Aggregate the numerator and denominator separately before division to get accurate percentages.
Use Recording Rules for Complex or Frequent Queries
For complex queries that are used frequently in dashboards, define recording rules:
# In prometheus.yml recording rules section
rules:
- record: job:http_requests:rate5m
expr: sum by(job) (rate(http_requests_total[5m]))
Then in your dashboards or alerts, use the pre-computed metric:
job:http_requests:rate5m{job="api-server"}
Recording rules improve performance and ensure consistency across dashboards.
Monitoring Your Services Effectively
Focus on the Four Golden Signals
Structure your monitoring queries around these fundamental aspects:
-
Latency: How long it takes to service a request
promqlhistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
-
Traffic: The demand placed on your system
promqlsum by(service) (rate(http_requests_total[5m]))
-
Errors: The rate of failed requests
promqlsum by(service) (rate(http_requests_total{status=~"5.."}[5m]))
-
Saturation: How "full" your service is
promql(1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100
These four signals provide a comprehensive view of service health.
Use Rate for Counters, Gauge for Direct Values
Match the function to the metric type:
# For counter metrics (always increasing)
rate(http_requests_total[5m])
# For gauge metrics (can go up and down)
memory_usage_bytes
Understanding the metric type helps you choose appropriate PromQL functions.
Practical Examples
Example 1: Service Health Dashboard Queries
Here's a set of queries for a comprehensive service dashboard:
# Request volume by endpoint
sum by(endpoint) (rate(http_requests_total[5m]))
# 95th percentile latency by endpoint
histogram_quantile(0.95, sum by(endpoint, le) (rate(http_request_duration_seconds_bucket[5m])))
# Error percentage by endpoint
(
sum by(endpoint) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by(endpoint) (rate(http_requests_total[5m]))
) * 100
# Service saturation (CPU usage)
1 - (
sum by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
/
sum by(instance) (rate(node_cpu_seconds_total[5m]))
)
Example 2: Identifying Hotspots in a Microservice Architecture
Locate which services are generating the most load:
# Top 5 services by CPU usage
topk(5,
sum by(service) (
rate(process_cpu_seconds_total[5m])
)
)
# Top 5 services by memory usage
topk(5,
sum by(service) (
process_resident_memory_bytes
)
)
# Top 5 most called services
topk(5,
sum by(service) (
rate(http_requests_total[5m])
)
)
Example 3: Effective Alert Query Patterns
Reliable alerting requires carefully crafted queries:
# Alert on error rate exceeding 5% over 5 minutes
(
sum by(service) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by(service) (rate(http_requests_total[5m]))
) > 0.05
# Alert on service latency - 95th percentile request taking more than 500ms
histogram_quantile(0.95, sum by(service, le) (rate(http_request_duration_seconds_bucket[5m]))) > 0.5
# Alert on disk space - less than 10% free
(
node_filesystem_avail_bytes / node_filesystem_size_bytes
) * 100 < 10
Advanced PromQL Techniques
Subqueries for Trend Analysis
Analyze trends over time using subqueries:
# Calculate rate of increase in error rates
max_over_time(
rate(http_errors_total[5m])[1h:5m]
)
This shows the maximum error rate observed in 5-minute windows over the past hour.
Using Binary Operators for Comparative Analysis
Compare metrics using binary operators:
# Compare current CPU usage to usage 1 day ago
(
sum by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
sum by(instance) (rate(node_cpu_seconds_total[5m]))
)
/
(
sum by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m] offset 1d))
/
sum by(instance) (rate(node_cpu_seconds_total[5m] offset 1d))
)
This query shows how current CPU utilization compares to the same time yesterday.
Visualizing PromQL Data Flow
The following diagram illustrates how data flows through a typical PromQL query:
Understanding this flow helps you structure complex queries effectively.
Summary
Writing effective PromQL queries requires more than just syntax knowledge—it demands an understanding of metric types, query performance, and monitoring best practices. By following these best practices, you can:
- Create more efficient queries that don't overload your Prometheus server
- Build more readable and maintainable monitoring configurations
- Develop more accurate and useful dashboards and alerts
- Avoid common pitfalls that lead to incorrect or misleading results
Remember that effective monitoring is an iterative process. Start with basic metrics and gradually refine your queries as you gain more understanding of your systems' behavior.
Additional Resources
- Prometheus Query Functions Documentation
- Prometheus Query Operators Documentation
- Prometheus Best Practices
- PromLabs PromQL Cheat Sheet
Exercises
- Convert a counter metric query to show a per-second rate.
- Write a query to find the 3 endpoints with the highest error rates.
- Create a query that compares current system load to the load from one week ago.
- Write a recording rule for a complex query you use frequently.
- Optimize a query that uses high-cardinality labels to improve performance.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)