PromQL Rate Function
Introduction
The rate()
function is one of the most fundamental and frequently used functions in PromQL (Prometheus Query Language). It's essential for analyzing how counter metrics change over time, allowing you to calculate the per-second average rate of increase of time series within a specified time window.
In monitoring systems like Prometheus, many metrics are stored as counters - values that only increase over time (except when they reset or restart). Examples include total HTTP requests received, bytes sent, or errors encountered. While raw counter values tell you the total count since the start, the rate of change often provides more actionable insights.
This is where the rate()
function comes in - it transforms monotonically increasing counter values into per-second rates that help you understand system behavior over time.
Syntax and Basic Usage
The basic syntax of the rate()
function is:
rate(counter_metric[time_range])
Where:
counter_metric
is a counter type metrictime_range
is the time window (or "lookback window") for calculating the rate
The rate()
function:
- Takes a range vector as input (a time series with values over a time range)
- Calculates the per-second average rate of increase over that time range
- Returns an instant vector with the calculated rate values
How Rate Calculation Works
The rate()
function uses the following approach to calculate the per-second rate:
- It takes the first and last data points within the specified time range
- Calculates the difference between these values
- Divides this difference by the time difference in seconds
- Accounts for counter resets (when a counter goes back to zero after a process restart)
The formula can be represented as:
rate = (last_value - first_value) / time_difference_in_seconds
With additional handling for counter resets and extrapolation for incomplete data.
Example: Basic Rate Calculation
Consider a counter metric http_requests_total
that tracks the total number of HTTP requests. To calculate the per-second rate of HTTP requests over the last 5 minutes:
rate(http_requests_total[5m])
If the counter had these values:
- 100 at t=0 seconds
- 160 at t=300 seconds (5 minutes)
The calculation would be:
rate = (160 - 100) / 300 = 0.2 requests per second
Counter Resets and How Rate Handles Them
One important feature of rate()
is its ability to handle counter resets. When a service restarts, counters typically reset to zero. The rate()
function detects these resets and correctly calculates the rate despite them.
For example, if a counter had these values:
- 100 at t=0 seconds
- 0 at t=150 seconds (after a service restart)
- 50 at t=300 seconds
The rate()
function would recognize the reset and calculate:
rate = ((50 - 0) + (100 - 0)) / 300 = 0.5 requests per second
This ability to handle counter resets makes rate()
robust for real-world monitoring scenarios where services may restart.
Best Practices for Time Range Selection
The time range you select affects the sensitivity and accuracy of your rate calculations:
- Too short (e.g.,
[30s]
): More responsive to sudden changes but more susceptible to noise and scrape gaps - Too long (e.g.,
[1h]
): Smoother but might mask important short-term variations
General guidelines:
- For high-frequency metrics: 1-5 minutes
- For standard metrics: 5-15 minutes
- For slow-changing metrics: 15+ minutes
A common starting point is [5m]
, which balances responsiveness and stability:
rate(http_requests_total[5m])
Rate vs. irate
PromQL offers two main functions for calculating rates:
rate()
: Calculates the per-second average rate over the entire time rangeirate()
: Calculates the per-second instant rate using only the last two data points
Here's a comparison:
Function | Calculation Method | Use Case | Advantages | Disadvantages |
---|---|---|---|---|
rate() | Average over entire range | General monitoring, dashboards | Smooths out spikes, better for graphing | May miss short-lived spikes |
irate() | Only last two samples | Alerting, detecting sudden changes | More responsive to sudden changes | More noisy, less stable |
For most dashboard visualizations, rate()
is preferred because it provides a more stable signal.
Real-World Examples
Example 1: HTTP Request Rate by Endpoint
This query calculates the per-second rate of HTTP requests for each endpoint:
rate(http_requests_total{job="api-server"}[5m])
You can add label filters to focus on specific endpoints:
rate(http_requests_total{job="api-server", endpoint="/api/users"}[5m])
Example 2: Error Rates and Success Rates
Calculate the per-second rate of errors:
rate(http_errors_total[5m])
Calculate error percentage (combining two rate calculations):
rate(http_errors_total[5m]) / rate(http_requests_total[5m]) * 100
Example 3: Network Traffic Throughput
Calculate network throughput in MB/s:
rate(network_bytes_transferred{interface="eth0"}[5m]) / (1024 * 1024)
Example 4: CPU Usage Rate
Calculate CPU usage rate from a counter of total CPU seconds:
rate(process_cpu_seconds_total{job="app-server"}[5m]) * 100
This gives the percentage of a single CPU core used by the process.
Visualizing Rate Data
Rate data is typically visualized on time-series graphs to show trends. Here's a Mermaid diagram illustrating how raw counter data transforms into rate data:
A typical dashboard might include:
- Raw request count (using the counter directly)
- Request rate (using
rate()
) - Error rate (using
rate()
on error counters) - Success percentage (calculated from rates)
Common Pitfalls and Solutions
1. Using Rate with Non-Counter Metrics
The rate()
function is designed specifically for counter metrics. Using it with gauge metrics will produce incorrect results.
❌ Incorrect:
rate(node_memory_MemFree_bytes[5m]) # MemFree is a gauge, not a counter
✅ Correct approach for gauges: Use functions like delta()
or deriv()
instead:
delta(node_memory_MemFree_bytes[5m]) # Change over 5m, not per-second rate
2. Time Range Too Small
If your time range is too small, you might not capture any data points, especially if your scrape interval is close to the range.
❌ Potential issue:
rate(http_requests_total[10s]) # If scrape interval is 15s, this may not work reliably
✅ Better approach:
rate(http_requests_total[1m]) # Ensure multiple data points in the range
The general rule is to use a time range at least 4 times your scrape interval.
3. Alerting on Spurious Spikes
Alerting on rate()
can sometimes trigger false alarms due to temporary spikes.
❌ Sensitive to spikes:
rate(http_errors_total[1m]) > 5
✅ More robust alerting:
rate(http_errors_total[5m]) > 5
For alerting specifically on spikes, irate()
might be appropriate with proper thresholds.
Advanced Usage: Combining with Other Functions
The rate()
function is often combined with other PromQL functions for more sophisticated analyses:
Aggregating Rates Across Instances
sum by (instance) (rate(http_requests_total[5m]))
This calculates the request rate for each instance separately, then sums them by instance.
Moving Averages of Rates
avg_over_time(rate(http_requests_total[5m])[1h:15m])
This creates a 1-hour moving average of the 5-minute rates, sampled every 15 minutes.
Predicting Future Values
predict_linear(rate(http_requests_total[6h])[1h:], 3600)
This predicts what the rate will be in 1 hour (3600 seconds) based on the trend of the last 6 hours.
Summary
The rate()
function is a cornerstone of PromQL that transforms counter metrics into more actionable per-second rates. Key points to remember:
- Use
rate()
only with counter metrics - Select an appropriate time range that balances responsiveness and stability
- Remember that
rate()
handles counter resets automatically - Use
rate()
for visualization and general monitoring; considerirate()
for alerting on sudden changes - Combine with aggregation and other functions for more sophisticated analyses
By mastering the rate()
function, you'll be able to extract meaningful insights from your time-series data and build effective monitoring dashboards.
Exercises
- Calculate the per-second rate of HTTP requests over the last 10 minutes.
- Compare the error rates between different service endpoints.
- Create a query that shows the percentage of CPU usage per container.
- Build a query that calculates the ratio of errors to total requests over 5 minutes.
- Implement a query that predicts what your request rate will be in 4 hours based on the current trend.
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)