PromQL Functions
Introduction
Functions are a fundamental component of PromQL (Prometheus Query Language) that allow you to transform, aggregate, and manipulate time series data. In this guide, we'll explore the various functions available in PromQL, understand their syntax, and learn how to use them effectively in your monitoring queries.
PromQL functions enhance your ability to extract meaningful insights from your metrics data. They can help you calculate rates, aggregate values, predict trends, and transform data into more useful formats. Understanding these functions is essential for building powerful monitoring dashboards and alerting rules.
Function Categories
PromQL functions can be broadly categorized into several types based on their purpose:
- Aggregation Functions: Combine multiple time series into fewer series
- Counter Functions: Work with counter metrics (continuously increasing values)
- Mathematical Functions: Perform mathematical operations on time series data
- Rate Functions: Calculate rates of change
- Time Functions: Manipulate timestamps or perform time-based calculations
- Label Manipulation: Modify, add, or remove labels from time series
Let's explore each category in detail.
Aggregation Functions
Aggregation functions combine multiple time series into a smaller set, usually by operating across the "instance" dimension.
sum
The sum
function adds the values of multiple time series together.
sum(http_requests_total)
This query sums the values of the http_requests_total
metric across all instances and returns a single time series.
You can also aggregate by specific labels:
sum by (job) (http_requests_total)
This groups the metrics by the job
label, summing the values for each job.
avg
The avg
function calculates the average value across multiple time series:
avg(node_cpu_seconds_total{mode="idle"})
Like sum
, you can use by
or without
to specify grouping:
avg by (instance) (node_cpu_seconds_total{mode="idle"})
Other Aggregation Functions
PromQL includes several other aggregation operators:
min
: Selects the smallest valuemax
: Selects the largest valuecount
: Counts the number of elements in the vectorstddev
: Calculates the population standard deviationstdvar
: Calculates the population standard variancetopk
: Selects the k largest elementsbottomk
: Selects the k smallest elementsquantile
: Calculates the φ-quantile (0 ≤ φ ≤ 1)
Example using topk
:
# Find the 3 busiest HTTP endpoints
topk(3, sum by (path) (rate(http_requests_total[5m])))
Counter Functions
Counter functions are designed to work with counter metrics, which always increase over time (except when they reset to zero, like during a restart).
rate
The rate
function calculates the per-second average rate of increase over a time window:
rate(http_requests_total[5m])
This calculates the per-second rate of HTTP requests over the last 5 minutes.
irate
The irate
function calculates the instant rate based on the last two data points:
irate(http_requests_total[5m])
irate
is more responsive to recent changes but can be more volatile than rate
.
increase
The increase
function calculates the total increase in a counter over a time window:
increase(http_requests_total[1h])
This gives the total number of requests over the past hour.
When to use which counter function?
- Use
rate
for regular graphing and alerting on counters. - Use
irate
for fast-changing counters or when you need to see rapid changes. - Use
increase
when you want the absolute increase rather than a per-second rate.
Mathematical Functions
PromQL offers various mathematical functions to transform your data.
abs
, ceil
, floor
, round
Basic mathematical operations:
abs(temperature - 273.15) # Convert Kelvin to Celsius and ensure it's positive
ceil(cpu_usage) # Round up CPU usage
floor(memory_fraction) # Round down memory usage
round(response_time_seconds) # Round to nearest integer
Trigonometric and Other Mathematical Functions
PromQL supports several advanced mathematical functions:
sqrt
: Square rootln
,log2
,log10
: Logarithmsexp
: Exponential functionsin
,cos
,tan
: Trigonometric functions
Example:
sqrt(process_resident_memory_bytes / 1024 / 1024)
This calculates the square root of the memory usage in MB.
Histogram Functions
Prometheus often stores latency and size metrics as histograms. PromQL provides functions to analyze these histograms.
histogram_quantile
The histogram_quantile
function calculates quantiles from histogram metrics:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
This query returns the 95th percentile of HTTP request durations over the last 5 minutes.
Time and Date Functions
These functions help with time-based calculations.
time
The time
function returns the timestamp (in seconds since epoch) for each point:
time()
day_of_week
, day_of_month
, day_of_year
These functions return the day number (starting from 0):
day_of_week(time()) # 0 = Sunday, 6 = Saturday
hour
, minute
, month
, year
Extract components from a timestamp:
hour(time()) # 0-23
Vector Matching Functions
These functions help when working with multiple time series that need to be combined.
vector
Converts a scalar to a vector:
vector(1)
on
and ignoring
These modifiers control which labels to match on:
api_requests_total{method="GET"} / on(instance, method) api_requests_total{method="POST"}
Practical Examples
Let's look at some real-world examples of combining PromQL functions to answer common monitoring questions.
Example 1: Error Rate Calculation
Calculate the percentage of HTTP requests that resulted in 5xx errors:
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
Example 2: CPU Usage per Core
Calculate the average CPU usage per core, excluding idle time:
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Example 3: Predicting Resource Exhaustion
Predict when disk space will run out based on current usage trend:
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.10
and
predict_linear(node_filesystem_avail_bytes[6h], 24 * 3600) < 0
This alerts when:
- Less than 10% disk space is available, AND
- The disk is predicted to fill up within the next 24 hours based on the last 6 hours of data.
Example 4: Detecting Service Degradation
Detect when a service's 95th percentile latency exceeds a threshold:
histogram_quantile(0.95, sum(rate(api_request_duration_seconds_bucket[5m])) by (le)) > 0.5
This alerts when the 95th percentile of API request duration exceeds 500ms.
Common Function Combinations
Some function combinations are frequently used together:
Rate then Sum
sum(rate(http_requests_total[5m])) by (path)
This pattern first calculates the rate for each time series, then sums them by path.
Aggregation then Histogram Quantile
histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket[5m])) by (le, job))
This calculates request duration quantiles per job.
Function Pitfalls and Gotchas
Extrapolation Issues with predict_linear
The predict_linear
function assumes a linear trend, which might not always be accurate for real-world data. Use with caution for long-term predictions.
Aggregation Order Matters
The order of operations can significantly affect results:
sum(rate(counter[5m])) # Calculate rate first, then sum
rate(sum(counter)[5m]) # Sum first, then calculate rate (usually incorrect)
Histogram Quantiles Are Approximate
The histogram_quantile
function provides an approximation based on bucket boundaries. The accuracy depends on how your histogram buckets are defined.
Summary
PromQL functions form the backbone of effective monitoring with Prometheus. They enable you to:
- Transform raw metrics into meaningful insights
- Calculate rates of change and trends
- Aggregate data across multiple dimensions
- Make predictions based on historical data
- Create powerful alerting rules
By combining these functions appropriately, you can build comprehensive monitoring solutions that help you understand system behavior and detect problems early.
Additional Resources
To deepen your understanding of PromQL functions:
- Experiment with the Prometheus expression browser in your own environment
- Try writing queries that answer specific questions about your systems
- Refer to the official Prometheus documentation for complete function reference
Exercises
- Write a PromQL query to find the 3 most CPU-intensive processes on your system.
- Create a query to calculate the memory usage growth rate and predict when you might run out of memory.
- Write a query to detect when the ratio of errors to total requests exceeds 5% over a 5-minute window.
- Create a dashboard panel showing the 95th percentile request latency broken down by endpoint.
- Write an alert expression that triggers when any instance has less than 10% disk space remaining and is predicted to run out within 48 hours.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)