PromQL Aggregation Operators
Introduction
When working with Prometheus, you'll often need to summarize data across multiple time series to get a higher-level view of your system's performance. This is where aggregation operators come into play. These powerful functions allow you to combine multiple time series into fewer series or even a single value, making it easier to understand trends, identify outliers, and create meaningful dashboards.
In this guide, we'll explore the various aggregation operators available in PromQL (Prometheus Query Language), understand how they work, and learn how to apply them in real-world monitoring scenarios.
Understanding Aggregation in Prometheus
Before diving into specific operators, it's important to understand what aggregation means in the context of Prometheus time series data.
Prometheus stores data as time series, where each series is uniquely identified by its metric name and a set of key-value pairs called labels. For example:
http_requests_total{method="GET", status="200", instance="10.0.0.1", job="web"}
When you have multiple time series for the same metric (e.g., http_requests_total
from different instances), aggregation operators allow you to combine these series in various ways:
- Calculating the sum across all instances
- Finding the average value
- Identifying minimum or maximum values
- Counting the number of series
- And more!
Basic Syntax
Aggregation operators in PromQL follow this general syntax:
<aggregation_operator>([parameter,] <vector expression>) [by|without (<label list>)]
Where:
<aggregation_operator>
is the function you want to apply (sum, avg, min, etc.)[parameter,]
is an optional parameter some operators accept<vector expression>
is the input data to aggregate[by|without (<label list>)]
is an optional clause that controls which labels to include or exclude
Common Aggregation Operators
sum
The sum
operator adds values of all time series that match the vector selector.
Example: Total HTTP requests across all instances
sum(http_requests_total)
Input:
http_requests_total{instance="server1", job="web"} 100
http_requests_total{instance="server2", job="web"} 200
http_requests_total{instance="server3", job="web"} 300
Output:
{} 600
avg
The avg
operator calculates the average value across all selected time series.
Example: Average CPU usage across all instances
avg(node_cpu_usage_percent)
Input:
node_cpu_usage_percent{instance="server1", job="node"} 75
node_cpu_usage_percent{instance="server2", job="node"} 25
node_cpu_usage_percent{instance="server3", job="node"} 50
Output:
{} 50
min
and max
These operators find the minimum or maximum value across all selected time series.
Example: Find the most loaded instance
max(node_load1)
Input:
node_load1{instance="server1", job="node"} 0.5
node_load1{instance="server2", job="node"} 1.2
node_load1{instance="server3", job="node"} 0.8
Output:
{} 1.2
count
The count
operator returns the number of time series in the input vector.
Example: Count the number of instances currently up
count(up)
Input:
up{instance="server1", job="web"} 1
up{instance="server2", job="web"} 1
up{instance="server3", job="web"} 0
up{instance="server4", job="db"} 1
Output:
{} 4
group
The group
operator creates a new time series with a value of 1 for each group of input time series, based on the labels specified in the by
clause.
Example: Group servers by job
group(node_cpu_seconds_total) by (job)
Input:
node_cpu_seconds_total{cpu="0", mode="idle", instance="server1", job="web"} 1000
node_cpu_seconds_total{cpu="0", mode="idle", instance="server2", job="web"} 2000
node_cpu_seconds_total{cpu="0", mode="idle", instance="server3", job="db"} 3000
Output:
{job="web"} 1
{job="db"} 1
topk
and bottomk
These operators select the top or bottom K elements from the input vector by sample value.
Example: Find the 2 instances with the highest memory usage
topk(2, node_memory_usage_bytes)
Input:
node_memory_usage_bytes{instance="server1", job="node"} 1000000
node_memory_usage_bytes{instance="server2", job="node"} 2500000
node_memory_usage_bytes{instance="server3", job="node"} 1500000
Output:
node_memory_usage_bytes{instance="server2", job="node"} 2500000
node_memory_usage_bytes{instance="server3", job="node"} 1500000
quantile
The quantile
operator calculates the φ-quantile (0 ≤ φ ≤ 1) of the values in the vector.
Example: Find the 90th percentile of request durations
quantile(0.9, http_request_duration_seconds)
Input:
http_request_duration_seconds{path="/api", method="GET"} 0.1
http_request_duration_seconds{path="/home", method="GET"} 0.2
http_request_duration_seconds{path="/login", method="POST"} 0.3
http_request_duration_seconds{path="/data", method="GET"} 0.4
http_request_duration_seconds{path="/admin", method="GET"} 0.5
Output:
{} 0.46
Grouping with by
and without
Clauses
You can use by
and without
clauses to control which labels are preserved in the result of an aggregation operation.
Using the by
Clause
The by
clause specifies which labels to keep, discarding all others.
Example: Sum HTTP requests by status code
sum(http_requests_total) by (status_code)
Input:
http_requests_total{instance="server1", path="/api", status_code="200"} 100
http_requests_total{instance="server2", path="/api", status_code="200"} 150
http_requests_total{instance="server1", path="/home", status_code="404"} 10
http_requests_total{instance="server2", path="/login", status_code="500"} 5
Output:
{status_code="200"} 250
{status_code="404"} 10
{status_code="500"} 5
Using the without
Clause
The without
clause specifies which labels to remove, keeping all others.
Example: Calculate the average CPU usage without instance information
avg(node_cpu_usage_percent) without (instance)
Input:
node_cpu_usage_percent{instance="server1", job="web", datacenter="us-east"} 70
node_cpu_usage_percent{instance="server2", job="web", datacenter="us-east"} 60
node_cpu_usage_percent{instance="server3", job="db", datacenter="us-west"} 50
Output:
{job="web", datacenter="us-east"} 65
{job="db", datacenter="us-west"} 50
Real-World Usage Examples
Let's explore some practical applications of aggregation operators in real-world monitoring scenarios.
Monitoring Service Health
Calculate the percentage of healthy instances for each service:
sum(up) by (job) / count(up) by (job) * 100
This query:
- Groups instances by job
- Calculates the sum of up instances (1 = up, 0 = down)
- Divides by the total count of instances
- Multiplies by 100 to get a percentage
Resource Utilization Across Clusters
Find the average CPU utilization grouped by datacenter and service:
avg(node_cpu_usage_percent) by (datacenter, service)
Identifying Outliers
Find instances with memory usage more than 2 standard deviations from the mean:
node_memory_usage_bytes >
(avg(node_memory_usage_bytes) + 2 * stddev(node_memory_usage_bytes))
Service Level Objectives (SLOs)
Calculate the 99th percentile of request latencies for API endpoints:
quantile(0.99, http_request_duration_seconds{handler=~"/api/.*"})
Capacity Planning
Find the top 5 services by growth rate over the past week:
topk(5, deriv(sum(container_cpu_usage_seconds_total) by (service)[1w:1h]))
Advanced Aggregation Patterns
Multi-level Aggregation
You can chain multiple aggregation operations for more complex analyses:
# Find the average of the maximum CPU usage across all instances by job
avg(max(node_cpu_usage_percent) by (instance)) by (job)
Combining with Range Vectors
Aggregation operators can be combined with range vectors for time-based aggregation:
# Average CPU usage over the last 5 minutes for each instance
avg_over_time(node_cpu_usage_percent[5m]) by (instance)
Filtering Before Aggregation
You can filter time series before applying aggregation:
# Sum of all HTTP 5xx errors
sum(http_requests_total{status_code=~"5.."})
Common Pitfalls and Best Practices
Cardinality Explosion
Avoid grouping by high-cardinality labels (like user IDs) as this can create too many time series and overload Prometheus:
# Bad practice - could create millions of time series
sum(http_requests_total) by (user_id)
Missing Labels
Be aware that when you use aggregation operators, any labels not included in the by
clause will be dropped from the result:
# Original labels like "instance" will be lost
sum(node_cpu_usage_percent) by (job)
Correct Handling of Counter Resets
When aggregating counters, use rate()
or increase()
first to handle counter resets properly:
# Correct way to sum rates across instances
sum(rate(http_requests_total[5m])) by (status_code)
Order of Operations
Pay attention to the order of operations. The following queries give different results:
# Calculate rate first, then sum across instances
sum(rate(http_requests_total[5m]))
# Calculate sum first, then rate (this loses information about individual counter resets)
rate(sum(http_requests_total)[5m])
Summary
PromQL aggregation operators are powerful tools for transforming and summarizing time series data. They allow you to reduce the dimensionality of your data, making it easier to derive meaningful insights and create effective dashboards.
Key points to remember:
- Use the
sum
,avg
,min
,max
, andcount
operators for basic aggregation - Apply
topk
,bottomk
, andquantile
for more specific analyses - Control label preservation with
by
andwithout
clauses - Be mindful of cardinality and the order of operations
- Apply aggregation operators after rate calculations when working with counters
By mastering these operators, you'll be able to extract valuable insights from your Prometheus metrics and build more effective monitoring systems.
Exercises
- Write a PromQL query to show the total number of HTTP requests by status code and method.
- Create a query to find the three services with the highest error rates.
- Write a query to calculate the 95th percentile of request latencies for each endpoint.
- Use aggregation to find the ratio of CPU usage to memory usage across all instances.
- Create a query to show the percentage of disk space used, averaged across all instances in each datacenter.
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)