PromQL Aggregation Operators

Introduction

When working with Prometheus, you'll often need to summarize data across multiple time series to get a higher-level view of your system's performance. This is where aggregation operators come into play. These powerful functions allow you to combine multiple time series into fewer series or even a single value, making it easier to understand trends, identify outliers, and create meaningful dashboards.

In this guide, we'll explore the various aggregation operators available in PromQL (Prometheus Query Language), understand how they work, and learn how to apply them in real-world monitoring scenarios.

Understanding Aggregation in Prometheus

Before diving into specific operators, it's important to understand what aggregation means in the context of Prometheus time series data.

Prometheus stores data as time series, where each series is uniquely identified by its metric name and a set of key-value pairs called labels. For example:

http_requests_total{method="GET", status="200", instance="10.0.0.1", job="web"}

When you have multiple time series for the same metric (e.g., http_requests_total from different instances), aggregation operators allow you to combine these series in various ways:

Calculating the sum across all instances
Finding the average value
Identifying minimum or maximum values
Counting the number of series
And more!

Basic Syntax

Aggregation operators in PromQL follow this general syntax:

<aggregation_operator>([parameter,] <vector expression>) [by|without (<label list>)]

Where:

<aggregation_operator> is the function you want to apply (sum, avg, min, etc.)
[parameter,] is an optional parameter some operators accept
<vector expression> is the input data to aggregate
[by|without (<label list>)] is an optional clause that controls which labels to include or exclude

Common Aggregation Operators

`sum`

The sum operator adds values of all time series that match the vector selector.

Example: Total HTTP requests across all instances

promql
sum(http_requests_total)

Input:

http_requests_total{instance="server1", job="web"} 100
http_requests_total{instance="server2", job="web"} 200
http_requests_total{instance="server3", job="web"} 300

Output:

{} 600

`avg`

The avg operator calculates the average value across all selected time series.

Example: Average CPU usage across all instances

promql
avg(node_cpu_usage_percent)

Input:

node_cpu_usage_percent{instance="server1", job="node"} 75
node_cpu_usage_percent{instance="server2", job="node"} 25
node_cpu_usage_percent{instance="server3", job="node"} 50

Output:

{} 50

`min` and `max`

These operators find the minimum or maximum value across all selected time series.

Example: Find the most loaded instance

promql
max(node_load1)

Input:

node_load1{instance="server1", job="node"} 0.5
node_load1{instance="server2", job="node"} 1.2
node_load1{instance="server3", job="node"} 0.8

Output:

{} 1.2

`count`

The count operator returns the number of time series in the input vector.

Example: Count the number of instances currently up

promql
count(up)

Input:

up{instance="server1", job="web"} 1
up{instance="server2", job="web"} 1
up{instance="server3", job="web"} 0
up{instance="server4", job="db"} 1

Output:

{} 4

`group`

The group operator creates a new time series with a value of 1 for each group of input time series, based on the labels specified in the by clause.

Example: Group servers by job

promql
group(node_cpu_seconds_total) by (job)

Input:

node_cpu_seconds_total{cpu="0", mode="idle", instance="server1", job="web"} 1000
node_cpu_seconds_total{cpu="0", mode="idle", instance="server2", job="web"} 2000
node_cpu_seconds_total{cpu="0", mode="idle", instance="server3", job="db"} 3000

Output:

{job="web"} 1
{job="db"} 1

`topk` and `bottomk`

These operators select the top or bottom K elements from the input vector by sample value.

Example: Find the 2 instances with the highest memory usage

promql
topk(2, node_memory_usage_bytes)

Input:

node_memory_usage_bytes{instance="server1", job="node"} 1000000
node_memory_usage_bytes{instance="server2", job="node"} 2500000
node_memory_usage_bytes{instance="server3", job="node"} 1500000

Output:

node_memory_usage_bytes{instance="server2", job="node"} 2500000
node_memory_usage_bytes{instance="server3", job="node"} 1500000

`quantile`

The quantile operator calculates the φ-quantile (0 ≤ φ ≤ 1) of the values in the vector.

Example: Find the 90th percentile of request durations

promql
quantile(0.9, http_request_duration_seconds)

Input:

http_request_duration_seconds{path="/api", method="GET"} 0.1
http_request_duration_seconds{path="/home", method="GET"} 0.2
http_request_duration_seconds{path="/login", method="POST"} 0.3
http_request_duration_seconds{path="/data", method="GET"} 0.4
http_request_duration_seconds{path="/admin", method="GET"} 0.5

Output:

{} 0.46

Grouping with `by` and `without` Clauses

You can use by and without clauses to control which labels are preserved in the result of an aggregation operation.

Using the `by` Clause

The by clause specifies which labels to keep, discarding all others.

Example: Sum HTTP requests by status code

promql
sum(http_requests_total) by (status_code)

Input:

http_requests_total{instance="server1", path="/api", status_code="200"} 100
http_requests_total{instance="server2", path="/api", status_code="200"} 150
http_requests_total{instance="server1", path="/home", status_code="404"} 10
http_requests_total{instance="server2", path="/login", status_code="500"} 5

Output:

{status_code="200"} 250
{status_code="404"} 10
{status_code="500"} 5

Using the `without` Clause

The without clause specifies which labels to remove, keeping all others.

Example: Calculate the average CPU usage without instance information

promql
avg(node_cpu_usage_percent) without (instance)

Input:

node_cpu_usage_percent{instance="server1", job="web", datacenter="us-east"} 70
node_cpu_usage_percent{instance="server2", job="web", datacenter="us-east"} 60
node_cpu_usage_percent{instance="server3", job="db", datacenter="us-west"} 50

Output:

{job="web", datacenter="us-east"} 65
{job="db", datacenter="us-west"} 50

Real-World Usage Examples

Let's explore some practical applications of aggregation operators in real-world monitoring scenarios.

Monitoring Service Health

Calculate the percentage of healthy instances for each service:

promql
sum(up) by (job) / count(up) by (job) * 100

This query:

Groups instances by job
Calculates the sum of up instances (1 = up, 0 = down)
Divides by the total count of instances
Multiplies by 100 to get a percentage

Resource Utilization Across Clusters

Find the average CPU utilization grouped by datacenter and service:

promql
avg(node_cpu_usage_percent) by (datacenter, service)

Identifying Outliers

Find instances with memory usage more than 2 standard deviations from the mean:

promql
node_memory_usage_bytes > 
(avg(node_memory_usage_bytes) + 2 * stddev(node_memory_usage_bytes))

Service Level Objectives (SLOs)

Calculate the 99th percentile of request latencies for API endpoints:

promql
quantile(0.99, http_request_duration_seconds{handler=~"/api/.*"})

Capacity Planning

Find the top 5 services by growth rate over the past week:

promql
topk(5, deriv(sum(container_cpu_usage_seconds_total) by (service)[1w:1h]))

Advanced Aggregation Patterns

Multi-level Aggregation

You can chain multiple aggregation operations for more complex analyses:

promql
# Find the average of the maximum CPU usage across all instances by job
avg(max(node_cpu_usage_percent) by (instance)) by (job)

Combining with Range Vectors

Aggregation operators can be combined with range vectors for time-based aggregation:

promql
# Average CPU usage over the last 5 minutes for each instance
avg_over_time(node_cpu_usage_percent[5m]) by (instance)

Filtering Before Aggregation

You can filter time series before applying aggregation:

promql
# Sum of all HTTP 5xx errors
sum(http_requests_total{status_code=~"5.."})

Common Pitfalls and Best Practices

Cardinality Explosion

Avoid grouping by high-cardinality labels (like user IDs) as this can create too many time series and overload Prometheus:

promql
# Bad practice - could create millions of time series
sum(http_requests_total) by (user_id)

Missing Labels

Be aware that when you use aggregation operators, any labels not included in the by clause will be dropped from the result:

promql
# Original labels like "instance" will be lost
sum(node_cpu_usage_percent) by (job)

Correct Handling of Counter Resets

When aggregating counters, use rate() or increase() first to handle counter resets properly:

promql
# Correct way to sum rates across instances
sum(rate(http_requests_total[5m])) by (status_code)

Order of Operations

Pay attention to the order of operations. The following queries give different results:

promql
# Calculate rate first, then sum across instances
sum(rate(http_requests_total[5m]))

# Calculate sum first, then rate (this loses information about individual counter resets)
rate(sum(http_requests_total)[5m])

Summary

PromQL aggregation operators are powerful tools for transforming and summarizing time series data. They allow you to reduce the dimensionality of your data, making it easier to derive meaningful insights and create effective dashboards.

Key points to remember:

Use the sum, avg, min, max, and count operators for basic aggregation
Apply topk, bottomk, and quantile for more specific analyses
Control label preservation with by and without clauses
Be mindful of cardinality and the order of operations
Apply aggregation operators after rate calculations when working with counters

By mastering these operators, you'll be able to extract valuable insights from your Prometheus metrics and build more effective monitoring systems.

Exercises

Write a PromQL query to show the total number of HTTP requests by status code and method.
Create a query to find the three services with the highest error rates.
Write a query to calculate the 95th percentile of request latencies for each endpoint.
Use aggregation to find the ratio of CPU usage to memory usage across all instances.
Create a query to show the percentage of disk space used, averaged across all instances in each datacenter.

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Aggregation in Prometheus​

Basic Syntax​

Common Aggregation Operators​

sum​

avg​

min and max​

count​

group​

topk and bottomk​

quantile​

Grouping with by and without Clauses​

Using the by Clause​

Using the without Clause​

Real-World Usage Examples​

Monitoring Service Health​

Resource Utilization Across Clusters​

Identifying Outliers​

Service Level Objectives (SLOs)​

Capacity Planning​

Advanced Aggregation Patterns​

Multi-level Aggregation​

Combining with Range Vectors​

Filtering Before Aggregation​

Common Pitfalls and Best Practices​

Cardinality Explosion​

Missing Labels​

Correct Handling of Counter Resets​

Order of Operations​

Summary​

Exercises​

Additional Resources​

Introduction

Understanding Aggregation in Prometheus

Basic Syntax

Common Aggregation Operators

`sum`

`avg`

`min` and `max`

`count`

`group`

`topk` and `bottomk`

`quantile`

Grouping with `by` and `without` Clauses

Using the `by` Clause

Using the `without` Clause

Real-World Usage Examples

Monitoring Service Health

Resource Utilization Across Clusters

Identifying Outliers

Service Level Objectives (SLOs)

Capacity Planning

Advanced Aggregation Patterns

Multi-level Aggregation

Combining with Range Vectors

Filtering Before Aggregation

Common Pitfalls and Best Practices

Cardinality Explosion

Missing Labels

Correct Handling of Counter Resets

Order of Operations

Summary

Exercises

Additional Resources