PromQL Operators
Introduction
PromQL (Prometheus Query Language) is the powerful query language used in Prometheus, a leading open-source monitoring and alerting toolkit. Operators are fundamental building blocks in PromQL that allow you to manipulate, combine, and transform time series data. Understanding operators is crucial for crafting effective queries to analyze metrics and create meaningful visualizations or alerts.
In this guide, we'll explore the various types of operators in PromQL, their syntax, and how to apply them in real-world monitoring scenarios.
Operator Types in PromQL
PromQL operators can be categorized into several groups:
- Arithmetic Operators: Perform mathematical calculations on time series
- Comparison Operators: Compare values and filter time series
- Logical/Set Operators: Combine different time series
- Vector Matching Operators: Control how time series are matched during operations
- Aggregation Operators: Summarize and reduce time series data
Let's explore each category in detail.
Arithmetic Operators
Arithmetic operators perform mathematical operations on time series data. These operators can be applied between:
- A scalar and a vector (applying the operation to each element)
- Two vectors (applying the operation to matching elements)
Basic Arithmetic Operators
Operator | Description | Example |
---|---|---|
+ | Addition | node_memory_free_bytes + node_memory_cached_bytes |
- | Subtraction | node_memory_total_bytes - node_memory_free_bytes |
* | Multiplication | node_network_transmit_bytes_total * 8 (convert bytes to bits) |
/ | Division | node_cpu_seconds_total / 60 (convert seconds to minutes) |
% | Modulo | node_cpu_seconds_total % 60 |
^ | Exponentiation | node_disk_read_bytes_total ^ 2 |
Example: Calculating Memory Usage Percentage
(node_memory_total_bytes - node_memory_free_bytes) / node_memory_total_bytes * 100
This query calculates the percentage of memory used by:
- Subtracting free memory from total memory to get used memory
- Dividing used memory by total memory
- Multiplying by 100 to get a percentage
Comparison Operators
Comparison operators compare values and create a new time series where the value is 1 if the comparison is true and 0 if it's false. They're useful for filtering and thresholding.
Available Comparison Operators
Operator | Description | Example |
---|---|---|
== | Equal | node_cpu_seconds_total == 0 |
!= | Not equal | node_cpu_seconds_total != 0 |
> | Greater than | http_requests_total > 100 |
< | Less than | http_requests_total < 10 |
>= | Greater than or equal | node_memory_usage_percentage >= 90 |
<= | Less than or equal | node_memory_usage_percentage <= 10 |
Using Comparison with bool
Modifier
The bool
modifier filters time series where the comparison is true:
http_requests_total > bool 100
This returns only the time series where the value is greater than 100.
Example: Finding High CPU Usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
This query:
- Calculates the CPU idle rate over 5 minutes
- Converts to percentage and subtracts from 100 to get CPU usage percentage
- Filters instances where CPU usage is above 80%
Logical/Set Operators
Logical operators combine or modify time series based on their existence or values.
Binary Logical Operators
Operator | Description | Example |
---|---|---|
and | Intersection | up == 1 and rate(http_requests_total[5m]) > 0 |
or | Union | node_filesystem_avail_bytes{mountpoint="/"} < 10737418240 or node_filesystem_avail_bytes{mountpoint="/data"} < 21474836480 |
unless | Complement | http_requests_total unless on(instance) node_boot_time < 1600000000 |
Example: Detecting Problems in Production
(instance:requests:rate5m > 100) and (instance:errors:rate5m / instance:requests:rate5m > 0.05)
This query identifies instances that have both high request rates (over 100 per second) and an error rate exceeding 5%.
Vector Matching Operators
When performing operations between two vectors, PromQL needs to know how to match the time series. Vector matching operators provide this control.
One-to-One Matching
For operations where each series in the left vector matches exactly one series in the right vector:
request_count{job="api"} / request_count{job="api", status="success"}
Many-to-One and One-to-Many Matching
When one side has more labels than the other:
sum(http_requests_total) by (job) / sum(http_requests_total{status="success"}) by (job)
Vector Matching Keywords
Operator | Description | Example |
---|---|---|
on | Match only on specified labels | request_count{job="api"} / on(job) request_count{job="api", status="success"} |
ignoring | Ignore specified labels when matching | request_count{job="api", path="/"} / ignoring(path) request_count{job="api"} |
group_left | Many-to-one matching | request_count{job="api", path="/"} / ignoring(path) group_left request_count{job="api"} |
group_right | One-to-many matching | request_count{job="api"} / ignoring(path) group_right request_count{job="api", path="/"} |
Example: Calculating Error Ratio with Labels
sum(rate(http_requests_total{status="error"}[5m])) by (job, handler)
/
sum(rate(http_requests_total[5m])) by (job, handler)
This query:
- Calculates the rate of error requests over 5 minutes, grouped by job and handler
- Calculates the total rate of requests over 5 minutes, grouped by job and handler
- Divides to get the error ratio for each job and handler combination
Aggregation Operators
Aggregation operators combine multiple time series into fewer time series based on labels.
Available Aggregation Operators
Operator | Description | Example |
---|---|---|
sum | Sum of all values | sum(http_requests_total) |
min | Minimum value | min(node_cpu_seconds_total) |
max | Maximum value | max(node_cpu_seconds_total) |
avg | Average value | avg(node_cpu_seconds_total) |
stddev | Standard deviation | stddev(node_cpu_seconds_total) |
stdvar | Standard variance | stdvar(node_cpu_seconds_total) |
count | Count of elements | count(up == 1) |
count_values | Count of unique values | count_values("version", build_version) |
bottomk | Bottom k elements | bottomk(3, node_cpu_seconds_total) |
topk | Top k elements | topk(3, node_cpu_seconds_total) |
quantile | φ-quantile (0 ≤ φ ≤ 1) | quantile(0.95, http_request_duration_seconds) |
Modifying Aggregations with by
and without
by
: Keep only specified labelswithout
: Remove specified labels
# Sum requests by job
sum by (job) (http_requests_total)
# Sum requests, removing path label
sum without (path) (http_requests_total)
Example: Finding Top CPU-Consuming Instances
topk(5, 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))
This query:
- Calculates CPU usage percentage for each instance (100 minus the idle percentage)
- Returns the 5 instances with the highest CPU usage
Operator Precedence
Like in mathematics, PromQL operators follow a precedence order:
- Grouping:
(
...)
- Function calls, aggregations
- Exponentiation:
^
- Multiplication, division, modulo:
*
,/
,%
- Addition, subtraction:
+
,-
- Comparison:
==
,!=
,<=
,<
,>=
,>
- Logical/set operators:
and
,unless
,or
Example: Operator Precedence
sum(rate(http_requests_total[5m])) by (job) > 10 or sum(rate(errors_total[5m])) by (job) > 5
This evaluates as:
- Calculate
rate(http_requests_total[5m])
andrate(errors_total[5m])
- Apply
sum...by
aggregations - Apply
>
comparisons - Combine with
or
Real-world Examples
Example 1: Monitoring Service Health
# Alert when error rate is over 5% in the last 5 minutes
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service) > 0.05
Example 2: Disk Space Prediction
# Predict when disks will run out of space
predict_linear(node_filesystem_avail_bytes[6h], 24 * 3600) < 0
Example 3: Calculating Percentiles for Request Durations
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler))
Visualizing PromQL Operator Flow
Here's a diagram showing how operators transform data in a typical PromQL query:
PromQL Operator Cheat Sheet
Here's a quick reference for PromQL operators:
Category | Operators | Usage |
---|---|---|
Arithmetic | + , - , * , / , % , ^ | node_memory_total_bytes - node_memory_free_bytes |
Comparison | == , != , > , < , >= , <= | http_requests_total > 100 |
Logical | and , or , unless | up == 1 and rate(http_requests_total[5m]) > 0 |
Vector Matching | on , ignoring , group_left , group_right | request_count / on(job) request_success_count |
Aggregation | sum , min , max , avg , stddev , count , topk , etc. | sum by (job) (http_requests_total) |
Summary
PromQL operators are powerful tools for transforming and analyzing time series data in Prometheus. By mastering these operators, you can:
- Perform mathematical calculations on metrics
- Filter time series based on conditions
- Combine metrics from different sources
- Aggregate data to reduce dimensionality
- Create meaningful visualizations and alerts
Remember that effective PromQL queries often combine multiple operators to extract valuable insights from your monitoring data.
Exercises
- Write a PromQL query to calculate the percentage of disk space used for each filesystem.
- Create a query to find the 3 busiest CPUs across all your instances.
- Write a query to calculate the ratio of HTTP 500 errors to total requests, grouped by service and endpoint.
- Use vector matching to join metrics from two different sources based on common labels.
- Create an alert expression that fires when memory usage is high AND disk space is running low.
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)