Complex PromQL Queries

Introduction

PromQL (Prometheus Query Language) is the powerful query language that makes Prometheus such a valuable tool in the monitoring ecosystem. While basic PromQL queries help you retrieve and visualize simple metrics, complex queries unlock the full potential of your monitoring data by enabling sophisticated analysis, correlation between metrics, and advanced alerting conditions.

In this guide, we'll explore advanced PromQL techniques that go beyond the basics. You'll learn how to construct multi-step queries, use advanced functions, and create complex expressions that reveal deeper insights into your system's behavior and performance.

Prerequisites

Before diving into complex PromQL, you should:

Understand basic PromQL syntax and simple queries
Be familiar with Prometheus metrics types (counters, gauges, histograms, summaries)
Have a running Prometheus instance to experiment with

Building Blocks of Complex Queries

Quick Recap of PromQL Basics

Let's start with a quick refresher on the basic components of PromQL:

http_requests_total{status="200", handler="/api/users"}[5m]

This query:

Selects the http_requests_total metric
Filters for HTTP status 200 and the /api/users handler
Uses a time range selector [5m] to get the last 5 minutes of data

Now, let's move beyond the basics.

Advanced Operators

Binary Operators

Binary operators allow you to perform calculations between two metrics or between a metric and a scalar value.

Arithmetic Operators

+ (addition)
- (subtraction)
* (multiplication)
/ (division)
% (modulo)
^ (power/exponentiation)

Example: Calculate the percentage of CPU used by each container in a Kubernetes pod:

(container_cpu_usage_seconds_total / machine_cpu_cores) * 100

Comparison Operators

== (equal)
!= (not equal)
> (greater than)
< (less than)
>= (greater than or equal)
<= (less than or equal)

Example: Find instances where CPU usage exceeds 80%:

(node_cpu_seconds_total{mode="idle"} / on(instance) group_left sum(node_cpu_seconds_total) by (instance)) < 0.2

Logical Operators

and (intersection)
or (union)
unless (complement)

Example: Alert when a service is both high in CPU and memory usage:

(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1 and (rate(node_cpu_seconds_total{mode="idle"}[5m]) < 0.2)

Vector Matching

One of the most powerful aspects of PromQL is its ability to match and join different time series together.

Types of Vector Matching

One-to-one: Each series from the left side matches with exactly one series from the right side.
One-to-many/many-to-one: These are specified using group_left and group_right modifiers.

Example: Calculate the ratio of errors to total requests for each API endpoint:

sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint) / 
sum(rate(http_requests_total[5m])) by (endpoint)

Vector Matching Modifiers

on: Specify which labels to match on
ignoring: Specify which labels to ignore when matching
group_left: Perform a many-to-one join
group_right: Perform a one-to-many join

Example: Calculate the percentage of CPU usage per application, including node information:

sum by (application, instance) (rate(process_cpu_seconds_total[5m])) / 
  on(instance) group_left
  node_cpu_cores * 100

Subqueries

Subqueries allow you to apply a range query to each point in an outer range query. They are formatted as:

<instant_query>[<range>:<step>]

Example: Calculate the 5-minute rate of HTTP requests, evaluated every minute for the last hour:

rate(http_requests_total[5m])[1h:1m]

Advanced Aggregation Operators

PromQL offers various aggregation operators to combine time series:

sum, min, max, avg
stddev, stdvar (standard deviation and variance)
count, count_values
bottomk, topk (smallest/largest k elements)
quantile

Example: Find the top 3 endpoints with the highest error rates:

topk(3, sum by (endpoint) (rate(http_requests_total{status=~"5.."}[5m])))

Time-Based Functions

Rate and Increase

For counter metrics, rate() and increase() are essential functions:

# Average rate of increase per second over the last 5 minutes
rate(http_requests_total[5m])

# Total increase over the last 5 minutes
increase(http_requests_total[5m])

Time Series Prediction

Predict future values using linear regression:

predict_linear(node_filesystem_free_bytes[6h], 4 * 3600)

This predicts the disk space available in 4 hours based on the last 6 hours of data.

Historical Data Analysis

Compare current metrics with past values:

# Compare current request rate with the rate 24 hours ago
rate(http_requests_total[5m]) / 
  rate(http_requests_total[5m] offset 24h)

Real-World Complex Query Examples

Example 1: Service Level Objectives (SLOs)

Calculate the error budget consumption for an API with a 99.9% availability target:

# First, calculate the error rate
sum(rate(http_requests_total{status=~"5.."}[1h])) / 
sum(rate(http_requests_total[1h]))

# Then, track against the SLO target of 0.1% errors (99.9% availability)
(sum(rate(http_requests_total{status=~"5.."}[1h])) / 
sum(rate(http_requests_total[1h]))) / 0.001

This calculates what percentage of our error budget we're currently consuming. A value of 1 means we're exactly at our SLO threshold.

Example 2: Kubernetes Resource Optimization

Identify pods that are consistently using less than 20% of their CPU requests:

avg_over_time(
  (sum by (pod) (rate(container_cpu_usage_seconds_total[5m])) / 
   sum by (pod) (kube_pod_container_resource_requests{resource="cpu"}))
[1d]) < 0.2

This helps identify over-provisioned resources that could be scaled down to save costs.

Example 3: Database Query Performance Monitoring

Calculate the 95th percentile of query latency and its trend:

# Calculate 95th percentile query time
histogram_quantile(0.95, sum(rate(db_query_duration_seconds_bucket[5m])) by (le, query_type))

# See if it's trending up or down
deriv(histogram_quantile(0.95, sum(rate(db_query_duration_seconds_bucket[5m])) by (le, query_type))[1h:])

Example 4: Network Traffic Anomaly Detection

Detect sudden spikes in network traffic that deviate from the norm:

abs(
  (rate(node_network_transmit_bytes_total[5m]) - 
   avg_over_time(rate(node_network_transmit_bytes_total[5m])[1d:1h])
  ) / 
  avg_over_time(rate(node_network_transmit_bytes_total[5m])[1d:1h])
) > 0.3

This alerts when current network traffic deviates by more than 30% from the average traffic at the same time of day over the past day.

Common Patterns and Best Practices

Rate then Sum, Not Sum then Rate

For counters, always apply rate() before aggregation operations:

✅ Correct:

sum(rate(http_requests_total[5m]))

❌ Incorrect:

rate(sum(http_requests_total)[5m])

Handle Counter Resets

When a service restarts, counters reset to zero. PromQL functions like rate(), increase(), and irate() automatically handle these resets, but be careful with manual calculations.

Use Labels Effectively

Structure your metrics with thoughtful labels to enable powerful queries later:

# Bad - difficult to filter and aggregate
http_requests_total

# Good - enables powerful filtering and aggregation
http_requests_total{service="payment-api", endpoint="/process", method="POST", status_code="200"}

Keep Cardinality Under Control

While labels are powerful, too many unique combinations can cause performance issues. For example, avoid labels like user_id that could have millions of values.

Debugging Complex Queries

When a complex query doesn't return the expected results, try these approaches:

Break it down: Execute each part of the query separately to see intermediate results
Check for missing data: Use absent() to verify if metrics exist
Examine label matching: Ensure that your vector matching is working as expected
Verify time ranges: Confirm that your time windows capture the data you're interested in

Visualizing Complex Queries

Complex PromQL queries truly shine when visualized in dashboards. Here are some tips:

Use appropriate visualization types:
- Use gauges for current values against thresholds
- Use graphs for rates and trends
- Use heatmaps for histograms
Add context with multiple panels:
- Show related metrics side by side
- Include both the raw data and calculated values
Effective dashboard layout:
- Group related metrics
- Order panels from high-level overview to detailed metrics

Case Study: Building a Comprehensive Service Dashboard

Let's combine everything we've learned to create a comprehensive service monitoring dashboard. Here's a diagram of the components we'll monitor:

For this architecture, we'll create PromQL queries to monitor:

Overall Service Health - Error rate across all services:

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Request Latency - 95th percentile response time per service:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))

Resource Utilization - CPU and memory usage per service:

sum(rate(process_cpu_seconds_total[5m])) by (service) / 
  sum(container_spec_cpu_quota) by (service)

Database Performance - Query throughput and latency:

rate(database_queries_total[5m])

histogram_quantile(0.95, sum(rate(database_query_duration_seconds_bucket[5m])) by (query_type, le))

Cache Efficiency - Cache hit ratio:

sum(rate(cache_hits_total[5m])) / (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m])))

Error Budget - SLO compliance tracking:

1 - (sum(increase(http_requests_total{status=~"5.."}[30d])) / sum(increase(http_requests_total[30d])))

Summary

Complex PromQL queries are powerful tools for extracting meaningful insights from your monitoring data. In this guide, we've covered:

Advanced operators for arithmetic, comparison, and logical operations
Vector matching to combine and correlate different metrics
Subqueries for analyzing patterns over time
Aggregation operators for summarizing data
Time-based functions for rate calculations and predictions
Real-world examples and best practices
Techniques for debugging and visualizing complex queries

With these advanced PromQL skills, you can build more sophisticated monitoring dashboards, create precise alerting rules, and gain deeper insights into your systems' behavior.

Exercises

Write a PromQL query to calculate the average memory usage per Kubernetes namespace over the last hour.
Create a query that shows the 99th percentile of API request latency for each endpoint, but only for endpoints that have processed more than 100 requests in the last 5 minutes.
Write a query to predict when a disk will reach 90% capacity based on the growth rate over the last week.
Create a dashboard panel that shows a ratio of errors to total requests for your top 5 most-used API endpoints.
Write a query to detect if any of your services are experiencing an unusual increase in error rates compared to their historical baseline.

Additional Resources

Prometheus Documentation - PromQL
PromLabs PromQL Cheat Sheet
PromQL for Humans
Grafana Labs: PromQL Examples
Robust Perception Blog - Expert articles on Prometheus

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Building Blocks of Complex Queries​

Quick Recap of PromQL Basics​

Advanced Operators​

Binary Operators​

Arithmetic Operators​

Comparison Operators​

Logical Operators​

Vector Matching​

Types of Vector Matching​

Vector Matching Modifiers​

Subqueries​

Advanced Aggregation Operators​

Time-Based Functions​

Rate and Increase​

Time Series Prediction​

Historical Data Analysis​

Real-World Complex Query Examples​

Example 1: Service Level Objectives (SLOs)​

Example 2: Kubernetes Resource Optimization​

Example 3: Database Query Performance Monitoring​

Example 4: Network Traffic Anomaly Detection​

Common Patterns and Best Practices​

Rate then Sum, Not Sum then Rate​

Handle Counter Resets​

Use Labels Effectively​

Keep Cardinality Under Control​

Debugging Complex Queries​

Visualizing Complex Queries​

Case Study: Building a Comprehensive Service Dashboard​

Summary​

Exercises​

Additional Resources​