PromQL Common Use Cases

Introduction

PromQL (Prometheus Query Language) is a powerful functional query language that lets you select and aggregate time series data stored in Prometheus. While the syntax might seem intimidating at first, mastering a set of common use cases will help you quickly solve practical monitoring challenges.

This guide covers the most frequent PromQL patterns you'll need when monitoring your systems. We'll explore real-world examples for resource utilization, error rates, service availability, and more, providing you with ready-to-use queries you can adapt to your environment.

Basic Metric Selection and Filtering

Selecting Metrics with Labels

One of the most common operations is selecting metrics and filtering them by their labels.

http_requests_total{job="api-server", environment="production"}

This query selects the http_requests_total metric, but only for time series where the job label equals "api-server" and the environment label equals "production".

Using Regular Expressions for Label Matching

You can use regular expressions to match multiple label values:

http_requests_total{job=~".*server", environment!~"test|staging"}

This selects http_requests_total metrics where:

The job label matches any value ending with "server"
The environment label is neither "test" nor "staging"

Rate Calculations

Calculating Request Rates

To calculate the rate of HTTP requests over the last 5 minutes:

rate(http_requests_total{job="api-server"}[5m])

The output is measured in requests per second. For instance:

{job="api-server", instance="10.0.0.1:9090", path="/api/users"} 12.34
{job="api-server", instance="10.0.0.2:9090", path="/api/users"} 10.21

This tells us the first instance is handling 12.34 requests per second, while the second is handling 10.21.

Calculating Error Rates

To calculate the error rate (HTTP 5xx responses) as a percentage:

sum(rate(http_requests_total{job="api-server", status=~"5.."}[5m])) by (instance)
  /
sum(rate(http_requests_total{job="api-server"}[5m])) by (instance)
  * 100

This gives you the percentage of 5xx errors for each instance:

{instance="10.0.0.1:9090"} 2.5
{instance="10.0.0.2:9090"} 1.7

This indicates that instance 10.0.0.1 has a 2.5% error rate, while 10.0.0.2 has a 1.7% error rate.

Aggregation Operations

Finding the Top 5 CPU Users

To identify which pods are consuming the most CPU:

topk(5, sum(rate(container_cpu_usage_seconds_total[5m])) by (pod))

Sample output:

{pod="search-indexer-67d8b9f88d-2xvqp"} 3.52
{pod="database-primary-0"} 2.14
{pod="api-gateway-75d4f9b675-f9d7x"} 1.87
{pod="cache-6b6b986b9c-t2jxz"} 1.65
{pod="log-collector-84569d887-6zjpm"} 1.23

This shows that the search indexer pod is using the most CPU (3.52 cores), followed by the database primary (2.14 cores), and so on.

Calculating Percentiles for Response Times

To calculate the 95th percentile of HTTP request durations:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Sample output:

{service="authentication"} 0.642
{service="payment-processing"} 1.257
{service="user-profile"} 0.381

This tells us that:

95% of authentication service requests complete in 0.642 seconds or less
Payment processing has slower responses with 95% completing in 1.257 seconds or less
User profile service is the fastest with 95% of requests completing in 0.381 seconds or less

Resource Utilization Monitoring

Memory Usage Percentage

To calculate the percentage of memory used by containers:

sum(container_memory_usage_bytes{namespace="production"}) by (pod)
  /
sum(container_memory_limit_bytes{namespace="production"}) by (pod)
  * 100

Sample output:

{pod="web-server-6fd7db4f76-gps2j"} 68.4
{pod="cache-5599d789c5-trlmk"} 92.7
{pod="backend-api-6b9f758b7c-lmn45"} 45.3

This shows the cache pod is running close to its memory limit at 92.7%, which might be concerning, while the backend API is comfortably using only 45.3% of its limit.

Disk Space Usage

To monitor disk usage across your infrastructure:

100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"})

Sample output:

{instance="app-server-01:9100", mountpoint="/"} 72.5
{instance="app-server-02:9100", mountpoint="/"} 45.3
{instance="db-server-01:9100", mountpoint="/"} 89.7

This indicates the database server's disk is 89.7% full, which may need attention soon.

Service Health and Availability

Uptime and Service Availability

To calculate service availability as a percentage over the past week:

sum_over_time(up{job="api-gateway"}[7d]) / count_over_time(up{job="api-gateway"}[7d]) * 100

Sample output:

{instance="api-gateway-prod-1"} 99.97
{instance="api-gateway-prod-2"} 100.00
{instance="api-gateway-prod-3"} 99.82

This shows your api-gateway instances have excellent uptime, with instance 2 having perfect 100% availability and instances 1 and 3 experiencing minimal downtime.

Target Scrape Health

To monitor how many Prometheus targets are unhealthy:

sum(up == 0)

A result of 5 would mean 5 targets are currently down.

To see what percentage of your targets are healthy:

avg(up) * 100

A result of 98.2 would mean 98.2% of your targets are currently up.

Alerting Thresholds and Prediction

Finding Pods About to Run Out of Memory

To identify pods that will reach memory limits within an hour based on current growth rate:

container_memory_usage_bytes{namespace="production"}
  + 
predict_linear(container_memory_usage_bytes{namespace="production"}[30m], 3600)
  > 
container_memory_limit_bytes{namespace="production"} * 0.9

This query returns the pods that are predicted to exceed 90% of their memory limit within the next hour.

Detecting Unusual Latency Increases

To detect when response times suddenly increase:

rate(http_request_duration_seconds_sum[5m])
  / 
rate(http_request_duration_seconds_count[5m])
  > 
(rate(http_request_duration_seconds_sum[1h] offset 1h)
  / 
rate(http_request_duration_seconds_count[1h] offset 1h)) * 2

This query identifies services where the current 5-minute average latency is more than double the 1-hour average from an hour ago.

Advanced Patterns

Delta and Increase for Counter Analysis

To see the total number of HTTP errors in the last hour:

increase(http_requests_total{status=~"5.."}[1h])

Sample output:

{job="api-server", instance="10.0.0.1:9090", path="/api/users", status="500"} 37
{job="api-server", instance="10.0.0.1:9090", path="/api/orders", status="503"} 18
{job="api-server", instance="10.0.0.2:9090", path="/api/users", status="500"} 21

This shows that in the last hour, the /api/users endpoint on instance 10.0.0.1 had 37 HTTP 500 errors, while the /api/orders endpoint had 18 HTTP 503 errors.

Calculating Query Performance Ratios

To find slow query ratios in your database:

sum(rate(database_queries_total{status="slow"}[5m])) by (database)
  /
sum(rate(database_queries_total[5m])) by (database)
  * 100

Sample output:

{database="users"} 1.2
{database="products"} 5.7
{database="orders"} 3.4

This indicates that 5.7% of queries to the products database are classified as slow, which might warrant investigation.

Heat Map with Histogram Quantiles

To visualize response time distributions at different percentiles:

histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

This set of queries gives you the 50th, 90th, 95th, and 99th percentile response times for each service, which you can use to create a heat map in Grafana.

Time Range Selection Techniques

Current vs. Last Week Comparison

To compare current request rates with the same period last week:

sum(rate(http_requests_total[1h])) by (service)
  /
sum(rate(http_requests_total[1h] offset 7d)) by (service)

Sample output:

{service="web-app"} 1.34
{service="api"} 0.92
{service="admin-portal"} 1.67

This shows that compared to last week, the web-app traffic is up by 34%, the API traffic is down by 8%, and the admin portal traffic has increased by 67%.

Business Hours Focus

To focus on metrics during business hours only (9 AM to 5 PM, Monday to Friday):

sum(rate(http_requests_total{service="customer-portal"}[5m]))
  * on() group_left()
  (day_of_week() >= 1 and day_of_week() <= 5 and hour() >= 9 and hour() <= 17)

This query selects customer portal request rates, but returns values only during business hours.

Summary

In this guide, we've explored common PromQL use cases that form the foundation of effective monitoring with Prometheus:

Basic metric selection and filtering with label matchers
Rate calculations for request and error monitoring
Aggregation operations to summarize metrics across dimensions
Resource utilization monitoring for CPU, memory, and disk space
Service health and availability tracking
Setting up proactive alerting based on predictions
Advanced patterns for deeper analysis
Time range selection techniques for comparative analysis

By mastering these patterns, you'll be able to extract meaningful insights from your metrics and build effective dashboards and alerts.

Additional Resources

Here are some exercises to further strengthen your PromQL skills:

Exercise: Create a query to show the top 3 services with the highest error rates in the last 15 minutes.
Exercise: Write a query to predict which nodes will run out of disk space in the next 24 hours based on the current growth rate.
Exercise: Develop a query to show the 95th percentile of API response times, grouped by endpoint and HTTP method.

For more advanced PromQL techniques, refer to:

The official Prometheus documentation
PromQL for Humans - A user-friendly guide to PromQL
PromLabs PromQL Cheat Sheet

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Basic Metric Selection and Filtering​

Selecting Metrics with Labels​

Using Regular Expressions for Label Matching​

Rate Calculations​

Calculating Request Rates​

Calculating Error Rates​

Aggregation Operations​

Finding the Top 5 CPU Users​

Calculating Percentiles for Response Times​

Resource Utilization Monitoring​

Memory Usage Percentage​

Disk Space Usage​

Service Health and Availability​

Uptime and Service Availability​

Target Scrape Health​

Alerting Thresholds and Prediction​

Finding Pods About to Run Out of Memory​

Detecting Unusual Latency Increases​

Advanced Patterns​

Delta and Increase for Counter Analysis​

Calculating Query Performance Ratios​

Heat Map with Histogram Quantiles​

Time Range Selection Techniques​

Current vs. Last Week Comparison​

Business Hours Focus​

Summary​

Additional Resources​

Introduction

Basic Metric Selection and Filtering

Selecting Metrics with Labels

Using Regular Expressions for Label Matching

Rate Calculations

Calculating Request Rates

Calculating Error Rates

Aggregation Operations

Finding the Top 5 CPU Users

Calculating Percentiles for Response Times

Resource Utilization Monitoring

Memory Usage Percentage

Disk Space Usage

Service Health and Availability

Uptime and Service Availability

Target Scrape Health

Alerting Thresholds and Prediction

Finding Pods About to Run Out of Memory

Detecting Unusual Latency Increases

Advanced Patterns

Delta and Increase for Counter Analysis

Calculating Query Performance Ratios

Heat Map with Histogram Quantiles

Time Range Selection Techniques

Current vs. Last Week Comparison

Business Hours Focus

Summary

Additional Resources