PromQL Common Use Cases
Introduction
PromQL (Prometheus Query Language) is a powerful functional query language that lets you select and aggregate time series data stored in Prometheus. While the syntax might seem intimidating at first, mastering a set of common use cases will help you quickly solve practical monitoring challenges.
This guide covers the most frequent PromQL patterns you'll need when monitoring your systems. We'll explore real-world examples for resource utilization, error rates, service availability, and more, providing you with ready-to-use queries you can adapt to your environment.
Basic Metric Selection and Filtering
Selecting Metrics with Labels
One of the most common operations is selecting metrics and filtering them by their labels.
http_requests_total{job="api-server", environment="production"}
This query selects the http_requests_total
metric, but only for time series where the job
label equals "api-server" and the environment
label equals "production".
Using Regular Expressions for Label Matching
You can use regular expressions to match multiple label values:
http_requests_total{job=~".*server", environment!~"test|staging"}
This selects http_requests_total
metrics where:
- The
job
label matches any value ending with "server" - The
environment
label is neither "test" nor "staging"
Rate Calculations
Calculating Request Rates
To calculate the rate of HTTP requests over the last 5 minutes:
rate(http_requests_total{job="api-server"}[5m])
The output is measured in requests per second. For instance:
{job="api-server", instance="10.0.0.1:9090", path="/api/users"} 12.34
{job="api-server", instance="10.0.0.2:9090", path="/api/users"} 10.21
This tells us the first instance is handling 12.34 requests per second, while the second is handling 10.21.
Calculating Error Rates
To calculate the error rate (HTTP 5xx responses) as a percentage:
sum(rate(http_requests_total{job="api-server", status=~"5.."}[5m])) by (instance)
/
sum(rate(http_requests_total{job="api-server"}[5m])) by (instance)
* 100
This gives you the percentage of 5xx errors for each instance:
{instance="10.0.0.1:9090"} 2.5
{instance="10.0.0.2:9090"} 1.7
This indicates that instance 10.0.0.1 has a 2.5% error rate, while 10.0.0.2 has a 1.7% error rate.
Aggregation Operations
Finding the Top 5 CPU Users
To identify which pods are consuming the most CPU:
topk(5, sum(rate(container_cpu_usage_seconds_total[5m])) by (pod))
Sample output:
{pod="search-indexer-67d8b9f88d-2xvqp"} 3.52
{pod="database-primary-0"} 2.14
{pod="api-gateway-75d4f9b675-f9d7x"} 1.87
{pod="cache-6b6b986b9c-t2jxz"} 1.65
{pod="log-collector-84569d887-6zjpm"} 1.23
This shows that the search indexer pod is using the most CPU (3.52 cores), followed by the database primary (2.14 cores), and so on.
Calculating Percentiles for Response Times
To calculate the 95th percentile of HTTP request durations:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
Sample output:
{service="authentication"} 0.642
{service="payment-processing"} 1.257
{service="user-profile"} 0.381
This tells us that:
- 95% of authentication service requests complete in 0.642 seconds or less
- Payment processing has slower responses with 95% completing in 1.257 seconds or less
- User profile service is the fastest with 95% of requests completing in 0.381 seconds or less
Resource Utilization Monitoring
Memory Usage Percentage
To calculate the percentage of memory used by containers:
sum(container_memory_usage_bytes{namespace="production"}) by (pod)
/
sum(container_memory_limit_bytes{namespace="production"}) by (pod)
* 100
Sample output:
{pod="web-server-6fd7db4f76-gps2j"} 68.4
{pod="cache-5599d789c5-trlmk"} 92.7
{pod="backend-api-6b9f758b7c-lmn45"} 45.3
This shows the cache pod is running close to its memory limit at 92.7%, which might be concerning, while the backend API is comfortably using only 45.3% of its limit.
Disk Space Usage
To monitor disk usage across your infrastructure:
100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"})
Sample output:
{instance="app-server-01:9100", mountpoint="/"} 72.5
{instance="app-server-02:9100", mountpoint="/"} 45.3
{instance="db-server-01:9100", mountpoint="/"} 89.7
This indicates the database server's disk is 89.7% full, which may need attention soon.
Service Health and Availability
Uptime and Service Availability
To calculate service availability as a percentage over the past week:
sum_over_time(up{job="api-gateway"}[7d]) / count_over_time(up{job="api-gateway"}[7d]) * 100
Sample output:
{instance="api-gateway-prod-1"} 99.97
{instance="api-gateway-prod-2"} 100.00
{instance="api-gateway-prod-3"} 99.82
This shows your api-gateway instances have excellent uptime, with instance 2 having perfect 100% availability and instances 1 and 3 experiencing minimal downtime.
Target Scrape Health
To monitor how many Prometheus targets are unhealthy:
sum(up == 0)
A result of 5
would mean 5 targets are currently down.
To see what percentage of your targets are healthy:
avg(up) * 100
A result of 98.2
would mean 98.2% of your targets are currently up.
Alerting Thresholds and Prediction
Finding Pods About to Run Out of Memory
To identify pods that will reach memory limits within an hour based on current growth rate:
container_memory_usage_bytes{namespace="production"}
+
predict_linear(container_memory_usage_bytes{namespace="production"}[30m], 3600)
>
container_memory_limit_bytes{namespace="production"} * 0.9
This query returns the pods that are predicted to exceed 90% of their memory limit within the next hour.
Detecting Unusual Latency Increases
To detect when response times suddenly increase:
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
>
(rate(http_request_duration_seconds_sum[1h] offset 1h)
/
rate(http_request_duration_seconds_count[1h] offset 1h)) * 2
This query identifies services where the current 5-minute average latency is more than double the 1-hour average from an hour ago.
Advanced Patterns
Delta and Increase for Counter Analysis
To see the total number of HTTP errors in the last hour:
increase(http_requests_total{status=~"5.."}[1h])
Sample output:
{job="api-server", instance="10.0.0.1:9090", path="/api/users", status="500"} 37
{job="api-server", instance="10.0.0.1:9090", path="/api/orders", status="503"} 18
{job="api-server", instance="10.0.0.2:9090", path="/api/users", status="500"} 21
This shows that in the last hour, the /api/users
endpoint on instance 10.0.0.1 had 37 HTTP 500 errors, while the /api/orders
endpoint had 18 HTTP 503 errors.
Calculating Query Performance Ratios
To find slow query ratios in your database:
sum(rate(database_queries_total{status="slow"}[5m])) by (database)
/
sum(rate(database_queries_total[5m])) by (database)
* 100
Sample output:
{database="users"} 1.2
{database="products"} 5.7
{database="orders"} 3.4
This indicates that 5.7% of queries to the products database are classified as slow, which might warrant investigation.
Heat Map with Histogram Quantiles
To visualize response time distributions at different percentiles:
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
This set of queries gives you the 50th, 90th, 95th, and 99th percentile response times for each service, which you can use to create a heat map in Grafana.
Time Range Selection Techniques
Current vs. Last Week Comparison
To compare current request rates with the same period last week:
sum(rate(http_requests_total[1h])) by (service)
/
sum(rate(http_requests_total[1h] offset 7d)) by (service)
Sample output:
{service="web-app"} 1.34
{service="api"} 0.92
{service="admin-portal"} 1.67
This shows that compared to last week, the web-app traffic is up by 34%, the API traffic is down by 8%, and the admin portal traffic has increased by 67%.
Business Hours Focus
To focus on metrics during business hours only (9 AM to 5 PM, Monday to Friday):
sum(rate(http_requests_total{service="customer-portal"}[5m]))
* on() group_left()
(day_of_week() >= 1 and day_of_week() <= 5 and hour() >= 9 and hour() <= 17)
This query selects customer portal request rates, but returns values only during business hours.
Summary
In this guide, we've explored common PromQL use cases that form the foundation of effective monitoring with Prometheus:
- Basic metric selection and filtering with label matchers
- Rate calculations for request and error monitoring
- Aggregation operations to summarize metrics across dimensions
- Resource utilization monitoring for CPU, memory, and disk space
- Service health and availability tracking
- Setting up proactive alerting based on predictions
- Advanced patterns for deeper analysis
- Time range selection techniques for comparative analysis
By mastering these patterns, you'll be able to extract meaningful insights from your metrics and build effective dashboards and alerts.
Additional Resources
Here are some exercises to further strengthen your PromQL skills:
- Exercise: Create a query to show the top 3 services with the highest error rates in the last 15 minutes.
- Exercise: Write a query to predict which nodes will run out of disk space in the next 24 hours based on the current growth rate.
- Exercise: Develop a query to show the 95th percentile of API response times, grouped by endpoint and HTTP method.
For more advanced PromQL techniques, refer to:
- The official Prometheus documentation
- PromQL for Humans - A user-friendly guide to PromQL
- PromLabs PromQL Cheat Sheet
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)