Query Performance Analysis

Introduction

Query performance analysis is a critical skill when working with Prometheus, especially as your monitoring system scales. In this guide, we'll explore how to identify, analyze, and optimize slow-performing queries in Prometheus. Understanding these concepts will help you build efficient dashboards and alerts while reducing the load on your Prometheus server.

When your monitoring system grows with more metrics, longer retention periods, and complex queries, performance bottlenecks can emerge. This guide will equip you with the knowledge to diagnose and address these issues.

Understanding Query Performance Metrics

Prometheus provides several key metrics to help you understand query performance:

Key Performance Indicators

Query execution time: How long your query takes to complete
Memory usage: How much memory is allocated during query execution
Number of samples processed: How many data points Prometheus needs to analyze
Series cardinality: The number of unique time series involved in the query

Let's look at how to access these metrics.

Examining Query Performance Stats

Prometheus exposes internal metrics about query performance that you can monitor.

Using the Query Stats API

The API endpoint /api/v1/stats/query provides detailed information about recent queries.

curl http://localhost:9090/api/v1/stats/query

Example output:

{
  "status": "success",
  "data": {
    "numSamples": 42007,
    "queryTime": 0.03282,
    "totalQueryableSamples": 84001524,
    "timings": {
      "evalTotalTime": 0.032778034,
      "resultSortTime": 4.2126e-06,
      "queryPreparationTime": 9.1389e-05,
      "innerEvalTime": 0.032686895,
      "execQueueTime": 4.101e-05
    }
  }
}

Using the /metrics Endpoint

Prometheus exposes internal metrics that you can query to analyze performance:

# Query execution time
prometheus_engine_query_duration_seconds{quantile="0.9"}

# Total queries being executed
prometheus_engine_queries

Common Performance Issues and Solutions

1. High Cardinality Problems

High cardinality occurs when a metric has many unique label combinations, leading to thousands or millions of time series.

Example of a high cardinality issue:

http_requests_total{path="/api/v1/endpoint"}

If path has unique values for every user ID or session, you might have thousands of time series for this single metric.

Solutions:

# Instead of tracking every unique path, use a label with fewer values
http_requests_total{path_group="api"}

# Or aggregate to reduce cardinality
sum(http_requests_total) by (status_code, method)

2. Inefficient Regular Expressions

Regular expressions can be computationally expensive, especially with large datasets.

Example of an inefficient regex query:

{job=~".*api.*"}

Optimized alternative:

{job=~"api|backend-api|user-api"}

3. Range Queries with Long Time Windows

Queries that analyze long time ranges can be very expensive.

Example of a resource-intensive range query:

rate(http_requests_total[30d])

Optimized alternatives:

# Use a shorter time window
rate(http_requests_total[5m])

# Use recording rules for common patterns
job:http_requests:rate5m

Query Optimization Techniques

1. Using Recording Rules

Recording rules pre-compute expensive expressions and save results as new time series, significantly improving dashboard performance.

In your prometheus.yml configuration:

rule_files:
  - "recording_rules.yml"

And in recording_rules.yml:

groups:
  - name: http_requests
    interval: 5m
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

Then use the pre-computed metric in your dashboards:

job:http_requests_total:rate5m{job="api"}

2. Limiting Time Range and Resolution

Adjust the time range and step parameters to match your visualization needs:

# For a dashboard showing the last hour, use a 5m resolution
rate(http_requests_total[5m])[1h:5m]

3. Avoiding Suboptimal Functions

Some functions require more resources than others:

# More expensive
quantile_over_time(0.95, http_request_duration_seconds[1h])

# Less expensive alternative
histogram_quantile(0.95, sum(rate(http_request_duration_bucket[1h])) by (le))

Visualizing Query Performance

You can create a dashboard to monitor your Prometheus query performance using these queries:

# Query execution duration
prometheus_engine_query_duration_seconds{quantile="0.9"}

# Number of concurrent queries
prometheus_engine_queries

# Memory usage
process_resident_memory_bytes{job="prometheus"}

Here's a simple diagram showing the query execution flow and potential bottlenecks:

Practical Example: Optimizing a Dashboard

Let's walk through a complete example of optimizing a dashboard with slow queries.

Original Dashboard Queries

# Original query - slow and resource-intensive
sum(rate(http_requests_total[1h])) by (service, endpoint, status_code)

Step 1: Identify the Problem

Using the Query Stats API, we see this query processes millions of samples and takes over 10 seconds.

Step 2: Analyze and Optimize

We can see that the high cardinality comes from having too many dimensions and a long lookback window.

Step 3: Implement Solutions

Create a recording rule:

groups:
  - name: http
    interval: 5m
    rules:
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service, status_code)

Use the recording rule in your dashboard:

# Optimized query
service:http_requests:rate5m

Results:
- Query time reduced from 10s to 0.1s
- Memory usage reduced by 80%
- Dashboard load time improved significantly

Performance Testing Tool - promtool

Prometheus ships with a tool called promtool that can help evaluate query performance.

promtool query stats http://localhost:9090 'rate(http_requests_total[5m])'

Example output:

Query: rate(http_requests_total[5m])
Instant query statistics:
Samples: 1427
Wall time: 0.029s

Summary

In this guide, we've covered the fundamentals of query performance analysis in Prometheus:

Understanding query performance metrics and how to access them
Identifying common performance issues like high cardinality and inefficient expressions
Implementing optimization techniques using recording rules and query refinement
Tools and methods for ongoing performance monitoring

Mastering query performance analysis is essential for maintaining a healthy and responsive monitoring system as your infrastructure grows. By applying these techniques, you can ensure your Prometheus queries remain efficient and your dashboards load quickly.

Additional Resources

Here are some exercises to further your understanding:

Set up a dashboard to monitor your Prometheus server's query performance
Identify the top 5 slowest queries in your environment
Create recording rules for common dashboard queries
Analyze a slow query and optimize it to improve performance

Introduction​

Understanding Query Performance Metrics​

Key Performance Indicators​

Examining Query Performance Stats​

Using the Query Stats API​

Using the /metrics Endpoint​

Common Performance Issues and Solutions​

1. High Cardinality Problems​

Example of a high cardinality issue:​

Solutions:​

2. Inefficient Regular Expressions​

Example of an inefficient regex query:​

Optimized alternative:​

3. Range Queries with Long Time Windows​

Example of a resource-intensive range query:​

Optimized alternatives:​

Query Optimization Techniques​

1. Using Recording Rules​

2. Limiting Time Range and Resolution​

3. Avoiding Suboptimal Functions​

Visualizing Query Performance​

Practical Example: Optimizing a Dashboard​

Original Dashboard Queries​

Step 1: Identify the Problem​

Step 2: Analyze and Optimize​

Step 3: Implement Solutions​

Performance Testing Tool - promtool​

Summary​

Additional Resources​

Further Reading​