Performance Optimization
Introduction
Performance optimization is a critical aspect of managing Prometheus in production environments. As your monitoring needs grow and the volume of metrics increases, you may encounter challenges related to query performance, storage consumption, and resource utilization. This guide will explore various techniques to optimize your Prometheus deployment, ensuring it remains efficient, responsive, and reliable even at scale.
Performance optimization for Prometheus involves balancing three key aspects:
- Query efficiency - Ensuring queries complete quickly
- Storage optimization - Managing disk space consumption
- Resource utilization - Properly allocating CPU and memory resources
By understanding and implementing the techniques covered in this guide, you'll be able to maintain a high-performing monitoring system even as your infrastructure grows.
Understanding Prometheus Performance Factors
Before diving into specific optimization strategies, it's important to understand the factors that impact Prometheus performance:
Cardinality
Cardinality refers to the number of unique time series in your Prometheus database. High cardinality is the most common cause of performance problems.
Scrape Interval and Retention
The frequency at which you collect metrics and how long you store them directly impacts resource consumption.
Query Complexity
Complex PromQL queries, especially those with many operations or high cardinality, can consume significant resources.
Query Optimization Techniques
1. Optimize Your PromQL Queries
Inefficient queries can consume excessive resources. Here are some strategies to optimize your PromQL queries:
Avoid using .*
in regex matchers
Inefficient:
http_requests_total{path=~"/api/.*"}
Better:
http_requests_total{path=~"/api/[^/]*"}
Limit the time range when possible
Inefficient:
rate(http_requests_total[24h])
Better:
rate(http_requests_total[5m])
Use aggregation to reduce the number of returned series
Inefficient:
http_requests_total{job="api-server"}
Better (if you only need totals by endpoint):
sum by(endpoint) (http_requests_total{job="api-server"})
2. Use Recording Rules for Complex Queries
Recording rules allow you to precompute frequently used or computationally expensive expressions and save their result as a new time series.
Example recording rule configuration:
groups:
- name: example
rules:
- record: job:http_requests_total:rate5m
expr: sum by(job) (rate(http_requests_total[5m]))
After adding this rule, you can query job:http_requests_total:rate5m
instead of calculating the rate expression each time.
3. Implement Query Caching
Starting with Prometheus 2.48.0, you can enable response caching to improve performance for frequently executed queries:
# prometheus.yml
query_cache:
# Enable caching
enabled: true
# Set maximum memory usage (bytes)
max_size_bytes: 1073741824 # 1GB
# Set maximum time to live for cached results
max_ttl: 5m
Storage Optimization
1. Adjust Data Retention
By default, Prometheus stores data for 15 days. Adjust this based on your needs:
# prometheus.yml
storage:
tsdb:
retention.time: 30d # Store data for 30 days
2. Configure TSDB Compression
The time series database (TSDB) uses compression to reduce disk usage:
# prometheus.yml
storage:
tsdb:
# Set to 1.0 for maximum compression (higher CPU usage)
# Set to 0.0 for minimum compression (lower CPU usage)
compression_factor: 0.8
3. Manage Block Sizes
Prometheus stores data in blocks. Tuning the block parameters can improve performance:
# prometheus.yml
storage:
tsdb:
# How long to wait before writing a new block
min_block_duration: 2h
# Maximum block duration
max_block_duration: 24h
Cardinality Management
1. Design Metrics with Cardinality in Mind
Choose labels carefully to avoid explosion in the number of time series:
High cardinality (avoid):
http_requests_total{user_id="12345", session_id="abcdef", request_id="xyz123"}
Better approach:
http_requests_total{endpoint="/api/users", status_code="200"}
2. Use exemplar
for High-Cardinality Data
For high-cardinality data like request IDs or user IDs, use exemplars instead of labels:
// Go example using the Prometheus client
httpRequestsCounter.WithLabelValues("/api/users", "200").Observe(0.42, prometheus.Labels{
"request_id": "xyz123",
"user_id": "12345",
})
3. Implement a Metric Naming Convention
Establish a consistent naming convention to avoid duplicate metrics with slightly different names:
<namespace>_<subsystem>_<name>_<unit>
For example: http_server_requests_seconds_count
Resource Management
1. Hardware Sizing Guidelines
Based on your metrics volume:
Metrics Volume | CPU Recommendations | Memory Recommendations | Storage Recommendations |
---|---|---|---|
< 10K series | 2 CPU cores | 4 GB RAM | 50 GB SSD |
10K-100K series | 4-8 CPU cores | 8-16 GB RAM | 100-500 GB SSD |
100K-1M series | 8-16 CPU cores | 32-64 GB RAM | 500 GB-2 TB SSD |
2. Tune Scrape Configuration
Adjust scrape configurations to balance between monitoring frequency and resource consumption:
scrape_configs:
- job_name: 'critical-services'
scrape_interval: 10s
scrape_timeout: 5s
- job_name: 'less-critical-services'
scrape_interval: 30s
scrape_timeout: 10s
3. Configure Memory Management
Prometheus provides settings to manage memory usage:
# prometheus.yml
query:
max_samples: 50000000 # Maximum number of samples in a query
timeout: 2m # Maximum query execution time
Practical Examples
Case Study: Optimizing a High-Cardinality Environment
Consider a web service that initially tracked HTTP requests with detailed labels:
http_requests_total{path="/api/user/profile", method="GET", status="200", user_id="12345", region="us-west", device="mobile", app_version="2.3.4"}
This approach created millions of time series and led to performance issues.
Solution:
- Removed high-cardinality labels from metrics:
http_requests_total{path="/api/user/profile", method="GET", status="200"}
- Created specific aggregated metrics for important dimensions:
http_requests_by_region_total{path="/api/user/profile", method="GET", status="200", region="us-west"}
http_requests_by_device_total{path="/api/user/profile", method="GET", status="200", device="mobile"}
- Used exemplars to track specific requests for troubleshooting.
Result: Query performance improved by 90%, and storage requirements decreased by 75%.
Case Study: Query Optimization
A dashboard with this query was taking 30+ seconds to load:
sum by(service, endpoint) (rate(http_request_duration_seconds_count{environment="production"}[5m]))
Solution:
- Created a recording rule to pre-compute this expression:
groups:
- name: http_metrics
rules:
- record: job:http_request_duration_seconds_count:rate5m
expr: sum by(service, endpoint) (rate(http_request_duration_seconds_count{environment="production"}[5m]))
- Updated the dashboard to use the pre-computed metric:
job:http_request_duration_seconds_count:rate5m
Result: Dashboard loading time reduced to under 1 second.
Monitoring Prometheus Itself
To optimize Prometheus, you should monitor its own performance:
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Key metrics to watch:
prometheus_tsdb_head_series
- Number of active time seriesprometheus_engine_query_duration_seconds
- Query execution timeprometheus_tsdb_storage_blocks_bytes
- Disk usageprocess_resident_memory_bytes
- Memory usage
Summary
Performance optimization is an ongoing process that involves:
- Designing metrics carefully to avoid cardinality problems
- Optimizing queries using efficient PromQL expressions and recording rules
- Managing storage through proper retention policies and compression
- Allocating appropriate resources based on your metrics volume
By implementing these best practices, you can ensure your Prometheus deployment remains efficient and reliable even as your monitoring needs grow.
Additional Resources
- Prometheus Documentation: Storage
- Prometheus Documentation: Query Optimization
- PromCon Talks on Performance
- Grafana Labs: Cardinality is (almost) killing you
Exercises
-
Analyze your current Prometheus metrics and identify high-cardinality series using:
promqltopk(10, count by (__name__, job) (count by (__name__, job, instance) ({__name__!=""})))
-
Implement at least two recording rules for your most frequently used or complex queries.
-
Create a Prometheus dashboard to monitor the performance of your Prometheus server itself.
-
Review your retention settings and calculate the optimal storage based on your metrics volume and business requirements.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)