Memory Usage Optimization
Introduction
Prometheus is a powerful monitoring and alerting toolkit, but as your monitoring infrastructure grows, memory usage can become a critical concern. Unoptimized Prometheus instances may consume excessive memory, leading to performance degradation, OOM (Out of Memory) errors, or even complete system crashes.
In this guide, we'll explore how to identify memory issues in Prometheus, understand the main factors that affect memory consumption, and implement effective strategies to optimize memory usage without sacrificing monitoring capabilities.
Understanding Prometheus Memory Usage
Before diving into optimization techniques, it's important to understand what drives memory consumption in Prometheus.
Key Memory Consumers in Prometheus
- Time Series Database (TSDB): Stores all the metric data
- Sample Ingestion: Processing incoming metrics
- Query Execution: Memory used during query operations
- Service Discovery: Tracking targets to scrape
- Alerting Rules: Evaluating alert conditions
Let's visualize the approximate memory distribution in a typical Prometheus instance:
Identifying Memory Issues
Before optimizing, you need to determine if you have a memory problem and identify its root cause.
Signs of Memory Issues
- Prometheus process crashes with OOM (Out of Memory) errors
- High and growing memory usage over time
- Slow query response times
- Metric collection delays
Monitoring Prometheus Itself
Prometheus can monitor itself! Use these metrics to track memory usage:
# Memory usage of Prometheus
process_resident_memory_bytes{job="prometheus"}
# Memory usage breakdowns
prometheus_engine_queries_concurrent_max
prometheus_tsdb_head_chunks_created_total
Let's create a simple script to check Prometheus memory usage:
#!/bin/bash
# prometheus_memory_check.sh
PROMETHEUS_URL="http://localhost:9090"
# Get current memory usage
MEM_USAGE=$(curl -s $PROMETHEUS_URL/api/v1/query?query=process_resident_memory_bytes | jq '.data.result[0].value[1]')
MEM_USAGE_MB=$(echo "$MEM_USAGE / 1024 / 1024" | bc)
echo "Current Prometheus memory usage: ${MEM_USAGE_MB}MB"
# Check if it's above threshold
if (( $(echo "$MEM_USAGE_MB > 2000" | bc -l) )); then
echo "WARNING: Memory usage is high (>2GB)"
fi
Memory Optimization Strategies
Now let's explore the key strategies to optimize Prometheus memory usage:
1. Optimize Scrape Configuration
The more metrics you collect, the more memory you need. Be selective!
scrape_configs:
- job_name: 'node'
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
static_configs:
- targets: ['localhost:9100']
# Use relabeling to drop unnecessary metrics
metric_relabel_configs:
- source_labels: [__name__]
regex: 'node_cpu_.*'
action: keep
This configuration focuses on keeping only CPU-related metrics from the node exporter.
2. Use metric_relabel_configs
to Filter Metrics
Relabeling allows you to filter metrics before they're stored:
metric_relabel_configs:
# Drop high-cardinality metrics
- source_labels: [__name__]
regex: 'http_requests_total'
action: drop
# Keep only metrics with specific labels
- source_labels: [environment]
regex: 'production'
action: keep
3. Adjust Retention Time
By default, Prometheus stores data for 15 days. Reduce this for memory savings:
# In prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
# Set a shorter retention period (e.g., 5 days)
storage.tsdb.retention.time: 5d
4. Optimize Storage Block Size
The default block size is 2 hours. Adjusting this can help with memory usage:
# In prometheus.yml
storage:
tsdb:
# Default is 2h, can be adjusted based on needs
min_block_duration: 1h
# Should be 10x min_block_duration
max_block_duration: 10h
5. Vertical Scaling
Sometimes the simplest solution is to increase available memory:
# Run Prometheus with memory limits (using Docker)
docker run -d --name prometheus \
--memory="4g" --memory-swap="4g" \
-p 9090:9090 \
prom/prometheus:v2.37.0
6. Horizontal Scaling (Federation)
Split your Prometheus setup into multiple instances:
# In the aggregating Prometheus
scrape_configs:
- job_name: 'prometheus-federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="node"}'
- '{job="api"}'
static_configs:
- targets:
- 'prometheus-app:9090'
- 'prometheus-infra:9090'
Practical Example: Reducing Memory in High-Cardinality Environments
Let's look at a real-world scenario where a Prometheus instance is monitoring a Kubernetes cluster with high-cardinality metrics.
Problem:
- Prometheus using 8GB of memory and growing
- Many high-cardinality metrics from Kubernetes pod labels
- Query performance degrading
Solution:
Step 1: Identify high-cardinality metrics:
topk(10, count by (__name__)({__name__=~".+"}))
Step 2: Create a more selective scrape configuration:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with prometheus.io/scrape=true annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
metric_relabel_configs:
# Drop high-cardinality metrics
- source_labels: [__name__]
regex: 'container_tasks_state|kube_pod_container_status_last_terminated_reason'
action: drop
# Limit label values to reduce cardinality
- source_labels: [pod]
regex: '.+'
target_label: pod
replacement: 'redacted'
action: replace
Step 3: Implement recording rules to pre-aggregate data:
# rules.yml
groups:
- name: k8s-aggregation
rules:
- record: job:container_cpu_usage_seconds_total:sum
expr: sum by (job) (rate(container_cpu_usage_seconds_total[5m]))
Results:
After implementing these changes:
- Memory usage reduced from 8GB to 2GB
- Query performance improved by 65%
- No loss of critical monitoring coverage
Memory Management Tools
Here are some tools to help manage Prometheus memory:
1. prom-heap
A tool to analyze Prometheus heap profiles:
# Install prom-heap
go get github.com/prometheus/prom-heap
# Generate heap profile from Prometheus
curl -s http://localhost:9090/debug/pprof/heap > heap.prof
# Analyze with prom-heap
prom-heap analyze heap.prof
2. Prometheus Memory Control Script
Create a monitoring script that restarts Prometheus if memory exceeds threshold:
#!/bin/bash
# prom_memory_control.sh
MAX_MEMORY_MB=4000
PROMETHEUS_URL="http://localhost:9090"
while true; do
MEM_USAGE=$(curl -s $PROMETHEUS_URL/api/v1/query?query=process_resident_memory_bytes | jq '.data.result[0].value[1]')
MEM_USAGE_MB=$(echo "$MEM_USAGE / 1024 / 1024" | bc)
echo "$(date): Prometheus memory usage: ${MEM_USAGE_MB}MB"
if (( $(echo "$MEM_USAGE_MB > $MAX_MEMORY_MB" | bc -l) )); then
echo "WARNING: Memory exceeds ${MAX_MEMORY_MB}MB threshold, restarting service"
systemctl restart prometheus
fi
sleep 300
done
Summary
Memory usage optimization is crucial for running Prometheus reliably in production environments. By implementing the strategies discussed in this guide, you can significantly reduce memory consumption while maintaining effective monitoring.
Key takeaways:
- Monitor Prometheus itself to understand memory usage patterns
- Focus on high-cardinality metrics, which are the biggest memory consumers
- Use relabeling to filter unnecessary metrics before storage
- Consider federation for horizontal scaling
- Regularly review and adjust retention policies based on actual needs
Additional Resources
- Official Documentation: The Prometheus storage documentation provides detailed information about TSDB memory usage.
- Cardinality Explorer: Try Cardinality Explorer to find high-cardinality metrics in your setup.
Exercises
- Use
topk(10, count by (__name__)({__name__=~".+"}))
to identify the metrics with the highest cardinality in your environment. - Create a recording rule that pre-aggregates a high-cardinality metric in your system.
- Implement a metric relabeling configuration to drop at least three non-essential high-cardinality metrics.
- Set up a federated Prometheus architecture with one instance focused on application metrics and another on infrastructure metrics.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)