Memory Usage Optimization

Introduction

Prometheus is a powerful monitoring and alerting toolkit, but as your monitoring infrastructure grows, memory usage can become a critical concern. Unoptimized Prometheus instances may consume excessive memory, leading to performance degradation, OOM (Out of Memory) errors, or even complete system crashes.

In this guide, we'll explore how to identify memory issues in Prometheus, understand the main factors that affect memory consumption, and implement effective strategies to optimize memory usage without sacrificing monitoring capabilities.

Understanding Prometheus Memory Usage

Before diving into optimization techniques, it's important to understand what drives memory consumption in Prometheus.

Key Memory Consumers in Prometheus

Time Series Database (TSDB): Stores all the metric data
Sample Ingestion: Processing incoming metrics
Query Execution: Memory used during query operations
Service Discovery: Tracking targets to scrape
Alerting Rules: Evaluating alert conditions

Let's visualize the approximate memory distribution in a typical Prometheus instance:

Identifying Memory Issues

Before optimizing, you need to determine if you have a memory problem and identify its root cause.

Signs of Memory Issues

Prometheus process crashes with OOM (Out of Memory) errors
High and growing memory usage over time
Slow query response times
Metric collection delays

Monitoring Prometheus Itself

Prometheus can monitor itself! Use these metrics to track memory usage:

# Memory usage of Prometheus
process_resident_memory_bytes{job="prometheus"}

# Memory usage breakdowns
prometheus_engine_queries_concurrent_max
prometheus_tsdb_head_chunks_created_total

Let's create a simple script to check Prometheus memory usage:

#!/bin/bash
# prometheus_memory_check.sh

PROMETHEUS_URL="http://localhost:9090"

# Get current memory usage
MEM_USAGE=$(curl -s $PROMETHEUS_URL/api/v1/query?query=process_resident_memory_bytes | jq '.data.result[0].value[1]')
MEM_USAGE_MB=$(echo "$MEM_USAGE / 1024 / 1024" | bc)

echo "Current Prometheus memory usage: ${MEM_USAGE_MB}MB"

# Check if it's above threshold
if (( $(echo "$MEM_USAGE_MB > 2000" | bc -l) )); then
  echo "WARNING: Memory usage is high (>2GB)"
fi

Memory Optimization Strategies

Now let's explore the key strategies to optimize Prometheus memory usage:

1. Optimize Scrape Configuration

The more metrics you collect, the more memory you need. Be selective!

scrape_configs:
  - job_name: 'node'
    scrape_interval: 15s
    scrape_timeout: 10s
    metrics_path: /metrics
    static_configs:
      - targets: ['localhost:9100']
    # Use relabeling to drop unnecessary metrics
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'node_cpu_.*'
        action: keep

This configuration focuses on keeping only CPU-related metrics from the node exporter.

2. Use `metric_relabel_configs` to Filter Metrics

Relabeling allows you to filter metrics before they're stored:

metric_relabel_configs:
  # Drop high-cardinality metrics
  - source_labels: [__name__]
    regex: 'http_requests_total'
    action: drop
  # Keep only metrics with specific labels
  - source_labels: [environment]
    regex: 'production'
    action: keep

3. Adjust Retention Time

By default, Prometheus stores data for 15 days. Reduce this for memory savings:

# In prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  # Set a shorter retention period (e.g., 5 days)
  storage.tsdb.retention.time: 5d

4. Optimize Storage Block Size

The default block size is 2 hours. Adjusting this can help with memory usage:

# In prometheus.yml
storage:
  tsdb:
    # Default is 2h, can be adjusted based on needs
    min_block_duration: 1h
    # Should be 10x min_block_duration
    max_block_duration: 10h

5. Vertical Scaling

Sometimes the simplest solution is to increase available memory:

# Run Prometheus with memory limits (using Docker)
docker run -d --name prometheus \
  --memory="4g" --memory-swap="4g" \
  -p 9090:9090 \
  prom/prometheus:v2.37.0

6. Horizontal Scaling (Federation)

Split your Prometheus setup into multiple instances:

# In the aggregating Prometheus
scrape_configs:
  - job_name: 'prometheus-federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="node"}'
        - '{job="api"}'
    static_configs:
      - targets:
        - 'prometheus-app:9090'
        - 'prometheus-infra:9090'

Practical Example: Reducing Memory in High-Cardinality Environments

Let's look at a real-world scenario where a Prometheus instance is monitoring a Kubernetes cluster with high-cardinality metrics.

Problem:

Prometheus using 8GB of memory and growing
Many high-cardinality metrics from Kubernetes pod labels
Query performance degrading

Solution:

Step 1: Identify high-cardinality metrics:

topk(10, count by (__name__)({__name__=~".+"}))

Step 2: Create a more selective scrape configuration:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape=true annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
    metric_relabel_configs:
      # Drop high-cardinality metrics
      - source_labels: [__name__]
        regex: 'container_tasks_state|kube_pod_container_status_last_terminated_reason'
        action: drop
      # Limit label values to reduce cardinality
      - source_labels: [pod]
        regex: '.+'
        target_label: pod
        replacement: 'redacted'
        action: replace

Step 3: Implement recording rules to pre-aggregate data:

# rules.yml
groups:
- name: k8s-aggregation
  rules:
  - record: job:container_cpu_usage_seconds_total:sum
    expr: sum by (job) (rate(container_cpu_usage_seconds_total[5m]))

Results:

After implementing these changes:

Memory usage reduced from 8GB to 2GB
Query performance improved by 65%
No loss of critical monitoring coverage

Memory Management Tools

Here are some tools to help manage Prometheus memory:

1. prom-heap

A tool to analyze Prometheus heap profiles:

# Install prom-heap
go get github.com/prometheus/prom-heap

# Generate heap profile from Prometheus
curl -s http://localhost:9090/debug/pprof/heap > heap.prof

# Analyze with prom-heap
prom-heap analyze heap.prof

2. Prometheus Memory Control Script

Create a monitoring script that restarts Prometheus if memory exceeds threshold:

#!/bin/bash
# prom_memory_control.sh

MAX_MEMORY_MB=4000
PROMETHEUS_URL="http://localhost:9090"

while true; do
  MEM_USAGE=$(curl -s $PROMETHEUS_URL/api/v1/query?query=process_resident_memory_bytes | jq '.data.result[0].value[1]')
  MEM_USAGE_MB=$(echo "$MEM_USAGE / 1024 / 1024" | bc)
  
  echo "$(date): Prometheus memory usage: ${MEM_USAGE_MB}MB"
  
  if (( $(echo "$MEM_USAGE_MB > $MAX_MEMORY_MB" | bc -l) )); then
    echo "WARNING: Memory exceeds ${MAX_MEMORY_MB}MB threshold, restarting service"
    systemctl restart prometheus
  fi
  
  sleep 300
done

Summary

Memory usage optimization is crucial for running Prometheus reliably in production environments. By implementing the strategies discussed in this guide, you can significantly reduce memory consumption while maintaining effective monitoring.

Key takeaways:

Monitor Prometheus itself to understand memory usage patterns
Focus on high-cardinality metrics, which are the biggest memory consumers
Use relabeling to filter unnecessary metrics before storage
Consider federation for horizontal scaling
Regularly review and adjust retention policies based on actual needs

Additional Resources

Official Documentation: The Prometheus storage documentation provides detailed information about TSDB memory usage.
Cardinality Explorer: Try Cardinality Explorer to find high-cardinality metrics in your setup.

Exercises

Use topk(10, count by (__name__)({__name__=~".+"})) to identify the metrics with the highest cardinality in your environment.
Create a recording rule that pre-aggregates a high-cardinality metric in your system.
Implement a metric relabeling configuration to drop at least three non-essential high-cardinality metrics.
Set up a federated Prometheus architecture with one instance focused on application metrics and another on infrastructure metrics.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Prometheus Memory Usage​

Key Memory Consumers in Prometheus​

Identifying Memory Issues​

Signs of Memory Issues​

Monitoring Prometheus Itself​

Memory Optimization Strategies​

1. Optimize Scrape Configuration​

2. Use metric_relabel_configs to Filter Metrics​

3. Adjust Retention Time​

4. Optimize Storage Block Size​

5. Vertical Scaling​

6. Horizontal Scaling (Federation)​

Practical Example: Reducing Memory in High-Cardinality Environments​

Problem:​

Solution:​

Results:​

Memory Management Tools​

1. prom-heap​

2. Prometheus Memory Control Script​

Summary​

Additional Resources​

Exercises​