Storage Issues in Prometheus

Introduction

Prometheus stores its time-series data on disk in a custom format called the Time Series Database (TSDB). While this storage engine is optimized for metrics data and performs well in most situations, you may encounter various storage-related issues that can affect your Prometheus deployment. This guide will help you understand, identify, and resolve common storage problems in Prometheus.

Understanding Prometheus Storage

Before diving into troubleshooting, it's important to understand how Prometheus stores data:

Prometheus uses a local time-series database (TSDB) to store all its metrics
Data is stored in blocks (typically 2-hour chunks by default)
Each block contains all time series for that time window
Older blocks get compacted into larger blocks over time
A write-ahead log (WAL) protects against crashes

Common Storage Issues

Disk Space Exhaustion

One of the most frequent issues is running out of disk space. Prometheus continuously collects metrics, and without proper retention settings, your disk can fill up quickly.

Symptoms

Prometheus service crashes or fails to start
Log entries indicating "no space left on device"
Increasing latency in queries as the disk gets fuller

Troubleshooting Steps

Check available disk space:

bash
df -h /path/to/prometheus/data

Review the current storage usage by Prometheus:

bash
du -sh /path/to/prometheus/data/*

Examine Prometheus storage metrics:

prometheus_tsdb_storage_blocks_bytes
prometheus_tsdb_wal_segment_current

Solutions

Configure appropriate retention periods in your prometheus.yml:

yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  # Configure how long to retain data
  retention_time: 15d

Set up disk space alerts before it becomes critical:

yaml
- alert: PrometheusStorageAlmostFull
  expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Prometheus storage almost full"
    description: "Prometheus storage is {{ $value }}% full and will fill up within the next 24h."

Consider implementing external storage solutions:

yaml
# prometheus.yml
remote_write:
  - url: "https://remote-storage-endpoint/api/v1/write"
    
remote_read:
  - url: "https://remote-storage-endpoint/api/v1/read"

Data Corruption

Data corruption can occur due to unexpected shutdowns, hardware failures, or filesystem issues.

Symptoms

Error messages about blocks being corrupted
Missing metrics or incomplete data
Prometheus failing to start with TSDB errors

Troubleshooting

Check Prometheus logs for corruption-related errors:

bash
grep -i "corrupt\|invalid\|error" /path/to/prometheus/log

Use Prometheus TSDB tools to verify the database:

bash
promtool tsdb verify /path/to/prometheus/data

Solutions

If corruption is detected, you may need to delete the corrupted blocks:

bash
# First, stop Prometheus
systemctl stop prometheus

# Then, remove the corrupted block (example block ID)
rm -rf /path/to/prometheus/data/01FCXYZ123456789

# Restart Prometheus
systemctl start prometheus

Implement regular backups of your Prometheus data:

bash
# Example backup script
#!/bin/bash
BACKUP_DIR="/backup/prometheus"
DATA_DIR="/path/to/prometheus/data"

systemctl stop prometheus
tar -czf $BACKUP_DIR/prometheus-data-$(date +%Y%m%d).tar.gz $DATA_DIR
systemctl start prometheus

Storage Performance Issues

As your metrics volume grows, you might notice performance degradation.

Symptoms

Slow query responses
High CPU/disk I/O during compaction
Increasing WAL replay time during restarts

Troubleshooting

Monitor disk I/O performance:

bash
iostat -xd 5

Check Prometheus performance metrics:

rate(prometheus_tsdb_head_active_appenders[5m])
rate(prometheus_tsdb_compactions_total[5m])
prometheus_tsdb_head_chunks

Solutions

Use faster storage (SSDs instead of HDDs):

yaml
# In systemd unit file
[Service]
ExecStart=/usr/local/bin/prometheus --storage.tsdb.path=/fast/ssd/path

Adjust block duration and retention periods:

yaml
# prometheus.yml
storage:
  tsdb:
    path: "/path/to/prometheus/data"
    retention.time: 15d
    min_block_duration: 2h
    max_block_duration: 24h

Shard your Prometheus deployment for better performance:

Out of Memory During Compaction

Compaction processes consolidate blocks of time-series data, which can be memory-intensive.

Symptoms

Out of memory errors during compaction
Prometheus crashing with memory-related errors
High memory usage spikes

Troubleshooting

Monitor memory usage during compactions:

bash
watch -n 1 "free -m"

Check the relevant Prometheus metrics:

process_resident_memory_bytes{job="prometheus"}
prometheus_tsdb_compactions_total
prometheus_tsdb_compaction_duration_seconds

Solutions

Increase available memory or limit Prometheus memory usage:

yaml
# In systemd unit file
[Service]
ExecStart=/usr/local/bin/prometheus
MemoryMax=8G

Use --storage.tsdb.max-block-duration to control block sizes:

bash
prometheus --storage.tsdb.path=/path/to/data --storage.tsdb.max-block-duration=2h

WAL Corruption

The Write-Ahead Log (WAL) is critical for Prometheus data integrity, but it can become corrupted.

Symptoms

Errors mentioning WAL corruption or truncation
Prometheus failing to start after a crash
Messages about invalid segment or checkpoint markers

Troubleshooting

Examine WAL-specific errors in logs:

bash
grep -i "wal" /path/to/prometheus/log

Check the WAL directory structure:

bash
ls -la /path/to/prometheus/data/wal

Solutions

In cases of WAL corruption, you might need to remove the WAL directory:

bash
# Warning: This will lose recent data that hasn't been compacted yet
systemctl stop prometheus
rm -rf /path/to/prometheus/data/wal
systemctl start prometheus

Configure WAL settings for better durability:

yaml
storage:
  tsdb:
    wal_segment_size: 128MB
    wal_compression: true

Best Practices for Prometheus Storage

1. Calculate Your Storage Requirements

Before deploying Prometheus, estimate your storage needs:

storage_size = retention_time_seconds * ingested_samples_per_second * bytes_per_sample

For example, with:

10,000 time series
1 sample every 15s per time series
2 bytes per sample (approximate)
15-day retention

10,000 * (1/15) * 2 * 15 * 86400 ≈ 1.73 GB

2. Implement Proper Retention Policies

Balance retention with storage constraints:

yaml
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  retention_time: 15d  # For time-based retention
  # OR
  retention_size: 500GB  # For size-based retention

3. Use Efficient Labels

Excessive label cardinality can bloat your storage:

yaml
# Avoid high cardinality labels like this
- job_name: 'api'
  static_configs:
    - targets: ['api.example.com:8080']
      labels:
        user_id: "12345"  # Bad: high cardinality
        request_id: "abc123"  # Bad: high cardinality
        
# Instead, use lower cardinality labels
- job_name: 'api'
  static_configs:
    - targets: ['api.example.com:8080']
      labels:
        service: "api"
        environment: "production"

4. Consider External Storage

For long-term storage, consider remote write/read:

yaml
# prometheus.yml
remote_write:
  - url: "https://thanos-receive.example.com/api/v1/receive"
    
remote_read:
  - url: "https://thanos-store.example.com/api/v1/read"

Troubleshooting Tools

Promtool

promtool provides utilities for working with Prometheus data:

bash
# Verify TSDB integrity
promtool tsdb verify /path/to/prometheus/data

# Dump samples from a TSDB
promtool tsdb dump /path/to/prometheus/data

# List all series for a label
promtool tsdb series /path/to/prometheus/data '{job="prometheus"}'

prometheus-storage-migrator

For migrating between storage formats:

bash
# Example migrating data
./prometheus-storage-migrator \
  --input.storage-path=/old/prometheus/data \
  --output.storage-path=/new/prometheus/data

Summary

Storage issues in Prometheus typically fall into a few categories:

Disk space exhaustion
Data corruption
Performance degradation
Memory issues during compaction
WAL problems

By understanding the underlying storage architecture and implementing best practices, you can maintain a healthy Prometheus deployment with minimal storage-related issues.

Additional Resources

Exercises

Calculate the storage requirements for your specific environment based on the formula provided.
Set up alerts for detecting storage issues before they become critical.
Practice simulating and recovering from a corrupted block using a test Prometheus instance.
Experiment with different retention settings and measure their impact on query performance.
Implement a backup strategy for your Prometheus data directory.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Prometheus Storage​

Common Storage Issues​

Disk Space Exhaustion​

Symptoms​

Troubleshooting Steps​

Solutions​

Data Corruption​

Symptoms​

Troubleshooting​

Solutions​

Storage Performance Issues​

Symptoms​

Troubleshooting​

Solutions​

Out of Memory During Compaction​

Symptoms​

Troubleshooting​

Solutions​

WAL Corruption​

Symptoms​

Troubleshooting​

Solutions​

Best Practices for Prometheus Storage​

1. Calculate Your Storage Requirements​

2. Implement Proper Retention Policies​

3. Use Efficient Labels​

4. Consider External Storage​

Troubleshooting Tools​

Promtool​

prometheus-storage-migrator​

Summary​

Additional Resources​

Exercises​

Introduction

Understanding Prometheus Storage

Common Storage Issues

Disk Space Exhaustion

Symptoms

Troubleshooting Steps

Solutions

Data Corruption

Symptoms

Troubleshooting

Solutions

Storage Performance Issues

Symptoms

Troubleshooting

Solutions

Out of Memory During Compaction

Symptoms

Troubleshooting

Solutions

WAL Corruption

Symptoms

Troubleshooting

Solutions

Best Practices for Prometheus Storage

1. Calculate Your Storage Requirements

2. Implement Proper Retention Policies

3. Use Efficient Labels

4. Consider External Storage

Troubleshooting Tools

Promtool

prometheus-storage-migrator

Summary

Additional Resources

Exercises