Prometheus Storage Architecture

Introduction

Prometheus is a powerful open-source monitoring and alerting system that collects and stores time series data as metrics. One of the core components that makes Prometheus efficient is its storage architecture. In this guide, we'll explore how Prometheus organizes, compresses, and manages metrics data on disk, providing insights into its performance characteristics and operational considerations.

The storage architecture is fundamental to understanding how Prometheus balances speed and efficiency when dealing with potentially millions of time series.

Core Storage Components

Prometheus's storage architecture consists of several key components:

Time Series Database (TSDB)

At its heart, Prometheus uses a custom-built Time Series Database (TSDB) optimized for the specific workloads of metrics collection. This TSDB is designed to:

Handle high volumes of time series data with minimal overhead
Provide fast query performance for common monitoring patterns
Efficiently compress data to reduce disk usage
Support Prometheus's retention and downsampling requirements

The TSDB was completely rewritten in Prometheus 2.0, significantly improving performance and resource usage compared to the earlier versions.

Data Organization

Prometheus organizes data in time blocks, with each block containing data for a specific time range:

Head Block: The most recent, in-memory, mutable block where new samples are written
Persisted Blocks: Immutable blocks on disk containing older data
WAL (Write-Ahead Log): Ensures data durability even in case of crashes

Let's explore each of these in detail.

The Head Block

The head block is where all incoming samples are initially stored. It's kept in memory for fast access and has several important characteristics:

Accepts new data points (samples) for currently active time series
Maintains a memory-mapped chunk file for durability
Is periodically compacted and written to disk as an immutable persisted block
Typically represents the most recent 2 hours of data (configurable)

Here's a simple diagram illustrating the head block's role:

Memory Management

The head block's memory usage is primarily determined by:

The number of active time series
The number of labels per series
The churn rate (creation of new series)

It's important to monitor Prometheus's memory usage, as a sudden increase in active series can lead to memory pressure.

Persisted Blocks

Once the head block reaches a certain age (2 hours by default), it's compacted and written to disk as a persisted block. These blocks have the following properties:

Immutable (never change once written)
Highly compressed to save disk space
Organized by time ranges (e.g., 2-hour blocks)
Include metadata, index, and chunk files

Each persisted block contains:

Index: Fast lookup of series by labels
Chunks: The actual compressed time series data
Tombstones: Markers for deleted series
Metadata: Information about the block

Let's look at how data is stored on disk:

data/
  ├── 01BKGV7JC0RY8A6MACW02A2PJD/
  │   ├── chunks/
  │   │   ├── 000001
  │   │   ├── 000002
  │   │   └── ...
  │   ├── index
  │   ├── meta.json
  │   └── tombstones
  ├── 01BKGTZQ1SYQJTR4PB43C8PD98/
  │   ├── chunks/
  ���   │   ├── 000001
  │   │   ├── 000002
  │   │   └── ...
  │   ├── index
  │   ├── meta.json
  │   └── tombstones
  ├── chunks_head/
  └── wal/
      ├── 000000001
      ├── 000000002
      └── ...

The directory name (e.g., 01BKGV7JC0RY8A6MACW02A2PJD) is a unique identifier for each block.

The Write-Ahead Log (WAL)

The WAL is critical for data durability. It works as follows:

Before new samples are added to the head block, they're first written to the WAL
In case of a crash, Prometheus can recover the in-memory state by replaying the WAL
WAL segments are deleted once their corresponding data has been successfully persisted to disk

This ensures that no data is lost due to unexpected shutdowns or crashes.

Compaction Process

To manage disk space and improve query performance, Prometheus regularly compacts blocks. This process:

Merges smaller blocks into larger ones
Applies retention policies (removing old data)
Optimizes indexes and deduplicate data
Handles tombstones (series marked for deletion)

The compaction process occurs in the background and is controlled by several configuration parameters.

Here's a visualization of the compaction process:

Data Compression

One of the key features of Prometheus's storage is its efficient compression algorithms:

Delta encoding: Stores differences between timestamps instead of absolute values
XOR compression: Efficiently encodes differences between sample values
Variable bit-length encoding: Uses fewer bits for smaller numbers
Dictionary compression: For label values and metric names

These techniques allow Prometheus to achieve very high compression ratios, often 10:1 or better compared to raw data.

Practical Example: Configuring Storage

Let's look at a practical example of configuring Prometheus storage settings:

yaml
global:
  scrape_interval: 15s

storage:
  tsdb:
    path: /path/to/data
    retention.time: 15d
    retention.size: 30GB
    wal_compression: true

  # Configuration for remote write
  remote_write:
    - url: "http://remote-storage:8080/write"
      queue_config:
        capacity: 10000
        max_shards: 200

In this configuration:

Data is stored in /path/to/data
Retention is set to 15 days or 30GB, whichever comes first
WAL compression is enabled
Remote write is configured to send data to an external storage system

Performance Considerations

When running Prometheus in production, consider these storage-related factors:

Disk I/O

Prometheus is I/O intensive, especially during:

Writing new blocks
Compaction processes
High query loads

Using SSDs instead of HDDs can significantly improve performance.

Memory Usage

Memory requirements grow with:

Number of active time series
Cardinality of labels
Query complexity

A common rule of thumb is to allocate 1-2GB of RAM per million active time series.

Example Memory Calculation

Memory (bytes) ≈ (active_series * 3KB) + headblock_overhead

For 1 million series: 1,000,000 * 3KB ≈ 3GB plus overhead

Optimizing Storage Performance

Here are some tips to improve Prometheus storage performance:

Reduce cardinality: Limit high-cardinality labels that create too many time series
Use appropriate retention: Only keep data as long as needed
Monitor disk usage: Set alerts for approaching storage limits
Consider remote storage: For long-term data retention

Example: Checking Storage Status

You can check the status of your Prometheus storage using API endpoints or the expression browser.

Using the API:

bash
curl -s http://localhost:9090/api/v1/status/tsdb | jq .

Example output:

json
{
  "status": "success",
  "data": {
    "headStats": {
      "numSeries": 845429,
      "numLabelPairs": 8312023,
      "chunkCount": 2391233,
      "minTime": 1628079300000,
      "maxTime": 1628086500000
    },
    "seriesCountByMetricName": [
      {
        "name": "up",
        "value": 942
      },
      {
        "name": "node_cpu_seconds_total",
        "value": 23520
      }
    ],
    "labelValueCountByLabelName": [
      {
        "name": "instance",
        "value": 132
      },
      {
        "name": "job",
        "value": 22
      }
    ]
  }
}

This output provides insights into:

The number of active series in the head block
Series counts by metric name
Label cardinality information

Advanced Topic: Local vs Remote Storage

While Prometheus's local storage is highly optimized, it has limitations:

Not designed for long-term historical data
Limited by single-node scalability
No high availability without federation

For these reasons, Prometheus supports integration with remote storage systems through:

Remote Write: Sends samples to a compatible remote storage
Remote Read: Queries data from remote storage when not available locally

Popular remote storage options include:

Thanos
Cortex
Victoria Metrics
Prometheus's own long-term storage solutions

Summary

Prometheus's storage architecture is a carefully designed system that balances performance, efficiency, and reliability:

Uses a custom TSDB optimized for time series metrics
Organizes data in head blocks (in-memory) and persisted blocks (on disk)
Implements WAL for crash recovery
Uses sophisticated compression to minimize disk usage
Provides configurable retention and compaction
Offers remote storage options for long-term data and scalability

Understanding the storage architecture helps you:

Better configure and optimize Prometheus for your environment
Troubleshoot performance issues
Make informed decisions about scaling and long-term storage needs
Plan for resource requirements as your monitoring needs grow

Additional Resources

For more in-depth learning:

Prometheus documentation on Storage
The TSDB format documentation
Blog post: Prometheus 2.0 Storage Layer

Exercises

Configure a Prometheus instance with different retention settings and observe the impact on disk usage.
Use Prometheus's API to examine the details of your storage blocks.
Experiment with remote write to an external storage system.
Create a dashboard that monitors Prometheus's own storage metrics.
Simulate high cardinality and observe its impact on storage performance.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Core Storage Components​

Time Series Database (TSDB)​

Data Organization​

The Head Block​

Memory Management​

Persisted Blocks​

The Write-Ahead Log (WAL)​

Compaction Process​

Data Compression​

Practical Example: Configuring Storage​

Performance Considerations​

Disk I/O​

Memory Usage​

Example Memory Calculation​

Optimizing Storage Performance​

Example: Checking Storage Status​

Advanced Topic: Local vs Remote Storage​

Summary​

Additional Resources​

Exercises​