Thanos Overview

Introduction

Thanos is an open-source project that extends Prometheus capabilities by adding highly available, long-term storage to your monitoring setup. If you've been working with Prometheus and found yourself limited by single-node deployments or short data retention periods, Thanos provides an elegant solution to these challenges.

The name "Thanos" comes from the Greek word for "death" - symbolizing its ability to address the mortality (limitations) of standard Prometheus deployments. Just as Prometheus (in Greek mythology) gave fire to humanity, Thanos gives longevity and resilience to Prometheus monitoring systems.

What Problems Does Thanos Solve?

Before diving into how Thanos works, let's understand the key limitations of standard Prometheus that Thanos addresses:

Limited Storage Capacity: Native Prometheus has limited local storage, making long-term data retention difficult.
Lack of High Availability: Prometheus runs as a single node with no built-in clustering.
Global Query View: Traditional setups make it hard to query across multiple Prometheus instances.
Resource Limitations: A single Prometheus server can only handle so much load and metrics.

Thanos Architecture

Thanos extends Prometheus by adding several components that work together:

Core Components

Sidecar: Connects to Prometheus, reads its data, and uploads it to object storage.
Store Gateway: Serves metrics from object storage.
Querier: Implements Prometheus' query API and aggregates results from multiple sources.
Compactor: Compacts, downsamples and applies retention policies on data in object storage.
Ruler: Evaluates recording and alerting rules against data in Thanos.
Receiver: Ingests data in Prometheus remote-write format and writes it to object storage.

Setting Up Thanos with Prometheus

Let's walk through a basic setup to understand how Thanos works with Prometheus.

Prerequisites

Running Prometheus instance(s)
Object storage (like AWS S3, Google Cloud Storage, MinIO)
Basic knowledge of Kubernetes (for production deployments)

Step 1: Deploy Prometheus with Thanos Sidecar

First, let's configure Prometheus to work with the Thanos sidecar:

# prometheus.yaml
global:
  external_labels:
    region: us-east-1
    replica: 1
  scrape_interval: 15s

storage:
  tsdb:
    path: /prometheus
    retention: 2d  # Short local retention
    min_block_duration: 2h
    max_block_duration: 2h

Now, let's run Prometheus with the Thanos sidecar:

# Start Prometheus
prometheus --config.file=prometheus.yaml --storage.tsdb.path=/prometheus

# Start Thanos sidecar
thanos sidecar \
  --tsdb.path /prometheus \
  --prometheus.url http://localhost:9090 \
  --objstore.config-file=bucket.yaml

Step 2: Configure Object Storage

Create a configuration file for your object storage:

# bucket.yaml
type: S3
config:
  bucket: "thanos"
  endpoint: "s3.amazonaws.com"
  access_key: "ACCESS_KEY"
  secret_key: "SECRET_KEY"

Step 3: Set Up Thanos Querier

The querier component allows you to query data across all your Prometheus servers:

thanos query \
  --store=sidecar-1:10901 \
  --store=sidecar-2:10901 \
  --store=store-gateway:10901

Step 4: Deploy Store Gateway

The store gateway gives access to historical data stored in your object storage:

thanos store \
  --objstore.config-file=bucket.yaml

Step 5: Set Up Compactor

The compactor processes data in your object storage for better performance:

thanos compact \
  --data-dir=/compact \
  --objstore.config-file=bucket.yaml \
  --retention.resolution-raw=30d \
  --retention.resolution-5m=90d \
  --retention.resolution-1h=1y

Practical Example: High Availability Prometheus Setup

Let's see how Thanos enables a highly available Prometheus deployment:

# docker-compose.yml
version: '3.7'

services:
  prometheus-1:
    image: prom/prometheus:v2.32.0
    volumes:
      - ./prometheus-1.yaml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.enable-lifecycle'
    ports:
      - "9091:9090"

  prometheus-2:
    image: prom/prometheus:v2.32.0
    volumes:
      - ./prometheus-2.yaml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.enable-lifecycle'
    ports:
      - "9092:9090"

  thanos-sidecar-1:
    image: quay.io/thanos/thanos:v0.24.0
    command:
      - 'sidecar'
      - '--tsdb.path=/prometheus'
      - '--prometheus.url=http://prometheus-1:9090'
      - '--grpc-address=0.0.0.0:10901'
      - '--http-address=0.0.0.0:10902'
      - '--objstore.config-file=/etc/thanos/bucket.yaml'
    volumes:
      - ./bucket.yaml:/etc/thanos/bucket.yaml
    depends_on:
      - prometheus-1

  thanos-sidecar-2:
    image: quay.io/thanos/thanos:v0.24.0
    command:
      - 'sidecar'
      - '--tsdb.path=/prometheus'
      - '--prometheus.url=http://prometheus-2:9090'
      - '--grpc-address=0.0.0.0:10901'
      - '--http-address=0.0.0.0:10902'
      - '--objstore.config-file=/etc/thanos/bucket.yaml'
    volumes:
      - ./bucket.yaml:/etc/thanos/bucket.yaml
    depends_on:
      - prometheus-2

  thanos-querier:
    image: quay.io/thanos/thanos:v0.24.0
    command:
      - 'query'
      - '--grpc-address=0.0.0.0:10901'
      - '--http-address=0.0.0.0:10902'
      - '--store=thanos-sidecar-1:10901'
      - '--store=thanos-sidecar-2:10901'
      - '--store=thanos-store-gateway:10901'
    ports:
      - "10902:10902"
    depends_on:
      - thanos-sidecar-1
      - thanos-sidecar-2

  thanos-store-gateway:
    image: quay.io/thanos/thanos:v0.24.0
    command:
      - 'store'
      - '--data-dir=/data'
      - '--grpc-address=0.0.0.0:10901'
      - '--http-address=0.0.0.0:10902'
      - '--objstore.config-file=/etc/thanos/bucket.yaml'
    volumes:
      - ./bucket.yaml:/etc/thanos/bucket.yaml

With this setup, you have:

Two independent Prometheus servers monitoring the same targets
Thanos sidecars uploading metrics to object storage
A global query view through the Thanos querier
Long-term storage via the object storage

Querying Data in Thanos

Thanos implements the same query API as Prometheus, so you can use PromQL just as you would with Prometheus:

# Total number of HTTP requests across all instances
sum(http_requests_total)

# CPU usage by instance with data from the last 1 year
max by(instance) (node_cpu_seconds_total{mode="idle"})[1y:1h]

The key difference is that Thanos can query across:

Multiple Prometheus instances (horizontal scaling)
Extended time ranges (historical data in object storage)
Deduplicated data (from redundant Prometheus servers)

Production Considerations

When deploying Thanos in production, consider the following:

Resource Requirements: Different components have varying resource needs:
- Querier: High memory for concurrent queries
- Store Gateway: Memory based on number of blocks
- Compactor: CPU intensive during compaction
Network Traffic: Object storage can generate significant egress costs:
- Use caching effectively
- Consider placement of components to minimize cross-region traffic
Security:
- Secure access to object storage
- Implement TLS for all Thanos component communications
- Use authentication for Thanos API endpoints
Data Retention:
- Configure appropriate retention policies
- Balance storage costs against data availability needs

Alternatives and Comparison

How does Thanos compare to other solutions?

Feature	Thanos	Cortex	M3DB	VictoriaMetrics
Architecture	Sidecars + Components	Microservices	Database	Single Binary/Cluster
Storage	Object Storage	Various	Custom	Various
HA	Yes	Yes	Yes	Yes
Complexity	Medium	High	High	Low
Prometheus Compatibility	Native	Native	Partial	High

Summary

Thanos extends Prometheus with:

Long-term metrics storage using object storage
High availability through redundant Prometheus instances
Global query view across multiple Prometheus servers
Downsampling for efficient long-term data storage

These capabilities allow you to build a robust, scalable monitoring system that can grow with your infrastructure while maintaining the familiar Prometheus experience.

Additional Resources

Exercises

Set up a basic Thanos deployment with two Prometheus instances and query data across both.
Configure Thanos to store metrics in MinIO (a self-hosted S3-compatible object store).
Write a recording rule in Thanos Ruler and verify it's working.
Create a Grafana dashboard that visualizes data from your Thanos setup.
Simulate a Prometheus outage and verify that historical data remains available through Thanos.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What Problems Does Thanos Solve?​

Thanos Architecture​

Core Components​

Setting Up Thanos with Prometheus​

Prerequisites​

Step 1: Deploy Prometheus with Thanos Sidecar​

Step 2: Configure Object Storage​

Step 3: Set Up Thanos Querier​

Step 4: Deploy Store Gateway​

Step 5: Set Up Compactor​

Practical Example: High Availability Prometheus Setup​

Querying Data in Thanos​

Production Considerations​

Alternatives and Comparison​

Summary​

Additional Resources​

Exercises​