High Availability Setup
Introduction
In production environments, monitoring systems like Prometheus are critical infrastructure components that should never fail. A monitoring outage can leave your team blind to application or infrastructure issues. High Availability (HA) in Prometheus refers to configuring your monitoring system to eliminate single points of failure, ensuring continuous operation even when individual components fail.
This guide explains how to implement high availability for Prometheus, covering redundant server deployments, data replication strategies, and load balancing approaches. Whether you're managing a small cluster or a large-scale infrastructure, these principles will help you build a resilient monitoring system.
Understanding Prometheus High Availability Challenges
Before diving into implementation, it's important to understand the specific challenges of making Prometheus highly available:
- Stateful Nature: Prometheus stores time-series data locally by default
- Single-Node Architecture: Prometheus was designed as a single-node application
- Lack of Native Clustering: Unlike some databases, Prometheus doesn't have built-in clustering
- Consistency Requirements: Alerting requires consistent data views to avoid duplicate or missed alerts
High Availability Patterns for Prometheus
Pattern 1: Simple Redundancy (Active-Passive)
The simplest approach to Prometheus HA is running identical, independent Prometheus servers that scrape the same targets.
Implementation Steps:
- Deploy multiple Prometheus instances with identical scrape configurations:
# prometheus-1.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'web-servers'
static_configs:
- targets: ['web1:9090', 'web2:9090', 'web3:9090']
- Configure identical alerting rules on each Prometheus server:
# alert-rules.yml
groups:
- name: example
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
- Set up Alertmanager deduplication to handle duplicate alerts from multiple Prometheus servers:
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'instance', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'team-email'
receivers:
- name: 'team-email'
email_configs:
- to: '[email protected]'
Pattern 2: Advanced Redundancy with Thanos
For larger deployments, Thanos extends Prometheus to provide:
- Long-term storage via object storage (S3, GCS, etc.)
- Global query view across Prometheus instances
- Centralized alerting and rule evaluation
Implementation Steps:
- Deploy Prometheus with Thanos sidecar:
# prometheus.yml
global:
external_labels:
region: us-east-1
replica: replica-1
storage:
tsdb:
path: /prometheus
retention: 2d # Short local retention, long-term in object storage
- Run the Thanos sidecar alongside Prometheus:
thanos sidecar \
--tsdb.path /prometheus \
--prometheus.url http://localhost:9090 \
--objstore.config-file bucket_config.yaml
- Configure the object storage for Thanos:
# bucket_config.yaml
type: S3
config:
bucket: "thanos-metrics"
endpoint: "s3.amazonaws.com"
access_key: "${ACCESS_KEY}"
secret_key: "${SECRET_KEY}"
- Deploy Thanos Querier for unified querying:
thanos query \
--store 10.0.0.1:10901 \
--store 10.0.0.2:10901 \
--store 10.0.0.3:10901
Pattern 3: Cortex/Mimir for Multi-Tenant HA Prometheus
For the most advanced setups, especially when dealing with multi-tenancy requirements, Cortex (or its successor Grafana Mimir) provides a horizontally scalable, highly available Prometheus-compatible monitoring system:
Implementation Steps:
- Configure Prometheus as an agent for remote-write:
# prometheus-agent.yml
global:
scrape_interval: 15s
external_labels:
cluster: production
__replica__: replica-1
remote_write:
- url: http://cortex:9009/api/v1/push
basic_auth:
username: "user"
password: "password"
scrape_configs:
- job_name: 'web-servers'
static_configs:
- targets: ['web1:9090', 'web2:9090']
- Deploy Cortex/Mimir components:
# cortex-config.yaml
distributor:
shard_by_all_labels: true
pool:
health_check_ingesters: true
ingester:
lifecycler:
ring:
kvstore:
store: consul
prefix: collectors/
replication_factor: 3
storage:
engine: blocks
blocks:
backend: s3
s3:
bucket_name: cortex-blocks
endpoint: s3.amazonaws.com
Alertmanager High Availability
Prometheus is only part of the monitoring stack - Alertmanager also needs to be highly available:
- Configure Alertmanager in cluster mode:
# alertmanager.yml
global:
resolve_timeout: 5m
cluster:
listen-address: "0.0.0.0:9094"
peer: "alertmanager-1:9094"
peer: "alertmanager-2:9094"
peer: "alertmanager-3:9094"
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'team-email'
receivers:
- name: 'team-email'
email_configs:
- to: '[email protected]'
- Configure Prometheus to send alerts to all Alertmanager instances:
# prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-1:9093
- alertmanager-2:9093
- alertmanager-3:9093
Practical Examples: High Availability Deployment
Example 1: Docker Compose Setup for Simple HA
Here's a Docker Compose example for a simple HA setup with two Prometheus instances and a clustered Alertmanager:
version: '3'
services:
prometheus-1:
image: prom/prometheus:latest
volumes:
- ./prometheus-1.yml:/etc/prometheus/prometheus.yml
- ./alert-rules.yml:/etc/prometheus/alert-rules.yml
- prometheus1_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
ports:
- "9091:9090"
prometheus-2:
image: prom/prometheus:latest
volumes:
- ./prometheus-2.yml:/etc/prometheus/prometheus.yml
- ./alert-rules.yml:/etc/prometheus/alert-rules.yml
- prometheus2_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
ports:
- "9092:9090"
alertmanager-1:
image: prom/alertmanager:latest
volumes:
- ./alertmanager-1.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--cluster.listen-address=:9094'
- '--cluster.peer=alertmanager-2:9094'
- '--cluster.peer=alertmanager-3:9094'
ports:
- "9093:9093"
- "9094:9094"
alertmanager-2:
image: prom/alertmanager:latest
volumes:
- ./alertmanager-2.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--cluster.listen-address=:9094'
- '--cluster.peer=alertmanager-1:9094'
- '--cluster.peer=alertmanager-3:9094'
ports:
- "9095:9093"
- "9096:9094"
alertmanager-3:
image: prom/alertmanager:latest
volumes:
- ./alertmanager-3.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--cluster.listen-address=:9094'
- '--cluster.peer=alertmanager-1:9094'
- '--cluster.peer=alertmanager-2:9094'
ports:
- "9097:9093"
- "9098:9094"
grafana:
image: grafana/grafana:latest
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
ports:
- "3000:3000"
volumes:
prometheus1_data:
prometheus2_data:
grafana_data:
Example 2: Thanos Deployment with Kubernetes
Here's a simplified example of a Prometheus deployment with Thanos on Kubernetes:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus
spec:
serviceName: "prometheus"
replicas: 2
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=2d"
- "--web.enable-lifecycle"
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus/
- name: prometheus-data
mountPath: /prometheus
ports:
- containerPort: 9090
name: prometheus
- name: thanos-sidecar
image: thanosio/thanos:latest
args:
- "sidecar"
- "--tsdb.path=/prometheus"
- "--prometheus.url=http://localhost:9090"
- "--objstore.config-file=/etc/thanos/bucket.yml"
volumeMounts:
- name: prometheus-data
mountPath: /prometheus
- name: thanos-config
mountPath: /etc/thanos
ports:
- containerPort: 10901
name: grpc
- containerPort: 10902
name: http
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
- name: thanos-config
secret:
secretName: thanos-objstore
volumeClaimTemplates:
- metadata:
name: prometheus-data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
Best Practices for Prometheus HA
-
Geographical Distribution: For critical systems, deploy Prometheus servers across different data centers or availability zones
-
Resource Isolation: Ensure each Prometheus instance has dedicated resources to prevent resource contention
-
Consistent Configuration: Use configuration management tools like Ansible, Chef, or Puppet to ensure consistent configuration across all Prometheus instances
-
Monitoring your Monitoring: Set up external monitoring for your Prometheus instances themselves
-
Regular Backup: Even with HA, maintain regular backups of your Prometheus data
-
Testing Failover: Regularly test failover scenarios to ensure your HA setup works as expected
-
Implementing Health Checks: Use health checks to detect and replace failed Prometheus instances automatically
-
Scaling Considerations:
- Consider federation for large-scale deployments
- Use recording rules to pre-compute expensive queries
- Implement proper retention policies
Diagnosing HA Issues
Common issues in Prometheus HA setups and how to diagnose them:
Issue | Symptoms | Diagnosis Commands |
---|---|---|
Alertmanager Cluster Split-Brain | Duplicate alerts | curl -s alertmanager:9093/-/ready to check cluster status |
Data Inconsistency | Different query results | Compare prometheus_tsdb_head_series metrics between instances |
Network Partitioning | Node isolation | Check gossip protocol with tcpdump |
Storage Bottlenecks | High latency | Monitor prometheus_tsdb_head_samples_appended_total |
Summary
High availability for Prometheus is crucial for maintaining continuous monitoring in production environments. The approaches range from simple redundant setups to advanced distributed systems like Thanos or Cortex/Mimir.
Key takeaways:
- Simple HA can be achieved with redundant Prometheus servers and deduplicating Alertmanager
- Thanos provides an excellent middle ground with long-term storage and global query view
- Cortex/Mimir offers the most advanced HA and multi-tenancy features
- Always consider the entire monitoring pipeline, including Alertmanager
Implementing HA requires careful planning, but it provides the peace of mind that your monitoring system will stay online even when components fail. Remember that the HA approach you choose should match your organization's scale, complexity, and reliability requirements.
Additional Resources
- Prometheus High Availability Documentation
- Thanos Project Documentation
- Grafana Mimir Documentation
- CNCF Prometheus Certification
Exercises
-
Basic HA Setup: Deploy two Prometheus servers with identical configurations scraping the same targets. Configure Grafana to use both as data sources.
-
Alertmanager Clustering: Set up a three-node Alertmanager cluster and test its behavior when one node fails.
-
Thanos Implementation: Deploy Prometheus with Thanos sidecar and set up a MinIO server as object storage. Query historical data using Thanos Querier.
-
Failure Simulation: In your HA setup, simulate failure of various components (Prometheus, Alertmanager, storage) and observe system behavior.
-
Load Testing: Generate high metric loads and evaluate how your HA setup handles increased pressure, measuring query performance across redundant instances.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)