Prometheus Architecture
Introduction
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. It has become a cornerstone of cloud-native monitoring solutions, especially in Kubernetes environments. Understanding Prometheus architecture is essential for effectively implementing and maintaining a robust monitoring system.
In this guide, we'll explore the core components of Prometheus, how they interact with each other, and the fundamental concepts that make Prometheus a powerful monitoring solution.
Prometheus Architecture Overview
At a high level, Prometheus operates using several key components working together to collect, store, and query metrics data, as well as to trigger alerts when specific conditions are met.
Here's a diagram that illustrates the Prometheus architecture:
Let's explore each component in detail.
Core Components
1. Prometheus Server
The Prometheus server is the central component that performs the actual monitoring work. It consists of several sub-components:
Time Series Database (TSDB)
The TSDB is where Prometheus stores all its metrics data. It's optimized for:
- High write throughput
- Efficient storage of time-series data
- Fast query execution
// Simplified representation of how data is stored in TSDB
{
metric: "http_requests_total",
labels: {
method: "GET",
endpoint: "/api/v1/users",
status: "200"
},
samples: [
[1616161616, 10], // timestamp, value
[1616161676, 15],
[1616161736, 18]
]
}
Data Retrieval (Scraper)
This component is responsible for:
- Pulling metrics from monitored targets (HTTP endpoints)
- Processing and storing the metrics in the database
- Managing service discovery to find targets dynamically
HTTP Server
Provides:
- A web UI for querying data and visualizing metrics
- API endpoints for external systems to query data
- Administrative interfaces for managing Prometheus
2. Exporters
Exporters are specialized applications that convert existing metrics from a system (which may not be directly exposable to Prometheus) into a format that Prometheus can scrape. Common exporters include:
- Node Exporter: System metrics (CPU, memory, disk, network)
- MySQL Exporter: MySQL server metrics
- Blackbox Exporter: Probing of endpoints over HTTP, HTTPS, DNS, TCP and ICMP
Example of metrics exposed by Node Exporter:
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 8570.6
node_cpu_seconds_total{cpu="0",mode="system"} 47.79
node_cpu_seconds_total{cpu="0",mode="user"} 38.42
3. Push Gateway
While Prometheus primarily uses a pull model (scraping metrics from targets), the Push Gateway allows for a push model in certain scenarios:
- Short-lived jobs that may not exist long enough to be scraped
- Batch jobs where metrics need to be collected after completion
- Jobs behind network restrictions that can't be directly scraped
Example of pushing metrics to the Push Gateway:
echo "some_metric 3.14" | curl --data-binary @- http://pushgateway:9091/metrics/job/some_job
4. Alert Manager
The Alert Manager handles alerts sent by Prometheus server:
- Deduplicates, groups, and routes alerts to the correct receiver
- Silences and inhibits alerts when needed
- Integrates with various notification channels (email, Slack, PagerDuty, etc.)
Example Alert Manager configuration:
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: 'team-X-mails'
receivers:
- name: 'team-X-mails'
email_configs:
- to: '[email protected]'
Data Flow in Prometheus
Let's understand how data flows through the Prometheus ecosystem:
- Collection: Prometheus server scrapes metrics from targets (or receives them via the Push Gateway)
- Storage: The scraped metrics are stored in the TSDB
- Querying: Users can query the stored metrics using PromQL through the HTTP API
- Alerting: Prometheus evaluates alert rules against the data and sends alerts to the Alert Manager
- Visualization: Tools like Grafana can query Prometheus and visualize the metrics
PromQL: Prometheus Query Language
PromQL is the query language used to retrieve and process time series data in Prometheus. It's powerful for creating complex queries and aggregations.
Basic PromQL examples:
# Simple query to get HTTP request rate
http_requests_total
# Rate of HTTP requests over the last 5 minutes
rate(http_requests_total[5m])
# 95th percentile of request durations
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Scraping and Service Discovery
Prometheus uses a pull-based model to collect metrics, which offers several advantages:
- Centralized control of scrape frequency and timeout settings
- Better visibility into targets' health (if it can't scrape, something is wrong)
- No need for authentication/authorization on the target side in many cases
Configuration example for scraping targets:
scrape_configs:
- job_name: 'node'
scrape_interval: 15s
static_configs:
- targets: ['localhost:9100']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Storage Efficiency
Prometheus is designed for efficient storage of time-series data:
- Uses a custom time-series database optimized for append operations
- Applies delta encoding and compression to reduce storage requirements
- Implements a block-based storage model with chunks of samples
The storage is configured to handle high cardinality and large numbers of time series efficiently.
Practical Example: Monitoring a Web Application
Let's walk through a practical example of how Prometheus architecture components work together to monitor a web application:
- Setup exporters: Install Node Exporter on your servers and add application-specific metrics to your code
// Example of adding Prometheus metrics in a Node.js application
const client = require('prom-client');
const httpRequestDurationMicroseconds = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});
// Register the metrics
client.register.registerMetric(httpRequestDurationMicroseconds);
// Add middleware to track request duration
app.use((req, res, next) => {
const end = httpRequestDurationMicroseconds.startTimer();
res.on('finish', () => {
end({ method: req.method, route: req.route.path, status_code: res.statusCode });
});
next();
});
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});
- Configure Prometheus to scrape these exporters:
scrape_configs:
- job_name: 'node'
scrape_interval: 15s
static_configs:
- targets: ['node1:9100', 'node2:9100']
- job_name: 'web-app'
scrape_interval: 10s
static_configs:
- targets: ['webapp:8080']
- Set up alert rules for important metrics:
groups:
- name: example
rules:
- alert: HighRequestLatency
expr: rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: High request latency on {{ $labels.instance }}
description: "App is experiencing high latency for HTTP requests (> 500ms)
VALUE = {{ $value }}
LABELS: {{ $labels }}"
- Configure Alert Manager to route notifications:
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'web-hooks'
receivers:
- name: 'web-hooks'
slack_configs:
- channel: '#alerts'
send_resolved: true
- Create Grafana dashboards to visualize the metrics:
{
"title": "Web Application Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_request_total[1m])) by (route)",
"legendFormat": "{{route}}"
}
]
},
{
"title": "Response Time",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))",
"legendFormat": "{{route}} (p95)"
}
]
}
]
}
Best Practices for Prometheus Architecture
-
Retention period: Configure appropriate data retention based on your needs and storage capacity
yamlprometheus.yml:
storage:
tsdb:
retention.time: 15d -
Scrape interval: Balance between data granularity and system load
yamlglobal:
scrape_interval: 15s
evaluation_interval: 15s -
Federation: For large-scale deployments, consider federation to create a hierarchy of Prometheus servers
yamlscrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 'source-prometheus-1:9090'
- 'source-prometheus-2:9090' -
High Availability: Deploy Prometheus in pairs for redundancy using the same scrape configuration
-
Resource allocation: Prometheus is memory-intensive, so allocate sufficient resources
yaml# Docker/Kubernetes resource recommendation for medium workload
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "1000m"
Summary
Prometheus architecture consists of several components working together to provide a robust monitoring system:
- Prometheus Server: The core component that scrapes and stores time series data
- Exporters: Convert metrics from various systems into a format Prometheus can consume
- Push Gateway: Allows short-lived jobs to push their metrics to Prometheus
- Alert Manager: Handles alerts, including deduplication, grouping, and routing
- PromQL: A powerful query language for analyzing the collected metrics
The pull-based model, efficient storage, and flexible query language make Prometheus particularly well-suited for dynamic environments like containers and microservices.
Exercises
- Install Prometheus and Node Exporter on your local machine and configure Prometheus to scrape Node Exporter metrics.
- Write PromQL queries to:
- Find the CPU usage rate
- Calculate the 95th percentile of memory usage
- Count the number of running processes
- Create a simple alert rule that triggers when disk usage exceeds 80%.
- Set up Grafana and create a dashboard with system metrics from Prometheus.
- Instrument a simple application with Prometheus client library in your preferred language.
Additional Resources
- Prometheus Documentation
- PromQL Cheat Sheet
- Grafana Dashboards for Prometheus
- Prometheus Operator for Kubernetes
- SRE Books by Google
By understanding Prometheus architecture, you'll be well-equipped to implement effective monitoring solutions for your infrastructure and applications, enabling better visibility, troubleshooting, and capacity planning.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)