Prometheus Internals
Introduction
When you're using Prometheus for monitoring, understanding its internal architecture can help you optimize your setup, troubleshoot issues more effectively, and make better decisions about scaling your monitoring infrastructure. In this guide, we'll take a deep dive into how Prometheus works under the hood, exploring its core components, data flow, and storage mechanisms.
Prometheus is not just a single binary but a sophisticated system with several interacting components that handle everything from scraping metrics to storing time series data and serving queries. By understanding these internals, you'll gain insights into how Prometheus achieves its reliability, efficiency, and scalability.
Core Components
Prometheus consists of several key internal components that work together to provide its monitoring capabilities:
- Service Discovery - Finds targets to monitor
- Scraping Engine - Collects metrics from targets
- Storage Subsystem - Manages the time-series database (TSDB)
- PromQL Engine - Processes queries against stored data
- Alert Manager Interface - Handles alert evaluation and routing
- HTTP Server - Serves the UI and API endpoints
Let's explore each of these components in detail:
Service Discovery
Before Prometheus can collect metrics, it needs to know what to monitor. This is where service discovery comes in.
How Service Discovery Works
Prometheus supports multiple service discovery mechanisms:
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
The service discovery process:
- Prometheus reads configuration from the config file
- The service discovery manager initializes the configured discovery mechanisms
- Discovery mechanisms provide target information (endpoints to scrape)
- Targets are processed and relabeled according to configured rules
- The final list of targets is passed to the scrape manager
Scraping Engine
The scraping engine is responsible for collecting metrics from the discovered targets at configured intervals.
Scrape Process
For each target, Prometheus:
- Initiates an HTTP request to the
/metrics
endpoint - Receives a text-based response with metrics
- Parses the response in the Prometheus exposition format
- Applies relabeling rules (if configured)
- Adds internal labels like
job
andinstance
- Sends the processed samples to the storage subsystem
Here's a code example showing a basic scraper implementation:
func (s *scraper) scrape(ctx context.Context) {
start := time.Now()
// Make the HTTP request to the target
req, err := http.NewRequest("GET", s.url.String(), nil)
if err != nil {
// Handle error
return
}
resp, err := s.client.Do(req.WithContext(ctx))
if err != nil {
// Handle error
return
}
defer resp.Body.Close()
// Parse the response
var (
metricFamilies map[string]*dto.MetricFamily
err error
)
if resp.Header.Get("Content-Type") == "application/vnd.google.protobuf" {
metricFamilies, err = parseProtobufResponse(resp.Body)
} else {
metricFamilies, err = parseTextResponse(resp.Body)
}
// Process and store metrics
// ...
duration := time.Since(start)
s.metrics.duration.Observe(duration.Seconds())
}
Storage Subsystem (TSDB)
The storage subsystem is the heart of Prometheus, managing how time-series data is written, stored, and accessed.
TSDB Architecture
The storage layer consists of:
- Write-Ahead Log (WAL) - Ensures durability of recent data
- Head Block - In-memory storage for recent samples
- Persistent Blocks - Immutable blocks of older data on disk
- Indexes - For efficient query execution
- Compaction - Process that optimizes storage
Data Flow Through Storage
When metrics are collected:
- New samples are written to the write-ahead log (WAL) for durability
- Samples are stored in the in-memory head block
- Periodically (every 2 hours by default), the head block is compacted into a persistent block
- Persistent blocks are further compacted to optimize storage
- Old blocks are deleted according to retention settings
Here's an example of the directory structure of a Prometheus TSDB:
data/
├── 01BKGV7JC0RY8A6MACW02A2PJD/
│ ├── chunks/
│ │ ├── 000001
│ │ └── 000002
│ ├── index
│ └── meta.json
├── 01BKGTZQ1SYQJTR4PB43C8PD98/
│ ├── chunks/
│ │ ├── 000001
│ │ └── 000002
│ ├── index
│ └── meta.json
├── chunks_head/
├── wal/
│ ├── 000000001
│ └── 000000002
└── queries.active
Each block contains:
- Chunks: The actual metric data
- Index: For fast lookups
- Meta.json: Block metadata including time range
PromQL Engine
The query engine processes PromQL expressions and retrieves data from the storage subsystem.
Query Execution Flow
When a PromQL query is executed:
- The query string is parsed into an abstract syntax tree (AST)
- The query is validated for correctness
- A query plan is created to optimize execution
- The execution engine retrieves data from storage components
- Results are processed according to query operators
- The formatted results are returned to the client
Let's look at a simple PromQL query and how it's processed:
rate(http_requests_total{job="api-server"}[5m])
- This query is parsed into an AST with a
rate()
function containing a matrix selector - The engine identifies that it needs 5 minutes of data for the
http_requests_total
metric with the labeljob="api-server"
- It retrieves the raw samples from the storage
- It calculates the per-second rate of increase for each time series
- The results are returned as instant vector
Alert Manager Interface
Prometheus evaluates alert rules internally but delegates alert management to the AlertManager.
Alert Evaluation Process
For each alert rule:
- The rule manager schedules evaluations at regular intervals
- PromQL expressions in the rule are executed against the database
- Results are compared against the defined thresholds
- If conditions are met, an alert is fired
- Alert data is sent to the configured AlertManager
Example of an alert rule configuration:
groups:
- name: example
rules:
- alert: HighRequestLatency
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
for: 10m
labels:
severity: page
annotations:
summary: High request latency
HTTP Server
The HTTP server provides several interfaces for interacting with Prometheus:
- Web UI - For visualizing metrics and alerts
- HTTP API - For programmatic access
- Metrics endpoint - Prometheus itself exposes metrics about its operation
Key API Endpoints
/api/v1/query
- Execute instant queries/api/v1/query_range
- Execute range queries/api/v1/series
- Find time series matching a label set/api/v1/labels
- Get all label names/api/v1/targets
- Get target information/metrics
- Prometheus' own metrics
Example request to query API:
curl 'http://localhost:9090/api/v1/query?query=up'
Response:
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "up",
"job": "prometheus",
"instance": "localhost:9090"
},
"value": [1607533730.610, "1"]
}
]
}
}
Performance Considerations
Understanding Prometheus internals helps optimize performance:
Memory Usage
Memory is primarily consumed by:
- The head block for recent samples
- Query execution
- Series cardinality (unique time series)
To control memory usage:
- Limit series cardinality by avoiding high-cardinality labels
- Adjust
--storage.tsdb.max-block-duration
and--storage.tsdb.min-block-duration
- Consider federation or sharding for large deployments
Disk Usage
Disk usage is affected by:
- Retention period
- Scrape interval
- Number of metrics
- Cardinality of metrics
To optimize disk usage:
- Set appropriate retention with
--storage.tsdb.retention.time
- Use recording rules to pre-aggregate data you query frequently
- Consider external storage solutions for long-term data
Scaling Prometheus
Knowledge of internals helps with scaling strategies:
- Vertical scaling: Larger machines for single instances
- Functional sharding: Different Prometheus servers for different services
- Hierarchical federation: Higher-level Prometheus instances collect from lower-level ones
- Remote storage: Offloading data to external systems
Real-world Example: Troubleshooting High Cardinality
Let's walk through a common issue: a Prometheus instance experiencing memory pressure due to high cardinality.
Problem Identification
Symptoms:
- Increasing memory usage
- Slow query performance
- Potential OOM (Out of Memory) crashes
Analysis
- Check the number of time series:
curl http://prometheus:9090/api/v1/status/tsdb | jq '.data.numSeries'
- Identify high-cardinality metrics:
topk(10, count by (__name__)({__name__=~".+"}))
- For a specific metric, find high-cardinality labels:
topk(10, count by (some_suspected_label) (your_metric_name))
Solution
After identifying that an application was adding unique request IDs as labels:
- Relabel to drop the high-cardinality labels:
scrape_configs:
- job_name: 'problem-app'
static_configs:
- targets: ['problem-app:8080']
metric_relabel_configs:
- source_labels: [request_id]
action: labeldrop
- Restart Prometheus to apply changes
- Monitor series count to verify improvement
Summary
In this guide, we've explored the internal architecture of Prometheus:
- Service Discovery: How Prometheus finds targets to monitor
- Scraping Engine: How metrics are collected from targets
- Storage Subsystem: How time-series data is stored and managed
- PromQL Engine: How queries are processed
- Alert Manager Interface: How alerts are evaluated
- HTTP Server: How the API and UI are exposed
Understanding these internals gives you deeper insight into how Prometheus works, helping you:
- Set up more efficient monitoring
- Troubleshoot performance issues
- Make better decisions about scaling
- Optimize storage and memory usage
Additional Resources
- Prometheus Storage documentation
- Prometheus Configuration documentation
- PromQL documentation
- Prometheus GitHub repository
Exercises
- Set up a local Prometheus instance and examine the directory structure of its data folder.
- Write PromQL queries to investigate the internal metrics of Prometheus (hint: look at metrics starting with
prometheus_
). - Experiment with different retention settings and observe the impact on disk usage.
- Introduce a high-cardinality metric and observe its impact on Prometheus memory usage.
- Configure Prometheus to use remote write and analyze how this affects its performance.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)