Prometheus Federation
Introduction
Prometheus Federation is a powerful technique that allows you to create scalable, hierarchical monitoring systems by enabling one Prometheus server to scrape selected time series from another Prometheus server. This approach is particularly useful when you need to monitor large infrastructures or when you want to create a centralized view of metrics collected by multiple distributed Prometheus instances.
Federation helps solve several challenges that emerge when monitoring grows beyond a single Prometheus server:
- Scalability: Distribute the load across multiple Prometheus servers
- Organizational boundaries: Collect metrics across different teams or departments
- Hierarchical views: Create aggregated dashboards while maintaining detailed local monitoring
- Geographic distribution: Handle metrics collection across different data centers or regions
How Federation Works
At its core, federation allows a Prometheus server (the "federated" server) to scrape selected metrics from other Prometheus servers. This is achieved through a special federation endpoint that exposes metrics in a format that can be scraped.
Setting Up Federation
The /federate
Endpoint
Prometheus exposes a special endpoint at /federate
that allows other Prometheus servers to scrape selected time series. This endpoint requires a match[]
parameter to specify which metrics should be included in the federation.
Basic Federation Configuration
To set up federation, you'll need to add a scrape configuration to your central Prometheus server's configuration file (prometheus.yml
):
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 'prometheus-1:9090'
- 'prometheus-2:9090'
Let's break down this configuration:
job_name
: A name for this federation jobscrape_interval
: How often to scrape the federated metricshonor_labels
: When true, this retains the original labels from the source Prometheusmetrics_path
: The federation endpointparams
: Contains the match parameters that filter which metrics to federatestatic_configs
: Defines the Prometheus servers to scrape metrics from
Match Selectors
The match[]
parameter is crucial as it determines which metrics will be federated. You can use Prometheus's powerful label matching syntax:
{job="prometheus"}
: Select all metrics associated with the "prometheus" job{__name__=~"job:.*"}
: Select all metrics with names starting with "job:"{instance=~".*"}
: Select all metrics from all instances{__name__=~"node_.*", environment="production"}
: Select all node metrics from the production environment
Hierarchical Federation Patterns
Two-Level Federation
The most common pattern is a two-level hierarchy with multiple Prometheus servers at the first level monitoring different parts of your infrastructure, and a single Prometheus server at the second level federating selected metrics.
# Example configuration for central Prometheus server
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="critical-services"}' # Only federate important metrics
- 'up' # Always include service availability
- 'instance:node_cpu_utilization:avg5m' # Include pre-aggregated metrics
static_configs:
- targets:
- 'team-a-prometheus:9090'
- 'team-b-prometheus:9090'
- labels:
datacenter: 'east'
Multi-Level Federation
For larger deployments, you might need more than two levels:
In this setup:
- Local Prometheus servers monitor specific applications
- Regional Prometheus servers federate from local servers
- Global Prometheus federates from regional servers
Federation Best Practices
1. Federate Aggregated Metrics
It's generally not a good idea to federate all raw metrics, as this can lead to performance issues. Instead, create recording rules in your source Prometheus servers to pre-aggregate metrics, and then federate these aggregated metrics:
# recording rules in source Prometheus
groups:
- name: aggregate
rules:
- record: instance:node_cpu_utilization:avg5m
expr: avg by (instance)(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
Then federate the aggregated metrics:
# in federated Prometheus
params:
'match[]':
- 'instance:node_cpu_utilization:avg5m'
2. Use Label Rewriting
Use honor_labels: true
to preserve original labels, but be careful about label collisions. For more control, you can use relabeling:
scrape_configs:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="node"}'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: region
replacement: 'us-west'
3. Be Selective About What You Federate
Avoid federating high-cardinality metrics. Focus on:
- Service-level indicators (SLIs)
- Critical alerts
- Key business metrics
- Pre-aggregated metrics
4. Consider Federation Scrape Intervals
Use longer scrape intervals for federation than for direct scraping. This reduces load while still providing useful data for longer-term trends.
Practical Example: Multi-Datacenter Monitoring
Let's walk through a complete example of setting up federation for monitoring multiple datacenters:
- Local Prometheus servers in each datacenter collect detailed metrics:
# prometheus-dc1.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets:
- 'node-exporter-1:9100'
- 'node-exporter-2:9100'
- job_name: 'api-service'
static_configs:
- targets:
- 'api-service-1:8080'
- 'api-service-2:8080'
# Recording rules for aggregation
rule_files:
- 'recording-rules.yml'
- Recording rules to pre-aggregate the metrics:
# recording-rules.yml
groups:
- name: aggregation
rules:
- record: dc:node_cpu_utilization:avg5m
expr: avg by (job)(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
- record: dc:api_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum(rate(api_request_duration_seconds_bucket[5m])) by (le, job))
- Central Prometheus federates from each datacenter:
# central-prometheus.yml
global:
scrape_interval: 30s # Longer interval for federation
scrape_configs:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- 'up' # Service availability
- 'dc:node_cpu_utilization:avg5m' # Aggregated CPU usage
- 'dc:api_request_duration_seconds:p95' # API p95 latency
static_configs:
- targets: ['prometheus-dc1:9090']
labels:
datacenter: 'us-east'
- targets: ['prometheus-dc2:9090']
labels:
datacenter: 'us-west'
- Querying federated metrics in Grafana:
With this setup, you could create a Grafana dashboard that shows datacenter-level metrics with PromQL queries like:
sum by (datacenter)(dc:node_cpu_utilization:avg5m)
This would show the average CPU utilization for each datacenter, allowing you to compare their performance at a glance.
Troubleshooting Federation
Common Issues
-
Metrics not appearing in federated Prometheus
- Check your match[] selectors
- Verify connectivity between Prometheus servers
- Check for label collisions if honor_labels is false
-
High cardinality issues
- Federate fewer metrics or use more specific selectors
- Use recording rules to aggregate metrics before federation
-
High load on source Prometheus
- Increase the federation scrape interval
- Be more selective with what you're federating
Debugging Federation
To debug federation issues, check the following:
-
Manually test the federate endpoint:
curl "http://source-prometheus:9090/federate?match[]={job='node'}"
-
Check the target status in the federated Prometheus web UI
-
Look for scrape errors in the federated Prometheus logs
Summary
Prometheus Federation is a powerful feature for building scalable monitoring systems. By creating hierarchical monitoring setups, you can effectively monitor large, distributed infrastructures while maintaining a centralized view of important metrics.
Key takeaways:
- Federation allows one Prometheus server to scrape metrics from other Prometheus servers
- Use selective matching to federate only the metrics you need
- Pre-aggregate metrics using recording rules before federation
- Consider hierarchical patterns for large-scale deployments
- Be mindful of performance implications and follow best practices
Exercises
- Set up a basic two-Prometheus federation on your local machine using Docker Compose
- Create recording rules to aggregate CPU and memory usage, then federate only these aggregated metrics
- Experiment with different match[] selectors to understand how they filter metrics
- Add a third level to your federation hierarchy and observe how metrics flow through the system
- Design a federation setup for a hypothetical company with three datacenters and multiple application teams
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)