Service Discovery Troubleshooting
Introduction
Service discovery is a critical component of Prometheus's monitoring architecture. It enables Prometheus to automatically find and scrape targets without manual configuration. However, when service discovery doesn't work as expected, it can lead to missing metrics, incomplete monitoring, and potential blind spots in your observability setup.
This guide will help you understand, identify, and troubleshoot common issues with Prometheus service discovery mechanisms, ensuring your monitoring system captures all the targets it should.
Common Service Discovery Issues
Before diving into specific troubleshooting techniques, let's understand the common issues that can occur with service discovery:
- Configuration errors - Syntax issues or incorrect parameters in your service discovery configuration
- Network connectivity problems - Prometheus unable to reach service discovery endpoints
- Authorization issues - Incorrect or missing credentials for accessing discovery APIs
- Target relabeling problems - Errors in relabeling rules leading to dropped targets
- Mismatched expectations - Service discovery working correctly but not finding what you expect
Troubleshooting Process Overview
When facing service discovery issues, follow this systematic approach:
Step 1: Verify Prometheus Configuration
The first step is to validate that your Prometheus configuration is correct. Configuration errors are a common source of service discovery issues.
Check Configuration Syntax
Ensure your configuration file follows the correct YAML syntax and matches the expected structure for your service discovery mechanism.
You can validate your configuration using the Prometheus command line tool:
promtool check config prometheus.yml
Example output for a valid configuration:
prometheus.yml SUCCESS
Example output for an invalid configuration:
Checking prometheus.yml
FAILED: parsing YAML file prometheus.yml: yaml: line 42: did not find expected key
Verify Service Discovery Parameters
Each service discovery method has specific parameters. Let's examine common parameters for Kubernetes service discovery as an example:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
api_server: 'https://kubernetes.default.svc:443'
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
Common misconfigurations include:
- Incorrect API server URL
- Missing or wrong path to credential files
- Incorrect role (should be one of: node, pod, service, endpoint, ingress)
Step 2: Check Service Discovery Status
Prometheus exposes information about its service discovery status through its web interface and API.
Using the Prometheus Web UI
- Navigate to your Prometheus web interface (usually
http://your-prometheus-instance:9090
) - Go to Status > Targets to see all the discovered targets
- Look for targets in the Down state or missing targets you expect to see
![Targets page showing discovered services]
Using the Prometheus API
You can query the Prometheus API to get information about targets:
curl http://your-prometheus-instance:9090/api/v1/targets | jq
The response will include detailed information about each target, including:
- Discovery group
- Current state (up/down)
- Labels
- Last scrape time and duration
- Error information (if any)
Analyzing Target States
When examining targets, pay attention to:
- Up targets: Successfully discovered and scraped
- Down targets: Discovered but failing to scrape
- Missing targets: Not discovered at all (you'll need to infer these)
Step 3: Enable Debug Logging
Increasing Prometheus's log verbosity can provide valuable insights into service discovery issues.
Setting Debug Log Level
Start Prometheus with the --log.level=debug
flag:
prometheus --config.file=prometheus.yml --log.level=debug
Or set it in your deployment configuration:
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--log.level=debug"
Analyzing Service Discovery Logs
Look for logs related to service discovery. For example, with Kubernetes service discovery:
level=debug ts=2023-05-15T10:23:45.123Z caller=kubernetes.go:267 component="discovery manager scrape" discovery=kubernetes msg="kubernetes service discovery started"
Error messages will provide clues about what's going wrong:
level=error ts=2023-05-15T10:24:12.456Z caller=kubernetes.go:321 component="discovery manager scrape" discovery=kubernetes msg="error retrieving pod objects" err="Get \"https://kubernetes.default.svc:443/api/v1/pods\": dial tcp: lookup kubernetes.default.svc on 10.96.0.10:53: no such host"
Step 4: Check Network Connectivity
Service discovery mechanisms often rely on external APIs, so network connectivity is crucial.
Testing API Connectivity
For Kubernetes service discovery, test connectivity to the Kubernetes API:
curl -k -v https://kubernetes.default.svc:443/api/v1/pods
For file-based discovery, ensure the files are accessible:
cat /path/to/targets.json
For DNS-based discovery, test DNS resolution:
dig SRV _prometheus._tcp.example.com
Checking Authorization
Many service discovery mechanisms require proper authorization. Verify that Prometheus has the necessary permissions.
For Kubernetes:
# Test token permissions
kubectl auth can-i list pods --as=system:serviceaccount:monitoring:prometheus
Expected output for proper permissions:
yes
Step 5: Troubleshoot Specific Service Discovery Methods
Let's look at troubleshooting steps for common service discovery mechanisms.
Kubernetes Service Discovery
Common issues with Kubernetes service discovery include:
- RBAC permissions: Ensure the Prometheus service account has the necessary permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- Label selection: Check if your relabel configurations match the actual labels on your resources
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Verify pod annotations:
kubectl get pods -n your-namespace -o jsonpath='{.items[*].metadata.annotations}'
File-Based Service Discovery
For file-based discovery, check:
- File permissions: Ensure Prometheus can read the file
ls -la /path/to/file_sd_config.json
- File format: Validate JSON format
jq . /path/to/file_sd_config.json
Example of a valid file_sd config:
[
{
"targets": ["host1:9100", "host2:9100"],
"labels": {
"env": "production",
"job": "node-exporter"
}
}
]
DNS-Based Service Discovery
For DNS discovery issues:
- DNS resolution: Verify SRV records are correctly set up
dig SRV _prometheus._tcp.example.com
Expected output with proper DNS configuration:
;; ANSWER SECTION:
_prometheus._tcp.example.com. 86400 IN SRV 0 5 9100 node1.example.com.
_prometheus._tcp.example.com. 86400 IN SRV 0 5 9100 node2.example.com.
- DNS configuration in Prometheus:
scrape_configs:
- job_name: 'dns-discovery'
dns_sd_configs:
- names:
- '_prometheus._tcp.example.com'
type: 'SRV'
Step 6: Troubleshooting Target Relabeling
Relabeling rules can sometimes cause targets to be dropped unintentionally.
Understanding Relabeling Actions
Key relabeling actions that can cause targets to disappear:
keep
: Keeps targets for which the regex matches the concatenated source_labelsdrop
: Drops targets for which the regex matches the concatenated source_labels
Debug Relabeling Rules
To debug relabeling, temporarily remove or simplify rules and observe the effect on discovered targets.
Before relabeling (complex rules):
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: important-app
Simplified for debugging:
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Inspecting Original Target Labels
To understand why relabeling might be dropping targets, inspect the original labels before relabeling.
Enable debug logging and look for entries like:
level=debug ts=2023-05-15T12:34:56.789Z caller=target.go:489 msg="before relabeling" lset="target=10.0.1.2:9090,__meta_kubernetes_pod_annotation_prometheus_io_scrape=false"
This shows the target had __meta_kubernetes_pod_annotation_prometheus_io_scrape=false
, which would cause it to be dropped by a keep
action looking for true
.
Step 7: Advanced Debugging Techniques
For persistent issues, try these advanced debugging techniques:
Using up
Metric for Missing Targets
Query up
metric to see which targets are being successfully scraped:
up
This returns 1 for targets that are up and 0 for targets that are down but discovered.
Checking Service Discovery Metrics
Prometheus exposes metrics about its own service discovery process:
prometheus_sd_discovered_targets
This shows the count of targets discovered by each service discovery mechanism.
prometheus_target_scrape_pool_targets
This shows the number of targets in each scrape pool.
Creating a Test Configuration
Create a minimal test configuration focusing only on the problematic service discovery method:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'test-discovery'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod_name
Run this with debug logging enabled to isolate the issue.
Common Solutions to Service Discovery Problems
Here are solutions to frequently encountered service discovery issues:
Kubernetes Service Discovery Issues
-
No targets found
- Check RBAC permissions
- Verify Prometheus is running in-cluster or has proper kubeconfig
- Ensure pods have necessary annotations or labels
-
Targets found but not scraped
- Check pod/service network policies
- Verify port configurations
- Check for restrictive relabeling rules
File-Based Service Discovery Issues
- No targets from file_sd_configs
- Check file permissions
- Validate JSON format
- Ensure file path is correct
- Check for dynamic file updates if using a generator
DNS-Based Service Discovery Issues
- No targets from dns_sd_configs
- Verify DNS records exist
- Check DNS server connectivity
- Ensure correct record type (A, AAAA, SRV)
- Check for DNS timeouts in Prometheus logs
Practical Example: Troubleshooting Kubernetes Service Discovery
Let's walk through a complete example of troubleshooting Kubernetes service discovery:
Scenario
You've deployed Prometheus in your Kubernetes cluster, but it's not discovering any pod targets despite having the correct configuration.
Step 1: Check Configuration
Your configuration looks like this:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Validate with promtool
:
promtool check config prometheus.yml
# Output: prometheus.yml SUCCESS
Step 2: Check Discovered Targets
Check the Targets page in Prometheus UI or use the API:
curl http://prometheus:9090/api/v1/targets | jq '.data.activeTargets | length'
# Output: 0
No targets are discovered.
Step 3: Enable Debug Logging
Modify Prometheus deployment to enable debug logging:
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--log.level=debug"
Check logs:
level=debug ts=2023-05-16T08:45:12.345Z caller=kubernetes.go:267 component="discovery manager scrape" discovery=kubernetes msg="kubernetes service discovery started"
level=error ts=2023-05-16T08:45:13.456Z caller=kubernetes.go:321 component="discovery manager scrape" discovery=kubernetes msg="error retrieving pod objects" err="pods is forbidden: User \"system:serviceaccount:monitoring:prometheus\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
This indicates a permissions issue.
Step 4: Fix RBAC Permissions
Create proper RBAC configuration:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
Apply the configuration:
kubectl apply -f prometheus-rbac.yaml
Step 5: Check If Targets Are Found
After fixing permissions, check targets again:
curl http://prometheus:9090/api/v1/targets | jq '.data.activeTargets | length'
# Output: 0
Still no targets.
Step 6: Check Pod Annotations
Check if any pods have the required annotation:
kubectl get pods --all-namespaces -o json | jq '.items[] | select(.metadata.annotations."prometheus.io/scrape" == "true") | .metadata.name'
# Output: (empty)
No pods have the annotation.
Step 7: Apply Annotations to Test Pod
Create a test pod with the correct annotation:
apiVersion: v1
kind: Pod
metadata:
name: test-pod
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
containers:
- name: test-container
image: nginx
ports:
- containerPort: 8080
Apply the configuration:
kubectl apply -f test-pod.yaml
Step 8: Verify Discovery
Check targets again:
curl http://prometheus:9090/api/v1/targets | jq '.data.activeTargets | length'
# Output: 1
Success! The target is now discovered.
Summary
In this guide, we covered:
- Common service discovery issues in Prometheus
- A systematic approach to troubleshooting these issues
- Specific techniques for debugging different service discovery mechanisms
- How to use Prometheus logs and metrics to identify problems
- Solutions to frequently encountered issues
- A practical example of troubleshooting Kubernetes service discovery
Remember that effective troubleshooting requires a methodical approach. Start with configuration validation, move to checking connectivity and permissions, and use debug logging to gain insights into issues. Understanding the specific requirements of each service discovery mechanism is crucial for successful monitoring.
Additional Resources
- Prometheus Configuration Documentation
- Prometheus Service Discovery Documentation
- Kubernetes Service Discovery in Prometheus
- Prometheus Relabeling Documentation
Exercises
- Set up a test environment with Prometheus and a misconfigured service discovery mechanism. Practice troubleshooting to identify and fix the issue.
- Create a custom relabeling configuration that filters targets based on specific criteria. Test how changes affect target discovery.
- Implement multiple service discovery mechanisms in a single Prometheus instance and troubleshoot potential conflicts.
- Write a shell script that automatically checks common service discovery issues and reports potential problems.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)