Common Prometheus Issues

Introduction

When working with Prometheus as your monitoring solution, you may encounter various challenges that can affect your metrics collection, storage, and querying capabilities. This guide explores the most common issues faced by Prometheus users and provides practical solutions to resolve them efficiently. Whether you're experiencing issues with scraping targets, query performance, or storage limitations, this comprehensive troubleshooting guide will help you maintain a healthy Prometheus environment.

Issue 1: Target Scraping Failures

One of the most common issues in Prometheus is failing to scrape metrics from targets.

Symptoms

Targets showing as "down" in the Prometheus targets page
Missing metrics for specific services
Error messages in Prometheus logs related to scraping

Common Causes and Solutions

Network Connectivity Issues

If Prometheus cannot reach your targets due to network problems:

# Check connectivity from Prometheus server to target
curl -v http://target-host:port/metrics

If the curl command fails, check for:

Firewall rules blocking connections
Network segmentation issues
Incorrect target URL configuration

Authentication Failures

For targets requiring authentication:

scrape_configs:
  - job_name: 'secured-endpoint'
    basic_auth:
      username: 'prometheus'
      password: 'secret_password'
    static_configs:
      - targets: ['localhost:8080']

TLS Certificate Issues

When connecting to HTTPS endpoints:

scrape_configs:
  - job_name: 'https-endpoint'
    scheme: https
    tls_config:
      ca_file: /path/to/ca.crt
      cert_file: /path/to/client.crt
      key_file: /path/to/client.key
      insecure_skip_verify: false
    static_configs:
      - targets: ['secure-host:443']

Timeouts During Scraping

Adjust timeout settings for slow endpoints:

scrape_configs:
  - job_name: 'slow-endpoint'
    scrape_interval: 30s
    scrape_timeout: 15s
    static_configs:
      - targets: ['slow-host:8080']

Issue 2: High Cardinality Problems

High cardinality occurs when a metric has a large number of label combinations, leading to performance issues.

Symptoms

Slow query performance
High memory usage
Storage size growing rapidly
"scrape_samples_exceeded" errors

Analyzing Cardinality

Check the cardinality of your metrics:

count by (__name__)({__name__=~".+"})

For a specific metric:

count by (__name__) ({__name__="http_requests_total"})

Solutions

Reduce Label Usage

Before:

httpRequestsTotal := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests",
    },
    []string{"method", "path", "status", "user_id", "session_id"}, // Too many labels
)

After:

httpRequestsTotal := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests",
    },
    []string{"method", "status_code"}, // Reduced labels
)

Use label_replace in Queries Instead

sum by (job, instance, method) (
  label_replace(http_request_duration_seconds_count, "method", "$1", "path", ".*/api/(.*)/.*")
)

Configure Metric Relabeling

scrape_configs:
  - job_name: 'high-cardinality-service'
    static_configs:
      - targets: ['service:8080']
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'http_requests_total'
        action: drop
      - source_labels: [user_id]
        action: labeldrop

Issue 3: Storage and Retention Issues

Managing Prometheus's storage efficiently is crucial for long-term operation.

Symptoms

Disk space alerts
Gaps in historical data
Slow startup time
Queries for older data failing

Managing Storage Space

Configure Storage Retention

In prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

storage:
  tsdb:
    path: /data
    retention.time: 15d
    retention.size: 50GB

Or using command-line flags:

prometheus --storage.tsdb.path=/data \
           --storage.tsdb.retention.time=15d \
           --storage.tsdb.retention.size=50GB

Storage Optimization

When running out of space, you can:

Reduce scrape frequency for less critical targets
Implement downsampling with recording rules:

groups:
  - name: downsampling
    interval: 5m
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

Consider using remote storage for long-term data retention:

remote_write:
  - url: "http://remote-storage-server:9201/write"
    
remote_read:
  - url: "http://remote-storage-server:9201/read"

Issue 4: Query Performance Problems

Slow PromQL queries can impact the usability of your dashboards and alerting.

Symptoms

Dashboards loading slowly
Query timeouts
High CPU usage during queries

Query Optimization Techniques

Avoid Using Large Time Ranges

Instead of:

rate(http_requests_total[30d])

Use:

rate(http_requests_total[5m])

Use Subqueries Carefully

Inefficient:

max_over_time(rate(http_requests_total[5m])[1h:])

Use Recording Rules for Complex Queries

In prometheus.yml:

rule_files:
  - "recording_rules.yml"

In recording_rules.yml:

groups:
  - name: http_requests
    interval: 1m
    rules:
      - record: job:http_request_errors:rate5m
        expr: sum by (job) (rate(http_request_errors_total[5m]))

Then query the pre-computed value:

job:http_request_errors:rate5m{job="api-server"}

Issue 5: Alert Manager Configuration Problems

Common issues with Prometheus Alert Manager that lead to missed or incorrect alerts.

Symptoms

Missing alerts
Duplicate alerts
Incorrect alert routing
Silent alerts

Common Alert Manager Issues

Alert Not Firing

Check if the alert rule is correctly defined:

groups:
  - name: example
    rules:
    - alert: HighErrorRate
      expr: job:http_error_rate:5m > 0.5
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value }} for {{ $labels.job }}"

Verify the alert is in Prometheus's "Alerts" tab and check its status.

Alert Manager Not Receiving Alerts

Check Prometheus configuration:

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

Incorrect Alert Routing

Review your alertmanager.yml configuration:

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'team-emails'
  routes:
  - match:
      severity: critical
    receiver: 'pager'
    continue: true

receivers:
- name: 'team-emails'
  email_configs:
  - to: '[email protected]'
- name: 'pager'
  pagerduty_configs:
  - service_key: '<key>'

Use the AlertManager API to debug routing issues:

curl -X POST -H "Content-Type: application/json" -d '{
  "status": "firing",
  "labels": {
    "alertname": "TestAlert",
    "service": "test-service",
    "severity": "critical",
    "instance": "test-instance"
  },
  "annotations": {
    "summary": "Test alert"
  }
}' http://localhost:9093/api/v1/alerts

Issue 6: Federation and Remote Storage Problems

Issues related to scaling Prometheus with federation or remote storage.

Symptoms

Missing data in federated Prometheus
Errors in remote_write or remote_read
Performance degradation
Increased network traffic

Federation Issues

Configuration for Federation

On the federated Prometheus:

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets:
        - 'source-prometheus-1:9090'
        - 'source-prometheus-2:9090'

Common issues include:

Firewall blocking federation connections
Source Prometheus unavailable
Match parameters not correctly defined

Remote Storage Issues

In prometheus.yml:

remote_write:
  - url: "http://remote-write-endpoint/api/v1/write"
    name: remote_storage
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'temp.*'
        action: drop

Troubleshoot with:

# Check if remote endpoint is reachable
curl -v http://remote-write-endpoint/api/v1/write

# Check Prometheus logs for remote_write errors
grep "remote_write" /path/to/prometheus/logs

Issue 7: Service Discovery Problems

Issues related to dynamic target discovery.

Symptoms

Missing targets
Stale targets not being removed
Too many targets being discovered
Inconsistent labeling of targets

Common Service Discovery Issues

File-based SD Issues

Check your file_sd_configs file format:

[
  {
    "targets": ["host1:9100", "host2:9100"],
    "labels": {
      "env": "production",
      "job": "node"
    }
  }
]

Kubernetes SD Issues

For Kubernetes service discovery:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Verify RBAC permissions:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]

Debugging Techniques

Using Prometheus Debug Endpoints

Prometheus exposes several debug endpoints:

/debug/pprof: Runtime profiling data
/debug/flags: Command-line flag information
/metrics: Prometheus's own metrics

# Check Prometheus's own metrics
curl http://localhost:9090/metrics | grep scrape_duration_seconds

Log Analysis

Check Prometheus logs for errors:

# For systemd-based systems
journalctl -u prometheus.service -f

# For container deployments
kubectl logs -f prometheus-pod
# or
docker logs -f prometheus-container

Look for common error patterns like:

"error scraping target"
"failed to evaluate rule"
"WAL corruption"

Analyzing Prometheus Configuration

Validate your configuration:

promtool check config prometheus.yml

Test alert rules:

promtool check rules alerts.yml

Test metrics:

promtool query instant http://localhost:9090 'up'

Advanced Troubleshooting Workflows

Process for Diagnosing Prometheus Issues

The following diagram illustrates a systematic approach to troubleshooting Prometheus issues:

Summary

This guide has covered the most common issues encountered when working with Prometheus:

Target Scraping Failures: Network, authentication, and timeout issues
High Cardinality Problems: Identifying and mitigating excessive label combinations
Storage and Retention Issues: Managing disk space and implementing retention policies
Query Performance Problems: Optimizing PromQL queries and using recording rules
Alert Manager Configuration Problems: Ensuring alerts are correctly defined and routed
Federation and Remote Storage Problems: Scaling Prometheus effectively
Service Discovery Problems: Ensuring dynamic targets are correctly discovered

By systematically addressing these issues using the techniques and solutions provided, you can maintain a reliable and efficient Prometheus monitoring system.

Additional Resources

Exercises

Target Discovery Debugging: Set up a Prometheus instance with file-based service discovery, intentionally introduce an error in the configuration, and practice debugging the issue.
Cardinality Analysis: Write a PromQL query to identify the metrics with the highest cardinality in your Prometheus instance and develop a plan to mitigate potential issues.
Alert Rule Testing: Create an alert rule with the following conditions: trigger when any instance has been down for more than 5 minutes, but only during business hours (9 AM to 5 PM). Test the rule using promtool.
Storage Optimization: Analyze your current Prometheus storage usage and implement a retention strategy that balances historical data availability with storage constraints.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Issue 1: Target Scraping Failures​

Symptoms​

Common Causes and Solutions​

Network Connectivity Issues​

Authentication Failures​

TLS Certificate Issues​

Timeouts During Scraping​

Issue 2: High Cardinality Problems​

Symptoms​

Analyzing Cardinality​

Solutions​

Reduce Label Usage​

Use label_replace in Queries Instead​

Configure Metric Relabeling​

Issue 3: Storage and Retention Issues​

Symptoms​

Managing Storage Space​

Configure Storage Retention​

Storage Optimization​

Issue 4: Query Performance Problems​

Symptoms​

Query Optimization Techniques​

Avoid Using Large Time Ranges​

Use Subqueries Carefully​

Use Recording Rules for Complex Queries​

Issue 5: Alert Manager Configuration Problems​

Symptoms​

Common Alert Manager Issues​

Alert Not Firing​

Alert Manager Not Receiving Alerts​

Incorrect Alert Routing​

Issue 6: Federation and Remote Storage Problems​

Symptoms​

Federation Issues​

Configuration for Federation​

Remote Storage Issues​

Issue 7: Service Discovery Problems​

Symptoms​

Common Service Discovery Issues​

File-based SD Issues​

Kubernetes SD Issues​

Debugging Techniques​

Using Prometheus Debug Endpoints​

Log Analysis​

Analyzing Prometheus Configuration​

Advanced Troubleshooting Workflows​

Process for Diagnosing Prometheus Issues​

Summary​

Additional Resources​

Exercises​

Introduction

Issue 1: Target Scraping Failures

Symptoms

Common Causes and Solutions

Network Connectivity Issues

Authentication Failures

TLS Certificate Issues

Timeouts During Scraping

Issue 2: High Cardinality Problems

Symptoms

Analyzing Cardinality

Solutions

Reduce Label Usage

Use label_replace in Queries Instead

Configure Metric Relabeling

Issue 3: Storage and Retention Issues

Symptoms

Managing Storage Space

Configure Storage Retention

Storage Optimization

Issue 4: Query Performance Problems

Symptoms

Query Optimization Techniques

Avoid Using Large Time Ranges

Use Subqueries Carefully

Use Recording Rules for Complex Queries

Issue 5: Alert Manager Configuration Problems

Symptoms

Common Alert Manager Issues

Alert Not Firing

Alert Manager Not Receiving Alerts

Incorrect Alert Routing

Issue 6: Federation and Remote Storage Problems

Symptoms

Federation Issues

Configuration for Federation

Remote Storage Issues

Issue 7: Service Discovery Problems

Symptoms

Common Service Discovery Issues

File-based SD Issues

Kubernetes SD Issues

Debugging Techniques

Using Prometheus Debug Endpoints

Log Analysis

Analyzing Prometheus Configuration

Advanced Troubleshooting Workflows

Process for Diagnosing Prometheus Issues

Summary

Additional Resources

Exercises