Remote Storage Troubleshooting

Introduction

Prometheus excels at collecting and storing metrics, but its local storage is designed to be operational rather than long-term archival. For extended data retention, higher availability, or to build a more scalable monitoring infrastructure, Prometheus offers remote storage integration. However, these integrations can sometimes encounter issues that require troubleshooting.

This guide will help you understand, diagnose, and resolve common problems with Prometheus remote storage configurations, helping ensure your metrics are properly stored and accessible for the long term.

Understanding Remote Storage in Prometheus

Before diving into troubleshooting, let's understand how remote storage works in Prometheus.

Prometheus can be configured to write samples to remote storage systems and read from them using remote read and write endpoints. This allows Prometheus to:

Store data long-term in external systems
Scale horizontally beyond a single Prometheus server
Integrate with existing data infrastructure

Common Remote Storage Issues and Solutions

Connection Problems

Symptoms:

Error logs showing connection failures
Missing data in remote storage
Error messages like connection refused or timeout

Troubleshooting Steps:

Verify network connectivity:

# Test basic connectivity
curl -v http://remote-storage-endpoint:port/api/v1/write

# Check if the endpoint is reachable from the Prometheus server
telnet remote-storage-endpoint port

Check firewall settings:
- Ensure firewall rules allow communication between Prometheus and the remote storage
- Verify any network policies in container environments like Kubernetes
Validate TLS configuration:
- If using HTTPS, ensure certificates are valid and trusted
- Check for TLS handshake errors in logs

Solution Example:

If your Prometheus server is failing to connect to a remote endpoint, first check the configuration:

remote_write:
  - url: "https://remote-write-endpoint/api/v1/write"
    tls_config:
      ca_file: /path/to/ca.crt
      insecure_skip_verify: false
    basic_auth:
      username: "prometheus"
      password: "secret"

Verify that:

The URL is correct
Certificate paths exist and are readable by Prometheus
Authentication credentials are valid

Queue Management Issues

Symptoms:

Logs showing queue is full, discarding samples
Growing queue size metrics
High memory usage on Prometheus

Troubleshooting Steps:

Monitor queue metrics:

rate(prometheus_remote_storage_samples_pending[5m])
prometheus_remote_storage_queue_highest_sent_timestamp_seconds
prometheus_remote_storage_samples_dropped_total

Check remote write configuration parameters:

remote_write:
  - url: "https://remote-write-endpoint/api/v1/write"
    queue_config:
      capacity: 10000
      max_shards: 200
      max_samples_per_send: 500
      batch_send_deadline: 5s
      min_backoff: 30ms
      max_backoff: 5s

Solution:

If the queue is consistently full:

Increase queue capacity:

queue_config:
  capacity: 50000  # Increased from default

Adjust sharding to process more samples in parallel:

queue_config:
  max_shards: 300  # Increased based on available CPUs

Balance the batch size and send deadline:

queue_config:
  max_samples_per_send: 1000
  batch_send_deadline: 10s

Performance and Capacity Issues

Symptoms:

High CPU and memory usage
Slow query performance
Increasing backlog of samples

Troubleshooting with Prometheus Metrics:

Monitor these metrics to identify bottlenecks:

# Remote storage operation latency
rate(prometheus_remote_storage_sent_batch_duration_seconds_sum[5m]) / rate(prometheus_remote_storage_sent_batch_duration_seconds_count[5m])

# Number of samples sent/failed
rate(prometheus_remote_storage_samples_in_total[5m])
rate(prometheus_remote_storage_samples_failed_total[5m])

# Queue length
prometheus_remote_storage_shards
prometheus_remote_storage_samples_pending

Solutions:

Implement request rate limiting:

remote_write:
  - url: "https://remote-write-endpoint/api/v1/write"
    remote_timeout: 30s
    queue_config:
      capacity: 100000
    write_relabel_configs:
      # Drop high-cardinality metrics
      - source_labels: [__name__]
        regex: 'test_metric_to_drop.*'
        action: drop

Use write_relabel_configs to reduce data volume:

write_relabel_configs:
  # Keep only important metrics for long-term storage
  - source_labels: [__name__]
    regex: 'node_cpu.*|node_memory.*|http_requests_total'
    action: keep

Authentication and Authorization Issues

Symptoms:

HTTP 401 or 403 errors in logs
Remote storage rejecting writes
Authentication-related error messages

Troubleshooting Steps:

Check credentials in Prometheus configuration:

remote_write:
  - url: "https://remote-write-endpoint/api/v1/write"
    basic_auth:
      username: "prometheus"
      password_file: /etc/prometheus/remote_storage_password

Verify token-based authentication:

remote_write:
  - url: "https://remote-write-endpoint/api/v1/write"
    bearer_token_file: /path/to/bearer/token

Test authentication manually:

# Using basic auth
curl -v -u username:password https://remote-write-endpoint/api/v1/write

# Using bearer token
curl -v -H "Authorization: Bearer $(cat /path/to/token)" https://remote-write-endpoint/api/v1/write

Solution Example:

If using an OAuth2 token that needs regular refresh:

remote_write:
  - url: "https://remote-write-endpoint/api/v1/write"
    oauth2:
      client_id: "prometheus"
      client_secret: "secret"
      token_url: "https://auth-provider/token"
      scopes: ["write:metrics"]

Inconsistent Data or Missing Samples

Symptoms:

Gaps in graphs when querying from remote storage
Different results when querying local vs. remote data
Missing recent data

Troubleshooting Steps:

Check for sample filtering and relabeling:
- Review write_relabel_configs to ensure metrics aren't being dropped
Verify time synchronization:
- Ensure NTP is properly configured on all systems
- Check timestamp-related metrics
Check read preferences:

remote_read:
  - url: "https://remote-read-endpoint/api/v1/read"
    read_recent: true  # Controls whether to read recent data from remote storage

Practical Example:

To diagnose data inconsistency between local and remote storage:

# Compare metric values from local and remote storage
prometheus_remote_storage_samples_in_total - prometheus_remote_storage_succeeded_samples_total

# Check for failures
rate(prometheus_remote_storage_failed_samples_total[5m])

Debugging with Log Analysis

For many remote storage issues, logs are your best diagnostic tool. Configure verbose logging when troubleshooting:

# In prometheus.yml
log_level: debug

Key log patterns to look for:

level=error ts=... component="remote storage" err="..."
level=warn ts=... component="remote storage" msg="Remote storage..."

End-to-End Testing

When troubleshooting complex remote storage setups, create a simple end-to-end test:

Configure a simple Prometheus instance with remote storage
Generate predictable metrics using a tool like prometheus-cli
Verify the full path - from collection to remote storage and query

Example testing script:

#!/bin/bash
# Generate test metrics
echo 'test_metric{label="value"} 123.4' | curl --data-binary @- http://localhost:9091/metrics/job/test

# Wait for scrape and remote_write
sleep 15

# Query local Prometheus
LOCAL_VALUE=$(curl -s 'http://localhost:9090/api/v1/query?query=test_metric' | jq '.data.result[0].value[1]')

# Query remote storage directly (if API compatible)
REMOTE_VALUE=$(curl -s 'http://remote-storage:9090/api/v1/query?query=test_metric' | jq '.data.result[0].value[1]')

echo "Local value: $LOCAL_VALUE"
echo "Remote value: $REMOTE_VALUE"

if [ "$LOCAL_VALUE" = "$REMOTE_VALUE" ]; then
  echo "✅ Values match"
else
  echo "❌ Values don't match"
fi

Best Practices for Remote Storage

Implement Redundancy: Configure multiple remote storage endpoints for critical metrics
Monitor the Monitoring: Set up alerting for remote storage issues
Rate Limiting: Implement client-side rate limiting to avoid overwhelming remote endpoints
Data Reduction: Use relabeling to reduce data volume sent to remote storage
Regular Testing: Periodically validate that data can be successfully queried from remote storage

Summary

Troubleshooting Prometheus remote storage requires understanding how the remote write and read protocols work, monitoring key metrics, and systematically testing connectivity, authentication, and data consistency.

Remember these key points:

Monitor queue metrics to spot bottlenecks early
Use relabeling to control what data gets sent to remote storage
Configure appropriate retry and backoff settings
Keep authentication credentials secure and regularly updated
Test the complete path from collection to remote query

By following the structured approach in this guide, you'll be able to identify and resolve most common remote storage issues, ensuring your Prometheus metrics are preserved and accessible for the long term.

Additional Resources

Prometheus Remote Storage Documentation
Remote Write Tuning Guide
Common Remote Storage Solutions:
- Thanos
- Cortex
- Victoria Metrics
- Grafana Mimir
- Prometheus TSDB

Practice Exercises

Configure Prometheus with a local remote storage endpoint (like VictoriaMetrics) and troubleshoot connection issues
Experiment with different queue settings and observe their impact on performance
Implement write relabeling to selectively store only specific metrics remotely
Set up a multi-remote-write configuration for redundancy and compare behavior during failures
Create a dashboard to monitor the health of your remote storage integration

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Remote Storage in Prometheus​

Common Remote Storage Issues and Solutions​

Connection Problems​

Queue Management Issues​

Performance and Capacity Issues​

Authentication and Authorization Issues​

Inconsistent Data or Missing Samples​

Debugging with Log Analysis​

End-to-End Testing​

Best Practices for Remote Storage​

Summary​

Additional Resources​

Practice Exercises​

Introduction

Understanding Remote Storage in Prometheus

Common Remote Storage Issues and Solutions

Connection Problems

Queue Management Issues

Performance and Capacity Issues

Authentication and Authorization Issues

Inconsistent Data or Missing Samples

Debugging with Log Analysis

End-to-End Testing

Best Practices for Remote Storage

Summary

Additional Resources

Practice Exercises