Remote Storage Troubleshooting
Introduction
Prometheus excels at collecting and storing metrics, but its local storage is designed to be operational rather than long-term archival. For extended data retention, higher availability, or to build a more scalable monitoring infrastructure, Prometheus offers remote storage integration. However, these integrations can sometimes encounter issues that require troubleshooting.
This guide will help you understand, diagnose, and resolve common problems with Prometheus remote storage configurations, helping ensure your metrics are properly stored and accessible for the long term.
Understanding Remote Storage in Prometheus
Before diving into troubleshooting, let's understand how remote storage works in Prometheus.
Prometheus can be configured to write samples to remote storage systems and read from them using remote read and write endpoints. This allows Prometheus to:
- Store data long-term in external systems
- Scale horizontally beyond a single Prometheus server
- Integrate with existing data infrastructure
Common Remote Storage Issues and Solutions
Connection Problems
Symptoms:
- Error logs showing connection failures
- Missing data in remote storage
- Error messages like
connection refused
ortimeout
Troubleshooting Steps:
- Verify network connectivity:
# Test basic connectivity
curl -v http://remote-storage-endpoint:port/api/v1/write
# Check if the endpoint is reachable from the Prometheus server
telnet remote-storage-endpoint port
-
Check firewall settings:
- Ensure firewall rules allow communication between Prometheus and the remote storage
- Verify any network policies in container environments like Kubernetes
-
Validate TLS configuration:
- If using HTTPS, ensure certificates are valid and trusted
- Check for TLS handshake errors in logs
Solution Example:
If your Prometheus server is failing to connect to a remote endpoint, first check the configuration:
remote_write:
- url: "https://remote-write-endpoint/api/v1/write"
tls_config:
ca_file: /path/to/ca.crt
insecure_skip_verify: false
basic_auth:
username: "prometheus"
password: "secret"
Verify that:
- The URL is correct
- Certificate paths exist and are readable by Prometheus
- Authentication credentials are valid
Queue Management Issues
Symptoms:
- Logs showing
queue is full, discarding samples
- Growing queue size metrics
- High memory usage on Prometheus
Troubleshooting Steps:
- Monitor queue metrics:
rate(prometheus_remote_storage_samples_pending[5m])
prometheus_remote_storage_queue_highest_sent_timestamp_seconds
prometheus_remote_storage_samples_dropped_total
- Check remote write configuration parameters:
remote_write:
- url: "https://remote-write-endpoint/api/v1/write"
queue_config:
capacity: 10000
max_shards: 200
max_samples_per_send: 500
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 5s
Solution:
If the queue is consistently full:
-
Increase queue capacity:
yamlqueue_config:
capacity: 50000 # Increased from default -
Adjust sharding to process more samples in parallel:
yamlqueue_config:
max_shards: 300 # Increased based on available CPUs -
Balance the batch size and send deadline:
yamlqueue_config:
max_samples_per_send: 1000
batch_send_deadline: 10s
Performance and Capacity Issues
Symptoms:
- High CPU and memory usage
- Slow query performance
- Increasing backlog of samples
Troubleshooting with Prometheus Metrics:
Monitor these metrics to identify bottlenecks:
# Remote storage operation latency
rate(prometheus_remote_storage_sent_batch_duration_seconds_sum[5m]) / rate(prometheus_remote_storage_sent_batch_duration_seconds_count[5m])
# Number of samples sent/failed
rate(prometheus_remote_storage_samples_in_total[5m])
rate(prometheus_remote_storage_samples_failed_total[5m])
# Queue length
prometheus_remote_storage_shards
prometheus_remote_storage_samples_pending
Solutions:
- Implement request rate limiting:
remote_write:
- url: "https://remote-write-endpoint/api/v1/write"
remote_timeout: 30s
queue_config:
capacity: 100000
write_relabel_configs:
# Drop high-cardinality metrics
- source_labels: [__name__]
regex: 'test_metric_to_drop.*'
action: drop
- Use
write_relabel_configs
to reduce data volume:
write_relabel_configs:
# Keep only important metrics for long-term storage
- source_labels: [__name__]
regex: 'node_cpu.*|node_memory.*|http_requests_total'
action: keep
Authentication and Authorization Issues
Symptoms:
- HTTP 401 or 403 errors in logs
- Remote storage rejecting writes
- Authentication-related error messages
Troubleshooting Steps:
- Check credentials in Prometheus configuration:
remote_write:
- url: "https://remote-write-endpoint/api/v1/write"
basic_auth:
username: "prometheus"
password_file: /etc/prometheus/remote_storage_password
- Verify token-based authentication:
remote_write:
- url: "https://remote-write-endpoint/api/v1/write"
bearer_token_file: /path/to/bearer/token
- Test authentication manually:
# Using basic auth
curl -v -u username:password https://remote-write-endpoint/api/v1/write
# Using bearer token
curl -v -H "Authorization: Bearer $(cat /path/to/token)" https://remote-write-endpoint/api/v1/write
Solution Example:
If using an OAuth2 token that needs regular refresh:
remote_write:
- url: "https://remote-write-endpoint/api/v1/write"
oauth2:
client_id: "prometheus"
client_secret: "secret"
token_url: "https://auth-provider/token"
scopes: ["write:metrics"]
Inconsistent Data or Missing Samples
Symptoms:
- Gaps in graphs when querying from remote storage
- Different results when querying local vs. remote data
- Missing recent data
Troubleshooting Steps:
-
Check for sample filtering and relabeling:
- Review
write_relabel_configs
to ensure metrics aren't being dropped
- Review
-
Verify time synchronization:
- Ensure NTP is properly configured on all systems
- Check timestamp-related metrics
-
Check read preferences:
remote_read:
- url: "https://remote-read-endpoint/api/v1/read"
read_recent: true # Controls whether to read recent data from remote storage
Practical Example:
To diagnose data inconsistency between local and remote storage:
# Compare metric values from local and remote storage
prometheus_remote_storage_samples_in_total - prometheus_remote_storage_succeeded_samples_total
# Check for failures
rate(prometheus_remote_storage_failed_samples_total[5m])
Debugging with Log Analysis
For many remote storage issues, logs are your best diagnostic tool. Configure verbose logging when troubleshooting:
# In prometheus.yml
log_level: debug
Key log patterns to look for:
level=error ts=... component="remote storage" err="..."
level=warn ts=... component="remote storage" msg="Remote storage..."
End-to-End Testing
When troubleshooting complex remote storage setups, create a simple end-to-end test:
- Configure a simple Prometheus instance with remote storage
- Generate predictable metrics using a tool like
prometheus-cli
- Verify the full path - from collection to remote storage and query
Example testing script:
#!/bin/bash
# Generate test metrics
echo 'test_metric{label="value"} 123.4' | curl --data-binary @- http://localhost:9091/metrics/job/test
# Wait for scrape and remote_write
sleep 15
# Query local Prometheus
LOCAL_VALUE=$(curl -s 'http://localhost:9090/api/v1/query?query=test_metric' | jq '.data.result[0].value[1]')
# Query remote storage directly (if API compatible)
REMOTE_VALUE=$(curl -s 'http://remote-storage:9090/api/v1/query?query=test_metric' | jq '.data.result[0].value[1]')
echo "Local value: $LOCAL_VALUE"
echo "Remote value: $REMOTE_VALUE"
if [ "$LOCAL_VALUE" = "$REMOTE_VALUE" ]; then
echo "✅ Values match"
else
echo "❌ Values don't match"
fi
Best Practices for Remote Storage
- Implement Redundancy: Configure multiple remote storage endpoints for critical metrics
- Monitor the Monitoring: Set up alerting for remote storage issues
- Rate Limiting: Implement client-side rate limiting to avoid overwhelming remote endpoints
- Data Reduction: Use relabeling to reduce data volume sent to remote storage
- Regular Testing: Periodically validate that data can be successfully queried from remote storage
Summary
Troubleshooting Prometheus remote storage requires understanding how the remote write and read protocols work, monitoring key metrics, and systematically testing connectivity, authentication, and data consistency.
Remember these key points:
- Monitor queue metrics to spot bottlenecks early
- Use relabeling to control what data gets sent to remote storage
- Configure appropriate retry and backoff settings
- Keep authentication credentials secure and regularly updated
- Test the complete path from collection to remote query
By following the structured approach in this guide, you'll be able to identify and resolve most common remote storage issues, ensuring your Prometheus metrics are preserved and accessible for the long term.
Additional Resources
- Prometheus Remote Storage Documentation
- Remote Write Tuning Guide
- Common Remote Storage Solutions:
- Thanos
- Cortex
- Victoria Metrics
- Grafana Mimir
- Prometheus TSDB
Practice Exercises
- Configure Prometheus with a local remote storage endpoint (like VictoriaMetrics) and troubleshoot connection issues
- Experiment with different queue settings and observe their impact on performance
- Implement write relabeling to selectively store only specific metrics remotely
- Set up a multi-remote-write configuration for redundancy and compare behavior during failures
- Create a dashboard to monitor the health of your remote storage integration
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)