Monitoring Gaps
Introduction
When working with Grafana Loki, you might encounter situations where your log data isn't complete - there are gaps where expected logs are missing. These monitoring gaps can lead to incomplete visualizations, inaccurate alerts, and missing critical information during incident investigations.
In this guide, we'll explore what monitoring gaps are, why they occur, how to identify them, and most importantly, how to resolve them in your Grafana Loki setup.
What Are Monitoring Gaps?
Monitoring gaps are periods where expected log data is missing from your Loki storage. These gaps might be:
- Temporal gaps: Missing logs for specific time periods
- Source gaps: Missing logs from specific sources (e.g., a particular pod, instance, or application)
- Label gaps: Missing logs for particular label combinations
- Content gaps: Missing specific types of log entries
Common Causes of Monitoring Gaps
1. Agent Issues
- Agent downtime: Promtail, Fluentd, or other agents might crash or stop running
- Configuration errors: Improper file paths or incorrect glob patterns
- Resource constraints: Agent running out of memory or CPU
2. Network Problems
- Connectivity issues: Network interruptions between agents and Loki
- Rate limiting: Network throttling causing drops
- Timeout issues: Requests taking too long and being terminated
3. Loki Server Issues
- Service unavailability: Loki components being down
- Rate limiting: Loki rejecting logs due to ingestion limits
- Out of resources: Insufficient storage or memory
4. Data Issues
- Log rotation: Logs being rotated before collection
- Application issues: Application not generating logs during certain periods
- Log format changes: Changes breaking the parsing logic
Identifying Monitoring Gaps
Let's explore how to detect gaps in your monitoring data:
Visual Inspection
The simplest way to identify gaps is through visual inspection in Grafana dashboards:
- Create a time series visualization showing log volume
- Look for drops to zero or significant decreases
- Check if the pattern correlates with specific events (deployments, maintenance, etc.)
Using LogQL to Detect Gaps
You can use LogQL queries to identify potential gaps in your data:
sum(count_over_time({job="myapp"} [1m])) by (instance)
This query counts logs per instance in 1-minute windows. Zeros or unexpected drops indicate potential gaps.
For more sophisticated gap detection, you can use a query like:
sum by (instance) (count_over_time({job="myapp"}[5m] offset 5m))
unless
sum by (instance) (count_over_time({job="myapp"}[5m]))
This identifies instances that were logging 5 minutes ago but have stopped now.
Implementing a Gap Detection Dashboard
Let's create a more structured approach with a dedicated gap detection dashboard:
A Practical Example: Debugging Agent Issues
Let's walk through a common scenario where Promtail (Loki's agent) is causing gaps:
Step 1: Check Promtail Status
First, verify if Promtail is running:
kubectl get pods -n loki | grep promtail
Example output:
promtail-abc12 1/1 Running 0 2d
promtail-def34 1/1 Running 0 2d
promtail-ghi56 0/1 CrashLoopBackOff 3 2d
This shows one of our Promtail pods is crashing.
Step 2: Check Promtail Logs
Examine the logs of the failing pod:
kubectl logs promtail-ghi56 -n loki
Example output:
level=error ts=2023-06-10T15:04:32.654Z caller=promtail.go:182 msg="error creating targets" error="open /var/log/pods: no such file or directory"
This shows a configuration issue - the pod can't access the expected log path.
Step 3: Fix the Configuration
Update the Promtail DaemonSet to ensure proper volume mounts:
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
Step 4: Verify the Fix
After applying the changes, check if logs are flowing again:
sum(rate({job="promtail"}[5m])) by (instance)
Real-World Scenario: Debugging Loki Ingestion Issues
Another common cause of monitoring gaps is Loki rejecting logs due to rate limiting or resource constraints.
Identifying the Issue
Check Loki's metrics to see if there are rejected samples:
sum(rate(loki_distributor_ingestion_rate_samples_rejected_total[5m])) by (reason)
Example output might show spikes in rejected samples due to "rate_limited" or "per_stream_rate_limited".
Understanding Loki's Rate Limiting
Loki implements several rate limits to protect itself:
- Global rate limits: Caps the total ingestion rate
- Per-tenant rate limits: Limits each tenant's ingestion rate
- Per-stream rate limits: Restricts ingestion for each unique label combination
Adjusting Rate Limits
You can modify Loki's configuration to accommodate your log volume:
limits_config:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
per_stream_rate_limit: 5MB
per_stream_rate_limit_burst: 10MB
Implementing Dynamic Rate Limiting
For production environments, consider implementing dynamic rate limiting that adjusts based on actual usage patterns:
limits_config:
ingestion_rate_strategy: "global"
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
Preventing Monitoring Gaps
Here are strategies to minimize monitoring gaps in your Loki setup:
1. Implement Buffer and Retry Logic
Configure your agents with appropriate buffering and retry logic:
clients:
- url: http://loki:3100/loki/api/v1/push
backoff_config:
min_period: 500ms
max_period: 5m
max_retries: 10
batchwait: 1s
batchsize: 1048576
2. Set Up Proper Monitoring of Your Monitoring
Monitor the health of your monitoring system itself:
- Track agent status and resource usage
- Monitor Loki component health
- Set up alerts for sudden drops in log volume
3. Implement High Availability
For critical environments, implement redundancy:
- Run multiple Promtail instances per host
- Deploy Loki in a distributed mode
- Use multiple storage backends
4. Regular Validation
Implement regular validation tests:
- Inject test logs at known intervals
- Verify end-to-end log delivery
- Audit log completeness for critical systems
Hands-On Exercise: Building a Gap Detection System
Let's create a simple system to detect and alert on monitoring gaps:
- Create a LogQL query to detect gaps:
# Find instances that have stopped logging
sum by (instance) (count_over_time({job="myapp"}[15m] offset 15m))
unless
sum by (instance) (count_over_time({job="myapp"}[15m]))
-
Set up a Grafana alert with this query:
- Navigate to Grafana Alerting
- Create a new alert rule using the query above
- Set the condition to alert when the result is > 0
- Add appropriate notification channels
-
Create a recovery procedure document for when gaps are detected
Summary
Monitoring gaps in Grafana Loki can undermine the reliability of your observability platform. By understanding the common causes, implementing detection mechanisms, and following the preventive measures outlined in this guide, you can ensure your log data remains complete and reliable.
Remember these key points:
- Regularly validate your log collection pipeline
- Monitor the components of your monitoring system
- Implement proper buffering and retry mechanisms
- Set up alerts for unexpected changes in log volume
By addressing monitoring gaps proactively, you'll build a more robust observability system that provides accurate insights when you need them most.
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)