Log Ingestion Problems
Introduction
Log ingestion is the process of collecting, parsing, and storing log data from various sources into Grafana Loki. While Loki is designed to be efficient and reliable, various issues can occur during the ingestion process that prevent logs from being properly collected, labeled, or stored.
In this guide, we'll explore common log ingestion problems in Grafana Loki, how to identify them, and most importantly, how to solve them. Whether you're seeing missing logs, experiencing performance issues, or encountering error messages, this troubleshooting guide will help you get your logging pipeline back on track.
Common Log Ingestion Problems
Let's explore the most frequent issues you might encounter when ingesting logs into Loki.
1. No Logs Appearing in Loki
One of the most common issues is when logs simply don't appear in Loki's interface even though your applications are generating them.
Potential Causes
- Agent Configuration Issues: Promtail or other log agents might be misconfigured
- Connectivity Problems: Network issues between your agents and Loki
- Rate Limiting: Loki might be throttling ingest requests
- Labels and Selectors: Incorrect query selectors when viewing logs
Diagnosing the Issue
First, check if your log agent (like Promtail) is actually sending logs:
# Check Promtail's status and logs
sudo systemctl status promtail
sudo journalctl -u promtail -f
Verify Promtail's configuration file has the correct targets:
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/*log
Check connectivity to your Loki instance:
# Test connection from agent to Loki
curl -v http://loki:3100/ready
Solution
- Check agent logs for error messages related to sending logs
- Verify configurations:
- Ensure log paths are correct
- Check Loki URL and credentials
- Validate label configurations
- Examine network connectivity between your agents and Loki
- Verify Loki is healthy by checking its ready and metrics endpoints
2. Rate Limiting and Throttling Issues
Loki includes rate limiting to protect itself from being overwhelmed by log volume.
Symptoms
- Logs appearing with delay
- Logs being dropped
- Error messages about rate limits or throttling
- Inconsistent log ingestion
Diagnosing Rate Limiting
Check Loki's logs for rate limiting messages:
# Find rate limiting messages in Loki logs
kubectl logs -f deployment/loki -n loki | grep -i "rate limit"
Review metrics for rate limiting:
# Query rate limiting metrics
curl -s http://loki:3100/metrics | grep -i rate_limit
Solution
- Increase limits in Loki configuration:
limits_config:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
per_stream_rate_limit: 10MB
per_stream_rate_limit_burst: 15MB
- Implement batching in your agent config to smooth ingestion rates
- Add more Loki distributors to handle higher throughput
- Consider upgrading to a clustered Loki setup for better scaling
3. Label Cardinality Issues
High cardinality labels can severely impact Loki's performance. This happens when you have labels with many potential values.
Symptoms
- Slow queries
- Out of memory errors
- High disk usage
- Error messages about series limits
Here's what a high cardinality problem might look like in Loki logs:
level=warn ts=2023-10-15T14:22:30.123Z caller=metrics.go:111 msg="too many series labels matched" error="per-user series limit exceeded: limit 10000 series, matched: 12500 series"
Diagnosing Cardinality Issues
Use Loki's metrics to identify high cardinality:
# Check series count metrics
curl -s http://loki:3100/metrics | grep 'loki_ingester_memory_series'
Examine your label configuration in Promtail:
scrape_configs:
- job_name: app
static_configs:
- targets:
- localhost
labels:
app: myapp
# High cardinality label example:
request_id: ${REQUEST_ID} # This could generate unlimited unique values!
Solution
- Remove high cardinality labels from your configuration
- Use dynamic labels sparingly, especially for values like:
- User IDs
- Session IDs
- Request IDs
- IP addresses
- Timestamps
- Keep high cardinality data in the log content, not in labels
- Increase series limits if absolutely necessary (but fix the root cause first!)
limits_config:
max_label_name_length: 1024
max_label_value_length: 2048
max_label_names_per_series: 30
max_global_streams_per_user: 15000
4. Log Line Size Limits
Loki enforces maximum log line sizes to prevent performance issues.
Symptoms
- Truncated logs
- Missing log entries
- Error messages about size limits
Diagnosing Size Limit Issues
Check for error messages about line size:
# Look for size limit errors in Promtail
journalctl -u promtail | grep -i "size limit"
Solution
- Increase line size limits in Loki configuration:
limits_config:
max_line_size: 512000 # Increased from default
- Configure line splitting in your log agent
- Consider restructuring very large log messages
5. Timestamp Parsing Issues
Loki requires logs to be in chronological order within each stream.
Symptoms
- "Entry out of order" errors
- Missing logs
- Duplicate timestamps
Diagnosing Timestamp Issues
Check for timestamp-related errors in Loki logs:
# Find timestamp errors
kubectl logs deployment/loki -n loki | grep -i "out of order"
Review your timestamp parsing configuration:
scrape_configs:
- job_name: app_logs
pipeline_stages:
- regex:
expression: '(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3})'
- timestamp:
source: timestamp
format: '2006-01-02 15:04:05.000'
Solution
- Configure proper timestamp parsing in your agent
- Increase the
reject_old_samples
threshold in Loki config
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h # 1 week, increased from default
- Ensure logs are sent in a timely manner from source to Loki
Troubleshooting Workflow
Let's put together a systematic approach to troubleshoot log ingestion issues:
Advanced Troubleshooting Techniques
Using Loki's API for Diagnostics
Loki provides several API endpoints that can help with troubleshooting:
# Check if Loki is ready
curl http://loki:3100/ready
# Get build information
curl http://loki:3100/loki/api/v1/status/buildinfo
# Check configured limits
curl http://loki:3100/config
Monitoring Loki's Metrics
Loki exposes Prometheus metrics that can help identify ingestion issues:
# Get all metrics
curl http://loki:3100/metrics
# Filter for ingestion-related metrics
curl -s http://loki:3100/metrics | grep 'loki_distributor_bytes_received_total'
Key metrics to monitor:
loki_distributor_bytes_received_total
: Total bytes received per tenantloki_distributor_lines_received_total
: Total lines received per tenantloki_distributor_ingester_append_failures_total
: Failed append requests to ingestersloki_ingester_memory_streams
: Number of streams in memoryloki_ingester_memory_series
: Number of series in memory
Debugging with Promtail
Enable debug logging in Promtail for more detailed information:
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
log_level: debug # Set to debug for more verbose logging
You can also use Promtail's dry-run feature to test configurations without actually sending logs:
promtail --dry-run --config.file=promtail-config.yaml
Common Log Ingestion Error Messages
Error Message | Possible Cause | Solution |
---|---|---|
server returned HTTP status 429 Too Many Requests | Rate limiting | Increase rate limits or batch logs |
entry out of order, rejecting | Timestamp issues | Fix timestamp parsing or increase max age |
per-user series limit exceeded | High cardinality | Reduce label cardinality |
max size of a single log line exceeded | Log line too big | Increase max line size or split logs |
failed to connect to loki | Network/availability | Check network and Loki status |
Real-World Example: Troubleshooting a Kubernetes Application
Let's see a practical example of troubleshooting log ingestion for a Kubernetes application.
Scenario
You've deployed a microservice application in Kubernetes and set up Promtail as a DaemonSet to collect container logs. However, you notice that logs from one particular service aren't appearing in Loki.
Step 1: Check if Promtail is collecting the logs
# Check Promtail pods
kubectl get pods -n logging
# Check Promtail logs
kubectl logs -f promtail-abcd1 -n logging
Step 2: Verify Promtail configuration
scrape_configs:
- job_name: kubernetes
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
Step 3: Test direct log queries
# Query logs with minimal filtering
curl -G -s "http://loki:3100/loki/api/v1/query_range" \
--data-urlencode "query={namespace=\"my-namespace\"}" \
--data-urlencode "start=1609459200000000000" \
--data-urlencode "end=1609545600000000000" \
--data-urlencode "limit=10"
Step 4: Examine Loki metrics for ingestion issues
# Check if logs are being received
curl -s http://loki:3100/metrics | grep 'loki_distributor_lines_received_total'
Step 5: Solution implementation
After investigation, we discover that the application was using multiline JSON logs that weren't being parsed correctly. We update the Promtail configuration to handle this:
scrape_configs:
- job_name: kubernetes
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- json:
expressions:
timestamp: time
level: level
message: message
- timestamp:
source: timestamp
format: RFC3339
Summary
Successfully troubleshooting log ingestion problems in Grafana Loki requires systematic investigation and a good understanding of how the logging pipeline works. In this guide, we've covered:
- Common log ingestion issues and their symptoms
- How to diagnose each type of problem
- Specific solutions for each category of issue
- A systematic troubleshooting workflow
- Advanced techniques for deeper investigation
- Real-world examples of log ingestion problems
Remember that effective log ingestion troubleshooting comes down to understanding:
- The path logs take from source to Loki
- Loki's architecture and how it processes logs
- The configuration options that affect log ingestion
- How to use metrics and logs to identify bottlenecks or failures
Additional Resources
- Grafana Loki Documentation
- Promtail Configuration Reference
- Loki Troubleshooting Guide
- Best Practices for Labels
Exercises
- Set up a local Loki and Promtail instance, then deliberately misconfigure one aspect to practice troubleshooting.
- Create a high cardinality situation and observe its effects on Loki's performance.
- Experiment with different rate limits to find the optimal configuration for your log volume.
- Implement a multiline logging pipeline and ensure timestamps are correctly parsed.
- Create a monitoring dashboard in Grafana to track key Loki ingestion metrics.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)