Ingestion Performance
Introduction
Ingestion performance is a critical aspect of Grafana Loki that directly impacts how efficiently your logging system operates at scale. When we talk about "ingestion performance," we're referring to how quickly and reliably Loki can receive, process, and store incoming log data from various sources.
As your logging needs grow, understanding and optimizing ingestion performance becomes increasingly important. Poor ingestion performance can lead to log backpressure, dropped logs, increased latency, and even system instability.
In this guide, we'll explore the factors that affect Loki's ingestion performance, common bottlenecks, and practical techniques to optimize and monitor your ingestion pipeline.
Understanding the Loki Ingestion Pipeline
Before diving into optimization, let's understand how logs flow through Loki's ingestion pipeline:
- Log Sources: Applications, servers, and services generate logs
- Log Agents: Tools like Promtail collect and forward logs to Loki
- Distributor: Receives logs, validates them, and distributes them to ingesters
- Ingester: Compresses logs, builds indexes, and writes to storage
- Storage: The final destination for log data and indexes
Each component in this pipeline can affect ingestion performance, so optimizing requires a holistic approach.
Key Factors Affecting Ingestion Performance
1. Label Cardinality
One of the most significant factors affecting Loki's ingestion performance is label cardinality.
What is Label Cardinality?
Label cardinality refers to the number of unique combinations of label values in your log streams. High cardinality occurs when you have labels with many possible values, creating numerous unique label combinations.
For example, using labels like:
user_id
request_id
pod_id
These can each have thousands or millions of unique values, causing high cardinality.
Impact on Performance
High cardinality directly impacts Loki's performance by:
- Increasing memory usage in ingesters
- Slowing down index lookups
- Growing index size exponentially
- Reducing query performance
Example of High vs. Low Cardinality
❌ High Cardinality (Avoid):
scrape_configs:
- job_name: app_logs
static_configs:
- targets:
- localhost
labels:
app: frontend
user_id: "{{.user_id}}" # Unique per user!
request_id: "{{.request_id}}" # Unique per request!
✅ Lower Cardinality (Better):
scrape_configs:
- job_name: app_logs
static_configs:
- targets:
- localhost
labels:
app: frontend
component: auth
environment: production
2. Chunk Size and Retention
Loki stores logs in "chunks," which impact both ingestion and query performance.
Chunk Settings
The key configurations that affect chunk behavior are:
chunk_target_size
: Target size of each chunkchunk_idle_period
: How long to wait before closing a chunk if no new logs arrivemax_chunk_age
: Maximum duration to keep a chunk open
limits_config:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
chunk_store_config:
max_look_back_period: 720h # 30 days
schema_config:
configs:
- from: 2020-07-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
Finding the Right Balance
- Too small chunks: More chunks to manage, higher overhead
- Too large chunks: More memory usage, potential ingestion delays
- Too short chunk periods: More chunks, higher storage overhead
- Too long chunk periods: Higher memory pressure on ingesters
A good starting point for most setups:
chunk_target_size: 1MB
chunk_idle_period: 30m
max_chunk_age: 1h
3. Ingestion Rate Limits
Loki includes configurable rate limits to protect your system from being overwhelmed:
limits_config:
ingestion_rate_mb: 4
ingestion_burst_size_mb: 6
per_tenant_override_config: /etc/loki/overrides.yaml
These settings control:
ingestion_rate_mb
: Maximum sustained ingestion rate in MB per secondingestion_burst_size_mb
: Maximum allowed burst size in MB
You can also configure per-tenant rate limits using override files.
Monitoring and Adjusting Rate Limits
When setting rate limits, start conservative and adjust based on:
- Monitoring metrics like
loki_distributor_bytes_received_total
- Watching for
rate_limited
errors in logs - Analyzing usage patterns over time
Optimizing Ingestion Performance
1. Hardware Resources
Ensure your Loki deployment has adequate resources:
Resource Requirements
Component | CPU | Memory | Disk I/O |
---|---|---|---|
Distributor | Medium | Low | Low |
Ingester | High | High | Medium |
Query Frontend | Medium | Medium | Low |
Storage | Low | Low | High |
Tips:
- Scale ingesters horizontally as they handle most of the ingestion workload
- Provide ingesters with ample memory, as they buffer data in memory
- Use high-performance storage for the index
2. Efficient Label Usage
Optimizing your labels is one of the most effective ways to improve ingestion performance:
Best Practices:
- Keep labels to a minimum: Only use labels needed for querying
- Avoid high-cardinality labels: Don't use unique IDs as labels
- Use structured logging: Keep detailed information in the log message, not labels
- Standardize labels: Use consistent naming across applications
Example: Efficient Label Strategy
scrape_configs:
- job_name: app_logs
static_configs:
- targets:
- localhost
labels:
env: "production"
app: "payment-service"
component: "api" # Limited set of values: api, worker, cron, etc.
3. Batch Processing
Configure your log agents to send logs in batches rather than individually:
Promtail Batch Configuration
clients:
- url: http://loki:3100/loki/api/v1/push
batchwait: 1s # Wait up to 1s to batch logs
batchsize: 1024KB # Send batches up to 1MB
This significantly reduces network overhead and Loki processing costs.
4. Horizontally Scale Components
For high-volume environments, scale out your Loki components:
# Example Kubernetes configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: loki-distributor
spec:
replicas: 3 # Scale based on load
template:
spec:
containers:
- name: loki
args:
- -target=distributor
Scale components based on their specific bottlenecks:
- Distributors: Scale based on incoming request rate
- Ingesters: Scale based on memory pressure and storage throughput
- Query Frontend: Scale based on query load
Monitoring Ingestion Performance
To effectively optimize, you need to monitor key metrics:
Essential Metrics to Watch
-
Ingestion Rates:
loki_distributor_bytes_received_total
loki_distributor_lines_received_total
-
Ingestion Latency:
loki_distributor_request_duration_seconds
-
Resources:
container_memory_usage_bytes
(for Loki components)container_cpu_usage_seconds_total
-
Errors and Drops:
loki_distributor_client_request_errors_total
loki_ingester_chunk_age_seconds
(to detect stuck chunks)
Creating a Monitoring Dashboard
Set up a Grafana dashboard specifically for ingestion performance:
// Example PromQL queries for your dashboard
// Ingestion rate by tenant
sum by(tenant) (rate(loki_distributor_bytes_received_total[5m]))
// Ingestion latency 99th percentile
histogram_quantile(0.99, sum by (le) (rate(loki_distributor_request_duration_seconds_bucket[5m])))
// Error rate
sum(rate(loki_distributor_client_request_errors_total[5m])) by (tenant, status_code)
Troubleshooting Common Ingestion Issues
1. High Latency or Timeouts
Symptoms:
- Slow log ingestion
- Timeout errors from clients
Potential Fixes:
- Increase resources for distributors and ingesters
- Review label cardinality and reduce if necessary
- Check network connectivity between components
- Increase timeout settings in client configurations
2. Rate Limiting Errors
Symptoms:
- Errors containing
rate limit exceeded
- Logs being dropped
Potential Fixes:
- Increase rate limits if resources allow
- Add more ingesters to handle the load
- Implement log sampling for high-volume, low-importance logs
- Spread log transmission over time when possible
3. Memory Pressure on Ingesters
Symptoms:
- OOM errors
- High memory usage on ingesters
Potential Fixes:
- Reduce chunk target size or max age
- Scale out ingesters horizontally
- Reduce label cardinality
- Increase memory limits for ingester pods
Real-World Example: Scaling Loki for High-Volume Production
Let's look at a case study of optimizing Loki for a system handling 100GB of logs per day:
Initial Setup
# Before optimization
limits_config:
ingestion_rate_mb: 4
ingestion_burst_size_mb: 6
schema_config:
configs:
- from: 2020-07-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
Issues Encountered:
- Frequent rate limiting errors
- High memory usage on ingesters
- Slow query performance
Optimized Configuration
# After optimization
limits_config:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 15
per_tenant_override_config: /etc/loki/overrides.yaml
chunk_store_config:
max_look_back_period: 720h
schema_config:
configs:
- from: 2020-07-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
ingester:
chunk_idle_period: 30m
chunk_target_size: 1.5MB
max_chunk_age: 1h
Infrastructure Changes:
- Scaled from 3 to 5 ingesters
- Increased memory allocation for ingesters
- Implemented a log pre-processor to standardize labels
- Set up tenant-specific rate limits
Result:
- 3x improvement in ingestion throughput
- 40% reduction in storage requirements
- 60% faster query performance
- Zero log drops during normal operation
Summary
Optimizing Loki's ingestion performance requires a systematic approach focused on:
- Managing label cardinality - The most important factor for Loki performance
- Tuning chunk settings - Finding the right balance for your workload
- Setting appropriate rate limits - Protecting your system while allowing needed throughput
- Efficient resource allocation - Giving components the resources they need
- Monitoring and iterative improvement - Continuously tracking metrics and refining
By applying these principles, you can build a Loki deployment that efficiently handles your log volume while maintaining reliability and query performance.
Additional Resources
- Practice implementing the configuration changes described in this guide
- Set up a test environment and experiment with different label strategies
- Create a monitoring dashboard with the metrics mentioned in this guide
- Try simulating high load to see how your Loki deployment responds
Exercise: Audit Your Labels
Review your current Loki labels and identify opportunities for optimization by answering these questions:
- How many unique label combinations do you have?
- Which labels have the highest cardinality?
- Are all your labels necessary for querying?
- Could any high-cardinality labels be moved into the log content instead?
Document your findings and create an optimization plan for your labels.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)