Resource Limits
When running Grafana Loki in a multi-tenant environment, implementing proper resource limits is crucial to ensure fairness, prevent abuse, and maintain system stability. This guide will explain how resource limits work in Loki and how to configure them effectively.
Introduction to Resource Limits
In a multi-tenant Loki deployment, different tenants (teams, applications, or environments) share the same infrastructure. Without proper limits, a single tenant could:
- Consume excessive storage by sending too much log data
- Overload the system with too many queries
- Degrade performance for other tenants through resource-intensive operations
Resource limits allow you to control and restrict how much of Loki's resources each tenant can use, ensuring fair allocation and preventing "noisy neighbor" problems.
Core Limit Types in Loki
Loki supports several types of limits that can be applied globally or per-tenant:
1. Ingestion Limits
Ingestion limits control how much data tenants can send to Loki:
- Rate limits: Maximum log entries per second
- Size limits: Maximum size of individual log entries
- Stream limits: Maximum number of active streams per tenant
2. Query Limits
Query limits protect the query path from excessive load:
- Query timeout: Maximum duration for query execution
- Query parallelism: Number of shards that can be queried simultaneously
- Cardinality limits: Restrictions on the number of series that can be processed
3. Storage Limits
Storage limits control how much data each tenant can store:
- Retention period: How long logs are kept before deletion
- Per-tenant storage quotas: Maximum storage capacity allocated
Configuring Global Limits
Loki's global limits apply to all tenants and serve as default values. They are defined in the limits_config
section of Loki's configuration.
limits_config:
ingestion_rate_mb: 4
ingestion_burst_size_mb: 6
max_entries_limit_per_query: 5000
max_label_name_length: 1024
max_label_value_length: 2048
reject_old_samples: true
reject_old_samples_max_age: 168h
max_query_length: 721h
max_query_parallelism: 14
max_streams_per_user: 10000
Per-Tenant Overrides
You can override global limits for specific tenants using the overrides
configuration:
limits_config:
# Global limits as shown above
per_tenant_override_config:
filename: /etc/loki/overrides.yaml
# In overrides.yaml:
overrides:
tenant1:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 15
max_streams_per_user: 20000
tenant2:
max_query_length: 168h
max_query_parallelism: 32
Understanding Critical Limits
Let's examine some of the most important limits in detail:
Ingestion Rate Limits
Rate limits are defined in MB per second and include a "burst" capacity to handle temporary spikes:
limits_config:
ingestion_rate_mb: 4 # 4 MB/s steady state
ingestion_burst_size_mb: 6 # 6 MB burst capacity
When a tenant exceeds these limits, Loki will reject their log entries with a 429 (Too Many Requests) HTTP response.
Stream Limits
The max_streams_per_user
setting controls the maximum number of active streams a tenant can have:
limits_config:
max_streams_per_user: 10000
This limit prevents cardinality explosions, which can happen when log entries have too many unique label combinations.
Measuring and Monitoring Limits
Loki exposes metrics that help you monitor resource usage and limit enforcement:
loki_distributor_bytes_received_total
: Total bytes received per tenantloki_ingester_streams_created_total
: Number of streams created per tenantloki_request_duration_seconds
: Request latency by tenantloki_discarded_samples_total
: Samples discarded due to rate limiting
You can use these metrics to set up alerts and create dashboards to monitor tenant behavior.
Dynamic Limit Configuration
For more flexibility, you can use Loki's runtime configuration feature to update limits without restarting:
runtime_config:
file: /etc/loki/runtime.yaml
This allows you to adjust limits on-the-fly as your workloads and requirements change.
Practical Example: Multi-Environment Setup
Consider a scenario where you're running Loki for development, staging, and production environments:
limits_config:
# Default limits for all tenants
ingestion_rate_mb: 2
ingestion_burst_size_mb: 4
max_query_length: 24h
per_tenant_override_config:
filename: /etc/loki/overrides.yaml
# In overrides.yaml:
overrides:
dev:
ingestion_rate_mb: 1
max_query_length: 48h
staging:
ingestion_rate_mb: 3
max_streams_per_user: 5000
production:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_query_length: 72h
max_query_parallelism: 32
max_streams_per_user: 50000
This configuration allocates the most resources to production while limiting development and staging to smaller quotas.
Limit Visualization
Here's a visual representation of how resource limits work in a multi-tenant environment:
Best Practices for Resource Limits
-
Start conservative: Begin with stricter limits and gradually increase them based on observed usage patterns.
-
Monitor limit errors: Track the frequency of rate limit errors per tenant to identify problematic workloads.
-
Implement graduated limits: Provide different tiers of limits based on tenant requirements or service levels.
-
Regular review: Periodically review and adjust limits based on system capacity and tenant needs.
-
Communicate limits: Ensure tenants understand their resource allocations and limits.
-
Use data to inform limits: Base limits on actual usage patterns rather than arbitrary values.
Implementing a Limit Strategy
Here's a step-by-step approach to implementing resource limits:
-
Baseline measurement: Monitor your Loki deployment without strict limits to understand natural usage patterns.
-
Global limit configuration: Set sensible global limits based on your total system capacity.
-
Tenant classification: Categorize tenants based on their expected workload and importance.
-
Per-tenant overrides: Configure specific limits for high-priority or high-volume tenants.
-
Monitoring setup: Create dashboards to track resource usage and limit violations.
-
Adjustment cycle: Regularly review and adjust limits based on monitoring data.
Summary
Resource limits are an essential aspect of running Loki in a multi-tenant environment. They protect your infrastructure from overload, ensure fair resource allocation, and help maintain predictable performance.
By understanding and properly configuring ingestion, query, and storage limits, you can create a stable and reliable logging system that serves multiple tenants effectively.
Additional Resources
Exercises
-
Configure basic resource limits for a Loki deployment with three tenants of different priorities.
-
Set up a Grafana dashboard to monitor tenant resource usage and limit violations.
-
Implement dynamic limit configuration and practice updating limits at runtime.
-
Design a graduated limit strategy with at least three different service tiers.
-
Create an incident response plan for dealing with tenants that consistently reach or exceed their resource limits.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)