Scaling Guidelines

Introduction

Grafana Loki is designed to be horizontally scalable, allowing you to start small and grow your deployment as your logging needs increase. Whether you're running Loki on a single machine or across a distributed Kubernetes cluster, understanding how to effectively scale your implementation is crucial for maintaining performance and reliability.

This guide covers essential scaling considerations and best practices to help you grow your Loki deployment efficiently. We'll explore component-specific scaling approaches, resource optimization techniques, and architectural patterns that enable Loki to handle increasing log volumes.

Scaling Fundamentals

Before diving into specific scaling strategies, let's understand the fundamental aspects that influence Loki's scalability.

Key Scaling Dimensions

Loki scales across several dimensions:

Query load: The number and complexity of queries
Ingest volume: The amount of log data being sent to Loki
Retention period: How long data is stored
Tenant count: Number of separate organizations/projects using the same Loki instance

Monolithic vs. Microservices Deployment

Loki supports two primary deployment modes:

Monolithic mode: All Loki components run in a single process
Microservices mode: Components are separated and can be scaled independently

Scaling for Different Deployment Sizes

Let's examine scaling guidelines based on deployment size.

Small Deployments (Up to 100GB/day)

For small environments, a monolithic deployment is typically sufficient:

loki:
  config: |
    auth_enabled: false
    
    server:
      http_listen_port: 3100
    
    ingester:
      lifecycler:
        ring:
          kvstore:
            store: inmemory
          replication_factor: 1
      chunk_idle_period: 15m
      chunk_retain_period: 30s
      max_transfer_retries: 0
    
    schema_config:
      configs:
        - from: 2020-05-15
          store: boltdb-shipper
          object_store: filesystem
          schema: v11
          index:
            prefix: index_
            period: 24h
    
    storage_config:
      boltdb_shipper:
        active_index_directory: /data/loki/index
        cache_location: /data/loki/cache
        cache_ttl: 24h
        shared_store: filesystem
      filesystem:
        directory: /data/loki/chunks
    
    limits_config:
      ingestion_rate_mb: 10
      ingestion_burst_size_mb: 20
      max_global_streams_per_user: 5000

Resource Guidelines:

CPU: 2-4 cores
Memory: 4-8GB
Storage: SSD for index data
Network: 1Gbps

Medium Deployments (100GB-1TB/day)

For medium-sized deployments, consider moving to microservices mode:

# distributor configuration
distributor:
  replicas: 2
  resources:
    limits:
      cpu: 1
      memory: 1Gi
    requests:
      cpu: 500m
      memory: 500Mi

# ingester configuration
ingester:
  replicas: 3
  resources:
    limits:
      cpu: 2
      memory: 8Gi
    requests:
      cpu: 1
      memory: 4Gi

# querier configuration
querier:
  replicas: 2
  resources:
    limits:
      cpu: 2
      memory: 4Gi
    requests:
      cpu: 1
      memory: 2Gi

Key Considerations:

Use replicated ingesters (replication_factor: 2-3)
Implement separate object storage (S3, GCS, etc.)
Add query frontends with query caching

Large Deployments (1TB+/day)

For enterprise-scale deployments:

# Additional specialized microservice components
querierFrontend:
  replicas: 3
  
compactor:
  replicas: 2
  
indexGateway:
  replicas: 3

ruler:
  replicas: 2

Advanced Scaling Techniques:

Implement tenant isolation with resource limits per tenant
Use read and write pools for ingesters
Add index caching layers
Consider using Cortex chunks storage

Component-Specific Scaling Guidelines

Scaling Distributors

Distributors handle incoming log streams and are generally CPU-bound:

distributor:
  replicas: ${DISTRIBUTOR_REPLICAS}
  resources:
    limits:
      cpu: ${DISTRIBUTOR_CPU_LIMIT}
      memory: ${DISTRIBUTOR_MEMORY_LIMIT}
  
  # Adjust these parameters for higher throughput
  max_recv_msg_size: 10485760  # 10MB
  http_server_write_timeout: 1m
  http_server_read_timeout: 1m

Scaling Indicators:

High CPU utilization
Increasing request latency
HTTP 429 responses (rate limiting)

Scaling Ingesters

Ingesters are the most resource-intensive component and require careful scaling:

ingester:
  replicas: ${INGESTER_REPLICAS}
  resources:
    limits:
      cpu: ${INGESTER_CPU_LIMIT}
      memory: ${INGESTER_MEMORY_LIMIT}
  
  # Performance tuning
  chunk_target_size: 1536000
  chunk_idle_period: 30m
  max_chunk_age: 1h
  
  lifecycler:
    ring:
      replication_factor: 3

Memory Sizing Formula:

Memory = (log_volume_per_day * replication_factor * chunk_retention_period) / ingester_count

Scaling Queriers

Queriers become important as your query load increases:

querier:
  replicas: ${QUERIER_REPLICAS}
  max_concurrent_queries: 20
  query_timeout: 2m
  engine:
    timeout: 1m
    max_look_back_period: 12h

Scaling Indicators:

Query timeouts
High query latency
High memory usage during query execution

Storage Optimization for Scaling

Storage choices significantly impact scalability:

Object Storage Considerations

storage_config:
  aws:
    s3: s3://region/bucket
    s3forcepathstyle: true
  boltdb_shipper:
    active_index_directory: /loki/index
    shared_store: s3
    cache_location: /loki/cache
    cache_ttl: 24h

Best Practices:

Use dedicated SSD volumes for active index directories
Implement caching for frequently accessed chunks
Consider data lifecycle policies to manage older data

Index Storage Optimization

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: index_
        period: 24h  # Shorter periods create more, smaller index files

Tuning Options:

Adjust index periods based on query patterns
Implement index caching for frequently queried time ranges
Use index gateways for large deployments

Multi-Tenancy Scaling

For environments with multiple teams or services:

limits_config:
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  
  # Per-tenant limits
  per_tenant_override_config: /etc/loki/tenant-limits.yaml
  
  # Global rate limit
  max_global_streams_per_user: 10000

Example tenant limits file:

tenant1:
  ingestion_rate_mb: 20
  ingestion_burst_size_mb: 30
  max_streams_per_user: 15000

tenant2:
  ingestion_rate_mb: 5
  ingestion_burst_size_mb: 10
  max_streams_per_user: 5000

Load Balancing and High Availability

Implementing proper load balancing enhances scalability:

Configuration Example:

# Using Kubernetes for automatic scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: loki-distributor
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: loki-distributor
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Monitoring Your Loki Deployment

Monitoring is essential to understand when and how to scale:

# Prometheus metrics config
server:
  http_listen_port: 3100
  
# Additional metrics exposure
runtime_config:
  file: /etc/loki/runtime-config.yaml

# Important metrics to watch
# - loki_distributor_bytes_received_total
# - loki_ingester_memory_chunks
# - loki_chunk_store_index_entries_per_chunk
# - loki_query_frontend_queries_total

Create a Grafana dashboard to visualize these key metrics:

Useful Prometheus queries:
sum(rate(loki_distributor_bytes_received_total[5m])) by (tenant)
sum(loki_ingester_memory_chunks) by (tenant)
histogram_quantile(0.99, sum(rate(loki_query_frontend_query_duration_seconds_bucket[5m])) by (le))

Practical Example: Scaling Exercise

Let's walk through scaling a Loki deployment from handling 50GB/day to 500GB/day:

Initial setup: Monolithic Loki with local storage
First scaling step: Migrate to object storage (S3)
Second scaling step: Split into microservices
Third scaling step: Add replication and query caching
Final adjustments: Implement autoscaling and per-tenant limits

Exercise: Calculate Required Resources

Given:

Log volume: 500GB/day
Retention: 14 days
Peak query rate: 100 queries/minute
5 tenants with varying log volumes

Calculate:

Required ingester count and sizing
Storage requirements
Query resource allocation

Solution Outline:

Total storage = 500GB * 14 days * 1.3 (overhead) = ~9.1TB
Ingester memory = (500GB * 3 (replication) * 1d) / 6 (ingesters) = ~250GB total
Querier count = (peak_QPS * avg_query_duration) / max_concurrent = ~4-6 queriers

Troubleshooting Scaling Issues

Common scaling issues and solutions:

Problem	Symptoms	Solution
High ingestion latency	Slow log delivery, timeouts	Increase distributor replicas, check rate limits
Query timeouts	Slow dashboards, failed queries	Increase querier resources, implement query frontend caching
Out of memory errors	Ingester crashes, restarts	Increase memory limits, check chunk settings
High disk I/O	Slow queries, high latency	Use SSD for active index, implement index caching

Summary

Scaling Grafana Loki effectively requires:

Understanding your log volume and query patterns
Choosing the right deployment architecture for your size
Properly sizing and configuring individual components
Implementing appropriate storage solutions
Monitoring performance metrics to identify bottlenecks
Applying component-specific optimizations

By following these guidelines, you can ensure your Loki deployment grows smoothly alongside your organization's logging needs.

Additional Resources

Official Loki scaling documentation
Grafana Loki capacity planning guide
Community best practices for large-scale deployments

Practice Exercises

Design a Loki deployment for handling 250GB/day with 30-day retention
Calculate the resource requirements for 10 tenants with varying log volumes
Create a monitoring dashboard with key scaling metrics
Implement a scaling plan to migrate from monolithic to microservices architecture

💡 Found a typo or mistake? Click "Edit this page" to suggest a correction. Your feedback is greatly appreciated!

Introduction​

Scaling Fundamentals​

Key Scaling Dimensions​

Monolithic vs. Microservices Deployment​

Scaling for Different Deployment Sizes​

Small Deployments (Up to 100GB/day)​

Medium Deployments (100GB-1TB/day)​

Large Deployments (1TB+/day)​

Component-Specific Scaling Guidelines​

Scaling Distributors​

Scaling Ingesters​

Scaling Queriers​

Storage Optimization for Scaling​

Object Storage Considerations​

Index Storage Optimization​

Multi-Tenancy Scaling​

Load Balancing and High Availability​

Monitoring Your Loki Deployment​

Practical Example: Scaling Exercise​

Exercise: Calculate Required Resources​

Troubleshooting Scaling Issues​

Summary​

Additional Resources​

Practice Exercises​