High Availability Setup

Introduction

High Availability (HA) is a critical characteristic of any production-ready logging system. In this guide, we'll explore how to set up Grafana Loki in a high availability configuration to ensure your logging infrastructure remains operational even when individual components fail.

Grafana Loki is designed with scalability and resilience in mind, allowing you to deploy it in distributed configurations that eliminate single points of failure. By understanding and implementing HA concepts with Loki, you'll be able to build logging infrastructure that meets enterprise requirements for uptime and reliability.

Understanding High Availability in Loki

Loki's architecture is composed of several microservices that can be deployed in various configurations. For high availability, we need to deploy multiple instances of each component and ensure they can work together seamlessly.

The main components that need to be considered for HA are:

Distributor - Handles incoming log streams
Ingester - Writes log data to long-term storage
Querier - Handles queries from Grafana or LogCLI
Query Frontend - Optimizes and distributes query load
Compactor - Optimizes storage in the background

Let's examine how these components work together in an HA setup:

Prerequisites

Before setting up Loki in HA mode, ensure you have:

Kubernetes cluster or multiple servers for deployment
Shared object storage (S3, GCS, Azure Blob, etc.)
A key-value store like Consul or etcd for ring management
Load balancer for distributing traffic

Implementation Steps

Step 1: Configure Storage Backend

First, we need to configure a shared storage backend that all Loki components can access. This is crucial for HA as it allows components to access data regardless of which instance processed it originally.

Here's an example configuration using S3:

storage_config:
  aws:
    s3: s3://access_key:secret_access_key@region/bucket_name
    s3forcepathstyle: true
  boltdb_shipper:
    active_index_directory: /loki/boltdb-shipper-active
    cache_location: /loki/boltdb-shipper-cache
    cache_ttl: 24h
    shared_store: s3

Step 2: Set Up Ring Membership

Loki uses a "ring" for distributed coordination between components. We'll configure this to use Consul:

kvstore:
  store: consul
  consul:
    host: consul:8500
    prefix: loki/

ring:
  kvstore:
    store: consul
    consul:
      host: consul:8500
      prefix: loki/ring/

Step 3: Configure Components for HA

Each component needs to be configured for HA operation. Here's an example for the distributor:

distributor:
  ring:
    kvstore:
      store: consul
      consul:
        host: consul:8500
        prefix: loki/ring/
    replication_factor: 3

And for the ingester:

ingester:
  lifecycler:
    ring:
      kvstore:
        store: consul
        consul:
          host: consul:8500
          prefix: loki/ring/
      replication_factor: 3
    final_sleep: 0s
  chunk_idle_period: 1h
  max_chunk_age: 1h
  chunk_target_size: 1048576
  chunk_retain_period: 30s
  max_transfer_retries: 0

Step 4: Deploy Multiple Instances

Using Kubernetes, we can deploy multiple instances of each component. Here's a simplified example for the distributor:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: loki-distributor
spec:
  replicas: 3
  selector:
    matchLabels:
      app: loki
      component: distributor
  template:
    metadata:
      labels:
        app: loki
        component: distributor
    spec:
      containers:
      - name: loki
        image: grafana/loki:latest
        args:
          - "-target=distributor"
          - "-config.file=/etc/loki/config.yaml"
        ports:
          - containerPort: 3100
            name: http
        volumeMounts:
          - name: config
            mountPath: /etc/loki
      volumes:
        - name: config
          configMap:
            name: loki-config

Step 5: Configure Load Balancers

Set up load balancers to distribute traffic to your multiple Loki instances:

apiVersion: v1
kind: Service
metadata:
  name: loki-distributor
spec:
  type: ClusterIP
  selector:
    app: loki
    component: distributor
  ports:
    - port: 3100
      targetPort: 3100

Practical Example: Complete HA Configuration

Let's put it all together with a full configuration example for a production environment:

auth_enabled: true

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: consul
        consul:
          host: consul:8500
          prefix: loki/ring/
      replication_factor: 3
    final_sleep: 0s
  chunk_idle_period: 1h
  chunk_target_size: 1536000
  max_chunk_age: 1h

schema_config:
  configs:
    - from: 2020-07-01
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  aws:
    s3: s3://access_key:secret_access_key@region/bucket_name
    s3forcepathstyle: true
  boltdb_shipper:
    active_index_directory: /loki/boltdb-shipper-active
    cache_location: /loki/boltdb-shipper-cache
    cache_ttl: 24h
    shared_store: s3

compactor:
  working_directory: /loki/compactor
  shared_store: s3
  compaction_interval: 10m

limits_config:
  retention_period: 744h
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20

frontend:
  compress_responses: true
  log_queries_longer_than: 10s
  
frontend_worker:
  frontend_address: loki-query-frontend:9095

ruler:
  storage:
    type: local
    local:
      directory: /loki/rules
  rule_path: /loki/rules
  alertmanager_url: http://alertmanager:9093
  ring:
    kvstore:
      store: consul
      consul:
        host: consul:8500
        prefix: loki/ruler/
  enable_api: true

Verifying Your HA Setup

Once deployed, you can verify your HA setup by:

Testing component resilience - Temporarily shutting down individual pods to confirm the system remains available
Monitoring the ring status - Using the /ring endpoint to verify all instances are healthy
Load testing - Generating high volumes of log traffic to ensure proper distribution

You can check the ring status with:

curl http://loki:3100/ring

Example output:

{
  "status": "success",
  "data": {
    "ingester": {
      "tokens": [
        {
          "id": "ingester-1",
          "state": "ACTIVE",
          "address": "10.0.0.1:3100",
          "timestamp": "2023-05-17T15:20:30Z"
        },
        {
          "id": "ingester-2",
          "state": "ACTIVE",
          "address": "10.0.0.2:3100",
          "timestamp": "2023-05-17T15:20:30Z"
        },
        {
          "id": "ingester-3",
          "state": "ACTIVE",
          "address": "10.0.0.3:3100",
          "timestamp": "2023-05-17T15:20:30Z"
        }
      ]
    }
  }
}

Performance Considerations

When running Loki in HA mode, consider these performance optimizations:

Resource allocation - Ensure each component has adequate CPU and memory
Storage performance - Use high-performance storage backends
Network latency - Deploy components in the same region to minimize latency
Query caching - Enable and tune query caching for better performance

Troubleshooting Common Issues

Issue: Ingesters Not Joining the Ring

Solution: Check Consul connectivity and permissions. Ensure all ingesters can reach the KV store:

kubectl exec -it loki-ingester-0 -- curl consul:8500/v1/kv/loki/ring/?recurse=true

Issue: Log Data Not Distributed Evenly

Solution: Verify your replication factor and make sure distributors are configured correctly:

kubectl logs deploy/loki-distributor | grep "distributing"

Issue: Queries Failing in HA Mode

Solution: Check that queriers can access all ingesters and the storage backend:

kubectl logs deploy/loki-querier | grep "error"

Advanced HA Features

Multi-Zone Deployment

For even higher availability, you can deploy Loki across multiple availability zones:

ingester:
  lifecycler:
    zone_awareness_enabled: true
    availability_zone: us-east-1a

Tenant Isolation

In multi-tenant environments, consider isolating resources per tenant:

limits_config:
  per_tenant_override_config: /etc/loki/tenant-overrides.yaml

Autoscaling

Configure Kubernetes Horizontal Pod Autoscalers (HPA) to scale components based on load:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: loki-distributor-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: loki-distributor
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Summary

Setting up Grafana Loki in a high availability configuration is essential for production environments where reliability and scalability are critical. By deploying multiple instances of each component, configuring shared storage, and implementing proper coordination through rings, you can build a robust logging infrastructure that can withstand component failures.

Key takeaways:

Use shared object storage for log data
Configure a distributed key-value store for ring management
Deploy multiple instances of each component
Use load balancers to distribute traffic
Implement monitoring to detect and address issues quickly

With these practices in place, your Loki deployment will be resilient and capable of handling production workloads reliably.

Additional Resources

Practice setting up a simple HA configuration on a local Kubernetes cluster using Minikube
Experiment with different replication factors and observe their impact on resilience and resource usage
Try simulating failures by shutting down components and observing how the system responds

Exercises

Deploy a basic Loki HA setup with 3 distributors, 3 ingesters, and 2 queriers
Implement a monitoring dashboard to track the health of your Loki components
Create a disaster recovery plan for your Loki deployment, including backup and restore procedures
Configure alerting to notify you when components fail or performance degrades
Benchmark your HA setup to determine maximum log ingestion rates

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding High Availability in Loki​

Prerequisites​

Implementation Steps​

Step 1: Configure Storage Backend​

Step 2: Set Up Ring Membership​

Step 3: Configure Components for HA​

Step 4: Deploy Multiple Instances​

Step 5: Configure Load Balancers​

Practical Example: Complete HA Configuration​

Verifying Your HA Setup​

Performance Considerations​

Troubleshooting Common Issues​

Issue: Ingesters Not Joining the Ring​

Issue: Log Data Not Distributed Evenly​

Issue: Queries Failing in HA Mode​

Advanced HA Features​

Multi-Zone Deployment​

Tenant Isolation​

Autoscaling​

Summary​

Additional Resources​

Exercises​