Loki Health Checks

Introduction

Health checks are essential components of any reliable system, and Grafana Loki is no exception. Health checks provide a way to monitor the operational status of Loki components, allowing you to detect and respond to issues before they cause significant problems. In this guide, we'll explore what health checks are in the context of Loki, why they're important, and how to implement and use them effectively.

What Are Loki Health Checks?

Health checks in Loki are endpoint-based monitoring tools that report on the status and availability of different Loki components. These checks help you:

Identify when components are experiencing issues
Automate recovery procedures
Provide visibility into system health
Support load balancing decisions
Enable proactive monitoring and alerting

Loki's Built-in Health Check Endpoints

Loki provides several built-in HTTP endpoints for health checking:

1. `/ready` Endpoint

The /ready endpoint indicates whether a Loki component is ready to receive traffic.

# Example request
curl http://localhost:3100/ready

# Example output when healthy
ready

When a component is not ready, it will return a non-200 HTTP status code.

2. `/metrics` Endpoint

The /metrics endpoint exposes Prometheus metrics that can be used for health monitoring.

# Example request
curl http://localhost:3100/metrics

# Partial example output
# HELP loki_distributor_received_samples_total The total number of received samples.
# TYPE loki_distributor_received_samples_total counter
loki_distributor_received_samples_total 12345

3. `/healthz` Endpoint

The /healthz endpoint provides a basic check to determine if the service is running.

# Example request
curl http://localhost:3100/healthz

# Example output when healthy
OK

Implementing Custom Health Checks

Besides the built-in endpoints, you can also create custom health checks for Loki. Here's how to set up a comprehensive health monitoring system:

Step 1: Configure Loki Components to Expose Health Endpoints

Ensure all Loki components have their health check endpoints properly configured in your Loki configuration file:

server:
  http_listen_port: 3100

distributor:
  ring:
    kvstore:
      store: inmemory

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /tmp/loki/boltdb-shipper-active
    cache_location: /tmp/loki/boltdb-shipper-cache
    cache_ttl: 24h
    shared_store: filesystem
  filesystem:
    directory: /tmp/loki/chunks

compactor:
  working_directory: /tmp/loki/compactor
  shared_store: filesystem

# Health check configuration
http_server:
  http_listen_port: 3100
  graceful_shutdown_timeout: 5s
  http_server_read_timeout: 30s
  http_server_write_timeout: 30s
  http_server_idle_timeout: 120s

Step 2: Create Prometheus Alerting Rules

Create Prometheus alerting rules to monitor Loki component health:

groups:
- name: loki_health_alerts
  rules:
  - alert: LokiComponentDown
    expr: up{job=~"loki.*"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Loki component {{ $labels.job }} on {{ $labels.instance }} is down"
      description: "The Loki component has been down for more than 5 minutes."
      
  - alert: LokiHighErrorRate
    expr: sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[5m])) by (job, route) / sum(rate(loki_request_duration_seconds_count[5m])) by (job, route) > 0.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Loki {{ $labels.job }} has high error rate on {{ $labels.route }}"
      description: "The Loki component {{ $labels.job }} has a high error rate (>5%) on the {{ $labels.route }} endpoint."

Step 3: Implement Health Check Probes in Kubernetes

If you're running Loki in Kubernetes, you can implement health check probes in your deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: loki
spec:
  selector:
    matchLabels:
      app: loki
  template:
    metadata:
      labels:
        app: loki
    spec:
      containers:
      - name: loki
        image: grafana/loki:latest
        ports:
        - containerPort: 3100
          name: http-metrics
        readinessProbe:
          httpGet:
            path: /ready
            port: http-metrics
          initialDelaySeconds: 15
          timeoutSeconds: 1
        livenessProbe:
          httpGet:
            path: /healthz
            port: http-metrics
          initialDelaySeconds: 30
          timeoutSeconds: 1
        startupProbe:
          httpGet:
            path: /ready
            port: http-metrics
          failureThreshold: 30
          periodSeconds: 10

Monitoring Loki Health Metrics

Besides the basic health check endpoints, Loki exposes detailed metrics that provide insights into its operational health:

Key Health Metrics to Monitor

Query Performance

histogram_quantile(0.99, sum(rate(loki_query_frontend_query_duration_seconds_bucket[5m])) by (le))

Ingestion Rate

sum(rate(loki_distributor_bytes_received_total[5m]))

Memory Usage

container_memory_usage_bytes{container="loki"}

Storage Operations

sum(rate(loki_ingester_chunk_size_bytes_bucket[5m])) by (le)

Request Failures

sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[5m])) by (route)

Creating a Loki Health Dashboard

You can create a comprehensive health dashboard in Grafana to monitor Loki:

Practical Example: Setting Up Comprehensive Loki Health Monitoring

Let's walk through a practical example of setting up a comprehensive health monitoring system for Loki.

1. Create a Docker Compose Setup for Loki with Prometheus and Grafana

version: '3'
services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
    command: -config.file=/etc/loki/local-config.yaml

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    depends_on:
      - loki

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    depends_on:
      - prometheus
      - loki

2. Configure Prometheus to Scrape Loki Metrics

Create a prometheus.yml file:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'loki'
    static_configs:
      - targets: ['loki:3100']

3. Create a Health Check Script

Create a health check script that can be run periodically:

#!/bin/bash

LOKI_HOST="localhost:3100"
ALERT_THRESHOLD=500

# Check if Loki is ready
ready_check=$(curl -s -o /dev/null -w "%{http_code}" http://${LOKI_HOST}/ready)
if [ $ready_check -ne 200 ]; then
  echo "CRITICAL: Loki readiness check failed with HTTP code $ready_check"
  exit 2
fi

# Check if Loki is healthy
health_check=$(curl -s -o /dev/null -w "%{http_code}" http://${LOKI_HOST}/healthz)
if [ $health_check -ne 200 ]; then
  echo "CRITICAL: Loki health check failed with HTTP code $health_check"
  exit 2
fi

# Check query performance
query_time=$(curl -s "http://${LOKI_HOST}/metrics" | grep loki_query_frontend_query_duration_seconds_sum | awk '{print $2}')
if (( $(echo "$query_time > $ALERT_THRESHOLD" | bc -l) )); then
  echo "WARNING: Query performance is degraded, taking ${query_time}s"
  exit 1
fi

echo "OK: All Loki health checks passed"
exit 0

Troubleshooting Common Health Check Issues

Issue: Loki Components Show as Not Ready

Possible Causes:

Recently started components haven't completed initialization
Configuration errors
Resource constraints

Solutions:

Check Loki logs for initialization errors
Review configuration for mistakes
Ensure adequate resources are allocated

Issue: High Error Rates in Requests

Possible Causes:

Incorrect query syntax
Resource exhaustion
Rate limits being hit

Solutions:

Review query patterns for optimization
Scale up resources
Adjust rate limits in configuration

limits_config:
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  per_stream_rate_limit: 10MB
  per_stream_rate_limit_burst: 15MB

Issue: High Memory Usage

Possible Causes:

Too many in-memory queries
Large query ranges
Inefficient chunk caching

Solutions:

Limit concurrent queries
Restrict query time ranges
Tune cache size

query_range:
  split_queries_by_interval: 30m
  cache_results: true
  max_retries: 5
  
limits_config:
  max_query_parallelism: 16
  max_query_series: 10000

Proactive Health Management with Alerting

Beyond basic health checks, implementing proactive alerting helps prevent issues before they impact users.

1. Set Up Grafana Alerting for Loki

Configure alerting rules in Grafana to notify you of potential issues:

# Example Grafana alert rule (as JSON)
{
  "alertName": "LokiQueryLatencyHigh",
  "expr": "histogram_quantile(0.99, sum(rate(loki_query_frontend_query_duration_seconds_bucket[5m])) by (le)) > 10",
  "for": "5m",
  "labels": {
    "severity": "warning"
  },
  "annotations": {
    "summary": "Loki query latency is high",
    "description": "99th percentile query latency is above 10 seconds for the last 5 minutes."
  }
}

2. Implement Automated Recovery Actions

You can set up automated recovery actions triggered by your monitoring system:

#!/bin/bash
# Example recovery script for Loki OOM issues

LOKI_CONTAINER=$(docker ps | grep loki | awk '{print $1}')

if [ $(docker stats --no-stream $LOKI_CONTAINER | awk 'NR>1 {print $7}') -gt 90 ]; then
  echo "Loki memory usage high, restarting container..."
  docker restart $LOKI_CONTAINER
  
  # Notify administrators
  curl -X POST -H 'Content-type: application/json' \
    --data '{"text":"Loki container restarted due to high memory usage"}' \
    $WEBHOOK_URL
fi

Summary

Health checks are a critical component of maintaining a reliable and performant Loki deployment. By implementing proper health check endpoints, monitoring key metrics, setting up alerting, and having automated recovery procedures, you can ensure that your Loki installation remains healthy and available.

In this guide, we've covered:

The importance of health checks in Loki
Built-in health check endpoints
Implementing custom health checks
Monitoring key health metrics
Creating a health dashboard
Practical implementation examples
Troubleshooting common health issues
Proactive health management with alerting

Additional Resources

To further enhance your understanding of Loki health checks:

Explore the Loki documentation for detailed information
Review Prometheus alerting best practices
Practice implementing health checks in a test environment before deploying to production
Experiment with different alert thresholds to find the right balance for your environment

Exercises

Set up a local Loki instance and configure the built-in health check endpoints
Create a Grafana dashboard to visualize key Loki health metrics
Implement a Prometheus alerting rule for Loki component availability
Write a script to periodically check Loki health and send notifications on failure
Simulate a failure scenario and test your recovery procedures

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What Are Loki Health Checks?​

Loki's Built-in Health Check Endpoints​

1. /ready Endpoint​

2. /metrics Endpoint​

3. /healthz Endpoint​

Implementing Custom Health Checks​

Step 1: Configure Loki Components to Expose Health Endpoints​

Step 2: Create Prometheus Alerting Rules​

Step 3: Implement Health Check Probes in Kubernetes​

Monitoring Loki Health Metrics​

Key Health Metrics to Monitor​

Creating a Loki Health Dashboard​

Practical Example: Setting Up Comprehensive Loki Health Monitoring​

1. Create a Docker Compose Setup for Loki with Prometheus and Grafana​

2. Configure Prometheus to Scrape Loki Metrics​

3. Create a Health Check Script​

Troubleshooting Common Health Check Issues​

Issue: Loki Components Show as Not Ready​

Issue: High Error Rates in Requests​

Issue: High Memory Usage​

Proactive Health Management with Alerting​

1. Set Up Grafana Alerting for Loki​

2. Implement Automated Recovery Actions​

Summary​

Additional Resources​

Exercises​

Introduction

What Are Loki Health Checks?

Loki's Built-in Health Check Endpoints

1. `/ready` Endpoint

2. `/metrics` Endpoint

3. `/healthz` Endpoint

Implementing Custom Health Checks

Step 1: Configure Loki Components to Expose Health Endpoints

Step 2: Create Prometheus Alerting Rules

Step 3: Implement Health Check Probes in Kubernetes

Monitoring Loki Health Metrics

Key Health Metrics to Monitor

Creating a Loki Health Dashboard

Practical Example: Setting Up Comprehensive Loki Health Monitoring

1. Create a Docker Compose Setup for Loki with Prometheus and Grafana

2. Configure Prometheus to Scrape Loki Metrics

3. Create a Health Check Script

Troubleshooting Common Health Check Issues

Issue: Loki Components Show as Not Ready

Issue: High Error Rates in Requests

Issue: High Memory Usage

Proactive Health Management with Alerting

1. Set Up Grafana Alerting for Loki

2. Implement Automated Recovery Actions

Summary

Additional Resources

Exercises