Ruler Service in Grafana Loki

Introduction

The Ruler service is a powerful component in Grafana Loki's architecture that extends Loki beyond just log storage and querying. It enables you to automatically evaluate LogQL expressions over time and take actions based on the results. This functionality is split into two main capabilities:

Recording rules - Pre-compute frequently used queries and store their results
Alerting rules - Trigger notifications when specific conditions in your logs are met

This guide will walk you through understanding, configuring, and using the Ruler service effectively in your Loki deployment.

Understanding the Ruler Service

The Ruler service in Loki closely follows the design of Prometheus' rule evaluation system, making it familiar for users who have experience with Prometheus. It periodically evaluates expressions defined in rule files and can:

Store the results of recording rules as new time series in the storage
Fire alerts when alert conditions are met

Enabling the Ruler Service

Before using the Ruler, you need to make sure it's properly configured in your Loki deployment. The Ruler needs a backend to store rule definitions, which can be local files or objects in a supported object store.

Basic Configuration

Here's a simple configuration to enable the Ruler service with local file storage:

ruler:
  storage:
    type: local
    local:
      directory: /loki/rules
  rule_path: /loki/rules-temp
  alertmanager_url: http://alertmanager:9093
  ring:
    kvstore:
      store: inmemory
  enable_api: true

This configuration:

Uses local storage for rules
Sets a temporary path for rule evaluation
Connects to Alertmanager for handling alerts
Enables the Ruler API to manage rules programmatically

Creating Recording Rules

Recording rules allow you to pre-compute frequently used or computationally expensive queries. This improves performance and reduces the load on your system.

Rule File Format

Recording rules are defined in YAML files with a specific structure:

groups:
  - name: example_recording_rules
    interval: 1m
    rules:
      - record: job:loki_request_duration_seconds_bucket:sum_rate
        expr: sum(rate(loki_request_duration_seconds_bucket[5m])) by (job, le)
      - record: job:loki_request_errors_total:sum_rate
        expr: sum(rate(loki_request_errors_total[5m])) by (job)

Example: Creating a Recording Rule to Track Error Rates

Let's create a recording rule that calculates the rate of error logs for different services:

groups:
  - name: service_errors
    interval: 2m
    rules:
      - record: service:error_rate:5m
        expr: sum by (service) (rate({app="production"} |= "error" [5m]))

This rule:

Creates a new metric called service:error_rate:5m
Calculates the rate of logs containing "error" for each service over 5-minute windows
Evaluates every 2 minutes

Once defined, you can query this pre-computed metric directly:

service:error_rate:5m{service="authentication"}

Setting Up Alerting Rules

Alerting rules define conditions that trigger alerts when they're met. These alerts can be sent to an Alertmanager instance for notification routing.

Alert Rule Format

Alert rules follow a similar format to recording rules but include additional fields:

groups:
  - name: example_alert_rules
    interval: 1m
    rules:
      - alert: HighErrorRate
        expr: job:loki_request_errors_total:sum_rate > 0.5
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors per second"

Example: Creating an Alert for High Error Rates

Let's create an alert that triggers when a service's error rate exceeds a threshold:

groups:
  - name: service_alerts
    interval: 1m
    rules:
      - alert: ServiceHighErrorRate
        expr: sum by (service) (rate({app="production"} |= "error" [5m])) > 0.1
        for: 5m
        labels:
          severity: critical
          team: "{{ $labels.service }}"
        annotations:
          summary: "{{ $labels.service }} has a high error rate"
          description: "{{ $labels.service }} is experiencing {{ $value }} errors per second for the last 5 minutes"

This alert:

Triggers when a service's error rate exceeds 0.1 errors per second
Must persist for 5 minutes before firing
Includes dynamic labels and annotations based on the evaluated data

Managing Rules via the API

Loki provides a REST API to manage rules programmatically, allowing you to create, update, and delete rules without modifying files directly.

Listing Rule Groups

curl -X GET http://loki:3100/loki/api/v1/rules | jq

Example response:

{
  "status": "success",
  "data": {
    "groups": [
      {
        "name": "service_errors",
        "interval": 120,
        "rules": [
          {
            "record": "service:error_rate:5m",
            "expr": "sum by (service) (rate({app=\"production\"} |= \"error\" [5m]))"
          }
        ]
      }
    ]
  }
}

Creating or Updating Rules

Rules can be created or updated by making POST requests to the Ruler API:

curl -X POST http://loki:3100/loki/api/v1/rules/namespace \
  -H "Content-Type: application/yaml" \
  --data-binary @rules.yaml

Real-World Use Cases

Case 1: Service Health Monitoring

Create recording rules to track service health metrics and alerting rules to notify teams of potential issues:

groups:
  - name: service_health
    interval: 1m
    rules:
      # Recording rule for service response times
      - record: service:response_time:p95
        expr: |
          quantile_over_time(0.95,
            {app="web-service"}
            | json
            | unwrap response_time_ms
            [5m]
          )
      
      # Alert when response time is too high
      - alert: SlowServiceResponse
        expr: service:response_time:p95 > 500
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Service response time is degraded"
          description: "95th percentile response time is {{ $value }}ms (threshold: 500ms)"

Case 2: Security Monitoring

Set up rules to detect security-related events in your logs:

groups:
  - name: security_monitoring
    interval: 1m
    rules:
      # Recording rule tracking failed logins
      - record: security:failed_logins:rate5m
        expr: sum by (username) (rate({app="auth-service"} |= "login failed" [5m]))
      
      # Alert on brute force attempts
      - alert: PossibleBruteForceAttempt
        expr: security:failed_logins:rate5m > 0.2
        for: 5m
        labels:
          severity: critical
          category: security
        annotations:
          summary: "Possible brute force attack detected"
          description: "User {{ $labels.username }} has {{ $value }} failed login attempts per second"

Best Practices

Optimizing Rule Performance

Limit the number of rules: Each rule consumes resources when evaluated
Use appropriate intervals: Balance between freshness and resource usage
Optimize LogQL queries: Complex queries in rules can be expensive
Use recording rules for complex queries: Pre-compute expensive expressions

Rule Organization

Group related rules together: Makes management easier
Use namespaces to organize rules: Separate rules by team or application
Follow consistent naming conventions: Makes rules more discoverable

Alert Design

Avoid alert fatigue: Only alert on actionable conditions
Include context in annotations: Help responders understand the alert
Set appropriate thresholds: Tune based on normal behavior patterns
Use for clause wisely: Prevents flapping alerts for transient conditions

Troubleshooting the Ruler Service

Common Issues

Rules not evaluating: Check Ruler service logs and configuration
Alertmanager not receiving alerts: Verify the alertmanager_url configuration
High resource usage: Rules may be too complex or too frequent

Debugging Steps

Check the Ruler's status:

curl -X GET http://loki:3100/loki/api/v1/rules/state

Verify rule syntax with the Loki API:

curl -X GET "http://loki:3100/loki/api/v1/query?query=<your rule expression>"

Review Ruler logs for any errors or warnings:

docker logs <loki-container> | grep ruler

Summary

The Ruler service extends Loki's capabilities from a log storage and query system to a complete monitoring solution. By using recording rules and alerting rules, you can:

Pre-compute expensive queries to improve performance
Automate anomaly detection in your logs
Create alerts for important conditions in your applications
Build a comprehensive monitoring system alongside metrics

The Ruler's design will feel familiar to Prometheus users, making it easy to adopt if you're already using Prometheus for metrics monitoring.

Additional Resources

Exercises

Create a recording rule that tracks the rate of HTTP 500 errors across different services.
Set up an alert that triggers when a specific log pattern appears more than 10 times in 5 minutes.
Modify an existing alert to include more context in its annotations.
Create a dashboard in Grafana that visualizes data from your recording rules.
Write a script that uses the Ruler API to dynamically update alert thresholds based on time of day.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding the Ruler Service​

Enabling the Ruler Service​

Basic Configuration​

Creating Recording Rules​

Rule File Format​

Example: Creating a Recording Rule to Track Error Rates​

Setting Up Alerting Rules​

Alert Rule Format​

Example: Creating an Alert for High Error Rates​

Managing Rules via the API​

Listing Rule Groups​

Creating or Updating Rules​

Real-World Use Cases​

Case 1: Service Health Monitoring​

Case 2: Security Monitoring​

Best Practices​

Optimizing Rule Performance​

Rule Organization​

Alert Design​

Troubleshooting the Ruler Service​

Common Issues​

Debugging Steps​

Summary​

Additional Resources​

Exercises​