Ruler Service in Grafana Loki
Introduction
The Ruler service is a powerful component in Grafana Loki's architecture that extends Loki beyond just log storage and querying. It enables you to automatically evaluate LogQL expressions over time and take actions based on the results. This functionality is split into two main capabilities:
- Recording rules - Pre-compute frequently used queries and store their results
- Alerting rules - Trigger notifications when specific conditions in your logs are met
This guide will walk you through understanding, configuring, and using the Ruler service effectively in your Loki deployment.
Understanding the Ruler Service
The Ruler service in Loki closely follows the design of Prometheus' rule evaluation system, making it familiar for users who have experience with Prometheus. It periodically evaluates expressions defined in rule files and can:
- Store the results of recording rules as new time series in the storage
- Fire alerts when alert conditions are met
Enabling the Ruler Service
Before using the Ruler, you need to make sure it's properly configured in your Loki deployment. The Ruler needs a backend to store rule definitions, which can be local files or objects in a supported object store.
Basic Configuration
Here's a simple configuration to enable the Ruler service with local file storage:
ruler:
storage:
type: local
local:
directory: /loki/rules
rule_path: /loki/rules-temp
alertmanager_url: http://alertmanager:9093
ring:
kvstore:
store: inmemory
enable_api: true
This configuration:
- Uses local storage for rules
- Sets a temporary path for rule evaluation
- Connects to Alertmanager for handling alerts
- Enables the Ruler API to manage rules programmatically
Creating Recording Rules
Recording rules allow you to pre-compute frequently used or computationally expensive queries. This improves performance and reduces the load on your system.
Rule File Format
Recording rules are defined in YAML files with a specific structure:
groups:
- name: example_recording_rules
interval: 1m
rules:
- record: job:loki_request_duration_seconds_bucket:sum_rate
expr: sum(rate(loki_request_duration_seconds_bucket[5m])) by (job, le)
- record: job:loki_request_errors_total:sum_rate
expr: sum(rate(loki_request_errors_total[5m])) by (job)
Example: Creating a Recording Rule to Track Error Rates
Let's create a recording rule that calculates the rate of error logs for different services:
groups:
- name: service_errors
interval: 2m
rules:
- record: service:error_rate:5m
expr: sum by (service) (rate({app="production"} |= "error" [5m]))
This rule:
- Creates a new metric called
service:error_rate:5m
- Calculates the rate of logs containing "error" for each service over 5-minute windows
- Evaluates every 2 minutes
Once defined, you can query this pre-computed metric directly:
service:error_rate:5m{service="authentication"}
Setting Up Alerting Rules
Alerting rules define conditions that trigger alerts when they're met. These alerts can be sent to an Alertmanager instance for notification routing.
Alert Rule Format
Alert rules follow a similar format to recording rules but include additional fields:
groups:
- name: example_alert_rules
interval: 1m
rules:
- alert: HighErrorRate
expr: job:loki_request_errors_total:sum_rate > 0.5
for: 10m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors per second"
Example: Creating an Alert for High Error Rates
Let's create an alert that triggers when a service's error rate exceeds a threshold:
groups:
- name: service_alerts
interval: 1m
rules:
- alert: ServiceHighErrorRate
expr: sum by (service) (rate({app="production"} |= "error" [5m])) > 0.1
for: 5m
labels:
severity: critical
team: "{{ $labels.service }}"
annotations:
summary: "{{ $labels.service }} has a high error rate"
description: "{{ $labels.service }} is experiencing {{ $value }} errors per second for the last 5 minutes"
This alert:
- Triggers when a service's error rate exceeds 0.1 errors per second
- Must persist for 5 minutes before firing
- Includes dynamic labels and annotations based on the evaluated data
Managing Rules via the API
Loki provides a REST API to manage rules programmatically, allowing you to create, update, and delete rules without modifying files directly.
Listing Rule Groups
curl -X GET http://loki:3100/loki/api/v1/rules | jq
Example response:
{
"status": "success",
"data": {
"groups": [
{
"name": "service_errors",
"interval": 120,
"rules": [
{
"record": "service:error_rate:5m",
"expr": "sum by (service) (rate({app=\"production\"} |= \"error\" [5m]))"
}
]
}
]
}
}
Creating or Updating Rules
Rules can be created or updated by making POST requests to the Ruler API:
curl -X POST http://loki:3100/loki/api/v1/rules/namespace \
-H "Content-Type: application/yaml" \
--data-binary @rules.yaml
Real-World Use Cases
Case 1: Service Health Monitoring
Create recording rules to track service health metrics and alerting rules to notify teams of potential issues:
groups:
- name: service_health
interval: 1m
rules:
# Recording rule for service response times
- record: service:response_time:p95
expr: |
quantile_over_time(0.95,
{app="web-service"}
| json
| unwrap response_time_ms
[5m]
)
# Alert when response time is too high
- alert: SlowServiceResponse
expr: service:response_time:p95 > 500
for: 10m
labels:
severity: warning
annotations:
summary: "Service response time is degraded"
description: "95th percentile response time is {{ $value }}ms (threshold: 500ms)"
Case 2: Security Monitoring
Set up rules to detect security-related events in your logs:
groups:
- name: security_monitoring
interval: 1m
rules:
# Recording rule tracking failed logins
- record: security:failed_logins:rate5m
expr: sum by (username) (rate({app="auth-service"} |= "login failed" [5m]))
# Alert on brute force attempts
- alert: PossibleBruteForceAttempt
expr: security:failed_logins:rate5m > 0.2
for: 5m
labels:
severity: critical
category: security
annotations:
summary: "Possible brute force attack detected"
description: "User {{ $labels.username }} has {{ $value }} failed login attempts per second"
Best Practices
Optimizing Rule Performance
- Limit the number of rules: Each rule consumes resources when evaluated
- Use appropriate intervals: Balance between freshness and resource usage
- Optimize LogQL queries: Complex queries in rules can be expensive
- Use recording rules for complex queries: Pre-compute expensive expressions
Rule Organization
- Group related rules together: Makes management easier
- Use namespaces to organize rules: Separate rules by team or application
- Follow consistent naming conventions: Makes rules more discoverable
Alert Design
- Avoid alert fatigue: Only alert on actionable conditions
- Include context in annotations: Help responders understand the alert
- Set appropriate thresholds: Tune based on normal behavior patterns
- Use
for
clause wisely: Prevents flapping alerts for transient conditions
Troubleshooting the Ruler Service
Common Issues
- Rules not evaluating: Check Ruler service logs and configuration
- Alertmanager not receiving alerts: Verify the
alertmanager_url
configuration - High resource usage: Rules may be too complex or too frequent
Debugging Steps
-
Check the Ruler's status:
bashcurl -X GET http://loki:3100/loki/api/v1/rules/state
-
Verify rule syntax with the Loki API:
bashcurl -X GET "http://loki:3100/loki/api/v1/query?query=<your rule expression>"
-
Review Ruler logs for any errors or warnings:
bashdocker logs <loki-container> | grep ruler
Summary
The Ruler service extends Loki's capabilities from a log storage and query system to a complete monitoring solution. By using recording rules and alerting rules, you can:
- Pre-compute expensive queries to improve performance
- Automate anomaly detection in your logs
- Create alerts for important conditions in your applications
- Build a comprehensive monitoring system alongside metrics
The Ruler's design will feel familiar to Prometheus users, making it easy to adopt if you're already using Prometheus for metrics monitoring.
Additional Resources
Exercises
- Create a recording rule that tracks the rate of HTTP 500 errors across different services.
- Set up an alert that triggers when a specific log pattern appears more than 10 times in 5 minutes.
- Modify an existing alert to include more context in its annotations.
- Create a dashboard in Grafana that visualizes data from your recording rules.
- Write a script that uses the Ruler API to dynamically update alert thresholds based on time of day.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)