Runbooks Creation
Introduction
Runbooks are documented procedures that provide step-by-step instructions for responding to specific incidents or alerts in your monitoring system. They serve as crucial reference material during incidents, helping teams respond effectively and consistently even under pressure. In the context of Grafana Loki, runbooks connect your alerting system to actionable response plans, turning alerts into resolved incidents faster.
What is a Runbook?
A runbook is a documented procedure that:
- Provides clear, step-by-step instructions for handling specific alerts or incidents
- Helps standardize incident response processes
- Reduces mean time to resolution (MTTR)
- Empowers team members of all experience levels to respond effectively
- Serves as both documentation and a learning resource
Why Runbooks Matter for Loki Monitoring
When monitoring logs with Grafana Loki, you'll set up alerts for various conditions - high error rates, specific error patterns, performance thresholds, etc. Without runbooks:
- Alert fatigue becomes common as teams receive notifications without clear next steps
- Resolution times increase as responders investigate from scratch each time
- Knowledge remains siloed with experienced team members
- Incident response quality varies based on who's responding
Effective runbooks transform Loki alerts from notifications into actionable information that leads directly to resolution.
Creating Effective Runbooks
Step 1: Identify Alert Scenarios
First, identify the specific alert conditions that warrant runbooks. For Loki, common scenarios include:
- High error rates in application logs
- Specific error patterns in logs (e.g., authentication failures)
- Missing logs or logging gaps
- Performance-related log entries
- Security-related log patterns
For each scenario, document:
- Alert name and description
- Alert severity level
- Systems/services affected
- Team responsible for response
Step 2: Establish the Runbook Structure
Create a consistent structure for all your runbooks. A good structure includes:
# [Alert Name] Runbook
## Alert Description
[Brief description of what triggered the alert]
## Impact
[Description of business/user impact]
## Investigation Steps
1. [First step]
2. [Second step]
3. [Third step]
## Resolution Steps
1. [First resolution step]
2. [Second resolution step]
3. [Third resolution step]
## Prevention
[Steps to prevent future occurrences]
## Additional Resources
[Links to dashboards, documentation, etc.]
Step 3: Write Clear Investigation Steps
The investigation section should guide the responder through troubleshooting. For Loki alerts, include:
- Specific LogQL queries to run for further investigation
- Dashboards to check
- Related metrics to examine
- Common causes of this alert
Example for a "High Error Rate" alert:
## Investigation Steps
1. Access the Grafana Loki Explorer at [dashboard URL]
2. Run this LogQL query to see the distribution of errors:
```logql
{app="myapp"} |= "ERROR" | rate(5m)
-
Identify which service is generating the most errors:
logql{app="myapp"} |= "ERROR"
| pattern "<_> - <service> - <_>"
| rate(5m) by (service) -
Check for deployment events that coincide with error spike:
logql{app="myapp"} |= "Deployment"
-
Check system metrics for the affected service
### Step 4: Include Resolution Steps
Resolution steps should be specific and actionable:
```markdown
## Resolution Steps
1. If errors are concentrated in the authentication service:
a. Check the auth database connection:
```bash
kubectl exec -it auth-service-pod -- curl db-host:5432
b. Verify auth service logs for connection issues:
{app="auth-service"} |= "connection"
c. Restart the auth service if needed:
kubectl rollout restart deployment/auth-service
- If errors are distributed across services:
a. Check for network issues:
b. Verify external dependency availability...bash
kubectl get events --field-selector type=Warning
### Step 5: Link Runbooks to Alerts
In Grafana Alerting, configure alerts to reference your runbooks:
1. In your alert rule, add annotations:
```yaml
annotations:
runbook_url: "https://your-wiki.example.com/runbooks/high-error-rate"
summary: "High error rate detected in application logs"
description: "Error rate exceeded threshold of 5 errors/second for 5 minutes"
- Use alert templating to add context:
annotations:
description: "{{ $labels.app }} is showing {{ $value }} errors per second"
runbook_url: "https://your-wiki.example.com/runbooks/{{ $labels.alert_type }}"
Practical Example: Creating a "Log Volume Drop" Runbook
Let's create a complete runbook for a scenario where Loki detects a significant drop in log volume, which could indicate service problems or logging pipeline issues.
# Log Volume Drop Runbook
## Alert Description
This alert triggers when the volume of logs from a service drops significantly (>50%) compared to the baseline for that time period.
## Impact
Missing logs may indicate:
- Service is down or degraded
- Logging pipeline failure
- Configuration change affecting logging
This impacts our ability to monitor service health and troubleshoot issues.
## Investigation Steps
1. Verify if the service is running:
```bash
kubectl get pods -l app=affected-service
-
Check if logs are being generated but not ingested: a. Check the service directly:
bashkubectl logs -l app=affected-service --tail=20
b. Check Loki ingestion metrics:
promqlrate(loki_distributor_bytes_received_total[5m])
-
Check for recent deployments or changes:
logql{app="ci-cd"} |= "deployment" |= "affected-service"
-
Examine Loki components health:
bashkubectl get pods -n loki
-
Check storage backend status:
promqlloki_boltdb_shipper_upload_failures_total
Resolution Steps
-
If service is not running: a. Check for events:
bashkubectl describe pod -l app=affected-service
b. Start the service if needed:
bashkubectl scale deployment affected-service --replicas=1
-
If logging pipeline is broken: a. Check Promtail/Fluentd/Vector configurations b. Restart log agents:
bashkubectl rollout restart daemonset promtail -n loki
-
If Loki components are failing: a. Check resource usage:
bashkubectl top pods -n loki
b. Restart affected components:
bashkubectl rollout restart statefulset/loki -n loki
-
If storage issues are occurring: a. Check storage capacity b. Verify permissions c. Check network connectivity to storage
Prevention
-
Implement log volume monitoring:
logqlsum(rate({app="affected-service"}[5m])) by (app)
-
Create dashboards showing log volume by service
-
Set up alerts for log pipeline components
-
Document logging configuration
Additional Resources
- Loki Troubleshooting Documentation
- Logging Infrastructure Diagram
- Loki Components Dashboard
## Testing and Improving Runbooks
Runbooks should be living documents that improve over time:
1. **Conduct runbook drills** - Practice using runbooks in simulated incidents
2. **Post-incident reviews** - After real incidents, update runbooks with new learnings
3. **Rotation practice** - Have new team members follow runbooks to identify unclear steps
4. **Version control** - Keep runbooks in a version-controlled repository
## Integrating Runbooks with Grafana Loki Alerting
### Alert Rule Configuration
When setting up Loki alert rules, include runbook information:
```yaml
groups:
- name: loki_alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate({app="production", level="error"}[5m])) > 5
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: High error rate detected
description: Error rate is {{ $value }} errors/sec
runbook_url: https://internal-docs/runbooks/high-error-rate
Using Contact Points
Configure contact points to include runbook links in notifications:
contact_points:
- name: slack-alerts
receivers:
- type: slack
settings:
url: https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX
message: |
:alert: {{ .Status | upper }}: {{ .CommonLabels.alertname }}
{{ .CommonAnnotations.summary }}
Severity: {{ .CommonLabels.severity }}
Runbook: {{ .CommonAnnotations.runbook_url }}
Automation Opportunities
As your runbooks mature, look for automation opportunities:
- Automated diagnostics - Scripts that gather information automatically
- Self-healing systems - Auto-remediation for common issues
- ChatOps integration - Run diagnostic commands from chat tools
Example of a simple diagnostic script for a Loki logging issue:
#!/bin/bash
# loki-diagnostics.sh - Run from a runbook to gather diagnostics
echo "=== Checking Loki Pods ==="
kubectl get pods -n loki
echo "=== Checking Loki Ingestion Rate ==="
curl -s http://prometheus:9090/api/v1/query?query=sum\(rate\(loki_distributor_bytes_received_total\[5m\]\)\)
echo "=== Checking Recent Log Volume ==="
curl -s -H "Content-Type: application/json" -X POST \
-d '{"query":"{app=\"myapp\"} |= \"\" | count_over_time[1h]"}' \
http://loki:3100/loki/api/v1/query
echo "=== Checking Loki Component Errors ==="
curl -s http://prometheus:9090/api/v1/query?query=sum\(rate\(loki_request_errors_total\[5m\]\)\)%20by%20\(component\)
Best Practices Summary
- Keep it simple - Runbooks should be easy to follow under pressure
- Be specific - Include exact commands and queries
- Include context - Explain why steps are important
- Anticipate variations - Cover different possible causes
- Use formatting - Make runbooks scannable with clear sections
- Include verification - Add steps to confirm resolution
- Link related resources - Connect to dashboards and documentation
- Update regularly - Review after incidents and as systems change
Summary
Effective runbooks transform Grafana Loki from a monitoring tool into an incident response system. By creating clear, actionable instructions tied to specific alerts, you reduce mean time to resolution and build team confidence. Start with your most critical alerts, create structured runbooks, and continuously improve them based on real incident experience.
Exercises
- Identify the top three most critical Loki alerts in your environment and create runbooks for them.
- Configure a Loki alert rule that includes a runbook URL in its annotation.
- Conduct a runbook drill with your team for a common incident scenario.
- Review a recent incident and document how you would improve the relevant runbook.
- Create a template for future runbooks based on the examples in this guide.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)