Skip to main content

Runbooks Creation

Introduction

Runbooks are documented procedures that provide step-by-step instructions for responding to specific incidents or alerts in your monitoring system. They serve as crucial reference material during incidents, helping teams respond effectively and consistently even under pressure. In the context of Grafana Loki, runbooks connect your alerting system to actionable response plans, turning alerts into resolved incidents faster.

What is a Runbook?

A runbook is a documented procedure that:

  • Provides clear, step-by-step instructions for handling specific alerts or incidents
  • Helps standardize incident response processes
  • Reduces mean time to resolution (MTTR)
  • Empowers team members of all experience levels to respond effectively
  • Serves as both documentation and a learning resource

Why Runbooks Matter for Loki Monitoring

When monitoring logs with Grafana Loki, you'll set up alerts for various conditions - high error rates, specific error patterns, performance thresholds, etc. Without runbooks:

  1. Alert fatigue becomes common as teams receive notifications without clear next steps
  2. Resolution times increase as responders investigate from scratch each time
  3. Knowledge remains siloed with experienced team members
  4. Incident response quality varies based on who's responding

Effective runbooks transform Loki alerts from notifications into actionable information that leads directly to resolution.

Creating Effective Runbooks

Step 1: Identify Alert Scenarios

First, identify the specific alert conditions that warrant runbooks. For Loki, common scenarios include:

  • High error rates in application logs
  • Specific error patterns in logs (e.g., authentication failures)
  • Missing logs or logging gaps
  • Performance-related log entries
  • Security-related log patterns

For each scenario, document:

- Alert name and description
- Alert severity level
- Systems/services affected
- Team responsible for response

Step 2: Establish the Runbook Structure

Create a consistent structure for all your runbooks. A good structure includes:

markdown
# [Alert Name] Runbook

## Alert Description
[Brief description of what triggered the alert]

## Impact
[Description of business/user impact]

## Investigation Steps
1. [First step]
2. [Second step]
3. [Third step]

## Resolution Steps
1. [First resolution step]
2. [Second resolution step]
3. [Third resolution step]

## Prevention
[Steps to prevent future occurrences]

## Additional Resources
[Links to dashboards, documentation, etc.]

Step 3: Write Clear Investigation Steps

The investigation section should guide the responder through troubleshooting. For Loki alerts, include:

  • Specific LogQL queries to run for further investigation
  • Dashboards to check
  • Related metrics to examine
  • Common causes of this alert

Example for a "High Error Rate" alert:

markdown
## Investigation Steps

1. Access the Grafana Loki Explorer at [dashboard URL]

2. Run this LogQL query to see the distribution of errors:
```logql
{app="myapp"} |= "ERROR" | rate(5m)
  1. Identify which service is generating the most errors:

    logql
    {app="myapp"} |= "ERROR" 
    | pattern "<_> - <service> - <_>"
    | rate(5m) by (service)
  2. Check for deployment events that coincide with error spike:

    logql
    {app="myapp"} |= "Deployment"
  3. Check system metrics for the affected service


### Step 4: Include Resolution Steps

Resolution steps should be specific and actionable:

```markdown
## Resolution Steps

1. If errors are concentrated in the authentication service:
a. Check the auth database connection:
```bash
kubectl exec -it auth-service-pod -- curl db-host:5432

b. Verify auth service logs for connection issues:

logql
{app="auth-service"} |= "connection"

c. Restart the auth service if needed:

bash
kubectl rollout restart deployment/auth-service
  1. If errors are distributed across services: a. Check for network issues:
    bash
    kubectl get events --field-selector type=Warning
    b. Verify external dependency availability...

### Step 5: Link Runbooks to Alerts

In Grafana Alerting, configure alerts to reference your runbooks:

1. In your alert rule, add annotations:

```yaml
annotations:
runbook_url: "https://your-wiki.example.com/runbooks/high-error-rate"
summary: "High error rate detected in application logs"
description: "Error rate exceeded threshold of 5 errors/second for 5 minutes"
  1. Use alert templating to add context:
yaml
annotations:
description: "{{ $labels.app }} is showing {{ $value }} errors per second"
runbook_url: "https://your-wiki.example.com/runbooks/{{ $labels.alert_type }}"

Practical Example: Creating a "Log Volume Drop" Runbook

Let's create a complete runbook for a scenario where Loki detects a significant drop in log volume, which could indicate service problems or logging pipeline issues.

markdown
# Log Volume Drop Runbook

## Alert Description
This alert triggers when the volume of logs from a service drops significantly (>50%) compared to the baseline for that time period.

## Impact
Missing logs may indicate:
- Service is down or degraded
- Logging pipeline failure
- Configuration change affecting logging

This impacts our ability to monitor service health and troubleshoot issues.

## Investigation Steps

1. Verify if the service is running:
```bash
kubectl get pods -l app=affected-service
  1. Check if logs are being generated but not ingested: a. Check the service directly:

    bash
    kubectl logs -l app=affected-service --tail=20

    b. Check Loki ingestion metrics:

    promql
    rate(loki_distributor_bytes_received_total[5m])
  2. Check for recent deployments or changes:

    logql
    {app="ci-cd"} |= "deployment" |= "affected-service"
  3. Examine Loki components health:

    bash
    kubectl get pods -n loki
  4. Check storage backend status:

    promql
    loki_boltdb_shipper_upload_failures_total

Resolution Steps

  1. If service is not running: a. Check for events:

    bash
    kubectl describe pod -l app=affected-service

    b. Start the service if needed:

    bash
    kubectl scale deployment affected-service --replicas=1
  2. If logging pipeline is broken: a. Check Promtail/Fluentd/Vector configurations b. Restart log agents:

    bash
    kubectl rollout restart daemonset promtail -n loki
  3. If Loki components are failing: a. Check resource usage:

    bash
    kubectl top pods -n loki

    b. Restart affected components:

    bash
    kubectl rollout restart statefulset/loki -n loki
  4. If storage issues are occurring: a. Check storage capacity b. Verify permissions c. Check network connectivity to storage

Prevention

  1. Implement log volume monitoring:

    logql
    sum(rate({app="affected-service"}[5m])) by (app)
  2. Create dashboards showing log volume by service

  3. Set up alerts for log pipeline components

  4. Document logging configuration

Additional Resources

  • Loki Troubleshooting Documentation
  • Logging Infrastructure Diagram
  • Loki Components Dashboard

## Testing and Improving Runbooks

Runbooks should be living documents that improve over time:

1. **Conduct runbook drills** - Practice using runbooks in simulated incidents
2. **Post-incident reviews** - After real incidents, update runbooks with new learnings
3. **Rotation practice** - Have new team members follow runbooks to identify unclear steps
4. **Version control** - Keep runbooks in a version-controlled repository

## Integrating Runbooks with Grafana Loki Alerting

### Alert Rule Configuration

When setting up Loki alert rules, include runbook information:

```yaml
groups:
- name: loki_alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate({app="production", level="error"}[5m])) > 5
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: High error rate detected
description: Error rate is {{ $value }} errors/sec
runbook_url: https://internal-docs/runbooks/high-error-rate

Using Contact Points

Configure contact points to include runbook links in notifications:

yaml
contact_points:
- name: slack-alerts
receivers:
- type: slack
settings:
url: https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX
message: |
:alert: {{ .Status | upper }}: {{ .CommonLabels.alertname }}
{{ .CommonAnnotations.summary }}
Severity: {{ .CommonLabels.severity }}
Runbook: {{ .CommonAnnotations.runbook_url }}

Automation Opportunities

As your runbooks mature, look for automation opportunities:

  1. Automated diagnostics - Scripts that gather information automatically
  2. Self-healing systems - Auto-remediation for common issues
  3. ChatOps integration - Run diagnostic commands from chat tools

Example of a simple diagnostic script for a Loki logging issue:

bash
#!/bin/bash
# loki-diagnostics.sh - Run from a runbook to gather diagnostics

echo "=== Checking Loki Pods ==="
kubectl get pods -n loki

echo "=== Checking Loki Ingestion Rate ==="
curl -s http://prometheus:9090/api/v1/query?query=sum\(rate\(loki_distributor_bytes_received_total\[5m\]\)\)

echo "=== Checking Recent Log Volume ==="
curl -s -H "Content-Type: application/json" -X POST \
-d '{"query":"{app=\"myapp\"} |= \"\" | count_over_time[1h]"}' \
http://loki:3100/loki/api/v1/query

echo "=== Checking Loki Component Errors ==="
curl -s http://prometheus:9090/api/v1/query?query=sum\(rate\(loki_request_errors_total\[5m\]\)\)%20by%20\(component\)

Best Practices Summary

  1. Keep it simple - Runbooks should be easy to follow under pressure
  2. Be specific - Include exact commands and queries
  3. Include context - Explain why steps are important
  4. Anticipate variations - Cover different possible causes
  5. Use formatting - Make runbooks scannable with clear sections
  6. Include verification - Add steps to confirm resolution
  7. Link related resources - Connect to dashboards and documentation
  8. Update regularly - Review after incidents and as systems change

Summary

Effective runbooks transform Grafana Loki from a monitoring tool into an incident response system. By creating clear, actionable instructions tied to specific alerts, you reduce mean time to resolution and build team confidence. Start with your most critical alerts, create structured runbooks, and continuously improve them based on real incident experience.

Exercises

  1. Identify the top three most critical Loki alerts in your environment and create runbooks for them.
  2. Configure a Loki alert rule that includes a runbook URL in its annotation.
  3. Conduct a runbook drill with your team for a common incident scenario.
  4. Review a recent incident and document how you would improve the relevant runbook.
  5. Create a template for future runbooks based on the examples in this guide.


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)