Runbooks Creation

Introduction

Runbooks are documented procedures that provide step-by-step instructions for responding to specific incidents or alerts in your monitoring system. They serve as crucial reference material during incidents, helping teams respond effectively and consistently even under pressure. In the context of Grafana Loki, runbooks connect your alerting system to actionable response plans, turning alerts into resolved incidents faster.

What is a Runbook?

A runbook is a documented procedure that:

Provides clear, step-by-step instructions for handling specific alerts or incidents
Helps standardize incident response processes
Reduces mean time to resolution (MTTR)
Empowers team members of all experience levels to respond effectively
Serves as both documentation and a learning resource

Why Runbooks Matter for Loki Monitoring

When monitoring logs with Grafana Loki, you'll set up alerts for various conditions - high error rates, specific error patterns, performance thresholds, etc. Without runbooks:

Alert fatigue becomes common as teams receive notifications without clear next steps
Resolution times increase as responders investigate from scratch each time
Knowledge remains siloed with experienced team members
Incident response quality varies based on who's responding

Effective runbooks transform Loki alerts from notifications into actionable information that leads directly to resolution.

Creating Effective Runbooks

Step 1: Identify Alert Scenarios

First, identify the specific alert conditions that warrant runbooks. For Loki, common scenarios include:

High error rates in application logs
Specific error patterns in logs (e.g., authentication failures)
Missing logs or logging gaps
Performance-related log entries
Security-related log patterns

For each scenario, document:

- Alert name and description
- Alert severity level
- Systems/services affected
- Team responsible for response

Step 2: Establish the Runbook Structure

Create a consistent structure for all your runbooks. A good structure includes:

# [Alert Name] Runbook

## Alert Description
[Brief description of what triggered the alert]

## Impact
[Description of business/user impact]

## Investigation Steps
1. [First step]
2. [Second step]
3. [Third step]

## Resolution Steps
1. [First resolution step]
2. [Second resolution step]
3. [Third resolution step]

## Prevention
[Steps to prevent future occurrences]

## Additional Resources
[Links to dashboards, documentation, etc.]

Step 3: Write Clear Investigation Steps

The investigation section should guide the responder through troubleshooting. For Loki alerts, include:

Specific LogQL queries to run for further investigation
Dashboards to check
Related metrics to examine
Common causes of this alert

Example for a "High Error Rate" alert:

## Investigation Steps

1. Access the Grafana Loki Explorer at [dashboard URL]

2. Run this LogQL query to see the distribution of errors:
   ```logql
   {app="myapp"} |= "ERROR" | rate(5m)

Identify which service is generating the most errors:

{app="myapp"} |= "ERROR" 
| pattern "<_> - <service> - <_>" 
| rate(5m) by (service)

Check for deployment events that coincide with error spike:
```
{app="myapp"} |= "Deployment"
```
Check system metrics for the affected service

### Step 4: Include Resolution Steps

Resolution steps should be specific and actionable:

```markdown
## Resolution Steps

1. If errors are concentrated in the authentication service:
   a. Check the auth database connection:
      ```bash
      kubectl exec -it auth-service-pod -- curl db-host:5432

b. Verify auth service logs for connection issues:

{app="auth-service"} |= "connection"

c. Restart the auth service if needed:

kubectl rollout restart deployment/auth-service

If errors are distributed across services: a. Check for network issues:
```
kubectl get events --field-selector type=Warning
```
b. Verify external dependency availability...

### Step 5: Link Runbooks to Alerts

In Grafana Alerting, configure alerts to reference your runbooks:

1. In your alert rule, add annotations:

```yaml
annotations:
  runbook_url: "https://your-wiki.example.com/runbooks/high-error-rate"
  summary: "High error rate detected in application logs"
  description: "Error rate exceeded threshold of 5 errors/second for 5 minutes"

Use alert templating to add context:

annotations:
  description: "{{ $labels.app }} is showing {{ $value }} errors per second"
  runbook_url: "https://your-wiki.example.com/runbooks/{{ $labels.alert_type }}"

Practical Example: Creating a "Log Volume Drop" Runbook

Let's create a complete runbook for a scenario where Loki detects a significant drop in log volume, which could indicate service problems or logging pipeline issues.

# Log Volume Drop Runbook

## Alert Description
This alert triggers when the volume of logs from a service drops significantly (>50%) compared to the baseline for that time period.

## Impact
Missing logs may indicate:
- Service is down or degraded
- Logging pipeline failure
- Configuration change affecting logging

This impacts our ability to monitor service health and troubleshoot issues.

## Investigation Steps

1. Verify if the service is running:
   ```bash
   kubectl get pods -l app=affected-service

Check if logs are being generated but not ingested: a. Check the service directly:
```
kubectl logs -l app=affected-service --tail=20
```
b. Check Loki ingestion metrics:
```
rate(loki_distributor_bytes_received_total[5m])
```

Check for recent deployments or changes:

{app="ci-cd"} |= "deployment" |= "affected-service"

Examine Loki components health:
```
kubectl get pods -n loki
```

Check storage backend status:

loki_boltdb_shipper_upload_failures_total

Resolution Steps

If service is not running: a. Check for events:

kubectl describe pod -l app=affected-service

b. Start the service if needed:

kubectl scale deployment affected-service --replicas=1

If logging pipeline is broken: a. Check Promtail/Fluentd/Vector configurations b. Restart log agents:
```
kubectl rollout restart daemonset promtail -n loki
```
If Loki components are failing: a. Check resource usage:
```
kubectl top pods -n loki
```
b. Restart affected components:
```
kubectl rollout restart statefulset/loki -n loki
```
If storage issues are occurring: a. Check storage capacity b. Verify permissions c. Check network connectivity to storage

Prevention

Implement log volume monitoring:

sum(rate({app="affected-service"}[5m])) by (app)

Create dashboards showing log volume by service
Set up alerts for log pipeline components
Document logging configuration

Additional Resources

Loki Troubleshooting Documentation
Logging Infrastructure Diagram
Loki Components Dashboard

## Testing and Improving Runbooks

Runbooks should be living documents that improve over time:

1. **Conduct runbook drills** - Practice using runbooks in simulated incidents
2. **Post-incident reviews** - After real incidents, update runbooks with new learnings
3. **Rotation practice** - Have new team members follow runbooks to identify unclear steps
4. **Version control** - Keep runbooks in a version-controlled repository

## Integrating Runbooks with Grafana Loki Alerting

### Alert Rule Configuration

When setting up Loki alert rules, include runbook information:

```yaml
groups:
  - name: loki_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate({app="production", level="error"}[5m])) > 5
        for: 5m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: High error rate detected
          description: Error rate is {{ $value }} errors/sec
          runbook_url: https://internal-docs/runbooks/high-error-rate

Using Contact Points

Configure contact points to include runbook links in notifications:

contact_points:
  - name: slack-alerts
    receivers:
      - type: slack
        settings:
          url: https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX
          message: |
            :alert: {{ .Status | upper }}: {{ .CommonLabels.alertname }}
            {{ .CommonAnnotations.summary }}
            Severity: {{ .CommonLabels.severity }}
            Runbook: {{ .CommonAnnotations.runbook_url }}

Automation Opportunities

As your runbooks mature, look for automation opportunities:

Automated diagnostics - Scripts that gather information automatically
Self-healing systems - Auto-remediation for common issues
ChatOps integration - Run diagnostic commands from chat tools

Example of a simple diagnostic script for a Loki logging issue:

#!/bin/bash
# loki-diagnostics.sh - Run from a runbook to gather diagnostics

echo "=== Checking Loki Pods ==="
kubectl get pods -n loki

echo "=== Checking Loki Ingestion Rate ==="
curl -s http://prometheus:9090/api/v1/query?query=sum\(rate\(loki_distributor_bytes_received_total\[5m\]\)\)

echo "=== Checking Recent Log Volume ==="
curl -s -H "Content-Type: application/json" -X POST \
  -d '{"query":"{app=\"myapp\"} |= \"\" | count_over_time[1h]"}' \
  http://loki:3100/loki/api/v1/query

echo "=== Checking Loki Component Errors ==="
curl -s http://prometheus:9090/api/v1/query?query=sum\(rate\(loki_request_errors_total\[5m\]\)\)%20by%20\(component\)

Best Practices Summary

Keep it simple - Runbooks should be easy to follow under pressure
Be specific - Include exact commands and queries
Include context - Explain why steps are important
Anticipate variations - Cover different possible causes
Use formatting - Make runbooks scannable with clear sections
Include verification - Add steps to confirm resolution
Link related resources - Connect to dashboards and documentation
Update regularly - Review after incidents and as systems change

Summary

Effective runbooks transform Grafana Loki from a monitoring tool into an incident response system. By creating clear, actionable instructions tied to specific alerts, you reduce mean time to resolution and build team confidence. Start with your most critical alerts, create structured runbooks, and continuously improve them based on real incident experience.

Exercises

Identify the top three most critical Loki alerts in your environment and create runbooks for them.
Configure a Loki alert rule that includes a runbook URL in its annotation.
Conduct a runbook drill with your team for a common incident scenario.
Review a recent incident and document how you would improve the relevant runbook.
Create a template for future runbooks based on the examples in this guide.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What is a Runbook?​

Why Runbooks Matter for Loki Monitoring​

Creating Effective Runbooks​

Step 1: Identify Alert Scenarios​

Step 2: Establish the Runbook Structure​

Step 3: Write Clear Investigation Steps​

Practical Example: Creating a "Log Volume Drop" Runbook​

Resolution Steps​

Prevention​

Additional Resources​

Using Contact Points​

Automation Opportunities​

Best Practices Summary​

Summary​

Exercises​