Grafana OnCall

Introduction

Grafana OnCall is a powerful incident response and alert management tool within the Grafana ecosystem. It helps DevOps, SRE (Site Reliability Engineering), and IT teams handle critical alerts efficiently by streamlining the alerting workflow, managing on-call schedules, and coordinating incident responses.

Designed to minimize response times and reduce alert fatigue, Grafana OnCall bridges the gap between monitoring systems that detect issues and the teams responsible for resolving them. It's an essential component for organizations aiming to maintain high system availability and implement effective incident management practices.

What is Grafana OnCall?

Grafana OnCall is an open-source incident response tool that:

Centralizes alerts from multiple monitoring systems
Manages on-call rotations to ensure the right people are notified at the right time
Coordinates incident responses through escalation policies and automated workflows
Provides insights into alert patterns to help teams improve their incident management processes

Originally developed as a standalone product called "Amixr," it was acquired by Grafana Labs and integrated into the Grafana ecosystem, offering both open-source and enterprise versions.

Key Features of Grafana OnCall

Alert Management

Grafana OnCall serves as a central hub for alerts from various monitoring systems, including:

Grafana Alerting
Prometheus
Zabbix
Nagios
AWS CloudWatch
Google Cloud Monitoring
And many others through webhooks

This centralization gives teams a single interface to manage all their alerts regardless of their source.

On-Call Scheduling

One of the most critical features of Grafana OnCall is its ability to manage on-call rotations:

Teams can create schedules that:

Define who is on-call at any given time
Include primary and backup responders
Account for different time zones
Support rotation patterns (daily, weekly, custom)
Handle holidays and time-off

Escalation Policies

Escalation policies ensure critical alerts never go unaddressed:

When an alert is triggered, it notifies the primary on-call person
If the alert isn't acknowledged within a defined time period, it escalates to additional team members
Multiple escalation steps can be configured to eventually reach team leads or managers if necessary

Multi-Channel Notifications

Grafana OnCall supports multiple notification methods to ensure alerts reach the right person:

Mobile push notifications (via the Grafana OnCall mobile app)
SMS messages
Phone calls with automated voice messages
Slack or Microsoft Teams messages
Email notifications

Integration with Incident Management Tools

Grafana OnCall works seamlessly with incident management tools like:

Jira
ServiceNow
PagerDuty
OpsGenie

This integration enables a complete incident response workflow from detection to resolution.

Setting Up Grafana OnCall

Let's walk through the basic steps to set up Grafana OnCall in your Grafana environment.

Prerequisites

Grafana instance (version 8.0 or higher)
Grafana OnCall plugin installed
Admin permissions in Grafana

Installation

There are two main ways to install Grafana OnCall:

As a Grafana plugin (recommended for most users)
As a standalone application (for advanced use cases)

Plugin Installation

To install Grafana OnCall as a plugin:

Navigate to your Grafana instance
Go to Configuration > Plugins
Search for "OnCall"
Click "Install"

Alternatively, you can install it via the Grafana CLI:

bash
grafana-cli plugins install grafana-oncall-app

After installation, you'll need to enable the plugin:

Go to Configuration > Plugins
Find OnCall in the list
Click "Enable"

Basic Configuration

Once installed, you'll need to configure the following components:

1. Integration Setup

First, set up integrations with your monitoring systems:

Navigate to the OnCall tab in Grafana
Go to "Integrations"
Click "Add Integration"
Select your monitoring system (e.g., Grafana Alerting, Prometheus)
Follow the specific setup instructions for that integration

Here's an example of setting up a webhook integration:

javascript
// Example webhook payload for Grafana OnCall
{
  "alert_uid": "unique-alert-id-123",
  "title": "High CPU Usage",
  "message": "Server CPU usage is above 90% for the last 5 minutes",
  "image_url": "https://example.com/graph.png",  // Optional
  "link_to_upstream_details": "https://monitoring.example.com/alert/123",
  "severity": "critical"
}

2. Team Configuration

Next, set up your teams:

Go to "Teams" in the OnCall interface
Click "Add Team"
Enter team name and description
Add team members with their contact information
Save the team configuration

3. On-Call Schedules

Create on-call schedules for your teams:

Go to "Schedules"
Click "New Schedule"
Configure the schedule details:
- Schedule name
- Time zone
- Rotation frequency (daily, weekly, custom)
- Team members in the rotation
Save the schedule

Here's an example of a weekly rotation schedule:

yaml
schedule_name: "Weekly Production Support"
timezone: "America/New_York"
rotation_type: "weekly"
starts_at: "2023-01-02T09:00:00Z"
members:
  - name: "Alice Smith"
    user_id: "alice123"
  - name: "Bob Johnson"
    user_id: "bob456"
  - name: "Charlie Brown"
    user_id: "charlie789"
handoff_day: "Monday"
handoff_time: "09:00"

4. Escalation Chains

Set up escalation chains to ensure alerts are handled promptly:

Go to "Escalation Chains"
Click "New Escalation Chain"
Add escalation steps:
- Who to notify
- Wait time before escalation
- Notification methods
Save the escalation chain

Testing Your Setup

Before relying on your OnCall configuration in production, test it:

Go to "Integrations"
Select the integration you want to test
Click "Send Test Alert"
Verify that notifications are received as expected

Practical Example: Setting Up a Complete OnCall Workflow

Let's walk through a complete example of setting up an on-call workflow for a web application monitoring scenario.

Scenario

You have a web application that's monitored by Prometheus. You want to set up an on-call rotation for your team to handle any critical alerts.

Step 1: Create a Team

First, create a team called "Web Application Support":

Navigate to Teams > Add Team
Enter team details:
- Name: Web Application Support
- Description: Team responsible for maintaining web application availability
Add team members with their contact information

Step 2: Create an On-Call Schedule

Create a weekly rotation schedule:

Go to Schedules > New Schedule
Configure schedule details:
- Name: Web App Weekly Rotation
- Rotation type: Weekly
- Start time: Monday 9:00 AM
- Team members: Alice, Bob, Charlie (rotating weekly)

Step 3: Create an Escalation Chain

Set up a three-tier escalation policy:

Go to Escalation Chains > New Escalation Chain
Name: Web App Critical Response
Add steps:
- Step 1: Notify primary on-call via mobile app and SMS
- Wait 15 minutes
- Step 2: Notify primary on-call via phone call and secondary on-call via mobile app
- Wait 15 minutes
- Step 3: Notify team manager via phone call

Step 4: Set Up Prometheus Integration

Integrate Prometheus alerts:

Go to Integrations > Add Integration
Select Prometheus
Configure the webhook URL
Add the webhook URL to your Prometheus alertmanager configuration:

yaml
# Prometheus alertmanager.yml
receivers:
  - name: 'grafana-oncall'
    webhook_configs:
      - url: 'https://oncall.example.com/integrations/prometheus/abcdef123456'
        send_resolved: true

Step 5: Configure Alert Routes

Create routing rules to determine which alerts go to which teams:

Go to Alert Routes
Create a new route:
- Name: Web App Critical Issues
- Conditions: severity = "critical" AND service = "web-app"
- Team: Web Application Support
- Escalation Chain: Web App Critical Response

Step 6: Test the Workflow

Test the complete workflow:

Trigger a test alert from Prometheus
Observe how the alert is routed through Grafana OnCall
Verify that the correct person is notified
Test the escalation process

Best Practices for Using Grafana OnCall

1. Alert Quality Matters

Configure meaningful alerts that represent real issues
Include context in alert messages (what happened, impact, possible causes)
Set appropriate severity levels for different alert types

2. Design Sustainable On-Call Rotations

Ensure fair distribution of on-call responsibilities
Consider time zones when designing global schedules
Build in handover processes between shifts
Avoid overloading any individual with too many on-call hours

3. Document Response Procedures

Create runbooks for common alerts
Link procedures directly in alert notifications when possible
Document escalation paths and when to involve different teams

4. Regularly Review and Improve

Analyze alert patterns to identify noisy alerts
Track metrics like MTTD (Mean Time To Detect) and MTTR (Mean Time To Resolve)
Conduct post-incident reviews to improve both systems and response processes

5. Train Your Team

Ensure everyone understands how to use the OnCall system
Practice responding to incidents through drills or game days
Cross-train team members on different systems

Advanced Features

1. Outgoing Webhooks

Grafana OnCall can trigger webhooks to external systems when alerts are received, acknowledged, or resolved. This enables automated actions like:

Creating tickets in a service desk system
Executing runbooks through automation platforms
Updating status pages

Example webhook configuration:

javascript
// Outgoing webhook configuration
{
  "event_type": ["alert_received", "alert_acknowledged"],
  "url": "https://automation.example.com/webhook",
  "http_method": "POST",
  "custom_headers": {
    "Authorization": "Bearer abc123def456"
  }
}

2. Maintenance Mode

When performing scheduled maintenance, you can place services in maintenance mode to prevent alert notifications:

Go to Integrations
Select the integration to put in maintenance
Click "Maintenance"
Set the duration and reason

3. Alert Grouping and Deduplication

Configure how alerts are grouped to reduce notification noise:

Group related alerts by service, host, or custom attributes
Set auto-resolution times for transient issues
Configure rate limiting for repetitive alerts

Troubleshooting Common Issues

Issue: Alerts Not Coming Through

Potential causes and solutions:

Integration misconfiguration
- Verify webhook URLs and API keys
- Check for typos in configuration files
Network issues
- Ensure monitoring systems can reach Grafana OnCall
- Check firewall rules
Routing problems
- Verify alert routes are correctly configured
- Check that conditions match the incoming alerts

Issue: Notifications Not Being Sent

Potential causes and solutions:

Contact information issues
- Verify phone numbers and email addresses
- Check if users have verified their contact methods
Service provider problems
- Check if SMS or call services are functioning
- Verify API limits haven't been exceeded

Summary

Grafana OnCall provides a comprehensive solution for managing on-call responsibilities and handling incidents efficiently. By centralizing alerts, managing on-call schedules, and coordinating responses, it helps teams:

Reduce mean time to resolution (MTTR)
Decrease alert fatigue
Ensure critical issues are never missed
Improve the on-call experience for team members

As part of the Grafana ecosystem, it integrates seamlessly with other Grafana tools, making it a powerful addition to any observability strategy.

Additional Resources

Exercises

Basic Setup Exercise:
- Install Grafana OnCall in your test environment
- Configure a basic webhook integration
- Create a simple on-call schedule for your team
Escalation Policy Design:
- Design an escalation policy for a critical service
- Include at least three escalation levels
- Test the escalation by simulating an unacknowledged alert
Integration Challenge:
- Set up Grafana OnCall to receive alerts from at least two different monitoring systems
- Configure routing rules to direct different alert types to appropriate teams
- Test end-to-end alert flow from each system
Automation Exercise:
- Configure an outgoing webhook from Grafana OnCall to an automation tool
- Create an automation that performs a simple remediation action when triggered
- Test the complete workflow from alert to automated response

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What is Grafana OnCall?​

Key Features of Grafana OnCall​

Alert Management​

On-Call Scheduling​

Escalation Policies​

Multi-Channel Notifications​

Integration with Incident Management Tools​

Setting Up Grafana OnCall​

Prerequisites​

Installation​

Plugin Installation​

Basic Configuration​

1. Integration Setup​

2. Team Configuration​

3. On-Call Schedules​

4. Escalation Chains​

Testing Your Setup​

Practical Example: Setting Up a Complete OnCall Workflow​

Scenario​

Step 1: Create a Team​

Step 2: Create an On-Call Schedule​

Step 3: Create an Escalation Chain​

Step 4: Set Up Prometheus Integration​

Step 5: Configure Alert Routes​

Step 6: Test the Workflow​

Best Practices for Using Grafana OnCall​

1. Alert Quality Matters​

2. Design Sustainable On-Call Rotations​

3. Document Response Procedures​

4. Regularly Review and Improve​

5. Train Your Team​

Advanced Features​

1. Outgoing Webhooks​

2. Maintenance Mode​

3. Alert Grouping and Deduplication​

Troubleshooting Common Issues​

Issue: Alerts Not Coming Through​

Issue: Notifications Not Being Sent​

Summary​

Additional Resources​

Exercises​