Alert Provisioning
Introduction
Alert provisioning is a powerful feature in Grafana Alerting that allows you to define, manage, and deploy alert rules as code instead of manually creating them through the Grafana UI. This approach follows the "Infrastructure as Code" (IaC) principle, enabling you to automate your alerting setup, maintain version control, and ensure consistency across different Grafana environments.
In this guide, you'll learn how to provision alert rules, notification policies, and contact points in Grafana using YAML configuration files. By the end, you'll be able to create and manage your alerting infrastructure efficiently using version-controlled code.
What is Alert Provisioning?
Alert provisioning refers to the process of automatically creating and updating alert configurations from externally managed files. Rather than manually clicking through the Grafana interface to set up alerts, you define them in configuration files that Grafana automatically loads when it starts or when triggered to reload.
Key benefits of alert provisioning include:
- Version control: Track changes to your alerting setup over time
- Automation: Integrate alert creation into CI/CD pipelines
- Consistency: Ensure identical alert configurations across development, staging, and production environments
- Backup & recovery: Easily restore alert configurations if needed
- Scalability: Manage large numbers of alerts more efficiently
Provisioning Methods
Grafana supports two main approaches for provisioning alerts:
- File-based provisioning: Using YAML files in the Grafana provisioning directory
- API-based provisioning: Using the Grafana HTTP API to manage alert resources
This guide will focus primarily on file-based provisioning, as it aligns better with the IaC philosophy.
Prerequisites
Before you start provisioning alerts, make sure you have:
- Grafana v9.0 or later installed
- Access to the Grafana server's file system or configuration management system
- Basic understanding of YAML syntax
- Familiarity with Grafana alerting concepts (alert rules, notification policies, contact points)
File-based Alert Provisioning
Directory Structure
By default, Grafana looks for provisioning files in the following directory:
/etc/grafana/provisioning/
For alert provisioning specifically, you'll need to create the following structure:
/etc/grafana/provisioning/alerting/
├── alert_rules/
│ └── example_rules.yaml
├── contact_points/
│ └── example_contacts.yaml
└── notification_policies/
└── example_policies.yaml
If you're using Docker, you can mount these directories from your host machine to the container.
Provisioning Alert Rules
Alert rules define the conditions that trigger alerts. Let's create a simple alert rule that monitors CPU usage.
Create a file named cpu_alert.yaml
in the alert_rules
directory:
apiVersion: 1
groups:
- name: CPU Alerts
folder: Server Monitoring
interval: 60s
rules:
- name: High CPU Usage
condition: C
for: 5m
annotations:
summary: High CPU usage detected
description: CPU usage has exceeded 80% for 5 minutes
labels:
severity: warning
category: resource
data:
- refId: A
datasourceUid: PBFA97CFB590B2093
model:
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
intervalMs: 1000
maxDataPoints: 43200
refId: A
- refId: B
datasourceUid: PBFA97CFB590B2093
model:
expr: 80
intervalMs: 1000
maxDataPoints: 43200
refId: B
- refId: C
relativeTimeRange:
from: 600
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params:
- 0
type: gt
operator:
type: and
query:
params:
- A
reducer:
type: last
type: query
expression: $A > $B
intervalMs: 1000
maxDataPoints: 43200
refId: C
type: math
Let's break down this configuration:
apiVersion
: Specifies the version of the provisioning APIgroups
: Contains a list of alert rule groupsname
: The name of the groupfolder
: The dashboard folder where the alerts will be storedinterval
: How often the rules in this group should be evaluatedrules
: The list of alert rules in this groupname
: The name of the alert rulecondition
: The reference ID of the condition expressionfor
: The duration for which the condition must be true before an alert firesannotations
: Additional information about the alertlabels
: Key-value pairs for categorization and routingdata
: The queries and expressions that define the alert condition
Provisioning Contact Points
Contact points define how alerts are delivered to recipients. Let's create a file named contact_points.yaml
in the contact_points
directory:
apiVersion: 1
contactPoints:
- name: email-team
receivers:
- uid: email-team-receiver
type: email
settings:
addresses: [email protected]
singleEmail: false
- name: slack-alerts
receivers:
- uid: slack-alerts-receiver
type: slack
settings:
url: https://hooks.slack.com/services/YOUR_SLACK_WEBHOOK_URL
title: "{{ .CommonLabels.alertname }}"
text: "{{ .CommonAnnotations.description }}"
This configuration defines two contact points:
- An email contact point that sends alerts to [email protected]
- A Slack contact point that posts alerts to a Slack channel using a webhook
Provisioning Notification Policies
Notification policies determine how alerts are routed to contact points based on labels. Let's create a file named notification_policies.yaml
in the notification_policies
directory:
apiVersion: 1
policies:
- orgId: 1
receiver: slack-alerts
group_by: ['alertname', 'category']
routes:
- receiver: email-team
group_by: ['alertname', 'category', 'instance']
matchers:
- severity = critical
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
mute_time_intervals: []
This configuration:
- Sets
slack-alerts
as the default receiver for all alerts - Groups alerts by
alertname
andcategory
- Creates a route that:
- Sends alerts with
severity: critical
to theemail-team
contact point - Groups these critical alerts by
alertname
,category
, andinstance
- Waits 30s before sending the first notification
- Waits 5m before sending updates for the same group
- Repeats notifications every 4h if the alert persists
- Sends alerts with
Applying Provisioned Configurations
Grafana automatically loads provisioning files at startup. If you've added or modified provisioning files while Grafana is running, you can apply the changes in two ways:
1. Restart Grafana
The simplest way to apply changes is to restart the Grafana service:
sudo systemctl restart grafana-server
Or for Docker:
docker restart grafana
2. API Reload (Grafana v9.1+)
For a zero-downtime reload, you can use the Grafana API (requires admin access):
curl -X POST -H "Authorization: Bearer YOUR_API_KEY" http://your-grafana-instance/api/admin/provisioning/alerting/reload
Real-world Example: Monitoring a Web Service
Let's create a more comprehensive example that monitors a web service, checking for high response times, error rates, and availability.
Create a file named web_service_alerts.yaml
in the alert_rules
directory:
apiVersion: 1
groups:
- name: Web Service Monitoring
folder: Application Monitoring
interval: 30s
rules:
- name: High Response Time
condition: C
for: 2m
annotations:
summary: High response time detected
description: Average response time has exceeded 500ms for 2 minutes
dashboard_url: ${__dashboardUid}/d/abc123/web-service-overview
labels:
severity: warning
category: performance
service: web-api
data:
- refId: A
datasourceUid: PBFA97CFB590B2093
model:
expr: avg(rate(http_request_duration_seconds_sum{job="web-api"}[5m]) / rate(http_request_duration_seconds_count{job="web-api"}[5m])) * 1000
intervalMs: 1000
maxDataPoints: 43200
refId: A
- refId: B
datasourceUid: PBFA97CFB590B2093
model:
expr: 500
intervalMs: 1000
maxDataPoints: 43200
refId: B
- refId: C
relativeTimeRange:
from: 600
to: 0
datasourceUid: __expr__
model:
expression: $A > $B
type: math
refId: C
- name: High Error Rate
condition: C
for: 1m
annotations:
summary: High error rate detected
description: Error rate has exceeded 5% for 1 minute
dashboard_url: ${__dashboardUid}/d/abc123/web-service-overview
labels:
severity: critical
category: errors
service: web-api
data:
- refId: A
datasourceUid: PBFA97CFB590B2093
model:
expr: sum(rate(http_requests_total{job="web-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="web-api"}[5m])) * 100
intervalMs: 1000
maxDataPoints: 43200
refId: A
- refId: B
datasourceUid: PBFA97CFB590B2093
model:
expr: 5
intervalMs: 1000
maxDataPoints: 43200
refId: B
- refId: C
relativeTimeRange:
from: 600
to: 0
datasourceUid: __expr__
model:
expression: $A > $B
type: math
refId: C
This example includes two alert rules:
- A warning alert when response times exceed 500ms
- A critical alert when the error rate exceeds 5%
Both alerts include links to a dashboard for further investigation.
Visualizing Alert Provisioning Flow
Here's a diagram showing how alert provisioning fits into the overall alert flow:
Best Practices
1. Use Version Control
Store your provisioning files in a Git repository to track changes and collaborate with team members:
git init
git add provisioning/
git commit -m "Initial alert provisioning configuration"
2. Validate YAML Files
Before applying changes, validate your YAML files to avoid syntax errors:
yamllint provisioning/alerting/alert_rules/*.yaml
3. Use Templates for Reusable Configuration
For teams managing many similar services, consider using a templating system like Jsonnet or Helm to generate your YAML files from reusable templates.
4. Include Documentation
Add comments in your YAML files to explain the purpose of each alert and any complex expressions:
rules:
- name: High CPU Usage
# This alert detects sustained high CPU usage that might indicate
# a resource constraint or runaway process
condition: C
for: 5m
...
5. Use Meaningful Naming Conventions
Adopt a consistent naming convention for your alert rules, groups, and contact points to make them easier to manage:
[Service/Component]_[Metric]_[Condition]
For example: API_ResponseTime_High
or Database_DiskSpace_Low
Troubleshooting
Common Issues
-
Alert rules not appearing in Grafana
- Check file permissions on provisioning directories
- Verify YAML syntax is correct
- Look for errors in Grafana logs:
/var/log/grafana/grafana.log
-
Alert evaluation errors
- Ensure datasource UIDs are correct for your environment
- Verify expressions are valid for your data source type
-
Notifications not being sent
- Check that contact point settings are correct (URLs, email addresses)
- Verify notification policy routes are matching your alert labels
- Check Grafana's alerting logs for delivery errors
Summary
In this guide, you've learned how to provision Grafana alerts using configuration files, enabling you to manage your alerting infrastructure as code. You've seen how to provision alert rules, contact points, and notification policies, as well as how to apply changes and follow best practices.
Alert provisioning brings the benefits of infrastructure as code to your monitoring setup, making it more maintainable, reproducible, and scalable. By mastering these techniques, you'll be able to build robust alerting systems that evolve with your applications.
Exercises
To reinforce your understanding of alert provisioning, try these exercises:
- Create an alert rule that monitors memory usage and triggers when it exceeds 90%
- Set up a multi-tier notification policy that routes different severity alerts to different contact points
- Provision a contact point that integrates with a service you use (PagerDuty, OpsGenie, etc.)
- Create a template for generating alert rules for multiple similar services
- Set up a CI pipeline that validates your provisioning files before deploying them
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)