PagerDuty Integration
Introduction
In a production environment, promptly responding to critical issues is essential. While Prometheus excels at monitoring and detecting anomalies, you need a reliable way to notify on-call engineers about these issues. This is where PagerDuty comes in.
PagerDuty is an incident management platform that helps teams detect and respond to critical infrastructure problems quickly. When integrated with Prometheus, it enables you to:
- Route alerts to the right team members based on schedules and escalation policies
- Track incident acknowledgment and resolution
- Coordinate response efforts among team members
- Analyze incident patterns to improve system reliability
This guide will walk you through setting up a Prometheus integration with PagerDuty to streamline your incident response process.
Prerequisites
Before starting this integration, you should have:
- A running Prometheus instance
- Alertmanager configured and connected to Prometheus
- A PagerDuty account (free trial or paid)
- Basic understanding of Prometheus alerting rules
How PagerDuty Works with Prometheus
When integrating Prometheus with PagerDuty, the workflow typically follows this pattern:
Setting Up the Integration
Step 1: Create a Service in PagerDuty
- Log in to your PagerDuty account
- Navigate to Services and click + New Service
- Enter a name for your service (e.g., "Prometheus Alerts")
- Select an escalation policy (or create a new one)
- Under Integration Settings, select Prometheus as the integration type
- Click Add Service
- After creating the service, you'll receive an Integration Key - save this key as we'll need it for Alertmanager configuration
Step 2: Configure Alertmanager for PagerDuty
Update your Alertmanager configuration file (alertmanager.yml
) to include PagerDuty as a receiver:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'pagerduty-notifications'
receivers:
- name: 'pagerduty-notifications'
pagerduty_configs:
- service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY>'
send_resolved: true
details:
custom_details: '{{ template "pagerduty.default.description" . }}'
Replace <YOUR_PAGERDUTY_INTEGRATION_KEY>
with the integration key you received from PagerDuty.
Step 3: Create Custom Templates (Optional)
For more detailed notifications, you can create custom templates in Alertmanager. Create a templates.tmpl
file:
{{ define "pagerduty.default.description" }}
{{ range .Alerts }}
Alert: {{ .Labels.alertname }}
Severity: {{ .Labels.severity }}
Summary: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ if .Labels.instance }}Instance: {{ .Labels.instance }}{{ end }}
{{ end }}
{{ end }}
Then reference this file in your Alertmanager configuration:
global:
resolve_timeout: 5m
templates:
- '/etc/alertmanager/templates.tmpl'
# Rest of the configuration...
Step 4: Define Prometheus Alert Rules
Create alert rules in Prometheus that will trigger notifications to PagerDuty. Here's an example of a high CPU usage alert:
groups:
- name: cpu-alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 85% for more than 5 minutes on {{ $labels.instance }}"
Step 5: Reload Configurations
Reload your Alertmanager and Prometheus configurations:
curl -X POST http://localhost:9090/-/reload # Prometheus
curl -X POST http://localhost:9093/-/reload # Alertmanager
Testing the Integration
To test if your integration is working correctly:
- Trigger a test alert in Prometheus:
curl -X POST "http://localhost:9090/api/v1/admin/tsdb/snapshot"
-
Or, you can temporarily modify an alert rule to trigger on normal conditions
-
Verify in the PagerDuty dashboard that the incident was created
Real-World Example: High-Latency API Monitoring
Let's create a practical example of monitoring API response times and alerting when they exceed acceptable thresholds:
- First, set up Prometheus to scrape your API endpoint metrics:
scrape_configs:
- job_name: 'api-monitoring'
metrics_path: '/metrics'
scrape_interval: 15s
static_configs:
- targets: ['api.example.com:9090']
- Define alert rules for high API latency:
groups:
- name: api-alerts
rules:
- alert: APIHighLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api-monitoring"}[5m])) by (le, endpoint)) > 0.5
for: 10m
labels:
severity: warning
team: api
annotations:
summary: "High API latency detected"
description: "95th percentile of API response time is above 500ms for endpoint {{ $labels.endpoint }}"
- alert: APICriticalLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api-monitoring"}[5m])) by (le, endpoint)) > 1
for: 5m
labels:
severity: critical
team: api
annotations:
summary: "Critical API latency detected"
description: "95th percentile of API response time is above 1s for endpoint {{ $labels.endpoint }}"
- Configure routing in Alertmanager to send only critical alerts to PagerDuty:
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'email-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty-notifications'
receivers:
- name: 'email-notifications'
email_configs:
- to: '[email protected]'
- name: 'pagerduty-notifications'
pagerduty_configs:
- service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY>'
send_resolved: true
This configuration ensures that only critical issues wake up your on-call engineers, while less severe problems are handled during business hours.
Customizing PagerDuty Incidents
You can customize how PagerDuty incidents are created by including additional information in your Alertmanager configuration:
receivers:
- name: 'pagerduty-notifications'
pagerduty_configs:
- service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY>'
send_resolved: true
description: '{{ .CommonLabels.alertname }}'
client: 'Prometheus Alertmanager'
client_url: 'https://alertmanager.example.com'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
summary: '{{ .CommonAnnotations.summary }}'
description: '{{ .CommonAnnotations.description }}'
Advanced: Using Event Transformers
PagerDuty supports Event Transformers that can modify the content of incoming alerts. This is useful for:
- Consolidating similar alerts
- Adding custom fields
- Changing severity based on time of day
- Filtering out certain types of alerts
You can configure Event Transformers in the PagerDuty web interface under your service settings.
Troubleshooting
Common issues and their solutions:
-
Alerts not reaching PagerDuty
- Verify the integration key is correct
- Check Alertmanager logs for errors
- Ensure your network allows outbound connections to PagerDuty's API
-
Duplicate alerts in PagerDuty
- Review your
group_by
settings in Alertmanager - Check
repeat_interval
configuration
- Review your
-
Resolved alerts not clearing in PagerDuty
- Ensure
send_resolved: true
is set in your configuration
- Ensure
Summary
Integrating Prometheus with PagerDuty creates a powerful incident management pipeline that helps your team respond quickly to critical issues. The key benefits include:
- Automated alert routing based on severity and type
- Structured on-call schedules and escalation policies
- Incident tracking and coordination
- Historical analysis of incidents for continuous improvement
By following this guide, you've learned how to:
- Set up a PagerDuty service for Prometheus alerts
- Configure Alertmanager to send notifications to PagerDuty
- Create meaningful alert rules that trigger appropriate responses
- Test and troubleshoot the integration
Additional Resources
- PagerDuty's official Prometheus integration guide
- Prometheus Alertmanager documentation
- PagerDuty API documentation
Exercises
- Set up a test environment with Prometheus and Alertmanager
- Create a free PagerDuty account and integrate it with your test environment
- Define custom alert rules for metrics that matter to your application
- Configure different routing rules based on alert severity
- Test the end-to-end flow by triggering test alerts
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)