PrometheusRule Custom Resources

Introduction

When monitoring applications in Kubernetes with Prometheus, you'll often need to define alerting and recording rules. While Prometheus traditionally uses a file-based approach for rule definitions, the Prometheus Operator introduces a more Kubernetes-native way to manage these rules through PrometheusRule custom resources.

PrometheusRule resources allow you to define, organize, and manage Prometheus rules declaratively using Kubernetes objects. This integration brings several benefits:

Declarative Configuration: Define rules as Kubernetes resources
Version Control: Track rule changes in your Git repository alongside other Kubernetes manifests
Dynamic Updates: Rules are automatically reloaded without restarting Prometheus
Namespace Organization: Group rules logically by namespace

In this guide, we'll explore how PrometheusRule custom resources work, how to define them, and how to integrate them with your Prometheus setup in Kubernetes.

Prerequisites

Before diving into PrometheusRule custom resources, you should have:

A Kubernetes cluster
Prometheus Operator installed
Basic understanding of Prometheus alerting and recording rules
Basic understanding of Kubernetes Custom Resources

Understanding PrometheusRule Structure

A PrometheusRule resource is a Kubernetes custom resource that contains groups of Prometheus recording and alerting rules. Let's examine its structure:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: example-rule
  namespace: monitoring
  labels:
    # These labels are required to be detected by Prometheus
    prometheus: k8s
    role: alert-rules
spec:
  groups:
  - name: example
    rules:
    - alert: HighRequestLatency
      expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
      for: 10m
      labels:
        severity: page
      annotations:
        summary: High request latency
        description: Job {{ $labels.job }} has a latency of {{ $value }} seconds.

The key components are:

apiVersion and kind: Identifies this as a PrometheusRule resource
metadata.labels: Critical for rule discovery (more on this later)
spec.groups: Contains one or more rule groups
rules: The actual alerting or recording rules

Rule Types

PrometheusRule resources can contain two types of rules:

Recording Rules

Recording rules precompute frequently used or computationally expensive expressions and save their results as new time series. This improves performance for dashboards and queries.

- record: job:request_duration_seconds:avg_rate5m
  expr: avg(rate(request_duration_seconds_count[5m])) by (job)

Alerting Rules

Alerting rules define conditions that trigger alerts when they become true:

- alert: InstanceDown
  expr: up == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Instance {{ $labels.instance }} down"
    description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

Creating Your First PrometheusRule

Let's walk through creating a basic PrometheusRule resource that monitors CPU usage in your cluster:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cpu-alert-rules
  namespace: monitoring
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  groups:
  - name: cpu-alerts
    rules:
    - alert: HighCPUUsage
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: High CPU usage detected
        description: "CPU usage is above 80% on instance {{ $labels.instance }}"
    - alert: CriticalCPUUsage
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: Critical CPU usage detected
        description: "CPU usage is above 95% on instance {{ $labels.instance }}"

To apply this rule:

kubectl apply -f cpu-alert-rules.yaml

Rule Discovery in Prometheus Operator

The Prometheus Operator uses label selectors to discover PrometheusRule resources. This selector is defined in your Prometheus custom resource:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: monitoring
spec:
  # ... other settings
  ruleSelector:
    matchLabels:
      prometheus: k8s
      role: alert-rules
  # ... other settings

With this configuration, Prometheus will discover all PrometheusRule resources with the labels prometheus: k8s and role: alert-rules across all namespaces it has permission to access.

Best Practices for Organizing Rules

As your monitoring needs grow, you'll want to organize your rules effectively:

Namespacing

Group related rules by namespace. For example:

monitoring namespace for cluster-level rules
Application-specific rules in their respective namespaces

Rule Grouping

Within a PrometheusRule, organize related rules into groups:

spec:
  groups:
  - name: node-alerts
    rules:
    # Node-related alerts
  - name: kubernetes-apps
    rules:
    # Application-related alerts

Rules in the same group are evaluated sequentially, which can be important for recording rules that build on each other.

Labeling Strategy

Use consistent labels to categorize your rules:

metadata:
  labels:
    prometheus: k8s
    role: alert-rules
    tier: infrastructure  # Additional categorization
    component: node       # Further categorization

Advanced Examples

Application-Specific Monitoring Rules

Here's a PrometheusRule for monitoring a web application:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: webapp-rules
  namespace: web-app
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  groups:
  - name: webapp-availability
    rules:
    - alert: WebAppDown
      expr: up{job="webapp"} == 0
      for: 2m
      labels:
        severity: critical
        team: frontend
      annotations:
        summary: Web application down
        description: The web application instance {{ $labels.instance }} has been down for more than 2 minutes.
        runbook_url: https://wiki.example.com/runbooks/webapp-down
        
    - alert: HighErrorRate
      expr: sum(rate(http_requests_total{job="webapp",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="webapp"}[5m])) > 0.05
      for: 2m
      labels:
        severity: warning
        team: frontend
      annotations:
        summary: High error rate detected
        description: Error rate is above 5% ({{ $value | humanizePercentage }})
        
  - name: webapp-performance
    rules:
    - record: job:request_latency_p95:5m
      expr: histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket{job="webapp"}[5m])) by (le, instance))
      
    - alert: SlowResponses
      expr: job:request_latency_p95:5m > 2
      for: 10m
      labels:
        severity: warning
        team: frontend
      annotations:
        summary: Slow response times
        description: 95th percentile of request latency is above 2 seconds ({{ $value }} seconds)

Database Monitoring Rules

Here's an example for monitoring a database:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: database-rules
  namespace: database
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  groups:
  - name: postgres-alerts
    rules:
    - alert: PostgresqlDown
      expr: pg_up{job="postgresql"} == 0
      for: 1m
      labels:
        severity: critical
        team: db-admins
      annotations:
        summary: PostgreSQL instance down
        description: PostgreSQL instance {{ $labels.instance }} is down

    - alert: PostgresqlHighConnections
      expr: sum by (datname) (pg_stat_activity_count{job="postgresql"}) > 150
      for: 5m
      labels:
        severity: warning
        team: db-admins
      annotations:
        summary: Too many connections
        description: Database {{ $labels.datname }} has more than 150 connections
        
    - alert: PostgresqlSlowQueries
      expr: pg_slow_queries{job="postgresql"} > 5
      for: 5m
      labels:
        severity: warning
        team: db-admins
      annotations:
        summary: Slow queries detected
        description: There are {{ $value }} slow queries on instance {{ $labels.instance }}

Visualizing Rule Relationships

Let's visualize how PrometheusRule resources fit into the Prometheus Operator ecosystem:

Debugging and Testing Rules

Checking Rule Syntax

You can validate your PrometheusRule resources before applying them:

kubectl apply --dry-run=client -f my-rule.yaml

Testing Rule Expressions

You can test rule expressions directly in the Prometheus UI:

Access your Prometheus UI (typically at http://<prometheus-host>:9090)
Go to the "Graph" tab
Enter your rule expression and click "Execute"

Checking Rule Status

To see if your rules are being loaded correctly:

In the Prometheus UI, go to "Status" > "Rules"
Verify your rules appear with the correct evaluation status

Common Patterns and Anti-Patterns

Recommended Patterns

✅ Use templating: Make use of Helm or Kustomize to template your PrometheusRule resources ✅ Include runbook links: Add runbook_url annotations to help responders ✅ Group related rules: Keep related rules in the same group ✅ Use meaningful labels: Label your alerts logically to help with routing

Anti-Patterns to Avoid

❌ Noisy alerts: Avoid alerting on metrics that fluctuate naturally ❌ Missing "for" duration: Always include a for clause to prevent flapping ❌ Lack of severity levels: Always include a severity label ❌ Hard-coded thresholds: Consider using variables or configMaps for thresholds

Integration with AlertManager

Your PrometheusRule alerting rules will send alerts to AlertManager, which can then route and deduplicate them based on the labels you define:

# In AlertManagerConfig
spec:
  route:
    groupBy: ['alertname', 'job']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    receiver: 'default-receiver'
    routes:
    - matchers:
      - name: severity
        value: critical
      receiver: 'critical-alerts'
      
  receivers:
  - name: 'default-receiver'
    # receiver configuration...
  - name: 'critical-alerts'
    # receiver configuration...

Summary

PrometheusRule custom resources provide a Kubernetes-native way to manage Prometheus alerting and recording rules. They offer several advantages over the traditional file-based approach:

Declarative, version-controlled rule definitions
Dynamic updates without Prometheus restarts
Logical organization by namespace and labels
Easy integration with the entire Kubernetes ecosystem

By mastering PrometheusRule resources, you can build comprehensive monitoring systems that detect issues early and provide valuable insights into your applications and infrastructure.

Additional Resources

Exercises

Create a PrometheusRule that monitors memory usage in your cluster.
Develop a PrometheusRule for a specific application with both recording and alerting rules.
Set up a rule that monitors API server latency and alerts when response times exceed certain thresholds.
Create a recording rule that aggregates request count metrics by endpoint and status code.
Design an alerting strategy with different severity levels and appropriate timing thresholds.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Understanding PrometheusRule Structure​

Rule Types​

Recording Rules​

Alerting Rules​

Creating Your First PrometheusRule​

Rule Discovery in Prometheus Operator​

Best Practices for Organizing Rules​

Namespacing​

Rule Grouping​

Labeling Strategy​

Advanced Examples​

Application-Specific Monitoring Rules​

Database Monitoring Rules​

Visualizing Rule Relationships​

Debugging and Testing Rules​

Checking Rule Syntax​

Testing Rule Expressions​

Checking Rule Status​

Common Patterns and Anti-Patterns​

Recommended Patterns​

Anti-Patterns to Avoid​

Integration with AlertManager​

Summary​

Additional Resources​

Exercises​