Recording Rules Best Practices

Introduction

Recording rules are a powerful feature in Prometheus that allow you to precompute frequently used or computationally expensive expressions and save their results as new time series. These precomputed time series can then be queried much more efficiently than executing the original expression every time.

In this guide, we'll explore best practices for creating and managing recording rules in Prometheus, helping you optimize your monitoring system and improve query performance.

Why Use Recording Rules?

Before diving into best practices, let's understand why recording rules are essential:

Performance Optimization: Complex queries can be expensive to compute on-demand. Recording rules precompute these expressions at regular intervals.
Dashboard Efficiency: Grafana dashboards that use precomputed metrics load faster and put less strain on your Prometheus server.
Query Simplification: Recording rules can transform complex expressions into simple metrics, making them easier to use.
Alerting Reliability: Using recording rules in alerting ensures that alert evaluation is fast and reliable.

Let's see how recording rules work with a simple diagram:

Basic Recording Rule Syntax

Recording rules in Prometheus are defined in a YAML configuration file. Here's the basic structure:

groups:
  - name: example
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

Each rule consists of:

record: The name of the new time series to create
expr: The PromQL expression to evaluate

Recording Rules Best Practices

1. Use Consistent Naming Conventions

Follow a consistent naming pattern for your recording rules to make them easily identifiable and understandable.

Recommended Pattern: `level:metric:operations`

level: The aggregation level (e.g., instance, job, cluster)
metric: The base metric name
operations: The operations applied (e.g., rate5m, sum, ratio)

Example:

# Good
- record: job:http_requests_total:rate5m
  expr: sum(rate(http_requests_total[5m])) by (job)

# Avoid
- record: http_request_rate_by_job
  expr: sum(rate(http_requests_total[5m])) by (job)

Organize your recording rules into logical groups. Rules in the same group are evaluated sequentially, so you can reference earlier rules within the same group.

groups:
  - name: http_metrics
    interval: 30s
    rules:
      # Base rates
      - record: instance:http_requests:rate5m
        expr: rate(http_requests_total[5m])
      
      # Job-level aggregation (uses the rule above)
      - record: job:http_requests:rate5m
        expr: sum by (job) (instance:http_requests:rate5m)

3. Set Appropriate Evaluation Intervals

Choose rule evaluation intervals based on the volatility of the data and the query requirements:

High-frequency metrics: 15-30s intervals
Standard metrics: 30s-1m intervals
Slow-changing metrics: 5m intervals

groups:
  - name: critical_metrics
    interval: 15s
    rules:
      - record: instance:cpu:usage:rate1m
        expr: rate(node_cpu_seconds_total{mode="user"}[1m])
  
  - name: standard_metrics
    interval: 1m
    rules:
      - record: job:http_errors:ratio5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)

4. Optimize Rule Expressions

Write efficient PromQL expressions to minimize the computational load:

Use Appropriate Time Windows

# Good for recent trends
- record: job:http_requests:rate5m
  expr: sum(rate(http_requests_total[5m])) by (job)

# Good for longer patterns
- record: job:http_requests:rate1h
  expr: sum(rate(http_requests_total[1h])) by (job)

Limit Label Cardinality

High cardinality labels can cause performance issues. Be selective about which labels to include.

# Better - retains only necessary labels
- record: api:request_duration_seconds:p95
  expr: histogram_quantile(0.95, sum by (job, method, path, le) (rate(http_request_duration_seconds_bucket[5m])))

# Avoid - includes all labels
- record: api:request_duration_seconds:p95_all_labels
  expr: histogram_quantile(0.95, sum by (job, method, path, status, user, region, le) (rate(http_request_duration_seconds_bucket[5m])))

5. Build Complex Metrics Incrementally

For complex calculations, break down the computation into multiple steps:

groups:
  - name: error_rates
    interval: 30s
    rules:
      # Step 1: Calculate total requests per job
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))
      
      # Step 2: Calculate error requests per job
      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
      
      # Step 3: Calculate error ratio using previous results
      - record: job:http_errors:ratio5m
        expr: job:http_errors:rate5m / job:http_requests:rate5m

6. Add Context with Labels and Comments

Use labels and YAML comments to provide context and documentation for your rules:

groups:
  - name: service_metrics
    interval: 30s
    rules:
      # Rule to track request latency at the 95th percentile for each service
      - record: service:request_latency_seconds:p95
        expr: histogram_quantile(0.95, sum by (service, le) (rate(request_duration_seconds_bucket[5m])))
        labels:
          purpose: "SLO tracking"
          team: "platform"

7. Test Rules Before Deployment

Before deploying recording rules to production, test them using the Prometheus /api/v1/query endpoint to ensure they produce the expected results.

Example test using curl:

curl -s 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=sum(rate(http_requests_total[5m])) by (job)'

You can also use the Prometheus expression browser to validate the output of your rules.

8. Implement Recording Rules for SLIs/SLOs

Recording rules are perfect for Service Level Indicators (SLIs) and Service Level Objectives (SLOs):

groups:
  - name: slos
    interval: 30s
    rules:
      # Availability SLI - percentage of successful requests
      - record: service:availability:ratio5m
        expr: sum(rate(http_requests_total{status!~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
      
      # Latency SLI - 90th percentile request duration
      - record: service:latency_seconds:p90
        expr: histogram_quantile(0.9, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))
      
      # SLO error budget consumption rate
      - record: service:error_budget:consumption_rate
        expr: (1 - service:availability:ratio5m) / (1 - 0.99) # Assuming 99% SLO target

9. Create Rules for Frequently Used PromQL Patterns

Identify common query patterns in your dashboards and alerts, and create recording rules for them:

groups:
  - name: common_patterns
    interval: 30s
    rules:
      # CPU usage per container
      - record: container:cpu:usage_seconds:rate5m
        expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (container, pod, namespace)
      
      # Memory usage per container
      - record: container:memory:usage_bytes
        expr: container_memory_usage_bytes
      
      # Network traffic per instance
      - record: instance:network:receive_bytes:rate5m
        expr: sum(rate(node_network_receive_bytes_total[5m])) by (instance)

10. Monitor Your Recording Rules

Create alerts to monitor the health of your recording rules evaluation:

groups:
  - name: prometheus_rules
    rules:
      - alert: PrometheusRuleEvaluationFailures
        expr: increase(prometheus_rule_evaluation_failures_total[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus rule evaluation failing"
          description: "Prometheus has failed to evaluate {{ $value }} rules in the last 5 minutes."

Practical Example: Creating a Recording Rules Configuration

Let's put these best practices together in a complete example for monitoring a web application:

groups:
  # HTTP traffic metrics
  - name: http_metrics
    interval: 30s
    rules:
      # Base rates by instance and job
      - record: instance:http_requests:rate5m
        expr: rate(http_requests_total[5m])
      
      - record: job:http_requests:rate5m
        expr: sum by (job) (instance:http_requests:rate5m)
      
      # Status code breakdowns
      - record: job:http_requests:rate5m:status
        expr: sum by (job, status) (rate(http_requests_total[5m]))
      
      # Error ratios
      - record: job:http_errors:ratio5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m]))
  
  # Latency metrics
  - name: latency_metrics
    interval: 30s
    rules:
      # Percentile latencies by job and endpoint
      - record: job:http_latency:p50
        expr: histogram_quantile(0.5, sum by (job, endpoint, le) (rate(http_request_duration_seconds_bucket[5m])))
      
      - record: job:http_latency:p90
        expr: histogram_quantile(0.9, sum by (job, endpoint, le) (rate(http_request_duration_seconds_bucket[5m])))
      
      - record: job:http_latency:p99
        expr: histogram_quantile(0.99, sum by (job, endpoint, le) (rate(http_request_duration_seconds_bucket[5m])))
  
  # System metrics
  - name: system_metrics
    interval: 1m
    rules:
      - record: instance:cpu:usage:rate5m
        expr: rate(node_cpu_seconds_total{mode="user"}[5m])
      
      - record: instance:memory:usage:percent
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

Implementing Recording Rules in Prometheus

Now that we've covered the best practices, let's see how to implement recording rules in your Prometheus setup:

Create a rules file (e.g., recording_rules.yml) with your rules configuration.
Update your Prometheus configuration (prometheus.yml) to include the rules file:

rule_files:
  - "recording_rules.yml"

Reload your Prometheus configuration (without restarting):

curl -X POST http://prometheus:9090/-/reload

Verify your rules are loaded in the Prometheus UI by navigating to Status > Rules.

Performance Considerations

When implementing recording rules, keep these performance considerations in mind:

Memory Usage: Each recorded time series consumes memory. Monitor Prometheus memory usage when adding new rules.
Rule Evaluation Time: Monitor prometheus_rule_group_evaluation_duration_seconds to ensure rule evaluation doesn't take too long.
Storage Impact: Recording rules create new time series that require additional storage. Plan your storage capacity accordingly.
Cardinality Explosion: Be careful with high-cardinality labels in recording rules to avoid excessive time series creation.

Summary

Recording rules are a powerful tool for optimizing Prometheus performance and making your metrics more accessible. By following these best practices, you can create an efficient and maintainable recording rules configuration:

Use consistent naming conventions
Group related rules together
Set appropriate evaluation intervals
Optimize rule expressions and limit cardinality
Build complex metrics incrementally
Provide context with labels and comments
Test rules before deployment
Implement rules for SLIs/SLOs
Create rules for common query patterns
Monitor the health of your recording rules

By applying these practices, you'll improve query performance, reduce load on your Prometheus server, and create a more responsive monitoring system.

Additional Resources

Exercises

Review your existing Prometheus dashboards and identify three frequently used or computationally expensive queries that could benefit from recording rules.
Create recording rules for calculating the 95th percentile latency for your most critical services.
Implement a hierarchical set of recording rules that calculate error rates at different aggregation levels (instance, job, and service).
Set up monitoring for your recording rules to ensure they're evaluating successfully.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Use Recording Rules?​

Basic Recording Rule Syntax​

Recording Rules Best Practices​

1. Use Consistent Naming Conventions​

Recommended Pattern: level:metric:operations​

Example:​

2. Group Related Rules Together​

3. Set Appropriate Evaluation Intervals​

4. Optimize Rule Expressions​

Use Appropriate Time Windows​

Limit Label Cardinality​

5. Build Complex Metrics Incrementally​

6. Add Context with Labels and Comments​

7. Test Rules Before Deployment​

8. Implement Recording Rules for SLIs/SLOs​

9. Create Rules for Frequently Used PromQL Patterns​

10. Monitor Your Recording Rules​

Practical Example: Creating a Recording Rules Configuration​

Implementing Recording Rules in Prometheus​

Performance Considerations​

Summary​

Additional Resources​

Exercises​

Introduction

Why Use Recording Rules?

Basic Recording Rule Syntax

Recording Rules Best Practices

1. Use Consistent Naming Conventions

Recommended Pattern: `level:metric:operations`

Example:

2. Group Related Rules Together

3. Set Appropriate Evaluation Intervals

4. Optimize Rule Expressions

Use Appropriate Time Windows

Limit Label Cardinality

5. Build Complex Metrics Incrementally

6. Add Context with Labels and Comments

7. Test Rules Before Deployment

8. Implement Recording Rules for SLIs/SLOs

9. Create Rules for Frequently Used PromQL Patterns

10. Monitor Your Recording Rules

Practical Example: Creating a Recording Rules Configuration

Implementing Recording Rules in Prometheus

Performance Considerations

Summary

Additional Resources

Exercises