Recording Rules Optimization

Introduction

Recording rules are one of Prometheus' most powerful features for improving query performance and reducing system load. They allow you to precompute frequently used or computationally expensive expressions and save their results as new time series. This guide will explore advanced techniques for optimizing your recording rules to ensure your Prometheus deployment remains efficient and performant even as your monitoring needs scale.

Why Optimize Recording Rules?

Before diving into optimization techniques, it's important to understand why optimizing recording rules matters:

Query Performance: Well-designed recording rules can reduce query latency by orders of magnitude
Resource Efficiency: Optimized rules reduce CPU and memory consumption
Scalability: Good optimization practices help your monitoring scale with your infrastructure
Reliability: Efficient rules reduce the risk of Prometheus becoming overloaded

Basic Recording Rules Review

Let's start with a quick review of recording rule syntax:

groups:
  - name: example
    interval: 5s
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

This simple rule:

Creates a new time series called job:http_requests:rate5m
Records the 5-minute rate of HTTP requests aggregated by job
Evaluates every 5 seconds (defined by the interval)

Advanced Optimization Techniques

1. Optimize Evaluation Intervals

One of the most important optimization techniques is choosing appropriate evaluation intervals:

groups:
  - name: frequently_accessed
    interval: 15s
    rules:
      - record: instance:node_cpu:utilization:rate1m
        expr: sum(rate(node_cpu_seconds_total{mode!="idle"}[1m])) by (instance) / count(node_cpu_seconds_total{mode="idle"}) by (instance)

  - name: dashboard_metrics
    interval: 1m
    rules:
      - record: job:http_errors:ratio30m
        expr: sum(rate(http_requests_total{status=~"5.."}[30m])) by (job) / sum(rate(http_requests_total[30m])) by (job)

  - name: capacity_planning
    interval: 5m
    rules:
      - record: instance:disk:capacity_prediction7d
        expr: predict_linear(node_filesystem_free_bytes[1d], 7 * 24 * 3600)

Optimization principles:

Match the evaluation interval to how frequently the data is accessed
Consider the rate of change of the underlying metrics
Use shorter intervals for alerting-related rules
Use longer intervals for dashboard metrics that don't need to be real-time
Group rules by their optimal evaluation interval

2. Reduce Cardinality

High cardinality can significantly impact Prometheus performance. Optimize your recording rules to reduce cardinality when possible:

# Before optimization - high cardinality
- record: api_requests:rate5m
  expr: sum(rate(api_requests_total[5m])) by (instance, method, path, status)

# After optimization - reduced cardinality
- record: api_requests:rate5m:status_family
  expr: sum(rate(api_requests_total[5m])) by (instance, method, status_family)

In this example, we've:

Removed the high-cardinality path label
Replaced detailed HTTP status codes with status families (2xx, 3xx, 4xx, 5xx)

Pro Tip: Consider using the without clause instead of by when you need to drop just a few high-cardinality labels:

- record: http_requests:rate5m
  expr: sum without(id, user_agent, request_id) (rate(http_requests_total[5m]))

3. Chain Recording Rules

For complex queries, create chains of recording rules with increasing levels of aggregation:

groups:
  - name: http_metrics
    interval: 30s
    rules:
      # Level 1: Per-instance rates
      - record: instance:http_requests:rate5m
        expr: rate(http_requests_total[5m])
      
      # Level 2: Aggregated by service and method
      - record: service:http_requests:rate5m
        expr: sum by(service, method) (instance:http_requests:rate5m)
      
      # Level 3: Error ratios
      - record: service:http_requests:error_ratio:rate5m
        expr: sum by(service) (service:http_requests:rate5m{status=~"5.."}) / sum by(service) (service:http_requests:rate5m)

Benefits:

Each level builds on the previous computation
Reduces redundant calculations
Makes the final complex queries much more efficient
Makes debugging easier by providing intermediate results

4. Use Subqueries Carefully

Subqueries can be powerful but expensive. When optimizing recording rules with subqueries, consider:

# Expensive subquery pattern
- record: job:http_request_latency:p95:rolling1h
  expr: quantile_over_time(0.95, http_request_duration_seconds[1h:1m])

# Optimized approach
groups:
  - name: http_latency
    interval: 1m
    rules:
      # First calculate p95 for shorter intervals
      - record: job:http_request_latency:p95:1m
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_bucket[1m])) by (job, le))
      
      # Then use avg_over_time for the rolling window
      - record: job:http_request_latency:p95:rolling1h
        expr: avg_over_time(job:http_request_latency:p95:1m[1h])

This approach:

Avoids the expensive subquery operation
Breaks the calculation into more efficient steps
Still provides the rolling window analysis

5. Prioritize and Organize Rule Groups

Organize your recording rules by priority and logical grouping:

groups:
  # Critical operational metrics evaluated frequently
  - name: critical_ops
    interval: 15s
    rules:
      - record: instance:memory:available_bytes
        expr: node_memory_MemAvailable_bytes
      - record: instance:cpu:utilization:rate1m
        expr: sum(rate(node_cpu_seconds_total{mode!="idle"}[1m])) by (instance) / count(node_cpu_seconds_total{mode="idle"}) by (instance)

  # Service-specific metrics
  - name: service_metrics
    interval: 30s
    rules:
      - record: service:request_rate:5m
        expr: sum(rate(http_requests_total[5m])) by (service)
      - record: service:error_rate:5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

  # Business metrics that change less frequently
  - name: business_metrics
    interval: 1m
    rules:
      - record: product:purchases:rate5m
        expr: sum(rate(purchase_events_total[5m])) by (product)

6. Use Time Windows Appropriately

Choose appropriate time windows based on the metric's volatility and how it will be used:

# For highly volatile metrics where immediate changes matter
- record: service:error_spike:rate30s
  expr: rate(http_errors_total[30s])

# For general trending where smoothing is beneficial
- record: service:requests:rate5m
  expr: rate(http_requests_total[5m])

# For long-term analysis where stability is important
- record: service:traffic:rate1h
  expr: rate(http_requests_total[1h])

Guidelines:

Shorter windows (30s-1m) are more responsive but noisier
Longer windows (5m-1h) provide smoother data but mask rapid changes
Match the window to how the data will be used (alerting vs dashboards)

Performance Benchmarking

To identify which rules need optimization, add these recording rules to measure rule evaluation performance:

groups:
  - name: recording_rules_performance
    interval: 1m
    rules:
      - record: prometheus:rule_evaluation:duration:p95
        expr: histogram_quantile(0.95, sum(rate(prometheus_rule_evaluation_duration_seconds_bucket[5m])) by (rule_group, le))
      - record: prometheus:rule_evaluation:failures:rate5m
        expr: sum(rate(prometheus_rule_evaluation_failures_total[5m])) by (rule_group)

These meta-metrics help you:

Identify slow rule groups
Detect failing evaluations
Track the impact of your optimizations

Optimization Workflow Example

Let's walk through a complete optimization workflow for a high-cardinality API monitoring setup:

Step 1: Identify problematic rules

topk(10, sum(prometheus_rule_evaluation_duration_seconds_sum) by (rule_group) / sum(prometheus_rule_evaluation_duration_seconds_count) by (rule_group))

This query reveals that the api_metrics rule group has the highest average evaluation time.

Step 2: Analyze the current rules

Original rule group:

groups:
  - name: api_metrics
    interval: 15s
    rules:
      - record: api:request_duration:p95
        expr: histogram_quantile(0.95, sum(rate(api_request_duration_seconds_bucket[5m])) by (service, endpoint, method, status_code, le))
      - record: api:request_rate:5m
        expr: sum(rate(api_requests_total[5m])) by (service, endpoint, method, status_code)
      - record: api:error_rate:5m
        expr: sum(rate(api_requests_total{status_code=~"5.."}[5m])) by (service, endpoint, method)

Step 3: Apply optimization techniques

Optimized rule group:

groups:
  # Split rules by evaluation interval and create a hierarchy
  - name: api_metrics_base
    interval: 30s
    rules:
      # Reduce cardinality by using fewer dimensions
      - record: api:request_duration_bucket:rate5m
        expr: sum(rate(api_request_duration_seconds_bucket[5m])) by (service, endpoint, le)
      - record: api:requests:rate5m
        expr: sum(rate(api_requests_total[5m])) by (service, endpoint)
      - record: api:errors:rate5m
        expr: sum(rate(api_requests_total{status_code=~"5.."}[5m])) by (service, endpoint)

  - name: api_metrics_derived
    interval: 1m
    rules:
      # Build on the base metrics with further aggregation
      - record: api:request_duration:p95
        expr: histogram_quantile(0.95, api:request_duration_bucket:rate5m)
      - record: api:error_ratio:5m
        expr: api:errors:rate5m / api:requests:rate5m
      
      # Service-level aggregations for dashboards
      - record: service:request_duration:p95
        expr: histogram_quantile(0.95, sum(api:request_duration_bucket:rate5m) by (service, le))
      - record: service:error_ratio:5m
        expr: sum(api:errors:rate5m) by (service) / sum(api:requests:rate5m) by (service)

Step 4: Measure the impact

After implementing these optimizations, we can measure their impact:

Rule evaluation times: Reduced by 65%
Cardinality: Reduced from 25,000 time series to 8,000
Query performance: Dashboard queries now execute 3-4x faster

Common Optimization Mistakes to Avoid

Over-aggregation: Removing too many labels makes the metrics less useful
Too many rules: Creating rules for every possible query scenario
Too frequent evaluation: Setting very short intervals for metrics that don't change quickly
Ignoring cardinality: Not considering the explosion of time series from high-cardinality labels
Complex expressions: Using highly complex expressions in recording rules instead of breaking them down

Visualizing Rule Evaluation Flow

A visualization can help understand the flow of rule evaluation and optimization:

Summary

Optimizing recording rules is essential for maintaining a high-performance Prometheus monitoring system. The key principles to remember are:

Match evaluation intervals to data access patterns and rates of change
Reduce cardinality by carefully selecting which labels to include
Chain recording rules to build complex metrics in stages
Use appropriate time windows based on the volatility of metrics
Organize rules by priority and logical grouping
Measure performance to identify bottlenecks and verify improvements

By applying these optimization techniques, you can ensure your Prometheus deployment remains efficient and performant even as your infrastructure and monitoring needs grow.

Additional Resources

Exercises

Analyze your current recording rules and identify those with the highest evaluation times
Take a complex dashboard query and break it down into a chain of recording rules
Identify metrics with high cardinality and create optimized recording rules for them
Set up meta-monitoring for your recording rules using the provided performance metrics
Create a rule optimization plan with different evaluation intervals based on data needs

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Optimize Recording Rules?​

Basic Recording Rules Review​

Advanced Optimization Techniques​

1. Optimize Evaluation Intervals​

2. Reduce Cardinality​

3. Chain Recording Rules​

4. Use Subqueries Carefully​

5. Prioritize and Organize Rule Groups​

6. Use Time Windows Appropriately​

Performance Benchmarking​

Optimization Workflow Example​

Step 1: Identify problematic rules​

Step 2: Analyze the current rules​

Step 3: Apply optimization techniques​

Step 4: Measure the impact​

Common Optimization Mistakes to Avoid​

Visualizing Rule Evaluation Flow​

Summary​

Additional Resources​

Exercises​