Cardinality Control

Introduction

When working with Prometheus, one of the most critical aspects to understand and manage is cardinality. Cardinality refers to the number of unique time series that Prometheus has to track. A time series in Prometheus is uniquely identified by its metric name and the combination of its label key-value pairs.

High cardinality is one of the leading causes of performance issues in Prometheus deployments. As the number of unique time series grows, Prometheus requires more memory and processing power, which can eventually lead to system instability or failure.

In this guide, we'll explore what cardinality is, why it matters, and how to effectively control it to maintain a healthy Prometheus installation.

Understanding Cardinality

What Is Cardinality?

Cardinality is a measure of the unique elements in a set. In Prometheus terms, it refers to the number of unique time series created by your metrics and their labels.

For example, consider a simple metric:

http_requests_total{path="/api", method="GET", status="200"} 24

This represents a single time series. If we add another metric with different label values:

http_requests_total{path="/home", method="GET", status="200"} 12

We now have two unique time series. As the number of unique combinations of labels grows, so does the cardinality.

Why Cardinality Matters

Prometheus stores each time series in memory and on disk. Each unique time series:

Consumes memory
Requires CPU time for processing
Takes up disk space
Increases query complexity and duration

A Prometheus server with uncontrolled cardinality can experience:

Memory exhaustion
Slow query performance
Increased scrape intervals
Eventual system crashes

Let's visualize the relationship between labels and cardinality:

Common Causes of High Cardinality

1. Using High-Variability Labels

Using labels with highly variable values is the most common cause of cardinality explosions:

# BAD: Using a user ID as a label
api_requests_total{user_id="12345"} 1

If your system has thousands or millions of users, this creates a new time series for each user!

2. Using Timestamps or Continuously Changing Values as Labels

# BAD: Using timestamp in a label
request_processed{timestamp="2023-10-26T15:34:12Z"} 1

This creates a new time series for every single data point, defeating the purpose of a time series database.

3. Cartesian Explosion with Multiple Labels

When you combine multiple labels, you multiply their possible values:

# 5 services × 10 endpoints × 4 methods × 5 status codes = 1,000 time series
http_requests_total{service="...", endpoint="...", method="...", status="..."}

Best Practices for Controlling Cardinality

1. Use Labels Judiciously

Only add labels that you will actually query on:

# Good: Limited, useful labels
http_requests_total{service="payment-api", endpoint="/process", status_code="200"} 42

# Avoid: Unnecessary labels that won't be used for querying
http_requests_total{service="payment-api", endpoint="/process", status_code="200", request_id="abc123", user_agent="Mozilla..."} 42

2. Limit the Number of Possible Values

For each label, try to keep the set of possible values small and stable:

# Good: Status code grouped into categories
http_requests_total{status="success"} 42
http_requests_total{status="error"} 7

# Instead of:
http_requests_total{status="200"} 30
http_requests_total{status="201"} 12
http_requests_total{status="400"} 5
http_requests_total{status="404"} 1
http_requests_total{status="500"} 1

3. Use the `label_values()` Function to Monitor Cardinality

You can use Prometheus's query language to monitor the cardinality of your metrics:

# Count unique time series for a metric
count(count by(__name__, job, instance, service, endpoint) (http_requests_total))

4. Implement Filtering at the Collection Level

Use Prometheus relabeling to filter out unnecessary labels before they enter the database:

scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['localhost:8080']
    metric_relabel_configs:
      # Drop unnecessary labels
      - source_labels: [__name__, high_cardinality_label]
        regex: 'http_requests_total;.*'
        action: labeldrop
        replacement: high_cardinality_label

5. Use Histograms and Summaries Appropriately

Histograms and summaries can help reduce cardinality while still providing detailed information about distributions:

# Instead of tracking each exact duration:
request_duration_seconds{path="/api", duration="0.123"} 1
request_duration_seconds{path="/api", duration="0.456"} 1

# Use a histogram with predefined buckets:
request_duration_seconds_bucket{path="/api", le="0.1"} 12
request_duration_seconds_bucket{path="/api", le="0.5"} 45
request_duration_seconds_bucket{path="/api", le="1.0"} 67

Practical Examples

Example 1: Refactoring High-Cardinality Metrics

Let's look at a real-world example of refactoring metrics to reduce cardinality:

Before (High Cardinality):

# Creates a time series for every unique user ID
login_attempts_total{user_id="user-1234", result="success"} 1
login_attempts_total{user_id="user-5678", result="failure"} 1

After (Controlled Cardinality):

# Only tracks success/failure counts
login_attempts_total{result="success"} 1056
login_attempts_total{result="failure"} 43

# If user-specific tracking is needed, use a separate counter with aggregation
user_login_failures_total 43

# For specific problematic users, track them separately
repeated_login_failures_total{user_id="user-5678"} 12

Example 2: Monitoring Cardinality Growth

Set up alerts to monitor cardinality growth:

groups:
- name: CardianlityAlerts
  rules:
  - alert: HighCardinalityMetric
    expr: |
      count by(__name__) ({__name__!=""}) > 10000
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High cardinality metric detected"
      description: "The metric {{ $labels.__name__ }} has more than 10,000 time series."

Example 3: Using Recording Rules to Aggregate Data

Recording rules can pre-aggregate high-cardinality data:

groups:
- name: aggregation-rules
  interval: 5m
  rules:
  - record: http_requests_by_service_total
    expr: sum by(service) (http_requests_total)

Implementation: A Complete Example

Here's a complete example showing how to refactor a high-cardinality metric system:

Original high-cardinality setup:

# In your application code
const counter = new prometheus.Counter({
  name: 'api_requests_total',
  help: 'Total API requests',
  labelNames: ['endpoint', 'user_id', 'status_code', 'region', 'version']
});

// This creates potentially millions of time series!
counter.inc({ 
  endpoint: '/users', 
  user_id: '12345', 
  status_code: '200',
  region: 'us-east-1',
  version: '1.2.3'
});

Refactored low-cardinality approach:

# In your application code
const requestCounter = new prometheus.Counter({
  name: 'api_requests_total',
  help: 'Total API requests',
  labelNames: ['endpoint', 'status_class', 'region'] // Reduced labels
});

const userErrorCounter = new prometheus.Counter({
  name: 'user_api_errors_total',
  help: 'API errors by user',
  labelNames: ['user_id'] // Separate metric for user-specific tracking
});

// Main counter with fewer labels
requestCounter.inc({ 
  endpoint: '/users', 
  status_class: '2xx', // Group status codes
  region: 'us-east'     // Use broader regions
});

// Only track errors by user
if (statusCode >= 400) {
  userErrorCounter.inc({ user_id: '12345' });
}

Add monitoring for cardinality:

# Prometheus recording rule
groups:
- name: cardinality
  interval: 5m
  rules:
  - record: metric_cardinality
    expr: count by(__name__) ({__name__!=""})

Summary

Controlling cardinality is essential for maintaining a performant and reliable Prometheus monitoring system. Remember these key points:

Be selective with labels: Only use labels you'll query on
Limit unique values: Group similar values into broader categories
Monitor cardinality: Set up alerts for unexpected growth
Use aggregation: Leverage recording rules to pre-aggregate data
Implement filtering: Use relabeling to drop high-cardinality labels

By following these best practices, you can prevent cardinality explosions and ensure your Prometheus deployment remains scalable and efficient.

Additional Resources

Exercises

Analyze your current metrics and identify any with potentially high cardinality. Create a plan to refactor them.
Write a PromQL query that identifies your top 10 metrics by cardinality.
Implement a recording rule that tracks the cardinality growth of your metrics over time.
Create a Grafana dashboard that visualizes the cardinality of your Prometheus metrics.

Practice refactoring this high-cardinality metric:

api_request_duration_seconds{path="/api/v1/users", method="GET", status="200", user_agent="Mozilla...", client_ip="192.168.1.1"}

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Cardinality​

What Is Cardinality?​

Why Cardinality Matters​

Common Causes of High Cardinality​

1. Using High-Variability Labels​

2. Using Timestamps or Continuously Changing Values as Labels​

3. Cartesian Explosion with Multiple Labels​

Best Practices for Controlling Cardinality​

1. Use Labels Judiciously​

2. Limit the Number of Possible Values​

3. Use the label_values() Function to Monitor Cardinality​

4. Implement Filtering at the Collection Level​

5. Use Histograms and Summaries Appropriately​

Practical Examples​

Example 1: Refactoring High-Cardinality Metrics​

Example 2: Monitoring Cardinality Growth​

Example 3: Using Recording Rules to Aggregate Data​

Implementation: A Complete Example​

Summary​

Additional Resources​

Exercises​