Alert Time Windows

Introduction

When setting up alerting in Prometheus, one challenge you'll quickly encounter is balancing between detecting genuine issues and avoiding alert fatigue. Alert Time Windows are a powerful feature in Prometheus's alerting system that helps you control when alerts should fire based on how long a problem condition persists.

In this guide, we'll explore how time windows work in Prometheus alerting rules, why they're crucial for building robust monitoring systems, and how to configure them effectively for your specific needs.

Understanding Alert Time Windows

The Basics

Alert time windows in Prometheus are primarily controlled through two parameters in your alerting rules:

for - Specifies how long a condition must be true before the alert fires
keep_firing_for - Determines how long an alert continues firing after the condition is no longer true

These parameters help you manage the timing aspects of your alerts, making them more resistant to temporary spikes or brief issues that might resolve themselves.

The Problem with Instant Alerting

Let's first understand why we need time windows. Consider this simple alerting rule without any time window:

groups:
  - name: example
    rules:
      - alert: HighCPUUsage
        expr: cpu_usage_percent > 90
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 90%"

The issue with this rule is that any momentary spike in CPU above 90% will trigger an alert, even if it lasts for just a few seconds. This can lead to:

Alert noise and fatigue for your team
Wasted time investigating transient issues
Decreased trust in your alerting system

Using the `for` Parameter

The for parameter solves this problem by requiring the condition to be true continuously for a specified duration before firing the alert.

Basic Syntax

groups:
  - name: example
    rules:
      - alert: HighCPUUsage
        expr: cpu_usage_percent > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage has been above 90% for 5 minutes"

In this example, the CPU must consistently exceed 90% for 5 minutes before an alert is triggered. This significantly reduces false alarms from brief spikes.

How It Works

Let's visualize how the for parameter works:

When the condition first becomes true, Prometheus puts the alert in a "pending" state
If the condition stays true for the entire for duration, the alert transitions to "firing"
If the condition becomes false at any point during the for period, the timer resets

Practical Example: Memory Usage Alert

Here's a practical example for monitoring memory usage:

groups:
  - name: memory_alerts
    rules:
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage has been above 85% for 10 minutes on {{ $labels.instance }}"

This alert will fire only if memory usage remains above 85% continuously for 10 minutes, which helps filter out temporary memory spikes that might occur during normal operations.

The `keep_firing_for` Parameter

While for helps prevent false positives from brief spikes, keep_firing_for (introduced in Prometheus 2.42.0) addresses the opposite problem: alerts that rapidly flip between firing and resolved states.

Basic Syntax

groups:
  - name: example
    rules:
      - alert: HighCPUUsage
        expr: cpu_usage_percent > 90
        for: 5m
        keep_firing_for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage has been above 90% for 5 minutes"

With this configuration, even if CPU usage drops below 90%, the alert will continue firing for 2 more minutes before resolving.

How It Works

This parameter is particularly useful for:

Preventing "flapping" alerts that rapidly toggle between firing and resolved
Ensuring alerts remain visible long enough for humans to notice them
Giving transient recoveries time to fail again without generating multiple alerts

Designing Effective Alert Time Windows

Factors to Consider

When setting time windows for your alerts, consider:

Service Level Objectives (SLOs): How quickly do you need to respond to issues?
Impact of the issue: High-impact problems might warrant quicker alerts
False positive cost: What's the cost of investigating false alarms?
Recovery patterns: How does your service typically recover from issues?

Common Time Window Patterns

Alert Type	Recommended `for`	Recommended `keep_firing_for`	Rationale
Critical service outage	1m-2m	1m	Quick response needed, minimal delay acceptable
High error rates	5m	2m	Filter brief error spikes, allow for quick recovery
Resource usage warnings	10m-15m	5m	Give time for automated scaling, reduce noise
Slow queries/latency	5m-10m	3m	Accommodate traffic variations and brief slowdowns

Example: Multi-level CPU Alerts

groups:
  - name: cpu_alerts
    rules:
      - alert: CPUUsageWarning
        expr: cpu_usage_percent > 80
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage warning"
          description: "CPU usage has been above 80% for 15 minutes on {{ $labels.instance }}"

      - alert: CPUUsageCritical
        expr: cpu_usage_percent > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU usage"
          description: "CPU usage has been above 95% for 5 minutes on {{ $labels.instance }}"

This example creates a two-tier alerting system:

A warning alert for sustained moderate CPU usage (80%+ for 15 minutes)
A critical alert for extreme CPU usage (95%+ for 5 minutes)

Advanced Techniques

Combining with Percentile-based Alerts

For metrics with significant variability, combine time windows with percentile-based expressions:

groups:
  - name: latency_alerts
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High API latency"
          description: "95th percentile latency has been above 2 seconds for 10 minutes"

This alerts only when the 95th percentile of request latency exceeds 2 seconds for 10 minutes, making it robust against individual slow requests.

Rate-based Alerts with Time Windows

For error rate monitoring, combine rate() functions with appropriate time windows:

groups:
  - name: error_rate_alerts
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        keep_firing_for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate has been above 5% for 5 minutes"

This alerts when the error rate exceeds 5% for 5 minutes, using a 5-minute rate calculation window and keeping the alert firing for 2 minutes after recovery.

Common Pitfalls and Best Practices

Pitfalls to Avoid

Setting windows too short: Can lead to noisy alerts and fatigue
Setting windows too long: Might delay critical notifications too much
Ignoring service patterns: Not considering how your service normally behaves
One-size-fits-all approach: Using the same time windows for all alerts

Best Practices

Start conservative: Begin with longer windows and adjust based on experience
Document your reasoning: Explain why you chose specific time windows
Review regularly: Analyze alert patterns and adjust time windows accordingly
Test different scenarios: Simulate spikes and outages to validate your settings
Consider business hours: You might want different windows during work hours vs. off-hours

Implementing Alert Time Windows in Prometheus Rules Files

Here's a complete example of a Prometheus rules file with various alerts and appropriate time windows:

groups:
- name: service_alerts
  rules:
  - alert: ServiceDown
    expr: up{job="my-service"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Service {{ $labels.instance }} is down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."

  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{job="my-service", status=~"5.."}[5m])) by (instance) / sum(rate(http_requests_total{job="my-service"}[5m])) by (instance) > 0.05
    for: 5m
    keep_firing_for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High error rate on {{ $labels.instance }}"
      description: "Error rate has been above 5% for 5 minutes on {{ $labels.instance }}."

  - alert: SlowResponses
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="my-service"}[5m])) by (le, instance)) > 1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Slow responses on {{ $labels.instance }}"
      description: "95th percentile of response time has been above 1 second for 10 minutes on {{ $labels.instance }}."

  - alert: HighCPUUsage
    expr: avg by(instance) (rate(process_cpu_seconds_total{job="my-service"}[3m])) * 100 > 80
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage has been above 80% for 15 minutes on {{ $labels.instance }}."

  - alert: MemoryNearExhaustion
    expr: process_resident_memory_bytes{job="my-service"} / node_memory_MemTotal_bytes * 100 > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Memory usage high on {{ $labels.instance }}"
      description: "Memory usage has been above 80% for 10 minutes on {{ $labels.instance }}."

Summary

Alert time windows are a crucial feature for building a reliable and actionable alerting system with Prometheus. By carefully configuring the for and keep_firing_for parameters, you can:

Reduce alert noise and fatigue
Filter out transient issues
Prevent alert flapping
Ensure timely notification of genuine problems

Remember these key points:

Use the for parameter to specify how long a condition must be true before alerting
Use the keep_firing_for parameter to prevent alerts from resolving too quickly
Adjust time windows based on the criticality and behavior of what you're monitoring
Regularly review and refine your time windows based on experience

Additional Resources

Exercises

Exercise 1: Review your existing alerting rules and identify candidates for adding or adjusting time windows.
Exercise 2: Create a multi-level alerting rule (warning/critical) for a service metric with appropriate time windows.
Exercise 3: Analyze your alert history to identify alerts that fire too frequently or not frequently enough, and adjust their time windows accordingly.
Exercise 4: Design an error rate alerting rule that uses both for and keep_firing_for parameters.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Alert Time Windows​

The Basics​

The Problem with Instant Alerting​

Using the for Parameter​

Basic Syntax​

How It Works​

Practical Example: Memory Usage Alert​

The keep_firing_for Parameter​

Basic Syntax​

How It Works​

Designing Effective Alert Time Windows​

Factors to Consider​

Common Time Window Patterns​

Example: Multi-level CPU Alerts​

Advanced Techniques​

Combining with Percentile-based Alerts​

Rate-based Alerts with Time Windows​

Common Pitfalls and Best Practices​

Pitfalls to Avoid​

Best Practices​

Implementing Alert Time Windows in Prometheus Rules Files​

Summary​

Additional Resources​

Exercises​

Introduction

Understanding Alert Time Windows

The Basics

The Problem with Instant Alerting

Using the `for` Parameter

Basic Syntax

How It Works

Practical Example: Memory Usage Alert

The `keep_firing_for` Parameter

Basic Syntax

How It Works

Designing Effective Alert Time Windows

Factors to Consider

Common Time Window Patterns

Example: Multi-level CPU Alerts

Advanced Techniques

Combining with Percentile-based Alerts

Rate-based Alerts with Time Windows

Common Pitfalls and Best Practices

Pitfalls to Avoid

Best Practices

Implementing Alert Time Windows in Prometheus Rules Files

Summary

Additional Resources

Exercises