RED Method (Rate, Errors, Duration)

Introduction

The RED Method is a powerful monitoring pattern that helps you understand the health and performance of your services. Inspired by Google's "Four Golden Signals," the RED Method was popularized by Tom Wilkie (one of Grafana's co-founders) and focuses on three key metrics:

Rate: The number of requests per second
Errors: The number of failed requests per second
Duration: The distribution of request latencies

This pattern is particularly effective for monitoring microservices and service-oriented architectures where understanding service behavior from a client perspective is crucial.

Why Use the RED Method?

The RED Method provides a user-centric view of your services by measuring what matters most to your users:

Simplicity: Focuses on just three metrics that give you a comprehensive view of service health
Consistency: Can be applied uniformly across all services in your architecture
Completeness: Covers all critical aspects of service behavior
User-centric: Aligns with what users actually experience

The Three Pillars of RED

Let's explore each component of the RED Method in detail:

Rate (Requests per Second)

Rate measures how many requests your service receives per second. This metric helps you understand:

Service load and traffic patterns
Usage trends over time
Capacity planning requirements
Potential issues with service discovery or load balancing

Implementation in Prometheus/Grafana:

For a service instrumented with Prometheus, you might track the rate using a counter that increments with each request:

javascript
// Example using a Node.js service with Prometheus client
const prometheus = require('prom-client');

// Create a counter for tracking requests
const httpRequestsTotal = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// In your request handler
app.use((req, res, next) => {
  res.on('finish', () => {
    httpRequestsTotal.inc({
      method: req.method,
      route: req.route?.path || 'unknown',
      status_code: res.statusCode
    });
  });
  next();
});

In Grafana, you can visualize this with a PromQL query:

rate(http_requests_total[5m])

This shows the per-second rate of requests over a 5-minute window.

Errors (Failed Requests per Second)

Errors track how many requests are failing. This metric helps you:

Detect issues that affect users
Identify problematic service dependencies
Monitor SLAs and SLOs
Trigger alerts when error rates exceed thresholds

Implementation in Prometheus/Grafana:

Using the same counter from before, you can filter for error responses:

javascript
// The counter is already tracking status codes in the previous example
// No additional instrumentation needed if you're capturing status codes

In Grafana, you can visualize the error rate with:

rate(http_requests_total{status_code=~"5.."}[5m])

Or calculate the error percentage:

sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Duration (Request Latency)

Duration measures how long your service takes to process requests. This metric helps you:

Understand user experience (slow responses frustrate users)
Identify performance bottlenecks
Track performance impacts from deployments or changes
Set realistic SLOs for response time

Implementation in Prometheus/Grafana:

For duration, you'll want to use a histogram to capture the distribution of latencies:

javascript
// Create a histogram for tracking request duration
const httpRequestDurationSeconds = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10] // buckets in seconds
});

// In your request handler
app.use((req, res, next) => {
  const end = httpRequestDurationSeconds.startTimer();
  res.on('finish', () => {
    end({
      method: req.method,
      route: req.route?.path || 'unknown',
      status_code: res.statusCode
    });
  });
  next();
});

In Grafana, you can visualize different aspects of the latency distribution:

# Median (50th percentile) request duration
histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# 95th percentile request duration
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# 99th percentile request duration
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Creating a RED Method Dashboard in Grafana

Let's put all of this together into a comprehensive Grafana dashboard that monitors a service using the RED Method:

Setting Up Your Dashboard

Here's a step-by-step guide to creating a RED Method dashboard in Grafana:

Create a new dashboard in Grafana
Add a row labeled "Service Overview"
Add the following panels:

Rate Panel (Requests Per Second)

Add a time series panel with:

sum(rate(http_requests_total[5m])) by (service)

Configure:

Title: "Request Rate"
Description: "Number of requests per second"
Unit: "requests/sec"

Error Rate Panel

Add a time series panel with:

sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)

Configure:

Title: "Error Rate"
Description: "Number of failed requests per second"
Unit: "requests/sec"
Threshold: Set color to red when values are > 0

Error Percentage Panel

Add a gauge panel with:

sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Configure:

Title: "Error Percentage"
Description: "Percentage of requests that are failing"
Unit: "percent"
Thresholds:
- 0-1%: Green
- 1-5%: Yellow
- 5%: Red

Duration Panels

Add a time series panel with multiple queries:

# 50th percentile (median)
histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# 95th percentile
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# 99th percentile
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Configure:

Title: "Request Duration"
Description: "Distribution of request latencies"
Unit: "seconds"
Legend: {{quantile}} percentile

Setting Up Alerts

The RED Method is perfect for alerting. Here are some recommended alerting strategies:

Rate-Based Alerts

Sudden drop in traffic: Alert when traffic drops more than 50% compared to the same time period in the previous day or week
Unexpected traffic spike: Alert when traffic exceeds 2x the normal volume

# Example alert expression for traffic drop
rate(http_requests_total[5m])
  < 
0.5 * rate(http_requests_total[5m] offset 1d)

Error-Based Alerts

High error rate: Alert when error percentage exceeds 5% for more than 5 minutes
Persistent errors: Alert when any errors occur continuously for more than 15 minutes

# Example alert expression for high error rate
sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05

Duration-Based Alerts

Slow responses: Alert when the 95th percentile latency exceeds your SLO threshold
Latency trend: Alert when latency is consistently increasing over a longer period

# Example alert expression for slow responses
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2

Practical Example: Monitoring an API Gateway

Let's see how the RED Method would be applied to monitor an API gateway in a microservices architecture:

Rate: Track requests per second across all API endpoints, potentially segmented by endpoint, consumer, or backend service
Errors: Monitor 4xx and 5xx responses, distinguishing between client errors (4xx) and server errors (5xx)
Duration: Measure request latency including gateway processing time and upstream service time

Sample Dashboard Configuration

Here's a sample Grafana dashboard layout for an API gateway:

Using the RED Method with Other Data Sources

While Prometheus is commonly used, you can implement the RED Method with other data sources:

Using Graphite

# Rate example in Graphite
summarize(sumSeries(stats.counters.myapp.requests.*.count), "1min", "sum", false)

# Error rate example in Graphite
summarize(sumSeries(stats.counters.myapp.requests.*.error.count), "1min", "sum", false)

Using InfluxDB

# Rate example in InfluxQL
SELECT count("value") FROM "requests" WHERE time > now() - 5m GROUP BY time(1m)

# Duration example in InfluxQL
SELECT mean("duration"), percentile("duration", 95) FROM "requests" WHERE time > now() - 5m GROUP BY time(1m)

RED Method vs. Other Monitoring Patterns

Let's compare the RED Method with other common monitoring approaches:

Pattern	Focus	Best For	Limitations
RED Method	Request-focused, client perspective	Microservices, API monitoring	Less insight into resource usage
USE Method	Resource utilization	Infrastructure, system monitoring	Less insight into user experience
Four Golden Signals	Comprehensive (includes saturation)	Balanced monitoring approach	Slightly more complex to implement

The RED Method is most similar to Google's "Four Golden Signals" (Latency, Traffic, Errors, Saturation), but omits saturation for simplicity.

Best Practices and Common Pitfalls

Best Practices

Consistent implementation: Apply the RED Method uniformly across all services
Meaningful segmentation: Break down metrics by relevant dimensions (endpoint, customer tier, region)
Histogram buckets: Choose histogram buckets that align with your SLOs
Comprehensive error tracking: Track all types of failures, not just HTTP 5xx responses
Contextual dashboards: Include service dependencies and business context in your dashboards

Common Pitfalls

Overaggregation: Aggregating metrics too broadly can hide important problems
Ignoring percentiles: Focusing only on averages masks the tail latency affecting some users
Alert fatigue: Setting thresholds too tight leads to noisy alerts
Missing client errors: Focusing only on server errors (5xx) while ignoring client errors (4xx)
Inconsistent implementation: Implementing differently across services makes comparison difficult

Summary

The RED Method provides a simple, effective approach to monitoring services from a user-centric perspective:

Rate: Shows how much your service is being used
Errors: Shows how often your service fails
Duration: Shows how long your service takes to respond

By consistently implementing these three key metrics across all your services, you create a uniform observability framework that helps you quickly identify and diagnose issues that affect your users.

Exercises

Implement the RED Method for a sample service using Prometheus and Grafana
Create a Grafana dashboard with dynamic variables to switch between different services
Set up alerts for each of the RED metrics with appropriate thresholds
Extend your RED dashboard to include breakdown by endpoint and user type
Compare how the same service looks when monitored with both the RED Method and the USE Method

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Use the RED Method?​

The Three Pillars of RED​

Rate (Requests per Second)​

Errors (Failed Requests per Second)​

Duration (Request Latency)​

Creating a RED Method Dashboard in Grafana​

Setting Up Your Dashboard​

Rate Panel (Requests Per Second)​

Error Rate Panel​

Error Percentage Panel​

Duration Panels​

Setting Up Alerts​

Rate-Based Alerts​

Error-Based Alerts​

Duration-Based Alerts​

Practical Example: Monitoring an API Gateway​

Sample Dashboard Configuration​

Using the RED Method with Other Data Sources​

Using Graphite​

Using InfluxDB​

RED Method vs. Other Monitoring Patterns​

Best Practices and Common Pitfalls​

Best Practices​

Common Pitfalls​

Summary​

Exercises​

Additional Resources​

Introduction

Why Use the RED Method?

The Three Pillars of RED

Rate (Requests per Second)

Errors (Failed Requests per Second)

Duration (Request Latency)

Creating a RED Method Dashboard in Grafana

Setting Up Your Dashboard

Rate Panel (Requests Per Second)

Error Rate Panel

Error Percentage Panel

Duration Panels

Setting Up Alerts

Rate-Based Alerts

Error-Based Alerts

Duration-Based Alerts

Practical Example: Monitoring an API Gateway

Sample Dashboard Configuration

Using the RED Method with Other Data Sources

Using Graphite

Using InfluxDB

RED Method vs. Other Monitoring Patterns

Best Practices and Common Pitfalls

Best Practices

Common Pitfalls

Summary

Exercises

Additional Resources