Four Golden Signals

Introduction

The Four Golden Signals are a monitoring methodology introduced by Google's Site Reliability Engineering (SRE) team. They represent the most critical metrics you should monitor to maintain the health and performance of your services or applications. This pattern provides a focused approach to observability by concentrating on four key metrics that, when monitored together, offer a comprehensive view of your system's health.

As you build your monitoring dashboards in Grafana, understanding and implementing the Four Golden Signals will help you identify issues quickly and ensure your services meet user expectations.

What Are the Four Golden Signals?

The Four Golden Signals consist of:

Latency - How long it takes to service a request
Traffic - The demand on your system
Errors - The rate of failed requests
Saturation - How "full" your system is

Let's explore each of these signals in detail and learn how to implement them in Grafana.

1. Latency

What is Latency?

Latency measures how long it takes to process a request. In other words, it's the time between when a user makes a request and when they receive a response.

Why Monitor Latency?

Directly impacts user experience
Helps identify performance bottlenecks
Provides early warning of system degradation

How to Measure Latency in Grafana

Latency should be measured at different percentiles, not just averages. A common approach is to track p50 (median), p90, p95, and p99:

# Prometheus query example for latency in PromQL
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le))

Visualizing Latency

In Grafana, you can visualize latency using:

Time Series Panel - To show how latency changes over time
Stat Panel - To display current percentiles
Heatmap - To visualize latency distribution

Here's a simple PromQL query to create a latency panel in Grafana:

# For a time series panel showing latency percentiles
histogram_quantile(0.95, sum by(le) (rate(http_request_duration_seconds_bucket{service="user-api"}[5m])))

2. Traffic

What is Traffic?

Traffic represents the demand placed on your system. It measures how much your service is being used, typically expressed as a rate of requests per second.

Why Monitor Traffic?

Helps understand usage patterns
Correlates with other signals during incident analysis
Enables capacity planning

How to Measure Traffic in Grafana

Traffic is usually measured as a rate of requests over time:

# Prometheus query example for traffic measurement
sum(rate(http_requests_total{job="api"}[5m]))

Visualizing Traffic

In Grafana, traffic can be visualized using:

Time Series Panel - To show traffic patterns over time
Bar Gauge - To display current traffic levels
Stat Panel - To show the total or average request rate

Example query:

# Traffic by endpoint
sum by(endpoint) (rate(http_requests_total{job="api"}[5m]))

3. Errors

What are Errors?

Errors represent the rate of failed requests. These can be explicit failures (like HTTP 500 errors) or implicit failures (like wrong content being served).

Why Monitor Errors?

Directly impacts user experience
Indicates system malfunctions
Helps identify problematic components

How to Measure Errors in Grafana

Errors are typically measured as a rate or as a percentage of total requests:

# Error rate
sum(rate(http_requests_total{job="api", status=~"5.."}[5m]))

# Error percentage
sum(rate(http_requests_total{job="api", status=~"5.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m])) * 100

Visualizing Errors

In Grafana, errors can be visualized using:

Time Series Panel - To show error rates over time
Stat Panel - To display current error percentage
Gauge - To visualize error rates against thresholds

4. Saturation

What is Saturation?

Saturation measures how "full" your service is. It shows how much of the available resources (CPU, memory, disk, network) are being utilized relative to their capacity.

Why Monitor Saturation?

Predicts impending resource exhaustion
Helps with capacity planning
Identifies potential bottlenecks before they impact users

How to Measure Saturation in Grafana

Saturation measurements vary by resource type:

# CPU saturation
avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[2m]))

# Memory saturation
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes

# Disk saturation
node_filesystem_avail_bytes / node_filesystem_size_bytes

# Connection pool saturation
max_connections - available_connections / max_connections

Visualizing Saturation

In Grafana, saturation can be visualized using:

Gauge Panels - To show utilization percentages
Time Series - To show resource utilization over time
Bar Gauges - To compare utilization across instances

Implementing the Four Golden Signals Dashboard

Now let's bring all these concepts together by creating a comprehensive Four Golden Signals dashboard in Grafana.

Dashboard Structure

A well-organized dashboard might include:

A row for each Golden Signal
Multiple visualizations per signal
Consistent color schemes
Thresholds for alerting

Example Dashboard JSON

Here's a simplified example of how you might structure your dashboard:

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 0,
  "id": 1,
  "links": [],
  "panels": [
    {
      "collapsed": false,
      "datasource": null,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 10,
      "panels": [],
      "title": "Latency",
      "type": "row"
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {},
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 1
      },
      "hiddenSeries": false,
      "id": 2,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.5.7",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "histogram_quantile(0.95, sum by(le) (rate(http_request_duration_seconds_bucket{job=\"api\"}[5m])))",
          "interval": "",
          "legendFormat": "p95",
          "refId": "A"
        },
        {
          "expr": "histogram_quantile(0.50, sum by(le) (rate(http_request_duration_seconds_bucket{job=\"api\"}[5m])))",
          "interval": "",
          "legendFormat": "p50",
          "refId": "B"
        }
      ],
      "thresholds": [],
      "timeRegions": [],
      "title": "Request Latency",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "s",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }
    // Additional panels would be defined here for the other signals
  ],
  "refresh": "5s",
  "schemaVersion": 27,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "Four Golden Signals",
  "uid": "abcdefghij",
  "version": 1
}

Real-World Applications

Case Study: E-commerce Website

For an e-commerce website, the Four Golden Signals might be implemented as:

Latency
- Page load time
- Checkout process duration
- API response times
Traffic
- Number of page views
- Number of active sessions
- Rate of transactions
Errors
- Failed checkouts
- API errors
- 5xx HTTP responses
Saturation
- Database connection pool usage
- CPU usage of web servers
- Queue depths for background jobs

Case Study: Microservices Architecture

In a microservices environment:

Each service would have its own Four Golden Signals dashboard, and an aggregated view might also be created for the system as a whole.

Setting Alerts Based on the Four Golden Signals

Effective monitoring isn't complete without alerting. Here's how you might set up alerts for each signal:

Latency Alerts
- Alert when p95 latency exceeds a threshold (e.g., 500ms)
- Alert on sudden increases in latency (e.g., 50% jump)
Traffic Alerts
- Alert on sudden drops in traffic (might indicate an upstream issue)
- Alert on unexpected spikes (might indicate a DDoS attack)
Error Alerts
- Alert when error rate exceeds a threshold (e.g., 1%)
- Alert on specific critical errors
Saturation Alerts
- Alert when resource usage exceeds thresholds (e.g., 80% CPU, 90% memory)
- Alert on resource exhaustion predictions

In Grafana, you can set up these alerts using Alerting rules:

# Example alert rule for high latency
- name: HighLatency
  rules:
  - alert: APIHighLatency
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)) > 0.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High API Latency"
      description: "95th percentile latency for API requests is above 500ms for 5 minutes."

Best Practices

Keep It Simple
- Start with basic implementations of the Four Golden Signals
- Add complexity as your understanding and needs evolve
Use Consistent Units
- Use seconds for latency (not milliseconds in some places and seconds in others)
- Be consistent with how you measure traffic across services
Set Appropriate Thresholds
- Base thresholds on historical data and business requirements
- Adjust thresholds as your system evolves
Correlate Signals
- Look for relationships between signals (e.g., increased latency during high traffic)
- Use these correlations for root cause analysis
Iterate and Improve
- Regularly review your dashboards and alert thresholds
- Add service-specific metrics to complement the Four Golden Signals

Summary

The Four Golden Signals provide a powerful framework for monitoring your systems effectively:

Latency tells you how fast your system is responding
Traffic shows the demand on your system
Errors reveal where your system is failing
Saturation indicates how close you are to resource exhaustion

By implementing these signals in Grafana, you gain a comprehensive view of your system's health and performance. This approach helps you detect issues early, troubleshoot effectively, and ensure your services meet user expectations.

Remember that the Four Golden Signals are a starting point, not the end goal. As you become more familiar with monitoring, you can expand your dashboards to include service-specific metrics that complement these core signals.

Additional Resources

Google's SRE Book chapter on monitoring: Monitoring Distributed Systems
Prometheus documentation for metrics collection: Prometheus Docs
Grafana dashboarding tutorials: Grafana Tutorials

Exercises

Create a basic Four Golden Signals dashboard for a web application you're familiar with.
Identify which metrics in your current monitoring setup map to each of the Four Golden Signals.
Set up alerts for each of the Four Golden Signals based on historical performance data.
Add service-specific metrics to your dashboard that complement the Four Golden Signals.
Simulate an incident (e.g., high load) and observe how the Four Golden Signals react.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What Are the Four Golden Signals?​

1. Latency​

What is Latency?​

Why Monitor Latency?​

How to Measure Latency in Grafana​

Visualizing Latency​

2. Traffic​

What is Traffic?​

Why Monitor Traffic?​

How to Measure Traffic in Grafana​

Visualizing Traffic​

3. Errors​

What are Errors?​

Why Monitor Errors?​

How to Measure Errors in Grafana​

Visualizing Errors​

4. Saturation​

What is Saturation?​

Why Monitor Saturation?​

How to Measure Saturation in Grafana​

Visualizing Saturation​

Implementing the Four Golden Signals Dashboard​

Dashboard Structure​

Example Dashboard JSON​

Real-World Applications​

Case Study: E-commerce Website​

Case Study: Microservices Architecture​

Setting Alerts Based on the Four Golden Signals​

Best Practices​

Summary​

Additional Resources​

Exercises​