SLA/SLO Monitoring with Prometheus

Introduction

In today's reliability-focused software landscape, measuring and maintaining service quality is crucial. Two key concepts guide this process: Service Level Agreements (SLAs) and Service Level Objectives (SLOs).

An SLA is a formal agreement between a service provider and customer that defines the expected level of service. It typically includes financial penalties if commitments aren't met.

An SLO is a specific, measurable target for service performance that a team aims to achieve, usually defined in terms of availability or performance metrics.

Prometheus, with its powerful querying capabilities and time-series database, provides an excellent foundation for implementing SLA/SLO monitoring. In this guide, we'll explore how to establish, measure, and alert on SLOs using Prometheus.

Understanding Service Level Indicators (SLIs)

Before diving into SLOs, we need to understand Service Level Indicators (SLIs). An SLI is a quantitative measure of some aspect of the provided service.

Common SLIs include:

Availability: Percentage of successful requests
Latency: Response time for requests
Error rate: Percentage of failed requests
Throughput: Number of requests handled per second
Saturation: How "full" your service is (e.g., CPU usage)

Let's create a basic diagram showing the relationship between these concepts:

Implementing SLO Monitoring with Prometheus

Step 1: Define Your SLIs

First, identify which metrics will serve as your SLIs. For a web service, these might include:

# Request success rate (as a ratio)
sum(rate(http_requests_total{code=~"2.."}[5m])) / sum(rate(http_requests_total[5m]))

# Latency (90th percentile in seconds)
histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Step 2: Set SLO Targets

Next, define your SLO targets. For example:

99.9% availability over 30 days (allowing for ~43 minutes of downtime)
95% of requests complete in under 200ms

Step 3: Implement Error Budgets

Error budgets are a powerful concept that goes hand-in-hand with SLOs. An error budget represents the amount of unreliability you can tolerate within your SLO.

For example, if your availability SLO is 99.9% over 30 days, your error budget is 0.1% of requests over that period. Once you've used up this budget, you might want to freeze new deployments until reliability improves.

Let's calculate an error budget using Prometheus:

# Define the SLO target
slo_target = 0.999

# Calculate the error budget (allowed errors)
total_requests = sum(increase(http_requests_total[30d]))
error_budget = (1 - slo_target) * total_requests

# Calculate budget consumption (actual errors)
actual_errors = sum(increase(http_requests_total{code=~"5.."}[30d]))

# Error budget remaining
error_budget - actual_errors

Step 4: Create Burn Rate Alerts

Rather than just alerting when you've violated an SLO, which might be too late, implement "burn rate" alerts. These notify you when you're consuming your error budget too quickly.

Here's how to create a burn rate alert in Prometheus:

groups:
- name: SLO_alerts
  rules:
  - alert: ErrorBudgetBurnRate
    expr: |
      sum(rate(http_requests_total{code=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
      >
      (1 - 0.999) * 24 # Burning 24x faster than allowed
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Error budget burning too fast"
      description: "Service is consuming error budget at an accelerated rate"

Practical Example: Monitoring API Availability

Let's implement a complete example for monitoring an API service:

First, we'll instrument our API to expose Prometheus metrics:

const express = require('express');
const client = require('prom-client');
const app = express();

// Create a Registry to register metrics to
const register = new client.Registry();

// Define a counter for HTTP requests
const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'code'],
  registers: [register]
});

// Define a histogram for request duration
const httpRequestDurationSeconds = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'code'],
  buckets: [0.01, 0.05, 0.1, 0.2, 0.5, 1, 2, 5],
  registers: [register]
});

// Middleware to measure request duration
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestsTotal.inc({
      method: req.method,
      route: req.path,
      code: res.statusCode
    });
    httpRequestDurationSeconds.observe(
      {
        method: req.method,
        route: req.path,
        code: res.statusCode
      },
      duration
    );
  });
  
  next();
});

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Example API endpoint
app.get('/api/data', (req, res) => {
  res.json({ message: 'Success' });
});

app.listen(3000, () => {
  console.log('Server listening on port 3000');
});

Configure Prometheus to scrape these metrics:

scrape_configs:
  - job_name: 'api'
    scrape_interval: 5s
    static_configs:
      - targets: ['api-server:3000']

Define PromQL queries for our SLIs:

# Availability SLI (success rate as a percentage)
(
  sum(rate(http_requests_total{job="api",code=~"2.."}[5m]))
  /
  sum(rate(http_requests_total{job="api"}[5m]))
) * 100

# Latency SLI (95th percentile)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le))

Create recording rules to calculate SLO compliance:

groups:
- name: slo_rules
  rules:
  - record: job:slo_availability:ratio_rate5m
    expr: |
      sum(rate(http_requests_total{job="api",code=~"2.."}[5m]))
      /
      sum(rate(http_requests_total{job="api"}[5m]))
      
  - record: job:slo_latency:95percentile_rate5m
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le))

Set up alert rules for SLO violations:

groups:
- name: slo_alerts
  rules:
  - alert: AvailabilitySLOViolation
    expr: job:slo_availability:ratio_rate5m < 0.999
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: "Availability SLO violation"
      description: "Service availability is below 99.9% over the last 15 minutes"
      
  - alert: LatencySLOViolation
    expr: job:slo_latency:95percentile_rate5m > 0.2
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: "Latency SLO violation"
      description: "95th percentile latency is above 200ms over the last 15 minutes"

Multi-Window, Multi-Burn-Rate Alerts

For more sophisticated SLO monitoring, you can implement multi-window, multi-burn-rate alerts. This approach uses different time windows to detect issues at various speeds:

groups:
- name: slo_burn_rate_alerts
  rules:
  - alert: ErrorBudgetBurningFast
    expr: |
      sum(rate(http_requests_total{job="api",code=~"5.."}[1h]))
      /
      sum(rate(http_requests_total{job="api"}[1h]))
      > 14.4 * (1 - 0.999) # 14.4x burn rate = depleting 30-day budget in ~2 days
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: "Error budget burning too fast"
      description: "Service is burning 2-week error budget in ~2 days"
      
  - alert: ErrorBudgetBurningSlow
    expr: |
      sum(rate(http_requests_total{job="api",code=~"5.."}[6h]))
      /
      sum(rate(http_requests_total{job="api"}[6h]))
      > 6 * (1 - 0.999) # 6x burn rate = depleting 30-day budget in ~5 days
    for: 6h
    labels:
      severity: warning
    annotations:
      summary: "Error budget burning steadily"
      description: "Service is burning 30-day error budget in ~5 days"

Visualizing SLOs in Grafana

To complete your SLO monitoring setup, create a dedicated SLO dashboard in Grafana:

// Sample SLO dashboard configuration in Grafana
{
  "panels": [
    {
      "title": "API Availability SLO",
      "type": "gauge",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "job:slo_availability:ratio_rate5m * 100",
          "legendFormat": "Current"
        }
      ],
      "options": {
        "thresholds": [
          { "value": 99.5, "color": "red" },
          { "value": 99.9, "color": "green" }
        ],
        "min": 99,
        "max": 100
      }
    },
    {
      "title": "API Latency SLO",
      "type": "gauge",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "job:slo_latency:95percentile_rate5m * 1000",
          "legendFormat": "95th percentile (ms)"
        }
      ],
      "options": {
        "thresholds": [
          { "value": 0, "color": "green" },
          { "value": 200, "color": "red" }
        ],
        "min": 0,
        "max": 500
      }
    },
    {
      "title": "Error Budget Remaining",
      "type": "stat",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "100 - (sum(increase(http_requests_total{job=\"api\",code=~\"5..\"}[30d])) / sum(increase(http_requests_total{job=\"api\"}[30d])) * 100) / (100 - 99.9) * 100",
          "legendFormat": "% Remaining"
        }
      ],
      "options": {
        "colorMode": "value",
        "thresholds": {
          "mode": "absolute",
          "steps": [
            { "value": 0, "color": "red" },
            { "value": 20, "color": "orange" },
            { "value": 50, "color": "green" }
          ]
        }
      }
    }
  ]
}

Best Practices for SLO Implementation

Start simple: Begin with 1-3 key SLOs that directly impact user experience
Use percentiles, not averages: Averages hide outliers that affect user experience
Set realistic targets: Aim for realistic SLOs based on historical performance
Implement error budgets: Convert SLOs into error budgets to make tradeoffs clear
Use multi-burn-rate alerts: Alert on varying consumption rates to catch both fast and slow degradations
Review and adjust: Regularly review your SLOs and adjust as needed

Summary

Implementing SLA/SLO monitoring with Prometheus provides a powerful way to:

Quantify service reliability in terms users care about
Set clear, measurable targets for service performance
Create a balance between reliability and innovation through error budgets
Detect and respond to reliability issues before they impact users

By following the steps outlined in this guide, you can establish a comprehensive SLO monitoring system that helps maintain reliable services while providing the necessary data to make informed decisions about reliability investments.

Additional Resources

Exercises

Define SLIs and SLOs for a service you're familiar with.
Implement basic availability and latency monitoring using Prometheus.
Create an error budget based on your SLOs and set up burn rate alerts.
Build a Grafana dashboard that visualizes your SLOs and error budget consumption.
Design a multi-window, multi-burn-rate alerting strategy for a critical service.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Service Level Indicators (SLIs)​

Implementing SLO Monitoring with Prometheus​

Step 1: Define Your SLIs​

Step 2: Set SLO Targets​

Step 3: Implement Error Budgets​

Step 4: Create Burn Rate Alerts​

Practical Example: Monitoring API Availability​

Multi-Window, Multi-Burn-Rate Alerts​

Visualizing SLOs in Grafana​

Best Practices for SLO Implementation​

Summary​

Additional Resources​

Exercises​