Kong Service Monitoring

Introduction

Monitoring your Kong API Gateway is crucial for maintaining service reliability, troubleshooting issues, and optimizing performance. Kong Service Monitoring provides insights into the health and performance of your API services, allowing you to proactively identify and resolve potential problems before they impact your users.

In this guide, we'll explore how to set up comprehensive monitoring for Kong services, interpret metrics, and use this data to maintain a robust API infrastructure.

Why Monitor Kong Services?

Before diving into implementation, let's understand why monitoring is essential:

Service Reliability - Detect and address failures quickly
Performance Optimization - Identify bottlenecks and optimization opportunities
Capacity Planning - Make informed scaling decisions based on usage patterns
Security Insights - Spot unusual traffic patterns that might indicate security issues
Compliance & SLAs - Ensure your services meet required availability standards

Monitoring Architecture

Kong's monitoring architecture consists of several components working together:

Setting Up Basic Monitoring

Let's start with the basics. Kong offers built-in monitoring capabilities through its status API.

Step 1: Enable the Status API

First, ensure the Status API is enabled in your Kong configuration:

status_listen: 0.0.0.0:8100
status_access_log: logs/status_access.log
status_error_log: logs/status_error.log

Step 2: Access Status Endpoints

Once enabled, you can access various status endpoints:

# Get basic node information
curl http://localhost:8100/status

# Response example
{
  "server": {
    "total_requests": 3242,
    "connections_active": 32,
    "connections_accepted": 3298,
    "connections_handled": 3298,
    "connections_reading": 0,
    "connections_writing": 12,
    "connections_waiting": 20
  },
  "database": {
    "reachable": true
  },
  "memory": {
    "lua_shared_dicts": {
      "kong": {
        "allocated_slabs": "3.46 MiB",
        "capacity": "5.00 MiB"
      },
      "kong_db_cache": {
        "allocated_slabs": "11.87 MiB",
        "capacity": "128.00 MiB"
      }
    }
  }
}

Advanced Monitoring with Prometheus and Grafana

For production environments, using Prometheus and Grafana provides more robust monitoring.

Step 1: Install the Prometheus Plugin

# Using Kong's plugin installation
kong config -c kong.conf --plugin prometheus

Step 2: Configure the Plugin

Add the following to your Kong configuration:

# In your Kong declarative configuration file
_format_version: "2.1"
plugins:
- name: prometheus
  config:
    status_code_metrics: true
    latency_metrics: true
    upstream_health_metrics: true
    bandwidth_metrics: true

Or if you're using the Admin API:

curl -X POST http://localhost:8001/plugins/ \
  --data "name=prometheus" \
  --data "config.status_code_metrics=true" \
  --data "config.latency_metrics=true" \
  --data "config.upstream_health_metrics=true" \
  --data "config.bandwidth_metrics=true"

Step 3: Access Prometheus Metrics

Once configured, Kong exposes metrics at the /metrics endpoint:

curl http://localhost:8001/metrics

This returns Prometheus-formatted metrics:

# HELP kong_bandwidth Total bandwidth in bytes consumed per service in Kong
# TYPE kong_bandwidth counter
kong_bandwidth{type="egress",service="users-service"} 1277
kong_bandwidth{type="ingress",service="users-service"} 726

# HELP kong_http_status HTTP status codes per service in Kong
# TYPE kong_http_status counter
kong_http_status{service="users-service",code="200"} 15
kong_http_status{service="auth-service",code="401"} 3

Step 4: Set Up Prometheus

Create a prometheus.yml file to scrape Kong metrics:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'kong'
    scrape_interval: 5s
    static_configs:
      - targets: ['kong:8001']

Step 5: Set Up Grafana Dashboard

Create a Grafana dashboard to visualize Kong metrics. Here's a simple example of JSON configuration for a dashboard panel:

{
  "datasource": "Prometheus",
  "fieldConfig": {
    "defaults": {
      "color": {"mode": "palette-classic"},
      "custom": {
        "axisLabel": "",
        "axisPlacement": "auto",
        "barAlignment": 0,
        "drawStyle": "line",
        "fillOpacity": 10,
        "gradientMode": "none",
        "hideFrom": {"legend": false, "tooltip": false, "viz": false},
        "lineInterpolation": "smooth",
        "lineWidth": 2,
        "pointSize": 5,
        "scaleDistribution": {"type": "linear"},
        "showPoints": "never",
        "spanNulls": true,
        "stacking": {"group": "A", "mode": "none"},
        "thresholdsStyle": {"mode": "off"}
      },
      "mappings": [],
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {"color": "green", "value": null},
          {"color": "red", "value": 80}
        ]
      },
      "unit": "short"
    },
    "overrides": []
  },
  "targets": [
    {
      "expr": "sum(increase(kong_http_status{code=~\"5..\"}[5m])) by (service)",
      "legendFormat": "{{service}}",
      "refId": "A"
    }
  ],
  "title": "5xx Errors by Service"
}

Key Metrics to Monitor

Focus on these critical metrics for effective Kong service monitoring:

1. Request Rate

Track the number of requests per service:

sum(rate(kong_http_status[1m])) by (service)

2. Error Rate

Monitor the percentage of requests resulting in errors:

sum(rate(kong_http_status{code=~"5.."}[1m])) by (service) / sum(rate(kong_http_status[1m])) by (service) * 100

3. Latency

Track request processing time:

histogram_quantile(0.95, sum(rate(kong_latency_bucket{type="request"}[5m])) by (service, le))

4. Upstream Health

Monitor the health of your backend services:

kong_upstream_target_health{state="healthy"} == 1

Setting Up Alerts

Proactive monitoring requires alerts. Here's how to set up basic alerts in Prometheus:

# In prometheus/alerts.yml
groups:
- name: kong_alerts
  rules:
  - alert: KongHighErrorRate
    expr: sum(rate(kong_http_status{code=~"5.."}[5m])) by (service) / sum(rate(kong_http_status[5m])) by (service) * 100 > 5
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "High error rate on {{ $labels.service }}"
      description: "Service {{ $labels.service }} has a 5xx error rate above 5% (current value: {{ $value }}%)"
      
  - alert: KongHighLatency
    expr: histogram_quantile(0.95, sum(rate(kong_latency_bucket{type="request"}[5m])) by (service, le)) > 500
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High latency on {{ $labels.service }}"
      description: "Service {{ $labels.service }} has a 95th percentile latency above 500ms (current value: {{ $value }}ms)"

Real-world Example: Monitoring Service Degradation

Let's walk through a real-world scenario:

Symptom: Users report intermittent slowness in the authentication service
Investigation: Looking at the monitoring dashboard, you notice:
- Increased latency in the auth service
- Normal request volume
- No increase in error rates
- High CPU usage on the upstream service
Diagnosis: By correlating these metrics, you determine that the auth service database is under-provisioned
Resolution:
- Scale up the database resources
- Monitor the metrics to confirm improvement
- Set up better alerting thresholds to catch this issue earlier next time

Code Example: Creating a Health Check Probe

Here's an example of setting up a custom health check for deeper monitoring:

-- custom-health-check.lua
local _M = {}

function _M.execute()
  local health = {
    status = "ok",
    details = {}
  }
  
  -- Check database connectivity
  local ok, err = kong.db.connector:connect()
  if not ok then
    health.status = "unhealthy"
    health.details.database = {
      status = "unhealthy",
      message = err
    }
  else
    health.details.database = {
      status = "ok"
    }
  end
  
  -- Check memory usage
  local memory_stats = collectgarbage("count") * 1024 -- in bytes
  health.details.memory = {
    status = "ok",
    current_usage = memory_stats
  }
  
  if memory_stats > 500000000 then -- 500MB threshold
    health.status = "degraded"
    health.details.memory.status = "warning"
  end
  
  return health
end

return _M

Monitoring Best Practices

Start Simple: Begin with basic metrics before adding complexity
Focus on the Four Golden Signals:
- Latency
- Traffic
- Errors
- Saturation
Layer Your Monitoring:
- Infrastructure metrics (CPU, memory)
- Kong internal metrics
- Business metrics (transactions, users)
Set Meaningful Thresholds: Based on normal operation patterns
Document Runbooks: Create clear procedures for responding to alerts
Regularly Review: Adjust monitoring based on new services and requirements

Troubleshooting Common Issues

Problem: High Latency Spikes

Possible Causes:

Database connection issues
Upstream service degradation
Kong plugin overhead

Investigation Steps:

# Check plugin execution times
curl http://localhost:8001/metrics | grep kong_http_latency

# Check DB health
curl http://localhost:8001/status | jq .database

# Check upstream health
curl http://localhost:8001/upstreams/{upstream_id}/health

Problem: High Error Rates

Possible Causes:

Upstream service failures
Route configuration issues
Rate limiting triggering

Investigation Steps:

# Check error codes by service
curl http://localhost:8001/metrics | grep kong_http_status

# Check specific service configuration
curl http://localhost:8001/services/{service_name} | jq .

Summary

Effective Kong service monitoring combines:

Proper Setup: Configuring Kong, Prometheus, and visualization tools
Key Metrics: Tracking the right indicators for your services
Alerting: Setting up proactive notifications for issues
Troubleshooting Workflows: Having clear procedures when problems arise
Continuous Improvement: Regularly refining your monitoring strategy

By implementing the practices in this guide, you'll gain visibility into your Kong API Gateway, ensuring reliable service for your users and peace of mind for your team.

Additional Resources

Exercises

Set up basic Kong monitoring with the Status API and fetch service metrics
Install and configure the Prometheus plugin for Kong
Create a simple Grafana dashboard showing request rates and error percentages
Set up an alert for when latency exceeds acceptable thresholds
Simulate a service degradation and use monitoring to diagnose the issue

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Monitor Kong Services?​

Monitoring Architecture​

Setting Up Basic Monitoring​

Step 1: Enable the Status API​

Step 2: Access Status Endpoints​

Advanced Monitoring with Prometheus and Grafana​

Step 1: Install the Prometheus Plugin​

Step 2: Configure the Plugin​

Step 3: Access Prometheus Metrics​

Step 4: Set Up Prometheus​

Step 5: Set Up Grafana Dashboard​

Key Metrics to Monitor​

1. Request Rate​

2. Error Rate​

3. Latency​

4. Upstream Health​

Setting Up Alerts​

Real-world Example: Monitoring Service Degradation​

Code Example: Creating a Health Check Probe​

Monitoring Best Practices​

Troubleshooting Common Issues​

Problem: High Latency Spikes​

Problem: High Error Rates​

Summary​

Additional Resources​

Exercises​