Kubernetes Monitoring Stack

Introduction

When managing Kubernetes clusters, having visibility into what's happening inside is crucial. A monitoring stack allows you to collect metrics, visualize data, and set up alerts for your Kubernetes infrastructure. This guide will walk you through building a comprehensive monitoring solution for your Kubernetes clusters using popular open-source tools.

Think of a monitoring stack as the dashboard of a car - it shows you important information about what's happening "under the hood" of your Kubernetes environment, helping you detect issues before they become problems.

Why Monitoring Matters in Kubernetes

Kubernetes environments are dynamic and complex. Without proper monitoring:

You won't know if your applications are healthy
Resource bottlenecks may go undetected
Troubleshooting becomes difficult
Scaling decisions lack data

Core Components of a Kubernetes Monitoring Stack

A typical Kubernetes monitoring stack consists of these primary components:

Let's break down each component and how they work together.

Setting Up Prometheus for Metrics Collection

Prometheus is the de-facto standard for Kubernetes metrics collection.

Installing Prometheus with Helm

Helm makes it easy to install Prometheus on your Kubernetes cluster:

# Add the Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus
helm install prometheus prometheus-community/prometheus -n monitoring --create-namespace

Understanding Prometheus Configuration

Prometheus uses a YAML configuration file. Here's a simplified example:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

How Prometheus Works

Prometheus works by "scraping" metrics from HTTP endpoints exposed by your applications and Kubernetes components. It stores these metrics in a time-series database.

To make your application scrape-able by Prometheus, you need to:

Expose metrics on an HTTP endpoint (commonly /metrics)
Format the metrics in the Prometheus format
Add the right annotations to your Kubernetes resources

Here's an example of a Pod with Prometheus annotations:

apiVersion: v1
kind: Pod
metadata:
  name: example-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
spec:
  containers:
    - name: example-app
      image: example-app:latest
      ports:
        - containerPort: 8080

Setting Up Grafana for Visualization

Grafana turns your metrics into visual dashboards.

Installing Grafana

# Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Grafana
helm install grafana grafana/grafana -n monitoring

Getting the Grafana Password

kubectl get secret --namespace monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

Connecting Grafana to Prometheus

Once installed, you'll need to:

Log in to Grafana UI
Go to Configuration > Data Sources
Add a new Prometheus data source with URL: http://prometheus-server.monitoring.svc.cluster.local

Creating Your First Dashboard

Let's create a simple dashboard to monitor CPU and memory usage:

In Grafana, click "Create Dashboard"
Add a new panel
Use this PromQL query for CPU usage:

sum(rate(container_cpu_usage_seconds_total{pod=~"$pod"}[5m])) by (pod)

Add another panel with this query for memory usage:

sum(container_memory_usage_bytes{pod=~"$pod"}) by (pod)

Adding Alerting with AlertManager

AlertManager handles alerts sent by Prometheus and routes them to the right receiver (email, Slack, etc.).

Configuring AlertManager

Create an alertmanager.yaml configuration file:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'email-notifications'

receivers:
- name: 'email-notifications'
  email_configs:
  - to: '[email protected]'
    from: '[email protected]'
    smarthost: 'smtp.example.com:587'
    auth_username: 'alertmanager'
    auth_password: 'password'

Creating Alert Rules in Prometheus

Alert rules tell Prometheus when to fire alerts. Here's an example rule:

groups:
- name: example
  rules:
  - alert: HighCPUUsage
    expr: sum(rate(container_cpu_usage_seconds_total{namespace!="kube-system"}[5m])) by (pod) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "Pod {{ $labels.pod }} has high CPU usage for more than 5 minutes."

Collecting and Analyzing Logs with the EFK Stack

While metrics show you what is happening, logs tell you why. The EFK stack (Elasticsearch, Fluentd, Kibana) is commonly used for log management in Kubernetes.

Installing Fluentd

# Create a namespace for logging
kubectl create namespace logging

# Apply Fluentd configuration
kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluentd
  namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluentd
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fluentd
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluentd
subjects:
- kind: ServiceAccount
  name: fluentd
  namespace: logging
EOF

Installing Elasticsearch and Kibana

# Add Elastic Helm repository
helm repo add elastic https://helm.elastic.co
helm repo update

# Install Elasticsearch
helm install elasticsearch elastic/elasticsearch -n logging

# Install Kibana
helm install kibana elastic/kibana -n logging

Integrating kube-state-metrics

To get additional metrics about the state of Kubernetes objects, you can use kube-state-metrics:

# Install kube-state-metrics
helm install kube-state-metrics prometheus-community/kube-state-metrics -n monitoring

This provides metrics like:

Pod status (running, pending, failed)
Deployment status (desired vs available replicas)
Node capacity and allocatable resources

Creating a Custom All-in-One Monitoring Stack

For beginners, setting up each component individually can be complex. Let's create a simple all-in-one setup:

# Create a values.yaml file for the Prometheus Stack
cat > monitoring-values.yaml <<EOF
grafana:
  enabled: true
  adminPassword: admin-password
  
prometheus:
  prometheusSpec:
    retention: 15d
    resources:
      requests:
        memory: 512Mi
        cpu: 500m
      limits:
        memory: 2Gi
        cpu: 2000m

alertmanager:
  enabled: true
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['alertname', 'job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'null'
    receivers:
    - name: 'null'

nodeExporter:
  enabled: true

kube-state-metrics:
  enabled: true
EOF

# Install the Prometheus Stack with our custom values
helm install monitoring prometheus-community/kube-prometheus-stack \
  -f monitoring-values.yaml \
  -n monitoring \
  --create-namespace

Practical Exercise: Monitor a Sample Application

Let's put everything together by deploying and monitoring a sample application:

Deploy a sample application:

kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
  labels:
    app: sample-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sample-app
  template:
    metadata:
      labels:
        app: sample-app
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
    spec:
      containers:
      - name: sample-app
        image: nginx:latest
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: sample-app
spec:
  selector:
    app: sample-app
  ports:
  - port: 80
    targetPort: 80
EOF

Port-forward to access Grafana:

kubectl port-forward svc/monitoring-grafana 3000:80 -n monitoring

Import a Kubernetes dashboard in Grafana (Dashboard ID: 10000)
Create an alert if the sample app's pod count drops below 3:

groups:
- name: sample-app
  rules:
  - alert: SampleAppPodCount
    expr: count(kube_pod_info{namespace="default", pod=~"sample-app.*"}) < 3
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Sample App pod count is below desired"
      description: "The number of running pods for the Sample App is below 3."

Best Practices for Kubernetes Monitoring

Monitor the Four Golden Signals:
- Latency
- Traffic
- Errors
- Saturation
Implement Proper Resource Requests and Limits: This helps the monitoring system identify when containers are approaching their limits.
Use Labels Consistently: Labels like app, environment, and team make it easier to filter and group metrics.
Set Up Relevant Alerts: Only alert on actionable issues to avoid alert fatigue.
Keep History in Mind: Configure appropriate retention periods for metrics and logs.

Troubleshooting Common Issues

"No Data Points" in Grafana

Check if Prometheus is correctly scraping metrics
Verify the PromQL query syntax
Ensure time ranges are set correctly

High Cardinality Issues

Avoid using high-cardinality labels
Use recording rules for common queries

Memory Usage Problems in Prometheus

Increase memory limits
Reduce retention period
Limit the number of time series with better relabeling

Summary

Setting up a monitoring stack for Kubernetes is essential for maintaining a healthy and efficient cluster. In this guide, we've covered:

Setting up Prometheus for metrics collection
Visualizing data with Grafana
Configuring alerts with AlertManager
Collecting logs with the EFK stack
Integrating kube-state-metrics for detailed Kubernetes object metrics
Creating an all-in-one monitoring solution

By implementing this monitoring stack, you'll gain visibility into your Kubernetes environment, helping you identify and resolve issues before they affect your applications.

Additional Resources

Exercises for Practice

Deploy the monitoring stack on a local Kubernetes cluster (like Minikube or Docker Desktop).
Create a custom dashboard in Grafana for a specific application.
Set up an alert that notifies you when pod restarts exceed a threshold.
Configure log collection for a specific namespace.
Create a recording rule in Prometheus to optimize a frequently used query.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Monitoring Matters in Kubernetes​

Core Components of a Kubernetes Monitoring Stack​

Setting Up Prometheus for Metrics Collection​

Installing Prometheus with Helm​

Understanding Prometheus Configuration​

How Prometheus Works​

Setting Up Grafana for Visualization​

Installing Grafana​

Getting the Grafana Password​

Connecting Grafana to Prometheus​

Creating Your First Dashboard​

Adding Alerting with AlertManager​

Configuring AlertManager​

Creating Alert Rules in Prometheus​

Collecting and Analyzing Logs with the EFK Stack​

Installing Fluentd​

Installing Elasticsearch and Kibana​

Integrating kube-state-metrics​

Creating a Custom All-in-One Monitoring Stack​

Practical Exercise: Monitor a Sample Application​

Best Practices for Kubernetes Monitoring​

Troubleshooting Common Issues​

"No Data Points" in Grafana​

High Cardinality Issues​

Memory Usage Problems in Prometheus​

Summary​

Additional Resources​

Exercises for Practice​