Kubernetes Monitoring Stack
Introduction
When managing Kubernetes clusters, having visibility into what's happening inside is crucial. A monitoring stack allows you to collect metrics, visualize data, and set up alerts for your Kubernetes infrastructure. This guide will walk you through building a comprehensive monitoring solution for your Kubernetes clusters using popular open-source tools.
Think of a monitoring stack as the dashboard of a car - it shows you important information about what's happening "under the hood" of your Kubernetes environment, helping you detect issues before they become problems.
Why Monitoring Matters in Kubernetes
Kubernetes environments are dynamic and complex. Without proper monitoring:
- You won't know if your applications are healthy
- Resource bottlenecks may go undetected
- Troubleshooting becomes difficult
- Scaling decisions lack data
Core Components of a Kubernetes Monitoring Stack
A typical Kubernetes monitoring stack consists of these primary components:
Let's break down each component and how they work together.
Setting Up Prometheus for Metrics Collection
Prometheus is the de-facto standard for Kubernetes metrics collection.
Installing Prometheus with Helm
Helm makes it easy to install Prometheus on your Kubernetes cluster:
# Add the Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install Prometheus
helm install prometheus prometheus-community/prometheus -n monitoring --create-namespace
Understanding Prometheus Configuration
Prometheus uses a YAML configuration file. Here's a simplified example:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
How Prometheus Works
Prometheus works by "scraping" metrics from HTTP endpoints exposed by your applications and Kubernetes components. It stores these metrics in a time-series database.
To make your application scrape-able by Prometheus, you need to:
- Expose metrics on an HTTP endpoint (commonly
/metrics
) - Format the metrics in the Prometheus format
- Add the right annotations to your Kubernetes resources
Here's an example of a Pod with Prometheus annotations:
apiVersion: v1
kind: Pod
metadata:
name: example-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: example-app
image: example-app:latest
ports:
- containerPort: 8080
Setting Up Grafana for Visualization
Grafana turns your metrics into visual dashboards.
Installing Grafana
# Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Grafana
helm install grafana grafana/grafana -n monitoring
Getting the Grafana Password
kubectl get secret --namespace monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
Connecting Grafana to Prometheus
Once installed, you'll need to:
- Log in to Grafana UI
- Go to Configuration > Data Sources
- Add a new Prometheus data source with URL:
http://prometheus-server.monitoring.svc.cluster.local
Creating Your First Dashboard
Let's create a simple dashboard to monitor CPU and memory usage:
- In Grafana, click "Create Dashboard"
- Add a new panel
- Use this PromQL query for CPU usage:
sum(rate(container_cpu_usage_seconds_total{pod=~"$pod"}[5m])) by (pod)
- Add another panel with this query for memory usage:
sum(container_memory_usage_bytes{pod=~"$pod"}) by (pod)
Adding Alerting with AlertManager
AlertManager handles alerts sent by Prometheus and routes them to the right receiver (email, Slack, etc.).
Configuring AlertManager
Create an alertmanager.yaml
configuration file:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'email-notifications'
receivers:
- name: 'email-notifications'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager'
auth_password: 'password'
Creating Alert Rules in Prometheus
Alert rules tell Prometheus when to fire alerts. Here's an example rule:
groups:
- name: example
rules:
- alert: HighCPUUsage
expr: sum(rate(container_cpu_usage_seconds_total{namespace!="kube-system"}[5m])) by (pod) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "Pod {{ $labels.pod }} has high CPU usage for more than 5 minutes."
Collecting and Analyzing Logs with the EFK Stack
While metrics show you what is happening, logs tell you why. The EFK stack (Elasticsearch, Fluentd, Kibana) is commonly used for log management in Kubernetes.
Installing Fluentd
# Create a namespace for logging
kubectl create namespace logging
# Apply Fluentd configuration
kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
name: fluentd
namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: fluentd
rules:
- apiGroups:
- ""
resources:
- pods
- namespaces
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: fluentd
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: fluentd
subjects:
- kind: ServiceAccount
name: fluentd
namespace: logging
EOF
Installing Elasticsearch and Kibana
# Add Elastic Helm repository
helm repo add elastic https://helm.elastic.co
helm repo update
# Install Elasticsearch
helm install elasticsearch elastic/elasticsearch -n logging
# Install Kibana
helm install kibana elastic/kibana -n logging
Integrating kube-state-metrics
To get additional metrics about the state of Kubernetes objects, you can use kube-state-metrics:
# Install kube-state-metrics
helm install kube-state-metrics prometheus-community/kube-state-metrics -n monitoring
This provides metrics like:
- Pod status (running, pending, failed)
- Deployment status (desired vs available replicas)
- Node capacity and allocatable resources
Creating a Custom All-in-One Monitoring Stack
For beginners, setting up each component individually can be complex. Let's create a simple all-in-one setup:
# Create a values.yaml file for the Prometheus Stack
cat > monitoring-values.yaml <<EOF
grafana:
enabled: true
adminPassword: admin-password
prometheus:
prometheusSpec:
retention: 15d
resources:
requests:
memory: 512Mi
cpu: 500m
limits:
memory: 2Gi
cpu: 2000m
alertmanager:
enabled: true
config:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'null'
receivers:
- name: 'null'
nodeExporter:
enabled: true
kube-state-metrics:
enabled: true
EOF
# Install the Prometheus Stack with our custom values
helm install monitoring prometheus-community/kube-prometheus-stack \
-f monitoring-values.yaml \
-n monitoring \
--create-namespace
Practical Exercise: Monitor a Sample Application
Let's put everything together by deploying and monitoring a sample application:
- Deploy a sample application:
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-app
labels:
app: sample-app
spec:
replicas: 3
selector:
matchLabels:
app: sample-app
template:
metadata:
labels:
app: sample-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
containers:
- name: sample-app
image: nginx:latest
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: sample-app
spec:
selector:
app: sample-app
ports:
- port: 80
targetPort: 80
EOF
- Port-forward to access Grafana:
kubectl port-forward svc/monitoring-grafana 3000:80 -n monitoring
-
Import a Kubernetes dashboard in Grafana (Dashboard ID: 10000)
-
Create an alert if the sample app's pod count drops below 3:
groups:
- name: sample-app
rules:
- alert: SampleAppPodCount
expr: count(kube_pod_info{namespace="default", pod=~"sample-app.*"}) < 3
for: 1m
labels:
severity: critical
annotations:
summary: "Sample App pod count is below desired"
description: "The number of running pods for the Sample App is below 3."
Best Practices for Kubernetes Monitoring
-
Monitor the Four Golden Signals:
- Latency
- Traffic
- Errors
- Saturation
-
Implement Proper Resource Requests and Limits: This helps the monitoring system identify when containers are approaching their limits.
-
Use Labels Consistently: Labels like
app
,environment
, andteam
make it easier to filter and group metrics. -
Set Up Relevant Alerts: Only alert on actionable issues to avoid alert fatigue.
-
Keep History in Mind: Configure appropriate retention periods for metrics and logs.
Troubleshooting Common Issues
"No Data Points" in Grafana
- Check if Prometheus is correctly scraping metrics
- Verify the PromQL query syntax
- Ensure time ranges are set correctly
High Cardinality Issues
- Avoid using high-cardinality labels
- Use recording rules for common queries
Memory Usage Problems in Prometheus
- Increase memory limits
- Reduce retention period
- Limit the number of time series with better relabeling
Summary
Setting up a monitoring stack for Kubernetes is essential for maintaining a healthy and efficient cluster. In this guide, we've covered:
- Setting up Prometheus for metrics collection
- Visualizing data with Grafana
- Configuring alerts with AlertManager
- Collecting logs with the EFK stack
- Integrating kube-state-metrics for detailed Kubernetes object metrics
- Creating an all-in-one monitoring solution
By implementing this monitoring stack, you'll gain visibility into your Kubernetes environment, helping you identify and resolve issues before they affect your applications.
Additional Resources
- Prometheus Documentation
- Grafana Documentation
- Kubernetes Monitoring with Prometheus
- Helm Chart Repository for Prometheus Stack
Exercises for Practice
- Deploy the monitoring stack on a local Kubernetes cluster (like Minikube or Docker Desktop).
- Create a custom dashboard in Grafana for a specific application.
- Set up an alert that notifies you when pod restarts exceed a threshold.
- Configure log collection for a specific namespace.
- Create a recording rule in Prometheus to optimize a frequently used query.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)