Kubernetes Monitoring with Grafana

Introduction

Kubernetes has become the de facto standard for container orchestration, enabling organizations to deploy, scale, and manage containerized applications efficiently. However, with the complexity and distributed nature of Kubernetes environments, monitoring becomes crucial to ensure optimal performance, resource utilization, and troubleshooting capabilities.

In this guide, we'll explore how to effectively monitor Kubernetes clusters using Grafana and Prometheus. We'll cover the fundamental monitoring concepts, set up a basic monitoring stack, and create dashboards to visualize key metrics. By the end, you'll have a solid understanding of Kubernetes monitoring patterns that you can apply to your own environments.

Why Monitor Kubernetes?

Before diving into the technical details, let's understand why monitoring Kubernetes is essential:

Resource Optimization: Monitor CPU, memory, and storage usage to optimize resource allocation.
Performance Tracking: Track application and system performance to identify bottlenecks.
Troubleshooting: Quickly identify and resolve issues before they impact users.
Capacity Planning: Forecast resource needs based on historical data and trends.
Security Monitoring: Detect unusual patterns that might indicate security issues.

Kubernetes Monitoring Architecture

A typical Kubernetes monitoring stack consists of several components working together:

Metrics Exporters: Components that collect metrics from various parts of the Kubernetes cluster
Prometheus: Time-series database that scrapes and stores metrics
Grafana: Visualization platform for creating dashboards and alerts
Alert Manager: Handles alerts from Prometheus and routes them to the correct receivers
Loki: Log aggregation system designed to work with Grafana (optional but recommended)

Setting Up the Monitoring Stack

Let's look at how to set up a basic monitoring stack for Kubernetes using Helm charts.

Prerequisites

A running Kubernetes cluster
Helm installed
kubectl configured to access your cluster

Installing Prometheus and Grafana

# Add the Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (includes Prometheus, Grafana, and Alertmanager)
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

This command installs the kube-prometheus-stack, which includes:

Prometheus server
Alertmanager
Grafana
Various exporters like node-exporter, kube-state-metrics

Verifying the Installation

# Check that all pods are running
kubectl get pods -n monitoring

# Expected output:
# NAME                                                     READY   STATUS    RESTARTS   AGE
# alertmanager-monitoring-kube-prometheus-alertmanager-0   2/2     Running   0          2m
# monitoring-grafana-59cb7d7c5c-z6l8p                     2/2     Running   0          2m
# monitoring-kube-prometheus-operator-6b8c6878f7-nf7gg    1/1     Running   0          2m
# monitoring-kube-state-metrics-55b6f7dcfb-szzbq          1/1     Running   0          2m
# monitoring-prometheus-node-exporter-gzlbq               1/1     Running   0          2m
# prometheus-monitoring-kube-prometheus-prometheus-0       2/2     Running   0          2m

Accessing Grafana

# Forward Grafana service port to your local machine
kubectl port-forward svc/monitoring-grafana 3000:80 -n monitoring

Now you can access Grafana at http://localhost:3000. The default credentials are:

Username: admin
Password: prom-operator

Key Metrics to Monitor in Kubernetes

When monitoring Kubernetes, focus on these key metric categories:

1. Node-level Metrics

These metrics provide insights into the health and resource utilization of your Kubernetes nodes:

CPU utilization
Memory usage
Disk space
Network traffic
System load

2. Pod-level Metrics

Pod metrics help you understand how your applications are performing:

CPU and memory requests/limits vs. actual usage
Pod restart count
Pod status (Running, Pending, Failed, etc.)
Container restart count

3. Kubernetes API Server Metrics

These metrics indicate the health of your control plane:

API request rate and latency
etcd performance
Controller manager and scheduler metrics

4. Custom Application Metrics

Application-specific metrics that provide insights into business logic:

Request rate, error rate, and duration (RED method)
Business-specific metrics (transactions, users, etc.)

Creating Kubernetes Monitoring Dashboards

Now, let's create a basic dashboard to monitor our Kubernetes cluster. We'll do this by importing a pre-configured dashboard and then customizing it.

Importing a Dashboard

In Grafana, click on "+" icon and select "Import"
Enter dashboard ID 10856 (Kubernetes Cluster Monitoring via Prometheus)
Select your Prometheus data source and click "Import"

This gives you a comprehensive dashboard to monitor your Kubernetes cluster:

Creating Custom Dashboards

Let's create a simple custom dashboard for monitoring pod resources:

Click "Create" and select "Dashboard"
Add a new panel
Use the following PromQL queries:

Pod CPU Usage:

sum(rate(container_cpu_usage_seconds_total{namespace="default",container!=""}[5m])) by (pod)

Pod Memory Usage:

sum(container_memory_working_set_bytes{namespace="default",container!=""}) by (pod)

Pod Network Traffic:

sum(rate(container_network_receive_bytes_total{namespace="default"}[5m])) by (pod)

Setting Up Alerts

Let's set up a basic alert for high CPU usage:

In Grafana, navigate to Alerting > Alert Rules
Click "New alert rule"
Configure the following:
- Name: "High CPU Usage"
- Query: sum(rate(container_cpu_usage_seconds_total{namespace="default",container!=""}[5m])) by (pod) > 0.8
- Evaluation interval: 1m
- For: 5m
Add a notification message: Pod {{$labels.pod}} in namespace {{$labels.namespace}} has high CPU usage: {{$value}}
Save the alert

Practical Example: Monitoring a Web Application

Let's walk through monitoring a sample web application:

1. Deploy a Sample Application

# Create a namespace
kubectl create namespace sample-app

# Deploy a sample application
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-web-app
  namespace: sample-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sample-web-app
  template:
    metadata:
      labels:
        app: sample-web-app
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/metrics"
        prometheus.io/port: "8080"
    spec:
      containers:
      - name: sample-web-app
        image: nginx
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: sample-web-app
  namespace: sample-app
spec:
  selector:
    app: sample-web-app
  ports:
  - port: 80
    targetPort: 80
EOF

2. Monitor Application Metrics

Now, create a dashboard specifically for this application:

Create a new dashboard

Add panels for:

Pod status

sum(kube_pod_status_phase{namespace="sample-app"}) by (phase)

CPU usage

sum(rate(container_cpu_usage_seconds_total{namespace="sample-app",container!=""}[5m])) by (pod)

Memory usage

sum(container_memory_working_set_bytes{namespace="sample-app",container!=""}) by (pod)

Network traffic

sum(rate(container_network_receive_bytes_total{namespace="sample-app"}[5m])) by (pod)

This gives you a comprehensive view of your application's performance and resource utilization.

Best Practices for Kubernetes Monitoring

Use Labels Effectively: Properly label your Kubernetes resources to make querying and filtering easier.
Configure Resource Requests and Limits: This helps Kubernetes make better scheduling decisions and provides meaningful utilization metrics.
Follow the RED Method:
- Rate: Number of requests per second
- Error rate: Percentage of requests that fail
- Duration: Distribution of response times
Use the USE Method for Resources:
- Utilization: Percentage of time the resource is busy
- Saturation: Amount of work the resource has to perform
- Errors: Count of error events
Set Up Proper Retention Policies: Configure data retention based on your needs, balancing between storage requirements and data availability.
Implement Multi-Level Alerting: Different severity levels for different thresholds.

Troubleshooting Common Issues

Issue: Missing Metrics

Solution: Check if exporters are running and correctly configured:

kubectl get pods -n monitoring | grep exporter

Issue: High Cardinality Issues

Solution: Review and optimize your labels to reduce the number of time series:

# Check metrics cardinality
curl -s http://prometheus-server:9090/api/v1/status/tsdb | jq .

Issue: Grafana Dashboard Loading Slowly

Solution: Optimize queries by adding time range constraints and avoiding high-cardinality labels:

sum(rate(container_cpu_usage_seconds_total{namespace="default"}[5m])) by (pod)

Instead of:

sum(rate(container_cpu_usage_seconds_total{}[5m])) by (pod, container_name, namespace, node)

Summary

In this guide, we've covered:

The importance of monitoring Kubernetes environments
Setting up a basic monitoring stack with Prometheus and Grafana
Key metrics to track at various levels of your Kubernetes cluster
Creating dashboards and alerts for effective monitoring
Best practices and troubleshooting tips

By implementing these monitoring patterns, you'll gain better visibility into your Kubernetes clusters, helping you optimize performance, troubleshoot issues, and plan for future capacity needs.

Additional Resources

Here are some exercises to help you practice Kubernetes monitoring:

Exercise: Deploy a stateful application (like a database) and create a custom dashboard for it.
Exercise: Set up alerts for different severity levels based on resource utilization thresholds.
Exercise: Implement log monitoring alongside metrics using Loki and Grafana.

For further learning, explore:

Prometheus Query Language (PromQL) for more advanced queries
Service meshes like Istio for more detailed service-level monitoring
Custom metrics using the Prometheus client libraries

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Monitor Kubernetes?​

Kubernetes Monitoring Architecture​

Setting Up the Monitoring Stack​

Prerequisites​

Installing Prometheus and Grafana​

Verifying the Installation​

Accessing Grafana​

Key Metrics to Monitor in Kubernetes​

1. Node-level Metrics​

2. Pod-level Metrics​

3. Kubernetes API Server Metrics​

4. Custom Application Metrics​

Creating Kubernetes Monitoring Dashboards​

Importing a Dashboard​

Creating Custom Dashboards​

Setting Up Alerts​

Practical Example: Monitoring a Web Application​

1. Deploy a Sample Application​

2. Monitor Application Metrics​

Best Practices for Kubernetes Monitoring​

Troubleshooting Common Issues​

Issue: Missing Metrics​

Issue: High Cardinality Issues​

Issue: Grafana Dashboard Loading Slowly​

Summary​

Additional Resources​