Kubernetes Node Maintenance

Introduction

In a Kubernetes cluster, nodes are the worker machines that run your containerized applications. Just like any other infrastructure component, nodes require regular maintenance — whether it's for security updates, kernel upgrades, hardware repairs, or resource scaling. However, performing maintenance on nodes that are actively running workloads can lead to application downtime if not handled correctly.

This guide will walk you through the proper procedures for performing maintenance on Kubernetes nodes while minimizing disruption to your applications.

Why Node Maintenance Matters

Proper node maintenance is crucial for several reasons:

Security: Regular updates help protect your cluster from vulnerabilities
Stability: Kernel and OS updates can improve node performance and reliability
Resource management: Scaling up or down requires adding or removing nodes
Hardware maintenance: Physical servers occasionally need repairs or replacements

Kubernetes provides specific tools to help you safely maintain nodes without disrupting your workloads.

Prerequisites

Before you begin, make sure you have:

A running Kubernetes cluster
kubectl installed and configured to communicate with your cluster
Appropriate permissions to manage nodes in your cluster

Understanding Node States

Kubernetes nodes can be in different states that affect how workloads are scheduled:

Ready: The node is healthy and available to accept new pods
NotReady: The node is unhealthy and cannot accept new pods
SchedulingDisabled: The node is cordoned (marked as unschedulable) but existing pods continue to run

Let's see how to check the current state of nodes in your cluster:

# List all nodes with their status
kubectl get nodes

# Output
NAME           STATUS   ROLES    AGE    VERSION
worker-node-1  Ready    <none>   45d    v1.26.5
worker-node-2  Ready    <none>   45d    v1.26.5
worker-node-3  Ready    <none>   45d    v1.26.5

You can get more detailed information about a specific node with:

kubectl describe node worker-node-1

Key Node Maintenance Operations

1. Cordoning a Node

Cordoning a node marks it as unschedulable, which prevents new pods from being scheduled on it. This is the first step in the maintenance process.

# Mark a node as unschedulable
kubectl cordon worker-node-1

# Verify the node is cordoned
kubectl get nodes

After running the cordon command, you'll see SchedulingDisabled in the STATUS column:

NAME           STATUS                     ROLES    AGE    VERSION
worker-node-1  Ready,SchedulingDisabled   <none>   45d    v1.26.5
worker-node-2  Ready                      <none>   45d    v1.26.5
worker-node-3  Ready                      <none>   45d    v1.26.5

Important: Cordoning a node only prevents new pods from being scheduled. Existing pods continue to run.

2. Draining a Node

Draining a node safely evicts all pods from the node, allowing them to be rescheduled on other nodes. This is the crucial step before performing actual maintenance.

# Drain a node (will also cordon if not already done)
kubectl drain worker-node-1 --ignore-daemonsets --delete-emptydir-data

Let's understand the flags:

--ignore-daemonsets: DaemonSet pods are designed to run on all nodes and will be recreated after maintenance, so we can ignore them during draining.
--delete-emptydir-data: Allows deletion of pods using emptyDir volumes. Data in these volumes will be lost!

You'll see output showing the pods being evicted:

node/worker-node-1 cordoned
evicting pod default/nginx-deployment-66b6c48dd5-7bqxz
evicting pod kube-system/coredns-74ff55c5b-7vxsh
...
node/worker-node-1 drained

Warning: Pods using local storage (emptyDir) will lose their data when evicted. Plan accordingly!

3. Performing Maintenance

With the node drained, you can now safely perform maintenance operations. This might include:

# SSH into the node
ssh user@worker-node-1

# Update the system
sudo apt update && sudo apt upgrade -y

# Restart the node if necessary
sudo reboot

4. Uncordoning a Node

Once maintenance is complete and the node is back online, make it schedulable again:

# Mark the node as schedulable
kubectl uncordon worker-node-1

# Verify the node is uncordoned
kubectl get nodes

The node will return to the normal Ready state:

NAME           STATUS   ROLES    AGE    VERSION
worker-node-1  Ready    <none>   45d    v1.26.5
worker-node-2  Ready    <none>   45d    v1.26.5
worker-node-3  Ready    <none>   45d    v1.26.5

Pod Disruption Budgets (PDBs)

To ensure high availability during node maintenance, you should configure Pod Disruption Budgets (PDBs) for your applications. PDBs define how many replicas of an application can be down at once.

Here's an example PDB that ensures at least 2 replicas of your application remain available at all times:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

You can apply this using:

kubectl apply -f pdb.yaml

With a PDB in place, the kubectl drain command will respect these constraints and ensure your application maintains minimum availability.

Automating Node Maintenance

For larger clusters, you might want to automate maintenance with tools like:

kured (Kubernetes Reboot Daemon): Automatically coordinates node reboots after updates
Cluster Autoscaler: Automatically adjusts the size of your cluster
Node Problem Detector: Identifies node issues automatically

Here's a quick example of setting up kured:

# Install kured using Helm
helm repo add kured https://weaveworks.github.io/kured
helm install kured kured/kured --namespace kube-system

Real-World Maintenance Workflow

Let's walk through a complete node maintenance scenario for replacing a failing disk:

Identify the node that needs maintenance:

kubectl get nodes
kubectl describe node worker-node-1

Cordon the node to prevent new workloads:
```
kubectl cordon worker-node-1
```

Drain the node safely:

kubectl drain worker-node-1 --ignore-daemonsets --delete-emptydir-data

Perform the hardware maintenance:

# SSH to the node
ssh admin@worker-node-1

# Shutdown the node
sudo shutdown -h now

# Replace the disk physically
# Boot the node back up

Verify the node is healthy:

kubectl get nodes
kubectl describe node worker-node-1

Make the node schedulable again:
```
kubectl uncordon worker-node-1
```
Monitor as workloads return to the node:
```
kubectl get pods -o wide
```

Node Maintenance Best Practices

Plan maintenance during low-traffic periods when possible
Always use PDBs for critical applications to maintain availability
Perform rolling maintenance on one node at a time
Monitor cluster capacity before draining nodes to ensure there are enough resources for rescheduled pods
Have alerting in place to notify you of any issues during maintenance
Document your maintenance procedures for consistency and knowledge sharing

Troubleshooting Common Issues

Pods Won't Evict During Drain

If pods are stuck during eviction, check:

# Get details of the pod
kubectl describe pod stuck-pod-name

# Check for PDB issues
kubectl get pdb
kubectl describe pdb my-pdb

Common causes include:

Insufficient resources on other nodes
Pod Disruption Budget constraints
StatefulSets with local storage
Pods with nodeName directly specified

Node Won't Return to Ready State

If a node doesn't return to Ready after maintenance:

# Check node status
kubectl describe node worker-node-1

# Check kubelet logs on the node
ssh admin@worker-node-1
sudo journalctl -u kubelet

Node Maintenance Flowchart

Summary

Proper node maintenance is essential for keeping your Kubernetes cluster healthy, secure, and performant. By following the steps outlined in this guide, you can perform maintenance tasks with minimal disruption to your applications:

Cordon the node to prevent new workloads
Drain the node to safely evict existing pods
Perform the necessary maintenance
Uncordon the node to make it schedulable again

Remember that Kubernetes was designed to handle node failures and maintenance gracefully, but proper preparation and procedures are still necessary to ensure smooth operations.

Additional Resources

Practice Exercises

Set up a test cluster and practice cordoning and draining nodes
Create PDBs for a sample application and observe how they affect the drain process
Automate a node maintenance workflow using shell scripts
Simulate a node failure and practice recovery procedures

💡 Found a typo or mistake? Click "Edit this page" to suggest a correction. Your feedback is greatly appreciated!

Introduction​

Why Node Maintenance Matters​

Prerequisites​

Understanding Node States​

Key Node Maintenance Operations​

1. Cordoning a Node​

2. Draining a Node​

3. Performing Maintenance​

4. Uncordoning a Node​

Pod Disruption Budgets (PDBs)​

Automating Node Maintenance​

Real-World Maintenance Workflow​

Node Maintenance Best Practices​

Troubleshooting Common Issues​

Pods Won't Evict During Drain​

Node Won't Return to Ready State​

Node Maintenance Flowchart​

Summary​

Additional Resources​

Practice Exercises​