Kubernetes Node Maintenance
Introduction
In a Kubernetes cluster, nodes are the worker machines that run your containerized applications. Just like any other infrastructure component, nodes require regular maintenance — whether it's for security updates, kernel upgrades, hardware repairs, or resource scaling. However, performing maintenance on nodes that are actively running workloads can lead to application downtime if not handled correctly.
This guide will walk you through the proper procedures for performing maintenance on Kubernetes nodes while minimizing disruption to your applications.
Why Node Maintenance Matters
Proper node maintenance is crucial for several reasons:
- Security: Regular updates help protect your cluster from vulnerabilities
- Stability: Kernel and OS updates can improve node performance and reliability
- Resource management: Scaling up or down requires adding or removing nodes
- Hardware maintenance: Physical servers occasionally need repairs or replacements
Kubernetes provides specific tools to help you safely maintain nodes without disrupting your workloads.
Prerequisites
Before you begin, make sure you have:
- A running Kubernetes cluster
kubectl
installed and configured to communicate with your cluster- Appropriate permissions to manage nodes in your cluster
Understanding Node States
Kubernetes nodes can be in different states that affect how workloads are scheduled:
- Ready: The node is healthy and available to accept new pods
- NotReady: The node is unhealthy and cannot accept new pods
- SchedulingDisabled: The node is cordoned (marked as unschedulable) but existing pods continue to run
Let's see how to check the current state of nodes in your cluster:
# List all nodes with their status
kubectl get nodes
# Output
NAME STATUS ROLES AGE VERSION
worker-node-1 Ready <none> 45d v1.26.5
worker-node-2 Ready <none> 45d v1.26.5
worker-node-3 Ready <none> 45d v1.26.5
You can get more detailed information about a specific node with:
kubectl describe node worker-node-1
Key Node Maintenance Operations
1. Cordoning a Node
Cordoning a node marks it as unschedulable, which prevents new pods from being scheduled on it. This is the first step in the maintenance process.
# Mark a node as unschedulable
kubectl cordon worker-node-1
# Verify the node is cordoned
kubectl get nodes
After running the cordon command, you'll see SchedulingDisabled
in the STATUS column:
NAME STATUS ROLES AGE VERSION
worker-node-1 Ready,SchedulingDisabled <none> 45d v1.26.5
worker-node-2 Ready <none> 45d v1.26.5
worker-node-3 Ready <none> 45d v1.26.5
Important: Cordoning a node only prevents new pods from being scheduled. Existing pods continue to run.
2. Draining a Node
Draining a node safely evicts all pods from the node, allowing them to be rescheduled on other nodes. This is the crucial step before performing actual maintenance.
# Drain a node (will also cordon if not already done)
kubectl drain worker-node-1 --ignore-daemonsets --delete-emptydir-data
Let's understand the flags:
--ignore-daemonsets
: DaemonSet pods are designed to run on all nodes and will be recreated after maintenance, so we can ignore them during draining.--delete-emptydir-data
: Allows deletion of pods using emptyDir volumes. Data in these volumes will be lost!
You'll see output showing the pods being evicted:
node/worker-node-1 cordoned
evicting pod default/nginx-deployment-66b6c48dd5-7bqxz
evicting pod kube-system/coredns-74ff55c5b-7vxsh
...
node/worker-node-1 drained
Warning: Pods using local storage (emptyDir) will lose their data when evicted. Plan accordingly!
3. Performing Maintenance
With the node drained, you can now safely perform maintenance operations. This might include:
# SSH into the node
ssh user@worker-node-1
# Update the system
sudo apt update && sudo apt upgrade -y
# Restart the node if necessary
sudo reboot
4. Uncordoning a Node
Once maintenance is complete and the node is back online, make it schedulable again:
# Mark the node as schedulable
kubectl uncordon worker-node-1
# Verify the node is uncordoned
kubectl get nodes
The node will return to the normal Ready
state:
NAME STATUS ROLES AGE VERSION
worker-node-1 Ready <none> 45d v1.26.5
worker-node-2 Ready <none> 45d v1.26.5
worker-node-3 Ready <none> 45d v1.26.5
Pod Disruption Budgets (PDBs)
To ensure high availability during node maintenance, you should configure Pod Disruption Budgets (PDBs) for your applications. PDBs define how many replicas of an application can be down at once.
Here's an example PDB that ensures at least 2 replicas of your application remain available at all times:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
You can apply this using:
kubectl apply -f pdb.yaml
With a PDB in place, the kubectl drain
command will respect these constraints and ensure your application maintains minimum availability.
Automating Node Maintenance
For larger clusters, you might want to automate maintenance with tools like:
- kured (Kubernetes Reboot Daemon): Automatically coordinates node reboots after updates
- Cluster Autoscaler: Automatically adjusts the size of your cluster
- Node Problem Detector: Identifies node issues automatically
Here's a quick example of setting up kured:
# Install kured using Helm
helm repo add kured https://weaveworks.github.io/kured
helm install kured kured/kured --namespace kube-system
Real-World Maintenance Workflow
Let's walk through a complete node maintenance scenario for replacing a failing disk:
-
Identify the node that needs maintenance:
bashkubectl get nodes
kubectl describe node worker-node-1 -
Cordon the node to prevent new workloads:
bashkubectl cordon worker-node-1
-
Drain the node safely:
bashkubectl drain worker-node-1 --ignore-daemonsets --delete-emptydir-data
-
Perform the hardware maintenance:
bash# SSH to the node
ssh admin@worker-node-1
# Shutdown the node
sudo shutdown -h now
# Replace the disk physically
# Boot the node back up -
Verify the node is healthy:
bashkubectl get nodes
kubectl describe node worker-node-1 -
Make the node schedulable again:
bashkubectl uncordon worker-node-1
-
Monitor as workloads return to the node:
bashkubectl get pods -o wide
Node Maintenance Best Practices
- Plan maintenance during low-traffic periods when possible
- Always use PDBs for critical applications to maintain availability
- Perform rolling maintenance on one node at a time
- Monitor cluster capacity before draining nodes to ensure there are enough resources for rescheduled pods
- Have alerting in place to notify you of any issues during maintenance
- Document your maintenance procedures for consistency and knowledge sharing
Troubleshooting Common Issues
Pods Won't Evict During Drain
If pods are stuck during eviction, check:
# Get details of the pod
kubectl describe pod stuck-pod-name
# Check for PDB issues
kubectl get pdb
kubectl describe pdb my-pdb
Common causes include:
- Insufficient resources on other nodes
- Pod Disruption Budget constraints
- StatefulSets with local storage
- Pods with
nodeName
directly specified
Node Won't Return to Ready State
If a node doesn't return to Ready
after maintenance:
# Check node status
kubectl describe node worker-node-1
# Check kubelet logs on the node
ssh admin@worker-node-1
sudo journalctl -u kubelet
Node Maintenance Flowchart
Summary
Proper node maintenance is essential for keeping your Kubernetes cluster healthy, secure, and performant. By following the steps outlined in this guide, you can perform maintenance tasks with minimal disruption to your applications:
- Cordon the node to prevent new workloads
- Drain the node to safely evict existing pods
- Perform the necessary maintenance
- Uncordon the node to make it schedulable again
Remember that Kubernetes was designed to handle node failures and maintenance gracefully, but proper preparation and procedures are still necessary to ensure smooth operations.
Additional Resources
- Kubernetes Official Documentation on Draining Nodes
- Pod Disruption Budgets
- Kured - Kubernetes Reboot Daemon
Practice Exercises
- Set up a test cluster and practice cordoning and draining nodes
- Create PDBs for a sample application and observe how they affect the drain process
- Automate a node maintenance workflow using shell scripts
- Simulate a node failure and practice recovery procedures
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)