Docker Swarm High Availability
Introduction
High Availability (HA) is a critical feature in any production-grade container orchestration system. In the context of Docker Swarm, high availability refers to the ability of the system to remain operational even when individual nodes or components fail.
Docker Swarm, Docker's native clustering and orchestration solution, provides several built-in features to ensure high availability. This guide will explore how to design, implement, and maintain a highly available Docker Swarm cluster, minimizing downtime and ensuring your applications remain accessible even during infrastructure failures.
Understanding High Availability Concepts
Before diving into the technical details, let's understand some key concepts related to high availability in Docker Swarm:
- Fault Tolerance: The ability of a system to continue operating properly even when some of its components fail
- Redundancy: Duplication of critical components to provide backup in case of failure
- Load Balancing: Distribution of workloads across multiple nodes to optimize resource usage
- Failover: The automatic switching to a redundant system when the primary system fails
- Quorum: The minimum number of manager nodes that must be available for the cluster to function properly
Docker Swarm Architecture for High Availability
A Docker Swarm cluster consists of two types of nodes:
- Manager Nodes: Responsible for orchestrating the cluster, maintaining the desired state, and scheduling services
- Worker Nodes: Execute the containers assigned to them by manager nodes
Manager Node Redundancy
For high availability, it's crucial to have multiple manager nodes. Docker Swarm uses the Raft consensus algorithm to maintain cluster state consistency across manager nodes.
Key considerations for manager nodes:
- Deploy an odd number of manager nodes (3, 5, or 7)
- Distribute manager nodes across different availability zones or failure domains
- Manager nodes should have stable network connectivity between them
Manager quorum requirements:
Total Manager Nodes | Quorum (Majority) | Maximum Failures Tolerated |
---|---|---|
1 | 1 | 0 |
3 | 2 | 1 |
5 | 3 | 2 |
7 | 4 | 3 |
While you can technically have 2, 4, or 6 manager nodes, these configurations don't improve fault tolerance over the next lower odd number.
Setting Up a Highly Available Swarm Cluster
Let's walk through the process of creating a highly available Docker Swarm cluster with multiple manager and worker nodes.
Prerequisites
- Multiple servers (physical or virtual) with Docker installed
- Network connectivity between all nodes
- Basic understanding of Docker and command-line operations
Step 1: Initialize the Swarm on the First Manager Node
# On the first manager node
docker swarm init --advertise-addr <MANAGER-IP>
Example output:
Swarm initialized: current node (dxn1zf6l61qsb1josjja83ngz) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join --token SWMTKN-1-49nj1cmql0jkz5s954yi3oex3nedyz0fb0xx14ie39trti4wxv-8vxv8rssmk743ojnwacrr2e7c 192.168.99.100:2377
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
Step 2: Add More Manager Nodes
Generate a manager join token on the first manager:
# On the first manager node
docker swarm join-token manager
Example output:
To add a manager to this swarm, run the following command:
docker swarm join --token SWMTKN-1-49nj1cmql0jkz5s954yi3oex3nedyz0fb0xx14ie39trti4wxv-2tviqzd7f9f1u3lb0s1ec7t67 192.168.99.100:2377
Run the provided command on each server that should become a manager:
# On each additional manager node
docker swarm join --token SWMTKN-1-49nj1cmql0jkz5s954yi3oex3nedyz0fb0xx14ie39trti4wxv-2tviqzd7f9f1u3lb0s1ec7t67 192.168.99.100:2377
Step 3: Add Worker Nodes
Generate a worker join token on any manager:
# On any manager node
docker swarm join-token worker
Run the provided command on each server that should become a worker:
# On each worker node
docker swarm join --token SWMTKN-1-49nj1cmql0jkz5s954yi3oex3nedyz0fb0xx14ie39trti4wxv-8vxv8rssmk743ojnwacrr2e7c 192.168.99.100:2377
Step 4: Verify the Swarm Cluster
Check the status of your Swarm cluster:
# On any manager node
docker node ls
Example output:
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
dxn1zf6l61qsb1josjja83ngz * manager1 Ready Active Leader 20.10.8
6ltenfc51yfr1la31kp4lr02v manager2 Ready Active Reachable 20.10.8
8jv3tlfpvyej4f2dnuqcm7vge manager3 Ready Active Reachable 20.10.8
a37md5239ny976w3qyydnr45t worker1 Ready Active 20.10.8
b96xc9mknqr3fctx9b7welxr5 worker2 Ready Active 20.10.8
High Availability Service Deployment
Now that we have a highly available swarm infrastructure, let's look at how to deploy services with high availability in mind.
Replicated Services for Redundancy
Deploy services with multiple replicas to ensure redundancy:
docker service create \
--name webapp \
--replicas 6 \
--publish 80:80 \
nginx:latest
Using Placement Constraints for Availability
Distribute service instances across nodes to maximize availability:
docker service create \
--name webapp \
--replicas 6 \
--publish 80:80 \
--placement-pref 'spread=node.labels.zone' \
nginx:latest
Rolling Updates for Zero Downtime
Configure services for rolling updates to avoid downtime during deployments:
docker service create \
--name webapp \
--replicas 6 \
--publish 80:80 \
--update-parallelism 2 \
--update-delay 10s \
nginx:latest
Later, when updating:
docker service update --image nginx:1.21 webapp
Advanced High Availability Configuration
Let's explore some advanced techniques for ensuring high availability in Docker Swarm.
Implementing Health Checks
Add health checks to containers to automatically detect and handle failures:
version: '3.8'
services:
webapp:
image: nginx:latest
deploy:
replicas: 4
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost"]
interval: 1m
timeout: 10s
retries: 3
start_period: 30s
Setting Up Manager Node Drain
When performing maintenance on manager nodes, you should drain them first:
# Put the node in drain state
docker node update --availability drain manager2
# Perform maintenance...
# Make the node active again
docker node update --availability active manager2
Configuring Backup and Restore for Swarm State
Regularly back up your Swarm state to enable disaster recovery:
# On a manager node, back up the Swarm state
tar -czvf swarm-backup.tar.gz /var/lib/docker/swarm/
To restore (on a completely new swarm):
# Stop Docker
systemctl stop docker
# Restore the backup
rm -rf /var/lib/docker/swarm/
mkdir -p /var/lib/docker/swarm/
tar -xvzf swarm-backup.tar.gz -C /var/lib/docker/swarm/
# Start Docker
systemctl start docker
Real-World Example: Highly Available Web Application
Let's put everything together by deploying a complete, highly available web application stack with a front-end, API, and database.
Create a Docker Compose File for the Stack
Create a file named webapp-stack.yml
:
version: '3.8'
services:
frontend:
image: nginx:latest
ports:
- "80:80"
deploy:
replicas: 3
update_config:
parallelism: 1
delay: 10s
order: start-first
restart_policy:
condition: on-failure
max_attempts: 3
placement:
preferences:
- spread: node.labels.zone
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
api:
image: myapp/api:latest
deploy:
replicas: 3
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: on-failure
max_attempts: 3
placement:
preferences:
- spread: node.labels.zone
environment:
- DB_HOST=db
- DB_USER=myuser
- DB_PASSWORD_FILE=/run/secrets/db_password
secrets:
- db_password
db:
image: mysql:8.0
deploy:
replicas: 1
placement:
constraints:
- node.role == manager
- node.labels.db == true
restart_policy:
condition: on-failure
environment:
- MYSQL_ROOT_PASSWORD_FILE=/run/secrets/db_root_password
- MYSQL_DATABASE=myapp
- MYSQL_USER=myuser
- MYSQL_PASSWORD_FILE=/run/secrets/db_password
volumes:
- db-data:/var/lib/mysql
secrets:
- db_root_password
- db_password
volumes:
db-data:
driver: local
secrets:
db_password:
external: true
db_root_password:
external: true
Create the Required Secrets
# Create secrets for the database passwords
echo "mysecretpassword" | docker secret create db_password -
echo "myrootsecretpassword" | docker secret create db_root_password -
Deploy the Stack
docker stack deploy -c webapp-stack.yml myapp
Monitor the Deployment
# Check the stack services
docker stack services myapp
# Check service logs
docker service logs myapp_frontend
docker service logs myapp_api
docker service logs myapp_db
Testing High Availability
To verify that your high availability setup works as expected, you should perform the following tests:
Test 1: Node Failure Simulation
Shut down a worker node and observe how the services are rescheduled:
# On the worker node
sudo shutdown -h now
Then check the service status on a manager:
docker service ls
Test 2: Manager Node Failure
Shut down a manager node (not the leader) and observe the cluster behavior:
# First, identify the leader
docker node ls
# On a non-leader manager node
sudo shutdown -h now
Check the swarm status:
docker node ls
Test 3: Network Partition Test
Isolate a node from the rest of the swarm using firewall rules:
# On a specific node
sudo iptables -A INPUT -p tcp --dport 2377 -j DROP
sudo iptables -A OUTPUT -p tcp --dport 2377 -j DROP
After testing, restore connectivity:
sudo iptables -D INPUT -p tcp --dport 2377 -j DROP
sudo iptables -D OUTPUT -p tcp --dport 2377 -j DROP
Best Practices for Docker Swarm High Availability
- Always use odd numbers of manager nodes (3, 5, or 7)
- Distribute manager nodes across different failure domains (racks, availability zones, etc.)
- Keep the manager count small (typically 3-5) for performance reasons
- Backup the swarm state regularly
- Label your nodes for better placement control
- Plan for network failures by ensuring connectivity between manager nodes
- Implement proper monitoring of your swarm cluster
- Use health checks for all critical services
- Configure resource constraints to prevent resource exhaustion
- Practice disaster recovery scenarios regularly
Monitoring and Maintenance
Setting Up Monitoring
A complete HA solution requires proper monitoring. Consider adding:
- Prometheus for metrics collection
- Grafana for visualization
- Node exporter for host-level metrics
- cAdvisor for container metrics
Example Prometheus configuration for Docker Swarm:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
deploy:
placement:
constraints:
- node.role == manager
Regular Maintenance Tasks
-
Update manager certificates (they expire after 90 days by default)
bashdocker swarm ca --rotate
-
Check swarm health
bashdocker node ls
docker service ls -
Update Docker Engine on nodes (one at a time)
bash# On each node, one at a time, starting with workers
docker node update --availability drain <NODE-ID>
# Perform update
docker node update --availability active <NODE-ID>
Troubleshooting Common Issues
Lost Quorum
If you lose quorum (majority of manager nodes), you may need to force a new cluster:
# On the remaining manager
docker swarm init --force-new-cluster
Network Connectivity Issues
Check connectivity between nodes:
# Test connectivity to the swarm ports
nc -zv <manager-ip> 2377
nc -zv <manager-ip> 7946
nc -zv <manager-ip> 4789
Service Deployment Failures
Check service logs for details:
docker service logs <service-name>
Summary
High availability in Docker Swarm is achieved through:
- Multiple manager nodes using Raft consensus
- Service replication across worker nodes
- Placement strategies to distribute workloads
- Health checks to detect and handle failures
- Rolling updates for zero-downtime deployments
- Regular backups of the swarm state
- Comprehensive monitoring and maintenance
By following the practices outlined in this guide, you can build a robust, highly available Docker Swarm cluster capable of withstanding various failure scenarios while maintaining service availability.
Additional Resources
Exercises
- Set up a 3-manager, 2-worker Docker Swarm cluster on local virtual machines or cloud providers
- Deploy a web application with 5 replicas and verify load balancing
- Simulate failure of a worker node and observe how Docker reschedules containers
- Implement health checks for a service and test failover when the health check fails
- Create a backup of your swarm state and practice restoring it on a new cluster
- Design a disaster recovery plan for a critical application running on Docker Swarm
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)