Docker Swarm High Availability

Introduction

High Availability (HA) is a critical feature in any production-grade container orchestration system. In the context of Docker Swarm, high availability refers to the ability of the system to remain operational even when individual nodes or components fail.

Docker Swarm, Docker's native clustering and orchestration solution, provides several built-in features to ensure high availability. This guide will explore how to design, implement, and maintain a highly available Docker Swarm cluster, minimizing downtime and ensuring your applications remain accessible even during infrastructure failures.

Understanding High Availability Concepts

Before diving into the technical details, let's understand some key concepts related to high availability in Docker Swarm:

Fault Tolerance: The ability of a system to continue operating properly even when some of its components fail
Redundancy: Duplication of critical components to provide backup in case of failure
Load Balancing: Distribution of workloads across multiple nodes to optimize resource usage
Failover: The automatic switching to a redundant system when the primary system fails
Quorum: The minimum number of manager nodes that must be available for the cluster to function properly

Docker Swarm Architecture for High Availability

A Docker Swarm cluster consists of two types of nodes:

Manager Nodes: Responsible for orchestrating the cluster, maintaining the desired state, and scheduling services
Worker Nodes: Execute the containers assigned to them by manager nodes

Manager Node Redundancy

For high availability, it's crucial to have multiple manager nodes. Docker Swarm uses the Raft consensus algorithm to maintain cluster state consistency across manager nodes.

Key considerations for manager nodes:

Deploy an odd number of manager nodes (3, 5, or 7)
Distribute manager nodes across different availability zones or failure domains
Manager nodes should have stable network connectivity between them

Manager quorum requirements:

Total Manager Nodes	Quorum (Majority)	Maximum Failures Tolerated
1	1	0
3	2	1
5	3	2
7	4	3

caution

While you can technically have 2, 4, or 6 manager nodes, these configurations don't improve fault tolerance over the next lower odd number.

Setting Up a Highly Available Swarm Cluster

Let's walk through the process of creating a highly available Docker Swarm cluster with multiple manager and worker nodes.

Prerequisites

Multiple servers (physical or virtual) with Docker installed
Network connectivity between all nodes
Basic understanding of Docker and command-line operations

Step 1: Initialize the Swarm on the First Manager Node

# On the first manager node
docker swarm init --advertise-addr <MANAGER-IP>

Example output:

Swarm initialized: current node (dxn1zf6l61qsb1josjja83ngz) is now a manager.

To add a worker to this swarm, run the following command:
    docker swarm join --token SWMTKN-1-49nj1cmql0jkz5s954yi3oex3nedyz0fb0xx14ie39trti4wxv-8vxv8rssmk743ojnwacrr2e7c 192.168.99.100:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

Step 2: Add More Manager Nodes

Generate a manager join token on the first manager:

# On the first manager node
docker swarm join-token manager

Example output:

To add a manager to this swarm, run the following command:
    docker swarm join --token SWMTKN-1-49nj1cmql0jkz5s954yi3oex3nedyz0fb0xx14ie39trti4wxv-2tviqzd7f9f1u3lb0s1ec7t67 192.168.99.100:2377

Run the provided command on each server that should become a manager:

# On each additional manager node
docker swarm join --token SWMTKN-1-49nj1cmql0jkz5s954yi3oex3nedyz0fb0xx14ie39trti4wxv-2tviqzd7f9f1u3lb0s1ec7t67 192.168.99.100:2377

Step 3: Add Worker Nodes

Generate a worker join token on any manager:

# On any manager node
docker swarm join-token worker

Run the provided command on each server that should become a worker:

# On each worker node
docker swarm join --token SWMTKN-1-49nj1cmql0jkz5s954yi3oex3nedyz0fb0xx14ie39trti4wxv-8vxv8rssmk743ojnwacrr2e7c 192.168.99.100:2377

Step 4: Verify the Swarm Cluster

Check the status of your Swarm cluster:

# On any manager node
docker node ls

Example output:

ID                            HOSTNAME            STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
dxn1zf6l61qsb1josjja83ngz *   manager1            Ready     Active         Leader           20.10.8
6ltenfc51yfr1la31kp4lr02v     manager2            Ready     Active         Reachable        20.10.8
8jv3tlfpvyej4f2dnuqcm7vge     manager3            Ready     Active         Reachable        20.10.8
a37md5239ny976w3qyydnr45t     worker1             Ready     Active                          20.10.8
b96xc9mknqr3fctx9b7welxr5     worker2             Ready     Active                          20.10.8

High Availability Service Deployment

Now that we have a highly available swarm infrastructure, let's look at how to deploy services with high availability in mind.

Replicated Services for Redundancy

Deploy services with multiple replicas to ensure redundancy:

docker service create \
  --name webapp \
  --replicas 6 \
  --publish 80:80 \
  nginx:latest

Using Placement Constraints for Availability

Distribute service instances across nodes to maximize availability:

docker service create \
  --name webapp \
  --replicas 6 \
  --publish 80:80 \
  --placement-pref 'spread=node.labels.zone' \
  nginx:latest

Rolling Updates for Zero Downtime

Configure services for rolling updates to avoid downtime during deployments:

docker service create \
  --name webapp \
  --replicas 6 \
  --publish 80:80 \
  --update-parallelism 2 \
  --update-delay 10s \
  nginx:latest

Later, when updating:

docker service update --image nginx:1.21 webapp

Advanced High Availability Configuration

Let's explore some advanced techniques for ensuring high availability in Docker Swarm.

Implementing Health Checks

Add health checks to containers to automatically detect and handle failures:

version: '3.8'
services:
  webapp:
    image: nginx:latest
    deploy:
      replicas: 4
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost"]
      interval: 1m
      timeout: 10s
      retries: 3
      start_period: 30s

Setting Up Manager Node Drain

When performing maintenance on manager nodes, you should drain them first:

# Put the node in drain state
docker node update --availability drain manager2

# Perform maintenance...

# Make the node active again
docker node update --availability active manager2

Configuring Backup and Restore for Swarm State

Regularly back up your Swarm state to enable disaster recovery:

# On a manager node, back up the Swarm state
tar -czvf swarm-backup.tar.gz /var/lib/docker/swarm/

To restore (on a completely new swarm):

# Stop Docker
systemctl stop docker

# Restore the backup
rm -rf /var/lib/docker/swarm/
mkdir -p /var/lib/docker/swarm/
tar -xvzf swarm-backup.tar.gz -C /var/lib/docker/swarm/

# Start Docker
systemctl start docker

Real-World Example: Highly Available Web Application

Let's put everything together by deploying a complete, highly available web application stack with a front-end, API, and database.

Create a Docker Compose File for the Stack

Create a file named webapp-stack.yml:

version: '3.8'

services:
  frontend:
    image: nginx:latest
    ports:
      - "80:80"
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        order: start-first
      restart_policy:
        condition: on-failure
        max_attempts: 3
      placement:
        preferences:
          - spread: node.labels.zone
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 10s
      
  api:
    image: myapp/api:latest
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
      restart_policy:
        condition: on-failure
        max_attempts: 3
      placement:
        preferences:
          - spread: node.labels.zone
    environment:
      - DB_HOST=db
      - DB_USER=myuser
      - DB_PASSWORD_FILE=/run/secrets/db_password
    secrets:
      - db_password
      
  db:
    image: mysql:8.0
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.role == manager
          - node.labels.db == true
      restart_policy:
        condition: on-failure
    environment:
      - MYSQL_ROOT_PASSWORD_FILE=/run/secrets/db_root_password
      - MYSQL_DATABASE=myapp
      - MYSQL_USER=myuser
      - MYSQL_PASSWORD_FILE=/run/secrets/db_password
    volumes:
      - db-data:/var/lib/mysql
    secrets:
      - db_root_password
      - db_password

volumes:
  db-data:
    driver: local

secrets:
  db_password:
    external: true
  db_root_password:
    external: true

Create the Required Secrets

# Create secrets for the database passwords
echo "mysecretpassword" | docker secret create db_password -
echo "myrootsecretpassword" | docker secret create db_root_password -

Deploy the Stack

docker stack deploy -c webapp-stack.yml myapp

Monitor the Deployment

# Check the stack services
docker stack services myapp

# Check service logs
docker service logs myapp_frontend
docker service logs myapp_api
docker service logs myapp_db

Testing High Availability

To verify that your high availability setup works as expected, you should perform the following tests:

Test 1: Node Failure Simulation

Shut down a worker node and observe how the services are rescheduled:

# On the worker node
sudo shutdown -h now

Then check the service status on a manager:

docker service ls

Test 2: Manager Node Failure

Shut down a manager node (not the leader) and observe the cluster behavior:

# First, identify the leader
docker node ls

# On a non-leader manager node
sudo shutdown -h now

Check the swarm status:

docker node ls

Test 3: Network Partition Test

Isolate a node from the rest of the swarm using firewall rules:

# On a specific node
sudo iptables -A INPUT -p tcp --dport 2377 -j DROP
sudo iptables -A OUTPUT -p tcp --dport 2377 -j DROP

After testing, restore connectivity:

sudo iptables -D INPUT -p tcp --dport 2377 -j DROP
sudo iptables -D OUTPUT -p tcp --dport 2377 -j DROP

Best Practices for Docker Swarm High Availability

Always use odd numbers of manager nodes (3, 5, or 7)
Distribute manager nodes across different failure domains (racks, availability zones, etc.)
Keep the manager count small (typically 3-5) for performance reasons
Backup the swarm state regularly
Label your nodes for better placement control
Plan for network failures by ensuring connectivity between manager nodes
Implement proper monitoring of your swarm cluster
Use health checks for all critical services
Configure resource constraints to prevent resource exhaustion
Practice disaster recovery scenarios regularly

Monitoring and Maintenance

Setting Up Monitoring

A complete HA solution requires proper monitoring. Consider adding:

Prometheus for metrics collection
Grafana for visualization
Node exporter for host-level metrics
cAdvisor for container metrics

Example Prometheus configuration for Docker Swarm:

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    deploy:
      placement:
        constraints:
          - node.role == manager

Regular Maintenance Tasks

Update manager certificates (they expire after 90 days by default)
```
docker swarm ca --rotate
```
Check swarm health
```
docker node ls
docker service ls
```

Update Docker Engine on nodes (one at a time)

# On each node, one at a time, starting with workers
docker node update --availability drain <NODE-ID>
# Perform update
docker node update --availability active <NODE-ID>

Troubleshooting Common Issues

Lost Quorum

If you lose quorum (majority of manager nodes), you may need to force a new cluster:

# On the remaining manager
docker swarm init --force-new-cluster

Network Connectivity Issues

Check connectivity between nodes:

# Test connectivity to the swarm ports
nc -zv <manager-ip> 2377
nc -zv <manager-ip> 7946
nc -zv <manager-ip> 4789

Service Deployment Failures

Check service logs for details:

docker service logs <service-name>

Summary

High availability in Docker Swarm is achieved through:

Multiple manager nodes using Raft consensus
Service replication across worker nodes
Placement strategies to distribute workloads
Health checks to detect and handle failures
Rolling updates for zero-downtime deployments
Regular backups of the swarm state
Comprehensive monitoring and maintenance

By following the practices outlined in this guide, you can build a robust, highly available Docker Swarm cluster capable of withstanding various failure scenarios while maintaining service availability.

Additional Resources

Exercises

Set up a 3-manager, 2-worker Docker Swarm cluster on local virtual machines or cloud providers
Deploy a web application with 5 replicas and verify load balancing
Simulate failure of a worker node and observe how Docker reschedules containers
Implement health checks for a service and test failover when the health check fails
Create a backup of your swarm state and practice restoring it on a new cluster
Design a disaster recovery plan for a critical application running on Docker Swarm

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding High Availability Concepts​

Docker Swarm Architecture for High Availability​

Manager Node Redundancy​

Setting Up a Highly Available Swarm Cluster​

Prerequisites​

Step 1: Initialize the Swarm on the First Manager Node​

Step 2: Add More Manager Nodes​

Step 3: Add Worker Nodes​

Step 4: Verify the Swarm Cluster​

High Availability Service Deployment​

Replicated Services for Redundancy​

Using Placement Constraints for Availability​

Rolling Updates for Zero Downtime​

Advanced High Availability Configuration​

Implementing Health Checks​

Setting Up Manager Node Drain​

Configuring Backup and Restore for Swarm State​

Real-World Example: Highly Available Web Application​

Create a Docker Compose File for the Stack​

Create the Required Secrets​

Deploy the Stack​

Monitor the Deployment​

Testing High Availability​

Test 1: Node Failure Simulation​

Test 2: Manager Node Failure​

Test 3: Network Partition Test​

Best Practices for Docker Swarm High Availability​

Monitoring and Maintenance​

Setting Up Monitoring​

Regular Maintenance Tasks​

Troubleshooting Common Issues​

Lost Quorum​

Network Connectivity Issues​

Service Deployment Failures​

Summary​

Additional Resources​

Exercises​

Introduction

Understanding High Availability Concepts

Docker Swarm Architecture for High Availability

Manager Node Redundancy

Setting Up a Highly Available Swarm Cluster

Prerequisites

Step 1: Initialize the Swarm on the First Manager Node

Step 2: Add More Manager Nodes

Step 3: Add Worker Nodes

Step 4: Verify the Swarm Cluster

High Availability Service Deployment

Replicated Services for Redundancy

Using Placement Constraints for Availability

Rolling Updates for Zero Downtime

Advanced High Availability Configuration

Implementing Health Checks

Setting Up Manager Node Drain

Configuring Backup and Restore for Swarm State

Real-World Example: Highly Available Web Application

Create a Docker Compose File for the Stack

Create the Required Secrets

Deploy the Stack

Monitor the Deployment

Testing High Availability

Test 1: Node Failure Simulation

Test 2: Manager Node Failure

Test 3: Network Partition Test

Best Practices for Docker Swarm High Availability

Monitoring and Maintenance

Setting Up Monitoring

Regular Maintenance Tasks

Troubleshooting Common Issues

Lost Quorum

Network Connectivity Issues

Service Deployment Failures

Summary

Additional Resources

Exercises