Docker Swarm Health Checks

Introduction

When running applications in a distributed environment like Docker Swarm, ensuring that your services remain healthy and available is critical. Health checks provide a mechanism to verify that your applications are functioning correctly and to automatically recover from failures when they occur.

In this guide, we'll explore how Docker Swarm implements health checks, how to configure them for your services, and best practices for maintaining a robust containerized environment.

What are Health Checks?

Health checks are automated tests that Docker performs to determine if a container is functioning properly. These tests can be as simple as checking if a process is running or as complex as verifying that your application can process transactions correctly.

When a health check fails, Docker Swarm can take remedial action, such as:

Restarting the unhealthy container
Rescheduling the service on a different node
Preventing traffic from being routed to the unhealthy container

Let's visualize the health check process in Docker Swarm:

Types of Health Checks in Docker Swarm

Docker Swarm supports several types of health checks:

1. Command-based Health Checks

These execute a command inside the container. If the command returns a zero exit code, the container is considered healthy.

2. HTTP Health Checks

These send an HTTP request to a specified endpoint. If the endpoint returns a successful status code (usually 200-399), the container is considered healthy.

3. TCP Health Checks

These attempt to establish a TCP connection to a specific port. If the connection succeeds, the container is considered healthy.

Configuring Health Checks

Health checks can be configured in your Docker Compose file or directly when creating services with the Docker CLI.

Health Check Configuration in Docker Compose

Here's how to define a health check in a Docker Compose file (version 3.x):

yaml
version: '3.8'
services:
  web:
    image: nginx:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    deploy:
      replicas: 3
      restart_policy:
        condition: on-failure

Let's break down the health check options:

test: The command to run to check health
interval: How often to run the health check (default: 30s)
timeout: Maximum time a health check can take before considered failing (default: 30s)
retries: Number of consecutive failures needed to consider the container unhealthy (default: 3)
start_period: Initial period to wait before starting health checks, allowing for container startup time (default: 0s)

Deploying Services with Health Checks via CLI

You can also specify health checks when creating services directly with the Docker CLI:

bash
docker service create \
  --name web \
  --replicas 3 \
  --health-cmd "curl -f http://localhost || exit 1" \
  --health-interval 30s \
  --health-timeout 10s \
  --health-retries 3 \
  --health-start-period 40s \
  nginx:latest

Real-World Health Check Examples

Let's look at some practical examples for different types of applications:

Example 1: Web Application Health Check

yaml
version: '3.8'
services:
  web_app:
    image: my-web-app:latest
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 20s
      timeout: 5s
      retries: 3
      start_period: 30s
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        order: start-first
      restart_policy:
        condition: on-failure

This example checks if the web application responds correctly to a /health endpoint, which is a common pattern for API health checks.

Example 2: Database Health Check

yaml
version: '3.8'
services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: password
      POSTGRES_DB: app_db
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user -d app_db"]
      interval: 30s
      timeout: 5s
      retries: 5
      start_period: 80s
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure

This example uses pg_isready to verify that PostgreSQL is ready to accept connections.

Example 3: Redis Cache Health Check

yaml
version: '3.8'
services:
  redis:
    image: redis:latest
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
      restart_policy:
        condition: on-failure

This simple health check verifies that Redis responds to the ping command, indicating it's operational.

Implementation Best Practices

To get the most out of Docker Swarm health checks, follow these best practices:

1. Create Dedicated Health Endpoints

For web applications, create a dedicated /health or /healthz endpoint that performs appropriate checks:

javascript
// Express.js example
app.get('/health', (req, res) => {
  // Check database connection
  const dbHealthy = checkDatabaseConnection();
  
  // Check cache connection
  const cacheHealthy = checkCacheConnection();
  
  if (dbHealthy && cacheHealthy) {
    res.status(200).send('Healthy');
  } else {
    res.status(500).send('Unhealthy');
  }
});

2. Make Health Checks Lightweight

Health checks should be quick and lightweight to avoid consuming unnecessary resources:

Avoid expensive database queries
Don't perform full application logic tests
Keep network calls to a minimum

3. Set Appropriate Timing Parameters

Balance between quick detection and avoiding false positives:

interval: Frequent enough to catch issues quickly, but not so frequent as to waste resources
timeout: Long enough for normal operations, but short enough to detect real hangs
start_period: Give containers enough time to initialize before checking health

4. Implement Multi-Stage Health Checks

For complex applications, consider different levels of health checks:

Liveness: Basic check that the application is running
Readiness: Check that the application is ready to accept traffic
Deep health: Comprehensive check of all critical dependencies

Monitoring Health Status

To check the health status of your services, use the following commands:

bash
# Check health status of all containers in a service
docker service ps my_service

# Detailed inspection of a service
docker service inspect my_service

# View container health events
docker events --filter type=container

Example output of docker service ps:

ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE            ERROR               PORTS
tkv7igzw5nwp        web.1               nginx:latest        worker1             Running             Running 2 minutes ago                        
l8x3syu4zpln        web.2               nginx:latest        worker2             Running             Running 2 minutes ago                        
9mw79070dhjc        web.3               nginx:latest        worker3             Running             Running 2 minutes ago                        

If a container becomes unhealthy and gets restarted, you'll see it in the output:

ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE            ERROR               PORTS
tkv7igzw5nwp        web.1               nginx:latest        worker1             Running             Running 2 minutes ago                        
l8x3syu4zpln        web.2               nginx:latest        worker2             Running             Running 2 minutes ago                        
jqu72v2cqidg        web.3               nginx:latest        worker3             Running             Running 30 seconds ago                       
9mw79070dhjc         \_ web.3           nginx:latest        worker3             Shutdown            Failed 35 seconds ago    "task: unhealthy"   

Common Health Check Pitfalls

Avoid these common mistakes when implementing health checks:

Too strict checks: Health checks that fail for minor issues can cause unnecessary restarts
Too lenient checks: Checks that pass even when the application is malfunctioning defeat the purpose
Resource-intensive checks: Heavy health checks can impact application performance
Incorrect timeout values: If timeouts are too short, healthy containers may be marked unhealthy

Troubleshooting Health Checks

If your health checks aren't working as expected:

Check container logs:
bash
```
docker service logs <service_name>
```

Manually run the health check command:

bash
docker exec <container_id> curl -f http://localhost/health

Inspect the service configuration:

bash
docker service inspect --pretty <service_name>

Temporarily disable and test manually: You can disable health checks and monitor manually while troubleshooting.

Summary

Docker Swarm health checks are a powerful feature for maintaining the availability and reliability of your containerized applications. By properly configuring health checks, you can:

Automatically detect and recover from failures
Prevent routing traffic to unhealthy containers
Gain visibility into the health of your distributed system

Remember to create lightweight, appropriate health checks tailored to your application's needs and to set reasonable timing parameters that balance between quick detection and avoiding false positives.

Additional Resources

To deepen your understanding of Docker Swarm health checks, consider exploring:

Docker's official documentation on health checks
Service orchestration patterns for high availability
Monitoring solutions that integrate with Docker health data (like Prometheus)

Exercises

Implement a basic HTTP health check for a web application of your choice
Create a multi-stage health check that verifies both the application and its database
Set up a monitoring dashboard to visualize the health status of your swarm services
Simulate a failure scenario and observe how Docker Swarm responds to unhealthy containers

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What are Health Checks?​

Types of Health Checks in Docker Swarm​

1. Command-based Health Checks​

2. HTTP Health Checks​

3. TCP Health Checks​

Configuring Health Checks​

Health Check Configuration in Docker Compose​

Deploying Services with Health Checks via CLI​

Real-World Health Check Examples​

Example 1: Web Application Health Check​

Example 2: Database Health Check​

Example 3: Redis Cache Health Check​

Implementation Best Practices​

1. Create Dedicated Health Endpoints​

2. Make Health Checks Lightweight​

3. Set Appropriate Timing Parameters​

4. Implement Multi-Stage Health Checks​

Monitoring Health Status​

Common Health Check Pitfalls​

Troubleshooting Health Checks​

Summary​

Additional Resources​

Exercises​