Nginx Backend Failover

Introduction

When running web applications in a production environment, ensuring high availability is crucial. If a backend server goes down, your users shouldn't experience service disruptions. This is where backend failover comes into play.

Backend failover is a critical feature of Nginx's load balancing capabilities that automatically redirects traffic away from failed servers to healthy ones. In this guide, we'll explore how to implement robust failover mechanisms using Nginx to maintain service availability even when backend servers fail.

Understanding Backend Failover

Backend failover is a high-availability technique that ensures continuity of service when one or more backend servers become unavailable. Nginx constantly monitors the health of backend servers and automatically routes traffic to operational servers when failures are detected.

Key Benefits

Improved Reliability: Minimize downtime during server failures
Enhanced User Experience: Users remain unaffected by backend issues
Simplified Maintenance: Perform server maintenance without service interruption
Disaster Recovery: Quick recovery from unexpected server failures

Basic Failover Configuration

Let's start with a basic failover configuration in Nginx:

nginx
http {
    upstream backend_servers {
        server backend1.example.com:8080;
        server backend2.example.com:8080 backup;
    }

    server {
        listen 80;
        server_name example.com;

        location / {
            proxy_pass http://backend_servers;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }
    }
}

In this example:

We define an upstream group called backend_servers with two backend servers
The second server has the backup parameter, designating it as a backup server
Nginx will only route traffic to the backup server if the primary server is unavailable

How It Works

All traffic is initially directed to backend1.example.com
If backend1.example.com fails, traffic is automatically redirected to backend2.example.com
When backend1.example.com becomes available again, traffic returns to it

Active Health Checks

For more robust failover, Nginx Plus provides active health checks that proactively monitor backend servers:

nginx
http {
    upstream backend_servers {
        server backend1.example.com:8080;
        server backend2.example.com:8080;
        server backend3.example.com:8080;

        health_check interval=5s fails=3 passes=2;
    }

    server {
        listen 80;
        server_name example.com;

        location / {
            proxy_pass http://backend_servers;
            health_check uri=/health;
        }
    }
}

In this configuration:

Nginx checks the /health endpoint every 5 seconds
A server is considered failed after 3 consecutive failed checks
A server is marked as healthy after 2 consecutive successful checks

Custom Health Check Parameters

You can customize health checks with various parameters:

nginx
upstream backend_servers {
    server backend1.example.com:8080;
    server backend2.example.com:8080;
    
    health_check interval=10s 
                 fails=3 
                 passes=2 
                 uri=/status 
                 match=health_check_status;
}

match health_check_status {
    status 200;
    body ~ "OK";
}

This configuration checks for both an HTTP 200 status code and the text "OK" in the response body.

Passive Health Checks

For Nginx open-source, we can implement passive health checks using the max_fails and fail_timeout parameters:

nginx
http {
    upstream backend_servers {
        server backend1.example.com:8080 max_fails=3 fail_timeout=30s;
        server backend2.example.com:8080 max_fails=3 fail_timeout=30s;
    }

    server {
        listen 80;
        server_name example.com;

        location / {
            proxy_pass http://backend_servers;
            proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
        }
    }
}

In this setup:

max_fails=3: A server is considered unavailable after 3 failed attempts
fail_timeout=30s: After failure, the server is marked unavailable for 30 seconds
proxy_next_upstream: Specifies conditions when Nginx should try the next server

Real-world Example: E-commerce Application

Let's look at a practical example for an e-commerce application with different backend services:

nginx
http {
    # Product catalog service
    upstream product_service {
        server product1.example.com:8080 weight=3;
        server product2.example.com:8080 weight=1;
        server product3.example.com:8080 backup;

        least_conn;
        keepalive 32;
    }

    # Payment processing service
    upstream payment_service {
        server payment1.example.com:8080;
        server payment2.example.com:8080 backup;

        keepalive 16;
    }

    server {
        listen 80;
        server_name shop.example.com;

        # Product catalog endpoints
        location /api/products {
            proxy_pass http://product_service;
            proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
            proxy_next_upstream_timeout 10s;
            proxy_next_upstream_tries 3;
            
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        }

        # Payment processing endpoints
        location /api/payments {
            proxy_pass http://payment_service;
            proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
            
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        }
    }
}

This configuration:

Creates separate upstream groups for different services
Uses weighted load balancing for the product service
Employs dedicated backup servers for each service
Sets different connection parameters based on service requirements
Configures timeout and retry settings for failover scenarios

Visualizing Failover in Action

Let's visualize how failover works with a diagram:

When a server fails:

Advanced Failover Techniques

Slow-start Recovery

Nginx Plus offers a "slow start" feature that gradually increases traffic to a recovered server:

nginx
upstream backend {
    server backend1.example.com:8080 slow_start=30s;
    server backend2.example.com:8080;
    server backup.example.com:8080 backup;
}

When backend1 recovers from a failure, Nginx gradually increases traffic over 30 seconds, preventing it from being overwhelmed.

Circuit Breaker Pattern

We can implement a basic circuit breaker pattern using Nginx variables and maps:

nginx
http {
    # Define a map to track server state
    map $upstream_addr $upstream_status_key {
        default "server_$upstream_addr";
        "~[^:]+$" "server_$upstream_addr";
    }

    # Circuit breaker logic using Nginx Keyval module
    keyval_zone zone=circuit_breaker:10m timeout=60s;
    keyval $upstream_status_key $server_failures zone=circuit_breaker;

    server {
        listen 80;
        
        location / {
            set $upstream_endpoint "http://backend_servers";
            
            # Check if server has too many failures
            access_by_lua_block {
                local failures = tonumber(ngx.var.server_failures) or 0
                
                if failures >= 5 then
                    ngx.log(ngx.WARN, "Circuit open for ", ngx.var.upstream_status_key)
                    ngx.var.upstream_endpoint = "http://backup_servers"
                end
            }
            
            proxy_pass $upstream_endpoint;
            
            # Track failures
            header_filter_by_lua_block {
                if ngx.status >= 500 then
                    local failures = tonumber(ngx.var.server_failures) or 0
                    ngx.var.server_failures = failures + 1
                end
            }
        }
    }
}

This advanced technique requires the Nginx Lua module but provides more sophisticated failover control.

Common Troubleshooting Issues

Problem: All Servers Marked as Failed

If all your servers are being marked as failed, check:

Health check endpoints are properly implemented
Network connectivity between Nginx and backend servers
Timeouts are configured appropriately

nginx
# More lenient health check configuration
health_check interval=10s 
             fails=5 
             passes=2 
             timeout=5s;

Problem: Premature Failover

If Nginx is failing over too aggressively:

Increase the max_fails threshold
Adjust fail_timeout duration
Verify your backend server's response times

nginx
# Less aggressive failover configuration
server backend1.example.com:8080 max_fails=5 fail_timeout=60s;

Summary

Nginx backend failover provides essential high-availability capabilities for your web applications. By implementing proper failover mechanisms, you can ensure your services remain available even when individual servers experience issues.

Key takeaways:

Use backup servers for basic failover scenarios
Implement health checks to detect server failures automatically
Configure timeout and retry parameters to fine-tune failover behavior
Consider advanced techniques like slow-start and circuit breakers for critical applications

Additional Resources and Exercises

Exercises

Basic Failover Configuration: Set up a simple Nginx environment with two backend servers and configure one as a backup. Test failover by stopping the primary server.
Health Check Implementation: Configure passive health checks using max_fails and fail_timeout. Test different values to see how they affect failover behavior.
Advanced Configuration: Create a multi-service configuration similar to the e-commerce example with different failover policies for each service.

Further Learning

Explore Nginx Plus features for enhanced failover capabilities
Learn about centralized monitoring solutions that complement Nginx's failover mechanisms
Study distributed system patterns like circuit breakers, bulkheads, and retries

Remember that a well-designed failover strategy is just one part of a comprehensive high-availability solution. Consider implementing additional reliability patterns like retries, caching, and graceful degradation for robust web applications.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Backend Failover​

Key Benefits​

Basic Failover Configuration​

How It Works​

Active Health Checks​

Custom Health Check Parameters​

Passive Health Checks​

Real-world Example: E-commerce Application​

Visualizing Failover in Action​

Advanced Failover Techniques​

Slow-start Recovery​

Circuit Breaker Pattern​

Common Troubleshooting Issues​

Problem: All Servers Marked as Failed​

Problem: Premature Failover​

Summary​

Additional Resources and Exercises​

Exercises​

Further Learning​