Redis Failover

Introduction

In production environments, ensuring your Redis instances remain available during hardware failures, network issues, or maintenance periods is critical. Redis failover is the process of automatically switching to a redundant or standby Redis server when the primary system becomes unavailable. This guide will walk you through Redis failover concepts, implementation strategies, and best practices to maintain high availability for your Redis deployments.

Why Failover Matters

Imagine your application suddenly loses connection to its Redis instance. Without proper failover mechanisms:

User sessions could be lost
Cache data becomes unavailable
Queue processing stops
Your application might crash or become unresponsive

A proper failover strategy ensures that when problems occur, your system can automatically recover with minimal downtime.

Redis Replication: The Foundation of Failover

Before understanding failover, we need to grasp Redis replication.

How Redis Replication Works

Redis replication allows you to create copies (replicas) of a Redis instance (primary). In a primary-replica setup:

The primary Redis instance processes write commands from clients
Replicas connect to the primary and receive a copy of the data
Replicas can serve read requests, distributing the read load

Let's set up a basic Redis replication:

# On the primary Redis instance
$ redis-cli
127.0.0.1:6379> INFO replication
# Replication
role:master
connected_slaves:0

# On the replica Redis instance
$ redis-cli
127.0.0.1:6380> REPLICAOF 127.0.0.1 6379
OK

# Check replication status on the primary
127.0.0.1:6379> INFO replication
# Replication
role:master
connected_slaves:1
slave0:ip=127.0.0.1,port=6380,state=online,offset=42,lag=0

While replication provides data redundancy, it doesn't automatically handle failover. If the primary fails, manual intervention is required to promote a replica to become the new primary.

Failover Approaches in Redis

Redis offers several approaches to implement failover:

Redis Sentinel
Redis Cluster
Orchestration tools (like Kubernetes)

Let's explore each method.

Redis Sentinel: Automated Failover

Redis Sentinel is the official solution for Redis high availability. It's a distributed system that monitors Redis instances and performs automatic failover when needed.

How Sentinel Works

Sentinel provides:

Monitoring: Continuously checks if primaries and replicas are working as expected
Notification: Alerts administrators about problems
Automatic failover: Promotes a replica to primary when the original primary fails
Configuration provider: Clients connect to Sentinels to find the current primary address

Setting Up Redis Sentinel

Let's implement a basic Sentinel setup with one primary and two replicas:

First, set up your Redis instances (one primary, two replicas)
Create a sentinel.conf file:

# sentinel.conf
port 26379
sentinel monitor mymaster 127.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
sentinel parallel-syncs mymaster 1

Start Sentinel instances:

$ redis-server sentinel.conf --sentinel

For high availability, run at least 3 Sentinel instances on different machines.

Simulating Failover with Sentinel

Let's see what happens during a failover:

# Shut down the primary
$ redis-cli -p 6379 DEBUG sleep 30

# Check Sentinel logs
$ tail -f sentinel.log
... # Sentinel detects the primary is down
... # A new primary is elected
... # Configuration is updated

Connecting Applications to Sentinel

Applications need to be Sentinel-aware. Here's how to connect using Redis clients:

// Node.js example with ioredis
const Redis = require('ioredis');

const redis = new Redis({
  sentinels: [
    { host: '127.0.0.1', port: 26379 },
    { host: '127.0.0.1', port: 26380 },
    { host: '127.0.0.1', port: 26381 }
  ],
  name: 'mymaster' // The Sentinel master name
});

redis.set('key', 'value', (err) => {
  // This automatically connects to the current primary
  console.log('Value set on the current primary');
});

# Python example with redis-py
from redis.sentinel import Sentinel

sentinel = Sentinel([
    ('127.0.0.1', 26379),
    ('127.0.0.1', 26380),
    ('127.0.0.1', 26381)
], socket_timeout=0.1)

# Get the current primary
master = sentinel.master_for('mymaster', socket_timeout=0.1)
master.set('key', 'value')  # This goes to the primary

# Get a replica for read operations
slave = sentinel.slave_for('mymaster', socket_timeout=0.1)
value = slave.get('key')    # This goes to a replica

Redis Cluster: Sharding with Automatic Failover

Redis Cluster is a distributed implementation that provides:

Data sharding across multiple Redis nodes
Built-in failover capabilities
No need for external Sentinel processes

Setting Up Redis Cluster

Configure Redis instances for cluster mode:

# In redis.conf for each node
port 7000  # Different for each node
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000

Start the Redis nodes:

$ redis-server ./redis-7000.conf
$ redis-server ./redis-7001.conf
# ... and so on for all nodes

Create the cluster:

$ redis-cli --cluster create 127.0.0.1:7000 127.0.0.1:7001 \
  127.0.0.1:7002 127.0.0.1:7003 127.0.0.1:7004 127.0.0.1:7005 \
  --cluster-replicas 1

This creates a cluster with 3 primary nodes and 3 replica nodes (one replica per primary).

Automatic Failover in Redis Cluster

Redis Cluster automatically handles failover when a primary node goes down:

Cluster nodes ping each other
If a primary node doesn't respond within the configured timeout
Its replicas detect the failure
One replica gets promoted to primary
The cluster continues to operate

Connecting to Redis Cluster

Applications need to use cluster-aware clients:

// Node.js with ioredis
const Redis = require('ioredis');

const cluster = new Redis.Cluster([
  { port: 7000, host: '127.0.0.1' },
  { port: 7001, host: '127.0.0.1' }
  // Only need to specify a few nodes, client will discover the rest
]);

cluster.set('key', 'value', (err) => {
  console.log('Key stored in the appropriate shard');
});

# Python with redis-py-cluster
from rediscluster import RedisCluster

startup_nodes = [
    {"host": "127.0.0.1", "port": 7000},
    {"host": "127.0.0.1", "port": 7001}
]

rc = RedisCluster(startup_nodes=startup_nodes, decode_responses=True)
rc.set('key', 'value')

Orchestration-Based Failover

Modern container orchestration platforms like Kubernetes can handle Redis failover:

Deploy Redis with StatefulSets
Use readiness/liveness probes to detect failures
Leverage operators like Redis Operator or Redis Enterprise Operator

Example Kubernetes manifest (simplified):

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
spec:
  serviceName: "redis"
  replicas: 3
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
      - name: redis
        image: redis:6.2
        ports:
        - containerPort: 6379
          name: redis
        readinessProbe:
          exec:
            command: ["redis-cli", "ping"]

Best Practices for Redis Failover

Use multiple Sentinels: Run at least 3 Sentinel instances on different machines
Tune timeouts carefully: Balance between quick failover and avoiding false positives
Test failover regularly: Simulate failures to ensure your system recovers properly
Monitor the entire system: Use tools like Prometheus and Grafana
Implement client-side retry logic: Handle temporary disconnections gracefully
Back up your data: Failover doesn't protect against data corruption
Document procedures: Have clear manual recovery procedures

Common Failover Issues and Solutions

Issue	Cause	Solution
Split-brain	Network partition causing multiple primaries	Use quorum settings in Sentinel
Data loss	Asynchronous replication	Use WAIT command for critical writes
Flapping	Unstable network causing repeated failovers	Increase detection timeouts
Client disconnections	Clients not reconnecting after failover	Use client libraries with failover support

Real-World Example: E-commerce Application

Let's look at how a typical e-commerce application might implement Redis failover:

In this setup:

The app connects to Redis through Sentinel
Session data, product caches, and cart data are stored in Redis
If the primary fails, Sentinel promotes a replica
The application continues to function with minimal disruption

Summary

Redis failover is essential for maintaining high availability in production environments. In this guide, we've covered:

The importance of Redis failover
Redis replication as the foundation for failover
Redis Sentinel for automated monitoring and failover
Redis Cluster for sharded data with built-in failover
Orchestration-based approaches with Kubernetes
Best practices and common issues

By implementing proper failover strategies, you can ensure your Redis deployments remain resilient and available, even when facing hardware failures, network issues, or during maintenance.

Additional Resources

Here are some exercises to reinforce your understanding:

Set up a Redis primary with two replicas on your local machine
Configure a three-node Sentinel system and test failover
Create a simple Redis Cluster with three primaries and three replicas
Write a client application that handles Redis failover gracefully
Design a backup strategy for your Redis deployment

Introduction​

Why Failover Matters​

Redis Replication: The Foundation of Failover​

How Redis Replication Works​

Failover Approaches in Redis​

Redis Sentinel: Automated Failover​

How Sentinel Works​

Setting Up Redis Sentinel​

Simulating Failover with Sentinel​

Connecting Applications to Sentinel​

Redis Cluster: Sharding with Automatic Failover​

Setting Up Redis Cluster​

Automatic Failover in Redis Cluster​

Connecting to Redis Cluster​

Orchestration-Based Failover​

Best Practices for Redis Failover​

Common Failover Issues and Solutions​

Real-World Example: E-commerce Application​

Summary​

Additional Resources​

Further Reading​