Distributed Systems

Introduction

A distributed system is a collection of independent computers that appear to users as a single coherent system. These computers communicate with each other over a network to achieve a common goal. Distributed systems have become increasingly important in modern software development, powering everything from search engines and social media platforms to banking systems and e-commerce websites.

In this tutorial, we'll explore the fundamental concepts of distributed systems, understand their advantages and challenges, and look at some real-world applications.

Why Distributed Systems?

Traditional single-server systems have several limitations:

Limited scalability: A single server can only handle a certain amount of load
Single point of failure: If the server goes down, the entire system becomes unavailable
Resource constraints: Memory, CPU, and storage are limited to what a single machine can provide

Distributed systems address these limitations by distributing data and computations across multiple machines, offering:

Scalability: Add more machines to handle increased load
Fault tolerance: System continues to function even if some components fail
Higher performance: Parallel processing across multiple machines
Geographic distribution: Place servers closer to users for lower latency

Key Characteristics of Distributed Systems

Multiple computers share resources like hardware, software, and data.

2. Openness

Systems use standard interfaces that allow different components to work together.

3. Concurrency

Multiple processes execute simultaneously across different machines.

4. Scalability

The system can grow by adding more resources to handle increased load.

5. Fault Tolerance

The system continues to function correctly despite failures of components.

6. Transparency

The complexity of the system is hidden from the end-user, making multiple machines appear as a single system.

Challenges in Distributed Systems

Distributed systems introduce several challenges that don't exist in single-system applications:

Network Failures

Networks are inherently unreliable. Messages can be lost, delayed, or delivered out of order.

// Example: Simple network request with retry logic
async function fetchWithRetry(url, maxRetries = 3) {
  let retries = 0;
  
  while (retries < maxRetries) {
    try {
      const response = await fetch(url);
      return await response.json();
    } catch (error) {
      console.log(`Request failed, attempt ${retries + 1}/${maxRetries}`);
      retries++;
      
      if (retries === maxRetries) {
        throw new Error('Maximum retries reached');
      }
      
      // Wait before retrying (exponential backoff)
      await new Promise(resolve => setTimeout(resolve, 1000 * Math.pow(2, retries)));
    }
  }
}

// Usage
fetchWithRetry('https://api.example.com/data')
  .then(data => console.log('Data received:', data))
  .catch(error => console.error('Failed to fetch data:', error));

/* Output:
If successful on first try:
Data received: {id: 1, name: "example"}

If failing twice before succeeding:
Request failed, attempt 1/3
Request failed, attempt 2/3
Data received: {id: 1, name: "example"}

If all attempts fail:
Request failed, attempt 1/3
Request failed, attempt 2/3
Request failed, attempt 3/3
Failed to fetch data: Error: Maximum retries reached
*/

Latency

Communication between distributed components takes time, which can affect overall system performance.

Partial Failures

Some parts of the system may fail while others continue to function, making it difficult to determine the system's state.

Consistency

Ensuring that all nodes in the system see the same data at the same time is challenging.

Clock Synchronization

Different machines have different clocks, which can cause timing issues in distributed algorithms.

CAP Theorem

The CAP theorem, formulated by Eric Brewer, states that a distributed system cannot simultaneously provide more than two of the following guarantees:

Consistency: All nodes see the same data at the same time
Availability: Every request receives a response (success or failure)
Partition Tolerance: The system continues to operate despite network partitions

This fundamental theorem guides the design decisions in distributed systems.

Key Concepts in Distributed Systems

Distributed Computing Models

Client-Server

The most common model where clients request services from servers.

// Server-side code (Node.js)
const http = require('http');

const server = http.createServer((req, res) => {
  res.writeHead(200, {'Content-Type': 'application/json'});
  res.end(JSON.stringify({
    message: 'Hello from the server!',
    timestamp: new Date()
  }));
});

server.listen(3000, () => {
  console.log('Server running on port 3000');
});

// Client-side code
fetch('http://localhost:3000')
  .then(response => response.json())
  .then(data => console.log(data))
  .catch(error => console.error('Error:', error));

/* Output (client):
{
  message: "Hello from the server!",
  timestamp: "2025-03-17T14:30:45.123Z"
}
*/

Peer-to-Peer (P2P)

Nodes act as both clients and servers, sharing resources directly without a central server.

Microservices

Applications are built as collections of loosely coupled services that communicate over a network.

Replication

Maintaining copies of data across multiple nodes to improve reliability and performance.

Types of Replication:

Active Replication: All replicas process each request
Passive Replication: One primary node processes requests and updates backups

Consistency Models

Different approaches to handling data consistency:

Strong Consistency

All reads reflect the most recent write. This provides the most intuitive behavior but can impact availability and performance.

Eventual Consistency

System will eventually become consistent if no new updates are made. This improves availability but may return stale data.

// Example: Simple in-memory data store with eventual consistency
class DistributedCache {
  constructor(nodeId, peers = []) {
    this.nodeId = nodeId;
    this.peers = peers;
    this.data = {};
    this.versionMap = {}; // Track versions of each key
  }

  // Write data locally and propagate to peers
  async set(key, value) {
    const version = (this.versionMap[key] || 0) + 1;
    this.data[key] = value;
    this.versionMap[key] = version;
    
    console.log(`[Node ${this.nodeId}] Set ${key}=${value} (version ${version})`);
    
    // Asynchronously propagate to peers (eventual consistency)
    this.peers.forEach(peer => {
      // Simulate network delay
      setTimeout(() => {
        peer.receiveUpdate(key, value, version, this.nodeId);
      }, Math.random() * 200); // Random delay up to 200ms
    });
    
    return true;
  }

  // Receive updates from peers
  receiveUpdate(key, value, version, fromNodeId) {
    const currentVersion = this.versionMap[key] || 0;
    
    // Only update if the received version is newer
    if (version > currentVersion) {
      this.data[key] = value;
      this.versionMap[key] = version;
      console.log(`[Node ${this.nodeId}] Updated ${key}=${value} (version ${version}) from Node ${fromNodeId}`);
    } else {
      console.log(`[Node ${this.nodeId}] Ignored update for ${key} (received version ${version}, current version ${currentVersion})`);
    }
  }

  // Read data from local store
  get(key) {
    const value = this.data[key];
    const version = this.versionMap[key] || 0;
    console.log(`[Node ${this.nodeId}] Get ${key}=${value} (version ${version})`);
    return value;
  }
}

// Example usage
const node1 = new DistributedCache(1);
const node2 = new DistributedCache(2);
const node3 = new DistributedCache(3);

// Set up the peer relationships
node1.peers = [node2, node3];
node2.peers = [node1, node3];
node3.peers = [node1, node2];

// Demonstrate eventual consistency
async function demo() {
  await node1.set('user', 'Alice');
  
  // Immediately after setting, nodes might be inconsistent
  console.log('Immediate state:');
  console.log(`Node 1: ${node1.get('user')}`);
  console.log(`Node 2: ${node2.get('user')}`);
  console.log(`Node 3: ${node3.get('user')}`);
  
  // Wait for eventual consistency
  await new Promise(resolve => setTimeout(resolve, 500));
  
  console.log('After synchronization:');
  console.log(`Node 1: ${node1.get('user')}`);
  console.log(`Node 2: ${node2.get('user')}`);
  console.log(`Node 3: ${node3.get('user')}`);
}

demo();

/* Example Output:
[Node 1] Set user=Alice (version 1)
Immediate state:
[Node 1] Get user=Alice (version 1)
Alice
[Node 2] Get user=undefined (version 0)
undefined
[Node 3] Get user=undefined (version 0)
undefined
[Node 2] Updated user=Alice (version 1) from Node 1
[Node 3] Updated user=Alice (version 1) from Node 1

After synchronization:
[Node 1] Get user=Alice (version 1)
Alice
[Node 2] Get user=Alice (version 1)
Alice
[Node 3] Get user=Alice (version 1)
Alice
*/

Causal Consistency

If one process reads a value and then writes a new value, any process that sees the new value will also see the previous value.

Partitioning (Sharding)

Dividing data across multiple nodes to improve scalability and performance.

Consensus Algorithms

Ensuring that all nodes in a distributed system agree on a single value or decision.

Paxos and Raft

Popular consensus algorithms that ensure agreement among distributed nodes even in the presence of failures.

Distributed Transactions

Maintaining ACID properties (Atomicity, Consistency, Isolation, Durability) across multiple systems.

Two-Phase Commit (2PC)

A protocol that ensures all participants in a distributed transaction either commit or abort the transaction.

Real-World Distributed Systems Examples

1. Google's Search Engine

Google's search engine is a massive distributed system that indexes billions of web pages and processes millions of search queries per second.

Key components:

Crawler: Distributed system that finds and downloads web pages
Indexer: Processes and stores information about web pages
Query processor: Handles user searches and returns relevant results

2. Amazon's E-commerce Platform

Amazon's platform handles millions of customers, products, and transactions using a distributed architecture.

Product catalog: Distributed database storing millions of products
Order processing: Distributed system handling transactions
Recommendation system: Analyzes user behavior across multiple systems

3. Banking Systems

Modern banking systems use distributed architectures to handle transactions across multiple branches and ATMs.

Transaction processing: Ensuring consistency across distributed nodes
Replication: Maintaining data integrity and availability
Fault tolerance: Ensuring the system continues to work even if some components fail

Building a Simple Distributed Calculator

Let's implement a simple distributed calculator system where different servers handle different operations.

// math-service.js - A simple distributed calculator service

const express = require('express');
const axios = require('axios');
const app = express();
app.use(express.json());

// Service registry - in a real system, this would be a separate service
const serviceRegistry = {
  addition: 'http://localhost:3001',
  subtraction: 'http://localhost:3002',
  multiplication: 'http://localhost:3003',
  division: 'http://localhost:3004'
};

// Gateway service - routes requests to appropriate services
app.post('/calculate', async (req, res) => {
  try {
    const { operation, a, b } = req.body;
    
    if (!serviceRegistry[operation]) {
      return res.status(400).json({ error: `Unsupported operation: ${operation}` });
    }
    
    console.log(`Routing ${operation} request to ${serviceRegistry[operation]}`);
    
    const response = await axios.post(`${serviceRegistry[operation]}/compute`, { a, b });
    res.json({ result: response.data.result });
  } catch (error) {
    console.error('Error processing request:', error.message);
    res.status(500).json({ error: 'Service unavailable' });
  }
});

// Addition service
const additionService = express();
additionService.use(express.json());
additionService.post('/compute', (req, res) => {
  const { a, b } = req.body;
  console.log(`Addition service: ${a} + ${b}`);
  res.json({ result: a + b });
});

// Subtraction service
const subtractionService = express();
subtractionService.use(express.json());
subtractionService.post('/compute', (req, res) => {
  const { a, b } = req.body;
  console.log(`Subtraction service: ${a} - ${b}`);
  res.json({ result: a - b });
});

// Start services
app.listen(3000, () => console.log('Gateway service running on port 3000'));
additionService.listen(3001, () => console.log('Addition service running on port 3001'));
subtractionService.listen(3002, () => console.log('Subtraction service running on port 3002'));

/* Usage:
POST http://localhost:3000/calculate
{
  "operation": "addition",
  "a": 5,
  "b": 3
}

Response:
{
  "result": 8
}
*/

This simple example demonstrates several distributed system concepts:

Service discovery: The gateway knows where to find each service
Load distribution: Different services handle different operations
Failure isolation: If one service fails, others can still function

Best Practices for Distributed Systems

1. Design for Failure

Assume components will fail and design your system to handle these failures gracefully.

2. Implement Retry Logic

When services fail, implement retry logic with exponential backoff to avoid overwhelming the system.

3. Use Circuit Breakers

Prevent cascading failures by using circuit breakers that stop requests to failing services.

// Example: Circuit Breaker pattern
class CircuitBreaker {
  constructor(service, options = {}) {
    this.service = service;
    this.failureThreshold = options.failureThreshold || 5;
    this.resetTimeout = options.resetTimeout || 30000; // 30 seconds
    this.state = 'CLOSED'; // CLOSED, OPEN, HALF-OPEN
    this.failureCount = 0;
    this.lastFailureTime = null;
  }
  
  async call(request) {
    if (this.state === 'OPEN') {
      // Check if it's time to try again
      const timeSinceLastFailure = Date.now() - this.lastFailureTime;
      if (timeSinceLastFailure > this.resetTimeout) {
        this.state = 'HALF-OPEN';
        console.log('Circuit half-open, attempting request');
      } else {
        console.log('Circuit open, failing fast');
        throw new Error('Service unavailable (circuit open)');
      }
    }
    
    try {
      const response = await this.service(request);
      
      // If successful in HALF-OPEN state, reset the circuit
      if (this.state === 'HALF-OPEN') {
        this.state = 'CLOSED';
        this.failureCount = 0;
        console.log('Request succeeded, circuit closed');
      }
      
      return response;
    } catch (error) {
      this.recordFailure();
      throw error;
    }
  }
  
  recordFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    
    if (this.state === 'HALF-OPEN' || this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
      console.log(`Circuit opened after ${this.failureCount} failures`);
    }
  }
}

// Example usage
const userService = new CircuitBreaker(
  async (userId) => {
    // Simulate a service that sometimes fails
    if (Math.random() < 0.3) {
      throw new Error('Service error');
    }
    return { id: userId, name: 'User ' + userId };
  },
  { failureThreshold: 3, resetTimeout: 10000 }
);

async function getUserData(userId) {
  try {
    return await userService.call(userId);
  } catch (error) {
    console.error(`Error fetching user ${userId}: ${error.message}`);
    return null;
  }
}

// Test the circuit breaker
async function testCircuitBreaker() {
  for (let i = 1; i <= 10; i++) {
    console.log(`Request ${i}:`);
    const result = await getUserData(i);
    console.log(result);
    await new Promise(resolve => setTimeout(resolve, 1000));
  }
}

testCircuitBreaker();

/* Example Output:
Request 1:
{ id: 1, name: 'User 1' }
Request 2:
Error fetching user 2: Service error
null
Request 3:
Error fetching user 3: Service error
null
Request 4:
Error fetching user 4: Service error
Circuit opened after 3 failures
null
Request 5:
Error fetching user 5: Service unavailable (circuit open)
null
...
*/

4. Implement Health Checks

Regularly check the health of your services and take appropriate actions when they're unhealthy.

5. Monitor and Log Everything

Use comprehensive monitoring and logging to understand the behavior of your distributed system.

6. Use Asynchronous Communication When Possible

Asynchronous communication helps decouple services and improve resilience.

Summary

Distributed systems offer significant advantages in terms of scalability, reliability, and performance, but they also introduce new challenges related to consistency, failure handling, and complexity.

Key points to remember:

Distributed systems consist of multiple independent computers working together
They provide benefits like scalability, fault tolerance, and performance
The CAP theorem states that we can only guarantee two of: consistency, availability, and partition tolerance
Common patterns include replication, partitioning, and consensus algorithms
Real-world applications include search engines, e-commerce platforms, and banking systems

By understanding these fundamental concepts, you're now better equipped to design and work with distributed systems in your own applications.

Further Learning

To deepen your understanding of distributed systems, consider exploring:

Distributed databases like Cassandra, MongoDB, and CockroachDB
Message queues like RabbitMQ and Apache Kafka
Container orchestration with Kubernetes
Service mesh technologies like Istio

Exercises

Design a simple distributed key-value store that replicates data across three nodes
Implement a basic load balancer that distributes requests across multiple servers
Create a simple distributed chat application using WebSockets
Experiment with different consistency models and observe their trade-offs
Implement a circuit breaker pattern for a REST API client

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Distributed Systems?​

Key Characteristics of Distributed Systems​

1. Resource Sharing​

2. Openness​

3. Concurrency​

4. Scalability​

5. Fault Tolerance​

6. Transparency​

Challenges in Distributed Systems​

Network Failures​

Latency​

Partial Failures​

Consistency​

Clock Synchronization​

CAP Theorem​

Key Concepts in Distributed Systems​

Distributed Computing Models​

Client-Server​

Peer-to-Peer (P2P)​

Microservices​

Replication​

Types of Replication:​

Consistency Models​

Strong Consistency​

Eventual Consistency​

Causal Consistency​

Partitioning (Sharding)​

Consensus Algorithms​

Paxos and Raft​

Distributed Transactions​

Two-Phase Commit (2PC)​

Real-World Distributed Systems Examples​

1. Google's Search Engine​

2. Amazon's E-commerce Platform​

3. Banking Systems​

Building a Simple Distributed Calculator​

Best Practices for Distributed Systems​

1. Design for Failure​

2. Implement Retry Logic​

3. Use Circuit Breakers​

4. Implement Health Checks​

5. Monitor and Log Everything​

6. Use Asynchronous Communication When Possible​

Summary​

Further Learning​

Exercises​