TensorFlow Multi-Agent Reinforcement Learning

Introduction

Multi-Agent Reinforcement Learning (MARL) extends traditional reinforcement learning by incorporating multiple agents that learn simultaneously within a shared environment. Unlike single-agent reinforcement learning where one agent interacts with the environment, MARL deals with scenarios where multiple agents need to cooperate or compete to achieve their objectives.

In this tutorial, we'll explore how to implement MARL using TensorFlow, Google's open-source machine learning library. By the end of this guide, you'll understand:

The core concepts behind multi-agent reinforcement learning
The challenges specific to multi-agent learning scenarios
How to implement basic MARL algorithms using TensorFlow
How to evaluate and visualize multi-agent behavior

Prerequisites

Before diving in, you should have:

Basic understanding of reinforcement learning concepts
Familiarity with TensorFlow basics
Python programming experience
Understanding of neural networks fundamentals

Let's start by installing the necessary libraries:

# Install required packages
!pip install tensorflow==2.12.0 matplotlib numpy gym pettingzoo

# Import necessary libraries
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from pettingzoo.mpe import simple_spread_v3
import random
from collections import deque

Understanding Multi-Agent Reinforcement Learning

Key Concepts

Multi-Agent Environment: An environment where multiple agents act simultaneously
Cooperation vs. Competition: Agents may work together or against each other
Partial Observability: Agents may have incomplete information about the environment
Credit Assignment: Determining which agent's actions led to rewards
Non-stationarity: The environment appears non-stationary from each agent's perspective since other agents are also learning and changing their behaviors

Types of MARL Scenarios

Cooperative: Agents work together to achieve a common goal
Competitive: Agents compete against each other for resources or objectives
Mixed: Both cooperative and competitive elements exist

Building a Simple Multi-Agent System

Let's build a cooperative multi-agent system using TensorFlow and PettingZoo, a Python library that provides a unified API for multi-agent environments.

Step 1: Set Up the Environment

We'll use the simple_spread environment from PettingZoo's MPE (Multi-Particle Environment) collection, where agents must work together to cover all landmarks while avoiding collisions.

def create_environment():
    env = simple_spread_v3.env(N=3, local_ratio=0.5, max_cycles=25)
    env.reset()
    return env

env = create_environment()
print(f"Agents: {env.possible_agents}")
print(f"Observation space: {env.observation_space('agent_0')}")
print(f"Action space: {env.action_space('agent_0')}")

Output:

Agents: ['agent_0', 'agent_1', 'agent_2']
Observation space: Box(-inf, inf, (18,), float32)
Action space: Discrete(5)

Step 2: Implement a Simple Q-Network Agent

We'll create a Q-Network agent using TensorFlow:

class QAgent:
    def __init__(self, state_size, action_size, agent_id):
        self.state_size = state_size
        self.action_size = action_size
        self.agent_id = agent_id
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95  # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = 0.001
        self.model = self._build_model()
        
    def _build_model(self):
        model = tf.keras.Sequential([
            tf.keras.layers.Dense(24, input_dim=self.state_size, activation='relu'),
            tf.keras.layers.Dense(24, activation='relu'),
            tf.keras.layers.Dense(self.action_size, activation='linear')
        ])
        model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate))
        return model
    
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
    
    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        act_values = self.model.predict(state.reshape(1, -1))
        return np.argmax(act_values[0])
    
    def replay(self, batch_size):
        if len(self.memory) < batch_size:
            return
        
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = reward + self.gamma * np.amax(self.model.predict(next_state.reshape(1, -1))[0])
            target_f = self.model.predict(state.reshape(1, -1))
            target_f[0][action] = target
            self.model.fit(state.reshape(1, -1), target_f, epochs=1, verbose=0)
            
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

Step 3: Set Up the Multi-Agent System

Now, let's create multiple agents and have them interact in the environment:

def train_marl(episodes=200):
    env = create_environment()
    state_size = env.observation_space('agent_0').shape[0]
    action_size = env.action_space('agent_0').n
    
    agents = {agent_id: QAgent(state_size, action_size, agent_id) 
              for agent_id in env.possible_agents}
    
    batch_size = 32
    rewards_history = []
    
    for e in range(episodes):
        env.reset()
        total_reward = 0
        for agent_id in env.agent_iter():
            observation, reward, termination, truncation, info = env.last()
            done = termination or truncation
            
            if not done:
                action = agents[agent_id].act(observation)
                env.step(action)
                next_observation, next_reward, next_termination, next_truncation, next_info = env.last()
                
                # Store experience in replay memory
                agents[agent_id].remember(observation, action, reward, next_observation, done)
                
                # Train the agent
                if len(agents[agent_id].memory) > batch_size:
                    agents[agent_id].replay(batch_size)
                
                total_reward += reward
            else:
                env.step(None)  # Agent is done, pass None action
        
        rewards_history.append(total_reward / len(env.possible_agents))
        if e % 10 == 0:
            print(f"Episode: {e}, Avg Reward: {rewards_history[-1]:.3f}, Epsilon: {agents['agent_0'].epsilon:.3f}")
    
    return agents, rewards_history

# Train our multi-agent system
trained_agents, rewards = train_marl(episodes=100)

# Plot the rewards
plt.figure(figsize=(10, 6))
plt.plot(rewards)
plt.title('Average Reward per Episode')
plt.xlabel('Episode')
plt.ylabel('Average Reward')
plt.grid(True)
plt.show()

Challenges in Multi-Agent Reinforcement Learning

Non-Stationarity

In MARL, each agent faces a non-stationary environment because other agents are also learning and changing their policies. This means that what worked before may not work later.

Solution: Experience Replay with Importance Sampling

class ImprovedQAgent(QAgent):
    def __init__(self, state_size, action_size, agent_id):
        super().__init__(state_size, action_size, agent_id)
        self.importance_weights = np.ones(2000) / 2000  # Initialize weights for all possible memories
    
    def replay(self, batch_size):
        if len(self.memory) < batch_size:
            return
        
        # Sample based on importance weights
        indices = np.random.choice(
            min(len(self.memory), 2000), 
            batch_size, 
            p=self.importance_weights[:len(self.memory)] / np.sum(self.importance_weights[:len(self.memory)])
        )
        
        minibatch = [self.memory[i] for i in indices]
        errors = []
        
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = reward + self.gamma * np.amax(self.model.predict(next_state.reshape(1, -1))[0])
            target_f = self.model.predict(state.reshape(1, -1))
            old_val = target_f[0][action]
            target_f[0][action] = target
            
            # Calculate TD error for importance weighting
            error = abs(old_val - target)
            errors.append(error)
            
            self.model.fit(state.reshape(1, -1), target_f, epochs=1, verbose=0)
        
        # Update importance weights
        for i, error in zip(indices, errors):
            self.importance_weights[i] = error + 0.01  # Add small constant to avoid zero weights
            
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

Credit Assignment Problem

In cooperative scenarios, it's hard to determine which agent's actions led to good outcomes. This is known as the credit assignment problem.

Solution: Centralized Training with Decentralized Execution

class CentralizedCritic:
    def __init__(self, state_size, num_agents):
        self.state_size = state_size * num_agents  # Combined state of all agents
        self.num_agents = num_agents
        self.learning_rate = 0.001
        self.model = self._build_model()
        
    def _build_model(self):
        model = tf.keras.Sequential([
            tf.keras.layers.Dense(64, input_dim=self.state_size, activation='relu'),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dense(1, activation='linear')  # Value function
        ])
        model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate))
        return model
    
    def predict_value(self, states):
        # states should be a combined state from all agents
        return self.model.predict(states.reshape(1, -1))[0][0]
    
    def train(self, states, target_value):
        self.model.fit(states.reshape(1, -1), np.array([[target_value]]), epochs=1, verbose=0)

Real-World Applications

Traffic Control System

Multi-agent reinforcement learning can be used to optimize traffic light timings across a city. Each traffic light acts as an agent that learns to manage traffic flow based on local observations.

# Simplified example of a traffic control MARL setup
class TrafficLightAgent(QAgent):
    def __init__(self, state_size, action_size, agent_id, neighbor_ids):
        super().__init__(state_size, action_size, agent_id)
        self.neighbor_ids = neighbor_ids  # IDs of neighboring traffic lights
        
    def get_cooperative_state(self, own_state, neighbor_states):
        # Combine own state with relevant neighbor states
        return np.concatenate([own_state] + [ns for ns in neighbor_states])
    
    def cooperative_act(self, own_state, neighbor_states):
        cooperative_state = self.get_cooperative_state(own_state, neighbor_states)
        return self.act(cooperative_state)

Autonomous Vehicles

Multiple autonomous vehicles can learn to navigate efficiently in traffic while avoiding collisions and optimizing travel time.

def autonomous_vehicle_scenario():
    # Define a simplified grid environment for vehicles
    grid_size = 10
    num_vehicles = 5
    
    # Initialize vehicles at random positions
    vehicle_positions = np.random.randint(0, grid_size, size=(num_vehicles, 2))
    
    # Define target positions
    target_positions = np.random.randint(0, grid_size, size=(num_vehicles, 2))
    
    # Create agents for each vehicle
    agents = {}
    state_size = 4  # x, y, target_x, target_y
    action_size = 4  # up, down, left, right
    
    for i in range(num_vehicles):
        agents[f"vehicle_{i}"] = QAgent(state_size, action_size, f"vehicle_{i}")
    
    # Visualization function (simplified)
    def visualize_scenario():
        plt.figure(figsize=(8, 8))
        plt.scatter(vehicle_positions[:, 0], vehicle_positions[:, 1], c='blue', s=200, label='Vehicles')
        plt.scatter(target_positions[:, 0], target_positions[:, 1], c='green', s=200, marker='*', label='Targets')
        
        # Draw lines between vehicles and targets
        for i in range(num_vehicles):
            plt.plot([vehicle_positions[i, 0], target_positions[i, 0]], 
                     [vehicle_positions[i, 1], target_positions[i, 1]], 'k--', alpha=0.3)
            
        plt.grid(True)
        plt.xlim(-1, grid_size)
        plt.ylim(-1, grid_size)
        plt.legend()
        plt.title("Autonomous Vehicles Scenario")
        plt.show()
    
    # Call visualization
    visualize_scenario()
    
    return agents, vehicle_positions, target_positions

Robotics and Swarm Intelligence

Multi-agent systems are often used in robotics, especially in swarm robotics where multiple robots need to coordinate to achieve complex tasks.

# Example of a swarm robotics task: cooperative object lifting
class SwarmAgent(QAgent):
    def __init__(self, state_size, action_size, agent_id, strength=1.0):
        super().__init__(state_size, action_size, agent_id)
        self.strength = strength  # How much weight this robot can lift
    
    def coordinate_lift(self, nearby_agents, object_weight):
        # Check if the combined strength of nearby agents is sufficient
        total_strength = self.strength + sum(agent.strength for agent in nearby_agents)
        return total_strength >= object_weight

Advanced Techniques in MARL

When agents have the same action and observation spaces, we can use parameter sharing to improve learning efficiency.

def create_shared_network(state_size, action_size):
    shared_network = tf.keras.Sequential([
        tf.keras.layers.Dense(64, input_dim=state_size, activation='relu'),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(action_size, activation='linear')
    ])
    shared_network.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=0.001))
    return shared_network

# Create agents with shared parameters
def create_shared_agents(num_agents, state_size, action_size):
    shared_network = create_shared_network(state_size, action_size)
    
    agents = {}
    for i in range(num_agents):
        agent_id = f"agent_{i}"
        agents[agent_id] = QAgent(state_size, action_size, agent_id)
        agents[agent_id].model = shared_network
        
    return agents

Curriculum Learning

We can use curriculum learning to gradually increase the difficulty of the tasks the agents need to solve.

def curriculum_training(num_episodes=1000):
    # Start with a simple environment
    difficulty = 0.1
    max_difficulty = 1.0
    difficulty_increment = 0.05
    
    agents = {}  # Initialize your agents
    
    for episode in range(num_episodes):
        # Adjust environment difficulty based on performance
        env = create_environment_with_difficulty(difficulty)
        
        # Train for one episode
        reward = train_episode(env, agents)
        
        # Increase difficulty if agents are performing well
        if reward > threshold:
            difficulty = min(difficulty + difficulty_increment, max_difficulty)
            print(f"Increasing difficulty to {difficulty}")
            
        # Log progress
        if episode % 10 == 0:
            print(f"Episode {episode}, Reward: {reward}, Difficulty: {difficulty}")

Summary

In this tutorial, we've explored Multi-Agent Reinforcement Learning (MARL) using TensorFlow:

We learned about the fundamental concepts in MARL including cooperation, competition, and the challenges specific to multi-agent systems.
We implemented a basic multi-agent Q-learning system using TensorFlow.
We addressed common challenges in MARL such as non-stationarity and the credit assignment problem.
We explored real-world applications of MARL in traffic control, autonomous vehicles, and swarm robotics.
We discussed advanced techniques like parameter sharing and curriculum learning.

MARL is a rapidly evolving field with many exciting research directions and practical applications. As systems become more complex and interconnected, the ability to model and train multiple learning agents becomes increasingly important.

Further Resources

Books:
- "Multi-Agent Machine Learning: A Reinforcement Learning Approach" by H.M. Schwartz
- "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto
Online Courses:
- DeepMind's Multi-Agent RL Course
- Berkeley's CS285: Deep Reinforcement Learning
Libraries and Frameworks:
- PettingZoo - Multi-agent environments
- Ray RLlib - Scalable RL library with MARL support

Exercises

Modify the Q-learning agents to use a Deep Q-Network (DQN) architecture with experience replay.
Implement a competitive MARL environment where agents compete for limited resources.
Create a visualization tool to better understand agent behaviors and interactions.
Extend the traffic control example to handle more realistic traffic patterns.
Implement a multi-agent system using parameter sharing and compare its performance to independent learning.

Happy coding and exploring the fascinating world of Multi-Agent Reinforcement Learning!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Understanding Multi-Agent Reinforcement Learning​

Key Concepts​

Types of MARL Scenarios​

Building a Simple Multi-Agent System​

Step 1: Set Up the Environment​

Step 2: Implement a Simple Q-Network Agent​

Step 3: Set Up the Multi-Agent System​

Challenges in Multi-Agent Reinforcement Learning​

Non-Stationarity​

Solution: Experience Replay with Importance Sampling​

Credit Assignment Problem​

Solution: Centralized Training with Decentralized Execution​

Real-World Applications​

Traffic Control System​

Autonomous Vehicles​

Robotics and Swarm Intelligence​

Advanced Techniques in MARL​

Parameter Sharing​

Curriculum Learning​

Summary​

Further Resources​

Exercises​