Skip to main content

TensorFlow Multi-Agent Reinforcement Learning

Introduction

Multi-Agent Reinforcement Learning (MARL) extends traditional reinforcement learning by incorporating multiple agents that learn simultaneously within a shared environment. Unlike single-agent reinforcement learning where one agent interacts with the environment, MARL deals with scenarios where multiple agents need to cooperate or compete to achieve their objectives.

In this tutorial, we'll explore how to implement MARL using TensorFlow, Google's open-source machine learning library. By the end of this guide, you'll understand:

  • The core concepts behind multi-agent reinforcement learning
  • The challenges specific to multi-agent learning scenarios
  • How to implement basic MARL algorithms using TensorFlow
  • How to evaluate and visualize multi-agent behavior

Prerequisites

Before diving in, you should have:

  • Basic understanding of reinforcement learning concepts
  • Familiarity with TensorFlow basics
  • Python programming experience
  • Understanding of neural networks fundamentals

Let's start by installing the necessary libraries:

python
# Install required packages
!pip install tensorflow==2.12.0 matplotlib numpy gym pettingzoo

# Import necessary libraries
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from pettingzoo.mpe import simple_spread_v3
import random
from collections import deque

Understanding Multi-Agent Reinforcement Learning

Key Concepts

  1. Multi-Agent Environment: An environment where multiple agents act simultaneously
  2. Cooperation vs. Competition: Agents may work together or against each other
  3. Partial Observability: Agents may have incomplete information about the environment
  4. Credit Assignment: Determining which agent's actions led to rewards
  5. Non-stationarity: The environment appears non-stationary from each agent's perspective since other agents are also learning and changing their behaviors

Types of MARL Scenarios

  1. Cooperative: Agents work together to achieve a common goal
  2. Competitive: Agents compete against each other for resources or objectives
  3. Mixed: Both cooperative and competitive elements exist

Building a Simple Multi-Agent System

Let's build a cooperative multi-agent system using TensorFlow and PettingZoo, a Python library that provides a unified API for multi-agent environments.

Step 1: Set Up the Environment

We'll use the simple_spread environment from PettingZoo's MPE (Multi-Particle Environment) collection, where agents must work together to cover all landmarks while avoiding collisions.

python
def create_environment():
env = simple_spread_v3.env(N=3, local_ratio=0.5, max_cycles=25)
env.reset()
return env

env = create_environment()
print(f"Agents: {env.possible_agents}")
print(f"Observation space: {env.observation_space('agent_0')}")
print(f"Action space: {env.action_space('agent_0')}")

Output:

Agents: ['agent_0', 'agent_1', 'agent_2']
Observation space: Box(-inf, inf, (18,), float32)
Action space: Discrete(5)

Step 2: Implement a Simple Q-Network Agent

We'll create a Q-Network agent using TensorFlow:

python
class QAgent:
def __init__(self, state_size, action_size, agent_id):
self.state_size = state_size
self.action_size = action_size
self.agent_id = agent_id
self.memory = deque(maxlen=2000)
self.gamma = 0.95 # discount rate
self.epsilon = 1.0 # exploration rate
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.learning_rate = 0.001
self.model = self._build_model()

def _build_model(self):
model = tf.keras.Sequential([
tf.keras.layers.Dense(24, input_dim=self.state_size, activation='relu'),
tf.keras.layers.Dense(24, activation='relu'),
tf.keras.layers.Dense(self.action_size, activation='linear')
])
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate))
return model

def remember(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))

def act(self, state):
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
act_values = self.model.predict(state.reshape(1, -1))
return np.argmax(act_values[0])

def replay(self, batch_size):
if len(self.memory) < batch_size:
return

minibatch = random.sample(self.memory, batch_size)
for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target = reward + self.gamma * np.amax(self.model.predict(next_state.reshape(1, -1))[0])
target_f = self.model.predict(state.reshape(1, -1))
target_f[0][action] = target
self.model.fit(state.reshape(1, -1), target_f, epochs=1, verbose=0)

if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay

Step 3: Set Up the Multi-Agent System

Now, let's create multiple agents and have them interact in the environment:

python
def train_marl(episodes=200):
env = create_environment()
state_size = env.observation_space('agent_0').shape[0]
action_size = env.action_space('agent_0').n

agents = {agent_id: QAgent(state_size, action_size, agent_id)
for agent_id in env.possible_agents}

batch_size = 32
rewards_history = []

for e in range(episodes):
env.reset()
total_reward = 0
for agent_id in env.agent_iter():
observation, reward, termination, truncation, info = env.last()
done = termination or truncation

if not done:
action = agents[agent_id].act(observation)
env.step(action)
next_observation, next_reward, next_termination, next_truncation, next_info = env.last()

# Store experience in replay memory
agents[agent_id].remember(observation, action, reward, next_observation, done)

# Train the agent
if len(agents[agent_id].memory) > batch_size:
agents[agent_id].replay(batch_size)

total_reward += reward
else:
env.step(None) # Agent is done, pass None action

rewards_history.append(total_reward / len(env.possible_agents))
if e % 10 == 0:
print(f"Episode: {e}, Avg Reward: {rewards_history[-1]:.3f}, Epsilon: {agents['agent_0'].epsilon:.3f}")

return agents, rewards_history

# Train our multi-agent system
trained_agents, rewards = train_marl(episodes=100)

# Plot the rewards
plt.figure(figsize=(10, 6))
plt.plot(rewards)
plt.title('Average Reward per Episode')
plt.xlabel('Episode')
plt.ylabel('Average Reward')
plt.grid(True)
plt.show()

Challenges in Multi-Agent Reinforcement Learning

Non-Stationarity

In MARL, each agent faces a non-stationary environment because other agents are also learning and changing their policies. This means that what worked before may not work later.

Solution: Experience Replay with Importance Sampling

python
class ImprovedQAgent(QAgent):
def __init__(self, state_size, action_size, agent_id):
super().__init__(state_size, action_size, agent_id)
self.importance_weights = np.ones(2000) / 2000 # Initialize weights for all possible memories

def replay(self, batch_size):
if len(self.memory) < batch_size:
return

# Sample based on importance weights
indices = np.random.choice(
min(len(self.memory), 2000),
batch_size,
p=self.importance_weights[:len(self.memory)] / np.sum(self.importance_weights[:len(self.memory)])
)

minibatch = [self.memory[i] for i in indices]
errors = []

for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target = reward + self.gamma * np.amax(self.model.predict(next_state.reshape(1, -1))[0])
target_f = self.model.predict(state.reshape(1, -1))
old_val = target_f[0][action]
target_f[0][action] = target

# Calculate TD error for importance weighting
error = abs(old_val - target)
errors.append(error)

self.model.fit(state.reshape(1, -1), target_f, epochs=1, verbose=0)

# Update importance weights
for i, error in zip(indices, errors):
self.importance_weights[i] = error + 0.01 # Add small constant to avoid zero weights

if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay

Credit Assignment Problem

In cooperative scenarios, it's hard to determine which agent's actions led to good outcomes. This is known as the credit assignment problem.

Solution: Centralized Training with Decentralized Execution

python
class CentralizedCritic:
def __init__(self, state_size, num_agents):
self.state_size = state_size * num_agents # Combined state of all agents
self.num_agents = num_agents
self.learning_rate = 0.001
self.model = self._build_model()

def _build_model(self):
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, input_dim=self.state_size, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='linear') # Value function
])
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate))
return model

def predict_value(self, states):
# states should be a combined state from all agents
return self.model.predict(states.reshape(1, -1))[0][0]

def train(self, states, target_value):
self.model.fit(states.reshape(1, -1), np.array([[target_value]]), epochs=1, verbose=0)

Real-World Applications

Traffic Control System

Multi-agent reinforcement learning can be used to optimize traffic light timings across a city. Each traffic light acts as an agent that learns to manage traffic flow based on local observations.

python
# Simplified example of a traffic control MARL setup
class TrafficLightAgent(QAgent):
def __init__(self, state_size, action_size, agent_id, neighbor_ids):
super().__init__(state_size, action_size, agent_id)
self.neighbor_ids = neighbor_ids # IDs of neighboring traffic lights

def get_cooperative_state(self, own_state, neighbor_states):
# Combine own state with relevant neighbor states
return np.concatenate([own_state] + [ns for ns in neighbor_states])

def cooperative_act(self, own_state, neighbor_states):
cooperative_state = self.get_cooperative_state(own_state, neighbor_states)
return self.act(cooperative_state)

Autonomous Vehicles

Multiple autonomous vehicles can learn to navigate efficiently in traffic while avoiding collisions and optimizing travel time.

python
def autonomous_vehicle_scenario():
# Define a simplified grid environment for vehicles
grid_size = 10
num_vehicles = 5

# Initialize vehicles at random positions
vehicle_positions = np.random.randint(0, grid_size, size=(num_vehicles, 2))

# Define target positions
target_positions = np.random.randint(0, grid_size, size=(num_vehicles, 2))

# Create agents for each vehicle
agents = {}
state_size = 4 # x, y, target_x, target_y
action_size = 4 # up, down, left, right

for i in range(num_vehicles):
agents[f"vehicle_{i}"] = QAgent(state_size, action_size, f"vehicle_{i}")

# Visualization function (simplified)
def visualize_scenario():
plt.figure(figsize=(8, 8))
plt.scatter(vehicle_positions[:, 0], vehicle_positions[:, 1], c='blue', s=200, label='Vehicles')
plt.scatter(target_positions[:, 0], target_positions[:, 1], c='green', s=200, marker='*', label='Targets')

# Draw lines between vehicles and targets
for i in range(num_vehicles):
plt.plot([vehicle_positions[i, 0], target_positions[i, 0]],
[vehicle_positions[i, 1], target_positions[i, 1]], 'k--', alpha=0.3)

plt.grid(True)
plt.xlim(-1, grid_size)
plt.ylim(-1, grid_size)
plt.legend()
plt.title("Autonomous Vehicles Scenario")
plt.show()

# Call visualization
visualize_scenario()

return agents, vehicle_positions, target_positions

Robotics and Swarm Intelligence

Multi-agent systems are often used in robotics, especially in swarm robotics where multiple robots need to coordinate to achieve complex tasks.

python
# Example of a swarm robotics task: cooperative object lifting
class SwarmAgent(QAgent):
def __init__(self, state_size, action_size, agent_id, strength=1.0):
super().__init__(state_size, action_size, agent_id)
self.strength = strength # How much weight this robot can lift

def coordinate_lift(self, nearby_agents, object_weight):
# Check if the combined strength of nearby agents is sufficient
total_strength = self.strength + sum(agent.strength for agent in nearby_agents)
return total_strength >= object_weight

Advanced Techniques in MARL

Parameter Sharing

When agents have the same action and observation spaces, we can use parameter sharing to improve learning efficiency.

python
def create_shared_network(state_size, action_size):
shared_network = tf.keras.Sequential([
tf.keras.layers.Dense(64, input_dim=state_size, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(action_size, activation='linear')
])
shared_network.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=0.001))
return shared_network

# Create agents with shared parameters
def create_shared_agents(num_agents, state_size, action_size):
shared_network = create_shared_network(state_size, action_size)

agents = {}
for i in range(num_agents):
agent_id = f"agent_{i}"
agents[agent_id] = QAgent(state_size, action_size, agent_id)
agents[agent_id].model = shared_network

return agents

Curriculum Learning

We can use curriculum learning to gradually increase the difficulty of the tasks the agents need to solve.

python
def curriculum_training(num_episodes=1000):
# Start with a simple environment
difficulty = 0.1
max_difficulty = 1.0
difficulty_increment = 0.05

agents = {} # Initialize your agents

for episode in range(num_episodes):
# Adjust environment difficulty based on performance
env = create_environment_with_difficulty(difficulty)

# Train for one episode
reward = train_episode(env, agents)

# Increase difficulty if agents are performing well
if reward > threshold:
difficulty = min(difficulty + difficulty_increment, max_difficulty)
print(f"Increasing difficulty to {difficulty}")

# Log progress
if episode % 10 == 0:
print(f"Episode {episode}, Reward: {reward}, Difficulty: {difficulty}")

Summary

In this tutorial, we've explored Multi-Agent Reinforcement Learning (MARL) using TensorFlow:

  1. We learned about the fundamental concepts in MARL including cooperation, competition, and the challenges specific to multi-agent systems.
  2. We implemented a basic multi-agent Q-learning system using TensorFlow.
  3. We addressed common challenges in MARL such as non-stationarity and the credit assignment problem.
  4. We explored real-world applications of MARL in traffic control, autonomous vehicles, and swarm robotics.
  5. We discussed advanced techniques like parameter sharing and curriculum learning.

MARL is a rapidly evolving field with many exciting research directions and practical applications. As systems become more complex and interconnected, the ability to model and train multiple learning agents becomes increasingly important.

Further Resources

  1. Books:

    • "Multi-Agent Machine Learning: A Reinforcement Learning Approach" by H.M. Schwartz
    • "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto
  2. Online Courses:

    • DeepMind's Multi-Agent RL Course
    • Berkeley's CS285: Deep Reinforcement Learning
  3. Libraries and Frameworks:

Exercises

  1. Modify the Q-learning agents to use a Deep Q-Network (DQN) architecture with experience replay.
  2. Implement a competitive MARL environment where agents compete for limited resources.
  3. Create a visualization tool to better understand agent behaviors and interactions.
  4. Extend the traffic control example to handle more realistic traffic patterns.
  5. Implement a multi-agent system using parameter sharing and compare its performance to independent learning.

Happy coding and exploring the fascinating world of Multi-Agent Reinforcement Learning!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)