TensorFlow Multi-Agent Reinforcement Learning
Introduction
Multi-Agent Reinforcement Learning (MARL) extends traditional reinforcement learning by incorporating multiple agents that learn simultaneously within a shared environment. Unlike single-agent reinforcement learning where one agent interacts with the environment, MARL deals with scenarios where multiple agents need to cooperate or compete to achieve their objectives.
In this tutorial, we'll explore how to implement MARL using TensorFlow, Google's open-source machine learning library. By the end of this guide, you'll understand:
- The core concepts behind multi-agent reinforcement learning
- The challenges specific to multi-agent learning scenarios
- How to implement basic MARL algorithms using TensorFlow
- How to evaluate and visualize multi-agent behavior
Prerequisites
Before diving in, you should have:
- Basic understanding of reinforcement learning concepts
- Familiarity with TensorFlow basics
- Python programming experience
- Understanding of neural networks fundamentals
Let's start by installing the necessary libraries:
# Install required packages
!pip install tensorflow==2.12.0 matplotlib numpy gym pettingzoo
# Import necessary libraries
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from pettingzoo.mpe import simple_spread_v3
import random
from collections import deque
Understanding Multi-Agent Reinforcement Learning
Key Concepts
- Multi-Agent Environment: An environment where multiple agents act simultaneously
- Cooperation vs. Competition: Agents may work together or against each other
- Partial Observability: Agents may have incomplete information about the environment
- Credit Assignment: Determining which agent's actions led to rewards
- Non-stationarity: The environment appears non-stationary from each agent's perspective since other agents are also learning and changing their behaviors
Types of MARL Scenarios
- Cooperative: Agents work together to achieve a common goal
- Competitive: Agents compete against each other for resources or objectives
- Mixed: Both cooperative and competitive elements exist
Building a Simple Multi-Agent System
Let's build a cooperative multi-agent system using TensorFlow and PettingZoo, a Python library that provides a unified API for multi-agent environments.
Step 1: Set Up the Environment
We'll use the simple_spread
environment from PettingZoo's MPE (Multi-Particle Environment) collection, where agents must work together to cover all landmarks while avoiding collisions.
def create_environment():
env = simple_spread_v3.env(N=3, local_ratio=0.5, max_cycles=25)
env.reset()
return env
env = create_environment()
print(f"Agents: {env.possible_agents}")
print(f"Observation space: {env.observation_space('agent_0')}")
print(f"Action space: {env.action_space('agent_0')}")
Output:
Agents: ['agent_0', 'agent_1', 'agent_2']
Observation space: Box(-inf, inf, (18,), float32)
Action space: Discrete(5)
Step 2: Implement a Simple Q-Network Agent
We'll create a Q-Network agent using TensorFlow:
class QAgent:
def __init__(self, state_size, action_size, agent_id):
self.state_size = state_size
self.action_size = action_size
self.agent_id = agent_id
self.memory = deque(maxlen=2000)
self.gamma = 0.95 # discount rate
self.epsilon = 1.0 # exploration rate
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.learning_rate = 0.001
self.model = self._build_model()
def _build_model(self):
model = tf.keras.Sequential([
tf.keras.layers.Dense(24, input_dim=self.state_size, activation='relu'),
tf.keras.layers.Dense(24, activation='relu'),
tf.keras.layers.Dense(self.action_size, activation='linear')
])
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate))
return model
def remember(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))
def act(self, state):
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
act_values = self.model.predict(state.reshape(1, -1))
return np.argmax(act_values[0])
def replay(self, batch_size):
if len(self.memory) < batch_size:
return
minibatch = random.sample(self.memory, batch_size)
for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target = reward + self.gamma * np.amax(self.model.predict(next_state.reshape(1, -1))[0])
target_f = self.model.predict(state.reshape(1, -1))
target_f[0][action] = target
self.model.fit(state.reshape(1, -1), target_f, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
Step 3: Set Up the Multi-Agent System
Now, let's create multiple agents and have them interact in the environment:
def train_marl(episodes=200):
env = create_environment()
state_size = env.observation_space('agent_0').shape[0]
action_size = env.action_space('agent_0').n
agents = {agent_id: QAgent(state_size, action_size, agent_id)
for agent_id in env.possible_agents}
batch_size = 32
rewards_history = []
for e in range(episodes):
env.reset()
total_reward = 0
for agent_id in env.agent_iter():
observation, reward, termination, truncation, info = env.last()
done = termination or truncation
if not done:
action = agents[agent_id].act(observation)
env.step(action)
next_observation, next_reward, next_termination, next_truncation, next_info = env.last()
# Store experience in replay memory
agents[agent_id].remember(observation, action, reward, next_observation, done)
# Train the agent
if len(agents[agent_id].memory) > batch_size:
agents[agent_id].replay(batch_size)
total_reward += reward
else:
env.step(None) # Agent is done, pass None action
rewards_history.append(total_reward / len(env.possible_agents))
if e % 10 == 0:
print(f"Episode: {e}, Avg Reward: {rewards_history[-1]:.3f}, Epsilon: {agents['agent_0'].epsilon:.3f}")
return agents, rewards_history
# Train our multi-agent system
trained_agents, rewards = train_marl(episodes=100)
# Plot the rewards
plt.figure(figsize=(10, 6))
plt.plot(rewards)
plt.title('Average Reward per Episode')
plt.xlabel('Episode')
plt.ylabel('Average Reward')
plt.grid(True)
plt.show()
Challenges in Multi-Agent Reinforcement Learning
Non-Stationarity
In MARL, each agent faces a non-stationary environment because other agents are also learning and changing their policies. This means that what worked before may not work later.
Solution: Experience Replay with Importance Sampling
class ImprovedQAgent(QAgent):
def __init__(self, state_size, action_size, agent_id):
super().__init__(state_size, action_size, agent_id)
self.importance_weights = np.ones(2000) / 2000 # Initialize weights for all possible memories
def replay(self, batch_size):
if len(self.memory) < batch_size:
return
# Sample based on importance weights
indices = np.random.choice(
min(len(self.memory), 2000),
batch_size,
p=self.importance_weights[:len(self.memory)] / np.sum(self.importance_weights[:len(self.memory)])
)
minibatch = [self.memory[i] for i in indices]
errors = []
for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target = reward + self.gamma * np.amax(self.model.predict(next_state.reshape(1, -1))[0])
target_f = self.model.predict(state.reshape(1, -1))
old_val = target_f[0][action]
target_f[0][action] = target
# Calculate TD error for importance weighting
error = abs(old_val - target)
errors.append(error)
self.model.fit(state.reshape(1, -1), target_f, epochs=1, verbose=0)
# Update importance weights
for i, error in zip(indices, errors):
self.importance_weights[i] = error + 0.01 # Add small constant to avoid zero weights
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
Credit Assignment Problem
In cooperative scenarios, it's hard to determine which agent's actions led to good outcomes. This is known as the credit assignment problem.
Solution: Centralized Training with Decentralized Execution
class CentralizedCritic:
def __init__(self, state_size, num_agents):
self.state_size = state_size * num_agents # Combined state of all agents
self.num_agents = num_agents
self.learning_rate = 0.001
self.model = self._build_model()
def _build_model(self):
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, input_dim=self.state_size, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='linear') # Value function
])
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate))
return model
def predict_value(self, states):
# states should be a combined state from all agents
return self.model.predict(states.reshape(1, -1))[0][0]
def train(self, states, target_value):
self.model.fit(states.reshape(1, -1), np.array([[target_value]]), epochs=1, verbose=0)
Real-World Applications
Traffic Control System
Multi-agent reinforcement learning can be used to optimize traffic light timings across a city. Each traffic light acts as an agent that learns to manage traffic flow based on local observations.
# Simplified example of a traffic control MARL setup
class TrafficLightAgent(QAgent):
def __init__(self, state_size, action_size, agent_id, neighbor_ids):
super().__init__(state_size, action_size, agent_id)
self.neighbor_ids = neighbor_ids # IDs of neighboring traffic lights
def get_cooperative_state(self, own_state, neighbor_states):
# Combine own state with relevant neighbor states
return np.concatenate([own_state] + [ns for ns in neighbor_states])
def cooperative_act(self, own_state, neighbor_states):
cooperative_state = self.get_cooperative_state(own_state, neighbor_states)
return self.act(cooperative_state)
Autonomous Vehicles
Multiple autonomous vehicles can learn to navigate efficiently in traffic while avoiding collisions and optimizing travel time.
def autonomous_vehicle_scenario():
# Define a simplified grid environment for vehicles
grid_size = 10
num_vehicles = 5
# Initialize vehicles at random positions
vehicle_positions = np.random.randint(0, grid_size, size=(num_vehicles, 2))
# Define target positions
target_positions = np.random.randint(0, grid_size, size=(num_vehicles, 2))
# Create agents for each vehicle
agents = {}
state_size = 4 # x, y, target_x, target_y
action_size = 4 # up, down, left, right
for i in range(num_vehicles):
agents[f"vehicle_{i}"] = QAgent(state_size, action_size, f"vehicle_{i}")
# Visualization function (simplified)
def visualize_scenario():
plt.figure(figsize=(8, 8))
plt.scatter(vehicle_positions[:, 0], vehicle_positions[:, 1], c='blue', s=200, label='Vehicles')
plt.scatter(target_positions[:, 0], target_positions[:, 1], c='green', s=200, marker='*', label='Targets')
# Draw lines between vehicles and targets
for i in range(num_vehicles):
plt.plot([vehicle_positions[i, 0], target_positions[i, 0]],
[vehicle_positions[i, 1], target_positions[i, 1]], 'k--', alpha=0.3)
plt.grid(True)
plt.xlim(-1, grid_size)
plt.ylim(-1, grid_size)
plt.legend()
plt.title("Autonomous Vehicles Scenario")
plt.show()
# Call visualization
visualize_scenario()
return agents, vehicle_positions, target_positions
Robotics and Swarm Intelligence
Multi-agent systems are often used in robotics, especially in swarm robotics where multiple robots need to coordinate to achieve complex tasks.
# Example of a swarm robotics task: cooperative object lifting
class SwarmAgent(QAgent):
def __init__(self, state_size, action_size, agent_id, strength=1.0):
super().__init__(state_size, action_size, agent_id)
self.strength = strength # How much weight this robot can lift
def coordinate_lift(self, nearby_agents, object_weight):
# Check if the combined strength of nearby agents is sufficient
total_strength = self.strength + sum(agent.strength for agent in nearby_agents)
return total_strength >= object_weight
Advanced Techniques in MARL
Parameter Sharing
When agents have the same action and observation spaces, we can use parameter sharing to improve learning efficiency.
def create_shared_network(state_size, action_size):
shared_network = tf.keras.Sequential([
tf.keras.layers.Dense(64, input_dim=state_size, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(action_size, activation='linear')
])
shared_network.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=0.001))
return shared_network
# Create agents with shared parameters
def create_shared_agents(num_agents, state_size, action_size):
shared_network = create_shared_network(state_size, action_size)
agents = {}
for i in range(num_agents):
agent_id = f"agent_{i}"
agents[agent_id] = QAgent(state_size, action_size, agent_id)
agents[agent_id].model = shared_network
return agents
Curriculum Learning
We can use curriculum learning to gradually increase the difficulty of the tasks the agents need to solve.
def curriculum_training(num_episodes=1000):
# Start with a simple environment
difficulty = 0.1
max_difficulty = 1.0
difficulty_increment = 0.05
agents = {} # Initialize your agents
for episode in range(num_episodes):
# Adjust environment difficulty based on performance
env = create_environment_with_difficulty(difficulty)
# Train for one episode
reward = train_episode(env, agents)
# Increase difficulty if agents are performing well
if reward > threshold:
difficulty = min(difficulty + difficulty_increment, max_difficulty)
print(f"Increasing difficulty to {difficulty}")
# Log progress
if episode % 10 == 0:
print(f"Episode {episode}, Reward: {reward}, Difficulty: {difficulty}")
Summary
In this tutorial, we've explored Multi-Agent Reinforcement Learning (MARL) using TensorFlow:
- We learned about the fundamental concepts in MARL including cooperation, competition, and the challenges specific to multi-agent systems.
- We implemented a basic multi-agent Q-learning system using TensorFlow.
- We addressed common challenges in MARL such as non-stationarity and the credit assignment problem.
- We explored real-world applications of MARL in traffic control, autonomous vehicles, and swarm robotics.
- We discussed advanced techniques like parameter sharing and curriculum learning.
MARL is a rapidly evolving field with many exciting research directions and practical applications. As systems become more complex and interconnected, the ability to model and train multiple learning agents becomes increasingly important.
Further Resources
-
Books:
- "Multi-Agent Machine Learning: A Reinforcement Learning Approach" by H.M. Schwartz
- "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto
-
Online Courses:
- DeepMind's Multi-Agent RL Course
- Berkeley's CS285: Deep Reinforcement Learning
-
Libraries and Frameworks:
- PettingZoo - Multi-agent environments
- Ray RLlib - Scalable RL library with MARL support
Exercises
- Modify the Q-learning agents to use a Deep Q-Network (DQN) architecture with experience replay.
- Implement a competitive MARL environment where agents compete for limited resources.
- Create a visualization tool to better understand agent behaviors and interactions.
- Extend the traffic control example to handle more realistic traffic patterns.
- Implement a multi-agent system using parameter sharing and compare its performance to independent learning.
Happy coding and exploring the fascinating world of Multi-Agent Reinforcement Learning!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)