TensorFlow Actor-Critic
Introduction
Actor-Critic methods represent a powerful family of reinforcement learning algorithms that combine the best aspects of two fundamental approaches: policy-based methods (like policy gradients) and value-based methods (like Q-learning). As the name suggests, Actor-Critic consists of two components:
- Actor: Determines which actions to take (policy-based)
- Critic: Evaluates how good those actions are (value-based)
This combined approach helps address limitations of using either method alone, offering more stable learning and improved sample efficiency. In this guide, we'll explore how to implement Actor-Critic algorithms using TensorFlow, walking through the concepts and providing practical code examples.
Prerequisites
Before diving into Actor-Critic methods, you should be familiar with:
- Basic reinforcement learning concepts (states, actions, rewards, policies)
- Python programming and TensorFlow basics
- Neural networks fundamentals
Understanding Actor-Critic
The Core Concept
Actor-Critic methods belong to the policy gradient family of reinforcement learning algorithms but use a critic to guide the actor's learning. Here's how it works:
- Actor: A policy network that maps states to actions, determining how the agent should behave in each state
- Critic: A value network that estimates the value function, evaluating how good a state or state-action pair is
The actor aims to maximize expected rewards by taking actions, while the critic provides feedback on those actions to help the actor improve its policy.
Why Actor-Critic?
Actor-Critic methods offer several advantages:
- Reduced variance: The critic helps reduce the variance in policy gradient updates
- Improved sample efficiency: Learning is often more efficient than pure policy gradient methods
- Continuous action spaces: Well-suited for environments with continuous action spaces
- Online learning: Can learn on a step-by-step basis without waiting for episode completion
Building an Actor-Critic Model in TensorFlow
Let's implement a basic Actor-Critic algorithm for a simple environment. We'll use the CartPole-v1 environment from OpenAI Gym as our example.
Setting Up the Environment
First, let's install and import the necessary packages:
# Install required packages
!pip install tensorflow gym matplotlib
# Import necessary libraries
import tensorflow as tf
import gym
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import tensorflow.keras.backend as K
Creating the Actor-Critic Network
We'll create a shared network architecture with two output heads: one for the actor (policy) and one for the critic (value):
class ActorCriticNetwork:
def __init__(self, state_dim, action_dim):
# Initialize network parameters
self.state_dim = state_dim
self.action_dim = action_dim
self.learning_rate = 0.001
self.model = self.build_model()
self.optimizer = Adam(learning_rate=self.learning_rate)
def build_model(self):
# Shared layers
state_input = Input(shape=(self.state_dim,))
dense1 = Dense(64, activation='relu')(state_input)
dense2 = Dense(64, activation='relu')(dense1)
# Actor head (policy network)
policy_output = Dense(self.action_dim, activation='softmax')(dense2)
# Critic head (value network)
value_output = Dense(1)(dense2)
# Create the model
model = Model(inputs=state_input, outputs=[policy_output, value_output])
return model
def train(self, state, action, reward, next_state, done):
with tf.GradientTape() as tape:
# Get policy and value predictions
policy, value = self.model(np.array([state]))
_, next_value = self.model(np.array([next_state]))
# Calculate target (TD error)
target = reward + (1 - done) * 0.99 * next_value
advantage = target - value
# Convert action to one-hot encoding
action_onehot = tf.one_hot(action, self.action_dim)
# Calculate policy loss (negative log probability weighted by advantage)
log_prob = tf.math.log(tf.reduce_sum(policy * action_onehot) + 1e-10)
actor_loss = -log_prob * advantage
# Calculate critic loss (MSE)
critic_loss = tf.square(advantage)
# Total loss
total_loss = actor_loss + critic_loss
# Get gradients and apply updates
grads = tape.gradient(total_loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
return total_loss
def get_action(self, state):
# Get policy prediction
policy, _ = self.model.predict(np.array([state]), verbose=0)
# Sample action from the policy distribution
action = np.random.choice(self.action_dim, p=policy[0])
return action
Training the Actor-Critic Agent
Now, let's create a function to train our Actor-Critic agent in the CartPole environment:
def train_actor_critic(episodes=1000, render=False):
# Create environment
env = gym.make('CartPole-v1')
# Get state and action dimensions
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
# Create Actor-Critic network
actor_critic = ActorCriticNetwork(state_dim, action_dim)
# Initialize tracking variables
episode_rewards = []
# Training loop
for episode in range(episodes):
# Reset environment
state, _ = env.reset()
done = False
episode_reward = 0
# Episode loop
while not done:
# Render environment if specified
if render and episode % 10 == 0:
env.render()
# Select action
action = actor_critic.get_action(state)
# Take action and observe next state and reward
next_state, reward, done, _, _ = env.step(action)
# Train the model
loss = actor_critic.train(state, action, reward, next_state, done)
# Update state and episode reward
state = next_state
episode_reward += reward
# Track episode rewards
episode_rewards.append(episode_reward)
# Print progress
if (episode + 1) % 10 == 0:
avg_reward = np.mean(episode_rewards[-10:])
print(f"Episode {episode + 1}/{episodes}, Average Reward: {avg_reward:.2f}")
# Close environment
env.close()
return episode_rewards
# Train the agent
rewards = train_actor_critic(episodes=500)
# Plot results
plt.figure(figsize=(10, 6))
plt.plot(rewards)
plt.title('Actor-Critic Training Progress')
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.grid(True)
plt.show()
Expected Output
When you run the training code, you should see output similar to:
Episode 10/500, Average Reward: 23.50
Episode 20/500, Average Reward: 35.20
Episode 30/500, Average Reward: 54.70
Episode 40/500, Average Reward: 89.30
...
Episode 490/500, Average Reward: 475.80
Episode 500/500, Average Reward: 489.60
And you'll see a plot showing how the rewards increase over training episodes, typically reaching the maximum possible reward of 500 in the CartPole environment after sufficient training.
Advanced Actor-Critic Methods
The basic implementation above introduces the core concepts, but there are several advanced Actor-Critic variants that offer improved performance:
Advantage Actor-Critic (A2C)
A2C is a synchronous, deterministic variant of the Asynchronous Advantage Actor-Critic (A3C) algorithm. It uses multiple workers to gather experience in parallel, then updates the policy in a synchronized manner.
Asynchronous Advantage Actor-Critic (A3C)
A3C uses multiple parallel agents (workers) with their own copy of the environment, each updating a global network asynchronously. This allows for more diverse experience collection and faster learning.
Soft Actor-Critic (SAC)
SAC introduces entropy regularization to encourage exploration and is particularly effective for continuous action spaces. Here's a simplified example of the SAC critic loss calculation:
def soft_q_update(critic_network, target_network, actor_network, states, actions, rewards, next_states, dones):
# Calculate target Q-value with entropy term
next_actions, next_log_probs = actor_network(next_states)
next_q_values = target_network(next_states, next_actions)
# Subtract entropy term (alpha * log_prob)
alpha = 0.2 # Temperature parameter
next_q_values = next_q_values - alpha * next_log_probs
# Calculate target using Bellman equation
targets = rewards + (1 - dones) * gamma * next_q_values
# Update critic
with tf.GradientTape() as tape:
current_q_values = critic_network(states, actions)
critic_loss = tf.reduce_mean(tf.square(current_q_values - targets))
gradients = tape.gradient(critic_loss, critic_network.trainable_variables)
critic_optimizer.apply_gradients(zip(gradients, critic_network.trainable_variables))
return critic_loss
Proximal Policy Optimization (PPO)
While not strictly an Actor-Critic method, PPO often uses a critic and incorporates a clipped objective function to prevent too large policy updates:
def ppo_loss(old_policy, new_policy, actions, advantages, clip_ratio=0.2):
# Convert actions to one-hot
action_masks = tf.one_hot(actions, depth=action_dim)
# Calculate probabilities of taken actions
old_probs = tf.reduce_sum(old_policy * action_masks, axis=1)
new_probs = tf.reduce_sum(new_policy * action_masks, axis=1)
# Calculate ratio of new and old probabilities
ratio = new_probs / (old_probs + 1e-10)
# Calculate surrogate losses
surrogate1 = ratio * advantages
surrogate2 = tf.clip_by_value(ratio, 1 - clip_ratio, 1 + clip_ratio) * advantages
# Take minimum to clip the objective
actor_loss = -tf.reduce_mean(tf.minimum(surrogate1, surrogate2))
return actor_loss
Real-World Application: Robotic Control
Let's see how we could apply Actor-Critic methods to a more complex environment like a robotic control task. We'll use the LunarLanderContinuous-v2 environment, which has a continuous action space:
import tensorflow as tf
import gym
import numpy as np
from tensorflow.keras.layers import Dense, Input, Concatenate
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import tensorflow_probability as tfp
# Set up the environment
env = gym.make('LunarLanderContinuous-v2')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
action_bound = env.action_space.high[0]
# Actor network for continuous actions
def build_actor(state_dim, action_dim, action_bound):
inputs = Input(shape=(state_dim,))
x = Dense(256, activation='relu')(inputs)
x = Dense(256, activation='relu')(x)
# Output mean and log_std for Gaussian policy
mu = Dense(action_dim, activation='tanh')(x)
mu = Lambda(lambda x: x * action_bound)(mu)
log_std = Dense(action_dim, activation='tanh')(x)
log_std = Lambda(lambda x: x * (-5, 2))(log_std) # Constrain log_std
# Create Normal distribution
std = tf.exp(log_std)
pi_dist = tfp.distributions.Normal(mu, std)
model = Model(inputs=inputs, outputs=[mu, std, pi_dist])
return model
# Critic network
def build_critic(state_dim, action_dim):
# State as input
state_input = Input(shape=(state_dim,))
s = Dense(256, activation='relu')(state_input)
# Action as input
action_input = Input(shape=(action_dim,))
a = Dense(256, activation='relu')(action_input)
# Combine state and action
combined = Concatenate()([s, a])
# Output single Q-value
q = Dense(256, activation='relu')(combined)
q = Dense(1)(q)
model = Model(inputs=[state_input, action_input], outputs=q)
return model
With these networks, we could implement the Soft Actor-Critic (SAC) algorithm, which is particularly effective for continuous control tasks like robotic manipulation.
Summary
Actor-Critic methods combine policy-based and value-based approaches to create powerful reinforcement learning algorithms. In this guide, we've explored:
- The fundamental concepts of Actor-Critic methods
- How to implement a basic Actor-Critic algorithm using TensorFlow
- Advanced variants like A2C, A3C, SAC, and PPO
- An application to robotic control with continuous actions
These methods form the backbone of many state-of-the-art reinforcement learning systems used in robotics, autonomous vehicles, game playing, and more.
Additional Resources and Exercises
Resources for Further Learning
- Spinning Up in Deep RL - OpenAI's educational resource on deep reinforcement learning
- TensorFlow Reinforcement Learning Documentation
- Sutton & Barto's Reinforcement Learning Book - Chapter 13 covers Actor-Critic methods
Practice Exercises
-
Basic: Modify the basic Actor-Critic implementation to use separate networks for the actor and critic instead of a shared network.
-
Intermediate: Implement the Advantage Actor-Critic (A2C) algorithm with n-step returns instead of 1-step returns.
-
Advanced: Implement the Soft Actor-Critic (SAC) algorithm for the LunarLanderContinuous-v2 environment and compare its performance with the basic Actor-Critic implementation.
-
Challenge: Apply an Actor-Critic method to a more complex environment like MuJoCo or PyBullet, which simulate physics and robotics tasks.
By mastering Actor-Critic methods, you'll have a powerful tool in your reinforcement learning toolkit that can be applied to a wide range of challenging problems.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)