Skip to main content

TensorFlow Policy Gradients

Introduction

Policy gradient methods are a fundamental class of algorithms in reinforcement learning that directly optimize the policy of an agent. Unlike value-based methods like Q-learning, policy gradient methods learn a parameterized policy that directly maps states to actions. This approach offers several advantages, particularly for environments with continuous action spaces or when we want stochastic policies.

In this tutorial, we'll explore how to implement policy gradient methods using TensorFlow. We'll start with the basics and gradually build up to more sophisticated implementations. By the end of this tutorial, you'll understand:

  • What policy gradients are and how they differ from value-based methods
  • The mathematical foundations of policy gradient methods
  • How to implement REINFORCE (a basic policy gradient algorithm)
  • Advanced techniques for improving policy gradient methods
  • Real-world applications of policy gradients

Foundational Concepts

What is a Policy?

In reinforcement learning, a policy is a strategy that an agent follows to determine its actions. Formally, a policy π is a mapping from states to actions (or to a probability distribution over actions):

  • Deterministic policy: π(s) = a
  • Stochastic policy: π(a|s) = P(A=a|S=s)

Policy gradient methods typically use neural networks to represent the policy πθ, where θ represents the parameters (weights) of the neural network.

The Reinforcement Learning Objective

In reinforcement learning, our goal is to find a policy that maximizes the expected return (cumulative reward):

J(θ) = Eτ~πθ[R(τ)]

Where τ represents a trajectory (sequence of states, actions, and rewards), and R(τ) is the total reward for that trajectory.

The Policy Gradient Theorem

The policy gradient theorem gives us a way to compute the gradient of the expected return with respect to the policy parameters:

θJ(θ) = Eτ~πθ[∇θ log πθ(a|s) · R(τ)]

This allows us to perform gradient ascent to improve the policy.

Implementing REINFORCE with TensorFlow

REINFORCE is one of the simplest policy gradient algorithms. Let's implement it step by step using TensorFlow.

Step 1: Set Up the Environment and Dependencies

python
import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np
import gym
import matplotlib.pyplot as plt

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)

Step 2: Define the Policy Network

python
class PolicyNetwork(tf.keras.Model):
def __init__(self, num_actions, hidden_size=128):
super(PolicyNetwork, self).__init__()
self.dense1 = tf.keras.layers.Dense(hidden_size, activation='relu')
self.dense2 = tf.keras.layers.Dense(hidden_size, activation='relu')
self.policy_logits = tf.keras.layers.Dense(num_actions)

def call(self, inputs):
x = self.dense1(inputs)
x = self.dense2(x)
logits = self.policy_logits(x)
return logits

def action_value(self, obs):
logits = self.call(obs)
action_probs = tf.nn.softmax(logits)
dist = tfp.distributions.Categorical(probs=action_probs)
action = dist.sample()
log_prob = dist.log_prob(action)
return action, log_prob

Step 3: Implement REINFORCE Algorithm

python
def reinforce(env_name, num_episodes=1000, learning_rate=0.01, gamma=0.99):
# Create environment
env = gym.make(env_name)

# Initialize policy network
state_dim = env.observation_space.shape[0]
num_actions = env.action_space.n
policy_net = PolicyNetwork(num_actions)
optimizer = tf.keras.optimizers.Adam(learning_rate)

# For tracking progress
episode_rewards = []

for episode in range(num_episodes):
states, actions, rewards, log_probs = [], [], [], []

# Start a new episode
state, _ = env.reset()
done = False
episode_reward = 0

# Collect trajectory
while not done:
# Convert state to tensor
state_tensor = tf.convert_to_tensor([state], dtype=tf.float32)

# Select action
action, log_prob = policy_net.action_value(state_tensor)

# Take action
next_state, reward, done, _, _ = env.step(action.numpy()[0])

# Store data
states.append(state)
actions.append(action)
rewards.append(reward)
log_probs.append(log_prob)

state = next_state
episode_reward += reward

episode_rewards.append(episode_reward)

# Compute discounted rewards
discounted_rewards = []
cumulative = 0
for reward in reversed(rewards):
cumulative = reward + gamma * cumulative
discounted_rewards.insert(0, cumulative)

# Convert to tensor and normalize
discounted_rewards = tf.convert_to_tensor(discounted_rewards, dtype=tf.float32)
discounted_rewards = (discounted_rewards - tf.math.reduce_mean(discounted_rewards)) / \
(tf.math.reduce_std(discounted_rewards) + 1e-9)

# Compute loss and update policy
with tf.GradientTape() as tape:
log_probs_tensor = tf.concat(log_probs, axis=0)
loss = -tf.reduce_sum(log_probs_tensor * discounted_rewards)

grads = tape.gradient(loss, policy_net.trainable_variables)
optimizer.apply_gradients(zip(grads, policy_net.trainable_variables))

if episode % 10 == 0:
print(f"Episode {episode}, Reward: {episode_reward}")

return episode_rewards, policy_net

Step 4: Train and Evaluate

python
# Train the agent
rewards, policy = reinforce('CartPole-v1', num_episodes=500)

# Plot learning curve
plt.figure(figsize=(10, 6))
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('REINFORCE Learning Curve')
plt.savefig('reinforce_learning_curve.png')
plt.show()

# Example output:
# Episode 0, Reward: 21.0
# Episode 10, Reward: 46.0
# Episode 20, Reward: 71.0
# ...
# Episode 490, Reward: 500.0

Understanding the Code Step by Step

  1. Policy Network: We implement a simple neural network that maps states to action probabilities. The action_value method samples an action based on these probabilities and returns the log probability.

  2. REINFORCE Algorithm:

    • We collect trajectories by running the policy in the environment
    • After each episode, we compute discounted rewards
    • We normalize the rewards to reduce variance
    • We compute the loss using the policy gradient theorem
    • We update the policy using gradient ascent
  3. Discounted Rewards: For each time step t, we compute the discounted sum of future rewards: Rt = Σk=0 γk rt+k

  4. Loss Function: Our loss is the negative sum of log probabilities multiplied by the discounted rewards. Minimizing this loss is equivalent to maximizing the expected reward.

Improving Policy Gradient Methods

The basic REINFORCE algorithm suffers from high variance in gradient estimates. Here are some techniques to improve it:

Adding a Baseline (Advantage Function)

We can reduce variance by subtracting a baseline from the returns. A common baseline is the value function, which leads to the advantage function:

A(s, a) = Q(s, a) - V(s)

Here's how to modify our code to include a value function baseline:

python
class ActorCriticNetwork(tf.keras.Model):
def __init__(self, num_actions, hidden_size=128):
super(ActorCriticNetwork, self).__init__()
self.shared_layers = tf.keras.Sequential([
tf.keras.layers.Dense(hidden_size, activation='relu'),
tf.keras.layers.Dense(hidden_size, activation='relu')
])

# Policy (actor) head
self.policy_logits = tf.keras.layers.Dense(num_actions)

# Value (critic) head
self.value = tf.keras.layers.Dense(1)

def call(self, inputs):
x = self.shared_layers(inputs)
logits = self.policy_logits(x)
value = self.value(x)
return logits, value

def action_value(self, obs):
logits, value = self.call(obs)
action_probs = tf.nn.softmax(logits)
dist = tfp.distributions.Categorical(probs=action_probs)
action = dist.sample()
log_prob = dist.log_prob(action)
return action, log_prob, value

With this network, we can implement the Actor-Critic algorithm, which is a more advanced policy gradient method that reduces variance.

Real-World Applications

Policy gradient methods are widely used in various domains:

1. Robotics

Policy gradients are used to train robots to perform complex tasks, such as:

python
# Example: Training a robotic arm for object manipulation
def create_custom_robot_env():
# Simplified example of setting up a robotics environment
env = gym.make('FetchReach-v1')
return env

robot_rewards, robot_policy = reinforce('FetchReach-v1', num_episodes=2000)

2. Game Playing

Policy gradients have been used in game-playing agents:

python
# Example: Training an agent to play Atari games
from stable_baselines3 import PPO

# Initialize the model
model = PPO("CnnPolicy", "PongNoFrameskip-v4", verbose=1)

# Train the model
model.learn(total_timesteps=1000000)

# Save the model
model.save("ppo_pong")

3. Autonomous Vehicles

Training autonomous vehicles to navigate complex environments:

python
# Simplified example for autonomous vehicle navigation
def create_navigation_environment():
# In practice, this would be a complex simulation environment
env = gym.make('CarRacing-v2')
return env

# Train navigation policy
navigation_rewards, navigation_policy = reinforce('CarRacing-v2', num_episodes=5000)

Proximal Policy Optimization (PPO)

Let's implement a more advanced policy gradient algorithm called Proximal Policy Optimization (PPO), which is more stable and efficient:

python
def ppo_loss(old_log_probs, log_probs, advantages, epsilon=0.2):
ratio = tf.exp(log_probs - old_log_probs)
surrogate1 = ratio * advantages
surrogate2 = tf.clip_by_value(ratio, 1 - epsilon, 1 + epsilon) * advantages
return -tf.reduce_mean(tf.minimum(surrogate1, surrogate2))

class PPOAgent:
def __init__(self, state_dim, action_dim, hidden_size=64):
self.actor = PolicyNetwork(action_dim, hidden_size)
self.critic = tf.keras.Sequential([
tf.keras.layers.Dense(hidden_size, activation='relu', input_shape=(state_dim,)),
tf.keras.layers.Dense(hidden_size, activation='relu'),
tf.keras.layers.Dense(1)
])

self.actor_optimizer = tf.keras.optimizers.Adam(0.001)
self.critic_optimizer = tf.keras.optimizers.Adam(0.001)

def update(self, states, actions, rewards, old_log_probs, next_states, dones, gamma=0.99):
# Convert to tensors
states = tf.convert_to_tensor(states, dtype=tf.float32)
next_states = tf.convert_to_tensor(next_states, dtype=tf.float32)
actions = tf.convert_to_tensor(actions, dtype=tf.int32)
rewards = tf.convert_to_tensor(rewards, dtype=tf.float32)
old_log_probs = tf.convert_to_tensor(old_log_probs, dtype=tf.float32)
dones = tf.convert_to_tensor(dones, dtype=tf.float32)

# Compute advantages
with tf.GradientTape() as tape:
values = self.critic(states)
next_values = self.critic(next_states) * (1.0 - dones)
td_targets = rewards + gamma * next_values
critic_loss = tf.reduce_mean(tf.square(td_targets - values))

critic_grads = tape.gradient(critic_loss, self.critic.trainable_variables)
self.critic_optimizer.apply_gradients(zip(critic_grads, self.critic.trainable_variables))

advantages = td_targets - values

# Update actor (policy)
with tf.GradientTape() as tape:
logits = self.actor(states)
action_probs = tf.nn.softmax(logits)
dist = tfp.distributions.Categorical(probs=action_probs)
log_probs = dist.log_prob(actions)

actor_loss = ppo_loss(old_log_probs, log_probs, advantages)

actor_grads = tape.gradient(actor_loss, self.actor.trainable_variables)
self.actor_optimizer.apply_gradients(zip(actor_grads, self.actor.trainable_variables))

return actor_loss, critic_loss

Summary

In this tutorial, we've explored policy gradient methods in TensorFlow, covering:

  1. The basic concepts of policy gradient methods
  2. The mathematical foundations behind the policy gradient theorem
  3. Implementation of the REINFORCE algorithm
  4. Advanced techniques like Actor-Critic and PPO
  5. Real-world applications of policy gradients

Policy gradient methods are a powerful class of reinforcement learning algorithms that directly optimize the policy. While they can be more challenging to implement than value-based methods, they offer several advantages:

  • They can naturally handle continuous action spaces
  • They can learn stochastic policies
  • They can converge to the optimal policy more directly

As you continue your reinforcement learning journey, you'll find that policy gradient methods form the foundation of many state-of-the-art algorithms.

Additional Resources and Exercises

Resources

Exercises

  1. Basic Exercise: Implement REINFORCE for the LunarLander-v2 environment and compare its performance with CartPole.

  2. Intermediate Exercise: Add a baseline to the REINFORCE algorithm to reduce variance. Compare the learning curves with and without the baseline.

  3. Advanced Exercise: Implement Proximal Policy Optimization (PPO) for a continuous action space environment like MountainCarContinuous-v0.

  4. Challenge: Design and implement a custom environment with unique challenges and train a policy gradient agent to solve it.

Happy learning and experimenting with policy gradient methods in TensorFlow!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)