TensorFlow Policy Gradients

Introduction

Policy gradient methods are a fundamental class of algorithms in reinforcement learning that directly optimize the policy of an agent. Unlike value-based methods like Q-learning, policy gradient methods learn a parameterized policy that directly maps states to actions. This approach offers several advantages, particularly for environments with continuous action spaces or when we want stochastic policies.

In this tutorial, we'll explore how to implement policy gradient methods using TensorFlow. We'll start with the basics and gradually build up to more sophisticated implementations. By the end of this tutorial, you'll understand:

What policy gradients are and how they differ from value-based methods
The mathematical foundations of policy gradient methods
How to implement REINFORCE (a basic policy gradient algorithm)
Advanced techniques for improving policy gradient methods
Real-world applications of policy gradients

Foundational Concepts

What is a Policy?

In reinforcement learning, a policy is a strategy that an agent follows to determine its actions. Formally, a policy π is a mapping from states to actions (or to a probability distribution over actions):

Deterministic policy: π(s) = a
Stochastic policy: π(a|s) = P(A=a|S=s)

Policy gradient methods typically use neural networks to represent the policy π_θ, where θ represents the parameters (weights) of the neural network.

The Reinforcement Learning Objective

In reinforcement learning, our goal is to find a policy that maximizes the expected return (cumulative reward):

J(θ) = E_{τ~π_θ}[R(τ)]

Where τ represents a trajectory (sequence of states, actions, and rewards), and R(τ) is the total reward for that trajectory.

The Policy Gradient Theorem

The policy gradient theorem gives us a way to compute the gradient of the expected return with respect to the policy parameters:

∇_θJ(θ) = E_{τ~π_θ}[∇_θ log π_θ(a|s) · R(τ)]

This allows us to perform gradient ascent to improve the policy.

Implementing REINFORCE with TensorFlow

REINFORCE is one of the simplest policy gradient algorithms. Let's implement it step by step using TensorFlow.

Step 1: Set Up the Environment and Dependencies

python
import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np
import gym
import matplotlib.pyplot as plt

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)

Step 2: Define the Policy Network

python
class PolicyNetwork(tf.keras.Model):
    def __init__(self, num_actions, hidden_size=128):
        super(PolicyNetwork, self).__init__()
        self.dense1 = tf.keras.layers.Dense(hidden_size, activation='relu')
        self.dense2 = tf.keras.layers.Dense(hidden_size, activation='relu')
        self.policy_logits = tf.keras.layers.Dense(num_actions)
        
    def call(self, inputs):
        x = self.dense1(inputs)
        x = self.dense2(x)
        logits = self.policy_logits(x)
        return logits
        
    def action_value(self, obs):
        logits = self.call(obs)
        action_probs = tf.nn.softmax(logits)
        dist = tfp.distributions.Categorical(probs=action_probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action, log_prob

Step 3: Implement REINFORCE Algorithm

python
def reinforce(env_name, num_episodes=1000, learning_rate=0.01, gamma=0.99):
    # Create environment
    env = gym.make(env_name)
    
    # Initialize policy network
    state_dim = env.observation_space.shape[0]
    num_actions = env.action_space.n
    policy_net = PolicyNetwork(num_actions)
    optimizer = tf.keras.optimizers.Adam(learning_rate)
    
    # For tracking progress
    episode_rewards = []
    
    for episode in range(num_episodes):
        states, actions, rewards, log_probs = [], [], [], []
        
        # Start a new episode
        state, _ = env.reset()
        done = False
        episode_reward = 0
        
        # Collect trajectory
        while not done:
            # Convert state to tensor
            state_tensor = tf.convert_to_tensor([state], dtype=tf.float32)
            
            # Select action
            action, log_prob = policy_net.action_value(state_tensor)
            
            # Take action
            next_state, reward, done, _, _ = env.step(action.numpy()[0])
            
            # Store data
            states.append(state)
            actions.append(action)
            rewards.append(reward)
            log_probs.append(log_prob)
            
            state = next_state
            episode_reward += reward
            
        episode_rewards.append(episode_reward)
        
        # Compute discounted rewards
        discounted_rewards = []
        cumulative = 0
        for reward in reversed(rewards):
            cumulative = reward + gamma * cumulative
            discounted_rewards.insert(0, cumulative)
        
        # Convert to tensor and normalize
        discounted_rewards = tf.convert_to_tensor(discounted_rewards, dtype=tf.float32)
        discounted_rewards = (discounted_rewards - tf.math.reduce_mean(discounted_rewards)) / \
                            (tf.math.reduce_std(discounted_rewards) + 1e-9)
        
        # Compute loss and update policy
        with tf.GradientTape() as tape:
            log_probs_tensor = tf.concat(log_probs, axis=0)
            loss = -tf.reduce_sum(log_probs_tensor * discounted_rewards)
            
        grads = tape.gradient(loss, policy_net.trainable_variables)
        optimizer.apply_gradients(zip(grads, policy_net.trainable_variables))
        
        if episode % 10 == 0:
            print(f"Episode {episode}, Reward: {episode_reward}")
            
    return episode_rewards, policy_net

Step 4: Train and Evaluate

python
# Train the agent
rewards, policy = reinforce('CartPole-v1', num_episodes=500)

# Plot learning curve
plt.figure(figsize=(10, 6))
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('REINFORCE Learning Curve')
plt.savefig('reinforce_learning_curve.png')
plt.show()

# Example output:
# Episode 0, Reward: 21.0
# Episode 10, Reward: 46.0
# Episode 20, Reward: 71.0
# ...
# Episode 490, Reward: 500.0

Understanding the Code Step by Step

Policy Network: We implement a simple neural network that maps states to action probabilities. The action_value method samples an action based on these probabilities and returns the log probability.
REINFORCE Algorithm:
- We collect trajectories by running the policy in the environment
- After each episode, we compute discounted rewards
- We normalize the rewards to reduce variance
- We compute the loss using the policy gradient theorem
- We update the policy using gradient ascent
Discounted Rewards: For each time step t, we compute the discounted sum of future rewards: R_t = Σ_k=0^∞ γ^k r_t+k
Loss Function: Our loss is the negative sum of log probabilities multiplied by the discounted rewards. Minimizing this loss is equivalent to maximizing the expected reward.

Improving Policy Gradient Methods

The basic REINFORCE algorithm suffers from high variance in gradient estimates. Here are some techniques to improve it:

Adding a Baseline (Advantage Function)

We can reduce variance by subtracting a baseline from the returns. A common baseline is the value function, which leads to the advantage function:

A(s, a) = Q(s, a) - V(s)

Here's how to modify our code to include a value function baseline:

python
class ActorCriticNetwork(tf.keras.Model):
    def __init__(self, num_actions, hidden_size=128):
        super(ActorCriticNetwork, self).__init__()
        self.shared_layers = tf.keras.Sequential([
            tf.keras.layers.Dense(hidden_size, activation='relu'),
            tf.keras.layers.Dense(hidden_size, activation='relu')
        ])
        
        # Policy (actor) head
        self.policy_logits = tf.keras.layers.Dense(num_actions)
        
        # Value (critic) head
        self.value = tf.keras.layers.Dense(1)
        
    def call(self, inputs):
        x = self.shared_layers(inputs)
        logits = self.policy_logits(x)
        value = self.value(x)
        return logits, value
    
    def action_value(self, obs):
        logits, value = self.call(obs)
        action_probs = tf.nn.softmax(logits)
        dist = tfp.distributions.Categorical(probs=action_probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action, log_prob, value

With this network, we can implement the Actor-Critic algorithm, which is a more advanced policy gradient method that reduces variance.

Real-World Applications

Policy gradient methods are widely used in various domains:

1. Robotics

Policy gradients are used to train robots to perform complex tasks, such as:

python
# Example: Training a robotic arm for object manipulation
def create_custom_robot_env():
    # Simplified example of setting up a robotics environment
    env = gym.make('FetchReach-v1')
    return env

robot_rewards, robot_policy = reinforce('FetchReach-v1', num_episodes=2000)

2. Game Playing

Policy gradients have been used in game-playing agents:

python
# Example: Training an agent to play Atari games
from stable_baselines3 import PPO

# Initialize the model
model = PPO("CnnPolicy", "PongNoFrameskip-v4", verbose=1)

# Train the model
model.learn(total_timesteps=1000000)

# Save the model
model.save("ppo_pong")

3. Autonomous Vehicles

Training autonomous vehicles to navigate complex environments:

python
# Simplified example for autonomous vehicle navigation
def create_navigation_environment():
    # In practice, this would be a complex simulation environment
    env = gym.make('CarRacing-v2')
    return env

# Train navigation policy
navigation_rewards, navigation_policy = reinforce('CarRacing-v2', num_episodes=5000)

Proximal Policy Optimization (PPO)

Let's implement a more advanced policy gradient algorithm called Proximal Policy Optimization (PPO), which is more stable and efficient:

python
def ppo_loss(old_log_probs, log_probs, advantages, epsilon=0.2):
    ratio = tf.exp(log_probs - old_log_probs)
    surrogate1 = ratio * advantages
    surrogate2 = tf.clip_by_value(ratio, 1 - epsilon, 1 + epsilon) * advantages
    return -tf.reduce_mean(tf.minimum(surrogate1, surrogate2))

class PPOAgent:
    def __init__(self, state_dim, action_dim, hidden_size=64):
        self.actor = PolicyNetwork(action_dim, hidden_size)
        self.critic = tf.keras.Sequential([
            tf.keras.layers.Dense(hidden_size, activation='relu', input_shape=(state_dim,)),
            tf.keras.layers.Dense(hidden_size, activation='relu'),
            tf.keras.layers.Dense(1)
        ])
        
        self.actor_optimizer = tf.keras.optimizers.Adam(0.001)
        self.critic_optimizer = tf.keras.optimizers.Adam(0.001)
        
    def update(self, states, actions, rewards, old_log_probs, next_states, dones, gamma=0.99):
        # Convert to tensors
        states = tf.convert_to_tensor(states, dtype=tf.float32)
        next_states = tf.convert_to_tensor(next_states, dtype=tf.float32)
        actions = tf.convert_to_tensor(actions, dtype=tf.int32)
        rewards = tf.convert_to_tensor(rewards, dtype=tf.float32)
        old_log_probs = tf.convert_to_tensor(old_log_probs, dtype=tf.float32)
        dones = tf.convert_to_tensor(dones, dtype=tf.float32)
        
        # Compute advantages
        with tf.GradientTape() as tape:
            values = self.critic(states)
            next_values = self.critic(next_states) * (1.0 - dones)
            td_targets = rewards + gamma * next_values
            critic_loss = tf.reduce_mean(tf.square(td_targets - values))
            
        critic_grads = tape.gradient(critic_loss, self.critic.trainable_variables)
        self.critic_optimizer.apply_gradients(zip(critic_grads, self.critic.trainable_variables))
        
        advantages = td_targets - values
        
        # Update actor (policy)
        with tf.GradientTape() as tape:
            logits = self.actor(states)
            action_probs = tf.nn.softmax(logits)
            dist = tfp.distributions.Categorical(probs=action_probs)
            log_probs = dist.log_prob(actions)
            
            actor_loss = ppo_loss(old_log_probs, log_probs, advantages)
            
        actor_grads = tape.gradient(actor_loss, self.actor.trainable_variables)
        self.actor_optimizer.apply_gradients(zip(actor_grads, self.actor.trainable_variables))
        
        return actor_loss, critic_loss

Summary

In this tutorial, we've explored policy gradient methods in TensorFlow, covering:

The basic concepts of policy gradient methods
The mathematical foundations behind the policy gradient theorem
Implementation of the REINFORCE algorithm
Advanced techniques like Actor-Critic and PPO
Real-world applications of policy gradients

Policy gradient methods are a powerful class of reinforcement learning algorithms that directly optimize the policy. While they can be more challenging to implement than value-based methods, they offer several advantages:

They can naturally handle continuous action spaces
They can learn stochastic policies
They can converge to the optimal policy more directly

As you continue your reinforcement learning journey, you'll find that policy gradient methods form the foundation of many state-of-the-art algorithms.

Additional Resources and Exercises

Resources

Exercises

Basic Exercise: Implement REINFORCE for the LunarLander-v2 environment and compare its performance with CartPole.
Intermediate Exercise: Add a baseline to the REINFORCE algorithm to reduce variance. Compare the learning curves with and without the baseline.
Advanced Exercise: Implement Proximal Policy Optimization (PPO) for a continuous action space environment like MountainCarContinuous-v0.
Challenge: Design and implement a custom environment with unique challenges and train a policy gradient agent to solve it.

Happy learning and experimenting with policy gradient methods in TensorFlow!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Foundational Concepts​

What is a Policy?​

The Reinforcement Learning Objective​

The Policy Gradient Theorem​

Implementing REINFORCE with TensorFlow​

Step 1: Set Up the Environment and Dependencies​

Step 2: Define the Policy Network​

Step 3: Implement REINFORCE Algorithm​

Step 4: Train and Evaluate​

Understanding the Code Step by Step​

Improving Policy Gradient Methods​

Adding a Baseline (Advantage Function)​

Real-World Applications​

1. Robotics​

2. Game Playing​

3. Autonomous Vehicles​

Proximal Policy Optimization (PPO)​

Summary​

Additional Resources and Exercises​

Resources​

Exercises​