TensorFlow Actor-Critic

Introduction

Actor-Critic methods represent a powerful family of reinforcement learning algorithms that combine the best aspects of two fundamental approaches: policy-based methods (like policy gradients) and value-based methods (like Q-learning). As the name suggests, Actor-Critic consists of two components:

Actor: Determines which actions to take (policy-based)
Critic: Evaluates how good those actions are (value-based)

This combined approach helps address limitations of using either method alone, offering more stable learning and improved sample efficiency. In this guide, we'll explore how to implement Actor-Critic algorithms using TensorFlow, walking through the concepts and providing practical code examples.

Prerequisites

Before diving into Actor-Critic methods, you should be familiar with:

Basic reinforcement learning concepts (states, actions, rewards, policies)
Python programming and TensorFlow basics
Neural networks fundamentals

Understanding Actor-Critic

The Core Concept

Actor-Critic methods belong to the policy gradient family of reinforcement learning algorithms but use a critic to guide the actor's learning. Here's how it works:

Actor: A policy network that maps states to actions, determining how the agent should behave in each state
Critic: A value network that estimates the value function, evaluating how good a state or state-action pair is

The actor aims to maximize expected rewards by taking actions, while the critic provides feedback on those actions to help the actor improve its policy.

Why Actor-Critic?

Actor-Critic methods offer several advantages:

Reduced variance: The critic helps reduce the variance in policy gradient updates
Improved sample efficiency: Learning is often more efficient than pure policy gradient methods
Continuous action spaces: Well-suited for environments with continuous action spaces
Online learning: Can learn on a step-by-step basis without waiting for episode completion

Building an Actor-Critic Model in TensorFlow

Let's implement a basic Actor-Critic algorithm for a simple environment. We'll use the CartPole-v1 environment from OpenAI Gym as our example.

Setting Up the Environment

First, let's install and import the necessary packages:

# Install required packages
!pip install tensorflow gym matplotlib

# Import necessary libraries
import tensorflow as tf
import gym
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import tensorflow.keras.backend as K

Creating the Actor-Critic Network

We'll create a shared network architecture with two output heads: one for the actor (policy) and one for the critic (value):

class ActorCriticNetwork:
    def __init__(self, state_dim, action_dim):
        # Initialize network parameters
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.learning_rate = 0.001
        self.model = self.build_model()
        self.optimizer = Adam(learning_rate=self.learning_rate)
    
    def build_model(self):
        # Shared layers
        state_input = Input(shape=(self.state_dim,))
        dense1 = Dense(64, activation='relu')(state_input)
        dense2 = Dense(64, activation='relu')(dense1)
        
        # Actor head (policy network)
        policy_output = Dense(self.action_dim, activation='softmax')(dense2)
        
        # Critic head (value network)
        value_output = Dense(1)(dense2)
        
        # Create the model
        model = Model(inputs=state_input, outputs=[policy_output, value_output])
        
        return model
    
    def train(self, state, action, reward, next_state, done):
        with tf.GradientTape() as tape:
            # Get policy and value predictions
            policy, value = self.model(np.array([state]))
            _, next_value = self.model(np.array([next_state]))
            
            # Calculate target (TD error)
            target = reward + (1 - done) * 0.99 * next_value
            advantage = target - value
            
            # Convert action to one-hot encoding
            action_onehot = tf.one_hot(action, self.action_dim)
            
            # Calculate policy loss (negative log probability weighted by advantage)
            log_prob = tf.math.log(tf.reduce_sum(policy * action_onehot) + 1e-10)
            actor_loss = -log_prob * advantage
            
            # Calculate critic loss (MSE)
            critic_loss = tf.square(advantage)
            
            # Total loss
            total_loss = actor_loss + critic_loss
        
        # Get gradients and apply updates
        grads = tape.gradient(total_loss, self.model.trainable_variables)
        self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
        
        return total_loss
    
    def get_action(self, state):
        # Get policy prediction
        policy, _ = self.model.predict(np.array([state]), verbose=0)
        # Sample action from the policy distribution
        action = np.random.choice(self.action_dim, p=policy[0])
        return action

Training the Actor-Critic Agent

Now, let's create a function to train our Actor-Critic agent in the CartPole environment:

def train_actor_critic(episodes=1000, render=False):
    # Create environment
    env = gym.make('CartPole-v1')
    
    # Get state and action dimensions
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    # Create Actor-Critic network
    actor_critic = ActorCriticNetwork(state_dim, action_dim)
    
    # Initialize tracking variables
    episode_rewards = []
    
    # Training loop
    for episode in range(episodes):
        # Reset environment
        state, _ = env.reset()
        done = False
        episode_reward = 0
        
        # Episode loop
        while not done:
            # Render environment if specified
            if render and episode % 10 == 0:
                env.render()
            
            # Select action
            action = actor_critic.get_action(state)
            
            # Take action and observe next state and reward
            next_state, reward, done, _, _ = env.step(action)
            
            # Train the model
            loss = actor_critic.train(state, action, reward, next_state, done)
            
            # Update state and episode reward
            state = next_state
            episode_reward += reward
        
        # Track episode rewards
        episode_rewards.append(episode_reward)
        
        # Print progress
        if (episode + 1) % 10 == 0:
            avg_reward = np.mean(episode_rewards[-10:])
            print(f"Episode {episode + 1}/{episodes}, Average Reward: {avg_reward:.2f}")
    
    # Close environment
    env.close()
    
    return episode_rewards

# Train the agent
rewards = train_actor_critic(episodes=500)

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(rewards)
plt.title('Actor-Critic Training Progress')
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.grid(True)
plt.show()

Expected Output

When you run the training code, you should see output similar to:

Episode 10/500, Average Reward: 23.50
Episode 20/500, Average Reward: 35.20
Episode 30/500, Average Reward: 54.70
Episode 40/500, Average Reward: 89.30
...
Episode 490/500, Average Reward: 475.80
Episode 500/500, Average Reward: 489.60

And you'll see a plot showing how the rewards increase over training episodes, typically reaching the maximum possible reward of 500 in the CartPole environment after sufficient training.

Advanced Actor-Critic Methods

The basic implementation above introduces the core concepts, but there are several advanced Actor-Critic variants that offer improved performance:

Advantage Actor-Critic (A2C)

A2C is a synchronous, deterministic variant of the Asynchronous Advantage Actor-Critic (A3C) algorithm. It uses multiple workers to gather experience in parallel, then updates the policy in a synchronized manner.

Asynchronous Advantage Actor-Critic (A3C)

A3C uses multiple parallel agents (workers) with their own copy of the environment, each updating a global network asynchronously. This allows for more diverse experience collection and faster learning.

Soft Actor-Critic (SAC)

SAC introduces entropy regularization to encourage exploration and is particularly effective for continuous action spaces. Here's a simplified example of the SAC critic loss calculation:

def soft_q_update(critic_network, target_network, actor_network, states, actions, rewards, next_states, dones):
    # Calculate target Q-value with entropy term
    next_actions, next_log_probs = actor_network(next_states)
    next_q_values = target_network(next_states, next_actions)
    
    # Subtract entropy term (alpha * log_prob)
    alpha = 0.2  # Temperature parameter
    next_q_values = next_q_values - alpha * next_log_probs
    
    # Calculate target using Bellman equation
    targets = rewards + (1 - dones) * gamma * next_q_values
    
    # Update critic
    with tf.GradientTape() as tape:
        current_q_values = critic_network(states, actions)
        critic_loss = tf.reduce_mean(tf.square(current_q_values - targets))
        
    gradients = tape.gradient(critic_loss, critic_network.trainable_variables)
    critic_optimizer.apply_gradients(zip(gradients, critic_network.trainable_variables))
    
    return critic_loss

Proximal Policy Optimization (PPO)

While not strictly an Actor-Critic method, PPO often uses a critic and incorporates a clipped objective function to prevent too large policy updates:

def ppo_loss(old_policy, new_policy, actions, advantages, clip_ratio=0.2):
    # Convert actions to one-hot
    action_masks = tf.one_hot(actions, depth=action_dim)
    
    # Calculate probabilities of taken actions
    old_probs = tf.reduce_sum(old_policy * action_masks, axis=1)
    new_probs = tf.reduce_sum(new_policy * action_masks, axis=1)
    
    # Calculate ratio of new and old probabilities
    ratio = new_probs / (old_probs + 1e-10)
    
    # Calculate surrogate losses
    surrogate1 = ratio * advantages
    surrogate2 = tf.clip_by_value(ratio, 1 - clip_ratio, 1 + clip_ratio) * advantages
    
    # Take minimum to clip the objective
    actor_loss = -tf.reduce_mean(tf.minimum(surrogate1, surrogate2))
    
    return actor_loss

Real-World Application: Robotic Control

Let's see how we could apply Actor-Critic methods to a more complex environment like a robotic control task. We'll use the LunarLanderContinuous-v2 environment, which has a continuous action space:

import tensorflow as tf
import gym
import numpy as np
from tensorflow.keras.layers import Dense, Input, Concatenate
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import tensorflow_probability as tfp

# Set up the environment
env = gym.make('LunarLanderContinuous-v2')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
action_bound = env.action_space.high[0]

# Actor network for continuous actions
def build_actor(state_dim, action_dim, action_bound):
    inputs = Input(shape=(state_dim,))
    x = Dense(256, activation='relu')(inputs)
    x = Dense(256, activation='relu')(x)
    
    # Output mean and log_std for Gaussian policy
    mu = Dense(action_dim, activation='tanh')(x)
    mu = Lambda(lambda x: x * action_bound)(mu)
    
    log_std = Dense(action_dim, activation='tanh')(x)
    log_std = Lambda(lambda x: x * (-5, 2))(log_std)  # Constrain log_std
    
    # Create Normal distribution
    std = tf.exp(log_std)
    pi_dist = tfp.distributions.Normal(mu, std)
    
    model = Model(inputs=inputs, outputs=[mu, std, pi_dist])
    return model

# Critic network
def build_critic(state_dim, action_dim):
    # State as input
    state_input = Input(shape=(state_dim,))
    s = Dense(256, activation='relu')(state_input)
    
    # Action as input
    action_input = Input(shape=(action_dim,))
    a = Dense(256, activation='relu')(action_input)
    
    # Combine state and action
    combined = Concatenate()([s, a])
    
    # Output single Q-value
    q = Dense(256, activation='relu')(combined)
    q = Dense(1)(q)
    
    model = Model(inputs=[state_input, action_input], outputs=q)
    return model

With these networks, we could implement the Soft Actor-Critic (SAC) algorithm, which is particularly effective for continuous control tasks like robotic manipulation.

Summary

Actor-Critic methods combine policy-based and value-based approaches to create powerful reinforcement learning algorithms. In this guide, we've explored:

The fundamental concepts of Actor-Critic methods
How to implement a basic Actor-Critic algorithm using TensorFlow
Advanced variants like A2C, A3C, SAC, and PPO
An application to robotic control with continuous actions

These methods form the backbone of many state-of-the-art reinforcement learning systems used in robotics, autonomous vehicles, game playing, and more.

Additional Resources and Exercises

Resources for Further Learning

Spinning Up in Deep RL - OpenAI's educational resource on deep reinforcement learning
TensorFlow Reinforcement Learning Documentation
Sutton & Barto's Reinforcement Learning Book - Chapter 13 covers Actor-Critic methods

Practice Exercises

Basic: Modify the basic Actor-Critic implementation to use separate networks for the actor and critic instead of a shared network.
Intermediate: Implement the Advantage Actor-Critic (A2C) algorithm with n-step returns instead of 1-step returns.
Advanced: Implement the Soft Actor-Critic (SAC) algorithm for the LunarLanderContinuous-v2 environment and compare its performance with the basic Actor-Critic implementation.
Challenge: Apply an Actor-Critic method to a more complex environment like MuJoCo or PyBullet, which simulate physics and robotics tasks.

By mastering Actor-Critic methods, you'll have a powerful tool in your reinforcement learning toolkit that can be applied to a wide range of challenging problems.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Understanding Actor-Critic​

The Core Concept​

Why Actor-Critic?​

Building an Actor-Critic Model in TensorFlow​

Setting Up the Environment​

Creating the Actor-Critic Network​

Training the Actor-Critic Agent​

Expected Output​

Advanced Actor-Critic Methods​

Advantage Actor-Critic (A2C)​

Asynchronous Advantage Actor-Critic (A3C)​

Soft Actor-Critic (SAC)​

Proximal Policy Optimization (PPO)​

Real-World Application: Robotic Control​

Summary​

Additional Resources and Exercises​

Resources for Further Learning​

Practice Exercises​