TensorFlow Learning Rate

Introduction

The learning rate is one of the most critical hyperparameters when training neural networks with TensorFlow. It controls how much we adjust our model weights in response to the estimated error each time the model weights are updated. If the learning rate is too small, training will take too long or might get stuck; if it's too large, training might diverge or oscillate without reaching the optimal solution.

In this tutorial, we'll explore:

What learning rate is and why it matters
How to set the learning rate in TensorFlow
Learning rate schedules and decay strategies
Adaptive learning rate optimizers
Practical tips for selecting the best learning rate for your models

Understanding Learning Rate

What is Learning Rate?

The learning rate (often denoted as α or lr) is a small positive value, typically ranging from 0.1 to 0.0001, that controls the step size during optimization. During backpropagation, the gradients indicate the direction to move to reduce the loss, while the learning rate determines how large of a step to take in that direction.

Mathematically, for a weight parameter w, the update rule is:

w_new = w_old - learning_rate * gradient

Why is Learning Rate Important?

The learning rate directly impacts:

Training speed: Higher rates can speed up convergence, but only up to a certain point
Training stability: Too high rates cause instability or divergence
Final model performance: Proper rate scheduling can lead to better generalization

Setting Learning Rate in TensorFlow

Basic Usage with Optimizers

In TensorFlow, you typically set the learning rate when creating an optimizer:

import tensorflow as tf

# Creating an optimizer with a fixed learning rate
learning_rate = 0.01
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)

# Use the optimizer when compiling a model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(
    optimizer=optimizer,
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

Monitoring the Effect of Different Learning Rates

Let's see how different learning rates affect model training:

import matplotlib.pyplot as plt

histories = {}
learning_rates = [0.1, 0.01, 0.001, 0.0001]

# Generate some sample data
import numpy as np
x_train = np.random.random((1000, 20))
y_train = np.random.randint(0, 2, (1000, 1))
x_val = np.random.random((200, 20))
y_val = np.random.randint(0, 2, (200, 1))

for lr in learning_rates:
    print(f"Training with learning rate: {lr}")
    
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=lr),
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    
    history = model.fit(
        x_train, y_train,
        validation_data=(x_val, y_val),
        epochs=10,
        verbose=0
    )
    
    histories[lr] = history.history
    
# Plot the training curves
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
for lr, history in histories.items():
    plt.plot(history['loss'], label=f'LR = {lr}')
plt.title('Training Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
for lr, history in histories.items():
    plt.plot(history['val_loss'], label=f'LR = {lr}')
plt.title('Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.savefig('learning_rate_comparison.png')  # Save the figure
plt.show()

This code would generate two plots showing how different learning rates affect training and validation loss over time. Typically, you'll see that:

Very high learning rates (0.1) might cause unstable training
Very low learning rates (0.0001) might learn too slowly
Moderate learning rates (0.01, 0.001) often perform best

Learning Rate Schedules

In practice, it's often beneficial to change the learning rate during training. TensorFlow provides several learning rate schedules:

Step Decay

Reduces the learning rate by a factor at specific epochs:

initial_learning_rate = 0.1
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate,
    decay_steps=100,
    decay_rate=0.96,
    staircase=True
)
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule)

Time-Based Decay

Gradually reduces the learning rate over time:

initial_learning_rate = 0.1
decay_rate = 0.96
decay_steps = 100

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate,
    decay_steps=decay_steps,
    decay_rate=decay_rate,
    staircase=False
)
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule)

Cosine Decay

Uses a cosine function to gradually reduce the learning rate:

initial_learning_rate = 0.1
decay_steps = 1000

lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate, decay_steps=decay_steps
)
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule)

Custom Learning Rate Schedule

You can also create custom schedules by subclassing LearningRateSchedule:

class CustomLearningRateSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, initial_learning_rate):
        self.initial_learning_rate = initial_learning_rate
        
    def __call__(self, step):
        # Custom logic to adjust learning rate based on the step
        return self.initial_learning_rate / (1 + 0.1 * step)
    
    def get_config(self):
        return {"initial_learning_rate": self.initial_learning_rate}

# Use the custom schedule
lr_schedule = CustomLearningRateSchedule(0.01)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

Using Callbacks to Adjust Learning Rate

TensorFlow also provides callbacks to adjust learning rates during training:

# Reduce learning rate when a metric has stopped improving
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.2,  # multiply the learning rate by 0.2 (reduce by 80%)
    patience=3,  # number of epochs with no improvement after which learning rate will be reduced
    min_lr=0.0001  # lower bound on the learning rate
)

model.fit(
    x_train, y_train,
    validation_data=(x_val, y_val),
    epochs=50,
    callbacks=[reduce_lr]
)

Adaptive Learning Rate Optimizers

TensorFlow provides several optimizers with built-in adaptive learning rate mechanisms:

Adam (Adaptive Moment Estimation)

The most popular optimizer for deep learning that adapts the learning rate for each parameter:

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

RMSprop

Maintains per-parameter learning rates adapted based on the average of recent magnitudes of gradients:

optimizer = tf.keras.optimizers.RMSprop(
    learning_rate=0.001,
    rho=0.9,
    momentum=0.0,
    epsilon=1e-07
)

Adagrad

Adapts the learning rate for each parameter based on the history of gradients:

optimizer = tf.keras.optimizers.Adagrad(
    learning_rate=0.01,
    initial_accumulator_value=0.1,
    epsilon=1e-07
)

Learning Rate Finder

A technique to find an optimal learning rate is to train your model with an increasing learning rate and plot the loss:

import numpy as np
import matplotlib.pyplot as plt

# Create a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Generate sample data
x_train = np.random.random((1000, 20))
y_train = np.random.randint(0, 2, (1000, 1))

# LR Finder implementation
start_lr = 1e-8
end_lr = 1.0
num_steps = 100
learning_rates = np.geomspace(start_lr, end_lr, num=num_steps)
losses = []

# Compile the model with a placeholder optimizer
model.compile(
    optimizer=tf.keras.optimizers.SGD(learning_rate=start_lr),
    loss='binary_crossentropy'
)

# Train for one batch with increasing LR
batch_size = 32
for lr in learning_rates:
    # Update learning rate
    tf.keras.backend.set_value(model.optimizer.learning_rate, lr)
    
    # Train for one batch
    indices = np.random.randint(0, len(x_train), batch_size)
    x_batch = x_train[indices]
    y_batch = y_train[indices]
    loss = model.train_on_batch(x_batch, y_batch)
    losses.append(loss)

# Plot the learning rate vs. loss
plt.figure(figsize=(10, 6))
plt.plot(learning_rates, losses)
plt.xscale('log')
plt.xlabel('Learning Rate (log scale)')
plt.ylabel('Loss')
plt.title('Learning Rate vs. Loss')
plt.grid(True)
plt.savefig('lr_finder.png')
plt.show()

# The optimal learning rate is typically just before the loss starts to increase rapidly

The optimal learning rate is typically found at the point where the loss is decreasing most rapidly, just before it starts to diverge.

Real-World Example: MNIST Classification

Let's put everything together with a real example using the MNIST dataset:

import tensorflow as tf
import matplotlib.pyplot as plt

# Load MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Preprocess the data
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
x_train = x_train.reshape(-1, 28*28)
x_test = x_test.reshape(-1, 28*28)

y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)

# Create a learning rate schedule with warmup and decay
initial_learning_rate = 0.001
lr_schedule = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
    boundaries=[1000, 2000, 3000],
    values=[
        initial_learning_rate * 0.3,  # Warm up
        initial_learning_rate,        # Full learning rate
        initial_learning_rate * 0.5,  # Reduced learning rate
        initial_learning_rate * 0.1   # Final learning rate
    ]
)

# Create the model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Compile the model with our learning rate schedule
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=lr_schedule),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Add a learning rate callback to monitor the learning rate
class LearningRateMonitor(tf.keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs=None):
        lr = self.model.optimizer.lr
        if hasattr(lr, '__call__'):
            current_lr = lr(self.model.optimizer.iterations).numpy()
        else:
            current_lr = tf.keras.backend.get_value(self.model.optimizer.lr)
        print(f"\nEpoch {epoch+1}: Current learning rate: {current_lr:.7f}")

lr_monitor = LearningRateMonitor()

# Train the model
history = model.fit(
    x_train, y_train,
    epochs=10,
    batch_size=128,
    validation_split=0.1,
    callbacks=[lr_monitor]
)

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print(f"\nTest accuracy: {test_acc:.4f}")

# Plot the training history
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss over Time')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Accuracy over Time')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.tight_layout()
plt.savefig('mnist_training.png')
plt.show()

This example demonstrates:

Using a piecewise constant learning rate schedule with warmup and decay
Monitoring the learning rate during training
Visualizing how the learning rate affects training and validation metrics

Summary

In this tutorial, we've covered:

The concept of learning rate and its importance in neural network training
How to set fixed learning rates in TensorFlow
Various learning rate schedules and decay strategies
Adaptive learning rate optimizers
Techniques for finding optimal learning rates
A real-world example implementing these concepts

Choosing the right learning rate and schedule is more art than science. While there are good starting points and heuristics, it often requires experimentation to find what works best for your specific problem.

Additional Resources

Exercises

Experiment with different learning rate schedules on the MNIST dataset and compare the results.
Implement cyclical learning rates and observe how they affect training.
Create a custom learning rate schedule that combines warmup, constant learning, and cosine decay.
Try the learning rate finder technique on a different dataset and visualize the results.
Compare the performance of different adaptive optimizers (Adam, RMSprop, Adagrad) using the same learning rate.

Remember that finding the optimal learning rate strategy can significantly improve your model's performance and training time!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Learning Rate​

What is Learning Rate?​

Why is Learning Rate Important?​

Setting Learning Rate in TensorFlow​

Basic Usage with Optimizers​

Monitoring the Effect of Different Learning Rates​

Learning Rate Schedules​

Step Decay​

Time-Based Decay​

Cosine Decay​

Custom Learning Rate Schedule​

Using Callbacks to Adjust Learning Rate​

Adaptive Learning Rate Optimizers​

Adam (Adaptive Moment Estimation)​

RMSprop​

Adagrad​

Learning Rate Finder​

Real-World Example: MNIST Classification​

Summary​

Additional Resources​

Exercises​