TensorFlow Learning Rate
Introduction
The learning rate is one of the most critical hyperparameters when training neural networks with TensorFlow. It controls how much we adjust our model weights in response to the estimated error each time the model weights are updated. If the learning rate is too small, training will take too long or might get stuck; if it's too large, training might diverge or oscillate without reaching the optimal solution.
In this tutorial, we'll explore:
- What learning rate is and why it matters
- How to set the learning rate in TensorFlow
- Learning rate schedules and decay strategies
- Adaptive learning rate optimizers
- Practical tips for selecting the best learning rate for your models
Understanding Learning Rate
What is Learning Rate?
The learning rate (often denoted as α or lr) is a small positive value, typically ranging from 0.1 to 0.0001, that controls the step size during optimization. During backpropagation, the gradients indicate the direction to move to reduce the loss, while the learning rate determines how large of a step to take in that direction.
Mathematically, for a weight parameter w, the update rule is:
w_new = w_old - learning_rate * gradient
Why is Learning Rate Important?
The learning rate directly impacts:
- Training speed: Higher rates can speed up convergence, but only up to a certain point
- Training stability: Too high rates cause instability or divergence
- Final model performance: Proper rate scheduling can lead to better generalization
Setting Learning Rate in TensorFlow
Basic Usage with Optimizers
In TensorFlow, you typically set the learning rate when creating an optimizer:
import tensorflow as tf
# Creating an optimizer with a fixed learning rate
learning_rate = 0.01
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)
# Use the optimizer when compiling a model
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation='relu', input_shape=(784,)),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(
optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy']
)
Monitoring the Effect of Different Learning Rates
Let's see how different learning rates affect model training:
import matplotlib.pyplot as plt
histories = {}
learning_rates = [0.1, 0.01, 0.001, 0.0001]
# Generate some sample data
import numpy as np
x_train = np.random.random((1000, 20))
y_train = np.random.randint(0, 2, (1000, 1))
x_val = np.random.random((200, 20))
y_val = np.random.randint(0, 2, (200, 1))
for lr in learning_rates:
print(f"Training with learning rate: {lr}")
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=lr),
loss='binary_crossentropy',
metrics=['accuracy']
)
history = model.fit(
x_train, y_train,
validation_data=(x_val, y_val),
epochs=10,
verbose=0
)
histories[lr] = history.history
# Plot the training curves
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
for lr, history in histories.items():
plt.plot(history['loss'], label=f'LR = {lr}')
plt.title('Training Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.subplot(1, 2, 2)
for lr, history in histories.items():
plt.plot(history['val_loss'], label=f'LR = {lr}')
plt.title('Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.tight_layout()
plt.savefig('learning_rate_comparison.png') # Save the figure
plt.show()
This code would generate two plots showing how different learning rates affect training and validation loss over time. Typically, you'll see that:
- Very high learning rates (0.1) might cause unstable training
- Very low learning rates (0.0001) might learn too slowly
- Moderate learning rates (0.01, 0.001) often perform best
Learning Rate Schedules
In practice, it's often beneficial to change the learning rate during training. TensorFlow provides several learning rate schedules:
Step Decay
Reduces the learning rate by a factor at specific epochs:
initial_learning_rate = 0.1
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate,
decay_steps=100,
decay_rate=0.96,
staircase=True
)
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule)
Time-Based Decay
Gradually reduces the learning rate over time:
initial_learning_rate = 0.1
decay_rate = 0.96
decay_steps = 100
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate,
decay_steps=decay_steps,
decay_rate=decay_rate,
staircase=False
)
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule)
Cosine Decay
Uses a cosine function to gradually reduce the learning rate:
initial_learning_rate = 0.1
decay_steps = 1000
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
initial_learning_rate, decay_steps=decay_steps
)
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule)
Custom Learning Rate Schedule
You can also create custom schedules by subclassing LearningRateSchedule
:
class CustomLearningRateSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, initial_learning_rate):
self.initial_learning_rate = initial_learning_rate
def __call__(self, step):
# Custom logic to adjust learning rate based on the step
return self.initial_learning_rate / (1 + 0.1 * step)
def get_config(self):
return {"initial_learning_rate": self.initial_learning_rate}
# Use the custom schedule
lr_schedule = CustomLearningRateSchedule(0.01)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
Using Callbacks to Adjust Learning Rate
TensorFlow also provides callbacks to adjust learning rates during training:
# Reduce learning rate when a metric has stopped improving
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.2, # multiply the learning rate by 0.2 (reduce by 80%)
patience=3, # number of epochs with no improvement after which learning rate will be reduced
min_lr=0.0001 # lower bound on the learning rate
)
model.fit(
x_train, y_train,
validation_data=(x_val, y_val),
epochs=50,
callbacks=[reduce_lr]
)
Adaptive Learning Rate Optimizers
TensorFlow provides several optimizers with built-in adaptive learning rate mechanisms:
Adam (Adaptive Moment Estimation)
The most popular optimizer for deep learning that adapts the learning rate for each parameter:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
RMSprop
Maintains per-parameter learning rates adapted based on the average of recent magnitudes of gradients:
optimizer = tf.keras.optimizers.RMSprop(
learning_rate=0.001,
rho=0.9,
momentum=0.0,
epsilon=1e-07
)
Adagrad
Adapts the learning rate for each parameter based on the history of gradients:
optimizer = tf.keras.optimizers.Adagrad(
learning_rate=0.01,
initial_accumulator_value=0.1,
epsilon=1e-07
)
Learning Rate Finder
A technique to find an optimal learning rate is to train your model with an increasing learning rate and plot the loss:
import numpy as np
import matplotlib.pyplot as plt
# Create a simple model
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),
tf.keras.layers.Dense(1, activation='sigmoid')
])
# Generate sample data
x_train = np.random.random((1000, 20))
y_train = np.random.randint(0, 2, (1000, 1))
# LR Finder implementation
start_lr = 1e-8
end_lr = 1.0
num_steps = 100
learning_rates = np.geomspace(start_lr, end_lr, num=num_steps)
losses = []
# Compile the model with a placeholder optimizer
model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=start_lr),
loss='binary_crossentropy'
)
# Train for one batch with increasing LR
batch_size = 32
for lr in learning_rates:
# Update learning rate
tf.keras.backend.set_value(model.optimizer.learning_rate, lr)
# Train for one batch
indices = np.random.randint(0, len(x_train), batch_size)
x_batch = x_train[indices]
y_batch = y_train[indices]
loss = model.train_on_batch(x_batch, y_batch)
losses.append(loss)
# Plot the learning rate vs. loss
plt.figure(figsize=(10, 6))
plt.plot(learning_rates, losses)
plt.xscale('log')
plt.xlabel('Learning Rate (log scale)')
plt.ylabel('Loss')
plt.title('Learning Rate vs. Loss')
plt.grid(True)
plt.savefig('lr_finder.png')
plt.show()
# The optimal learning rate is typically just before the loss starts to increase rapidly
The optimal learning rate is typically found at the point where the loss is decreasing most rapidly, just before it starts to diverge.
Real-World Example: MNIST Classification
Let's put everything together with a real example using the MNIST dataset:
import tensorflow as tf
import matplotlib.pyplot as plt
# Load MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
# Preprocess the data
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
x_train = x_train.reshape(-1, 28*28)
x_test = x_test.reshape(-1, 28*28)
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)
# Create a learning rate schedule with warmup and decay
initial_learning_rate = 0.001
lr_schedule = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
boundaries=[1000, 2000, 3000],
values=[
initial_learning_rate * 0.3, # Warm up
initial_learning_rate, # Full learning rate
initial_learning_rate * 0.5, # Reduced learning rate
initial_learning_rate * 0.1 # Final learning rate
]
)
# Create the model
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
# Compile the model with our learning rate schedule
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=lr_schedule),
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Add a learning rate callback to monitor the learning rate
class LearningRateMonitor(tf.keras.callbacks.Callback):
def on_epoch_begin(self, epoch, logs=None):
lr = self.model.optimizer.lr
if hasattr(lr, '__call__'):
current_lr = lr(self.model.optimizer.iterations).numpy()
else:
current_lr = tf.keras.backend.get_value(self.model.optimizer.lr)
print(f"\nEpoch {epoch+1}: Current learning rate: {current_lr:.7f}")
lr_monitor = LearningRateMonitor()
# Train the model
history = model.fit(
x_train, y_train,
epochs=10,
batch_size=128,
validation_split=0.1,
callbacks=[lr_monitor]
)
# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print(f"\nTest accuracy: {test_acc:.4f}")
# Plot the training history
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss over Time')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Accuracy over Time')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.tight_layout()
plt.savefig('mnist_training.png')
plt.show()
This example demonstrates:
- Using a piecewise constant learning rate schedule with warmup and decay
- Monitoring the learning rate during training
- Visualizing how the learning rate affects training and validation metrics
Summary
In this tutorial, we've covered:
- The concept of learning rate and its importance in neural network training
- How to set fixed learning rates in TensorFlow
- Various learning rate schedules and decay strategies
- Adaptive learning rate optimizers
- Techniques for finding optimal learning rates
- A real-world example implementing these concepts
Choosing the right learning rate and schedule is more art than science. While there are good starting points and heuristics, it often requires experimentation to find what works best for your specific problem.
Additional Resources
- TensorFlow Learning Rate Schedules Documentation
- Practical Deep Learning Course - Learning Rate Selection
- Research Paper: Cyclical Learning Rates for Training Neural Networks
Exercises
- Experiment with different learning rate schedules on the MNIST dataset and compare the results.
- Implement cyclical learning rates and observe how they affect training.
- Create a custom learning rate schedule that combines warmup, constant learning, and cosine decay.
- Try the learning rate finder technique on a different dataset and visualize the results.
- Compare the performance of different adaptive optimizers (Adam, RMSprop, Adagrad) using the same learning rate.
Remember that finding the optimal learning rate strategy can significantly improve your model's performance and training time!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)