PyTorch Learning Rate Scheduling
Introduction
Learning rate is one of the most important hyperparameters in deep learning. It controls how much we adjust our model weights during training. If the learning rate is too large, the model might overshoot the optimal solution. If it's too small, training might take too long or get stuck in local minima.
Learning rate scheduling is a technique where we change the learning rate during training to improve model performance and convergence. PyTorch provides several built-in schedulers that help us implement different strategies for adjusting the learning rate over time.
In this tutorial, you'll learn:
- Why learning rate scheduling is important
- Different types of learning rate schedulers in PyTorch
- How to implement and use various schedulers
- Best practices for learning rate scheduling
Why Use Learning Rate Scheduling?
When training neural networks, a common challenge is finding the perfect learning rate:
- Too high: The model may oscillate around the optimal point or diverge entirely
- Too low: Training progresses slowly and may get trapped in suboptimal minima
Learning rate scheduling addresses this by typically starting with a higher learning rate and gradually reducing it according to a predefined strategy. This approach has several benefits:
- Faster initial progress: Higher learning rates in early epochs allow rapid movement toward better weights
- Fine-tuning: Smaller learning rates later enable more precise weight adjustments
- Better convergence: Often results in better final model performance
- Escaping local minima: Can help the model escape poor local minima early in training
Common Learning Rate Schedulers in PyTorch
PyTorch provides several learning rate schedulers through the torch.optim.lr_scheduler
module. Let's explore the most commonly used ones:
1. Step Learning Rate Scheduler
The StepLR
scheduler decreases the learning rate by a factor (gamma) after a specified number of epochs (step_size).
import torch
from torch.optim import SGD
from torch.optim.lr_scheduler import StepLR
# Create an optimizer
optimizer = SGD(model.parameters(), lr=0.1)
# Create a scheduler: reduce LR by a factor of 0.1 every 30 epochs
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# Training loop
for epoch in range(100):
train_one_epoch(model, optimizer) # Your training function
scheduler.step()
# Print current learning rate
print(f"Epoch {epoch}, Learning Rate: {optimizer.param_groups[0]['lr']}")
Output:
Epoch 0, Learning Rate: 0.1
...
Epoch 29, Learning Rate: 0.1
Epoch 30, Learning Rate: 0.01
...
Epoch 59, Learning Rate: 0.01
Epoch 60, Learning Rate: 0.001
...
2. Multi-Step Learning Rate Scheduler
The MultiStepLR
scheduler is similar to StepLR
but allows you to specify multiple epochs where the learning rate changes:
from torch.optim.lr_scheduler import MultiStepLR
optimizer = SGD(model.parameters(), lr=0.1)
# Reduce LR by a factor of 0.1 at epochs 30, 60, and 90
scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)
# Training loop
for epoch in range(100):
train_one_epoch(model, optimizer)
scheduler.step()
print(f"Epoch {epoch}, Learning Rate: {optimizer.param_groups[0]['lr']}")
Output:
Epoch 0, Learning Rate: 0.1
...
Epoch 29, Learning Rate: 0.1
Epoch 30, Learning Rate: 0.01
...
Epoch 59, Learning Rate: 0.01
Epoch 60, Learning Rate: 0.001
...
3. Exponential Learning Rate Scheduler
The ExponentialLR
scheduler decreases the learning rate by multiplying it with a gamma factor after each epoch:
from torch.optim.lr_scheduler import ExponentialLR
optimizer = SGD(model.parameters(), lr=0.1)
# Multiply LR by 0.95 after each epoch
scheduler = ExponentialLR(optimizer, gamma=0.95)
# Training loop
for epoch in range(10): # Showing just 10 epochs for brevity
train_one_epoch(model, optimizer)
scheduler.step()
print(f"Epoch {epoch}, Learning Rate: {optimizer.param_groups[0]['lr']}")
Output:
Epoch 0, Learning Rate: 0.095
Epoch 1, Learning Rate: 0.09025
Epoch 2, Learning Rate: 0.0857375
...
Epoch 9, Learning Rate: 0.05987369392383002
4. Cosine Annealing Learning Rate Scheduler
The CosineAnnealingLR
scheduler implements a cosine annealing schedule that gradually reduces the learning rate from an initial value to a minimum value following a cosine curve:
from torch.optim.lr_scheduler import CosineAnnealingLR
optimizer = SGD(model.parameters(), lr=0.1)
# Cosine annealing from 0.1 to 0.001 over 50 epochs
scheduler = CosineAnnealingLR(optimizer, T_max=50, eta_min=0.001)
# Training loop
for epoch in range(100):
train_one_epoch(model, optimizer)
scheduler.step()
if epoch % 10 == 0: # Print every 10 epochs for brevity
print(f"Epoch {epoch}, Learning Rate: {optimizer.param_groups[0]['lr']}")
Output:
Epoch 0, Learning Rate: 0.09802
Epoch 10, Learning Rate: 0.07782
Epoch 20, Learning Rate: 0.04218
Epoch 30, Learning Rate: 0.01637
Epoch 40, Learning Rate: 0.00363
Epoch 50, Learning Rate: 0.001
Epoch 60, Learning Rate: 0.00363
...
5. ReduceLROnPlateau
The ReduceLROnPlateau
scheduler reduces the learning rate when a metric has stopped improving:
from torch.optim.lr_scheduler import ReduceLROnPlateau
optimizer = SGD(model.parameters(), lr=0.1)
# Reduce LR by a factor of 0.1 when validation loss doesn't improve for 5 epochs
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=5, verbose=True)
# Training loop
for epoch in range(100):
train_loss = train_one_epoch(model, optimizer)
val_loss = validate(model) # Your validation function
# Pass the validation loss to the scheduler
scheduler.step(val_loss)
print(f"Epoch {epoch}, Learning Rate: {optimizer.param_groups[0]['lr']}, Val Loss: {val_loss}")
Output:
Epoch 0, Learning Rate: 0.1, Val Loss: 2.3
...
Epoch 10, Learning Rate: 0.1, Val Loss: 1.2
Epoch 11, Learning Rate: 0.1, Val Loss: 1.19
...
Epoch 16, Learning Rate: 0.01, Val Loss: 1.19 # LR reduced after 5 epochs without improvement
...
Creating a Complete Training Loop with Learning Rate Scheduling
Let's put everything together in a complete training example using CIFAR-10 dataset and a simple CNN model:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import OneCycleLR
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Data transformations
transform = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomCrop(32, padding=4),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# Load CIFAR-10 dataset
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
]))
testloader = torch.utils.data.DataLoader(testset, batch_size=128, shuffle=False, num_workers=2)
# Define a simple CNN model
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(128 * 4 * 4, 512)
self.fc2 = nn.Linear(512, 10)
self.relu = nn.ReLU()
def forward(self, x):
x = self.pool(self.relu(self.conv1(x)))
x = self.pool(self.relu(self.conv2(x)))
x = self.pool(self.relu(self.conv3(x)))
x = x.view(-1, 128 * 4 * 4)
x = self.relu(self.fc1(x))
x = self.fc2(x)
return x
# Create model, loss function, and optimizer
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
# Use OneCycleLR scheduler - a more advanced scheduler that's often very effective
scheduler = OneCycleLR(
optimizer,
max_lr=0.1,
steps_per_epoch=len(trainloader),
epochs=30,
pct_start=0.3, # 30% of training spent increasing LR
anneal_strategy='cos'
)
# Lists to store metrics
train_losses = []
test_accuracies = []
learning_rates = []
# Training loop
def train_model(epochs=30):
for epoch in range(epochs):
model.train()
running_loss = 0.0
for i, data in enumerate(trainloader):
inputs, labels = data[0].to(device), data[1].to(device)
# Zero the parameter gradients
optimizer.zero_grad()
# Forward + backward + optimize
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# Update learning rate
scheduler.step()
# Record learning rate
learning_rates.append(optimizer.param_groups[0]['lr'])
running_loss += loss.item()
if i % 100 == 99:
print(f'[{epoch + 1}, {i + 1}] loss: {running_loss / 100:.3f}')
running_loss = 0.0
# Save average training loss for this epoch
train_losses.append(running_loss / len(trainloader))
# Evaluate on test set
correct = 0
total = 0
model.eval()
with torch.no_grad():
for data in testloader:
images, labels = data[0].to(device), data[1].to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
test_accuracies.append(accuracy)
print(f'Epoch {epoch+1}: Test Accuracy: {accuracy:.2f}%')
print('Finished Training')
# Train the model
train_model(epochs=30)
# Plot learning rate over iterations
plt.figure(figsize=(10, 4))
plt.plot(learning_rates)
plt.title('Learning Rate Schedule')
plt.xlabel('Iterations')
plt.ylabel('Learning Rate')
plt.grid(True)
plt.savefig('learning_rate_schedule.png')
# Plot training loss and test accuracy
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(train_losses)
plt.title('Training Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.grid(True)
plt.subplot(1, 2, 2)
plt.plot(test_accuracies)
plt.title('Test Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy (%)')
plt.grid(True)
plt.savefig('training_metrics.png')
Choosing the Right Learning Rate Scheduler
Different schedulers work better for different tasks. Here are some guidelines:
-
Step-based schedulers (
StepLR
,MultiStepLR
):- Good for tasks where you have a good understanding of when learning should be reduced
- Simple to implement and interpret
- Common practice: reduce LR by 10x after plateaus in validation metrics
-
Cosine Annealing schedulers:
- Generally provide smooth transitions
- Often work well in practice for many vision tasks
CosineAnnealingWarmRestarts
can help escape local minima
-
ReduceLROnPlateau
:- Adaptive approach that responds to model performance
- Great when you're unsure when to reduce learning rate
- Typically more robust across different datasets/models
-
OneCycleLR
:- Often provides faster convergence and better results
- Follows research showing benefits of super-convergence
- Good default choice for many modern deep learning tasks
Learning Rate Finder
Before setting up a scheduler, it's often helpful to find a good initial learning rate. PyTorch doesn't have a built-in learning rate finder, but libraries like fastai
and pytorch-lightning
provide this functionality.
Here's a simplified implementation:
def find_learning_rate(model, train_loader, optimizer, criterion, device, start_lr=1e-7, end_lr=10, num_iter=100):
# Save original parameters to restore them later
original_params = {name: param.clone() for name, param in model.named_parameters()}
# Initialize lists to store learning rates and losses
lrs = []
losses = []
# Set the initial learning rate
for param_group in optimizer.param_groups:
param_group['lr'] = start_lr
# Calculate the multiplication factor
lr_factor = (end_lr / start_lr) ** (1 / num_iter)
model.train()
for i, (inputs, targets) in enumerate(train_loader):
if i >= num_iter:
break
inputs, targets = inputs.to(device), targets.to(device)
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Store learning rate and loss
lr = optimizer.param_groups[0]['lr']
lrs.append(lr)
losses.append(loss.item())
# Increase learning rate for next iteration
for param_group in optimizer.param_groups:
param_group['lr'] *= lr_factor
# Break if loss explodes
if not np.isfinite(loss.item()) or loss.item() > 4:
break
# Restore original parameters
for name, param in model.named_parameters():
param.data = original_params[name]
# Plot the learning rate vs loss
plt.figure(figsize=(10, 6))
plt.plot(lrs, losses)
plt.xscale('log')
plt.xlabel('Learning Rate')
plt.ylabel('Loss')
plt.title('Learning Rate Finder')
plt.savefig('lr_finder.png')
return lrs, losses
Best Practices for Learning Rate Scheduling
-
Warm-up period: Start with a smaller learning rate and gradually increase it in the first few epochs. This helps stabilize early training.
-
Cyclical learning rates: Consider using cyclical learning rates (like
CosineAnnealingWarmRestarts
) to periodically increase the learning rate, helping escape local minima. -
Monitor validation metrics: Learning rate changes should correlate with improvements in validation metrics.
-
Use learning rate finder: Start your project by finding a good initial learning rate range.
-
Combine with proper initialization: Proper weight initialization techniques complement learning rate scheduling.
-
Different parameter groups: Consider using different learning rates for different layers (often lower for early layers, higher for later layers).
Summary
Learning rate scheduling is a powerful technique to improve training dynamics and model performance. PyTorch provides a variety of schedulers that allow you to implement different strategies for adjusting the learning rate during training.
Key takeaways:
- Learning rate scheduling can lead to faster convergence and better final performance
- PyTorch offers various built-in schedulers like
StepLR
,CosineAnnealingLR
, andReduceLROnPlateau
- The choice of scheduler depends on your specific task and dataset
- Using a learning rate finder can help determine good initial learning rates
- Modern techniques like
OneCycleLR
often provide excellent results
Additional Resources
- PyTorch Learning Rate Scheduler Documentation
- Cyclical Learning Rates for Training Neural Networks
- Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
- fastai's Learning Rate Finder Implementation
Exercises
- Implement different learning rate schedulers on the MNIST dataset and compare their performance.
- Create a custom learning rate scheduler that combines features from existing schedulers.
- Experiment with the
OneCycleLR
scheduler using different values forpct_start
and observe the effects. - Implement a learning rate finder and use it to find the optimal learning rate for a custom dataset.
- Compare the performance of a model trained with constant learning rate versus one trained with a scheduler.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)