PyTorch Learning Rate Scheduling

Introduction

Learning rate is one of the most important hyperparameters in deep learning. It controls how much we adjust our model weights during training. If the learning rate is too large, the model might overshoot the optimal solution. If it's too small, training might take too long or get stuck in local minima.

Learning rate scheduling is a technique where we change the learning rate during training to improve model performance and convergence. PyTorch provides several built-in schedulers that help us implement different strategies for adjusting the learning rate over time.

In this tutorial, you'll learn:

Why learning rate scheduling is important
Different types of learning rate schedulers in PyTorch
How to implement and use various schedulers
Best practices for learning rate scheduling

Why Use Learning Rate Scheduling?

When training neural networks, a common challenge is finding the perfect learning rate:

Too high: The model may oscillate around the optimal point or diverge entirely
Too low: Training progresses slowly and may get trapped in suboptimal minima

Learning rate scheduling addresses this by typically starting with a higher learning rate and gradually reducing it according to a predefined strategy. This approach has several benefits:

Faster initial progress: Higher learning rates in early epochs allow rapid movement toward better weights
Fine-tuning: Smaller learning rates later enable more precise weight adjustments
Better convergence: Often results in better final model performance
Escaping local minima: Can help the model escape poor local minima early in training

Common Learning Rate Schedulers in PyTorch

PyTorch provides several learning rate schedulers through the torch.optim.lr_scheduler module. Let's explore the most commonly used ones:

1. Step Learning Rate Scheduler

The StepLR scheduler decreases the learning rate by a factor (gamma) after a specified number of epochs (step_size).

import torch
from torch.optim import SGD
from torch.optim.lr_scheduler import StepLR

# Create an optimizer
optimizer = SGD(model.parameters(), lr=0.1)

# Create a scheduler: reduce LR by a factor of 0.1 every 30 epochs
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# Training loop
for epoch in range(100):
    train_one_epoch(model, optimizer)  # Your training function
    scheduler.step()
    
    # Print current learning rate
    print(f"Epoch {epoch}, Learning Rate: {optimizer.param_groups[0]['lr']}")

Output:

Epoch 0, Learning Rate: 0.1
...
Epoch 29, Learning Rate: 0.1
Epoch 30, Learning Rate: 0.01
...
Epoch 59, Learning Rate: 0.01
Epoch 60, Learning Rate: 0.001
...

2. Multi-Step Learning Rate Scheduler

The MultiStepLR scheduler is similar to StepLR but allows you to specify multiple epochs where the learning rate changes:

from torch.optim.lr_scheduler import MultiStepLR

optimizer = SGD(model.parameters(), lr=0.1)

# Reduce LR by a factor of 0.1 at epochs 30, 60, and 90
scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)

# Training loop
for epoch in range(100):
    train_one_epoch(model, optimizer)
    scheduler.step()
    print(f"Epoch {epoch}, Learning Rate: {optimizer.param_groups[0]['lr']}")

Output:

Epoch 0, Learning Rate: 0.1
...
Epoch 29, Learning Rate: 0.1
Epoch 30, Learning Rate: 0.01
...
Epoch 59, Learning Rate: 0.01
Epoch 60, Learning Rate: 0.001
...

3. Exponential Learning Rate Scheduler

The ExponentialLR scheduler decreases the learning rate by multiplying it with a gamma factor after each epoch:

from torch.optim.lr_scheduler import ExponentialLR

optimizer = SGD(model.parameters(), lr=0.1)

# Multiply LR by 0.95 after each epoch
scheduler = ExponentialLR(optimizer, gamma=0.95)

# Training loop
for epoch in range(10):  # Showing just 10 epochs for brevity
    train_one_epoch(model, optimizer)
    scheduler.step()
    print(f"Epoch {epoch}, Learning Rate: {optimizer.param_groups[0]['lr']}")

Output:

Epoch 0, Learning Rate: 0.095
Epoch 1, Learning Rate: 0.09025
Epoch 2, Learning Rate: 0.0857375
...
Epoch 9, Learning Rate: 0.05987369392383002

4. Cosine Annealing Learning Rate Scheduler

The CosineAnnealingLR scheduler implements a cosine annealing schedule that gradually reduces the learning rate from an initial value to a minimum value following a cosine curve:

from torch.optim.lr_scheduler import CosineAnnealingLR

optimizer = SGD(model.parameters(), lr=0.1)

# Cosine annealing from 0.1 to 0.001 over 50 epochs
scheduler = CosineAnnealingLR(optimizer, T_max=50, eta_min=0.001)

# Training loop
for epoch in range(100):
    train_one_epoch(model, optimizer)
    scheduler.step()
    if epoch % 10 == 0:  # Print every 10 epochs for brevity
        print(f"Epoch {epoch}, Learning Rate: {optimizer.param_groups[0]['lr']}")

Output:

Epoch 0, Learning Rate: 0.09802
Epoch 10, Learning Rate: 0.07782
Epoch 20, Learning Rate: 0.04218
Epoch 30, Learning Rate: 0.01637
Epoch 40, Learning Rate: 0.00363
Epoch 50, Learning Rate: 0.001
Epoch 60, Learning Rate: 0.00363
...

5. ReduceLROnPlateau

The ReduceLROnPlateau scheduler reduces the learning rate when a metric has stopped improving:

from torch.optim.lr_scheduler import ReduceLROnPlateau

optimizer = SGD(model.parameters(), lr=0.1)

# Reduce LR by a factor of 0.1 when validation loss doesn't improve for 5 epochs
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=5, verbose=True)

# Training loop
for epoch in range(100):
    train_loss = train_one_epoch(model, optimizer)
    val_loss = validate(model)  # Your validation function
    
    # Pass the validation loss to the scheduler
    scheduler.step(val_loss)
    
    print(f"Epoch {epoch}, Learning Rate: {optimizer.param_groups[0]['lr']}, Val Loss: {val_loss}")

Output:

Epoch 0, Learning Rate: 0.1, Val Loss: 2.3
...
Epoch 10, Learning Rate: 0.1, Val Loss: 1.2
Epoch 11, Learning Rate: 0.1, Val Loss: 1.19
...
Epoch 16, Learning Rate: 0.01, Val Loss: 1.19  # LR reduced after 5 epochs without improvement
...

Creating a Complete Training Loop with Learning Rate Scheduling

Let's put everything together in a complete training example using CIFAR-10 dataset and a simple CNN model:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import OneCycleLR
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt

# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Data transformations
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Load CIFAR-10 dataset
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
]))
testloader = torch.utils.data.DataLoader(testset, batch_size=128, shuffle=False, num_workers=2)

# Define a simple CNN model
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(128 * 4 * 4, 512)
        self.fc2 = nn.Linear(512, 10)
        self.relu = nn.ReLU()
        
    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = self.pool(self.relu(self.conv3(x)))
        x = x.view(-1, 128 * 4 * 4)
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Create model, loss function, and optimizer
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)

# Use OneCycleLR scheduler - a more advanced scheduler that's often very effective
scheduler = OneCycleLR(
    optimizer,
    max_lr=0.1,
    steps_per_epoch=len(trainloader),
    epochs=30,
    pct_start=0.3,  # 30% of training spent increasing LR
    anneal_strategy='cos'
)

# Lists to store metrics
train_losses = []
test_accuracies = []
learning_rates = []

# Training loop
def train_model(epochs=30):
    for epoch in range(epochs):
        model.train()
        running_loss = 0.0
        
        for i, data in enumerate(trainloader):
            inputs, labels = data[0].to(device), data[1].to(device)
            
            # Zero the parameter gradients
            optimizer.zero_grad()
            
            # Forward + backward + optimize
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            # Update learning rate
            scheduler.step()
            
            # Record learning rate
            learning_rates.append(optimizer.param_groups[0]['lr'])
            
            running_loss += loss.item()
            
            if i % 100 == 99:
                print(f'[{epoch + 1}, {i + 1}] loss: {running_loss / 100:.3f}')
                running_loss = 0.0
        
        # Save average training loss for this epoch
        train_losses.append(running_loss / len(trainloader))
        
        # Evaluate on test set
        correct = 0
        total = 0
        model.eval()
        with torch.no_grad():
            for data in testloader:
                images, labels = data[0].to(device), data[1].to(device)
                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        
        accuracy = 100 * correct / total
        test_accuracies.append(accuracy)
        print(f'Epoch {epoch+1}: Test Accuracy: {accuracy:.2f}%')
    
    print('Finished Training')

# Train the model
train_model(epochs=30)

# Plot learning rate over iterations
plt.figure(figsize=(10, 4))
plt.plot(learning_rates)
plt.title('Learning Rate Schedule')
plt.xlabel('Iterations')
plt.ylabel('Learning Rate')
plt.grid(True)
plt.savefig('learning_rate_schedule.png')

# Plot training loss and test accuracy
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(train_losses)
plt.title('Training Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(test_accuracies)
plt.title('Test Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy (%)')
plt.grid(True)
plt.savefig('training_metrics.png')

Choosing the Right Learning Rate Scheduler

Different schedulers work better for different tasks. Here are some guidelines:

Step-based schedulers (StepLR, MultiStepLR):
- Good for tasks where you have a good understanding of when learning should be reduced
- Simple to implement and interpret
- Common practice: reduce LR by 10x after plateaus in validation metrics
Cosine Annealing schedulers:
- Generally provide smooth transitions
- Often work well in practice for many vision tasks
- CosineAnnealingWarmRestarts can help escape local minima
ReduceLROnPlateau:
- Adaptive approach that responds to model performance
- Great when you're unsure when to reduce learning rate
- Typically more robust across different datasets/models
OneCycleLR:
- Often provides faster convergence and better results
- Follows research showing benefits of super-convergence
- Good default choice for many modern deep learning tasks

Learning Rate Finder

Before setting up a scheduler, it's often helpful to find a good initial learning rate. PyTorch doesn't have a built-in learning rate finder, but libraries like fastai and pytorch-lightning provide this functionality.

Here's a simplified implementation:

def find_learning_rate(model, train_loader, optimizer, criterion, device, start_lr=1e-7, end_lr=10, num_iter=100):
    # Save original parameters to restore them later
    original_params = {name: param.clone() for name, param in model.named_parameters()}
    
    # Initialize lists to store learning rates and losses
    lrs = []
    losses = []
    
    # Set the initial learning rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = start_lr
    
    # Calculate the multiplication factor
    lr_factor = (end_lr / start_lr) ** (1 / num_iter)
    
    model.train()
    for i, (inputs, targets) in enumerate(train_loader):
        if i >= num_iter:
            break
            
        inputs, targets = inputs.to(device), targets.to(device)
        
        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Store learning rate and loss
        lr = optimizer.param_groups[0]['lr']
        lrs.append(lr)
        losses.append(loss.item())
        
        # Increase learning rate for next iteration
        for param_group in optimizer.param_groups:
            param_group['lr'] *= lr_factor
        
        # Break if loss explodes
        if not np.isfinite(loss.item()) or loss.item() > 4:
            break
    
    # Restore original parameters
    for name, param in model.named_parameters():
        param.data = original_params[name]
    
    # Plot the learning rate vs loss
    plt.figure(figsize=(10, 6))
    plt.plot(lrs, losses)
    plt.xscale('log')
    plt.xlabel('Learning Rate')
    plt.ylabel('Loss')
    plt.title('Learning Rate Finder')
    plt.savefig('lr_finder.png')
    
    return lrs, losses

Best Practices for Learning Rate Scheduling

Warm-up period: Start with a smaller learning rate and gradually increase it in the first few epochs. This helps stabilize early training.
Cyclical learning rates: Consider using cyclical learning rates (like CosineAnnealingWarmRestarts) to periodically increase the learning rate, helping escape local minima.
Monitor validation metrics: Learning rate changes should correlate with improvements in validation metrics.
Use learning rate finder: Start your project by finding a good initial learning rate range.
Combine with proper initialization: Proper weight initialization techniques complement learning rate scheduling.
Different parameter groups: Consider using different learning rates for different layers (often lower for early layers, higher for later layers).

Summary

Learning rate scheduling is a powerful technique to improve training dynamics and model performance. PyTorch provides a variety of schedulers that allow you to implement different strategies for adjusting the learning rate during training.

Key takeaways:

Learning rate scheduling can lead to faster convergence and better final performance
PyTorch offers various built-in schedulers like StepLR, CosineAnnealingLR, and ReduceLROnPlateau
The choice of scheduler depends on your specific task and dataset
Using a learning rate finder can help determine good initial learning rates
Modern techniques like OneCycleLR often provide excellent results

Additional Resources

Exercises

Implement different learning rate schedulers on the MNIST dataset and compare their performance.
Create a custom learning rate scheduler that combines features from existing schedulers.
Experiment with the OneCycleLR scheduler using different values for pct_start and observe the effects.
Implement a learning rate finder and use it to find the optimal learning rate for a custom dataset.
Compare the performance of a model trained with constant learning rate versus one trained with a scheduler.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Use Learning Rate Scheduling?​

Common Learning Rate Schedulers in PyTorch​

1. Step Learning Rate Scheduler​

2. Multi-Step Learning Rate Scheduler​

3. Exponential Learning Rate Scheduler​

4. Cosine Annealing Learning Rate Scheduler​

5. ReduceLROnPlateau​

Creating a Complete Training Loop with Learning Rate Scheduling​

Choosing the Right Learning Rate Scheduler​

Learning Rate Finder​

Best Practices for Learning Rate Scheduling​

Summary​

Additional Resources​

Exercises​

Introduction

Why Use Learning Rate Scheduling?

Common Learning Rate Schedulers in PyTorch

1. Step Learning Rate Scheduler

2. Multi-Step Learning Rate Scheduler

3. Exponential Learning Rate Scheduler

4. Cosine Annealing Learning Rate Scheduler

5. ReduceLROnPlateau

Creating a Complete Training Loop with Learning Rate Scheduling

Choosing the Right Learning Rate Scheduler

Learning Rate Finder

Best Practices for Learning Rate Scheduling

Summary

Additional Resources

Exercises