PyTorch Mixed Precision Training
Introduction
Mixed precision training is a technique that combines different numerical precision types (typically FP32 and FP16) during model training to improve computational efficiency. By using lower precision where possible, you can achieve:
- Faster computation: Half-precision operations are significantly faster on modern GPUs, especially on NVIDIA GPUs with Tensor Cores
- Reduced memory usage: FP16 values take up half the memory of FP32 values
- Similar model accuracy: When implemented correctly, mixed precision maintains model quality
In this tutorial, we'll explore how to implement mixed precision training in PyTorch using the torch.cuda.amp
package (Automatic Mixed Precision), which makes the process straightforward and automatic.
Understanding Floating Point Precision
Before diving into mixed precision training, let's understand the different floating point formats:
- FP32 (32-bit floating point): Standard precision used in most deep learning training
- FP16 (16-bit floating point): Half precision, requires half the memory but has limited range
- BF16 (Brain Floating Point): Alternative 16-bit format with better numerical properties than FP16
The challenge with FP16 is its limited dynamic range, which can cause numerical instability in training. This is why we use "mixed" precision rather than pure FP16 training.
PyTorch's Automatic Mixed Precision (AMP)
PyTorch provides the torch.cuda.amp
module, which simplifies mixed precision training. The key components are:
- GradScaler: Helps prevent underflow in gradients
- autocast context: Automatically casts operations to the appropriate precision
Let's see how to implement mixed precision training step by step.
Basic Implementation
Here's a simple implementation of mixed precision training:
import torch
from torch.cuda.amp import autocast, GradScaler
# Initialize model, optimizer, data loader, etc.
model = YourModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = torch.nn.CrossEntropyLoss()
# Initialize the GradScaler
scaler = GradScaler()
# Training loop
for epoch in range(num_epochs):
for inputs, labels in dataloader:
# Move data to GPU
inputs = inputs.cuda()
labels = labels.cuda()
# Clear gradients
optimizer.zero_grad()
# Forward pass with autocast
with autocast():
outputs = model(inputs)
loss = criterion(outputs, labels)
# Backward pass with gradient scaling
scaler.scale(loss).backward()
# Optimizer step with unscaling
scaler.step(optimizer)
# Update scaler for next iteration
scaler.update()
How Mixed Precision Works
Let's break down what's happening in the code above:
1. The autocast()
Context Manager
with autocast():
outputs = model(inputs)
loss = criterion(outputs, labels)
The autocast()
context automatically casts operations to FP16 where appropriate. It selectively casts operations to FP16 or keeps them in FP32 based on PyTorch's internal rules for numerical stability.
2. The GradScaler
Object
scaler = GradScaler()
The gradient scaler helps prevent gradient underflow. Since gradients in FP16 can become too small to be represented, the scaler multiplies the loss by a scale factor before backpropagation.
3. Scaling the Loss
scaler.scale(loss).backward()
This scales the loss value before backpropagation, which effectively scales gradients by the same factor.
4. Optimizer Step with Unscaling
scaler.step(optimizer)
Before applying gradients, the scaler unscales them (divides by the same scale factor). It also checks for gradient overflow and skips the optimization step if detected.
5. Updating the Scaler
scaler.update()
This adjusts the scale factor based on whether overflow was detected. If no overflow occurred, the scale factor may be increased; if overflow occurred, it's decreased.
Complete Training Loop Example
Here's a more complete example showing mixed precision training with additional training loop components:
import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
from torchvision import models, datasets, transforms
from torch.utils.data import DataLoader
# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Set up a simple model
model = models.resnet18(pretrained=False, num_classes=10).to(device)
# Set up data loaders (simplified)
transform = transforms.Compose([
transforms.Resize(224),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Initialize the GradScaler
scaler = GradScaler()
# Training function
def train_one_epoch(model, train_loader, optimizer, criterion, scaler, epoch):
model.train()
running_loss = 0.0
correct = 0
total = 0
for i, (inputs, targets) in enumerate(train_loader):
inputs, targets = inputs.to(device), targets.to(device)
# Zero the parameter gradients
optimizer.zero_grad()
# Forward pass with autocast
with autocast():
outputs = model(inputs)
loss = criterion(outputs, targets)
# Backward and optimize with gradient scaling
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
# Statistics
running_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
if i % 100 == 99:
print(f'Epoch: {epoch+1}, Batch: {i+1}, Loss: {running_loss/100:.3f}, '
f'Accuracy: {100.*correct/total:.2f}%')
running_loss = 0.0
return 100.*correct/total
# Train the model
num_epochs = 5
for epoch in range(num_epochs):
train_accuracy = train_one_epoch(model, train_loader, optimizer, criterion, scaler, epoch)
print(f'Epoch {epoch+1} completed. Training accuracy: {train_accuracy:.2f}%')
print('Finished Training')
Sample output might look like:
Epoch: 1, Batch: 100, Loss: 1.856, Accuracy: 32.55%
Epoch: 1, Batch: 200, Loss: 1.683, Accuracy: 38.78%
Epoch: 1, Batch: 300, Loss: 1.509, Accuracy: 46.12%
Epoch: 1, Batch: 400, Loss: 1.471, Accuracy: 47.84%
Epoch: 1, Batch: 500, Loss: 1.328, Accuracy: 52.58%
Epoch: 1, Batch: 600, Loss: 1.257, Accuracy: 55.44%
Epoch: 1, Batch: 700, Loss: 1.182, Accuracy: 58.32%
Epoch 1 completed. Training accuracy: 58.85%
...
Comparing Performance: Mixed Precision vs. Standard Training
Let's create a simple benchmark to compare the performance of mixed precision training against standard FP32 training:
import time
import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
def benchmark_training(use_amp=False):
# Create a large model
model = models.resnet50().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()
# Create synthetic data
batch_size = 64
inputs = torch.randn(batch_size, 3, 224, 224).cuda()
targets = torch.randint(0, 1000, (batch_size,)).cuda()
# Initialize GradScaler for AMP
scaler = GradScaler() if use_amp else None
# Warmup
for _ in range(10):
if use_amp:
with autocast():
outputs = model(inputs)
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Benchmark
torch.cuda.synchronize()
start_time = time.time()
num_iterations = 100
for _ in range(num_iterations):
if use_amp:
with autocast():
outputs = model(inputs)
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
optimizer.zero_grad()
torch.cuda.synchronize()
end_time = time.time()
return (end_time - start_time) / num_iterations
# Run benchmarks
fp32_time = benchmark_training(use_amp=False)
amp_time = benchmark_training(use_amp=True)
print(f"FP32 training time per iteration: {fp32_time*1000:.2f} ms")
print(f"Mixed precision training time per iteration: {amp_time*1000:.2f} ms")
print(f"Speedup: {fp32_time/amp_time:.2f}x")
Typical output on an NVIDIA GPU with Tensor Cores might look like:
FP32 training time per iteration: 87.56 ms
Mixed precision training time per iteration: 42.31 ms
Speedup: 2.07x
Memory Usage Benefits
Mixed precision training not only speeds up computation but also reduces memory usage. Here's a simple example to demonstrate the memory savings:
import torch
from torch.cuda.amp import autocast
import gc
def measure_memory_usage(use_fp16=False):
# Clear cache
torch.cuda.empty_cache()
gc.collect()
# Get initial memory usage
initial_memory = torch.cuda.memory_allocated()
# Create a large tensor
size = (5000, 5000)
if use_fp16:
with autocast():
tensor = torch.randn(size, device="cuda")
# Force autocast to use fp16
tensor = tensor * 1.0
else:
tensor = torch.randn(size, device="cuda", dtype=torch.float32)
# Get memory usage after tensor creation
final_memory = torch.cuda.memory_allocated()
memory_used = final_memory - initial_memory
return memory_used / (1024 * 1024) # Convert to MB
# Measure memory usage
fp32_memory = measure_memory_usage(use_fp16=False)
fp16_memory = measure_memory_usage(use_fp16=True)
print(f"FP32 memory usage: {fp32_memory:.2f} MB")
print(f"FP16 memory usage: {fp16_memory:.2f} MB")
print(f"Memory reduction: {fp32_memory/fp16_memory:.2f}x")
Expected output:
FP32 memory usage: 95.37 MB
FP16 memory usage: 47.68 MB
Memory reduction: 2.00x
Best Practices for Mixed Precision Training
When implementing mixed precision training, keep these best practices in mind:
-
Verify hardware support: Ensure your GPU supports mixed precision operations (NVIDIA GPUs with Tensor Cores work best)
-
Monitor for numerical instability: Watch for NaN or inf values in your model outputs, which could indicate numerical instability
-
Consider loss scaling: In some cases, you might need to manually adjust the initial loss scale if the default doesn't work well
-
Choose precision-sensitive operations: Some operations benefit more from higher precision than others. PyTorch's autocast handles this automatically, but it's good to be aware
-
Store master weights in FP32: Always keep the model weights in FP32 (PyTorch does this by default)
-
Model validation: Always perform validation in FP32 for more accurate evaluation
Common Issues and Solutions
1. Overflow/Underflow Issues
If you're experiencing training instability:
# Customize the GradScaler with a lower initial scale
scaler = GradScaler(init_scale=2**10) # Default is 2**16
# You can also disable gradient scaling entirely if needed
scaler = GradScaler(enabled=False)
2. Checking for NaN Values
def check_nan(model, value_name="weight"):
has_nan = False
for name, param in model.named_parameters():
if torch.isnan(param).any():
print(f"NaN found in {name} {value_name}")
has_nan = True
return has_nan
# Check model parameters for NaN
has_nan = check_nan(model)
if has_nan:
print("Model contains NaN values!")
3. Monitoring Loss Scale
# Inside your training loop
if i % 100 == 0:
current_scale = scaler.get_scale()
print(f"Current loss scale: {current_scale}")
Using Mixed Precision with PyTorch Lightning
If you're using PyTorch Lightning, enabling mixed precision is even simpler:
import pytorch_lightning as pl
# Define your model as a LightningModule
class LitModel(pl.LightningModule):
# Your model implementation
pass
# Create the trainer with mixed precision enabled
trainer = pl.Trainer(
accelerator="gpu",
devices=1,
precision=16 # Use mixed precision (16-bit)
)
# Train your model
model = LitModel()
trainer.fit(model)
Summary
Mixed precision training is a powerful technique to accelerate deep learning training while reducing memory usage. With PyTorch's torch.cuda.amp
package, implementing mixed precision is straightforward:
- Use the
autocast()
context manager to automatically use lower precision where appropriate - Use
GradScaler
to prevent gradient underflow issues - Follow the modified training loop pattern: scale loss during backward pass, unscale gradients before optimizer step
The benefits of mixed precision training include:
- Faster training (up to 3x on hardware with Tensor Core support)
- Reduced memory usage (approximately 2x savings)
- Ability to train larger models or use larger batch sizes
By implementing mixed precision training, you can significantly improve your PyTorch training pipelines with minimal changes to your code.
Additional Resources
- PyTorch Documentation on Automatic Mixed Precision
- NVIDIA Deep Learning Performance Documentation
- PyTorch AMP Examples Repository
Exercises
-
Implement mixed precision training for a simple CNN on the MNIST dataset and compare the training speed with standard FP32 training.
-
Experiment with different initial loss scale values and observe how they affect training stability and performance.
-
Modify the memory benchmark code to measure how much larger a batch size you can use with mixed precision compared to FP32 training.
-
Implement a callback that monitors for NaN values during training and automatically adjusts the loss scale if they occur.
-
Apply mixed precision training to a pre-existing project and measure the performance improvements in terms of speed and memory usage.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)