PyTorch Mixed Precision Training

Introduction

Mixed precision training is a technique that combines different numerical precision types (typically FP32 and FP16) during model training to improve computational efficiency. By using lower precision where possible, you can achieve:

Faster computation: Half-precision operations are significantly faster on modern GPUs, especially on NVIDIA GPUs with Tensor Cores
Reduced memory usage: FP16 values take up half the memory of FP32 values
Similar model accuracy: When implemented correctly, mixed precision maintains model quality

In this tutorial, we'll explore how to implement mixed precision training in PyTorch using the torch.cuda.amp package (Automatic Mixed Precision), which makes the process straightforward and automatic.

Understanding Floating Point Precision

Before diving into mixed precision training, let's understand the different floating point formats:

FP32 (32-bit floating point): Standard precision used in most deep learning training
FP16 (16-bit floating point): Half precision, requires half the memory but has limited range
BF16 (Brain Floating Point): Alternative 16-bit format with better numerical properties than FP16

The challenge with FP16 is its limited dynamic range, which can cause numerical instability in training. This is why we use "mixed" precision rather than pure FP16 training.

PyTorch's Automatic Mixed Precision (AMP)

PyTorch provides the torch.cuda.amp module, which simplifies mixed precision training. The key components are:

GradScaler: Helps prevent underflow in gradients
autocast context: Automatically casts operations to the appropriate precision

Let's see how to implement mixed precision training step by step.

Basic Implementation

Here's a simple implementation of mixed precision training:

python
import torch
from torch.cuda.amp import autocast, GradScaler

# Initialize model, optimizer, data loader, etc.
model = YourModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = torch.nn.CrossEntropyLoss()

# Initialize the GradScaler
scaler = GradScaler()

# Training loop
for epoch in range(num_epochs):
    for inputs, labels in dataloader:
        # Move data to GPU
        inputs = inputs.cuda()
        labels = labels.cuda()
        
        # Clear gradients
        optimizer.zero_grad()
        
        # Forward pass with autocast
        with autocast():
            outputs = model(inputs)
            loss = criterion(outputs, labels)
        
        # Backward pass with gradient scaling
        scaler.scale(loss).backward()
        
        # Optimizer step with unscaling
        scaler.step(optimizer)
        
        # Update scaler for next iteration
        scaler.update()

How Mixed Precision Works

Let's break down what's happening in the code above:

1. The `autocast()` Context Manager

python
with autocast():
    outputs = model(inputs)
    loss = criterion(outputs, labels)

The autocast() context automatically casts operations to FP16 where appropriate. It selectively casts operations to FP16 or keeps them in FP32 based on PyTorch's internal rules for numerical stability.

2. The `GradScaler` Object

python
scaler = GradScaler()

The gradient scaler helps prevent gradient underflow. Since gradients in FP16 can become too small to be represented, the scaler multiplies the loss by a scale factor before backpropagation.

3. Scaling the Loss

python
scaler.scale(loss).backward()

This scales the loss value before backpropagation, which effectively scales gradients by the same factor.

4. Optimizer Step with Unscaling

python
scaler.step(optimizer)

Before applying gradients, the scaler unscales them (divides by the same scale factor). It also checks for gradient overflow and skips the optimization step if detected.

5. Updating the Scaler

python
scaler.update()

This adjusts the scale factor based on whether overflow was detected. If no overflow occurred, the scale factor may be increased; if overflow occurred, it's decreased.

Complete Training Loop Example

Here's a more complete example showing mixed precision training with additional training loop components:

python
import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
from torchvision import models, datasets, transforms
from torch.utils.data import DataLoader

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Set up a simple model
model = models.resnet18(pretrained=False, num_classes=10).to(device)

# Set up data loaders (simplified)
transform = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Initialize the GradScaler
scaler = GradScaler()

# Training function
def train_one_epoch(model, train_loader, optimizer, criterion, scaler, epoch):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    
    for i, (inputs, targets) in enumerate(train_loader):
        inputs, targets = inputs.to(device), targets.to(device)
        
        # Zero the parameter gradients
        optimizer.zero_grad()
        
        # Forward pass with autocast
        with autocast():
            outputs = model(inputs)
            loss = criterion(outputs, targets)
        
        # Backward and optimize with gradient scaling
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        
        # Statistics
        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()
        
        if i % 100 == 99:
            print(f'Epoch: {epoch+1}, Batch: {i+1}, Loss: {running_loss/100:.3f}, '
                  f'Accuracy: {100.*correct/total:.2f}%')
            running_loss = 0.0
    
    return 100.*correct/total

# Train the model
num_epochs = 5
for epoch in range(num_epochs):
    train_accuracy = train_one_epoch(model, train_loader, optimizer, criterion, scaler, epoch)
    print(f'Epoch {epoch+1} completed. Training accuracy: {train_accuracy:.2f}%')

print('Finished Training')

Sample output might look like:

Epoch: 1, Batch: 100, Loss: 1.856, Accuracy: 32.55%
Epoch: 1, Batch: 200, Loss: 1.683, Accuracy: 38.78%
Epoch: 1, Batch: 300, Loss: 1.509, Accuracy: 46.12%
Epoch: 1, Batch: 400, Loss: 1.471, Accuracy: 47.84%
Epoch: 1, Batch: 500, Loss: 1.328, Accuracy: 52.58%
Epoch: 1, Batch: 600, Loss: 1.257, Accuracy: 55.44%
Epoch: 1, Batch: 700, Loss: 1.182, Accuracy: 58.32%
Epoch 1 completed. Training accuracy: 58.85%
...

Comparing Performance: Mixed Precision vs. Standard Training

Let's create a simple benchmark to compare the performance of mixed precision training against standard FP32 training:

python
import time
import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

def benchmark_training(use_amp=False):
    # Create a large model
    model = models.resnet50().cuda()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()
    
    # Create synthetic data
    batch_size = 64
    inputs = torch.randn(batch_size, 3, 224, 224).cuda()
    targets = torch.randint(0, 1000, (batch_size,)).cuda()
    
    # Initialize GradScaler for AMP
    scaler = GradScaler() if use_amp else None
    
    # Warmup
    for _ in range(10):
        if use_amp:
            with autocast():
                outputs = model(inputs)
                loss = criterion(outputs, targets)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
        optimizer.zero_grad()
    
    # Benchmark
    torch.cuda.synchronize()
    start_time = time.time()
    
    num_iterations = 100
    for _ in range(num_iterations):
        if use_amp:
            with autocast():
                outputs = model(inputs)
                loss = criterion(outputs, targets)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
        optimizer.zero_grad()
    
    torch.cuda.synchronize()
    end_time = time.time()
    
    return (end_time - start_time) / num_iterations

# Run benchmarks
fp32_time = benchmark_training(use_amp=False)
amp_time = benchmark_training(use_amp=True)

print(f"FP32 training time per iteration: {fp32_time*1000:.2f} ms")
print(f"Mixed precision training time per iteration: {amp_time*1000:.2f} ms")
print(f"Speedup: {fp32_time/amp_time:.2f}x")

Typical output on an NVIDIA GPU with Tensor Cores might look like:

FP32 training time per iteration: 87.56 ms
Mixed precision training time per iteration: 42.31 ms
Speedup: 2.07x

Memory Usage Benefits

Mixed precision training not only speeds up computation but also reduces memory usage. Here's a simple example to demonstrate the memory savings:

python
import torch
from torch.cuda.amp import autocast
import gc

def measure_memory_usage(use_fp16=False):
    # Clear cache
    torch.cuda.empty_cache()
    gc.collect()
    
    # Get initial memory usage
    initial_memory = torch.cuda.memory_allocated()
    
    # Create a large tensor
    size = (5000, 5000)
    if use_fp16:
        with autocast():
            tensor = torch.randn(size, device="cuda")
            # Force autocast to use fp16
            tensor = tensor * 1.0
    else:
        tensor = torch.randn(size, device="cuda", dtype=torch.float32)
    
    # Get memory usage after tensor creation
    final_memory = torch.cuda.memory_allocated()
    memory_used = final_memory - initial_memory
    
    return memory_used / (1024 * 1024)  # Convert to MB

# Measure memory usage
fp32_memory = measure_memory_usage(use_fp16=False)
fp16_memory = measure_memory_usage(use_fp16=True)

print(f"FP32 memory usage: {fp32_memory:.2f} MB")
print(f"FP16 memory usage: {fp16_memory:.2f} MB")
print(f"Memory reduction: {fp32_memory/fp16_memory:.2f}x")

Expected output:

FP32 memory usage: 95.37 MB
FP16 memory usage: 47.68 MB
Memory reduction: 2.00x

Best Practices for Mixed Precision Training

When implementing mixed precision training, keep these best practices in mind:

Verify hardware support: Ensure your GPU supports mixed precision operations (NVIDIA GPUs with Tensor Cores work best)
Monitor for numerical instability: Watch for NaN or inf values in your model outputs, which could indicate numerical instability
Consider loss scaling: In some cases, you might need to manually adjust the initial loss scale if the default doesn't work well
Choose precision-sensitive operations: Some operations benefit more from higher precision than others. PyTorch's autocast handles this automatically, but it's good to be aware
Store master weights in FP32: Always keep the model weights in FP32 (PyTorch does this by default)
Model validation: Always perform validation in FP32 for more accurate evaluation

Common Issues and Solutions

1. Overflow/Underflow Issues

If you're experiencing training instability:

python
# Customize the GradScaler with a lower initial scale
scaler = GradScaler(init_scale=2**10)  # Default is 2**16

# You can also disable gradient scaling entirely if needed
scaler = GradScaler(enabled=False)

2. Checking for NaN Values

python
def check_nan(model, value_name="weight"):
    has_nan = False
    for name, param in model.named_parameters():
        if torch.isnan(param).any():
            print(f"NaN found in {name} {value_name}")
            has_nan = True
    return has_nan

# Check model parameters for NaN
has_nan = check_nan(model)
if has_nan:
    print("Model contains NaN values!")

3. Monitoring Loss Scale

python
# Inside your training loop
if i % 100 == 0:
    current_scale = scaler.get_scale()
    print(f"Current loss scale: {current_scale}")

Using Mixed Precision with PyTorch Lightning

If you're using PyTorch Lightning, enabling mixed precision is even simpler:

python
import pytorch_lightning as pl

# Define your model as a LightningModule
class LitModel(pl.LightningModule):
    # Your model implementation
    pass

# Create the trainer with mixed precision enabled
trainer = pl.Trainer(
    accelerator="gpu",
    devices=1,
    precision=16  # Use mixed precision (16-bit)
)

# Train your model
model = LitModel()
trainer.fit(model)

Summary

Mixed precision training is a powerful technique to accelerate deep learning training while reducing memory usage. With PyTorch's torch.cuda.amp package, implementing mixed precision is straightforward:

Use the autocast() context manager to automatically use lower precision where appropriate
Use GradScaler to prevent gradient underflow issues
Follow the modified training loop pattern: scale loss during backward pass, unscale gradients before optimizer step

The benefits of mixed precision training include:

Faster training (up to 3x on hardware with Tensor Core support)
Reduced memory usage (approximately 2x savings)
Ability to train larger models or use larger batch sizes

By implementing mixed precision training, you can significantly improve your PyTorch training pipelines with minimal changes to your code.

Additional Resources

Exercises

Implement mixed precision training for a simple CNN on the MNIST dataset and compare the training speed with standard FP32 training.
Experiment with different initial loss scale values and observe how they affect training stability and performance.
Modify the memory benchmark code to measure how much larger a batch size you can use with mixed precision compared to FP32 training.
Implement a callback that monitors for NaN values during training and automatically adjusts the loss scale if they occur.
Apply mixed precision training to a pre-existing project and measure the performance improvements in terms of speed and memory usage.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Floating Point Precision​

PyTorch's Automatic Mixed Precision (AMP)​

Basic Implementation​

How Mixed Precision Works​

1. The autocast() Context Manager​

2. The GradScaler Object​

3. Scaling the Loss​

4. Optimizer Step with Unscaling​

5. Updating the Scaler​

Complete Training Loop Example​

Comparing Performance: Mixed Precision vs. Standard Training​

Memory Usage Benefits​

Best Practices for Mixed Precision Training​

Common Issues and Solutions​

1. Overflow/Underflow Issues​

2. Checking for NaN Values​

3. Monitoring Loss Scale​

Using Mixed Precision with PyTorch Lightning​

Summary​

Additional Resources​

Exercises​