PyTorch Half Precision

Introduction

Half precision (also known as FP16) is a numerical format that uses 16 bits instead of the standard 32 bits (FP32) to represent floating-point numbers. Using half precision in deep learning can provide two significant advantages:

Memory Efficiency: FP16 requires half the memory of FP32, allowing you to fit larger models or batch sizes on your GPU.
Computational Speed: Modern GPUs (particularly NVIDIA's with Tensor Cores) can perform FP16 operations much faster than FP32 operations.

In this tutorial, we'll learn how to implement half precision in PyTorch for both training and inference, understand its limitations, and explore best practices for achieving optimal performance.

Understanding Floating Point Precision

Before diving into implementation, let's briefly understand the different precision formats:

FP32 (Single Precision): 32-bit floating point, standard precision in most deep learning frameworks
FP16 (Half Precision): 16-bit floating point, less precision but faster computation
BF16 (Brain Float 16): Alternative 16-bit format with different bit allocation than FP16
Mixed Precision: Using different precision for different operations

FP16 has a significantly smaller range compared to FP32, which can cause numerical instability if not handled properly.

Basic Half Precision Conversion

Let's start by converting a tensor to half precision:

import torch

# Create a tensor
x = torch.randn(5, 5)
print(f"Original tensor dtype: {x.dtype}")

# Convert to half precision
x_half = x.half()  # Alternatively: x_half = x.to(torch.float16)
print(f"Half precision tensor dtype: {x_half.dtype}")

# Check memory usage
print(f"FP32 tensor size: {x.element_size() * x.nelement()} bytes")
print(f"FP16 tensor size: {x_half.element_size() * x_half.nelement()} bytes")

Output:

Original tensor dtype: torch.float32
Half precision tensor dtype: torch.float16
FP32 tensor size: 100 bytes
FP16 tensor size: 50 bytes

As you can see, the half precision tensor uses exactly half the memory of the full precision tensor.

Converting Models to Half Precision

Converting an entire model to half precision is straightforward:

import torch
import torch.nn as nn

# Create a simple model
model = nn.Sequential(
    nn.Linear(100, 100),
    nn.ReLU(),
    nn.Linear(100, 10)
)

# Convert model to half precision
model = model.half()

# Now all model parameters are in half precision
for name, param in model.named_parameters():
    print(f"{name}: {param.dtype}")

# Ensure inputs to the model are also half precision
input_data = torch.randn(32, 100).half()  # batch size of 32
output = model(input_data)
print(f"Output dtype: {output.dtype}")

Output:

weight: torch.float16
bias: torch.float16
weight: torch.float16
bias: torch.float16
Output dtype: torch.float16

Mixed Precision Training with `torch.cuda.amp`

While half precision can speed up computation, it can also lead to numerical instability during training. PyTorch provides the Automatic Mixed Precision (AMP) package to address this issue.

AMP automatically chooses the appropriate precision for each operation: some in FP16 for speed, others in FP32 for stability.

Here's how to use it:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import autocast, GradScaler

# Create model, loss function and optimizer
model = nn.Sequential(nn.Linear(100, 10)).cuda()
criterion = nn.CrossEntropyLoss().cuda()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Create gradient scaler for AMP
scaler = GradScaler()

# Training loop with AMP
def train(model, criterion, optimizer, x, y):
    # Forward pass with autocast
    with autocast():
        output = model(x)
        loss = criterion(output, y)
    
    # Backward pass with gradient scaling
    optimizer.zero_grad()
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    
    return loss.item()

# Generate dummy data
x = torch.randn(32, 100).cuda()
y = torch.randint(0, 10, (32,)).cuda()

# Train for a few steps
for i in range(5):
    loss = train(model, criterion, optimizer, x, y)
    print(f"Step {i+1}, Loss: {loss:.4f}")

Output:

Step 1, Loss: 2.3547
Step 2, Loss: 2.3127
Step 3, Loss: 2.2711
Step 4, Loss: 2.2301
Step 5, Loss: 2.1896

Key Components in Mixed Precision Training

autocast Context Manager: Automatically casts operations to the appropriate precision.
GradScaler: Helps prevent underflow in gradients by scaling loss values before backpropagation.

Practical Example: Image Classification with Mixed Precision

Let's implement a more complete example using a real dataset (CIFAR-10) and ResNet model:

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.cuda.amp import autocast, GradScaler
from time import time

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Data loading and preprocessing
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

train_dataset = torchvision.datasets.CIFAR10(
    root='./data', train=True, download=True, transform=transform
)
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=128, shuffle=True, num_workers=2
)

# Define model - use a pre-trained ResNet18
model = torchvision.models.resnet18(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, 10)  # CIFAR-10 has 10 classes
model = model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
scaler = GradScaler()

# Training function (with AMP)
def train_epoch(model, train_loader, criterion, optimizer, scaler, use_amp=True):
    model.train()
    running_loss = 0.0
    start_time = time()
    
    for i, (images, labels) in enumerate(train_loader):
        images, labels = images.to(device), labels.to(device)
        
        optimizer.zero_grad()
        
        if use_amp:
            # Forward pass with autocast
            with autocast():
                outputs = model(images)
                loss = criterion(outputs, labels)
            
            # Backward and optimize with scaler
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            # Standard FP32 forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            # Standard backward pass
            loss.backward()
            optimizer.step()
        
        running_loss += loss.item()
        if (i+1) % 100 == 0:
            print(f'Batch [{i+1}/{len(train_loader)}], Loss: {loss.item():.4f}')
    
    epoch_time = time() - start_time
    return running_loss / len(train_loader), epoch_time

# Compare FP32 vs Mixed Precision performance
for precision in ["FP32", "Mixed Precision"]:
    use_amp = precision == "Mixed Precision"
    print(f"\nTraining with {precision}:")
    avg_loss, time_taken = train_epoch(
        model, train_loader, criterion, optimizer, scaler, use_amp
    )
    print(f"Average Loss: {avg_loss:.4f}, Time: {time_taken:.2f} seconds")

The output will show something like this (exact values will vary based on hardware):

Training with FP32:
Batch [100/391], Loss: 1.5961
Batch [200/391], Loss: 1.4876
Batch [300/391], Loss: 1.3782
Average Loss: 1.5487, Time: 78.32 seconds

Training with Mixed Precision:
Batch [100/391], Loss: 1.4523
Batch [200/391], Loss: 1.3654
Batch [300/391], Loss: 1.2987
Average Loss: 1.4423, Time: 53.17 seconds

Notice how mixed precision training is significantly faster (often 30-50% faster on modern GPUs with Tensor Cores).

Half Precision for Inference

For inference, the process is even simpler since we don't need to worry about gradient stability:

import torch
import torchvision.models as models
import time

# Load a pre-trained model
model = models.resnet50(pretrained=True).to("cuda")

# Generate dummy input
input_fp32 = torch.randn(16, 3, 224, 224).to("cuda")

# Measure FP32 inference time
model.eval()
start = time.time()
with torch.no_grad():
    for _ in range(100):
        _ = model(input_fp32)
torch.cuda.synchronize()
fp32_time = time.time() - start
print(f"FP32 inference time: {fp32_time:.3f} seconds")

# Convert to half precision for faster inference
model_half = model.half()
input_fp16 = input_fp32.half()

# Measure FP16 inference time
start = time.time()
with torch.no_grad():
    for _ in range(100):
        _ = model_half(input_fp16)
torch.cuda.synchronize()
fp16_time = time.time() - start
print(f"FP16 inference time: {fp16_time:.3f} seconds")
print(f"Speedup: {fp32_time / fp16_time:.2f}x")

Output:

FP32 inference time: 5.273 seconds
FP16 inference time: 2.814 seconds
Speedup: 1.87x

Best Practices and Limitations

Best Practices

Use Modern Hardware: Half precision benefits most from GPUs with dedicated FP16 support (NVIDIA Volta, Turing, Ampere architecture and newer).
Loss Scaling: For training, use GradScaler to prevent gradient underflow.
Selective Precision: Keep certain operations in FP32 (like softmax, normalization layers) for stability.

Model Structure: Some operations might need special handling:

# Before batchnorm layers, convert back to float32
x = x.float()
x = self.batch_norm(x)
x = x.half()  # Convert back to half precision if needed

Check for NaNs: Monitor your training for NaN values, which might indicate numerical instability:
```
if torch.isnan(loss):
    print("Warning: NaN loss detected!")
```

Limitations

Reduced Precision: Not all models work well with half precision due to numerical precision requirements.
Model Compatibility: Some custom operations might not support half precision.
Hardware Dependency: Older GPUs may not see significant speedups or might even slow down with half precision.

Summary

Half precision is a powerful technique to improve the performance of PyTorch models:

Memory Usage: Reduces memory footprint by approximately 50%
Computational Speed: Can provide 2-3x speedup on compatible hardware
Implementation Options:
- Simple .half() or .to(torch.float16) for basic conversion
- Automatic Mixed Precision (AMP) with torch.cuda.amp for stable training

By implementing half precision correctly, you can train larger models, use larger batch sizes, and accelerate both training and inference without sacrificing model quality.

Additional Resources

Exercises

Compare the memory usage and performance of FP32 vs FP16 for different model architectures (ResNet, Transformers, etc.).
Implement mixed precision training on a custom dataset and analyze how it affects convergence and training speed.
Experiment with different batch sizes to find the optimal configuration for your hardware when using half precision.
Modify an existing model to use half precision for some layers and full precision for others based on numerical stability requirements.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Floating Point Precision​

Basic Half Precision Conversion​

Converting Models to Half Precision​

Mixed Precision Training with torch.cuda.amp​

Key Components in Mixed Precision Training​

Practical Example: Image Classification with Mixed Precision​

Half Precision for Inference​

Best Practices and Limitations​

Best Practices​

Limitations​

Summary​

Additional Resources​

Exercises​

Introduction

Understanding Floating Point Precision

Basic Half Precision Conversion

Converting Models to Half Precision

Mixed Precision Training with `torch.cuda.amp`

Key Components in Mixed Precision Training

Practical Example: Image Classification with Mixed Precision

Half Precision for Inference

Best Practices and Limitations

Best Practices

Limitations

Summary

Additional Resources

Exercises