PyTorch CUDA Optimization

Introduction

Graphics Processing Units (GPUs) have revolutionized deep learning by enabling massive parallel computation. PyTorch offers seamless integration with NVIDIA's CUDA platform, allowing models to train significantly faster than on CPUs. However, simply running your code on a GPU doesn't guarantee optimal performance. This guide will teach you how to effectively optimize your PyTorch code for CUDA to achieve maximum efficiency and speed.

CUDA Basics in PyTorch

Checking CUDA Availability

Before diving into optimization techniques, let's ensure CUDA is available in your environment:

python
import torch

# Check if CUDA is available
cuda_available = torch.cuda.is_available()
print(f"CUDA available: {cuda_available}")

# Get the number of CUDA devices
if cuda_available:
    num_devices = torch.cuda.device_count()
    print(f"Number of CUDA devices: {num_devices}")
    
    # Print device name
    for i in range(num_devices):
        print(f"Device {i}: {torch.cuda.get_device_name(i)}")

Sample output:

CUDA available: True
Number of CUDA devices: 1
Device 0: NVIDIA GeForce RTX 3080

Basic Device Management

Moving tensors and models between devices is fundamental in PyTorch:

python
# Create a device object
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Create a tensor on CPU
cpu_tensor = torch.rand(3, 3)
print(f"Tensor device: {cpu_tensor.device}")

# Move tensor to GPU
gpu_tensor = cpu_tensor.to(device)
print(f"Tensor device: {gpu_tensor.device}")

# Creating a tensor directly on GPU
direct_gpu_tensor = torch.rand(3, 3, device=device)
print(f"Direct GPU tensor device: {direct_gpu_tensor.device}")

Sample output:

Using device: cuda
Tensor device: cpu
Tensor device: cuda:0
Direct GPU tensor device: cuda:0

Common CUDA Pitfalls and Optimizations

1. Avoiding CPU-GPU Synchronization

One of the most common performance bottlenecks is unnecessary data transfer between CPU and GPU:

python
import time

# Bad practice: Unnecessary CPU-GPU transfers
def inefficient_calculation(tensor1, tensor2, iterations=1000):
    results = []
    for _ in range(iterations):
        # Moving back to CPU every iteration
        result = (tensor1 * tensor2).sum().cpu().numpy()
        results.append(result)
    return results

# Good practice: Keep computation on GPU
def efficient_calculation(tensor1, tensor2, iterations=1000):
    results = torch.zeros(iterations, device=tensor1.device)
    for i in range(iterations):
        # Stay on GPU until the end
        results[i] = (tensor1 * tensor2).sum()
    # Transfer only once at the end
    return results.cpu().numpy()

# Benchmark
a = torch.rand(1000, 1000, device=device)
b = torch.rand(1000, 1000, device=device)

# Time inefficient approach
start = time.time()
inefficient_calculation(a, b, 100)
print(f"Inefficient time: {time.time() - start:.4f} seconds")

# Time efficient approach
start = time.time()
efficient_calculation(a, b, 100)
print(f"Efficient time: {time.time() - start:.4f} seconds")

Sample output:

Inefficient time: 0.5832 seconds
Efficient time: 0.0214 seconds

2. Batch Processing

Always process data in batches rather than individual samples:

python
# Inefficient: Processing one sample at a time
def process_individually(data, model):
    results = []
    for sample in data:
        # This adds overhead for each sample
        sample = sample.unsqueeze(0).to(device)  # Add batch dimension
        result = model(sample)
        results.append(result.cpu())
    return torch.cat(results)

# Efficient: Processing in batches
def process_in_batches(data, model, batch_size=64):
    results = []
    for i in range(0, len(data), batch_size):
        # Process multiple samples at once
        batch = data[i:i+batch_size].to(device)
        batch_results = model(batch)
        results.append(batch_results.cpu())
    return torch.cat(results)

3. Using Pinned Memory

Pinned memory can speed up CPU-GPU transfers:

python
# Setup data loaders with pinned memory
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=64,
    shuffle=True,
    pin_memory=True,  # This enables faster CPU to GPU transfers
    num_workers=4
)

4. Asynchronous Data Loading

Overlap data loading and model computation:

python
# Define prefetcher to load data asynchronously
class DataPrefetcher:
    def __init__(self, loader, device):
        self.loader = iter(loader)
        self.device = device
        self.stream = torch.cuda.Stream()
        self.preload()
        
    def preload(self):
        try:
            self.next_input, self.next_target = next(self.loader)
        except StopIteration:
            self.next_input = None
            self.next_target = None
            return
            
        with torch.cuda.stream(self.stream):
            self.next_input = self.next_input.to(self.device, non_blocking=True)
            self.next_target = self.next_target.to(self.device, non_blocking=True)
            
    def next(self):
        torch.cuda.current_stream().wait_stream(self.stream)
        input = self.next_input
        target = self.next_target
        self.preload()
        return input, target

# Usage in training loop
prefetcher = DataPrefetcher(train_loader, device)
data, target = prefetcher.next()

while data is not None:
    # Use the prefetched data for training
    output = model(data)
    loss = criterion(output, target)
    # Backward pass and optimization...
    
    # Get next batch
    data, target = prefetcher.next()

Advanced CUDA Optimizations

1. Optimizing Memory Usage

Using CUDA memory efficiently is crucial for large models:

python
# Check memory usage
def print_gpu_memory():
    if torch.cuda.is_available():
        print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
        print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Example usage
print_gpu_memory()
large_tensor = torch.rand(10000, 10000, device=device)
print_gpu_memory()
del large_tensor
torch.cuda.empty_cache()  # Free cached memory
print_gpu_memory()

2. Mixed Precision Training

Using mixed precision (FP16) can significantly speed up training on newer GPUs:

python
import torch.cuda.amp as amp

# Create model and optimizer
model = MyModel().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Create GradScaler for mixed precision
scaler = amp.GradScaler()

# Training loop with mixed precision
for inputs, targets in dataloader:
    inputs = inputs.to(device)
    targets = targets.to(device)
    
    # Forward pass with autocast
    with amp.autocast():
        outputs = model(inputs)
        loss = criterion(outputs, targets)
    
    # Backward and optimize with gradient scaling
    optimizer.zero_grad()
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

3. JIT Compilation for CUDA Code

Using TorchScript to compile models:

python
# Define a simple model
class MyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(100, 10)
        
    def forward(self, x):
        return self.linear(x)

# Create and trace the model
model = MyModel().to(device)
example_input = torch.rand(32, 100, device=device)
traced_model = torch.jit.trace(model, example_input)

# Save the compiled model
traced_model.save("traced_model.pt")

# Compare performance
def benchmark_model(model, input_tensor, iterations=1000):
    start = time.time()
    for _ in range(iterations):
        _ = model(input_tensor)
    torch.cuda.synchronize()  # Wait for all CUDA operations to finish
    return time.time() - start

regular_time = benchmark_model(model, example_input)
jit_time = benchmark_model(traced_model, example_input)

print(f"Regular model time: {regular_time:.4f} seconds")
print(f"JIT model time: {jit_time:.4f} seconds")
print(f"Speedup: {regular_time/jit_time:.2f}x")

Real-World Application: Optimizing a CNN for Image Classification

Let's put everything together in a real-world example - optimizing a convolutional neural network for image classification:

python
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import time
from torch.cuda.amp import autocast, GradScaler

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define transforms with normalization
transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Load CIFAR-10 dataset
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                       download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128,
                                         shuffle=True, num_workers=2, 
                                         pin_memory=True)  # Use pin_memory for faster GPU transfers

# Define a simple CNN model
class OptimizedCNN(nn.Module):
    def __init__(self):
        super(OptimizedCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(128 * 4 * 4, 512)
        self.fc2 = nn.Linear(512, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = self.pool(self.relu(self.conv3(x)))
        x = x.view(-1, 128 * 4 * 4)
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Create the model and move it to GPU
model = OptimizedCNN().to(device)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Create gradient scaler for mixed precision training
scaler = GradScaler()

# Training function with optimizations
def train_optimized(epochs=5):
    model.train()
    start_time = time.time()
    
    for epoch in range(epochs):
        running_loss = 0.0
        epoch_start = time.time()
        
        for i, data in enumerate(trainloader):
            inputs, labels = data[0].to(device, non_blocking=True), data[1].to(device, non_blocking=True)
            
            # Zero the parameter gradients
            optimizer.zero_grad()
            
            # Forward pass with mixed precision
            with autocast():
                outputs = model(inputs)
                loss = criterion(outputs, labels)
            
            # Backward and optimize with gradient scaling
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
            
            running_loss += loss.item()
            
            if i % 100 == 99:
                print(f'Epoch {epoch+1}, Batch {i+1}: Loss: {running_loss/100:.3f}')
                running_loss = 0.0
        
        epoch_time = time.time() - epoch_start
        print(f'Epoch {epoch+1} completed in {epoch_time:.2f} seconds')
    
    total_time = time.time() - start_time
    print(f'Training completed in {total_time:.2f} seconds')

# Train the model
train_optimized(epochs=3)

# Save the optimized model
torch.save(model.state_dict(), "optimized_cnn.pth")

# Create a JIT compiled version for inference
example_input = torch.rand(1, 3, 32, 32, device=device)
traced_model = torch.jit.trace(model, example_input)
traced_model.save("optimized_cnn_jit.pth")

Memory Management Best Practices

Proper memory management is crucial for training large models:

python
# Free unused memory
torch.cuda.empty_cache()

# Check memory usage before and after operations
print_gpu_memory()
large_computation = torch.randn(10000, 10000, device=device) @ torch.randn(10000, 10000, device=device)
print_gpu_memory()
del large_computation
torch.cuda.empty_cache()
print_gpu_memory()

# Use gradient checkpointing for large models
from torch.utils.checkpoint import checkpoint

def run_model_with_checkpointing(model, input_tensor):
    # Break computation into smaller pieces to reduce memory usage
    def custom_forward(x):
        return model(x)
    
    return checkpoint(custom_forward, input_tensor)

Profiling Your CUDA Code

PyTorch provides tools to profile and analyze your code:

python
# Basic profiling
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as prof:
    model(example_input)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

# More detailed profiling with TensorBoard
from torch.profiler import profile, record_function, ProfilerActivity

def trace_handler(p):
    output = p.key_averages().table(sort_by="cuda_time_total", row_limit=10)
    print(output)
    p.export_chrome_trace("trace.json")

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(
        wait=1,
        warmup=1,
        active=2),
    on_trace_ready=trace_handler
) as p:
    for idx, (inputs, labels) in enumerate(trainloader):
        if idx >= 4:
            break
        inputs, labels = inputs.to(device), labels.to(device)
        with record_function("model_inference"):
            model(inputs)
        p.step()

Multi-GPU Training

For even faster training, you can use multiple GPUs:

python
# Data Parallel training (simplest approach)
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    model = nn.DataParallel(model)

# Distributed Data Parallel (more advanced)
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel

def setup(rank, world_size):
    # Initialize process group
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def train_ddp(rank, world_size):
    setup(rank, world_size)
    
    # Create model and move it to GPU with id rank
    model = OptimizedCNN().to(rank)
    ddp_model = DistributedDataParallel(model, device_ids=[rank])
    
    # Training code here
    # ...
    
    cleanup()

# Start multiple processes
if torch.cuda.device_count() > 1:
    world_size = torch.cuda.device_count()
    mp.spawn(train_ddp, args=(world_size,), nprocs=world_size, join=True)

Summary

Optimizing PyTorch code for CUDA can drastically improve the performance of your deep learning models. In this guide, we've covered key aspects:

Basic CUDA operations and device management
Avoiding common pitfalls like excessive CPU-GPU transfers
Advanced optimization techniques including:
- Mixed precision training
- Asynchronous data loading
- Memory management
- JIT compilation
- Multi-GPU training
- Profiling and performance analysis

By implementing these optimizations, you can significantly speed up both training and inference time for your models, allowing you to iterate faster and work with larger, more complex architectures.

Additional Resources

Exercises

Profile a simple neural network training loop on your GPU and identify the top 3 operations that consume the most time.
Implement mixed precision training for a model of your choice and measure the speedup.
Compare the performance of a CNN model with and without pinned memory for the data loader.
Implement gradient checkpointing for a large model and measure the memory savings.
If you have access to multiple GPUs, modify the example CNN to train using DistributedDataParallel and measure the scaling efficiency.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

CUDA Basics in PyTorch​

Checking CUDA Availability​

Basic Device Management​

Common CUDA Pitfalls and Optimizations​

1. Avoiding CPU-GPU Synchronization​

2. Batch Processing​

3. Using Pinned Memory​

4. Asynchronous Data Loading​

Advanced CUDA Optimizations​

1. Optimizing Memory Usage​

2. Mixed Precision Training​

3. JIT Compilation for CUDA Code​

Real-World Application: Optimizing a CNN for Image Classification​

Memory Management Best Practices​

Profiling Your CUDA Code​

Multi-GPU Training​

Summary​

Additional Resources​

Exercises​

Introduction

CUDA Basics in PyTorch

Checking CUDA Availability

Basic Device Management

Common CUDA Pitfalls and Optimizations

1. Avoiding CPU-GPU Synchronization

2. Batch Processing

3. Using Pinned Memory

4. Asynchronous Data Loading

Advanced CUDA Optimizations

1. Optimizing Memory Usage

2. Mixed Precision Training

3. JIT Compilation for CUDA Code

Real-World Application: Optimizing a CNN for Image Classification

Memory Management Best Practices

Profiling Your CUDA Code

Multi-GPU Training

Summary

Additional Resources

Exercises