PyTorch CUDA Optimization
Introduction
Graphics Processing Units (GPUs) have revolutionized deep learning by enabling massive parallel computation. PyTorch offers seamless integration with NVIDIA's CUDA platform, allowing models to train significantly faster than on CPUs. However, simply running your code on a GPU doesn't guarantee optimal performance. This guide will teach you how to effectively optimize your PyTorch code for CUDA to achieve maximum efficiency and speed.
CUDA Basics in PyTorch
Checking CUDA Availability
Before diving into optimization techniques, let's ensure CUDA is available in your environment:
import torch
# Check if CUDA is available
cuda_available = torch.cuda.is_available()
print(f"CUDA available: {cuda_available}")
# Get the number of CUDA devices
if cuda_available:
num_devices = torch.cuda.device_count()
print(f"Number of CUDA devices: {num_devices}")
# Print device name
for i in range(num_devices):
print(f"Device {i}: {torch.cuda.get_device_name(i)}")
Sample output:
CUDA available: True
Number of CUDA devices: 1
Device 0: NVIDIA GeForce RTX 3080
Basic Device Management
Moving tensors and models between devices is fundamental in PyTorch:
# Create a device object
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Create a tensor on CPU
cpu_tensor = torch.rand(3, 3)
print(f"Tensor device: {cpu_tensor.device}")
# Move tensor to GPU
gpu_tensor = cpu_tensor.to(device)
print(f"Tensor device: {gpu_tensor.device}")
# Creating a tensor directly on GPU
direct_gpu_tensor = torch.rand(3, 3, device=device)
print(f"Direct GPU tensor device: {direct_gpu_tensor.device}")
Sample output:
Using device: cuda
Tensor device: cpu
Tensor device: cuda:0
Direct GPU tensor device: cuda:0
Common CUDA Pitfalls and Optimizations
1. Avoiding CPU-GPU Synchronization
One of the most common performance bottlenecks is unnecessary data transfer between CPU and GPU:
import time
# Bad practice: Unnecessary CPU-GPU transfers
def inefficient_calculation(tensor1, tensor2, iterations=1000):
results = []
for _ in range(iterations):
# Moving back to CPU every iteration
result = (tensor1 * tensor2).sum().cpu().numpy()
results.append(result)
return results
# Good practice: Keep computation on GPU
def efficient_calculation(tensor1, tensor2, iterations=1000):
results = torch.zeros(iterations, device=tensor1.device)
for i in range(iterations):
# Stay on GPU until the end
results[i] = (tensor1 * tensor2).sum()
# Transfer only once at the end
return results.cpu().numpy()
# Benchmark
a = torch.rand(1000, 1000, device=device)
b = torch.rand(1000, 1000, device=device)
# Time inefficient approach
start = time.time()
inefficient_calculation(a, b, 100)
print(f"Inefficient time: {time.time() - start:.4f} seconds")
# Time efficient approach
start = time.time()
efficient_calculation(a, b, 100)
print(f"Efficient time: {time.time() - start:.4f} seconds")
Sample output:
Inefficient time: 0.5832 seconds
Efficient time: 0.0214 seconds
2. Batch Processing
Always process data in batches rather than individual samples:
# Inefficient: Processing one sample at a time
def process_individually(data, model):
results = []
for sample in data:
# This adds overhead for each sample
sample = sample.unsqueeze(0).to(device) # Add batch dimension
result = model(sample)
results.append(result.cpu())
return torch.cat(results)
# Efficient: Processing in batches
def process_in_batches(data, model, batch_size=64):
results = []
for i in range(0, len(data), batch_size):
# Process multiple samples at once
batch = data[i:i+batch_size].to(device)
batch_results = model(batch)
results.append(batch_results.cpu())
return torch.cat(results)
3. Using Pinned Memory
Pinned memory can speed up CPU-GPU transfers:
# Setup data loaders with pinned memory
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=64,
shuffle=True,
pin_memory=True, # This enables faster CPU to GPU transfers
num_workers=4
)
4. Asynchronous Data Loading
Overlap data loading and model computation:
# Define prefetcher to load data asynchronously
class DataPrefetcher:
def __init__(self, loader, device):
self.loader = iter(loader)
self.device = device
self.stream = torch.cuda.Stream()
self.preload()
def preload(self):
try:
self.next_input, self.next_target = next(self.loader)
except StopIteration:
self.next_input = None
self.next_target = None
return
with torch.cuda.stream(self.stream):
self.next_input = self.next_input.to(self.device, non_blocking=True)
self.next_target = self.next_target.to(self.device, non_blocking=True)
def next(self):
torch.cuda.current_stream().wait_stream(self.stream)
input = self.next_input
target = self.next_target
self.preload()
return input, target
# Usage in training loop
prefetcher = DataPrefetcher(train_loader, device)
data, target = prefetcher.next()
while data is not None:
# Use the prefetched data for training
output = model(data)
loss = criterion(output, target)
# Backward pass and optimization...
# Get next batch
data, target = prefetcher.next()
Advanced CUDA Optimizations
1. Optimizing Memory Usage
Using CUDA memory efficiently is crucial for large models:
# Check memory usage
def print_gpu_memory():
if torch.cuda.is_available():
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
# Example usage
print_gpu_memory()
large_tensor = torch.rand(10000, 10000, device=device)
print_gpu_memory()
del large_tensor
torch.cuda.empty_cache() # Free cached memory
print_gpu_memory()
2. Mixed Precision Training
Using mixed precision (FP16) can significantly speed up training on newer GPUs:
import torch.cuda.amp as amp
# Create model and optimizer
model = MyModel().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Create GradScaler for mixed precision
scaler = amp.GradScaler()
# Training loop with mixed precision
for inputs, targets in dataloader:
inputs = inputs.to(device)
targets = targets.to(device)
# Forward pass with autocast
with amp.autocast():
outputs = model(inputs)
loss = criterion(outputs, targets)
# Backward and optimize with gradient scaling
optimizer.zero_grad()
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
3. JIT Compilation for CUDA Code
Using TorchScript to compile models:
# Define a simple model
class MyModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(100, 10)
def forward(self, x):
return self.linear(x)
# Create and trace the model
model = MyModel().to(device)
example_input = torch.rand(32, 100, device=device)
traced_model = torch.jit.trace(model, example_input)
# Save the compiled model
traced_model.save("traced_model.pt")
# Compare performance
def benchmark_model(model, input_tensor, iterations=1000):
start = time.time()
for _ in range(iterations):
_ = model(input_tensor)
torch.cuda.synchronize() # Wait for all CUDA operations to finish
return time.time() - start
regular_time = benchmark_model(model, example_input)
jit_time = benchmark_model(traced_model, example_input)
print(f"Regular model time: {regular_time:.4f} seconds")
print(f"JIT model time: {jit_time:.4f} seconds")
print(f"Speedup: {regular_time/jit_time:.2f}x")
Real-World Application: Optimizing a CNN for Image Classification
Let's put everything together in a real-world example - optimizing a convolutional neural network for image classification:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import time
from torch.cuda.amp import autocast, GradScaler
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Define transforms with normalization
transform = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# Load CIFAR-10 dataset
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128,
shuffle=True, num_workers=2,
pin_memory=True) # Use pin_memory for faster GPU transfers
# Define a simple CNN model
class OptimizedCNN(nn.Module):
def __init__(self):
super(OptimizedCNN, self).__init__()
self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(128 * 4 * 4, 512)
self.fc2 = nn.Linear(512, 10)
self.relu = nn.ReLU()
def forward(self, x):
x = self.pool(self.relu(self.conv1(x)))
x = self.pool(self.relu(self.conv2(x)))
x = self.pool(self.relu(self.conv3(x)))
x = x.view(-1, 128 * 4 * 4)
x = self.relu(self.fc1(x))
x = self.fc2(x)
return x
# Create the model and move it to GPU
model = OptimizedCNN().to(device)
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Create gradient scaler for mixed precision training
scaler = GradScaler()
# Training function with optimizations
def train_optimized(epochs=5):
model.train()
start_time = time.time()
for epoch in range(epochs):
running_loss = 0.0
epoch_start = time.time()
for i, data in enumerate(trainloader):
inputs, labels = data[0].to(device, non_blocking=True), data[1].to(device, non_blocking=True)
# Zero the parameter gradients
optimizer.zero_grad()
# Forward pass with mixed precision
with autocast():
outputs = model(inputs)
loss = criterion(outputs, labels)
# Backward and optimize with gradient scaling
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
running_loss += loss.item()
if i % 100 == 99:
print(f'Epoch {epoch+1}, Batch {i+1}: Loss: {running_loss/100:.3f}')
running_loss = 0.0
epoch_time = time.time() - epoch_start
print(f'Epoch {epoch+1} completed in {epoch_time:.2f} seconds')
total_time = time.time() - start_time
print(f'Training completed in {total_time:.2f} seconds')
# Train the model
train_optimized(epochs=3)
# Save the optimized model
torch.save(model.state_dict(), "optimized_cnn.pth")
# Create a JIT compiled version for inference
example_input = torch.rand(1, 3, 32, 32, device=device)
traced_model = torch.jit.trace(model, example_input)
traced_model.save("optimized_cnn_jit.pth")
Memory Management Best Practices
Proper memory management is crucial for training large models:
# Free unused memory
torch.cuda.empty_cache()
# Check memory usage before and after operations
print_gpu_memory()
large_computation = torch.randn(10000, 10000, device=device) @ torch.randn(10000, 10000, device=device)
print_gpu_memory()
del large_computation
torch.cuda.empty_cache()
print_gpu_memory()
# Use gradient checkpointing for large models
from torch.utils.checkpoint import checkpoint
def run_model_with_checkpointing(model, input_tensor):
# Break computation into smaller pieces to reduce memory usage
def custom_forward(x):
return model(x)
return checkpoint(custom_forward, input_tensor)
Profiling Your CUDA Code
PyTorch provides tools to profile and analyze your code:
# Basic profiling
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
]
) as prof:
model(example_input)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
# More detailed profiling with TensorBoard
from torch.profiler import profile, record_function, ProfilerActivity
def trace_handler(p):
output = p.key_averages().table(sort_by="cuda_time_total", row_limit=10)
print(output)
p.export_chrome_trace("trace.json")
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(
wait=1,
warmup=1,
active=2),
on_trace_ready=trace_handler
) as p:
for idx, (inputs, labels) in enumerate(trainloader):
if idx >= 4:
break
inputs, labels = inputs.to(device), labels.to(device)
with record_function("model_inference"):
model(inputs)
p.step()
Multi-GPU Training
For even faster training, you can use multiple GPUs:
# Data Parallel training (simplest approach)
if torch.cuda.device_count() > 1:
print(f"Using {torch.cuda.device_count()} GPUs!")
model = nn.DataParallel(model)
# Distributed Data Parallel (more advanced)
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel
def setup(rank, world_size):
# Initialize process group
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
def train_ddp(rank, world_size):
setup(rank, world_size)
# Create model and move it to GPU with id rank
model = OptimizedCNN().to(rank)
ddp_model = DistributedDataParallel(model, device_ids=[rank])
# Training code here
# ...
cleanup()
# Start multiple processes
if torch.cuda.device_count() > 1:
world_size = torch.cuda.device_count()
mp.spawn(train_ddp, args=(world_size,), nprocs=world_size, join=True)
Summary
Optimizing PyTorch code for CUDA can drastically improve the performance of your deep learning models. In this guide, we've covered key aspects:
- Basic CUDA operations and device management
- Avoiding common pitfalls like excessive CPU-GPU transfers
- Advanced optimization techniques including:
- Mixed precision training
- Asynchronous data loading
- Memory management
- JIT compilation
- Multi-GPU training
- Profiling and performance analysis
By implementing these optimizations, you can significantly speed up both training and inference time for your models, allowing you to iterate faster and work with larger, more complex architectures.
Additional Resources
- PyTorch Performance Tuning Guide
- NVIDIA CUDA Programming Guide
- PyTorch Profiler Documentation
- Distributed Training with PyTorch
Exercises
- Profile a simple neural network training loop on your GPU and identify the top 3 operations that consume the most time.
- Implement mixed precision training for a model of your choice and measure the speedup.
- Compare the performance of a CNN model with and without pinned memory for the data loader.
- Implement gradient checkpointing for a large model and measure the memory savings.
- If you have access to multiple GPUs, modify the example CNN to train using DistributedDataParallel and measure the scaling efficiency.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)