PyTorch Multi-GPU Training
Introduction
Training deep learning models can be computationally intensive and time-consuming. As models grow larger and datasets expand, the need for accelerated training becomes crucial. One effective way to speed up training is by leveraging multiple GPUs.
In this tutorial, we'll explore how to use PyTorch's capabilities for multi-GPU training on a single machine. This approach can significantly reduce training time by distributing the workload across multiple graphics processing units.
Prerequisites
Before diving into multi-GPU training, make sure you have:
- PyTorch installed (version 1.0 or later)
- A machine with multiple CUDA-compatible GPUs
- Basic knowledge of PyTorch and neural networks
You can check your GPU availability with:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Number of GPUs: {torch.cuda.device_count()}")
Output:
PyTorch version: 1.9.0
CUDA available: True
Number of GPUs: 4
Multi-GPU Training Approaches
PyTorch offers two main approaches for multi-GPU training:
- DataParallel (DP): Simpler to implement but less efficient
- DistributedDataParallel (DDP): More complex but offers better performance
Let's explore both methods.
Method 1: Using DataParallel
DataParallel
is the simplest way to run your model on multiple GPUs. It works by:
- Splitting input data across GPUs
- Replicating the model on each GPU
- Gathering and averaging results
Basic Implementation
Here's how to implement a model with DataParallel
:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
# Define a simple model
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.layers = nn.Sequential(
nn.Linear(784, 512),
nn.ReLU(),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
def forward(self, x):
return self.layers(x)
# Create model instance
model = SimpleModel()
# Check if multiple GPUs are available
if torch.cuda.device_count() > 1:
print(f"Using {torch.cuda.device_count()} GPUs!")
# Wrap the model with DataParallel
model = nn.DataParallel(model)
# Move model to GPU
model.to('cuda')
Complete Training Example
Let's implement a complete training loop using DataParallel:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load dataset (MNIST as an example)
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_dataset = torchvision.datasets.MNIST(
root='./data',
train=True,
download=True,
transform=transform
)
# Create data loader with multiple workers for efficient data loading
train_loader = DataLoader(
dataset=train_dataset,
batch_size=256, # Large batch size for multi-GPU
shuffle=True,
num_workers=4 # Parallelize data loading
)
# Create model
model = SimpleModel()
if torch.cuda.device_count() > 1:
print(f"Using {torch.cuda.device_count()} GPUs!")
model = nn.DataParallel(model)
model.to(device)
# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
def train(epochs):
for epoch in range(epochs):
running_loss = 0.0
for i, (images, labels) in enumerate(train_loader):
images = images.view(images.shape[0], -1).to(device) # Flatten images
labels = labels.to(device)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
running_loss += loss.item()
if (i+1) % 100 == 0:
print(f'Epoch [{epoch+1}/5], Step [{i+1}/{len(train_loader)}], Loss: {running_loss/100:.4f}')
running_loss = 0.0
# Train the model
train(epochs=5)
Limitations of DataParallel
While DataParallel
is easy to use, it has several limitations:
- Performance Bottleneck: All gradient synchronization happens on a single GPU
- GIL Bottleneck: Python's Global Interpreter Lock limits true parallelism
- Uneven GPU Memory Consumption: The first GPU typically uses more memory
Method 2: Using DistributedDataParallel (DDP)
DistributedDataParallel
is more efficient than DataParallel
as it creates a separate process for each GPU, avoiding the GIL bottleneck. It also distributes model parameters more evenly across GPUs.
Basic DDP Implementation
Here's a script demonstrating DDP:
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
import torchvision.transforms as transforms
import torchvision.datasets as datasets
def setup(rank, world_size):
"""Initialize the distributed environment."""
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
# Initialize the process group
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def cleanup():
"""Clean up the distributed environment."""
dist.destroy_process_group()
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.layers = nn.Sequential(
nn.Linear(784, 512),
nn.ReLU(),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
def forward(self, x):
return self.layers(x)
def train(rank, world_size):
# Set up distributed training
setup(rank, world_size)
# Create model and move it to GPU
model = SimpleModel().to(rank)
# Wrap model with DDP
ddp_model = DDP(model, device_ids=[rank])
# Data loading
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
# Use DistributedSampler to partition the dataset
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
dataloader = DataLoader(dataset, batch_size=64, sampler=sampler)
# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(ddp_model.parameters(), lr=0.001)
# Training loop
for epoch in range(5):
# Important: set the epoch for the sampler
sampler.set_epoch(epoch)
running_loss = 0.0
for i, (images, labels) in enumerate(dataloader):
images = images.view(images.shape[0], -1).to(rank) # Flatten
labels = labels.to(rank)
# Forward pass
outputs = ddp_model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 100 == 0 and rank == 0:
print(f'Rank {rank}, Epoch {epoch+1}, Batch {i}, Loss: {running_loss/100:.4f}')
running_loss = 0.0
cleanup()
def main():
# Number of GPUs available
world_size = torch.cuda.device_count()
# Use multiprocessing to launch parallel training
mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)
if __name__ == "__main__":
main()
To run this script, you would execute:
python ddp_script.py
Real-World Example: Training ResNet on ImageNet
Let's use DDP to train a ResNet model on a subset of ImageNet. This example demonstrates a more practical use case for multi-GPU training.
import os
import torch
import torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
import torchvision.transforms as transforms
import torchvision.models as models
import torchvision.datasets as datasets
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
def train_resnet(rank, world_size):
setup(rank, world_size)
# Create model
model = models.resnet50(pretrained=False)
model = model.to(rank)
model = DDP(model, device_ids=[rank])
# Data transforms
transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# Use a small subset of ImageNet for demonstration
# In a real scenario, you would use the full ImageNet dataset
dataset = datasets.ImageFolder(
'/path/to/imagenet/train', # Replace with your dataset path
transform=transform
)
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler, num_workers=4)
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# Training loop
num_epochs = 2 # Just for demonstration
for epoch in range(num_epochs):
sampler.set_epoch(epoch) # Important for proper shuffling
running_loss = 0.0
model.train()
for i, (images, labels) in enumerate(dataloader):
images = images.to(rank)
labels = labels.to(rank)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 20 == 19 and rank == 0:
print(f'Epoch {epoch+1}, Batch {i+1}, Loss: {running_loss/20:.4f}')
running_loss = 0.0
# Save model (only on rank 0)
if rank == 0:
torch.save(model.module.state_dict(), "resnet50_distributed.pth")
cleanup()
def main():
world_size = torch.cuda.device_count()
mp.spawn(train_resnet, args=(world_size,), nprocs=world_size, join=True)
if __name__ == "__main__":
main()
Performance Comparison: DataParallel vs. DistributedDataParallel
Let's compare the performance of both methods:
Method | Pros | Cons | Use Cases |
---|---|---|---|
DataParallel | - Easy to implement - Minimal code changes | - Less efficient - GIL bottleneck - Uneven memory usage | - Quick prototyping - Small models |
DistributedDataParallel | - Better performance - Scales better - Evenly distributed workload | - More complex setup - Requires process management | - Production training - Large scale models |
Best Practices for Multi-GPU Training
- Use DistributedDataParallel when possible - It's more efficient and scales better
- Increase batch size - With multiple GPUs, you can often increase the batch size proportionally
- Adjust learning rate - Larger batch sizes may require learning rate adjustments
- Use gradient accumulation - If memory is a constraint
- Mix precision training - Consider using FP16 training with PyTorch AMP for even more speedup
- Monitor GPU utilization - Use
nvidia-smi
to make sure your GPUs are being properly utilized
Implementing Mixed Precision with Multi-GPU Training
For even faster training, you can combine multi-GPU with mixed precision:
import torch.cuda.amp as amp
def train_with_amp(rank, world_size):
setup(rank, world_size)
model = SimpleModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
# Setup data, criterion, and optimizer as before
# Create GradScaler for mixed precision training
scaler = amp.GradScaler()
# Training loop
for epoch in range(5):
sampler.set_epoch(epoch)
for images, labels in dataloader:
images = images.view(images.shape[0], -1).to(rank)
labels = labels.to(rank)
# Forward pass with autocast
with amp.autocast():
outputs = ddp_model(images)
loss = criterion(outputs, labels)
# Backward and optimize with scaler
optimizer.zero_grad()
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
cleanup()
Summary
In this tutorial, we've learned:
- Basic multi-GPU approaches in PyTorch: DataParallel and DistributedDataParallel
- How to implement DataParallel for simple multi-GPU training
- Advanced DistributedDataParallel setup for more efficient training
- Best practices for multi-GPU training
- Real-world examples of training on multiple GPUs
Multi-GPU training is essential for scaling up deep learning models effectively. While DataParallel is easier to implement, DistributedDataParallel offers better performance and is the recommended approach for serious deep learning work.
Additional Resources
Exercises
- Modify the DDP example to train a ResNet-18 model on the CIFAR-10 dataset
- Implement a custom distributed training script that logs GPU memory usage during training
- Experiment with different batch sizes and compare training times
- Implement a distributed validation function that evaluates the model across multiple GPUs
- Try implementing both approaches (DP and DDP) and benchmark their performance on your specific hardware
Happy distributed training!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)