Skip to main content

PyTorch Multi-GPU Training

Introduction

Training deep learning models can be computationally intensive and time-consuming. As models grow larger and datasets expand, the need for accelerated training becomes crucial. One effective way to speed up training is by leveraging multiple GPUs.

In this tutorial, we'll explore how to use PyTorch's capabilities for multi-GPU training on a single machine. This approach can significantly reduce training time by distributing the workload across multiple graphics processing units.

Prerequisites

Before diving into multi-GPU training, make sure you have:

  • PyTorch installed (version 1.0 or later)
  • A machine with multiple CUDA-compatible GPUs
  • Basic knowledge of PyTorch and neural networks

You can check your GPU availability with:

python
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Number of GPUs: {torch.cuda.device_count()}")

Output:

PyTorch version: 1.9.0
CUDA available: True
Number of GPUs: 4

Multi-GPU Training Approaches

PyTorch offers two main approaches for multi-GPU training:

  1. DataParallel (DP): Simpler to implement but less efficient
  2. DistributedDataParallel (DDP): More complex but offers better performance

Let's explore both methods.

Method 1: Using DataParallel

DataParallel is the simplest way to run your model on multiple GPUs. It works by:

  • Splitting input data across GPUs
  • Replicating the model on each GPU
  • Gathering and averaging results

Basic Implementation

Here's how to implement a model with DataParallel:

python
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Define a simple model
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.layers = nn.Sequential(
nn.Linear(784, 512),
nn.ReLU(),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 10)
)

def forward(self, x):
return self.layers(x)

# Create model instance
model = SimpleModel()

# Check if multiple GPUs are available
if torch.cuda.device_count() > 1:
print(f"Using {torch.cuda.device_count()} GPUs!")
# Wrap the model with DataParallel
model = nn.DataParallel(model)

# Move model to GPU
model.to('cuda')

Complete Training Example

Let's implement a complete training loop using DataParallel:

python
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load dataset (MNIST as an example)
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])

train_dataset = torchvision.datasets.MNIST(
root='./data',
train=True,
download=True,
transform=transform
)

# Create data loader with multiple workers for efficient data loading
train_loader = DataLoader(
dataset=train_dataset,
batch_size=256, # Large batch size for multi-GPU
shuffle=True,
num_workers=4 # Parallelize data loading
)

# Create model
model = SimpleModel()
if torch.cuda.device_count() > 1:
print(f"Using {torch.cuda.device_count()} GPUs!")
model = nn.DataParallel(model)
model.to(device)

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
def train(epochs):
for epoch in range(epochs):
running_loss = 0.0
for i, (images, labels) in enumerate(train_loader):
images = images.view(images.shape[0], -1).to(device) # Flatten images
labels = labels.to(device)

# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)

# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()

running_loss += loss.item()

if (i+1) % 100 == 0:
print(f'Epoch [{epoch+1}/5], Step [{i+1}/{len(train_loader)}], Loss: {running_loss/100:.4f}')
running_loss = 0.0

# Train the model
train(epochs=5)

Limitations of DataParallel

While DataParallel is easy to use, it has several limitations:

  1. Performance Bottleneck: All gradient synchronization happens on a single GPU
  2. GIL Bottleneck: Python's Global Interpreter Lock limits true parallelism
  3. Uneven GPU Memory Consumption: The first GPU typically uses more memory

Method 2: Using DistributedDataParallel (DDP)

DistributedDataParallel is more efficient than DataParallel as it creates a separate process for each GPU, avoiding the GIL bottleneck. It also distributes model parameters more evenly across GPUs.

Basic DDP Implementation

Here's a script demonstrating DDP:

python
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
import torchvision.transforms as transforms
import torchvision.datasets as datasets

def setup(rank, world_size):
"""Initialize the distributed environment."""
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'

# Initialize the process group
dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
"""Clean up the distributed environment."""
dist.destroy_process_group()

class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.layers = nn.Sequential(
nn.Linear(784, 512),
nn.ReLU(),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 10)
)

def forward(self, x):
return self.layers(x)

def train(rank, world_size):
# Set up distributed training
setup(rank, world_size)

# Create model and move it to GPU
model = SimpleModel().to(rank)
# Wrap model with DDP
ddp_model = DDP(model, device_ids=[rank])

# Data loading
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])

dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)

# Use DistributedSampler to partition the dataset
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)

dataloader = DataLoader(dataset, batch_size=64, sampler=sampler)

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(ddp_model.parameters(), lr=0.001)

# Training loop
for epoch in range(5):
# Important: set the epoch for the sampler
sampler.set_epoch(epoch)

running_loss = 0.0
for i, (images, labels) in enumerate(dataloader):
images = images.view(images.shape[0], -1).to(rank) # Flatten
labels = labels.to(rank)

# Forward pass
outputs = ddp_model(images)
loss = criterion(outputs, labels)

# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()

running_loss += loss.item()

if i % 100 == 0 and rank == 0:
print(f'Rank {rank}, Epoch {epoch+1}, Batch {i}, Loss: {running_loss/100:.4f}')
running_loss = 0.0

cleanup()

def main():
# Number of GPUs available
world_size = torch.cuda.device_count()
# Use multiprocessing to launch parallel training
mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
main()

To run this script, you would execute:

bash
python ddp_script.py

Real-World Example: Training ResNet on ImageNet

Let's use DDP to train a ResNet model on a subset of ImageNet. This example demonstrates a more practical use case for multi-GPU training.

python
import os
import torch
import torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
import torchvision.transforms as transforms
import torchvision.models as models
import torchvision.datasets as datasets

def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
dist.destroy_process_group()

def train_resnet(rank, world_size):
setup(rank, world_size)

# Create model
model = models.resnet50(pretrained=False)
model = model.to(rank)
model = DDP(model, device_ids=[rank])

# Data transforms
transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])

# Use a small subset of ImageNet for demonstration
# In a real scenario, you would use the full ImageNet dataset
dataset = datasets.ImageFolder(
'/path/to/imagenet/train', # Replace with your dataset path
transform=transform
)

sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler, num_workers=4)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Training loop
num_epochs = 2 # Just for demonstration
for epoch in range(num_epochs):
sampler.set_epoch(epoch) # Important for proper shuffling

running_loss = 0.0
model.train()

for i, (images, labels) in enumerate(dataloader):
images = images.to(rank)
labels = labels.to(rank)

# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)

# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()

running_loss += loss.item()

if i % 20 == 19 and rank == 0:
print(f'Epoch {epoch+1}, Batch {i+1}, Loss: {running_loss/20:.4f}')
running_loss = 0.0

# Save model (only on rank 0)
if rank == 0:
torch.save(model.module.state_dict(), "resnet50_distributed.pth")

cleanup()

def main():
world_size = torch.cuda.device_count()
mp.spawn(train_resnet, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
main()

Performance Comparison: DataParallel vs. DistributedDataParallel

Let's compare the performance of both methods:

MethodProsConsUse Cases
DataParallel- Easy to implement
- Minimal code changes
- Less efficient
- GIL bottleneck
- Uneven memory usage
- Quick prototyping
- Small models
DistributedDataParallel- Better performance
- Scales better
- Evenly distributed workload
- More complex setup
- Requires process management
- Production training
- Large scale models

Best Practices for Multi-GPU Training

  1. Use DistributedDataParallel when possible - It's more efficient and scales better
  2. Increase batch size - With multiple GPUs, you can often increase the batch size proportionally
  3. Adjust learning rate - Larger batch sizes may require learning rate adjustments
  4. Use gradient accumulation - If memory is a constraint
  5. Mix precision training - Consider using FP16 training with PyTorch AMP for even more speedup
  6. Monitor GPU utilization - Use nvidia-smi to make sure your GPUs are being properly utilized

Implementing Mixed Precision with Multi-GPU Training

For even faster training, you can combine multi-GPU with mixed precision:

python
import torch.cuda.amp as amp

def train_with_amp(rank, world_size):
setup(rank, world_size)

model = SimpleModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])

# Setup data, criterion, and optimizer as before

# Create GradScaler for mixed precision training
scaler = amp.GradScaler()

# Training loop
for epoch in range(5):
sampler.set_epoch(epoch)

for images, labels in dataloader:
images = images.view(images.shape[0], -1).to(rank)
labels = labels.to(rank)

# Forward pass with autocast
with amp.autocast():
outputs = ddp_model(images)
loss = criterion(outputs, labels)

# Backward and optimize with scaler
optimizer.zero_grad()
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

cleanup()

Summary

In this tutorial, we've learned:

  1. Basic multi-GPU approaches in PyTorch: DataParallel and DistributedDataParallel
  2. How to implement DataParallel for simple multi-GPU training
  3. Advanced DistributedDataParallel setup for more efficient training
  4. Best practices for multi-GPU training
  5. Real-world examples of training on multiple GPUs

Multi-GPU training is essential for scaling up deep learning models effectively. While DataParallel is easier to implement, DistributedDataParallel offers better performance and is the recommended approach for serious deep learning work.

Additional Resources

Exercises

  1. Modify the DDP example to train a ResNet-18 model on the CIFAR-10 dataset
  2. Implement a custom distributed training script that logs GPU memory usage during training
  3. Experiment with different batch sizes and compare training times
  4. Implement a distributed validation function that evaluates the model across multiple GPUs
  5. Try implementing both approaches (DP and DDP) and benchmark their performance on your specific hardware

Happy distributed training!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)