PyTorch Multi-GPU Training

Introduction

Training deep learning models can be computationally intensive and time-consuming. As models grow larger and datasets expand, the need for accelerated training becomes crucial. One effective way to speed up training is by leveraging multiple GPUs.

In this tutorial, we'll explore how to use PyTorch's capabilities for multi-GPU training on a single machine. This approach can significantly reduce training time by distributing the workload across multiple graphics processing units.

Prerequisites

Before diving into multi-GPU training, make sure you have:

PyTorch installed (version 1.0 or later)
A machine with multiple CUDA-compatible GPUs
Basic knowledge of PyTorch and neural networks

You can check your GPU availability with:

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Number of GPUs: {torch.cuda.device_count()}")

Output:

PyTorch version: 1.9.0
CUDA available: True
Number of GPUs: 4

Multi-GPU Training Approaches

PyTorch offers two main approaches for multi-GPU training:

DataParallel (DP): Simpler to implement but less efficient
DistributedDataParallel (DDP): More complex but offers better performance

Let's explore both methods.

Method 1: Using DataParallel

DataParallel is the simplest way to run your model on multiple GPUs. It works by:

Splitting input data across GPUs
Replicating the model on each GPU
Gathering and averaging results

Basic Implementation

Here's how to implement a model with DataParallel:

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(784, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )
        
    def forward(self, x):
        return self.layers(x)

# Create model instance
model = SimpleModel()

# Check if multiple GPUs are available
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    # Wrap the model with DataParallel
    model = nn.DataParallel(model)
    
# Move model to GPU
model.to('cuda')

Complete Training Example

Let's implement a complete training loop using DataParallel:

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load dataset (MNIST as an example)
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_dataset = torchvision.datasets.MNIST(
    root='./data', 
    train=True, 
    download=True, 
    transform=transform
)

# Create data loader with multiple workers for efficient data loading
train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=256,  # Large batch size for multi-GPU
    shuffle=True,
    num_workers=4  # Parallelize data loading
)

# Create model
model = SimpleModel()
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    model = nn.DataParallel(model)
model.to(device)

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
def train(epochs):
    for epoch in range(epochs):
        running_loss = 0.0
        for i, (images, labels) in enumerate(train_loader):
            images = images.view(images.shape[0], -1).to(device)  # Flatten images
            labels = labels.to(device)
            
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
            
            if (i+1) % 100 == 0:
                print(f'Epoch [{epoch+1}/5], Step [{i+1}/{len(train_loader)}], Loss: {running_loss/100:.4f}')
                running_loss = 0.0
                
# Train the model
train(epochs=5)

Limitations of DataParallel

While DataParallel is easy to use, it has several limitations:

Performance Bottleneck: All gradient synchronization happens on a single GPU
GIL Bottleneck: Python's Global Interpreter Lock limits true parallelism
Uneven GPU Memory Consumption: The first GPU typically uses more memory

Method 2: Using DistributedDataParallel (DDP)

DistributedDataParallel is more efficient than DataParallel as it creates a separate process for each GPU, avoiding the GIL bottleneck. It also distributes model parameters more evenly across GPUs.

Basic DDP Implementation

Here's a script demonstrating DDP:

import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
import torchvision.transforms as transforms
import torchvision.datasets as datasets

def setup(rank, world_size):
    """Initialize the distributed environment."""
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    
    # Initialize the process group
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
    """Clean up the distributed environment."""
    dist.destroy_process_group()

class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(784, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )
        
    def forward(self, x):
        return self.layers(x)

def train(rank, world_size):
    # Set up distributed training
    setup(rank, world_size)
    
    # Create model and move it to GPU
    model = SimpleModel().to(rank)
    # Wrap model with DDP
    ddp_model = DDP(model, device_ids=[rank])
    
    # Data loading
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    
    dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
    
    # Use DistributedSampler to partition the dataset
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    
    dataloader = DataLoader(dataset, batch_size=64, sampler=sampler)
    
    # Define loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(ddp_model.parameters(), lr=0.001)
    
    # Training loop
    for epoch in range(5):
        # Important: set the epoch for the sampler
        sampler.set_epoch(epoch)
        
        running_loss = 0.0
        for i, (images, labels) in enumerate(dataloader):
            images = images.view(images.shape[0], -1).to(rank)  # Flatten
            labels = labels.to(rank)
            
            # Forward pass
            outputs = ddp_model(images)
            loss = criterion(outputs, labels)
            
            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
            
            if i % 100 == 0 and rank == 0:
                print(f'Rank {rank}, Epoch {epoch+1}, Batch {i}, Loss: {running_loss/100:.4f}')
                running_loss = 0.0
    
    cleanup()

def main():
    # Number of GPUs available
    world_size = torch.cuda.device_count()
    # Use multiprocessing to launch parallel training
    mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    main()

To run this script, you would execute:

python ddp_script.py

Real-World Example: Training ResNet on ImageNet

Let's use DDP to train a ResNet model on a subset of ImageNet. This example demonstrates a more practical use case for multi-GPU training.

import os
import torch
import torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
import torchvision.transforms as transforms
import torchvision.models as models
import torchvision.datasets as datasets

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def train_resnet(rank, world_size):
    setup(rank, world_size)
    
    # Create model
    model = models.resnet50(pretrained=False)
    model = model.to(rank)
    model = DDP(model, device_ids=[rank])
    
    # Data transforms
    transform = transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])
    
    # Use a small subset of ImageNet for demonstration
    # In a real scenario, you would use the full ImageNet dataset
    dataset = datasets.ImageFolder(
        '/path/to/imagenet/train',  # Replace with your dataset path
        transform=transform
    )
    
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, batch_size=32, sampler=sampler, num_workers=4)
    
    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
    
    # Training loop
    num_epochs = 2  # Just for demonstration
    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)  # Important for proper shuffling
        
        running_loss = 0.0
        model.train()
        
        for i, (images, labels) in enumerate(dataloader):
            images = images.to(rank)
            labels = labels.to(rank)
            
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
            
            if i % 20 == 19 and rank == 0:
                print(f'Epoch {epoch+1}, Batch {i+1}, Loss: {running_loss/20:.4f}')
                running_loss = 0.0
    
    # Save model (only on rank 0)
    if rank == 0:
        torch.save(model.module.state_dict(), "resnet50_distributed.pth")
    
    cleanup()

def main():
    world_size = torch.cuda.device_count()
    mp.spawn(train_resnet, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    main()

Performance Comparison: DataParallel vs. DistributedDataParallel

Let's compare the performance of both methods:

Method	Pros	Cons	Use Cases
DataParallel	- Easy to implement - Minimal code changes	- Less efficient - GIL bottleneck - Uneven memory usage	- Quick prototyping - Small models
DistributedDataParallel	- Better performance - Scales better - Evenly distributed workload	- More complex setup - Requires process management	- Production training - Large scale models

Best Practices for Multi-GPU Training

Use DistributedDataParallel when possible - It's more efficient and scales better
Increase batch size - With multiple GPUs, you can often increase the batch size proportionally
Adjust learning rate - Larger batch sizes may require learning rate adjustments
Use gradient accumulation - If memory is a constraint
Mix precision training - Consider using FP16 training with PyTorch AMP for even more speedup
Monitor GPU utilization - Use nvidia-smi to make sure your GPUs are being properly utilized

Implementing Mixed Precision with Multi-GPU Training

For even faster training, you can combine multi-GPU with mixed precision:

import torch.cuda.amp as amp

def train_with_amp(rank, world_size):
    setup(rank, world_size)
    
    model = SimpleModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])
    
    # Setup data, criterion, and optimizer as before
    
    # Create GradScaler for mixed precision training
    scaler = amp.GradScaler()
    
    # Training loop
    for epoch in range(5):
        sampler.set_epoch(epoch)
        
        for images, labels in dataloader:
            images = images.view(images.shape[0], -1).to(rank)
            labels = labels.to(rank)
            
            # Forward pass with autocast
            with amp.autocast():
                outputs = ddp_model(images)
                loss = criterion(outputs, labels)
            
            # Backward and optimize with scaler
            optimizer.zero_grad()
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
    
    cleanup()

Summary

In this tutorial, we've learned:

Basic multi-GPU approaches in PyTorch: DataParallel and DistributedDataParallel
How to implement DataParallel for simple multi-GPU training
Advanced DistributedDataParallel setup for more efficient training
Best practices for multi-GPU training
Real-world examples of training on multiple GPUs

Multi-GPU training is essential for scaling up deep learning models effectively. While DataParallel is easier to implement, DistributedDataParallel offers better performance and is the recommended approach for serious deep learning work.

Additional Resources

Exercises

Modify the DDP example to train a ResNet-18 model on the CIFAR-10 dataset
Implement a custom distributed training script that logs GPU memory usage during training
Experiment with different batch sizes and compare training times
Implement a distributed validation function that evaluates the model across multiple GPUs
Try implementing both approaches (DP and DDP) and benchmark their performance on your specific hardware

Happy distributed training!

💡 Found a typo or mistake? Click "Edit this page" to suggest a correction. Your feedback is greatly appreciated!

Introduction​

Prerequisites​

Multi-GPU Training Approaches​

Method 1: Using DataParallel​

Basic Implementation​

Complete Training Example​

Limitations of DataParallel​

Method 2: Using DistributedDataParallel (DDP)​

Basic DDP Implementation​

Real-World Example: Training ResNet on ImageNet​

Performance Comparison: DataParallel vs. DistributedDataParallel​

Best Practices for Multi-GPU Training​

Implementing Mixed Precision with Multi-GPU Training​

Summary​

Additional Resources​

Exercises​