Skip to main content

PyTorch Memory Optimization

Introduction

When training deep learning models with PyTorch, you'll often encounter memory limitations, especially when working with large models or limited hardware resources. Efficient memory management is crucial for:

  • Training larger models on your existing hardware
  • Increasing batch sizes for better training performance
  • Preventing out-of-memory (OOM) errors during training
  • Improving overall training speed and efficiency

In this guide, we'll explore various techniques to optimize memory usage in PyTorch, ranging from basic approaches to advanced strategies. Whether you're training on a laptop or cloud GPUs, these memory optimization techniques will help you make the most of your available resources.

Understanding PyTorch Memory Usage

Before diving into optimization techniques, it's important to understand how PyTorch manages memory.

Key Memory Consumers in PyTorch

  1. Model Parameters: Weights and biases of your neural network
  2. Activations: Intermediate outputs from each layer (stored for backward pass)
  3. Gradients: Computed during backpropagation
  4. Optimizer States: Additional memory used by optimizers like Adam
  5. Data Batches: Input data and targets loaded in memory

Let's start by examining how to monitor memory usage:

python
import torch
import gc

# Check if CUDA is available
if torch.cuda.is_available():
# Get current GPU memory usage
print(f"Memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Memory reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Get maximum GPU memory usage
print(f"Max memory allocated: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")

Output (example):

Memory allocated: 0.25 GB
Memory reserved: 0.50 GB
Max memory allocated: 1.75 GB

Basic Memory Optimization Techniques

1. Release Unused Tensors

Python's garbage collector may not immediately release GPU memory. Explicitly delete tensors and call garbage collection:

python
# Create a large tensor
large_tensor = torch.randn(10000, 10000, device='cuda')
print(f"Memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

# Delete the tensor and run garbage collection
del large_tensor
torch.cuda.empty_cache()
gc.collect()
print(f"Memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

Output:

Memory allocated: 0.40 GB
Memory allocated: 0.01 GB

2. Use Appropriate Tensor Types

PyTorch's default tensor type is torch.float32 (32-bit floating point). For many applications, you can use lower precision:

python
# Default 32-bit float (4 bytes per element)
tensor_f32 = torch.randn(1000, 1000, device='cuda')
print(f"Float32 tensor size: {tensor_f32.element_size() * tensor_f32.nelement() / 1e6:.2f} MB")

# 16-bit float (2 bytes per element)
tensor_f16 = torch.randn(1000, 1000, device='cuda', dtype=torch.float16)
print(f"Float16 tensor size: {tensor_f16.element_size() * tensor_f16.nelement() / 1e6:.2f} MB")

# 8-bit integer (1 byte per element)
tensor_i8 = torch.randint(0, 255, (1000, 1000), device='cuda', dtype=torch.uint8)
print(f"Int8 tensor size: {tensor_i8.element_size() * tensor_i8.nelement() / 1e6:.2f} MB")

Output:

Float32 tensor size: 4.00 MB
Float16 tensor size: 2.00 MB
Int8 tensor size: 1.00 MB

3. Move Operations to CPU When Appropriate

For some operations, especially with smaller tensors, it might be more memory-efficient to move them to CPU:

python
# Large operation on GPU might cause OOM
large_tensor = torch.randn(20000, 20000, device='cuda')

# Move to CPU, process, then move back
cpu_tensor = large_tensor.cpu()
result = cpu_tensor @ cpu_tensor # Matrix multiplication on CPU
result_gpu = result.cuda() # Move back to GPU if needed

# Clean up
del large_tensor, cpu_tensor
torch.cuda.empty_cache()

Intermediate Optimization Techniques

1. Gradient Checkpointing

Gradient checkpointing is a technique that trades computation for memory. Instead of storing all activations, the model recomputes them during backpropagation.

python
import torch.nn as nn
import torch.utils.checkpoint as checkpoint

class CheckpointedModel(nn.Module):
def __init__(self):
super().__init__()
# Create a large model
self.layers = nn.Sequential(
*[nn.Sequential(
nn.Linear(512, 512),
nn.ReLU()
) for _ in range(10)]
)

def forward(self, x):
# Use checkpointing for memory efficiency
return checkpoint.checkpoint_sequential(self.layers, 3, x)

# Create model and input
model = CheckpointedModel().cuda()
input_data = torch.randn(128, 512, device='cuda')

# Forward and backward pass with checkpointing
output = model(input_data)
output.sum().backward()

This approach can significantly reduce memory usage during training of deep networks, though it will increase computation time.

2. Mixed Precision Training

Mixed precision training uses a mix of float16 and float32 precision to reduce memory usage while maintaining training stability.

python
import torch.cuda.amp as amp

# Create model and optimizer
model = nn.Sequential(nn.Linear(1000, 1000), nn.ReLU(), nn.Linear(1000, 10)).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Create a GradScaler for mixed precision training
scaler = amp.GradScaler()

# Training loop with mixed precision
for epoch in range(3):
for batch_idx, (data, target) in enumerate(dataloader):
data, target = data.cuda(), target.cuda()

# Forward pass with autocast
with amp.autocast():
output = model(data)
loss = nn.functional.cross_entropy(output, target)

# Backward pass with scaled gradients
optimizer.zero_grad()
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

if batch_idx % 100 == 0:
print(f"Epoch: {epoch}, Batch: {batch_idx}, Loss: {loss.item()}")

3. Efficient Data Loading

Optimize your data loading pipeline to reduce memory pressure:

python
from torch.utils.data import DataLoader, Dataset

class EfficientDataset(Dataset):
def __init__(self, data_path):
self.data_path = data_path
# Store only metadata, not actual data
self.metadata = [f"{data_path}/file_{i}.pt" for i in range(1000)]

def __len__(self):
return len(self.metadata)

def __getitem__(self, idx):
# Load data on-the-fly
data = torch.load(self.metadata[idx])
return data

# Create an efficient dataloader
dataset = EfficientDataset("./data_directory")
dataloader = DataLoader(
dataset,
batch_size=32,
num_workers=4, # Multiprocessing for data loading
pin_memory=True, # Faster data transfer to GPU
prefetch_factor=2 # Prefetch batches
)

Advanced Memory Optimization Techniques

1. Model Sharding and Distributed Training

For extremely large models, you can split the model across multiple GPUs:

python
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
dist.destroy_process_group()

def train(rank, world_size):
setup(rank, world_size)

# Create model and move it to the current device
model = LargeModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])

# Training loop
# ...

cleanup()

# Run training on 4 GPUs
world_size = 4
mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)

2. Offloading Model Parameters

Offload large model weights to CPU and move them to GPU only when needed:

python
class OffloadModule(nn.Module):
def __init__(self, layer):
super().__init__()
self.layer = layer.cpu() # Store on CPU

def forward(self, x):
# Move layer to the same device as input
device = x.device
self.layer = self.layer.to(device)

# Compute output
output = self.layer(x)

# Move layer back to CPU
self.layer = self.layer.cpu()

return output

# Example usage
large_layer = nn.Linear(10000, 10000)
efficient_layer = OffloadModule(large_layer)

# Forward pass
input_data = torch.randn(32, 10000, device='cuda')
output = efficient_layer(input_data) # Layer temporarily moves to GPU

3. Using 16-bit Model Parameters

Convert your entire model to half precision:

python
# Convert model to half precision (fp16)
model = MyModel().cuda().half()

# Ensure inputs are also half precision
input_data = torch.randn(32, 512, device='cuda').half()

# Forward pass in half precision
output = model(input_data)

4. Gradient Accumulation

Reduce memory usage by performing backpropagation after multiple forward passes:

python
model = LargeModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
accumulation_steps = 4 # Simulate a batch size 4x larger

for epoch in range(epochs):
for i, (data, target) in enumerate(dataloader):
data, target = data.cuda(), target.cuda()

# Forward pass
output = model(data)
loss = criterion(output, target) / accumulation_steps # Normalize loss

# Backward pass
loss.backward()

# Update weights after several backward passes
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()

Real-World Application: Training a Large Language Model

Let's combine several memory optimization techniques to train a large transformer model:

python
import torch
import torch.nn as nn
import torch.cuda.amp as amp
import torch.utils.checkpoint as checkpoint

# Define a transformer layer with checkpointing
class MemoryEfficientTransformerLayer(nn.Module):
def __init__(self, d_model, nhead, dim_feedforward):
super().__init__()
self.attention = nn.MultiheadAttention(d_model, nhead)
self.feed_forward = nn.Sequential(
nn.Linear(d_model, dim_feedforward),
nn.GELU(),
nn.Linear(dim_feedforward, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)

def _attention_block(self, x):
return self.attention(x, x, x)[0]

def _ff_block(self, x):
return self.feed_forward(x)

def forward(self, x):
# Use checkpointing for attention and feedforward
x = x + checkpoint.checkpoint(self._attention_block, x)
x = self.norm1(x)
x = x + checkpoint.checkpoint(self._ff_block, x)
x = self.norm2(x)
return x

# Create a memory-efficient transformer model
class MemoryEfficientTransformer(nn.Module):
def __init__(self, vocab_size, d_model=512, nhead=8, num_layers=6):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.layers = nn.ModuleList([
MemoryEfficientTransformerLayer(d_model, nhead, d_model * 4)
for _ in range(num_layers)
])
self.classifier = nn.Linear(d_model, vocab_size)

def forward(self, x):
x = self.embedding(x).transpose(0, 1) # [batch, seq] -> [seq, batch, dim]

for layer in self.layers:
x = layer(x)

x = x.transpose(0, 1) # [seq, batch, dim] -> [batch, seq, dim]
return self.classifier(x)

# Training function combining multiple optimization techniques
def train_large_model(model, train_data, epochs=1):
# Setup for mixed precision training
scaler = amp.GradScaler()
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-5)

# Gradient accumulation steps
accumulation_steps = 8

# Training loop
model.train()
for epoch in range(epochs):
for i, (input_ids, labels) in enumerate(train_data):
input_ids = input_ids.cuda()
labels = labels.cuda()

# Mixed precision forward pass
with amp.autocast():
outputs = model(input_ids)
loss = nn.functional.cross_entropy(
outputs.view(-1, outputs.size(-1)),
labels.view(-1)
) / accumulation_steps

# Scale and accumulate gradients
scaler.scale(loss).backward()

if (i + 1) % accumulation_steps == 0:
# Unscale gradients for potential gradient clipping
scaler.unscale_(optimizer)

# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Update parameters with scaled gradients
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()

if i % 100 == 0:
print(f"Epoch {epoch}, Step {i}, Loss: {loss.item() * accumulation_steps}")

# Instantiate model and move to GPU
model = MemoryEfficientTransformer(vocab_size=50000)
model = model.cuda()

# Example usage (assuming train_data is a properly constructed DataLoader)
# train_large_model(model, train_data)

Summary and Best Practices

In this guide, we've covered a wide range of memory optimization techniques for PyTorch:

  1. Basic techniques:

    • Releasing unused tensors with del and torch.cuda.empty_cache()
    • Using appropriate tensor types (float16, int8)
    • Moving operations between CPU and GPU strategically
  2. Intermediate techniques:

    • Gradient checkpointing to trade computation for memory
    • Mixed precision training with torch.cuda.amp
    • Efficient data loading with PyTorch DataLoaders
  3. Advanced techniques:

    • Model sharding across multiple GPUs
    • Parameter offloading between CPU and GPU
    • 16-bit model parameters
    • Gradient accumulation for larger effective batch sizes

Best Practices Checklist:

  • ✅ Monitor your memory usage with torch.cuda.memory_allocated()
  • ✅ Use mixed precision training when possible
  • ✅ Enable gradient checkpointing for deep models
  • ✅ Implement gradient accumulation for larger effective batch sizes
  • ✅ Use proper tensor types based on your needs
  • ✅ Release and empty cache for unused tensors
  • ✅ Consider model parallelism for extremely large models

Additional Resources

For further exploration of PyTorch memory optimization:

  1. PyTorch Memory Management Documentation
  2. Gradient Checkpointing Tutorial
  3. Mixed Precision Training Tutorial
  4. PyTorch Profiler for Memory Analysis

Exercises

  1. Memory Monitoring: Write a context manager that tracks memory usage before and after a particular operation.

  2. Mixed Precision Implementation: Convert an existing model training loop to use mixed precision.

  3. Gradient Accumulation: Implement gradient accumulation in a training pipeline and experiment with different accumulation steps.

  4. Optimized DataLoader: Create a memory-efficient dataset class that streams data from disk rather than loading it all at once.

  5. Memory Profiling: Use PyTorch's profiler to identify memory bottlenecks in a given model and apply appropriate optimizations.

By mastering these memory optimization techniques, you'll be able to train larger, more complex models even with limited hardware resources.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)