PyTorch Common Bugs
When working with PyTorch, encountering bugs and errors is part of the learning process. This guide will help you identify, understand, and fix common PyTorch bugs that beginners often face.
Introduction
PyTorch is a powerful library for deep learning, but like any complex tool, it comes with its share of pitfalls. Understanding common bugs will help you:
- Debug your code more efficiently
- Prevent issues before they happen
- Gain deeper insights into how PyTorch works
Let's explore the most common PyTorch bugs and their solutions.
1. Tensor Shape Mismatches
Shape mismatches are perhaps the most common errors in PyTorch.
Example:
import torch
# Attempting to multiply matrices with incompatible shapes
matrix1 = torch.randn(3, 4)
matrix2 = torch.randn(5, 6)
try:
result = torch.matmul(matrix1, matrix2)
except RuntimeError as e:
print(f"Error: {e}")
Output:
Error: mat1 and mat2 shapes cannot be multiplied (3x4 and 5x6)
Solution:
Always check tensor shapes before operations. Use .shape
to inspect tensors:
import torch
matrix1 = torch.randn(3, 4)
matrix2 = torch.randn(4, 6) # Corrected shape
print(f"Matrix 1 shape: {matrix1.shape}")
print(f"Matrix 2 shape: {matrix2.shape}")
# Now matrices can be multiplied
result = torch.matmul(matrix1, matrix2)
print(f"Result shape: {result.shape}")
Output:
Matrix 1 shape: torch.Size([3, 4])
Matrix 2 shape: torch.Size([4, 6])
Result shape: torch.Size([3, 6])
2. CUDA Out of Memory Errors
GPU memory management is critical when training deep learning models.
Example:
import torch
# This might cause memory issues with large dimensions
try:
# Create an extremely large tensor
huge_tensor = torch.randn(50000, 50000, device="cuda" if torch.cuda.is_available() else "cpu")
print("Tensor created successfully")
except RuntimeError as e:
print(f"Error: {e}")
Output (on most GPUs):
Error: CUDA out of memory. Tried to allocate 9.31 GiB (GPU 0; 8.00 GiB total capacity; 2.43 GiB already allocated; 5.53 GiB free; 2.44 GiB reserved in total by PyTorch)
Solutions:
- Reduce batch size:
# Instead of batch_size = 128, try:
batch_size = 32
- Clear cache periodically:
import torch
import gc
# After heavy computations
torch.cuda.empty_cache()
gc.collect()
- Use gradient checkpointing for large models:
from torch.utils.checkpoint import checkpoint
# Instead of:
# output = model(input)
# Use:
output = checkpoint(model, input)
3. Gradient Issues
3.1 Forgetting to Zero Gradients
import torch
# Create a simple model and optimizer
model = torch.nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = torch.nn.MSELoss()
# Training loop with a bug
for _ in range(3):
# Missing optimizer.zero_grad() here
inputs = torch.randn(5, 10)
targets = torch.randn(5, 1)
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
print(f"Gradients for weight: {model.weight.grad[0][:3]}") # Gradients accumulate
optimizer.step()
Output:
Gradients for weight: tensor([0.2946, 0.1635, 0.1798])
Gradients for weight: tensor([0.4821, 0.3716, 0.5601]) # Notice these accumulate
Gradients for weight: tensor([0.7501, 0.4890, 0.9320]) # And keep growing
Solution:
Always zero gradients before backward pass:
import torch
model = torch.nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = torch.nn.MSELoss()
# Correct training loop
for _ in range(3):
optimizer.zero_grad() # Zero gradients before backward pass
inputs = torch.randn(5, 10)
targets = torch.randn(5, 1)
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
print(f"Gradients for weight: {model.weight.grad[0][:3]}") # Fresh gradients each time
optimizer.step()
Output:
Gradients for weight: tensor([0.2946, 0.1635, 0.1798])
Gradients for weight: tensor([0.1875, 0.2081, 0.3803]) # Different values each time
Gradients for weight: tensor([0.2680, 0.1174, 0.3719])
3.2 Forgetting requires_grad=True
import torch
# Create trainable and non-trainable tensors
trainable = torch.randn(3, requires_grad=True)
non_trainable = torch.randn(3) # Missing requires_grad=True
print(f"Trainable requires grad: {trainable.requires_grad}")
print(f"Non-trainable requires grad: {non_trainable.requires_grad}")
# Try to compute gradients for both
result1 = trainable.sum()
result1.backward()
print(f"Gradient exists for trainable: {trainable.grad is not None}")
try:
result2 = non_trainable.sum()
result2.backward()
print(f"Gradient exists for non-trainable: {non_trainable.grad is not None}")
except RuntimeError as e:
print(f"Error: {e}")
Output:
Trainable requires grad: True
Non-trainable requires grad: False
Gradient exists for trainable: True
Error: element 0 of tensors does not require grad and does not have a grad_fn
4. Data Type Mismatches
import torch
# Create tensors with different dtypes
float_tensor = torch.randn(3, 3) # Default is float32
double_tensor = torch.randn(3, 3, dtype=torch.float64)
try:
result = float_tensor + double_tensor
except RuntimeError as e:
print(f"Error: {e}")
# Solution
result = float_tensor + double_tensor.float() # Convert to matching dtype
print(f"Result dtype: {result.dtype}")
Output:
Error: expected scalar type Float but found Double
Result dtype: torch.float32
5. Memory Leaks
Memory leaks are subtle but can cause your application to crash after running for a while.
Common causes and solutions:
Keeping references to intermediate tensors:
import torch
import gc
# Potential leak
def train_with_leak(iterations):
history = []
for i in range(iterations):
# Large intermediate tensor
tensor = torch.randn(1000, 1000)
result = tensor.sum()
history.append(tensor) # Keeping reference to large tensor
return history
# Better approach
def train_without_leak(iterations):
history = []
for i in range(iterations):
tensor = torch.randn(1000, 1000)
result = tensor.sum()
history.append(result.item()) # Only store the scalar value
return history
# Clear memory before comparing
gc.collect()
torch.cuda.empty_cache()
# Check memory usage with and without leak
import sys
result_leak = train_with_leak(10)
print(f"Memory used with leak: {sys.getsizeof(result_leak) + sum(sys.getsizeof(t) for t in result_leak)} bytes")
gc.collect()
result_no_leak = train_without_leak(10)
print(f"Memory used without leak: {sys.getsizeof(result_no_leak) + sum(sys.getsizeof(t) for t in result_no_leak)} bytes")
Output:
Memory used with leak: 80000856 bytes
Memory used without leak: 368 bytes
6. Model Evaluation Mode Issues
Forgetting to set your model to evaluation mode during inference can lead to inconsistent results.
import torch
import torch.nn as nn
# Create a simple model with dropout and batch normalization
class SimpleModel(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(10, 10)
self.dropout = nn.Dropout(0.5)
self.bn = nn.BatchNorm1d(10)
def forward(self, x):
x = self.linear(x)
x = self.dropout(x)
x = self.bn(x)
return x
model = SimpleModel()
input_tensor = torch.randn(5, 10)
# Training mode (default)
output1 = model(input_tensor)
# Same input, still in training mode
output2 = model(input_tensor)
# Check if outputs are the same
print(f"Same output in training mode: {torch.all(output1 == output2).item()}")
# Set to evaluation mode
model.eval()
output3 = model(input_tensor)
output4 = model(input_tensor)
# Check if outputs are the same in eval mode
print(f"Same output in eval mode: {torch.all(output3 == output4).item()}")
Output:
Same output in training mode: False
Same output in eval mode: True
Solution:
Always use context managers for evaluation:
# For inference
with torch.no_grad():
model.eval()
output = model(input_data)
# Back to training
model.train()
7. NaN and Inf Values
NaN (Not a Number) and Inf (Infinity) values can break your training.
import torch
import torch.nn as nn
# Create a simple network
model = nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=1e6) # Extremely high learning rate
# Training data
x = torch.randn(5, 10)
y = torch.randn(5, 1)
# Check for NaN before training
print(f"NaN in model parameters before: {any(torch.isnan(p).any() for p in model.parameters())}")
# One training step with very high learning rate
loss = nn.MSELoss()(model(x), y)
loss.backward()
optimizer.step()
# Check for NaN after training
print(f"NaN in model parameters after: {any(torch.isnan(p).any() for p in model.parameters())}")
print(f"Inf in model parameters after: {any(torch.isinf(p).any() for p in model.parameters())}")
if any(torch.isnan(p).any() for p in model.parameters()):
print("Found NaN values in model parameters!")
# Solution: Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Output:
NaN in model parameters before: False
NaN in model parameters after: True
Inf in model parameters after: False
Found NaN values in model parameters!
Solutions:
- Use gradient clipping:
# Before optimizer step
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
- Check for NaNs during training:
def detect_anomaly_in_tensor(tensor, tensor_name):
if torch.isnan(tensor).any():
print(f"NaN detected in {tensor_name}")
return True
if torch.isinf(tensor).any():
print(f"Inf detected in {tensor_name}")
return True
return False
# During training
for name, param in model.named_parameters():
detect_anomaly_in_tensor(param.data, f"Parameter {name}")
if param.grad is not None:
detect_anomaly_in_tensor(param.grad, f"Gradient {name}")
8. CPU/GPU Device Mismatches
import torch
# Create tensors on different devices
cpu_tensor = torch.randn(3, 3)
gpu_tensor = torch.randn(3, 3).cuda() if torch.cuda.is_available() else torch.randn(3, 3)
try:
result = cpu_tensor + gpu_tensor
except RuntimeError as e:
print(f"Error: {e}")
# Solution
result = cpu_tensor.to(gpu_tensor.device) + gpu_tensor
print(f"Result device: {result.device}")
Output (with CUDA available):
Error: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Result device: cuda:0
Best practice:
Always specify the device explicitly and move all tensors to that device:
# Define device once
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
# Create tensors directly on the target device
tensor1 = torch.randn(3, 3, device=device)
tensor2 = torch.randn(3, 3, device=device)
# Move existing tensors to device
tensor3 = torch.randn(3, 3).to(device)
# Model to device
model = torch.nn.Linear(3, 2).to(device)
# All operations work smoothly
result = model(tensor1 + tensor2 + tensor3)
print(f"Result shape: {result.shape}, device: {result.device}")
Summary
In this guide, we've covered the most common PyTorch bugs and their solutions:
- Tensor shape mismatches - Always check tensor dimensions
- CUDA memory errors - Manage GPU memory carefully
- Gradient issues - Zero gradients and check requires_grad
- Data type mismatches - Ensure consistent dtypes
- Memory leaks - Avoid keeping unnecessary references
- Model evaluation mode - Use model.eval() for inference
- NaN and Inf values - Use gradient clipping and monitoring
- CPU/GPU device mismatches - Keep tensors on the same device
Debugging PyTorch code becomes easier with practice. Remember to print tensor shapes, devices, and gradients when encountering errors, and you'll be able to solve most issues quickly.
Additional Resources
- PyTorch Official Debugging Documentation
- PyTorch Forum - Great for getting help with specific bugs
- Stack Overflow PyTorch Tag
Exercises
- Create a script that intentionally causes each of the bugs mentioned in this guide, then fix them.
- Build a function that checks a PyTorch model for potential issues (NaN values, device mismatches, etc.).
- Debug a network that is not learning (loss not decreasing) by identifying which of the above issues might be causing the problem.
- Create a custom PyTorch error handler that provides more informative messages for common errors.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)