PyTorch Common Bugs

When working with PyTorch, encountering bugs and errors is part of the learning process. This guide will help you identify, understand, and fix common PyTorch bugs that beginners often face.

Introduction

PyTorch is a powerful library for deep learning, but like any complex tool, it comes with its share of pitfalls. Understanding common bugs will help you:

Debug your code more efficiently
Prevent issues before they happen
Gain deeper insights into how PyTorch works

Let's explore the most common PyTorch bugs and their solutions.

1. Tensor Shape Mismatches

Shape mismatches are perhaps the most common errors in PyTorch.

Example:

python
import torch

# Attempting to multiply matrices with incompatible shapes
matrix1 = torch.randn(3, 4)
matrix2 = torch.randn(5, 6)

try:
    result = torch.matmul(matrix1, matrix2)
except RuntimeError as e:
    print(f"Error: {e}")

Output:

Error: mat1 and mat2 shapes cannot be multiplied (3x4 and 5x6)

Solution:

Always check tensor shapes before operations. Use .shape to inspect tensors:

python
import torch

matrix1 = torch.randn(3, 4)
matrix2 = torch.randn(4, 6)  # Corrected shape

print(f"Matrix 1 shape: {matrix1.shape}")
print(f"Matrix 2 shape: {matrix2.shape}")

# Now matrices can be multiplied
result = torch.matmul(matrix1, matrix2)
print(f"Result shape: {result.shape}")

Output:

Matrix 1 shape: torch.Size([3, 4])
Matrix 2 shape: torch.Size([4, 6])
Result shape: torch.Size([3, 6])

2. CUDA Out of Memory Errors

GPU memory management is critical when training deep learning models.

Example:

python
import torch

# This might cause memory issues with large dimensions
try:
    # Create an extremely large tensor
    huge_tensor = torch.randn(50000, 50000, device="cuda" if torch.cuda.is_available() else "cpu")
    print("Tensor created successfully")
except RuntimeError as e:
    print(f"Error: {e}")

Output (on most GPUs):

Error: CUDA out of memory. Tried to allocate 9.31 GiB (GPU 0; 8.00 GiB total capacity; 2.43 GiB already allocated; 5.53 GiB free; 2.44 GiB reserved in total by PyTorch)

Solutions:

Reduce batch size:

python
# Instead of batch_size = 128, try:
batch_size = 32

Clear cache periodically:

python
import torch
import gc

# After heavy computations
torch.cuda.empty_cache()
gc.collect()

Use gradient checkpointing for large models:

python
from torch.utils.checkpoint import checkpoint

# Instead of:
# output = model(input)

# Use:
output = checkpoint(model, input)

3. Gradient Issues

3.1 Forgetting to Zero Gradients

python
import torch

# Create a simple model and optimizer
model = torch.nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = torch.nn.MSELoss()

# Training loop with a bug
for _ in range(3):
    # Missing optimizer.zero_grad() here
    inputs = torch.randn(5, 10)
    targets = torch.randn(5, 1)
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    print(f"Gradients for weight: {model.weight.grad[0][:3]}")  # Gradients accumulate
    optimizer.step()

Output:

Gradients for weight: tensor([0.2946, 0.1635, 0.1798])
Gradients for weight: tensor([0.4821, 0.3716, 0.5601])  # Notice these accumulate
Gradients for weight: tensor([0.7501, 0.4890, 0.9320])  # And keep growing

Solution:

Always zero gradients before backward pass:

python
import torch

model = torch.nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = torch.nn.MSELoss()

# Correct training loop
for _ in range(3):
    optimizer.zero_grad()  # Zero gradients before backward pass
    inputs = torch.randn(5, 10)
    targets = torch.randn(5, 1)
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    print(f"Gradients for weight: {model.weight.grad[0][:3]}")  # Fresh gradients each time
    optimizer.step()

Output:

Gradients for weight: tensor([0.2946, 0.1635, 0.1798])
Gradients for weight: tensor([0.1875, 0.2081, 0.3803])  # Different values each time
Gradients for weight: tensor([0.2680, 0.1174, 0.3719])

3.2 Forgetting `requires_grad=True`

python
import torch

# Create trainable and non-trainable tensors
trainable = torch.randn(3, requires_grad=True)
non_trainable = torch.randn(3)  # Missing requires_grad=True

print(f"Trainable requires grad: {trainable.requires_grad}")
print(f"Non-trainable requires grad: {non_trainable.requires_grad}")

# Try to compute gradients for both
result1 = trainable.sum()
result1.backward()
print(f"Gradient exists for trainable: {trainable.grad is not None}")

try:
    result2 = non_trainable.sum()
    result2.backward()
    print(f"Gradient exists for non-trainable: {non_trainable.grad is not None}")
except RuntimeError as e:
    print(f"Error: {e}")

Output:

Trainable requires grad: True
Non-trainable requires grad: False
Gradient exists for trainable: True
Error: element 0 of tensors does not require grad and does not have a grad_fn

4. Data Type Mismatches

python
import torch

# Create tensors with different dtypes
float_tensor = torch.randn(3, 3)  # Default is float32
double_tensor = torch.randn(3, 3, dtype=torch.float64)

try:
    result = float_tensor + double_tensor
except RuntimeError as e:
    print(f"Error: {e}")
    
# Solution
result = float_tensor + double_tensor.float()  # Convert to matching dtype
print(f"Result dtype: {result.dtype}")

Output:

Error: expected scalar type Float but found Double
Result dtype: torch.float32

5. Memory Leaks

Memory leaks are subtle but can cause your application to crash after running for a while.

Common causes and solutions:

Keeping references to intermediate tensors:

python
import torch
import gc

# Potential leak
def train_with_leak(iterations):
    history = []
    for i in range(iterations):
        # Large intermediate tensor
        tensor = torch.randn(1000, 1000)
        result = tensor.sum()
        history.append(tensor)  # Keeping reference to large tensor
    return history

# Better approach
def train_without_leak(iterations):
    history = []
    for i in range(iterations):
        tensor = torch.randn(1000, 1000)
        result = tensor.sum()
        history.append(result.item())  # Only store the scalar value
    return history

# Clear memory before comparing
gc.collect()
torch.cuda.empty_cache()

# Check memory usage with and without leak
import sys
result_leak = train_with_leak(10)
print(f"Memory used with leak: {sys.getsizeof(result_leak) + sum(sys.getsizeof(t) for t in result_leak)} bytes")

gc.collect()
result_no_leak = train_without_leak(10)
print(f"Memory used without leak: {sys.getsizeof(result_no_leak) + sum(sys.getsizeof(t) for t in result_no_leak)} bytes")

Output:

Memory used with leak: 80000856 bytes
Memory used without leak: 368 bytes

6. Model Evaluation Mode Issues

Forgetting to set your model to evaluation mode during inference can lead to inconsistent results.

python
import torch
import torch.nn as nn

# Create a simple model with dropout and batch normalization
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 10)
        self.dropout = nn.Dropout(0.5)
        self.bn = nn.BatchNorm1d(10)
        
    def forward(self, x):
        x = self.linear(x)
        x = self.dropout(x)
        x = self.bn(x)
        return x

model = SimpleModel()
input_tensor = torch.randn(5, 10)

# Training mode (default)
output1 = model(input_tensor)

# Same input, still in training mode
output2 = model(input_tensor)

# Check if outputs are the same
print(f"Same output in training mode: {torch.all(output1 == output2).item()}")

# Set to evaluation mode
model.eval()
output3 = model(input_tensor)
output4 = model(input_tensor)

# Check if outputs are the same in eval mode
print(f"Same output in eval mode: {torch.all(output3 == output4).item()}")

Output:

Same output in training mode: False
Same output in eval mode: True

Solution:

Always use context managers for evaluation:

python
# For inference
with torch.no_grad():
    model.eval()
    output = model(input_data)

# Back to training
model.train()

7. NaN and Inf Values

NaN (Not a Number) and Inf (Infinity) values can break your training.

python
import torch
import torch.nn as nn

# Create a simple network
model = nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=1e6)  # Extremely high learning rate

# Training data
x = torch.randn(5, 10)
y = torch.randn(5, 1)

# Check for NaN before training
print(f"NaN in model parameters before: {any(torch.isnan(p).any() for p in model.parameters())}")

# One training step with very high learning rate
loss = nn.MSELoss()(model(x), y)
loss.backward()
optimizer.step()

# Check for NaN after training
print(f"NaN in model parameters after: {any(torch.isnan(p).any() for p in model.parameters())}")
print(f"Inf in model parameters after: {any(torch.isinf(p).any() for p in model.parameters())}")

if any(torch.isnan(p).any() for p in model.parameters()):
    print("Found NaN values in model parameters!")
    
    # Solution: Gradient clipping
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Output:

NaN in model parameters before: False
NaN in model parameters after: True
Inf in model parameters after: False
Found NaN values in model parameters!

Solutions:

Use gradient clipping:

python
# Before optimizer step
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

Check for NaNs during training:

python
def detect_anomaly_in_tensor(tensor, tensor_name):
    if torch.isnan(tensor).any():
        print(f"NaN detected in {tensor_name}")
        return True
    if torch.isinf(tensor).any():
        print(f"Inf detected in {tensor_name}")
        return True
    return False

# During training
for name, param in model.named_parameters():
    detect_anomaly_in_tensor(param.data, f"Parameter {name}")
    if param.grad is not None:
        detect_anomaly_in_tensor(param.grad, f"Gradient {name}")

8. CPU/GPU Device Mismatches

python
import torch

# Create tensors on different devices
cpu_tensor = torch.randn(3, 3)
gpu_tensor = torch.randn(3, 3).cuda() if torch.cuda.is_available() else torch.randn(3, 3)

try:
    result = cpu_tensor + gpu_tensor
except RuntimeError as e:
    print(f"Error: {e}")
    
    # Solution
    result = cpu_tensor.to(gpu_tensor.device) + gpu_tensor
    print(f"Result device: {result.device}")

Output (with CUDA available):

Error: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Result device: cuda:0

Best practice:

Always specify the device explicitly and move all tensors to that device:

python
# Define device once
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Create tensors directly on the target device
tensor1 = torch.randn(3, 3, device=device)
tensor2 = torch.randn(3, 3, device=device)

# Move existing tensors to device
tensor3 = torch.randn(3, 3).to(device)

# Model to device
model = torch.nn.Linear(3, 2).to(device)

# All operations work smoothly
result = model(tensor1 + tensor2 + tensor3)
print(f"Result shape: {result.shape}, device: {result.device}")

Summary

In this guide, we've covered the most common PyTorch bugs and their solutions:

Tensor shape mismatches - Always check tensor dimensions
CUDA memory errors - Manage GPU memory carefully
Gradient issues - Zero gradients and check requires_grad
Data type mismatches - Ensure consistent dtypes
Memory leaks - Avoid keeping unnecessary references
Model evaluation mode - Use model.eval() for inference
NaN and Inf values - Use gradient clipping and monitoring
CPU/GPU device mismatches - Keep tensors on the same device

Debugging PyTorch code becomes easier with practice. Remember to print tensor shapes, devices, and gradients when encountering errors, and you'll be able to solve most issues quickly.

Additional Resources

PyTorch Official Debugging Documentation
PyTorch Forum - Great for getting help with specific bugs
Stack Overflow PyTorch Tag

Exercises

Create a script that intentionally causes each of the bugs mentioned in this guide, then fix them.
Build a function that checks a PyTorch model for potential issues (NaN values, device mismatches, etc.).
Debug a network that is not learning (loss not decreasing) by identifying which of the above issues might be causing the problem.
Create a custom PyTorch error handler that provides more informative messages for common errors.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

1. Tensor Shape Mismatches​

Example:​

Solution:​

2. CUDA Out of Memory Errors​

Example:​

Solutions:​

3. Gradient Issues​

3.1 Forgetting to Zero Gradients​

Solution:​

3.2 Forgetting requires_grad=True​

4. Data Type Mismatches​

5. Memory Leaks​

Common causes and solutions:​

Keeping references to intermediate tensors:​

6. Model Evaluation Mode Issues​

Solution:​

7. NaN and Inf Values​

Solutions:​

8. CPU/GPU Device Mismatches​

Best practice:​

Summary​

Additional Resources​

Exercises​

Introduction

1. Tensor Shape Mismatches

Example:

Solution:

2. CUDA Out of Memory Errors

Example:

Solutions:

3. Gradient Issues

3.1 Forgetting to Zero Gradients

Solution:

3.2 Forgetting `requires_grad=True`

4. Data Type Mismatches

5. Memory Leaks

Common causes and solutions:

Keeping references to intermediate tensors:

6. Model Evaluation Mode Issues

Solution:

7. NaN and Inf Values

Solutions:

8. CPU/GPU Device Mismatches

Best practice:

Summary

Additional Resources

Exercises