PyTorch Tensor Debugging

Debugging is a crucial skill for any PyTorch developer. When working with neural networks, you'll often encounter issues related to tensors - PyTorch's fundamental data structure. This guide will help you understand common tensor problems and provide effective debugging techniques.

Introduction to Tensor Debugging

Tensors are the core data structure in PyTorch, similar to NumPy arrays but with GPU acceleration capabilities. When your deep learning models misbehave, the issue often stems from tensor-related problems:

Incorrect shapes or dimensions
NaN or infinity values
Wrong device placement (CPU vs. GPU)
Type mismatches
Memory leaks

Learning how to diagnose and fix these issues will save you countless hours of frustration.

Basic Tensor Inspection

Printing Tensor Properties

The first step in debugging tensors is to examine their basic properties:

python
import torch

# Create a sample tensor
x = torch.randn(3, 4)

# Basic inspection
print(f"Tensor: {x}")
print(f"Shape: {x.shape}")
print(f"Datatype: {x.dtype}")
print(f"Device: {x.device}")

Output:

Tensor: tensor([[ 0.0562, -0.1928,  0.4994,  0.0335],
                [ 1.0492, -0.7965, -0.8460,  0.3251],
                [ 1.3163,  0.6596, -1.5771, -0.0929]])
Shape: torch.Size([3, 4])
Datatype: torch.float32
Device: cpu

Viewing Tensor Content

For larger tensors, you might want to see only parts of the data:

python
# Create a larger tensor
large_tensor = torch.randn(10, 10)

# View first few elements
print("First 2 rows:")
print(large_tensor[:2, :])

# View specific slice
print("\nCenter of tensor:")
print(large_tensor[4:6, 4:6])

Common Tensor Debugging Issues

1. Shape Mismatch Errors

One of the most common tensor errors occurs when operations expect tensors of specific shapes:

python
# Example of shape mismatch
a = torch.randn(3, 4)
b = torch.randn(5, 4)

try:
    c = a + b  # This will fail
except RuntimeError as e:
    print(f"Error: {e}")
    
    # Fix the shape mismatch
    b_resized = torch.randn(3, 4)  # Create properly sized tensor
    c = a + b_resized
    print(f"Fixed shapes: a {a.shape}, b_resized {b_resized.shape}, c {c.shape}")

Output:

Error: The size of tensor a (3) must match the size of tensor b (5) at non-singleton dimension 0
Fixed shapes: a torch.Size([3, 4]), b_resized torch.Size([3, 4]), c torch.Size([3, 4])

2. NaN and Infinity Values

NaN (Not a Number) values can silently propagate through your network:

python
# Create a tensor with NaN
x = torch.tensor([1.0, float('nan'), 3.0, float('inf')])
print(f"Tensor with NaN and Inf: {x}")

# Check for NaN or Inf
print(f"Contains NaN: {torch.isnan(x).any()}")
print(f"Contains Inf: {torch.isinf(x).any()}")

# Find positions of NaN values
print(f"NaN positions: {torch.where(torch.isnan(x))}")
print(f"Inf positions: {torch.where(torch.isinf(x))}")

Output:

Tensor with NaN and Inf: tensor([1., nan, 3., inf])
Contains NaN: tensor(True)
Contains Inf: tensor(True)
NaN positions: (tensor([1]),)
Inf positions: (tensor([3]),)

3. Device Placement Issues

When working with both CPU and GPU, tensors must be on the same device for operations:

python
# Check if CUDA is available
cuda_available = torch.cuda.is_available()
print(f"CUDA available: {cuda_available}")

# Create tensors on different devices
a = torch.randn(2, 2)
if cuda_available:
    b = torch.randn(2, 2).cuda()
    
    try:
        # This will fail - tensors on different devices
        c = a + b
    except RuntimeError as e:
        print(f"Error: {e}")
        
        # Fix by moving a to GPU
        a_cuda = a.cuda()
        c = a_cuda + b
        print(f"Fixed operation with both tensors on: {a_cuda.device}")

Advanced Debugging Techniques

Using torch.set_printoptions

Control how tensors are displayed for better debugging:

python
# Default printing can truncate large tensors
large_tensor = torch.randn(10, 10)
print("Default print options:")
print(large_tensor)

# Customize print options
torch.set_printoptions(precision=2, sci_mode=False, linewidth=120, edgeitems=3)
print("\nCustomized print options:")
print(large_tensor)

Tracking Gradient Flow

When debugging backward passes in neural networks, check if your gradients are flowing correctly:

python
def debug_gradients():
    x = torch.randn(3, requires_grad=True)
    y = x * 2
    z = y.mean()
    
    # Before backward
    print("Before backward:")
    print(f"x.grad: {x.grad}")
    
    # Compute gradients
    z.backward()
    
    # After backward
    print("After backward:")
    print(f"x.grad: {x.grad}")
    
    # Check for small gradients that might indicate vanishing gradient
    if x.grad.abs().max() < 1e-5:
        print("Warning: Very small gradients detected!")

debug_gradients()

Output:

Before backward:
x.grad: None
After backward:
x.grad: tensor([0.6667, 0.6667, 0.6667])

Using Hooks for Debugging

Hooks can be attached to tensors to monitor operations during forward or backward passes:

python
def hook_example():
    # Define a hook function
    def print_grad(grad):
        print(f"Gradient in hook: {grad}")
    
    x = torch.randn(2, 2, requires_grad=True)
    
    # Register hook
    handle = x.register_hook(print_grad)
    
    # Forward and backward
    y = x.pow(2).sum()
    y.backward()
    
    # Remove hook after use
    handle.remove()

hook_example()

Real-World Debugging Examples

Example 1: Fixing a Training Loop

Here's how to debug a common issue in training loops - forgetting to zero gradients:

python
import torch.nn as nn
import torch.optim as optim

def debug_training_loop():
    # Create a simple model
    model = nn.Linear(10, 1)
    optimizer = optim.SGD(model.parameters(), lr=0.01)
    criterion = nn.MSELoss()
    
    # Sample data
    inputs = torch.randn(5, 10)
    targets = torch.randn(5, 1)
    
    print("Gradient before any backward pass:")
    for name, param in model.named_parameters():
        print(f"{name} grad: {param.grad}")
    
    # Problem: not zeroing gradients (accumulating them)
    for epoch in range(2):
        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        
        # Backward pass without zeroing gradients
        loss.backward()
        
        print(f"\nEpoch {epoch+1}, gradients before optimizer step:")
        for name, param in model.named_parameters():
            print(f"{name} grad: {param.grad}")
            
        optimizer.step()
    
    print("\nFixed version with proper gradient zeroing:")
    # Correct training loop
    for epoch in range(2):
        # Zero gradients first
        optimizer.zero_grad()
        
        # Forward and backward passes
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        
        # Now gradients won't accumulate
        print(f"Epoch {epoch+1}, gradients with zeroing:")
        for name, param in model.named_parameters():
            print(f"{name} grad magnitude: {param.grad.norm()}")
            
        optimizer.step()

debug_training_loop()

Example 2: Diagnosing Dimension Issues

Let's debug a common issue when preparing data for a CNN:

python
def debug_cnn_dimensions():
    # Create a mini-batch of images (batch_size, channels, height, width)
    images = torch.randn(4, 3, 32, 32)
    
    # Attempt to pass to a network that expects different dimensions
    try:
        # Simulating a network expecting (batch_size, height, width, channels) format
        print("Original shape:", images.shape)
        
        # This would cause dimension errors in many CNNs
        # Purposely create incorrect transposition
        incorrect_format = images.permute(0, 2, 3, 1)
        print("Incorrect format:", incorrect_format.shape)
        
        # Fix by permuting correctly
        correct_format = incorrect_format.permute(0, 3, 1, 2)
        print("Fixed format:", correct_format.shape)
        print("Matches original:", (correct_format.shape == images.shape))
        
    except Exception as e:
        print(f"Error: {e}")

debug_cnn_dimensions()

Output:

Original shape: torch.Size([4, 3, 32, 32])
Incorrect format: torch.Size([4, 32, 32, 3])
Fixed format: torch.Size([4, 3, 32, 32])
Matches original: True

Tracking Tensor Memory Usage

Memory issues are common in deep learning. Here's how to track tensor memory:

python
def memory_debugging():
    # Only works if CUDA is available
    if torch.cuda.is_available():
        # Check memory before
        before = torch.cuda.memory_allocated()
        print(f"Memory before: {before / 1e6:.2f} MB")
        
        # Create a large tensor
        large_tensor = torch.randn(1000, 1000, device='cuda')
        
        # Check memory after
        after = torch.cuda.memory_allocated()
        print(f"Memory after: {after / 1e6:.2f} MB")
        print(f"Tensor size: {large_tensor.element_size() * large_tensor.nelement() / 1e6:.2f} MB")
        
        # Clean up to free memory
        del large_tensor
        torch.cuda.empty_cache()
        
        # Check memory after cleanup
        final = torch.cuda.memory_allocated()
        print(f"Memory after cleanup: {final / 1e6:.2f} MB")
    else:
        print("CUDA not available, cannot demonstrate GPU memory tracking")
        
        # CPU memory tracking is more complex and requires external packages
        print("For CPU memory tracking, consider using the 'psutil' library")

memory_debugging()

Debugging Tools and Utilities

Creating a Simple Tensor Debugger

Let's build a utility function for tensor debugging:

python
def tensor_debug(tensor, name="tensor", full_info=False):
    """Utility for comprehensive tensor debugging"""
    print(f"\n--- Debug: {name} ---")
    print(f"Shape: {tensor.shape}")
    print(f"Dtype: {tensor.dtype}")
    print(f"Device: {tensor.device}")
    
    # Check for NaN and Inf
    has_nan = torch.isnan(tensor).any().item()
    has_inf = torch.isinf(tensor).any().item()
    print(f"Contains NaN: {has_nan}")
    print(f"Contains Inf: {has_inf}")
    
    # Basic statistics
    if tensor.numel() > 0 and tensor.dtype in [torch.float16, torch.float32, torch.float64]:
        print(f"Min: {tensor.min().item():.6f}")
        print(f"Max: {tensor.max().item():.6f}")
        print(f"Mean: {tensor.mean().item():.6f}")
        print(f"Std: {tensor.std().item():.6f}")
    
    if has_nan or has_inf:
        nan_count = torch.isnan(tensor).sum().item()
        inf_count = torch.isinf(tensor).sum().item()
        print(f"Number of NaN values: {nan_count}")
        print(f"Number of Inf values: {inf_count}")
    
    # Print tensor values
    if full_info or tensor.numel() < 100:
        print(f"Values: {tensor}")
    else:
        print("Values: (tensor too large to display, use full_info=True to override)")
    
    print("-" * 30)
    return tensor

# Example usage
x = torch.randn(3, 4)
x[1, 2] = float('nan')  # Introduce a NaN
tensor_debug(x, "sample tensor")

Integrating with Python Debugger (pdb)

Using Python's built-in debugger with PyTorch:

python
def debugging_with_pdb():
    print("Example of using pdb with PyTorch")
    print("In your actual code, you would do:")
    print("import pdb; pdb.set_trace()")
    print()
    print("Common pdb commands:")
    print("- n: next line")
    print("- c: continue execution")
    print("- p expression: print value of expression")
    print("- pp tensor: pretty-print a tensor")
    
    # Example code that would use pdb
    def problematic_function():
        x = torch.randn(3, 3)
        y = torch.randn(3, 3)
        
        # Insert breakpoint in actual debugging scenario
        # import pdb; pdb.set_trace()
        
        z = x * y  # Inspect tensors at this point
        return z.sum()
    
    result = problematic_function()
    print(f"\nFunction result: {result}")

debugging_with_pdb()

Summary

Debugging tensors in PyTorch is an essential skill for any deep learning practitioner. In this guide, we've covered:

Basic tensor inspection techniques
Identifying and fixing common tensor issues (shape mismatches, NaN values, device placement)
Advanced debugging with hooks and custom utilities
Memory management and tracking
Real-world examples of tensor debugging

Remember that effective debugging is often about being systematic and checking assumptions at each step. The most common tensor issues relate to shapes, data types, and device placement, so those are good places to start your investigation.

Additional Resources and Exercises

Resources

PyTorch Documentation on Tensors
PyTorch Forums - great for asking specific debugging questions
PyTorch GitHub Issues - may contain solutions to known bugs

Exercises

Debugging Challenge: Create a tensor with shape (3, 4, 5) and introduce NaN values at specific indices. Write code to identify and replace these NaN values with the mean of their respective feature vectors.
Memory Optimization: Write a function that processes a large dataset tensor by tensor, monitoring memory usage and ensuring it stays below a specific threshold.
Custom Hook: Create a custom hook that monitors for exploding gradients (values larger than a threshold) during training and prints a warning with the layer name when detected.
Tensor Visualization: Use matplotlib to create a visualization tool for PyTorch tensors that can help you debug activation patterns in neural networks.
Gradient Flow Analysis: Build a utility to track gradient magnitudes throughout a neural network and identify layers where gradients might be vanishing.

Happy debugging!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Tensor Debugging​

Basic Tensor Inspection​

Printing Tensor Properties​

Viewing Tensor Content​

Common Tensor Debugging Issues​

1. Shape Mismatch Errors​

2. NaN and Infinity Values​

3. Device Placement Issues​

Advanced Debugging Techniques​

Using torch.set_printoptions​

Tracking Gradient Flow​

Using Hooks for Debugging​

Real-World Debugging Examples​

Example 1: Fixing a Training Loop​

Example 2: Diagnosing Dimension Issues​

Memory-Related Debugging​

Tracking Tensor Memory Usage​

Debugging Tools and Utilities​

Creating a Simple Tensor Debugger​

Integrating with Python Debugger (pdb)​

Summary​

Additional Resources and Exercises​

Resources​

Exercises​

Introduction to Tensor Debugging

Basic Tensor Inspection

Printing Tensor Properties

Viewing Tensor Content

Common Tensor Debugging Issues

1. Shape Mismatch Errors

2. NaN and Infinity Values

3. Device Placement Issues

Advanced Debugging Techniques

Using torch.set_printoptions

Tracking Gradient Flow

Using Hooks for Debugging

Real-World Debugging Examples

Example 1: Fixing a Training Loop

Example 2: Diagnosing Dimension Issues

Memory-Related Debugging

Tracking Tensor Memory Usage

Debugging Tools and Utilities

Creating a Simple Tensor Debugger

Integrating with Python Debugger (pdb)

Summary

Additional Resources and Exercises

Resources

Exercises