Skip to main content

PyTorch Tensor Debugging

Debugging is a crucial skill for any PyTorch developer. When working with neural networks, you'll often encounter issues related to tensors - PyTorch's fundamental data structure. This guide will help you understand common tensor problems and provide effective debugging techniques.

Introduction to Tensor Debugging

Tensors are the core data structure in PyTorch, similar to NumPy arrays but with GPU acceleration capabilities. When your deep learning models misbehave, the issue often stems from tensor-related problems:

  • Incorrect shapes or dimensions
  • NaN or infinity values
  • Wrong device placement (CPU vs. GPU)
  • Type mismatches
  • Memory leaks

Learning how to diagnose and fix these issues will save you countless hours of frustration.

Basic Tensor Inspection

Printing Tensor Properties

The first step in debugging tensors is to examine their basic properties:

python
import torch

# Create a sample tensor
x = torch.randn(3, 4)

# Basic inspection
print(f"Tensor: {x}")
print(f"Shape: {x.shape}")
print(f"Datatype: {x.dtype}")
print(f"Device: {x.device}")

Output:

Tensor: tensor([[ 0.0562, -0.1928,  0.4994,  0.0335],
[ 1.0492, -0.7965, -0.8460, 0.3251],
[ 1.3163, 0.6596, -1.5771, -0.0929]])
Shape: torch.Size([3, 4])
Datatype: torch.float32
Device: cpu

Viewing Tensor Content

For larger tensors, you might want to see only parts of the data:

python
# Create a larger tensor
large_tensor = torch.randn(10, 10)

# View first few elements
print("First 2 rows:")
print(large_tensor[:2, :])

# View specific slice
print("\nCenter of tensor:")
print(large_tensor[4:6, 4:6])

Common Tensor Debugging Issues

1. Shape Mismatch Errors

One of the most common tensor errors occurs when operations expect tensors of specific shapes:

python
# Example of shape mismatch
a = torch.randn(3, 4)
b = torch.randn(5, 4)

try:
c = a + b # This will fail
except RuntimeError as e:
print(f"Error: {e}")

# Fix the shape mismatch
b_resized = torch.randn(3, 4) # Create properly sized tensor
c = a + b_resized
print(f"Fixed shapes: a {a.shape}, b_resized {b_resized.shape}, c {c.shape}")

Output:

Error: The size of tensor a (3) must match the size of tensor b (5) at non-singleton dimension 0
Fixed shapes: a torch.Size([3, 4]), b_resized torch.Size([3, 4]), c torch.Size([3, 4])

2. NaN and Infinity Values

NaN (Not a Number) values can silently propagate through your network:

python
# Create a tensor with NaN
x = torch.tensor([1.0, float('nan'), 3.0, float('inf')])
print(f"Tensor with NaN and Inf: {x}")

# Check for NaN or Inf
print(f"Contains NaN: {torch.isnan(x).any()}")
print(f"Contains Inf: {torch.isinf(x).any()}")

# Find positions of NaN values
print(f"NaN positions: {torch.where(torch.isnan(x))}")
print(f"Inf positions: {torch.where(torch.isinf(x))}")

Output:

Tensor with NaN and Inf: tensor([1., nan, 3., inf])
Contains NaN: tensor(True)
Contains Inf: tensor(True)
NaN positions: (tensor([1]),)
Inf positions: (tensor([3]),)

3. Device Placement Issues

When working with both CPU and GPU, tensors must be on the same device for operations:

python
# Check if CUDA is available
cuda_available = torch.cuda.is_available()
print(f"CUDA available: {cuda_available}")

# Create tensors on different devices
a = torch.randn(2, 2)
if cuda_available:
b = torch.randn(2, 2).cuda()

try:
# This will fail - tensors on different devices
c = a + b
except RuntimeError as e:
print(f"Error: {e}")

# Fix by moving a to GPU
a_cuda = a.cuda()
c = a_cuda + b
print(f"Fixed operation with both tensors on: {a_cuda.device}")

Advanced Debugging Techniques

Using torch.set_printoptions

Control how tensors are displayed for better debugging:

python
# Default printing can truncate large tensors
large_tensor = torch.randn(10, 10)
print("Default print options:")
print(large_tensor)

# Customize print options
torch.set_printoptions(precision=2, sci_mode=False, linewidth=120, edgeitems=3)
print("\nCustomized print options:")
print(large_tensor)

Tracking Gradient Flow

When debugging backward passes in neural networks, check if your gradients are flowing correctly:

python
def debug_gradients():
x = torch.randn(3, requires_grad=True)
y = x * 2
z = y.mean()

# Before backward
print("Before backward:")
print(f"x.grad: {x.grad}")

# Compute gradients
z.backward()

# After backward
print("After backward:")
print(f"x.grad: {x.grad}")

# Check for small gradients that might indicate vanishing gradient
if x.grad.abs().max() < 1e-5:
print("Warning: Very small gradients detected!")

debug_gradients()

Output:

Before backward:
x.grad: None
After backward:
x.grad: tensor([0.6667, 0.6667, 0.6667])

Using Hooks for Debugging

Hooks can be attached to tensors to monitor operations during forward or backward passes:

python
def hook_example():
# Define a hook function
def print_grad(grad):
print(f"Gradient in hook: {grad}")

x = torch.randn(2, 2, requires_grad=True)

# Register hook
handle = x.register_hook(print_grad)

# Forward and backward
y = x.pow(2).sum()
y.backward()

# Remove hook after use
handle.remove()

hook_example()

Real-World Debugging Examples

Example 1: Fixing a Training Loop

Here's how to debug a common issue in training loops - forgetting to zero gradients:

python
import torch.nn as nn
import torch.optim as optim

def debug_training_loop():
# Create a simple model
model = nn.Linear(10, 1)
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()

# Sample data
inputs = torch.randn(5, 10)
targets = torch.randn(5, 1)

print("Gradient before any backward pass:")
for name, param in model.named_parameters():
print(f"{name} grad: {param.grad}")

# Problem: not zeroing gradients (accumulating them)
for epoch in range(2):
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)

# Backward pass without zeroing gradients
loss.backward()

print(f"\nEpoch {epoch+1}, gradients before optimizer step:")
for name, param in model.named_parameters():
print(f"{name} grad: {param.grad}")

optimizer.step()

print("\nFixed version with proper gradient zeroing:")
# Correct training loop
for epoch in range(2):
# Zero gradients first
optimizer.zero_grad()

# Forward and backward passes
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()

# Now gradients won't accumulate
print(f"Epoch {epoch+1}, gradients with zeroing:")
for name, param in model.named_parameters():
print(f"{name} grad magnitude: {param.grad.norm()}")

optimizer.step()

debug_training_loop()

Example 2: Diagnosing Dimension Issues

Let's debug a common issue when preparing data for a CNN:

python
def debug_cnn_dimensions():
# Create a mini-batch of images (batch_size, channels, height, width)
images = torch.randn(4, 3, 32, 32)

# Attempt to pass to a network that expects different dimensions
try:
# Simulating a network expecting (batch_size, height, width, channels) format
print("Original shape:", images.shape)

# This would cause dimension errors in many CNNs
# Purposely create incorrect transposition
incorrect_format = images.permute(0, 2, 3, 1)
print("Incorrect format:", incorrect_format.shape)

# Fix by permuting correctly
correct_format = incorrect_format.permute(0, 3, 1, 2)
print("Fixed format:", correct_format.shape)
print("Matches original:", (correct_format.shape == images.shape))

except Exception as e:
print(f"Error: {e}")

debug_cnn_dimensions()

Output:

Original shape: torch.Size([4, 3, 32, 32])
Incorrect format: torch.Size([4, 32, 32, 3])
Fixed format: torch.Size([4, 3, 32, 32])
Matches original: True

Tracking Tensor Memory Usage

Memory issues are common in deep learning. Here's how to track tensor memory:

python
def memory_debugging():
# Only works if CUDA is available
if torch.cuda.is_available():
# Check memory before
before = torch.cuda.memory_allocated()
print(f"Memory before: {before / 1e6:.2f} MB")

# Create a large tensor
large_tensor = torch.randn(1000, 1000, device='cuda')

# Check memory after
after = torch.cuda.memory_allocated()
print(f"Memory after: {after / 1e6:.2f} MB")
print(f"Tensor size: {large_tensor.element_size() * large_tensor.nelement() / 1e6:.2f} MB")

# Clean up to free memory
del large_tensor
torch.cuda.empty_cache()

# Check memory after cleanup
final = torch.cuda.memory_allocated()
print(f"Memory after cleanup: {final / 1e6:.2f} MB")
else:
print("CUDA not available, cannot demonstrate GPU memory tracking")

# CPU memory tracking is more complex and requires external packages
print("For CPU memory tracking, consider using the 'psutil' library")

memory_debugging()

Debugging Tools and Utilities

Creating a Simple Tensor Debugger

Let's build a utility function for tensor debugging:

python
def tensor_debug(tensor, name="tensor", full_info=False):
"""Utility for comprehensive tensor debugging"""
print(f"\n--- Debug: {name} ---")
print(f"Shape: {tensor.shape}")
print(f"Dtype: {tensor.dtype}")
print(f"Device: {tensor.device}")

# Check for NaN and Inf
has_nan = torch.isnan(tensor).any().item()
has_inf = torch.isinf(tensor).any().item()
print(f"Contains NaN: {has_nan}")
print(f"Contains Inf: {has_inf}")

# Basic statistics
if tensor.numel() > 0 and tensor.dtype in [torch.float16, torch.float32, torch.float64]:
print(f"Min: {tensor.min().item():.6f}")
print(f"Max: {tensor.max().item():.6f}")
print(f"Mean: {tensor.mean().item():.6f}")
print(f"Std: {tensor.std().item():.6f}")

if has_nan or has_inf:
nan_count = torch.isnan(tensor).sum().item()
inf_count = torch.isinf(tensor).sum().item()
print(f"Number of NaN values: {nan_count}")
print(f"Number of Inf values: {inf_count}")

# Print tensor values
if full_info or tensor.numel() < 100:
print(f"Values: {tensor}")
else:
print("Values: (tensor too large to display, use full_info=True to override)")

print("-" * 30)
return tensor

# Example usage
x = torch.randn(3, 4)
x[1, 2] = float('nan') # Introduce a NaN
tensor_debug(x, "sample tensor")

Integrating with Python Debugger (pdb)

Using Python's built-in debugger with PyTorch:

python
def debugging_with_pdb():
print("Example of using pdb with PyTorch")
print("In your actual code, you would do:")
print("import pdb; pdb.set_trace()")
print()
print("Common pdb commands:")
print("- n: next line")
print("- c: continue execution")
print("- p expression: print value of expression")
print("- pp tensor: pretty-print a tensor")

# Example code that would use pdb
def problematic_function():
x = torch.randn(3, 3)
y = torch.randn(3, 3)

# Insert breakpoint in actual debugging scenario
# import pdb; pdb.set_trace()

z = x * y # Inspect tensors at this point
return z.sum()

result = problematic_function()
print(f"\nFunction result: {result}")

debugging_with_pdb()

Summary

Debugging tensors in PyTorch is an essential skill for any deep learning practitioner. In this guide, we've covered:

  • Basic tensor inspection techniques
  • Identifying and fixing common tensor issues (shape mismatches, NaN values, device placement)
  • Advanced debugging with hooks and custom utilities
  • Memory management and tracking
  • Real-world examples of tensor debugging

Remember that effective debugging is often about being systematic and checking assumptions at each step. The most common tensor issues relate to shapes, data types, and device placement, so those are good places to start your investigation.

Additional Resources and Exercises

Resources

  1. PyTorch Documentation on Tensors
  2. PyTorch Forums - great for asking specific debugging questions
  3. PyTorch GitHub Issues - may contain solutions to known bugs

Exercises

  1. Debugging Challenge: Create a tensor with shape (3, 4, 5) and introduce NaN values at specific indices. Write code to identify and replace these NaN values with the mean of their respective feature vectors.

  2. Memory Optimization: Write a function that processes a large dataset tensor by tensor, monitoring memory usage and ensuring it stays below a specific threshold.

  3. Custom Hook: Create a custom hook that monitors for exploding gradients (values larger than a threshold) during training and prints a warning with the layer name when detected.

  4. Tensor Visualization: Use matplotlib to create a visualization tool for PyTorch tensors that can help you debug activation patterns in neural networks.

  5. Gradient Flow Analysis: Build a utility to track gradient magnitudes throughout a neural network and identify layers where gradients might be vanishing.

Happy debugging!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)