Skip to main content

PyTorch Autograd Profiling

When developing deep learning models with PyTorch, understanding performance bottlenecks is crucial for optimization. PyTorch provides powerful profiling tools that help you analyze where your model spends most of its time during both forward and backward passes. In this tutorial, you'll learn how to use PyTorch's autograd profiling tools to identify performance issues and optimize your models.

Introduction to PyTorch Profiling

PyTorch's profiling utilities help you measure:

  • The execution time of operations
  • Memory consumption
  • CUDA kernel launches
  • Stack traces of operations

These insights can help you pinpoint which parts of your model or training pipeline need optimization. The main tools we'll explore include:

  1. torch.autograd.profiler.profile
  2. torch.autograd.profiler.record_function
  3. torch.profiler (the newer, more comprehensive API)

Let's dive into each of these tools with practical examples.

Basic Profiling with torch.autograd.profiler.profile

The profile context manager is the simplest way to start profiling your PyTorch code:

python
import torch
import torch.nn as nn
from torch.autograd import profiler

# Define a simple model
model = nn.Sequential(
nn.Linear(100, 200),
nn.ReLU(),
nn.Linear(200, 50),
nn.ReLU(),
nn.Linear(50, 10)
)

# Create random input data
inputs = torch.randn(32, 100)

# Profile both forward and backward pass
with profiler.profile(with_stack=True, profile_memory=True) as prof:
outputs = model(inputs)
loss = outputs.sum()
loss.backward()

# Print the report
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

Output:

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::linear_1 3.21% 162.675us 54.12% 2.740ms 2.740ms 0.000us 0.00% 0.000us 0.000us 1
aten::add_ 3.86% 195.418us 35.32% 1.787ms 101.399us 0.000us 0.00% 0.000us 0.000us 18
aten::matmul 3.12% 158.053us 31.48% 1.594ms 531.238us 0.000us 0.00% 0.000us 0.000us 3
aten::_to_copy_ 1.82% 91.957us 28.38% 1.437ms 68.407us 0.000us 0.00% 0.000us 0.000us 21
aten::empty_like 1.21% 61.223us 26.56% 1.344ms 74.681us 0.000us 0.00% 0.000us 0.000us 18
aten::addmm 28.09% 1.422ms 28.09% 1.422ms 473.865us 0.000us 0.00% 0.000us 0.000us 3
aten::t_strided 2.22% 112.224us 25.71% 1.301ms 100.088us 0.000us 0.00% 0.000us 0.000us 13
aten::empty_like_ 5.33% 269.540us 25.35% 1.283ms 107.634us 0.000us 0.00% 0.000us 0.000us 12
aten::resize_ 17.19% 869.910us 17.19% 869.910us 21.748us 0.000us 0.00% 0.000us 0.000us 40
aten::zero_ 4.25% 214.839us 14.94% 756.142us 31.506us 0.000us 0.00% 0.000us 0.000us 24
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------

This report shows:

  • Name: Operation name
  • Self CPU %: Percentage of time spent in this operation (excluding child operations)
  • CPU total %: Percentage of time including child operations
  • # of Calls: How many times this operation was called

Profiling with User Annotations

You can add custom annotations to mark specific sections of your code using record_function:

python
import torch
import torch.nn as nn
from torch.autograd import profiler

model = nn.Sequential(
nn.Linear(100, 200),
nn.ReLU(),
nn.Linear(200, 50),
nn.ReLU(),
nn.Linear(50, 10)
)

inputs = torch.randn(32, 100)

with profiler.profile() as prof:
with profiler.record_function("model_forward"):
outputs = model(inputs)

with profiler.record_function("loss_calculation"):
loss = outputs.sum()

with profiler.record_function("backward_pass"):
loss.backward()

# Print events with our custom labels
print(prof.key_averages().table(sort_by="cpu_time_total"))

Output:

---------------------------------  ------------  ------------  ------------  ------------  ------------  
Name Self CPU % Self CPU CPU total % CPU total # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------
backward_pass 19.11% 149.511us 100.00% 782.461us 1
model_forward 2.34% 18.283us 58.71% 459.466us 1
loss_calculation 22.19% 173.615us 22.19% 173.615us 1
...
--------------------------------- ------------ ------------ ------------ ------------ ------------

Profiling GPU Operations

If you're using CUDA, you can also profile GPU operations by setting use_cuda=True:

python
import torch
import torch.nn as nn
from torch.autograd import profiler

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a model and move it to the device
model = nn.Sequential(
nn.Linear(100, 200),
nn.ReLU(),
nn.Linear(200, 50),
nn.ReLU(),
nn.Linear(50, 10)
).to(device)

# Create input data on the device
inputs = torch.randn(32, 100, device=device)

# Profile with CUDA support
with profiler.profile(use_cuda=True) as prof:
outputs = model(inputs)
loss = outputs.sum()
loss.backward()

# Print the report
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Output (when run on a CUDA-enabled device):

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Name Self CUDA % Self CUDA CUDA total % CUDA total CUDA time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::addmm 84.73% 3.464ms 84.73% 3.464ms 1.155ms 3
aten::relu 8.26% 337.728us 8.26% 337.728us 168.864us 2
aten::copy_ 5.41% 221.184us 5.41% 221.184us 73.728us 3
aten::sum 0.89% 36.352us 0.89% 36.352us 36.352us 1
...
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------

Advanced Profiling with torch.profiler

PyTorch introduced a newer, more comprehensive profiling API called torch.profiler that provides enhanced profiling capabilities:

python
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity

# Define a model
model = nn.Sequential(
nn.Linear(100, 200),
nn.ReLU(),
nn.Linear(200, 50),
nn.ReLU(),
nn.Linear(50, 10)
)

# Move to CUDA if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = torch.randn(32, 100, device=device)

# Profile both CPU and CUDA activities
activities = [ProfilerActivity.CPU]
if torch.cuda.is_available():
activities.append(ProfilerActivity.CUDA)

with profile(activities=activities, record_shapes=True) as prof:
with record_function("model_inference"):
outputs = model(inputs)
loss = outputs.sum()
loss.backward()

# Print a summary
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

# Export trace to Chrome trace format
prof.export_chrome_trace("trace.json")
print("Trace exported to trace.json - view it at chrome://tracing/")

Profiling a Training Loop

Let's see how to profile a real training loop:

python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.profiler import profile, record_function, ProfilerActivity, ProfilerAction, schedule

# Define a model
model = nn.Sequential(
nn.Linear(100, 200),
nn.ReLU(),
nn.Linear(200, 50),
nn.ReLU(),
nn.Linear(50, 10)
)

# Move to CUDA if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Define optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Create dummy training data
inputs = torch.randn(32, 100, device=device)
targets = torch.randn(32, 10, device=device)

# Define profiler schedule: wait 1 batch, warmup for 1 batch, active for 3 batches, repeat
profiler_schedule = schedule(wait=1, warmup=1, active=3, repeat=2)

# Activities to profile
activities = [ProfilerActivity.CPU]
if torch.cuda.is_available():
activities.append(ProfilerActivity.CUDA)

# Start profiler
with profile(
activities=activities,
schedule=profiler_schedule,
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/profiler'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
# Train for 10 batches
for step in range(10):
# Generate synthetic data
inputs = torch.randn(32, 100, device=device)
targets = torch.randn(32, 10, device=device)

with record_function("forward_pass"):
outputs = model(inputs)
loss = nn.functional.mse_loss(outputs, targets)

with record_function("backward_pass"):
optimizer.zero_grad()
loss.backward()

with record_function("optimizer_step"):
optimizer.step()

# Step the profiler
prof.step()

# Print summary of the collected traces
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
print("\nTo visualize profiler data in TensorBoard, run: tensorboard --logdir=./log")

Once you've collected the profiling data, you can visualize it in TensorBoard:

bash
pip install tensorboard
tensorboard --logdir=./log

Then navigate to http://localhost:6006/#pytorch_profiler to see detailed profiling visualizations.

Identifying and Resolving Bottlenecks

After profiling your model, look for:

  1. Operations with high CPU/GPU time: These are candidates for optimization
  2. Excessive memory usage: May cause out-of-memory errors on larger datasets
  3. Operations with many calls: Consider if they can be batched
  4. CPU-GPU synchronization points: These can create processing bottlenecks

Common optimization techniques include:

  • Using torch.compile() (in PyTorch 2.0+) for automatic optimization
  • Increasing batch sizes (if memory allows)
  • Reducing precision (e.g., using torch.float16 instead of torch.float32)
  • Optimizing data loading with proper prefetching
  • Using more efficient model architectures

Profiling Memory Usage

Memory profiling is critical for large models:

python
import torch
import torch.nn as nn
from torch.autograd import profiler

model = nn.Sequential(
nn.Linear(1000, 2000),
nn.ReLU(),
nn.Linear(2000, 500),
nn.ReLU(),
nn.Linear(500, 100)
)

inputs = torch.randn(128, 1000)

with profiler.profile(profile_memory=True) as prof:
outputs = model(inputs)
loss = outputs.sum()
loss.backward()

print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))

Output:

---------------------------------  ------------  ------------  ------------  ------------  ------------  
Name Self CPU Mem Self CPU Mem CPU Mem Avg # of Calls Memory Format
Alloc Ret
--------------------------------- ------------ ------------ ------------ ------------ ------------
aten::empty 205.33 Mb -205.33 Mb 2.15 Mb 96 Strided
aten::empty_like 25.87 Mb -25.87 Mb 1.44 Mb 18 Strided
aten::zeros 7.63 Mb -7.63 Mb 2.54 Mb 3 Strided
...
--------------------------------- ------------ ------------ ------------ ------------ ------------

Summary

PyTorch's profiling tools provide powerful insights into model performance:

  • Basic profiling with torch.autograd.profiler.profile helps identify time-consuming operations
  • Custom annotations with record_function let you mark specific code sections for analysis
  • Advanced profiling with torch.profiler offers comprehensive analysis with TensorBoard integration
  • Memory profiling helps identify memory bottlenecks

By systematically analyzing the execution patterns of your models, you can identify bottlenecks and apply targeted optimizations to make your PyTorch code faster and more memory-efficient.

Additional Resources and Exercises

Resources:

Exercises:

  1. Exercise 1: Profile a CNN model on a vision dataset and identify the most time-consuming operations.

  2. Exercise 2: Compare the performance of the same model running with different batch sizes and analyze how execution time scales.

  3. Exercise 3: Profile a model with different data types (float32 vs float16) and measure the performance difference.

  4. Exercise 4: Use the profiler to analyze the performance impact of different optimization strategies like frozen layers or model pruning.

  5. Challenge Exercise: Profile a transformer model like BERT and identify optimization opportunities to reduce its memory footprint or inference time.

By mastering PyTorch's profiling tools, you'll be able to build more efficient models that train faster and use resources more effectively!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)