PyTorch Autograd Profiling
When developing deep learning models with PyTorch, understanding performance bottlenecks is crucial for optimization. PyTorch provides powerful profiling tools that help you analyze where your model spends most of its time during both forward and backward passes. In this tutorial, you'll learn how to use PyTorch's autograd profiling tools to identify performance issues and optimize your models.
Introduction to PyTorch Profiling
PyTorch's profiling utilities help you measure:
- The execution time of operations
- Memory consumption
- CUDA kernel launches
- Stack traces of operations
These insights can help you pinpoint which parts of your model or training pipeline need optimization. The main tools we'll explore include:
torch.autograd.profiler.profile
torch.autograd.profiler.record_function
torch.profiler
(the newer, more comprehensive API)
Let's dive into each of these tools with practical examples.
Basic Profiling with torch.autograd.profiler.profile
The profile
context manager is the simplest way to start profiling your PyTorch code:
import torch
import torch.nn as nn
from torch.autograd import profiler
# Define a simple model
model = nn.Sequential(
nn.Linear(100, 200),
nn.ReLU(),
nn.Linear(200, 50),
nn.ReLU(),
nn.Linear(50, 10)
)
# Create random input data
inputs = torch.randn(32, 100)
# Profile both forward and backward pass
with profiler.profile(with_stack=True, profile_memory=True) as prof:
outputs = model(inputs)
loss = outputs.sum()
loss.backward()
# Print the report
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
Output:
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::linear_1 3.21% 162.675us 54.12% 2.740ms 2.740ms 0.000us 0.00% 0.000us 0.000us 1
aten::add_ 3.86% 195.418us 35.32% 1.787ms 101.399us 0.000us 0.00% 0.000us 0.000us 18
aten::matmul 3.12% 158.053us 31.48% 1.594ms 531.238us 0.000us 0.00% 0.000us 0.000us 3
aten::_to_copy_ 1.82% 91.957us 28.38% 1.437ms 68.407us 0.000us 0.00% 0.000us 0.000us 21
aten::empty_like 1.21% 61.223us 26.56% 1.344ms 74.681us 0.000us 0.00% 0.000us 0.000us 18
aten::addmm 28.09% 1.422ms 28.09% 1.422ms 473.865us 0.000us 0.00% 0.000us 0.000us 3
aten::t_strided 2.22% 112.224us 25.71% 1.301ms 100.088us 0.000us 0.00% 0.000us 0.000us 13
aten::empty_like_ 5.33% 269.540us 25.35% 1.283ms 107.634us 0.000us 0.00% 0.000us 0.000us 12
aten::resize_ 17.19% 869.910us 17.19% 869.910us 21.748us 0.000us 0.00% 0.000us 0.000us 40
aten::zero_ 4.25% 214.839us 14.94% 756.142us 31.506us 0.000us 0.00% 0.000us 0.000us 24
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
This report shows:
- Name: Operation name
- Self CPU %: Percentage of time spent in this operation (excluding child operations)
- CPU total %: Percentage of time including child operations
- # of Calls: How many times this operation was called
Profiling with User Annotations
You can add custom annotations to mark specific sections of your code using record_function
:
import torch
import torch.nn as nn
from torch.autograd import profiler
model = nn.Sequential(
nn.Linear(100, 200),
nn.ReLU(),
nn.Linear(200, 50),
nn.ReLU(),
nn.Linear(50, 10)
)
inputs = torch.randn(32, 100)
with profiler.profile() as prof:
with profiler.record_function("model_forward"):
outputs = model(inputs)
with profiler.record_function("loss_calculation"):
loss = outputs.sum()
with profiler.record_function("backward_pass"):
loss.backward()
# Print events with our custom labels
print(prof.key_averages().table(sort_by="cpu_time_total"))
Output:
--------------------------------- ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------
backward_pass 19.11% 149.511us 100.00% 782.461us 1
model_forward 2.34% 18.283us 58.71% 459.466us 1
loss_calculation 22.19% 173.615us 22.19% 173.615us 1
...
--------------------------------- ------------ ------------ ------------ ------------ ------------
Profiling GPU Operations
If you're using CUDA, you can also profile GPU operations by setting use_cuda=True
:
import torch
import torch.nn as nn
from torch.autograd import profiler
# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Define a model and move it to the device
model = nn.Sequential(
nn.Linear(100, 200),
nn.ReLU(),
nn.Linear(200, 50),
nn.ReLU(),
nn.Linear(50, 10)
).to(device)
# Create input data on the device
inputs = torch.randn(32, 100, device=device)
# Profile with CUDA support
with profiler.profile(use_cuda=True) as prof:
outputs = model(inputs)
loss = outputs.sum()
loss.backward()
# Print the report
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
Output (when run on a CUDA-enabled device):
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CUDA % Self CUDA CUDA total % CUDA total CUDA time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::addmm 84.73% 3.464ms 84.73% 3.464ms 1.155ms 3
aten::relu 8.26% 337.728us 8.26% 337.728us 168.864us 2
aten::copy_ 5.41% 221.184us 5.41% 221.184us 73.728us 3
aten::sum 0.89% 36.352us 0.89% 36.352us 36.352us 1
...
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Advanced Profiling with torch.profiler
PyTorch introduced a newer, more comprehensive profiling API called torch.profiler
that provides enhanced profiling capabilities:
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity
# Define a model
model = nn.Sequential(
nn.Linear(100, 200),
nn.ReLU(),
nn.Linear(200, 50),
nn.ReLU(),
nn.Linear(50, 10)
)
# Move to CUDA if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = torch.randn(32, 100, device=device)
# Profile both CPU and CUDA activities
activities = [ProfilerActivity.CPU]
if torch.cuda.is_available():
activities.append(ProfilerActivity.CUDA)
with profile(activities=activities, record_shapes=True) as prof:
with record_function("model_inference"):
outputs = model(inputs)
loss = outputs.sum()
loss.backward()
# Print a summary
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
# Export trace to Chrome trace format
prof.export_chrome_trace("trace.json")
print("Trace exported to trace.json - view it at chrome://tracing/")
Profiling a Training Loop
Let's see how to profile a real training loop:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.profiler import profile, record_function, ProfilerActivity, ProfilerAction, schedule
# Define a model
model = nn.Sequential(
nn.Linear(100, 200),
nn.ReLU(),
nn.Linear(200, 50),
nn.ReLU(),
nn.Linear(50, 10)
)
# Move to CUDA if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Define optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Create dummy training data
inputs = torch.randn(32, 100, device=device)
targets = torch.randn(32, 10, device=device)
# Define profiler schedule: wait 1 batch, warmup for 1 batch, active for 3 batches, repeat
profiler_schedule = schedule(wait=1, warmup=1, active=3, repeat=2)
# Activities to profile
activities = [ProfilerActivity.CPU]
if torch.cuda.is_available():
activities.append(ProfilerActivity.CUDA)
# Start profiler
with profile(
activities=activities,
schedule=profiler_schedule,
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/profiler'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
# Train for 10 batches
for step in range(10):
# Generate synthetic data
inputs = torch.randn(32, 100, device=device)
targets = torch.randn(32, 10, device=device)
with record_function("forward_pass"):
outputs = model(inputs)
loss = nn.functional.mse_loss(outputs, targets)
with record_function("backward_pass"):
optimizer.zero_grad()
loss.backward()
with record_function("optimizer_step"):
optimizer.step()
# Step the profiler
prof.step()
# Print summary of the collected traces
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
print("\nTo visualize profiler data in TensorBoard, run: tensorboard --logdir=./log")
Once you've collected the profiling data, you can visualize it in TensorBoard:
pip install tensorboard
tensorboard --logdir=./log
Then navigate to http://localhost:6006/#pytorch_profiler to see detailed profiling visualizations.
Identifying and Resolving Bottlenecks
After profiling your model, look for:
- Operations with high CPU/GPU time: These are candidates for optimization
- Excessive memory usage: May cause out-of-memory errors on larger datasets
- Operations with many calls: Consider if they can be batched
- CPU-GPU synchronization points: These can create processing bottlenecks
Common optimization techniques include:
- Using
torch.compile()
(in PyTorch 2.0+) for automatic optimization - Increasing batch sizes (if memory allows)
- Reducing precision (e.g., using
torch.float16
instead oftorch.float32
) - Optimizing data loading with proper prefetching
- Using more efficient model architectures
Profiling Memory Usage
Memory profiling is critical for large models:
import torch
import torch.nn as nn
from torch.autograd import profiler
model = nn.Sequential(
nn.Linear(1000, 2000),
nn.ReLU(),
nn.Linear(2000, 500),
nn.ReLU(),
nn.Linear(500, 100)
)
inputs = torch.randn(128, 1000)
with profiler.profile(profile_memory=True) as prof:
outputs = model(inputs)
loss = outputs.sum()
loss.backward()
print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))
Output:
--------------------------------- ------------ ------------ ------------ ------------ ------------
Name Self CPU Mem Self CPU Mem CPU Mem Avg # of Calls Memory Format
Alloc Ret
--------------------------------- ------------ ------------ ------------ ------------ ------------
aten::empty 205.33 Mb -205.33 Mb 2.15 Mb 96 Strided
aten::empty_like 25.87 Mb -25.87 Mb 1.44 Mb 18 Strided
aten::zeros 7.63 Mb -7.63 Mb 2.54 Mb 3 Strided
...
--------------------------------- ------------ ------------ ------------ ------------ ------------
Summary
PyTorch's profiling tools provide powerful insights into model performance:
- Basic profiling with
torch.autograd.profiler.profile
helps identify time-consuming operations - Custom annotations with
record_function
let you mark specific code sections for analysis - Advanced profiling with
torch.profiler
offers comprehensive analysis with TensorBoard integration - Memory profiling helps identify memory bottlenecks
By systematically analyzing the execution patterns of your models, you can identify bottlenecks and apply targeted optimizations to make your PyTorch code faster and more memory-efficient.
Additional Resources and Exercises
Resources:
Exercises:
-
Exercise 1: Profile a CNN model on a vision dataset and identify the most time-consuming operations.
-
Exercise 2: Compare the performance of the same model running with different batch sizes and analyze how execution time scales.
-
Exercise 3: Profile a model with different data types (float32 vs float16) and measure the performance difference.
-
Exercise 4: Use the profiler to analyze the performance impact of different optimization strategies like frozen layers or model pruning.
-
Challenge Exercise: Profile a transformer model like BERT and identify optimization opportunities to reduce its memory footprint or inference time.
By mastering PyTorch's profiling tools, you'll be able to build more efficient models that train faster and use resources more effectively!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)