PyTorch Profiling
Introduction
Understanding the performance characteristics of your deep learning models is crucial for optimization. PyTorch provides powerful built-in profiling tools that help identify performance bottlenecks in your code. In this tutorial, we'll explore PyTorch's profiling capabilities that allow you to measure execution time, memory usage, and operator-level statistics to help optimize your models.
Profiling is especially important when working with large-scale deep learning models, as inefficiencies that might go unnoticed in smaller models can lead to significant slowdowns in production environments.
The Basics of PyTorch Profiling
PyTorch offers a profiler module called torch.profiler
that provides comprehensive profiling capabilities. It allows you to:
- Track execution time of operations
- Analyze CPU and GPU utilization
- Visualize the model's execution trace
- Identify bottlenecks in your data loading, model execution, and backward pass
Let's start with the basic usage of the PyTorch profiler:
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity
# Define a simple model
model = nn.Sequential(
nn.Linear(100, 200),
nn.ReLU(),
nn.Linear(200, 50),
nn.ReLU(),
nn.Linear(50, 10)
)
model.eval() # Set to evaluation mode
# Create random input data
inputs = torch.randn(32, 100)
# Basic profiling
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
with record_function("model_inference"):
model(inputs)
# Print results
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
Output (will vary depending on your system):
--------------------------------- ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg
--------------------------------- ------------ ------------ ------------ ------------ ------------
model_inference 18.85% 152.192us 100.00% 807.266us 807.266us
aten::linear_1 12.03% 97.085us 27.31% 220.480us 220.480us
aten::linear 10.08% 81.394us 30.98% 250.101us 250.101us
aten::linear_2 9.89% 79.845us 19.73% 159.231us 159.231us
aten::empty_strided 8.11% 65.454us 8.11% 65.454us 21.818us
aten::relu 5.64% 45.520us 10.66% 86.081us 43.040us
aten::matmul 5.63% 45.489us 13.84% 111.702us 37.234us
aten::addmm 5.34% 43.152us 13.18% 106.429us 35.476us
aten::empty 3.75% 30.267us 3.75% 30.267us 3.784us
aten::bmm 3.05% 24.653us 3.05% 24.653us 24.653us
--------------------------------- ------------ ------------ ------------ ------------ ------------
This table shows the breakdown of time spent in different operations during model inference.
Advanced Profiling with Trace Export
For more detailed analysis, PyTorch allows you to export trace information that can be visualized in tools like Chrome Tracing or Tensorboard:
import torch.nn as nn
import torch.optim as optim
from torch.profiler import profile, record_function, ProfilerActivity
# Define a more realistic training scenario
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
self.relu = nn.ReLU()
def forward(self, x):
x = self.pool(self.relu(self.conv1(x)))
x = self.pool(self.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = self.relu(self.fc1(x))
x = self.relu(self.fc2(x))
x = self.fc3(x)
return x
# Create model, optimizer and loss function
model = SimpleModel()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
# Create random inputs and targets
inputs = torch.randn(64, 3, 32, 32)
targets = torch.randint(0, 10, (64,))
# Profile with trace export
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
with_stack=True,
profile_memory=True
) as prof:
with record_function("training_batch"):
outputs = model(inputs)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Print summary
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
# Export chrome trace
prof.export_chrome_trace("pytorch_trace.json")
The trace file generated (pytorch_trace.json
) can be loaded in Chrome by navigating to chrome://tracing/
and then loading the file, providing a visual representation of your model's execution timeline.
Profiling with TensorBoard Integration
PyTorch profiler also integrates with TensorBoard for a more interactive visualization experience:
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity
from torch.utils.tensorboard import SummaryWriter
# Define model
model = nn.Sequential(
nn.Linear(100, 200),
nn.ReLU(),
nn.Linear(200, 50),
nn.ReLU(),
nn.Linear(50, 10)
)
# Define fake input
inputs = torch.randn(32, 100)
# Create TensorBoard writer
writer = SummaryWriter('logs/profiler_example')
# Profile with TensorBoard integration
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
on_trace_ready=torch.profiler.tensorboard_trace_handler('logs/profiler_example')
) as prof:
model(inputs)
# Close the writer
writer.close()
After running this code, you can start TensorBoard with:
tensorboard --logdir=logs/profiler_example
Then navigate to the "Profiler" tab in the TensorBoard interface in your browser to see a detailed breakdown of your model's performance.
Profiling in Training Loops
For profiling a complete training workflow, use the schedule
parameter to collect traces at specific intervals:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.profiler import profile, record_function, ProfilerActivity, schedule
# Define model
model = nn.Linear(100, 10)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Create a profiler with schedule
def trace_handler(p):
output = p.key_averages().table(sort_by="self_cpu_time_total", row_limit=10)
print(output)
p.export_chrome_trace(f"trace_{p.step_num}.json")
# Schedule: wait 1, warmup 1, active 3, repeat 2
my_schedule = schedule(
wait=1,
warmup=1,
active=3,
repeat=2)
# Create a profiler
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=my_schedule,
on_trace_ready=trace_handler
) as prof:
for step in range(10):
# Generate random data
inputs = torch.randn(32, 100)
targets = torch.randint(0, 10, (32,))
# Train step
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
# Step the profiler
prof.step()
The profiler will collect data according to the specified schedule, allowing you to profile specific parts of your training loop without the overhead of profiling every iteration.
Profiling Memory Usage
Memory issues are common in deep learning, especially when dealing with large models or datasets. PyTorch profiler can help identify memory bottlenecks:
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity
# Create a model that uses significant memory
class LargeModel(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Linear(1000, 5000)
self.layer2 = nn.Linear(5000, 5000)
self.layer3 = nn.Linear(5000, 1000)
self.relu = nn.ReLU()
def forward(self, x):
out1 = self.relu(self.layer1(x))
out2 = self.relu(self.layer2(out1))
return self.layer3(out2)
model = LargeModel()
inputs = torch.randn(128, 1000)
# Profile with memory tracking
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True
) as prof:
with record_function("model_inference"):
model(inputs)
# Print memory usage stats
print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))
This will show which operations are consuming the most memory during your model's execution.
Real-World Application: Profiling a Dataloader
Let's look at a real-world example of profiling a complete deep learning pipeline, including data loading:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from torch.profiler import profile, record_function, ProfilerActivity, schedule
# Define transformations and data loading
transform = transforms.Compose([
transforms.Resize(224),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# Use a small dataset for demonstration (alternatively, download CIFAR10)
trainset = torchvision.datasets.FakeData(
size=1000,
image_size=(3, 224, 224),
num_classes=10,
transform=transform
)
trainloader = DataLoader(trainset, batch_size=32, shuffle=True, num_workers=2)
# Define a simple CNN model
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
self.fc1 = nn.Linear(32 * 56 * 56, 128)
self.fc2 = nn.Linear(128, 10)
self.relu = nn.ReLU()
def forward(self, x):
x = self.pool(self.relu(self.conv1(x)))
x = self.pool(self.relu(self.conv2(x)))
x = x.view(-1, 32 * 56 * 56)
x = self.relu(self.fc1(x))
x = self.fc2(x)
return x
model = SimpleCNN()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
# Profiling function
def trace_handler(prof):
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
# Export to TensorBoard format if needed
# prof.export_chrome_trace(f"trace_{prof.step_num}.json")
# Schedule: wait 1 iter, warmup 1 iter, active 3 iters
prof_schedule = schedule(wait=1, warmup=1, active=3, repeat=1)
# Profile the training loop
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=prof_schedule,
on_trace_ready=trace_handler,
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
# Training loop
for i, data in enumerate(trainloader):
# Get inputs and labels
inputs, labels = data
# Forward + backward + optimize
with record_function("training_batch"):
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
with record_function("backward"):
loss.backward()
with record_function("optimizer_step"):
optimizer.step()
prof.step()
# Break after a few iterations for demonstration
if i >= 5:
break
This example shows how to profile a complete deep learning pipeline including data loading, forward pass, backward pass, and optimization steps.
Understanding Profiler Output and Optimization Tips
When analyzing profiler output, look for:
- Long-running operations: Operations that take a significant amount of total CPU or GPU time
- Excessive memory usage: Operations that allocate large amounts of memory
- Data transfer bottlenecks: Operations that move data between CPU and GPU
- Underutilization: Low GPU utilization can indicate that your code is CPU-bound
Common optimization strategies based on profiler results include:
- Batch size optimization: Adjust batch size to maximize GPU utilization
- Model parallelism: Distribute model layers across multiple GPUs
- Mixed precision training: Use float16 operations where possible
- Data loading optimization: Use more workers or prefetching in DataLoader
- Custom CUDA kernels: Replace inefficient operations with optimized versions
Summary
PyTorch's profiling tools provide valuable insights into the performance characteristics of your deep learning models. In this tutorial, we've covered:
- Basic profiling with
torch.profiler
- Exporting traces for visualization
- TensorBoard integration for interactive analysis
- Profiling training loops with scheduling
- Memory usage tracking
- Real-world application profiling
By systematically profiling your PyTorch code, you can identify and eliminate performance bottlenecks, leading to faster training and inference times and more efficient resource utilization.
Additional Resources and Exercises
Resources
- Official PyTorch Profiler Documentation
- PyTorch Performance Tuning Guide
- TensorBoard Profiler Tutorial
Exercises
-
Profile a Custom Model: Profile a model you've been working on and identify the top 3 operations consuming the most time.
-
Benchmark Different Batch Sizes: Use the profiler to compare the performance of your model with different batch sizes (8, 16, 32, 64, 128) and determine the optimal size for your hardware.
-
Optimize Data Loading: Profile your data loading pipeline and implement improvements (such as adding more workers, using
pin_memory=True
, or implementing prefetching). -
Memory Optimization: Use the profiler to identify memory-intensive operations in your model and implement techniques to reduce memory usage while maintaining model accuracy.
-
Advanced Visualization: Export profiler traces and explore them in Chrome Tracing or TensorBoard to get a deeper understanding of your model's execution flow.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)