PyTorch Profiling

Introduction

Understanding the performance characteristics of your deep learning models is crucial for optimization. PyTorch provides powerful built-in profiling tools that help identify performance bottlenecks in your code. In this tutorial, we'll explore PyTorch's profiling capabilities that allow you to measure execution time, memory usage, and operator-level statistics to help optimize your models.

Profiling is especially important when working with large-scale deep learning models, as inefficiencies that might go unnoticed in smaller models can lead to significant slowdowns in production environments.

The Basics of PyTorch Profiling

PyTorch offers a profiler module called torch.profiler that provides comprehensive profiling capabilities. It allows you to:

Track execution time of operations
Analyze CPU and GPU utilization
Visualize the model's execution trace
Identify bottlenecks in your data loading, model execution, and backward pass

Let's start with the basic usage of the PyTorch profiler:

python
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity

# Define a simple model
model = nn.Sequential(
    nn.Linear(100, 200),
    nn.ReLU(),
    nn.Linear(200, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)
model.eval()  # Set to evaluation mode

# Create random input data
inputs = torch.randn(32, 100)

# Basic profiling
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
    with record_function("model_inference"):
        model(inputs)

# Print results
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

Output (will vary depending on your system):

---------------------------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg  
---------------------------------  ------------  ------------  ------------  ------------  ------------  
                  model_inference        18.85%     152.192us       100.00%     807.266us     807.266us  
                 aten::linear_1         12.03%      97.085us        27.31%     220.480us     220.480us  
                  aten::linear          10.08%      81.394us        30.98%     250.101us     250.101us  
                 aten::linear_2          9.89%      79.845us        19.73%     159.231us     159.231us  
             aten::empty_strided         8.11%      65.454us         8.11%      65.454us      21.818us  
                      aten::relu         5.64%      45.520us        10.66%      86.081us      43.040us  
                   aten::matmul         5.63%      45.489us        13.84%     111.702us      37.234us  
                     aten::addmm         5.34%      43.152us        13.18%     106.429us      35.476us  
                     aten::empty         3.75%      30.267us         3.75%      30.267us       3.784us  
                       aten::bmm         3.05%      24.653us         3.05%      24.653us      24.653us  
---------------------------------  ------------  ------------  ------------  ------------  ------------  

This table shows the breakdown of time spent in different operations during model inference.

Advanced Profiling with Trace Export

For more detailed analysis, PyTorch allows you to export trace information that can be visualized in tools like Chrome Tracing or Tensorboard:

python
import torch.nn as nn
import torch.optim as optim
from torch.profiler import profile, record_function, ProfilerActivity

# Define a more realistic training scenario
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Create model, optimizer and loss function
model = SimpleModel()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Create random inputs and targets
inputs = torch.randn(64, 3, 32, 32)
targets = torch.randint(0, 10, (64,))

# Profile with trace export
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    with_stack=True,
    profile_memory=True
) as prof:
    with record_function("training_batch"):
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Print summary
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

# Export chrome trace
prof.export_chrome_trace("pytorch_trace.json")

The trace file generated (pytorch_trace.json) can be loaded in Chrome by navigating to chrome://tracing/ and then loading the file, providing a visual representation of your model's execution timeline.

Profiling with TensorBoard Integration

PyTorch profiler also integrates with TensorBoard for a more interactive visualization experience:

python
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity
from torch.utils.tensorboard import SummaryWriter

# Define model
model = nn.Sequential(
    nn.Linear(100, 200),
    nn.ReLU(),
    nn.Linear(200, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)

# Define fake input
inputs = torch.randn(32, 100)

# Create TensorBoard writer
writer = SummaryWriter('logs/profiler_example')

# Profile with TensorBoard integration
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    on_trace_ready=torch.profiler.tensorboard_trace_handler('logs/profiler_example')
) as prof:
    model(inputs)

# Close the writer
writer.close()

After running this code, you can start TensorBoard with:

bash
tensorboard --logdir=logs/profiler_example

Then navigate to the "Profiler" tab in the TensorBoard interface in your browser to see a detailed breakdown of your model's performance.

Profiling in Training Loops

For profiling a complete training workflow, use the schedule parameter to collect traces at specific intervals:

python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.profiler import profile, record_function, ProfilerActivity, schedule

# Define model
model = nn.Linear(100, 10)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Create a profiler with schedule
def trace_handler(p):
    output = p.key_averages().table(sort_by="self_cpu_time_total", row_limit=10)
    print(output)
    p.export_chrome_trace(f"trace_{p.step_num}.json")

# Schedule: wait 1, warmup 1, active 3, repeat 2
my_schedule = schedule(
    wait=1,
    warmup=1,
    active=3,
    repeat=2)

# Create a profiler
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=my_schedule,
    on_trace_ready=trace_handler
) as prof:
    for step in range(10):
        # Generate random data
        inputs = torch.randn(32, 100)
        targets = torch.randint(0, 10, (32,))
        
        # Train step
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        
        # Step the profiler
        prof.step()

The profiler will collect data according to the specified schedule, allowing you to profile specific parts of your training loop without the overhead of profiling every iteration.

Profiling Memory Usage

Memory issues are common in deep learning, especially when dealing with large models or datasets. PyTorch profiler can help identify memory bottlenecks:

python
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity

# Create a model that uses significant memory
class LargeModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(1000, 5000)
        self.layer2 = nn.Linear(5000, 5000)
        self.layer3 = nn.Linear(5000, 1000)
        self.relu = nn.ReLU()

    def forward(self, x):
        out1 = self.relu(self.layer1(x))
        out2 = self.relu(self.layer2(out1))
        return self.layer3(out2)

model = LargeModel()
inputs = torch.randn(128, 1000)

# Profile with memory tracking
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True
) as prof:
    with record_function("model_inference"):
        model(inputs)

# Print memory usage stats
print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))

This will show which operations are consuming the most memory during your model's execution.

Real-World Application: Profiling a Dataloader

Let's look at a real-world example of profiling a complete deep learning pipeline, including data loading:

python
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from torch.profiler import profile, record_function, ProfilerActivity, schedule

# Define transformations and data loading
transform = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Use a small dataset for demonstration (alternatively, download CIFAR10)
trainset = torchvision.datasets.FakeData(
    size=1000,
    image_size=(3, 224, 224),
    num_classes=10,
    transform=transform
)
trainloader = DataLoader(trainset, batch_size=32, shuffle=True, num_workers=2)

# Define a simple CNN model
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        self.fc1 = nn.Linear(32 * 56 * 56, 128)
        self.fc2 = nn.Linear(128, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = x.view(-1, 32 * 56 * 56)
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = SimpleCNN()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Profiling function
def trace_handler(prof):
    print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))
    # Export to TensorBoard format if needed
    # prof.export_chrome_trace(f"trace_{prof.step_num}.json")

# Schedule: wait 1 iter, warmup 1 iter, active 3 iters
prof_schedule = schedule(wait=1, warmup=1, active=3, repeat=1)

# Profile the training loop
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=prof_schedule,
    on_trace_ready=trace_handler,
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    # Training loop
    for i, data in enumerate(trainloader):
        # Get inputs and labels
        inputs, labels = data
        
        # Forward + backward + optimize
        with record_function("training_batch"):
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            with record_function("backward"):
                loss.backward()
            with record_function("optimizer_step"):
                optimizer.step()
        
        prof.step()
        
        # Break after a few iterations for demonstration
        if i >= 5:
            break

This example shows how to profile a complete deep learning pipeline including data loading, forward pass, backward pass, and optimization steps.

Understanding Profiler Output and Optimization Tips

When analyzing profiler output, look for:

Long-running operations: Operations that take a significant amount of total CPU or GPU time
Excessive memory usage: Operations that allocate large amounts of memory
Data transfer bottlenecks: Operations that move data between CPU and GPU
Underutilization: Low GPU utilization can indicate that your code is CPU-bound

Common optimization strategies based on profiler results include:

Batch size optimization: Adjust batch size to maximize GPU utilization
Model parallelism: Distribute model layers across multiple GPUs
Mixed precision training: Use float16 operations where possible
Data loading optimization: Use more workers or prefetching in DataLoader
Custom CUDA kernels: Replace inefficient operations with optimized versions

Summary

PyTorch's profiling tools provide valuable insights into the performance characteristics of your deep learning models. In this tutorial, we've covered:

Basic profiling with torch.profiler
Exporting traces for visualization
TensorBoard integration for interactive analysis
Profiling training loops with scheduling
Memory usage tracking
Real-world application profiling

By systematically profiling your PyTorch code, you can identify and eliminate performance bottlenecks, leading to faster training and inference times and more efficient resource utilization.

Additional Resources and Exercises

Resources

Exercises

Profile a Custom Model: Profile a model you've been working on and identify the top 3 operations consuming the most time.
Benchmark Different Batch Sizes: Use the profiler to compare the performance of your model with different batch sizes (8, 16, 32, 64, 128) and determine the optimal size for your hardware.
Optimize Data Loading: Profile your data loading pipeline and implement improvements (such as adding more workers, using pin_memory=True, or implementing prefetching).
Memory Optimization: Use the profiler to identify memory-intensive operations in your model and implement techniques to reduce memory usage while maintaining model accuracy.
Advanced Visualization: Export profiler traces and explore them in Chrome Tracing or TensorBoard to get a deeper understanding of your model's execution flow.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

The Basics of PyTorch Profiling​

Advanced Profiling with Trace Export​

Profiling with TensorBoard Integration​

Profiling in Training Loops​

Profiling Memory Usage​

Real-World Application: Profiling a Dataloader​

Understanding Profiler Output and Optimization Tips​

Summary​

Additional Resources and Exercises​

Resources​

Exercises​