PyTorch Autograd Profiling

When developing deep learning models with PyTorch, understanding performance bottlenecks is crucial for optimization. PyTorch provides powerful profiling tools that help you analyze where your model spends most of its time during both forward and backward passes. In this tutorial, you'll learn how to use PyTorch's autograd profiling tools to identify performance issues and optimize your models.

Introduction to PyTorch Profiling

PyTorch's profiling utilities help you measure:

The execution time of operations
Memory consumption
CUDA kernel launches
Stack traces of operations

These insights can help you pinpoint which parts of your model or training pipeline need optimization. The main tools we'll explore include:

torch.autograd.profiler.profile
torch.autograd.profiler.record_function
torch.profiler (the newer, more comprehensive API)

Let's dive into each of these tools with practical examples.

Basic Profiling with `torch.autograd.profiler.profile`

The profile context manager is the simplest way to start profiling your PyTorch code:

import torch
import torch.nn as nn
from torch.autograd import profiler

# Define a simple model
model = nn.Sequential(
    nn.Linear(100, 200),
    nn.ReLU(),
    nn.Linear(200, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)

# Create random input data
inputs = torch.randn(32, 100)

# Profile both forward and backward pass
with profiler.profile(with_stack=True, profile_memory=True) as prof:
    outputs = model(inputs)
    loss = outputs.sum()
    loss.backward()

# Print the report
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

Output:

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                  aten::linear_1         3.21%     162.675us        54.12%       2.740ms       2.740ms       0.000us         0.00%       0.000us       0.000us             1  
                      aten::add_         3.86%     195.418us        35.32%       1.787ms      101.399us       0.000us         0.00%       0.000us       0.000us            18  
                     aten::matmul         3.12%     158.053us        31.48%       1.594ms      531.238us       0.000us         0.00%       0.000us       0.000us             3  
                 aten::_to_copy_         1.82%      91.957us        28.38%       1.437ms       68.407us       0.000us         0.00%       0.000us       0.000us            21  
                aten::empty_like         1.21%      61.223us        26.56%       1.344ms       74.681us       0.000us         0.00%       0.000us       0.000us            18  
                     aten::addmm        28.09%       1.422ms        28.09%       1.422ms      473.865us       0.000us         0.00%       0.000us       0.000us             3  
                  aten::t_strided         2.22%     112.224us        25.71%       1.301ms      100.088us       0.000us         0.00%       0.000us       0.000us            13  
                aten::empty_like_         5.33%     269.540us        25.35%       1.283ms      107.634us       0.000us         0.00%       0.000us       0.000us            12  
                    aten::resize_        17.19%     869.910us        17.19%     869.910us       21.748us       0.000us         0.00%       0.000us       0.000us            40  
                       aten::zero_         4.25%     214.839us        14.94%     756.142us       31.506us       0.000us         0.00%       0.000us       0.000us            24  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  

This report shows:

Name: Operation name
Self CPU %: Percentage of time spent in this operation (excluding child operations)
CPU total %: Percentage of time including child operations
# of Calls: How many times this operation was called

Profiling with User Annotations

You can add custom annotations to mark specific sections of your code using record_function:

import torch
import torch.nn as nn
from torch.autograd import profiler

model = nn.Sequential(
    nn.Linear(100, 200),
    nn.ReLU(),
    nn.Linear(200, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)

inputs = torch.randn(32, 100)

with profiler.profile() as prof:
    with profiler.record_function("model_forward"):
        outputs = model(inputs)
    
    with profiler.record_function("loss_calculation"):
        loss = outputs.sum()
    
    with profiler.record_function("backward_pass"):
        loss.backward()

# Print events with our custom labels
print(prof.key_averages().table(sort_by="cpu_time_total"))

Output:

---------------------------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  
                   backward_pass        19.11%     149.511us       100.00%     782.461us             1  
                   model_forward         2.34%      18.283us        58.71%     459.466us             1  
               loss_calculation        22.19%     173.615us        22.19%     173.615us             1  
...
---------------------------------  ------------  ------------  ------------  ------------  ------------  

Profiling GPU Operations

If you're using CUDA, you can also profile GPU operations by setting use_cuda=True:

import torch
import torch.nn as nn
from torch.autograd import profiler

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a model and move it to the device
model = nn.Sequential(
    nn.Linear(100, 200),
    nn.ReLU(),
    nn.Linear(200, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
).to(device)

# Create input data on the device
inputs = torch.randn(32, 100, device=device)

# Profile with CUDA support
with profiler.profile(use_cuda=True) as prof:
    outputs = model(inputs)
    loss = outputs.sum()
    loss.backward()

# Print the report
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Output (when run on a CUDA-enabled device):

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CUDA %     Self CUDA  CUDA total %    CUDA total  CUDA time avg    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                     aten::addmm        84.73%     3.464ms         84.73%       3.464ms      1.155ms               3  
                      aten::relu         8.26%     337.728us        8.26%      337.728us     168.864us             2  
                     aten::copy_         5.41%     221.184us        5.41%      221.184us      73.728us             3  
                      aten::sum          0.89%      36.352us        0.89%       36.352us      36.352us             1  
...
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  

Advanced Profiling with `torch.profiler`

PyTorch introduced a newer, more comprehensive profiling API called torch.profiler that provides enhanced profiling capabilities:

import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity

# Define a model
model = nn.Sequential(
    nn.Linear(100, 200),
    nn.ReLU(),
    nn.Linear(200, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)

# Move to CUDA if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = torch.randn(32, 100, device=device)

# Profile both CPU and CUDA activities
activities = [ProfilerActivity.CPU]
if torch.cuda.is_available():
    activities.append(ProfilerActivity.CUDA)

with profile(activities=activities, record_shapes=True) as prof:
    with record_function("model_inference"):
        outputs = model(inputs)
        loss = outputs.sum()
        loss.backward()

# Print a summary
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

# Export trace to Chrome trace format
prof.export_chrome_trace("trace.json")
print("Trace exported to trace.json - view it at chrome://tracing/")

Profiling a Training Loop

Let's see how to profile a real training loop:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.profiler import profile, record_function, ProfilerActivity, ProfilerAction, schedule

# Define a model
model = nn.Sequential(
    nn.Linear(100, 200),
    nn.ReLU(),
    nn.Linear(200, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)

# Move to CUDA if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Define optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Create dummy training data
inputs = torch.randn(32, 100, device=device)
targets = torch.randn(32, 10, device=device)

# Define profiler schedule: wait 1 batch, warmup for 1 batch, active for 3 batches, repeat
profiler_schedule = schedule(wait=1, warmup=1, active=3, repeat=2)

# Activities to profile
activities = [ProfilerActivity.CPU]
if torch.cuda.is_available():
    activities.append(ProfilerActivity.CUDA)

# Start profiler
with profile(
    activities=activities,
    schedule=profiler_schedule,
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/profiler'),
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    # Train for 10 batches
    for step in range(10):
        # Generate synthetic data
        inputs = torch.randn(32, 100, device=device)
        targets = torch.randn(32, 10, device=device)
        
        with record_function("forward_pass"):
            outputs = model(inputs)
            loss = nn.functional.mse_loss(outputs, targets)
        
        with record_function("backward_pass"):
            optimizer.zero_grad()
            loss.backward()
        
        with record_function("optimizer_step"):
            optimizer.step()
        
        # Step the profiler
        prof.step()

# Print summary of the collected traces
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
print("\nTo visualize profiler data in TensorBoard, run: tensorboard --logdir=./log")

Once you've collected the profiling data, you can visualize it in TensorBoard:

pip install tensorboard
tensorboard --logdir=./log

Then navigate to http://localhost:6006/#pytorch_profiler to see detailed profiling visualizations.

Identifying and Resolving Bottlenecks

After profiling your model, look for:

Operations with high CPU/GPU time: These are candidates for optimization
Excessive memory usage: May cause out-of-memory errors on larger datasets
Operations with many calls: Consider if they can be batched
CPU-GPU synchronization points: These can create processing bottlenecks

Common optimization techniques include:

Using torch.compile() (in PyTorch 2.0+) for automatic optimization
Increasing batch sizes (if memory allows)
Reducing precision (e.g., using torch.float16 instead of torch.float32)
Optimizing data loading with proper prefetching
Using more efficient model architectures

Profiling Memory Usage

Memory profiling is critical for large models:

import torch
import torch.nn as nn
from torch.autograd import profiler

model = nn.Sequential(
    nn.Linear(1000, 2000),
    nn.ReLU(),
    nn.Linear(2000, 500),
    nn.ReLU(),
    nn.Linear(500, 100)
)

inputs = torch.randn(128, 1000)

with profiler.profile(profile_memory=True) as prof:
    outputs = model(inputs)
    loss = outputs.sum()
    loss.backward()

print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))

Output:

---------------------------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU Mem  Self CPU Mem   CPU Mem Avg    # of Calls  Memory Format  
                                          Alloc         Ret                                                
---------------------------------  ------------  ------------  ------------  ------------  ------------  
                     aten::empty       205.33 Mb    -205.33 Mb      2.15 Mb            96  Strided      
                aten::empty_like        25.87 Mb     -25.87 Mb      1.44 Mb            18  Strided      
                     aten::zeros         7.63 Mb      -7.63 Mb      2.54 Mb             3  Strided      
...
---------------------------------  ------------  ------------  ------------  ------------  ------------  

Summary

PyTorch's profiling tools provide powerful insights into model performance:

Basic profiling with torch.autograd.profiler.profile helps identify time-consuming operations
Custom annotations with record_function let you mark specific code sections for analysis
Advanced profiling with torch.profiler offers comprehensive analysis with TensorBoard integration
Memory profiling helps identify memory bottlenecks

By systematically analyzing the execution patterns of your models, you can identify bottlenecks and apply targeted optimizations to make your PyTorch code faster and more memory-efficient.

Additional Resources and Exercises

Resources:

Exercises:

Exercise 1: Profile a CNN model on a vision dataset and identify the most time-consuming operations.
Exercise 2: Compare the performance of the same model running with different batch sizes and analyze how execution time scales.
Exercise 3: Profile a model with different data types (float32 vs float16) and measure the performance difference.
Exercise 4: Use the profiler to analyze the performance impact of different optimization strategies like frozen layers or model pruning.
Challenge Exercise: Profile a transformer model like BERT and identify optimization opportunities to reduce its memory footprint or inference time.

By mastering PyTorch's profiling tools, you'll be able to build more efficient models that train faster and use resources more effectively!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to PyTorch Profiling​

Basic Profiling with torch.autograd.profiler.profile​

Profiling with User Annotations​

Profiling GPU Operations​

Advanced Profiling with torch.profiler​

Profiling a Training Loop​

Identifying and Resolving Bottlenecks​

Profiling Memory Usage​

Summary​

Additional Resources and Exercises​

Resources:​

Exercises:​

Introduction to PyTorch Profiling

Basic Profiling with `torch.autograd.profiler.profile`

Profiling with User Annotations

Profiling GPU Operations

Advanced Profiling with `torch.profiler`

Profiling a Training Loop

Identifying and Resolving Bottlenecks

Profiling Memory Usage

Summary

Additional Resources and Exercises

Resources:

Exercises: