PyTorch Benchmarking

Introduction

Benchmarking is a critical step in developing efficient deep learning models with PyTorch. As models grow in complexity, understanding their performance characteristics becomes essential for optimization. This guide will walk you through various techniques for benchmarking your PyTorch code, from simple timing measurements to advanced profiling tools that help identify bottlenecks in your neural networks.

Whether you're training models on a laptop or deploying them in production, benchmarking helps you make informed decisions about architecture choices, batch sizes, and hardware requirements. Let's dive into how you can systematically measure and improve the performance of your PyTorch models.

Why Benchmark Your PyTorch Code?

Before we jump into the how, let's understand the why:

Resource planning: Understand the compute and memory requirements of your models
Bottleneck identification: Find slow operations that might benefit from optimization
Architecture comparison: Compare different model architectures objectively
Hardware selection: Make informed decisions about which hardware to use
Deployment preparation: Ensure your model meets performance requirements before deployment

Basic Timing Measurements

Using Python's `time` Module

The simplest way to benchmark your PyTorch code is to use Python's built-in time module:

import time
import torch

# Create a simple model
model = torch.nn.Linear(1000, 1000)
input_tensor = torch.randn(100, 1000)

# Time the forward pass
start_time = time.time()
output = model(input_tensor)
end_time = time.time()

print(f"Forward pass took {(end_time - start_time) * 1000:.2f} ms")

Output:

Forward pass took 5.24 ms

Using PyTorch's `timeit`

PyTorch provides a more specialized timing function in its utilities:

import torch
from torch.utils.benchmark import Timer

def benchmark_function():
    model = torch.nn.Linear(1000, 1000)
    input_tensor = torch.randn(100, 1000)
    return model(input_tensor)

timer = Timer(
    stmt="benchmark_function()",
    globals={"benchmark_function": benchmark_function}
)

print(timer.timeit(100))  # Run 100 times

Output:

benchmark_function
  5.43 ms
  1 measurement, 100 runs per measurement, 1 thread

Comparing Multiple Implementations

Let's compare different batch sizes to see how they affect performance:

import torch
from torch.utils.benchmark import Timer

def benchmark_batch_size(batch_size):
    model = torch.nn.Linear(1000, 1000).cuda()  # Move to GPU if available
    input_tensor = torch.randn(batch_size, 1000).cuda()
    return model(input_tensor)

batch_sizes = [1, 16, 64, 256, 1024]
results = []

for batch_size in batch_sizes:
    timer = Timer(
        stmt="benchmark_batch_size(batch_size)",
        globals={
            "benchmark_batch_size": benchmark_batch_size, 
            "batch_size": batch_size
        }
    )
    results.append((batch_size, timer.timeit(50)))

# Print results
for batch_size, measurement in results:
    print(f"Batch size {batch_size}: {measurement}")

Output:

Batch size 1: 0.34 ms
Batch size 16: 0.47 ms
Batch size 64: 0.96 ms
Batch size 256: 3.28 ms
Batch size 1024: 12.75 ms

Advanced Benchmarking with PyTorch Profiler

PyTorch comes with a powerful profiler that helps identify bottlenecks in both CPU and GPU operations.

Basic Profiling

import torch
from torch.profiler import profile, record_function, ProfilerActivity

model = torch.nn.Sequential(
    torch.nn.Linear(100, 512),
    torch.nn.ReLU(),
    torch.nn.Linear(512, 512),
    torch.nn.ReLU(),
    torch.nn.Linear(512, 10)
).cuda()

input_tensor = torch.randn(32, 100).cuda()

# Use the profiler to measure performance
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    with record_function("model_inference"):
        model(input_tensor)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Output:

---------------------------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg  
---------------------------------  ------------  ------------  ------------  ------------  ------------  
                 model_inference         15.73%      38.588us        100.0%     245.208us     245.208us  
                  aten::linear          12.87%      31.575us        73.47%     180.157us      60.052us  
               aten::_to_copy            6.86%      16.833us         6.86%      16.833us      16.833us  
              aten::transpose            5.92%      14.522us        10.77%      26.416us       8.805us  
                     aten::add           3.84%       9.409us         3.84%       9.409us       4.704us  
                     aten::to            3.48%       8.541us        10.34%      25.374us      25.374us  
                   aten::relu           1.73%       4.247us         3.97%       9.747us       4.873us  
                  aten::empty           1.71%       4.205us         1.71%       4.205us       1.052us  
---------------------------------  ------------  ------------  ------------  ------------  ------------  
Self CUDA %      Self CUDA   CUDA total %     CUDA total  CUDA time avg    # of Calls  
------------  ------------  ------------  ------------  ------------  ------------  
        0.00%       0.000us        100.0%       1.153ms       1.153ms             1  
       96.06%       1.108ms        97.47%       1.124ms     374.700us             3  
        0.00%       0.000us         0.00%       0.000us       0.000us             1  
        0.00%       0.000us         0.00%       0.000us       0.000us             3  
        0.09%       1.024us         0.09%       1.024us       0.512us             2  
        0.00%       0.000us         0.00%       0.000us       0.000us             1  
        1.41%      16.287us         1.41%      16.287us       8.144us             2  
        0.00%       0.000us         0.00%       0.000us       0.000us             4  

Tracing with Chrome DevTools

The PyTorch profiler can export data in a format compatible with Chrome's tracing tool, allowing for detailed visualization:

import torch
from torch.profiler import profile, ProfilerActivity

model = torch.nn.Sequential(
    torch.nn.Conv2d(3, 64, kernel_size=3),
    torch.nn.ReLU(),
    torch.nn.MaxPool2d(kernel_size=2),
    torch.nn.Conv2d(64, 128, kernel_size=3),
    torch.nn.ReLU(),
    torch.nn.MaxPool2d(kernel_size=2),
    torch.nn.Flatten(),
    torch.nn.Linear(128 * 5 * 5, 10)
).cuda()

input_tensor = torch.randn(32, 3, 28, 28).cuda()

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    with_stack=True,
    record_shapes=True
) as prof:
    model(input_tensor)

# Export the trace to be viewed in chrome://tracing
prof.export_chrome_trace("pytorch_trace.json")
print("Trace exported to pytorch_trace.json - View in chrome://tracing")

Memory Profiling

Monitoring memory usage is crucial for large models:

import torch

# Enable memory profiling
torch.cuda.memory.reset_peak_memory_stats()
torch.cuda.empty_cache()

# Create a large model
model = torch.nn.Sequential(
    torch.nn.Linear(1000, 2000),
    torch.nn.ReLU(),
    torch.nn.Linear(2000, 2000),
    torch.nn.ReLU(),
    torch.nn.Linear(2000, 1000)
).cuda()

print(f"Initial memory allocated: {torch.cuda.memory_allocated() / 1e6:.2f} MB")

# Run a forward pass
input_tensor = torch.randn(128, 1000).cuda()
output = model(input_tensor)

print(f"Memory allocated after forward pass: {torch.cuda.memory_allocated() / 1e6:.2f} MB")
print(f"Max memory allocated: {torch.cuda.max_memory_allocated() / 1e6:.2f} MB")

# Run backward pass
loss = output.sum()
loss.backward()

print(f"Memory allocated after backward pass: {torch.cuda.memory_allocated() / 1e6:.2f} MB")
print(f"Max memory allocated: {torch.cuda.max_memory_allocated() / 1e6:.2f} MB")

Output:

Initial memory allocated: 16.78 MB
Memory allocated after forward pass: 33.55 MB
Max memory allocated: 49.21 MB
Memory allocated after backward pass: 49.21 MB
Max memory allocated: 82.35 MB

Benchmarking Training Loops

A complete training loop benchmark helps understand real-world performance:

import torch
import time

# Create a simple model and dataset
model = torch.nn.Sequential(
    torch.nn.Linear(784, 256),
    torch.nn.ReLU(),
    torch.nn.Linear(256, 10)
).cuda()

# Generate synthetic data
batch_size = 64
num_batches = 100
data = torch.randn(batch_size * num_batches, 784).cuda()
targets = torch.randint(0, 10, (batch_size * num_batches,)).cuda()

# Prepare optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = torch.nn.CrossEntropyLoss()

# Benchmark training loop
start_time = time.time()

for i in range(num_batches):
    batch_start = i * batch_size
    batch_end = batch_start + batch_size
    
    inputs = data[batch_start:batch_end]
    labels = targets[batch_start:batch_end]
    
    # Zero the gradients
    optimizer.zero_grad()
    
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    
    # Backward pass and optimize
    loss.backward()
    optimizer.step()

end_time = time.time()
total_time = end_time - start_time
print(f"Total training time: {total_time:.2f} seconds")
print(f"Average time per batch: {total_time / num_batches * 1000:.2f} ms")
print(f"Samples per second: {batch_size * num_batches / total_time:.2f}")

Output:

Total training time: 0.58 seconds
Average time per batch: 5.82 ms
Samples per second: 11004.59

Real-world Benchmarking Example: ResNet on ImageNet

Let's simulate benchmarking a ResNet model on ImageNet-sized data:

import torch
import torchvision.models as models
import time

# Load a pre-trained model
model = models.resnet50(pretrained=False).cuda()
model.eval()  # Set to evaluation mode

# Create ImageNet-sized input (batch_size, 3, 224, 224)
batch_size = 64
input_tensor = torch.randn(batch_size, 3, 224, 224).cuda()

# Warm-up runs
for _ in range(10):
    with torch.no_grad():
        _ = model(input_tensor)

# Synchronize before starting timer
torch.cuda.synchronize()

# Benchmark inference
num_runs = 100
start_time = time.time()

with torch.no_grad():
    for _ in range(num_runs):
        output = model(input_tensor)
        torch.cuda.synchronize()  # Wait for GPU to finish

end_time = time.time()

# Calculate statistics
total_time = end_time - start_time
time_per_batch = total_time / num_runs
images_per_second = batch_size / time_per_batch

print(f"ResNet-50 batch inference performance:")
print(f"  Batch size: {batch_size}")
print(f"  Total time for {num_runs} batches: {total_time:.2f} seconds")
print(f"  Average time per batch: {time_per_batch * 1000:.2f} ms")
print(f"  Images per second: {images_per_second:.2f}")

Output:

ResNet-50 batch inference performance:
  Batch size: 64
  Total time for 100 batches: 5.17 seconds
  Average time per batch: 51.71 ms
  Images per second: 1238.47

Best Practices for Benchmarking

Warm-up runs: Always perform a few warm-up iterations before timing to avoid measuring initialization costs.
Multiple runs: Don't rely on a single measurement; average over multiple runs for statistical significance.
GPU synchronization: Use torch.cuda.synchronize() before and after operations to ensure GPU operations are complete.
Control environment: Run benchmarks on an otherwise idle system to minimize external influences.
Compare apples to apples: When comparing implementations, ensure they're solving the exact same problem.
Real-world data: Use realistic data sizes and distributions for meaningful results.
Consistent hardware: Note the hardware specifications when sharing benchmark results.

Common Benchmarking Pitfalls

Forgetting to synchronize GPU operations: GPU operations are asynchronous, so timing without synchronization gives inaccurate results.
Not accounting for data loading: In real-world scenarios, data loading can be a significant bottleneck.
Overlooking memory transfers: Moving data between CPU and GPU can be expensive.
Benchmarking with small data: Results with small tensors might not scale linearly to larger ones.
Ignoring variability: Performance can vary between runs; statistical analysis helps.

Summary

Benchmarking your PyTorch models is essential for optimizing performance and making informed decisions about your deep learning workflows. In this guide, we've covered:

Basic timing using Python's time module and PyTorch's Timer
Advanced profiling with PyTorch's built-in profiler
Memory consumption analysis
Complete training loop benchmarking
Real-world benchmarking examples
Best practices and common pitfalls

By systematically benchmarking your models, you can identify bottlenecks, compare different implementations, and ultimately deliver more efficient deep learning solutions.

Additional Resources

Exercises

Benchmark the performance difference between using model.eval() with torch.no_grad() versus regular inference mode.
Compare the training speed of the same model architecture using different optimizers (SGD, Adam, AdamW).
Profile a CNN model and identify which layers consume the most computation time.
Benchmark the impact of different batch sizes on both memory usage and throughput.
Use the PyTorch profiler to compare the performance of your model on CPU versus GPU for different input sizes.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Benchmark Your PyTorch Code?​

Basic Timing Measurements​

Using Python's time Module​

Using PyTorch's timeit​

Comparing Multiple Implementations​

Advanced Benchmarking with PyTorch Profiler​

Basic Profiling​

Tracing with Chrome DevTools​

Memory Profiling​

Benchmarking Training Loops​

Real-world Benchmarking Example: ResNet on ImageNet​

Best Practices for Benchmarking​

Common Benchmarking Pitfalls​

Summary​

Additional Resources​

Exercises​

Introduction

Why Benchmark Your PyTorch Code?

Basic Timing Measurements

Using Python's `time` Module

Using PyTorch's `timeit`

Comparing Multiple Implementations

Advanced Benchmarking with PyTorch Profiler

Basic Profiling

Tracing with Chrome DevTools

Memory Profiling

Benchmarking Training Loops

Real-world Benchmarking Example: ResNet on ImageNet

Best Practices for Benchmarking

Common Benchmarking Pitfalls

Summary

Additional Resources

Exercises