Skip to main content

PyTorch Benchmarking

Introduction

Benchmarking is a critical step in developing efficient deep learning models with PyTorch. As models grow in complexity, understanding their performance characteristics becomes essential for optimization. This guide will walk you through various techniques for benchmarking your PyTorch code, from simple timing measurements to advanced profiling tools that help identify bottlenecks in your neural networks.

Whether you're training models on a laptop or deploying them in production, benchmarking helps you make informed decisions about architecture choices, batch sizes, and hardware requirements. Let's dive into how you can systematically measure and improve the performance of your PyTorch models.

Why Benchmark Your PyTorch Code?

Before we jump into the how, let's understand the why:

  • Resource planning: Understand the compute and memory requirements of your models
  • Bottleneck identification: Find slow operations that might benefit from optimization
  • Architecture comparison: Compare different model architectures objectively
  • Hardware selection: Make informed decisions about which hardware to use
  • Deployment preparation: Ensure your model meets performance requirements before deployment

Basic Timing Measurements

Using Python's time Module

The simplest way to benchmark your PyTorch code is to use Python's built-in time module:

python
import time
import torch

# Create a simple model
model = torch.nn.Linear(1000, 1000)
input_tensor = torch.randn(100, 1000)

# Time the forward pass
start_time = time.time()
output = model(input_tensor)
end_time = time.time()

print(f"Forward pass took {(end_time - start_time) * 1000:.2f} ms")

Output:

Forward pass took 5.24 ms

Using PyTorch's timeit

PyTorch provides a more specialized timing function in its utilities:

python
import torch
from torch.utils.benchmark import Timer

def benchmark_function():
model = torch.nn.Linear(1000, 1000)
input_tensor = torch.randn(100, 1000)
return model(input_tensor)

timer = Timer(
stmt="benchmark_function()",
globals={"benchmark_function": benchmark_function}
)

print(timer.timeit(100)) # Run 100 times

Output:

benchmark_function
5.43 ms
1 measurement, 100 runs per measurement, 1 thread

Comparing Multiple Implementations

Let's compare different batch sizes to see how they affect performance:

python
import torch
from torch.utils.benchmark import Timer

def benchmark_batch_size(batch_size):
model = torch.nn.Linear(1000, 1000).cuda() # Move to GPU if available
input_tensor = torch.randn(batch_size, 1000).cuda()
return model(input_tensor)

batch_sizes = [1, 16, 64, 256, 1024]
results = []

for batch_size in batch_sizes:
timer = Timer(
stmt="benchmark_batch_size(batch_size)",
globals={
"benchmark_batch_size": benchmark_batch_size,
"batch_size": batch_size
}
)
results.append((batch_size, timer.timeit(50)))

# Print results
for batch_size, measurement in results:
print(f"Batch size {batch_size}: {measurement}")

Output:

Batch size 1: 0.34 ms
Batch size 16: 0.47 ms
Batch size 64: 0.96 ms
Batch size 256: 3.28 ms
Batch size 1024: 12.75 ms

Advanced Benchmarking with PyTorch Profiler

PyTorch comes with a powerful profiler that helps identify bottlenecks in both CPU and GPU operations.

Basic Profiling

python
import torch
from torch.profiler import profile, record_function, ProfilerActivity

model = torch.nn.Sequential(
torch.nn.Linear(100, 512),
torch.nn.ReLU(),
torch.nn.Linear(512, 512),
torch.nn.ReLU(),
torch.nn.Linear(512, 10)
).cuda()

input_tensor = torch.randn(32, 100).cuda()

# Use the profiler to measure performance
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
with record_function("model_inference"):
model(input_tensor)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Output:

---------------------------------  ------------  ------------  ------------  ------------  ------------  
Name Self CPU % Self CPU CPU total % CPU total CPU time avg
--------------------------------- ------------ ------------ ------------ ------------ ------------
model_inference 15.73% 38.588us 100.0% 245.208us 245.208us
aten::linear 12.87% 31.575us 73.47% 180.157us 60.052us
aten::_to_copy 6.86% 16.833us 6.86% 16.833us 16.833us
aten::transpose 5.92% 14.522us 10.77% 26.416us 8.805us
aten::add 3.84% 9.409us 3.84% 9.409us 4.704us
aten::to 3.48% 8.541us 10.34% 25.374us 25.374us
aten::relu 1.73% 4.247us 3.97% 9.747us 4.873us
aten::empty 1.71% 4.205us 1.71% 4.205us 1.052us
--------------------------------- ------------ ------------ ------------ ------------ ------------
Self CUDA % Self CUDA CUDA total % CUDA total CUDA time avg # of Calls
------------ ------------ ------------ ------------ ------------ ------------
0.00% 0.000us 100.0% 1.153ms 1.153ms 1
96.06% 1.108ms 97.47% 1.124ms 374.700us 3
0.00% 0.000us 0.00% 0.000us 0.000us 1
0.00% 0.000us 0.00% 0.000us 0.000us 3
0.09% 1.024us 0.09% 1.024us 0.512us 2
0.00% 0.000us 0.00% 0.000us 0.000us 1
1.41% 16.287us 1.41% 16.287us 8.144us 2
0.00% 0.000us 0.00% 0.000us 0.000us 4

Tracing with Chrome DevTools

The PyTorch profiler can export data in a format compatible with Chrome's tracing tool, allowing for detailed visualization:

python
import torch
from torch.profiler import profile, ProfilerActivity

model = torch.nn.Sequential(
torch.nn.Conv2d(3, 64, kernel_size=3),
torch.nn.ReLU(),
torch.nn.MaxPool2d(kernel_size=2),
torch.nn.Conv2d(64, 128, kernel_size=3),
torch.nn.ReLU(),
torch.nn.MaxPool2d(kernel_size=2),
torch.nn.Flatten(),
torch.nn.Linear(128 * 5 * 5, 10)
).cuda()

input_tensor = torch.randn(32, 3, 28, 28).cuda()

with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
with_stack=True,
record_shapes=True
) as prof:
model(input_tensor)

# Export the trace to be viewed in chrome://tracing
prof.export_chrome_trace("pytorch_trace.json")
print("Trace exported to pytorch_trace.json - View in chrome://tracing")

Memory Profiling

Monitoring memory usage is crucial for large models:

python
import torch

# Enable memory profiling
torch.cuda.memory.reset_peak_memory_stats()
torch.cuda.empty_cache()

# Create a large model
model = torch.nn.Sequential(
torch.nn.Linear(1000, 2000),
torch.nn.ReLU(),
torch.nn.Linear(2000, 2000),
torch.nn.ReLU(),
torch.nn.Linear(2000, 1000)
).cuda()

print(f"Initial memory allocated: {torch.cuda.memory_allocated() / 1e6:.2f} MB")

# Run a forward pass
input_tensor = torch.randn(128, 1000).cuda()
output = model(input_tensor)

print(f"Memory allocated after forward pass: {torch.cuda.memory_allocated() / 1e6:.2f} MB")
print(f"Max memory allocated: {torch.cuda.max_memory_allocated() / 1e6:.2f} MB")

# Run backward pass
loss = output.sum()
loss.backward()

print(f"Memory allocated after backward pass: {torch.cuda.memory_allocated() / 1e6:.2f} MB")
print(f"Max memory allocated: {torch.cuda.max_memory_allocated() / 1e6:.2f} MB")

Output:

Initial memory allocated: 16.78 MB
Memory allocated after forward pass: 33.55 MB
Max memory allocated: 49.21 MB
Memory allocated after backward pass: 49.21 MB
Max memory allocated: 82.35 MB

Benchmarking Training Loops

A complete training loop benchmark helps understand real-world performance:

python
import torch
import time

# Create a simple model and dataset
model = torch.nn.Sequential(
torch.nn.Linear(784, 256),
torch.nn.ReLU(),
torch.nn.Linear(256, 10)
).cuda()

# Generate synthetic data
batch_size = 64
num_batches = 100
data = torch.randn(batch_size * num_batches, 784).cuda()
targets = torch.randint(0, 10, (batch_size * num_batches,)).cuda()

# Prepare optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = torch.nn.CrossEntropyLoss()

# Benchmark training loop
start_time = time.time()

for i in range(num_batches):
batch_start = i * batch_size
batch_end = batch_start + batch_size

inputs = data[batch_start:batch_end]
labels = targets[batch_start:batch_end]

# Zero the gradients
optimizer.zero_grad()

# Forward pass
outputs = model(inputs)
loss = criterion(outputs, labels)

# Backward pass and optimize
loss.backward()
optimizer.step()

end_time = time.time()
total_time = end_time - start_time
print(f"Total training time: {total_time:.2f} seconds")
print(f"Average time per batch: {total_time / num_batches * 1000:.2f} ms")
print(f"Samples per second: {batch_size * num_batches / total_time:.2f}")

Output:

Total training time: 0.58 seconds
Average time per batch: 5.82 ms
Samples per second: 11004.59

Real-world Benchmarking Example: ResNet on ImageNet

Let's simulate benchmarking a ResNet model on ImageNet-sized data:

python
import torch
import torchvision.models as models
import time

# Load a pre-trained model
model = models.resnet50(pretrained=False).cuda()
model.eval() # Set to evaluation mode

# Create ImageNet-sized input (batch_size, 3, 224, 224)
batch_size = 64
input_tensor = torch.randn(batch_size, 3, 224, 224).cuda()

# Warm-up runs
for _ in range(10):
with torch.no_grad():
_ = model(input_tensor)

# Synchronize before starting timer
torch.cuda.synchronize()

# Benchmark inference
num_runs = 100
start_time = time.time()

with torch.no_grad():
for _ in range(num_runs):
output = model(input_tensor)
torch.cuda.synchronize() # Wait for GPU to finish

end_time = time.time()

# Calculate statistics
total_time = end_time - start_time
time_per_batch = total_time / num_runs
images_per_second = batch_size / time_per_batch

print(f"ResNet-50 batch inference performance:")
print(f" Batch size: {batch_size}")
print(f" Total time for {num_runs} batches: {total_time:.2f} seconds")
print(f" Average time per batch: {time_per_batch * 1000:.2f} ms")
print(f" Images per second: {images_per_second:.2f}")

Output:

ResNet-50 batch inference performance:
Batch size: 64
Total time for 100 batches: 5.17 seconds
Average time per batch: 51.71 ms
Images per second: 1238.47

Best Practices for Benchmarking

  1. Warm-up runs: Always perform a few warm-up iterations before timing to avoid measuring initialization costs.

  2. Multiple runs: Don't rely on a single measurement; average over multiple runs for statistical significance.

  3. GPU synchronization: Use torch.cuda.synchronize() before and after operations to ensure GPU operations are complete.

  4. Control environment: Run benchmarks on an otherwise idle system to minimize external influences.

  5. Compare apples to apples: When comparing implementations, ensure they're solving the exact same problem.

  6. Real-world data: Use realistic data sizes and distributions for meaningful results.

  7. Consistent hardware: Note the hardware specifications when sharing benchmark results.

Common Benchmarking Pitfalls

  • Forgetting to synchronize GPU operations: GPU operations are asynchronous, so timing without synchronization gives inaccurate results.
  • Not accounting for data loading: In real-world scenarios, data loading can be a significant bottleneck.
  • Overlooking memory transfers: Moving data between CPU and GPU can be expensive.
  • Benchmarking with small data: Results with small tensors might not scale linearly to larger ones.
  • Ignoring variability: Performance can vary between runs; statistical analysis helps.

Summary

Benchmarking your PyTorch models is essential for optimizing performance and making informed decisions about your deep learning workflows. In this guide, we've covered:

  • Basic timing using Python's time module and PyTorch's Timer
  • Advanced profiling with PyTorch's built-in profiler
  • Memory consumption analysis
  • Complete training loop benchmarking
  • Real-world benchmarking examples
  • Best practices and common pitfalls

By systematically benchmarking your models, you can identify bottlenecks, compare different implementations, and ultimately deliver more efficient deep learning solutions.

Additional Resources

Exercises

  1. Benchmark the performance difference between using model.eval() with torch.no_grad() versus regular inference mode.

  2. Compare the training speed of the same model architecture using different optimizers (SGD, Adam, AdamW).

  3. Profile a CNN model and identify which layers consume the most computation time.

  4. Benchmark the impact of different batch sizes on both memory usage and throughput.

  5. Use the PyTorch profiler to compare the performance of your model on CPU versus GPU for different input sizes.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)