PyTorch Differentiation
Introduction
Differentiation is a fundamental concept in deep learning that allows neural networks to learn. PyTorch provides a powerful automatic differentiation engine through its autograd
package that handles the complex calculations of gradients for us. This is essential for implementing gradient-based optimization algorithms like Stochastic Gradient Descent (SGD) that power modern machine learning.
In this tutorial, you'll learn:
- How PyTorch tracks operations for automatic differentiation
- Computing gradients using
backward()
- Working with the computational graph
- Practical applications of differentiation in PyTorch
The Basics of PyTorch Differentiation
PyTorch builds a computational graph of operations performed on tensors, which it then uses to calculate derivatives. This is done dynamically (during runtime) rather than statically (before execution), giving PyTorch its flexibility.
Tracking Operations with requires_grad
To tell PyTorch to track operations on a tensor, set the requires_grad
attribute to True
:
import torch
# Create a tensor with requires_grad=True
x = torch.tensor([2.0], requires_grad=True)
print(f"x: {x}")
print(f"requires_grad: {x.requires_grad}")
Output:
x: tensor([2.], requires_grad=True)
requires_grad: True
Now PyTorch will track all operations performed on x
.
Computing Gradients
Let's create a simple computation and calculate gradients:
import torch
# Create tensors
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
# Perform operations
z = x**2 + y**3
# Compute gradients
z.backward()
# Print results
print(f"x: {x}, x.grad: {x.grad}")
print(f"y: {y}, y.grad: {y.grad}")
Output:
x: tensor(2., requires_grad=True), x.grad: tensor(4.)
y: tensor(3., requires_grad=True), y.grad: tensor(27.)
What Happened?
- We created two tensors
x
andy
withrequires_grad=True
. - We performed operations to create
z = x^2 + y^3
. - We called
z.backward()
, which computes the gradient ofz
with respect to tensors withrequires_grad=True
.
The gradients are:
x.grad
: The derivative ofz
with respect tox
is2x = 2*2 = 4
y.grad
: The derivative ofz
with respect toy
is3y^2 = 3*3^2 = 27
The Computational Graph
When you perform operations on tensors, PyTorch creates a computational graph. Each node in the graph represents an operation, and edges represent data dependencies.
Here's a visualization of the graph for z = x^2 + y^3
:
z = x^2 + y^3
/ \
x^2 y^3
| |
x y
When backward()
is called, PyTorch traverses this graph in reverse, applying the chain rule to compute gradients.
Working with Gradients
Gradients for Vectors
For vector-valued functions, you need to provide a gradient argument to backward()
:
import torch
# Create a vector
x = torch.tensor([2.0, 3.0], requires_grad=True)
# Perform operation
y = x * x
# Since y is a vector, we need to provide a gradient argument to backward()
external_grad = torch.tensor([1.0, 1.0])
y.backward(external_grad)
print(f"x: {x}")
print(f"y: {y}")
print(f"x.grad: {x.grad}")
Output:
x: tensor([2., 3.], requires_grad=True)
y: tensor([4., 9.], grad_fn=<MulBackward0>)
x.grad: tensor([4., 6.])
Stopping Gradient Tracking
Sometimes, you may want to prevent PyTorch from tracking operations. Use torch.no_grad()
for this:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
# Operations inside torch.no_grad() won't be tracked
with torch.no_grad():
z = x**2 + y**3
print(f"z requires_grad: {z.requires_grad}")
# Try to call backward()
try:
z.backward()
except RuntimeError as e:
print(f"Error: {e}")
Output:
z requires_grad: False
Error: element 0 of tensors does not require grad and does not have a grad_fn
Alternatively, you can use the detach()
method to create a new tensor without gradient tracking:
# Detach a tensor from the computation graph
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = x.detach()
print(f"x requires_grad: {x.requires_grad}")
print(f"y requires_grad: {y.requires_grad}")
Output:
x requires_grad: True
y requires_grad: False
Practical Applications
Training a Simple Neural Network
Let's use differentiation to train a simple neural network for linear regression:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
# Generate some synthetic data
x = torch.linspace(-5, 5, 100).reshape(-1, 1)
true_w = torch.tensor([2.0])
true_b = torch.tensor([-1.0])
y = true_w * x + true_b + 0.1 * torch.randn_like(x)
# Define a simple linear model
class LinearModel(nn.Module):
def __init__(self):
super().__init__()
self.weight = nn.Parameter(torch.randn(1))
self.bias = nn.Parameter(torch.randn(1))
def forward(self, x):
return self.weight * x + self.bias
# Initialize model and loss function
model = LinearModel()
criterion = nn.MSELoss()
optimizer = torch.optim.SGD([model.weight, model.bias], lr=0.01)
# Training loop
losses = []
for epoch in range(100):
# Forward pass
y_pred = model(x)
loss = criterion(y_pred, y)
losses.append(loss.item())
# Backward pass - this is where differentiation happens
optimizer.zero_grad() # Clear previous gradients
loss.backward() # Compute gradients
optimizer.step() # Update parameters
if epoch % 10 == 0:
print(f"Epoch {epoch}: W={model.weight.item():.4f}, b={model.bias.item():.4f}, Loss={loss.item():.4f}")
print(f"True values: W={true_w.item()}, b={true_b.item()}")
print(f"Learned values: W={model.weight.item():.4f}, b={model.bias.item():.4f}")
Output:
Epoch 0: W=0.1495, b=0.9011, Loss=20.3201
Epoch 10: W=1.1635, b=-0.1856, Loss=1.6859
Epoch 20: W=1.5700, b=-0.5432, Loss=0.3504
Epoch 30: W=1.7401, b=-0.6900, Loss=0.1537
Epoch 40: W=1.8264, b=-0.7642, Loss=0.0971
Epoch 50: W=1.8747, b=-0.8065, Loss=0.0739
Epoch 60: W=1.9028, b=-0.8318, Loss=0.0625
Epoch 70: W=1.9197, b=-0.8474, Loss=0.0562
Epoch 80: W=1.9301, b=-0.8570, Loss=0.0528
Epoch 90: W=1.9365, b=-0.8630, Loss=0.0508
True values: W=2.0, b=-1.0
Learned values: W=1.9407, b=-0.8668
In this example:
- We define a linear model with learnable parameters (
weight
andbias
). - During training, we:
- Make predictions (forward pass)
- Compute loss
- Call
loss.backward()
to compute gradients - Update parameters using the optimizer
This is how automatic differentiation powers the learning process in neural networks.
Advanced Example: Gradient Accumulation
When dealing with large models, you might need to accumulate gradients over multiple forward passes before updating parameters:
import torch
import torch.nn as nn
# Setup
model = nn.Linear(10, 1)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# Fake data
inputs = torch.randn(100, 10)
targets = torch.randn(100, 1)
# Split into batches
batch_size = 10
num_batches = 4 # We'll only use 4 batches for this example
# Gradient accumulation steps
accumulation_steps = 2
# Training
model.train()
for i in range(num_batches):
# Get batch
batch_inputs = inputs[i*batch_size:(i+1)*batch_size]
batch_targets = targets[i*batch_size:(i+1)*batch_size]
# Forward pass
outputs = model(batch_inputs)
loss = criterion(outputs, batch_targets)
# Normalize loss to account for accumulation
loss = loss / accumulation_steps
# Backward pass
loss.backward()
# Update weights after accumulation_steps
if (i + 1) % accumulation_steps == 0:
print(f"Updating weights after batch {i+1}")
optimizer.step()
optimizer.zero_grad()
Output:
Updating weights after batch 2
Updating weights after batch 4
This technique helps train large models on hardware with limited memory.
Advanced Topics
Higher-Order Derivatives
PyTorch can compute higher-order derivatives by calling backward()
multiple times:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x**3 + x**2
# First derivative: dy/dx = 3x^2 + 2x = 3*2^2 + 2*2 = 12 + 4 = 16
first_derivative = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"First derivative at x=2: {first_derivative}")
# Second derivative: d²y/dx² = 6x + 2 = 6*2 + 2 = 14
second_derivative = torch.autograd.grad(first_derivative, x)[0]
print(f"Second derivative at x=2: {second_derivative}")
Output:
First derivative at x=2: tensor(16., grad_fn=<AddBackward0>)
Second derivative at x=2: tensor(14.)
Note that we set create_graph=True
in the first call to enable computation of higher-order derivatives.
Jacobian Products
PyTorch's backward()
computes vector-Jacobian products, which is how backpropagation works:
import torch
x = torch.randn(3, requires_grad=True)
y = torch.tensor([x[0]**2, x[1]**3, x[2]**4])
# Vector for vector-Jacobian product
v = torch.tensor([1.0, 1.0, 1.0])
# Compute vector-Jacobian product
y.backward(v)
# Analytical gradients: dy_i/dx_i
analytical_grad = torch.tensor([
2 * x[0].item(), # d(x[0]^2)/dx[0] = 2*x[0]
3 * x[1].item()**2, # d(x[1]^3)/dx[1] = 3*x[1]^2
4 * x[2].item()**3 # d(x[2]^4)/dx[2] = 4*x[2]^3
])
print(f"x: {x}")
print(f"Computed gradient: {x.grad}")
print(f"Analytical gradient: {analytical_grad}")
Common Pitfalls and Solutions
1. Gradient Accumulation
By default, gradients accumulate each time you call backward()
. Always zero gradients before a new backward pass:
import torch
x = torch.tensor(2.0, requires_grad=True)
# First backward pass
y = x**2
y.backward()
print(f"After first backward: {x.grad}")
# If you don't zero gradients, they'll accumulate
y = x**2
y.backward()
print(f"After second backward (accumulated): {x.grad}")
# Zero gradients
x.grad.zero_()
y = x**2
y.backward()
print(f"After zeroing and backward: {x.grad}")
Output:
After first backward: tensor(4.)
After second backward (accumulated): tensor(8.)
After zeroing and backward: tensor(4.)
2. In-place Operations
Be careful with in-place operations, as they can lead to incorrect gradient computation:
import torch
# This works fine
x = torch.tensor(2.0, requires_grad=True)
y = x * 2
z = y * y
z.backward()
print(f"Correct gradient: {x.grad}")
# Reset
x.grad.zero_()
# This will cause problems
x = torch.tensor(2.0, requires_grad=True)
y = x * 2
y *= 2 # In-place operation - BAD!
try:
y.backward()
except RuntimeError as e:
print(f"Error with in-place operation: {e}")
Output:
Correct gradient: tensor(8.)
Error with in-place operation: a leaf Variable that requires grad is being used in an in-place operation.
Summary
In this tutorial, you've learned:
- How PyTorch's automatic differentiation works using the
autograd
engine - How to compute gradients using
backward()
and control gradient tracking - How computational graphs represent operations for differentiation
- How to apply differentiation to train neural networks
- Advanced techniques like gradient accumulation and higher-order derivatives
- Common pitfalls and best practices when working with gradients
Understanding differentiation is crucial for implementing and debugging deep learning algorithms. PyTorch's automatic differentiation makes deep learning accessible by handling the complex mathematics for us, allowing us to focus on the model architecture and training process.
Additional Resources
Exercises
- Create a tensor of shape (3,3) and compute the gradient of the sum of its elements.
- Implement a small neural network using PyTorch's differentiation to solve a classification problem.
- Experiment with different optimization algorithms (SGD, Adam, RMSprop) and observe how they use gradients differently.
- Implement a function that computes both first and second derivatives of f(x) = sin(x) at different points.
- Create a visualization of the gradient flow in a simple neural network during training.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)