PyTorch Backpropagation
Introduction
Backpropagation is the fundamental algorithm behind training neural networks. It's how neural networks "learn" from data by adjusting their weights to minimize errors. PyTorch makes this process incredibly straightforward through its autograd
package, which automatically calculates gradients for us. In this tutorial, we'll explore how backpropagation works in PyTorch, starting from basic concepts and advancing to more complex examples.
Understanding Backpropagation
Backpropagation is simply the application of the chain rule from calculus to calculate gradients of a loss function with respect to the model's parameters. These gradients indicate how to adjust the parameters to reduce the error.
The Chain Rule Refresher
The chain rule states that if y = f(u) and u = g(x), then:
Neural networks consist of many nested functions, so the chain rule is applied repeatedly to calculate gradients through all layers.
PyTorch's Autograd: The Engine Behind Backpropagation
PyTorch's autograd
package provides automatic differentiation for all operations on tensors. It builds a computational graph that tracks operations performed on tensors, and then uses this graph to calculate gradients.
Key Components in PyTorch Autograd
- Tensor: The basic data structure in PyTorch
- requires_grad: A flag to indicate if a tensor needs gradient calculation
- backward(): The function that initiates backpropagation
- grad: The attribute that stores gradient values
Let's see how these components work together:
import torch
# Create a tensor with requires_grad=True
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 # Some computation
# Print the computation result
print(f"y = {y}")
# Compute gradients
y.backward()
# Access gradients
print(f"dy/dx = {x.grad}")
Output:
y = tensor([4.], grad_fn=<PowBackward0>)
dy/dx = tensor([4.])
In this example:
- We created a tensor
x
withrequires_grad=True
- We performed a computation y = x²
- We called
.backward()
ony
to compute gradients - The gradient of y with respect to x (dy/dx) is 4 (derivative of x² is 2x, and x=2)
Step-by-Step Backpropagation Process in PyTorch
Let's break down the backpropagation process in PyTorch:
1. Creating the Computational Graph
When you perform operations on tensors with requires_grad=True
, PyTorch automatically constructs a computational graph:
import torch
# Create tensors
x = torch.tensor(1.0, requires_grad=True)
w = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
# Build a computational graph
y = w * x + b
# Display the graph
print(f"y = {y}")
print(f"Gradient function for y: {y.grad_fn}")
Output:
y = tensor(5., grad_fn=<AddBackward0>)
Gradient function for y: <AddBackward0 object at 0x7f123a4b5f40>
2. Forward Pass
During the forward pass, PyTorch computes the output of your model and remembers the operations for the backward pass:
# Continuing from previous code
z = y ** 2
print(f"z = {z}")
Output:
z = tensor(25., grad_fn=<PowBackward0>)
3. Backward Pass (Computing Gradients)
When you call .backward()
, PyTorch automatically computes all the required gradients:
# Continuing from previous code
z.backward()
# Display the gradients
print(f"dz/dx: {x.grad}") # dz/dx = dz/dy * dy/dx = 2y * w = 2(5) * 2 = 20
print(f"dz/dw: {w.grad}") # dz/dw = dz/dy * dy/dw = 2y * x = 2(5) * 1 = 10
print(f"dz/db: {b.grad}") # dz/db = dz/dy * dy/db = 2y * 1 = 2(5) = 10
Output:
dz/dx: tensor(20.)
dz/dw: tensor(10.)
dz/db: tensor(10.)
Managing Gradients
Accumulating Gradients
By default, PyTorch accumulates gradients. This means gradients are added to existing values rather than replacing them:
import torch
x = torch.tensor(1.0, requires_grad=True)
# First backward pass
y = x * 2
y.backward()
print(f"Gradient after first backward: {x.grad}")
# Second backward pass (gradient accumulates)
y = x * 2
y.backward()
print(f"Gradient after second backward: {x.grad}")
Output:
Gradient after first backward: tensor(2.)
Gradient after second backward: tensor(4.)
Zeroing Gradients
To prevent gradient accumulation, you need to zero gradients before each backward pass:
import torch
x = torch.tensor(1.0, requires_grad=True)
# First backward pass
y = x * 2
y.backward()
print(f"Gradient after first backward: {x.grad}")
# Zero gradients
x.grad.zero_()
# Second backward pass
y = x * 2
y.backward()
print(f"Gradient after zeroing and second backward: {x.grad}")
Output:
Gradient after first backward: tensor(2.)
Gradient after zeroing and second backward: tensor(2.)
Real-World Example: Linear Regression
Let's apply backpropagation to train a simple linear regression model:
import torch
import matplotlib.pyplot as plt
# Set random seed for reproducibility
torch.manual_seed(42)
# Generate synthetic data
x = torch.linspace(0, 10, 100)
y_true = 2*x + 1 + torch.randn(100) * 1.5
# Initialize parameters with requires_grad=True
w = torch.tensor(0.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)
# Hyperparameters
learning_rate = 0.01
epochs = 100
# Lists to store values for plotting
epoch_list = []
loss_list = []
# Training loop
for epoch in range(epochs):
# Forward pass
y_pred = w * x + b
# Compute loss (mean squared error)
loss = ((y_pred - y_true) ** 2).mean()
# Store values for plotting
if epoch % 10 == 0:
epoch_list.append(epoch)
loss_list.append(loss.item())
print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
# Backward pass
loss.backward()
# Update parameters using gradient descent
with torch.no_grad():
w -= learning_rate * w.grad
b -= learning_rate * b.grad
# Zero gradients for next iteration
w.grad.zero_()
b.grad.zero_()
print(f"Final parameters: w = {w.item():.4f}, b = {b.item():.4f}")
print(f"True parameters: w = 2.0000, b = 1.0000")
# Plot the data and the learned line
plt.figure(figsize=(10, 6))
plt.plot(x.numpy(), y_true.numpy(), 'o', alpha=0.5, label='Data')
plt.plot(x.numpy(), (w * x + b).detach().numpy(), 'r-', label=f'Fit: y = {w.item():.2f}x + {b.item():.2f}')
plt.legend()
plt.title('Linear Regression with PyTorch Backpropagation')
plt.xlabel('x')
plt.ylabel('y')
# Plot loss over epochs
plt.figure(figsize=(10, 6))
plt.plot(epoch_list, loss_list, 'b.-')
plt.title('Loss vs. Epochs')
plt.xlabel('Epoch')
plt.ylabel('Mean Squared Error')
plt.grid(True)
plt.show()
Output (values may vary due to randomness):
Epoch 0, Loss: 47.7978
Epoch 10, Loss: 13.2450
Epoch 20, Loss: 5.7079
Epoch 30, Loss: 3.9901
...
Epoch 90, Loss: 3.0482
Final parameters: w = 1.9711, b = 1.0510
True parameters: w = 2.0000, b = 1.0000
This example demonstrates the complete backpropagation process:
- We created tensors with
requires_grad=True
- We performed forward calculations to get predictions and loss
- We called
backward()
to compute gradients - We manually updated parameters using gradient descent
- We zeroed gradients before the next iteration
Advanced Topics in PyTorch Backpropagation
Computing Gradients for Non-Scalar Outputs
When the output is not a scalar, you must provide a gradient argument to the .backward()
method:
import torch
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
y = x * 2
# We need to provide a gradient argument since y is not a scalar
# The gradient argument should match the shape of y
external_grad = torch.tensor([[1.0, 1.0], [1.0, 1.0]])
y.backward(gradient=external_grad)
print(f"x.grad: {x.grad}")
Output:
x.grad: tensor([[2., 2.],
[2., 2.]])
Using retain_graph Parameter
By default, the computational graph is freed after calling .backward()
. To reuse it, set retain_graph=True
:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2
z = x ** 3
# First backward pass
y.backward(retain_graph=True) # Keep the graph for another backward pass
print(f"dy/dx: {x.grad}")
# Zero gradients before next backward
x.grad.zero_()
# Second backward pass (using the same graph)
z.backward()
print(f"dz/dx: {x.grad}")
Output:
dy/dx: tensor(4.)
dz/dx: tensor(12.)
Using create_graph Parameter
Setting create_graph=True
allows computing higher-order derivatives:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3
# First derivative
first_derivative = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"First derivative (dy/dx at x=2): {first_derivative}")
# Second derivative
second_derivative = torch.autograd.grad(first_derivative, x)[0]
print(f"Second derivative (d²y/dx² at x=2): {second_derivative}")
Output:
First derivative (dy/dx at x=2): tensor(12., grad_fn=<MulBackward0>)
Second derivative (d²y/dx² at x=2): tensor(12.)
Common Pitfalls and Tips
-
Forgetting to Zero Gradients: Always call
optimizer.zero_grad()
orx.grad.zero_()
before computing new gradients. -
In-Place Operations: Be careful with in-place operations (operations that modify a tensor directly), as they can cause issues with the autograd graph:
import torch
# This works fine
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x + 2
z = y * y
z.sum().backward()
print(f"Gradient without in-place operations: {x.grad}")
# Reset
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x + 2
# In-place operation
y.add_(1) # This will generate a warning
z = y * y
z.sum().backward()
print(f"Gradient with in-place operations: {x.grad}")
- detach() and with torch.no_grad(): Use these to stop gradient tracking when you don't need it:
import torch
# Using detach()
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2
detached_y = y.detach() # Detached tensor doesn't track gradients
detached_y.requires_grad # Returns False
# Using with torch.no_grad()
with torch.no_grad():
# No operations here will track gradients
z = x * 3
print(f"z.requires_grad: {z.requires_grad}") # False
Summary
In this tutorial, we've covered how backpropagation works in PyTorch using the autograd system:
- PyTorch builds a computational graph to track operations
- The
backward()
method computes gradients using the chain rule - Gradients are stored in the
.grad
attribute of tensors - Gradients accumulate by default, so they need to be zeroed before each backward pass
- PyTorch handles both simple and complex gradient computations automatically
Understanding backpropagation is crucial for effective deep learning model development. PyTorch's autograd system makes it surprisingly easy to implement even complex models without having to manually derive the gradients.
Additional Resources and Exercises
Additional Resources
Exercises
-
Basic Backpropagation: Create a tensor
x
with value 3.0 and compute the gradient of y = x³ - 4x² + 5 with respect to x. -
Neural Network Training: Implement a simple 2-layer neural network from scratch using PyTorch's autograd (no nn.Module) and train it on the XOR problem.
-
Gradient Accumulation: Experiment with gradient accumulation by manually accumulating gradients over multiple forward passes before updating parameters.
-
Higher-order Derivatives: Compute the third derivative of f(x) = sin(x) at x = 0 using PyTorch's autograd.
By mastering PyTorch's backpropagation mechanism, you'll have a powerful tool for building and training neural networks efficiently.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)