Skip to main content

PyTorch Backpropagation

Introduction

Backpropagation is the fundamental algorithm behind training neural networks. It's how neural networks "learn" from data by adjusting their weights to minimize errors. PyTorch makes this process incredibly straightforward through its autograd package, which automatically calculates gradients for us. In this tutorial, we'll explore how backpropagation works in PyTorch, starting from basic concepts and advancing to more complex examples.

Understanding Backpropagation

Backpropagation is simply the application of the chain rule from calculus to calculate gradients of a loss function with respect to the model's parameters. These gradients indicate how to adjust the parameters to reduce the error.

The Chain Rule Refresher

The chain rule states that if y = f(u) and u = g(x), then:

dydx=dydududx\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

Neural networks consist of many nested functions, so the chain rule is applied repeatedly to calculate gradients through all layers.

PyTorch's Autograd: The Engine Behind Backpropagation

PyTorch's autograd package provides automatic differentiation for all operations on tensors. It builds a computational graph that tracks operations performed on tensors, and then uses this graph to calculate gradients.

Key Components in PyTorch Autograd

  1. Tensor: The basic data structure in PyTorch
  2. requires_grad: A flag to indicate if a tensor needs gradient calculation
  3. backward(): The function that initiates backpropagation
  4. grad: The attribute that stores gradient values

Let's see how these components work together:

python
import torch

# Create a tensor with requires_grad=True
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 # Some computation

# Print the computation result
print(f"y = {y}")

# Compute gradients
y.backward()

# Access gradients
print(f"dy/dx = {x.grad}")

Output:

y = tensor([4.], grad_fn=<PowBackward0>)
dy/dx = tensor([4.])

In this example:

  • We created a tensor x with requires_grad=True
  • We performed a computation y = x²
  • We called .backward() on y to compute gradients
  • The gradient of y with respect to x (dy/dx) is 4 (derivative of x² is 2x, and x=2)

Step-by-Step Backpropagation Process in PyTorch

Let's break down the backpropagation process in PyTorch:

1. Creating the Computational Graph

When you perform operations on tensors with requires_grad=True, PyTorch automatically constructs a computational graph:

python
import torch

# Create tensors
x = torch.tensor(1.0, requires_grad=True)
w = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)

# Build a computational graph
y = w * x + b

# Display the graph
print(f"y = {y}")
print(f"Gradient function for y: {y.grad_fn}")

Output:

y = tensor(5., grad_fn=<AddBackward0>)
Gradient function for y: <AddBackward0 object at 0x7f123a4b5f40>

2. Forward Pass

During the forward pass, PyTorch computes the output of your model and remembers the operations for the backward pass:

python
# Continuing from previous code
z = y ** 2
print(f"z = {z}")

Output:

z = tensor(25., grad_fn=<PowBackward0>)

3. Backward Pass (Computing Gradients)

When you call .backward(), PyTorch automatically computes all the required gradients:

python
# Continuing from previous code
z.backward()

# Display the gradients
print(f"dz/dx: {x.grad}") # dz/dx = dz/dy * dy/dx = 2y * w = 2(5) * 2 = 20
print(f"dz/dw: {w.grad}") # dz/dw = dz/dy * dy/dw = 2y * x = 2(5) * 1 = 10
print(f"dz/db: {b.grad}") # dz/db = dz/dy * dy/db = 2y * 1 = 2(5) = 10

Output:

dz/dx: tensor(20.)
dz/dw: tensor(10.)
dz/db: tensor(10.)

Managing Gradients

Accumulating Gradients

By default, PyTorch accumulates gradients. This means gradients are added to existing values rather than replacing them:

python
import torch

x = torch.tensor(1.0, requires_grad=True)

# First backward pass
y = x * 2
y.backward()
print(f"Gradient after first backward: {x.grad}")

# Second backward pass (gradient accumulates)
y = x * 2
y.backward()
print(f"Gradient after second backward: {x.grad}")

Output:

Gradient after first backward: tensor(2.)
Gradient after second backward: tensor(4.)

Zeroing Gradients

To prevent gradient accumulation, you need to zero gradients before each backward pass:

python
import torch

x = torch.tensor(1.0, requires_grad=True)

# First backward pass
y = x * 2
y.backward()
print(f"Gradient after first backward: {x.grad}")

# Zero gradients
x.grad.zero_()

# Second backward pass
y = x * 2
y.backward()
print(f"Gradient after zeroing and second backward: {x.grad}")

Output:

Gradient after first backward: tensor(2.)
Gradient after zeroing and second backward: tensor(2.)

Real-World Example: Linear Regression

Let's apply backpropagation to train a simple linear regression model:

python
import torch
import matplotlib.pyplot as plt

# Set random seed for reproducibility
torch.manual_seed(42)

# Generate synthetic data
x = torch.linspace(0, 10, 100)
y_true = 2*x + 1 + torch.randn(100) * 1.5

# Initialize parameters with requires_grad=True
w = torch.tensor(0.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

# Hyperparameters
learning_rate = 0.01
epochs = 100

# Lists to store values for plotting
epoch_list = []
loss_list = []

# Training loop
for epoch in range(epochs):
# Forward pass
y_pred = w * x + b

# Compute loss (mean squared error)
loss = ((y_pred - y_true) ** 2).mean()

# Store values for plotting
if epoch % 10 == 0:
epoch_list.append(epoch)
loss_list.append(loss.item())
print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

# Backward pass
loss.backward()

# Update parameters using gradient descent
with torch.no_grad():
w -= learning_rate * w.grad
b -= learning_rate * b.grad

# Zero gradients for next iteration
w.grad.zero_()
b.grad.zero_()

print(f"Final parameters: w = {w.item():.4f}, b = {b.item():.4f}")
print(f"True parameters: w = 2.0000, b = 1.0000")

# Plot the data and the learned line
plt.figure(figsize=(10, 6))
plt.plot(x.numpy(), y_true.numpy(), 'o', alpha=0.5, label='Data')
plt.plot(x.numpy(), (w * x + b).detach().numpy(), 'r-', label=f'Fit: y = {w.item():.2f}x + {b.item():.2f}')
plt.legend()
plt.title('Linear Regression with PyTorch Backpropagation')
plt.xlabel('x')
plt.ylabel('y')

# Plot loss over epochs
plt.figure(figsize=(10, 6))
plt.plot(epoch_list, loss_list, 'b.-')
plt.title('Loss vs. Epochs')
plt.xlabel('Epoch')
plt.ylabel('Mean Squared Error')
plt.grid(True)
plt.show()

Output (values may vary due to randomness):

Epoch 0, Loss: 47.7978
Epoch 10, Loss: 13.2450
Epoch 20, Loss: 5.7079
Epoch 30, Loss: 3.9901
...
Epoch 90, Loss: 3.0482
Final parameters: w = 1.9711, b = 1.0510
True parameters: w = 2.0000, b = 1.0000

This example demonstrates the complete backpropagation process:

  1. We created tensors with requires_grad=True
  2. We performed forward calculations to get predictions and loss
  3. We called backward() to compute gradients
  4. We manually updated parameters using gradient descent
  5. We zeroed gradients before the next iteration

Advanced Topics in PyTorch Backpropagation

Computing Gradients for Non-Scalar Outputs

When the output is not a scalar, you must provide a gradient argument to the .backward() method:

python
import torch

x = torch.tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
y = x * 2

# We need to provide a gradient argument since y is not a scalar
# The gradient argument should match the shape of y
external_grad = torch.tensor([[1.0, 1.0], [1.0, 1.0]])
y.backward(gradient=external_grad)

print(f"x.grad: {x.grad}")

Output:

x.grad: tensor([[2., 2.],
[2., 2.]])

Using retain_graph Parameter

By default, the computational graph is freed after calling .backward(). To reuse it, set retain_graph=True:

python
import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 2
z = x ** 3

# First backward pass
y.backward(retain_graph=True) # Keep the graph for another backward pass
print(f"dy/dx: {x.grad}")

# Zero gradients before next backward
x.grad.zero_()

# Second backward pass (using the same graph)
z.backward()
print(f"dz/dx: {x.grad}")

Output:

dy/dx: tensor(4.)
dz/dx: tensor(12.)

Using create_graph Parameter

Setting create_graph=True allows computing higher-order derivatives:

python
import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 3

# First derivative
first_derivative = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"First derivative (dy/dx at x=2): {first_derivative}")

# Second derivative
second_derivative = torch.autograd.grad(first_derivative, x)[0]
print(f"Second derivative (d²y/dx² at x=2): {second_derivative}")

Output:

First derivative (dy/dx at x=2): tensor(12., grad_fn=<MulBackward0>)
Second derivative (d²y/dx² at x=2): tensor(12.)

Common Pitfalls and Tips

  1. Forgetting to Zero Gradients: Always call optimizer.zero_grad() or x.grad.zero_() before computing new gradients.

  2. In-Place Operations: Be careful with in-place operations (operations that modify a tensor directly), as they can cause issues with the autograd graph:

python
import torch

# This works fine
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x + 2
z = y * y
z.sum().backward()
print(f"Gradient without in-place operations: {x.grad}")

# Reset
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x + 2
# In-place operation
y.add_(1) # This will generate a warning
z = y * y
z.sum().backward()
print(f"Gradient with in-place operations: {x.grad}")
  1. detach() and with torch.no_grad(): Use these to stop gradient tracking when you don't need it:
python
import torch

# Using detach()
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2
detached_y = y.detach() # Detached tensor doesn't track gradients
detached_y.requires_grad # Returns False

# Using with torch.no_grad()
with torch.no_grad():
# No operations here will track gradients
z = x * 3
print(f"z.requires_grad: {z.requires_grad}") # False

Summary

In this tutorial, we've covered how backpropagation works in PyTorch using the autograd system:

  1. PyTorch builds a computational graph to track operations
  2. The backward() method computes gradients using the chain rule
  3. Gradients are stored in the .grad attribute of tensors
  4. Gradients accumulate by default, so they need to be zeroed before each backward pass
  5. PyTorch handles both simple and complex gradient computations automatically

Understanding backpropagation is crucial for effective deep learning model development. PyTorch's autograd system makes it surprisingly easy to implement even complex models without having to manually derive the gradients.

Additional Resources and Exercises

Additional Resources

  1. PyTorch Autograd Documentation
  2. PyTorch Tutorials on Autograd

Exercises

  1. Basic Backpropagation: Create a tensor x with value 3.0 and compute the gradient of y = x³ - 4x² + 5 with respect to x.

  2. Neural Network Training: Implement a simple 2-layer neural network from scratch using PyTorch's autograd (no nn.Module) and train it on the XOR problem.

  3. Gradient Accumulation: Experiment with gradient accumulation by manually accumulating gradients over multiple forward passes before updating parameters.

  4. Higher-order Derivatives: Compute the third derivative of f(x) = sin(x) at x = 0 using PyTorch's autograd.

By mastering PyTorch's backpropagation mechanism, you'll have a powerful tool for building and training neural networks efficiently.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)