PyTorch Backpropagation

Introduction

Backpropagation is the fundamental algorithm behind training neural networks. It's how neural networks "learn" from data by adjusting their weights to minimize errors. PyTorch makes this process incredibly straightforward through its autograd package, which automatically calculates gradients for us. In this tutorial, we'll explore how backpropagation works in PyTorch, starting from basic concepts and advancing to more complex examples.

Understanding Backpropagation

Backpropagation is simply the application of the chain rule from calculus to calculate gradients of a loss function with respect to the model's parameters. These gradients indicate how to adjust the parameters to reduce the error.

The Chain Rule Refresher

The chain rule states that if y = f(u) and u = g(x), then:

$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$

Neural networks consist of many nested functions, so the chain rule is applied repeatedly to calculate gradients through all layers.

PyTorch's Autograd: The Engine Behind Backpropagation

PyTorch's autograd package provides automatic differentiation for all operations on tensors. It builds a computational graph that tracks operations performed on tensors, and then uses this graph to calculate gradients.

Key Components in PyTorch Autograd

Tensor: The basic data structure in PyTorch
requires_grad: A flag to indicate if a tensor needs gradient calculation
backward(): The function that initiates backpropagation
grad: The attribute that stores gradient values

Let's see how these components work together:

import torch

# Create a tensor with requires_grad=True
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2  # Some computation

# Print the computation result
print(f"y = {y}")

# Compute gradients
y.backward()

# Access gradients
print(f"dy/dx = {x.grad}")

Output:

y = tensor([4.], grad_fn=<PowBackward0>)
dy/dx = tensor([4.])

In this example:

We created a tensor x with requires_grad=True
We performed a computation y = x²
We called .backward() on y to compute gradients
The gradient of y with respect to x (dy/dx) is 4 (derivative of x² is 2x, and x=2)

Step-by-Step Backpropagation Process in PyTorch

Let's break down the backpropagation process in PyTorch:

1. Creating the Computational Graph

When you perform operations on tensors with requires_grad=True, PyTorch automatically constructs a computational graph:

import torch

# Create tensors
x = torch.tensor(1.0, requires_grad=True)
w = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)

# Build a computational graph
y = w * x + b

# Display the graph
print(f"y = {y}")
print(f"Gradient function for y: {y.grad_fn}")

Output:

y = tensor(5., grad_fn=<AddBackward0>)
Gradient function for y: <AddBackward0 object at 0x7f123a4b5f40>

2. Forward Pass

During the forward pass, PyTorch computes the output of your model and remembers the operations for the backward pass:

# Continuing from previous code
z = y ** 2
print(f"z = {z}")

Output:

z = tensor(25., grad_fn=<PowBackward0>)

3. Backward Pass (Computing Gradients)

When you call .backward(), PyTorch automatically computes all the required gradients:

# Continuing from previous code
z.backward()

# Display the gradients
print(f"dz/dx: {x.grad}")    # dz/dx = dz/dy * dy/dx = 2y * w = 2(5) * 2 = 20
print(f"dz/dw: {w.grad}")    # dz/dw = dz/dy * dy/dw = 2y * x = 2(5) * 1 = 10
print(f"dz/db: {b.grad}")    # dz/db = dz/dy * dy/db = 2y * 1 = 2(5) = 10

Output:

dz/dx: tensor(20.)
dz/dw: tensor(10.)
dz/db: tensor(10.)

Managing Gradients

Accumulating Gradients

By default, PyTorch accumulates gradients. This means gradients are added to existing values rather than replacing them:

import torch

x = torch.tensor(1.0, requires_grad=True)

# First backward pass
y = x * 2
y.backward()
print(f"Gradient after first backward: {x.grad}")

# Second backward pass (gradient accumulates)
y = x * 2
y.backward()
print(f"Gradient after second backward: {x.grad}")

Output:

Gradient after first backward: tensor(2.)
Gradient after second backward: tensor(4.)

Zeroing Gradients

To prevent gradient accumulation, you need to zero gradients before each backward pass:

import torch

x = torch.tensor(1.0, requires_grad=True)

# First backward pass
y = x * 2
y.backward()
print(f"Gradient after first backward: {x.grad}")

# Zero gradients
x.grad.zero_()

# Second backward pass
y = x * 2
y.backward()
print(f"Gradient after zeroing and second backward: {x.grad}")

Output:

Gradient after first backward: tensor(2.)
Gradient after zeroing and second backward: tensor(2.)

Real-World Example: Linear Regression

Let's apply backpropagation to train a simple linear regression model:

import torch
import matplotlib.pyplot as plt

# Set random seed for reproducibility
torch.manual_seed(42)

# Generate synthetic data
x = torch.linspace(0, 10, 100)
y_true = 2*x + 1 + torch.randn(100) * 1.5

# Initialize parameters with requires_grad=True
w = torch.tensor(0.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

# Hyperparameters
learning_rate = 0.01
epochs = 100

# Lists to store values for plotting
epoch_list = []
loss_list = []

# Training loop
for epoch in range(epochs):
    # Forward pass
    y_pred = w * x + b
    
    # Compute loss (mean squared error)
    loss = ((y_pred - y_true) ** 2).mean()
    
    # Store values for plotting
    if epoch % 10 == 0:
        epoch_list.append(epoch)
        loss_list.append(loss.item())
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
    
    # Backward pass
    loss.backward()
    
    # Update parameters using gradient descent
    with torch.no_grad():
        w -= learning_rate * w.grad
        b -= learning_rate * b.grad
    
    # Zero gradients for next iteration
    w.grad.zero_()
    b.grad.zero_()

print(f"Final parameters: w = {w.item():.4f}, b = {b.item():.4f}")
print(f"True parameters: w = 2.0000, b = 1.0000")

# Plot the data and the learned line
plt.figure(figsize=(10, 6))
plt.plot(x.numpy(), y_true.numpy(), 'o', alpha=0.5, label='Data')
plt.plot(x.numpy(), (w * x + b).detach().numpy(), 'r-', label=f'Fit: y = {w.item():.2f}x + {b.item():.2f}')
plt.legend()
plt.title('Linear Regression with PyTorch Backpropagation')
plt.xlabel('x')
plt.ylabel('y')

# Plot loss over epochs
plt.figure(figsize=(10, 6))
plt.plot(epoch_list, loss_list, 'b.-')
plt.title('Loss vs. Epochs')
plt.xlabel('Epoch')
plt.ylabel('Mean Squared Error')
plt.grid(True)
plt.show()

Output (values may vary due to randomness):

Epoch 0, Loss: 47.7978
Epoch 10, Loss: 13.2450
Epoch 20, Loss: 5.7079
Epoch 30, Loss: 3.9901
...
Epoch 90, Loss: 3.0482
Final parameters: w = 1.9711, b = 1.0510
True parameters: w = 2.0000, b = 1.0000

This example demonstrates the complete backpropagation process:

We created tensors with requires_grad=True
We performed forward calculations to get predictions and loss
We called backward() to compute gradients
We manually updated parameters using gradient descent
We zeroed gradients before the next iteration

Advanced Topics in PyTorch Backpropagation

Computing Gradients for Non-Scalar Outputs

When the output is not a scalar, you must provide a gradient argument to the .backward() method:

import torch

x = torch.tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
y = x * 2

# We need to provide a gradient argument since y is not a scalar
# The gradient argument should match the shape of y
external_grad = torch.tensor([[1.0, 1.0], [1.0, 1.0]])
y.backward(gradient=external_grad)

print(f"x.grad: {x.grad}")

Output:

x.grad: tensor([[2., 2.],
                [2., 2.]])

Using retain_graph Parameter

By default, the computational graph is freed after calling .backward(). To reuse it, set retain_graph=True:

import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 2
z = x ** 3

# First backward pass
y.backward(retain_graph=True)  # Keep the graph for another backward pass
print(f"dy/dx: {x.grad}")

# Zero gradients before next backward
x.grad.zero_()

# Second backward pass (using the same graph)
z.backward()
print(f"dz/dx: {x.grad}")

Output:

dy/dx: tensor(4.)
dz/dx: tensor(12.)

Using create_graph Parameter

Setting create_graph=True allows computing higher-order derivatives:

import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 3

# First derivative
first_derivative = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"First derivative (dy/dx at x=2): {first_derivative}")

# Second derivative
second_derivative = torch.autograd.grad(first_derivative, x)[0]
print(f"Second derivative (d²y/dx² at x=2): {second_derivative}")

Output:

First derivative (dy/dx at x=2): tensor(12., grad_fn=<MulBackward0>)
Second derivative (d²y/dx² at x=2): tensor(12.)

Common Pitfalls and Tips

Forgetting to Zero Gradients: Always call optimizer.zero_grad() or x.grad.zero_() before computing new gradients.
In-Place Operations: Be careful with in-place operations (operations that modify a tensor directly), as they can cause issues with the autograd graph:

import torch

# This works fine
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x + 2
z = y * y
z.sum().backward()
print(f"Gradient without in-place operations: {x.grad}")

# Reset
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x + 2
# In-place operation
y.add_(1)  # This will generate a warning
z = y * y
z.sum().backward()
print(f"Gradient with in-place operations: {x.grad}")

detach() and with torch.no_grad(): Use these to stop gradient tracking when you don't need it:

import torch

# Using detach()
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2
detached_y = y.detach()  # Detached tensor doesn't track gradients
detached_y.requires_grad  # Returns False

# Using with torch.no_grad()
with torch.no_grad():
    # No operations here will track gradients
    z = x * 3
    print(f"z.requires_grad: {z.requires_grad}")  # False

Summary

In this tutorial, we've covered how backpropagation works in PyTorch using the autograd system:

PyTorch builds a computational graph to track operations
The backward() method computes gradients using the chain rule
Gradients are stored in the .grad attribute of tensors
Gradients accumulate by default, so they need to be zeroed before each backward pass
PyTorch handles both simple and complex gradient computations automatically

Understanding backpropagation is crucial for effective deep learning model development. PyTorch's autograd system makes it surprisingly easy to implement even complex models without having to manually derive the gradients.

Additional Resources and Exercises

Additional Resources

Exercises

Basic Backpropagation: Create a tensor x with value 3.0 and compute the gradient of y = x³ - 4x² + 5 with respect to x.
Neural Network Training: Implement a simple 2-layer neural network from scratch using PyTorch's autograd (no nn.Module) and train it on the XOR problem.
Gradient Accumulation: Experiment with gradient accumulation by manually accumulating gradients over multiple forward passes before updating parameters.
Higher-order Derivatives: Compute the third derivative of f(x) = sin(x) at x = 0 using PyTorch's autograd.

By mastering PyTorch's backpropagation mechanism, you'll have a powerful tool for building and training neural networks efficiently.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Backpropagation​

The Chain Rule Refresher​

PyTorch's Autograd: The Engine Behind Backpropagation​

Key Components in PyTorch Autograd​

Step-by-Step Backpropagation Process in PyTorch​

1. Creating the Computational Graph​

2. Forward Pass​

3. Backward Pass (Computing Gradients)​

Managing Gradients​

Accumulating Gradients​

Zeroing Gradients​

Real-World Example: Linear Regression​

Advanced Topics in PyTorch Backpropagation​

Computing Gradients for Non-Scalar Outputs​

Using retain_graph Parameter​

Using create_graph Parameter​

Common Pitfalls and Tips​

Summary​

Additional Resources and Exercises​

Additional Resources​

Exercises​

Introduction

Understanding Backpropagation

The Chain Rule Refresher

PyTorch's Autograd: The Engine Behind Backpropagation

Key Components in PyTorch Autograd

Step-by-Step Backpropagation Process in PyTorch

1. Creating the Computational Graph

2. Forward Pass

3. Backward Pass (Computing Gradients)

Managing Gradients

Accumulating Gradients

Zeroing Gradients

Real-World Example: Linear Regression

Advanced Topics in PyTorch Backpropagation

Computing Gradients for Non-Scalar Outputs

Using retain_graph Parameter

Using create_graph Parameter

Common Pitfalls and Tips

Summary

Additional Resources and Exercises

Additional Resources

Exercises