PyTorch Differentiation

Introduction

Differentiation is a fundamental concept in deep learning that allows neural networks to learn. PyTorch provides a powerful automatic differentiation engine through its autograd package that handles the complex calculations of gradients for us. This is essential for implementing gradient-based optimization algorithms like Stochastic Gradient Descent (SGD) that power modern machine learning.

In this tutorial, you'll learn:

How PyTorch tracks operations for automatic differentiation
Computing gradients using backward()
Working with the computational graph
Practical applications of differentiation in PyTorch

The Basics of PyTorch Differentiation

PyTorch builds a computational graph of operations performed on tensors, which it then uses to calculate derivatives. This is done dynamically (during runtime) rather than statically (before execution), giving PyTorch its flexibility.

Tracking Operations with `requires_grad`

To tell PyTorch to track operations on a tensor, set the requires_grad attribute to True:

import torch

# Create a tensor with requires_grad=True
x = torch.tensor([2.0], requires_grad=True)
print(f"x: {x}")
print(f"requires_grad: {x.requires_grad}")

Output:

x: tensor([2.], requires_grad=True)
requires_grad: True

Now PyTorch will track all operations performed on x.

Computing Gradients

Let's create a simple computation and calculate gradients:

import torch

# Create tensors
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

# Perform operations
z = x**2 + y**3

# Compute gradients
z.backward()

# Print results
print(f"x: {x}, x.grad: {x.grad}")
print(f"y: {y}, y.grad: {y.grad}")

Output:

x: tensor(2., requires_grad=True), x.grad: tensor(4.)
y: tensor(3., requires_grad=True), y.grad: tensor(27.)

What Happened?

We created two tensors x and y with requires_grad=True.
We performed operations to create z = x^2 + y^3.
We called z.backward(), which computes the gradient of z with respect to tensors with requires_grad=True.

The gradients are:

x.grad: The derivative of z with respect to x is 2x = 2*2 = 4
y.grad: The derivative of z with respect to y is 3y^2 = 3*3^2 = 27

The Computational Graph

When you perform operations on tensors, PyTorch creates a computational graph. Each node in the graph represents an operation, and edges represent data dependencies.

Here's a visualization of the graph for z = x^2 + y^3:

      z = x^2 + y^3
      /         \
   x^2          y^3
    |            |
    x            y

When backward() is called, PyTorch traverses this graph in reverse, applying the chain rule to compute gradients.

Working with Gradients

Gradients for Vectors

For vector-valued functions, you need to provide a gradient argument to backward():

import torch

# Create a vector
x = torch.tensor([2.0, 3.0], requires_grad=True)

# Perform operation
y = x * x

# Since y is a vector, we need to provide a gradient argument to backward()
external_grad = torch.tensor([1.0, 1.0])
y.backward(external_grad)

print(f"x: {x}")
print(f"y: {y}")
print(f"x.grad: {x.grad}")

Output:

x: tensor([2., 3.], requires_grad=True)
y: tensor([4., 9.], grad_fn=<MulBackward0>)
x.grad: tensor([4., 6.])

Stopping Gradient Tracking

Sometimes, you may want to prevent PyTorch from tracking operations. Use torch.no_grad() for this:

import torch

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

# Operations inside torch.no_grad() won't be tracked
with torch.no_grad():
    z = x**2 + y**3
    
print(f"z requires_grad: {z.requires_grad}")

# Try to call backward()
try:
    z.backward()
except RuntimeError as e:
    print(f"Error: {e}")

Output:

z requires_grad: False
Error: element 0 of tensors does not require grad and does not have a grad_fn

Alternatively, you can use the detach() method to create a new tensor without gradient tracking:

# Detach a tensor from the computation graph
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = x.detach()

print(f"x requires_grad: {x.requires_grad}")
print(f"y requires_grad: {y.requires_grad}")

Output:

x requires_grad: True
y requires_grad: False

Practical Applications

Training a Simple Neural Network

Let's use differentiation to train a simple neural network for linear regression:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt

# Generate some synthetic data
x = torch.linspace(-5, 5, 100).reshape(-1, 1)
true_w = torch.tensor([2.0])
true_b = torch.tensor([-1.0])
y = true_w * x + true_b + 0.1 * torch.randn_like(x)

# Define a simple linear model
class LinearModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(1))
        self.bias = nn.Parameter(torch.randn(1))
        
    def forward(self, x):
        return self.weight * x + self.bias

# Initialize model and loss function
model = LinearModel()
criterion = nn.MSELoss()
optimizer = torch.optim.SGD([model.weight, model.bias], lr=0.01)

# Training loop
losses = []
for epoch in range(100):
    # Forward pass
    y_pred = model(x)
    loss = criterion(y_pred, y)
    losses.append(loss.item())
    
    # Backward pass - this is where differentiation happens
    optimizer.zero_grad()  # Clear previous gradients
    loss.backward()        # Compute gradients
    optimizer.step()       # Update parameters
    
    if epoch % 10 == 0:
        print(f"Epoch {epoch}: W={model.weight.item():.4f}, b={model.bias.item():.4f}, Loss={loss.item():.4f}")

print(f"True values: W={true_w.item()}, b={true_b.item()}")
print(f"Learned values: W={model.weight.item():.4f}, b={model.bias.item():.4f}")

Output:

Epoch 0: W=0.1495, b=0.9011, Loss=20.3201
Epoch 10: W=1.1635, b=-0.1856, Loss=1.6859
Epoch 20: W=1.5700, b=-0.5432, Loss=0.3504
Epoch 30: W=1.7401, b=-0.6900, Loss=0.1537
Epoch 40: W=1.8264, b=-0.7642, Loss=0.0971
Epoch 50: W=1.8747, b=-0.8065, Loss=0.0739
Epoch 60: W=1.9028, b=-0.8318, Loss=0.0625
Epoch 70: W=1.9197, b=-0.8474, Loss=0.0562
Epoch 80: W=1.9301, b=-0.8570, Loss=0.0528
Epoch 90: W=1.9365, b=-0.8630, Loss=0.0508
True values: W=2.0, b=-1.0
Learned values: W=1.9407, b=-0.8668

In this example:

We define a linear model with learnable parameters (weight and bias).
During training, we:
- Make predictions (forward pass)
- Compute loss
- Call loss.backward() to compute gradients
- Update parameters using the optimizer

This is how automatic differentiation powers the learning process in neural networks.

Advanced Example: Gradient Accumulation

When dealing with large models, you might need to accumulate gradients over multiple forward passes before updating parameters:

import torch
import torch.nn as nn

# Setup
model = nn.Linear(10, 1)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Fake data
inputs = torch.randn(100, 10)
targets = torch.randn(100, 1)

# Split into batches
batch_size = 10
num_batches = 4  # We'll only use 4 batches for this example

# Gradient accumulation steps
accumulation_steps = 2

# Training
model.train()
for i in range(num_batches):
    # Get batch
    batch_inputs = inputs[i*batch_size:(i+1)*batch_size]
    batch_targets = targets[i*batch_size:(i+1)*batch_size]
    
    # Forward pass
    outputs = model(batch_inputs)
    loss = criterion(outputs, batch_targets)
    
    # Normalize loss to account for accumulation
    loss = loss / accumulation_steps
    
    # Backward pass
    loss.backward()
    
    # Update weights after accumulation_steps
    if (i + 1) % accumulation_steps == 0:
        print(f"Updating weights after batch {i+1}")
        optimizer.step()
        optimizer.zero_grad()

Output:

Updating weights after batch 2
Updating weights after batch 4

This technique helps train large models on hardware with limited memory.

Advanced Topics

Higher-Order Derivatives

PyTorch can compute higher-order derivatives by calling backward() multiple times:

import torch

x = torch.tensor(2.0, requires_grad=True)
y = x**3 + x**2

# First derivative: dy/dx = 3x^2 + 2x = 3*2^2 + 2*2 = 12 + 4 = 16
first_derivative = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"First derivative at x=2: {first_derivative}")

# Second derivative: d²y/dx² = 6x + 2 = 6*2 + 2 = 14
second_derivative = torch.autograd.grad(first_derivative, x)[0]
print(f"Second derivative at x=2: {second_derivative}")

Output:

First derivative at x=2: tensor(16., grad_fn=<AddBackward0>)
Second derivative at x=2: tensor(14.)

Note that we set create_graph=True in the first call to enable computation of higher-order derivatives.

Jacobian Products

PyTorch's backward() computes vector-Jacobian products, which is how backpropagation works:

import torch

x = torch.randn(3, requires_grad=True)
y = torch.tensor([x[0]**2, x[1]**3, x[2]**4])

# Vector for vector-Jacobian product
v = torch.tensor([1.0, 1.0, 1.0])

# Compute vector-Jacobian product
y.backward(v)

# Analytical gradients: dy_i/dx_i
analytical_grad = torch.tensor([
    2 * x[0].item(),       # d(x[0]^2)/dx[0] = 2*x[0]
    3 * x[1].item()**2,    # d(x[1]^3)/dx[1] = 3*x[1]^2
    4 * x[2].item()**3     # d(x[2]^4)/dx[2] = 4*x[2]^3
])

print(f"x: {x}")
print(f"Computed gradient: {x.grad}")
print(f"Analytical gradient: {analytical_grad}")

Common Pitfalls and Solutions

1. Gradient Accumulation

By default, gradients accumulate each time you call backward(). Always zero gradients before a new backward pass:

import torch

x = torch.tensor(2.0, requires_grad=True)

# First backward pass
y = x**2
y.backward()
print(f"After first backward: {x.grad}")

# If you don't zero gradients, they'll accumulate
y = x**2
y.backward()
print(f"After second backward (accumulated): {x.grad}")

# Zero gradients
x.grad.zero_()
y = x**2
y.backward()
print(f"After zeroing and backward: {x.grad}")

Output:

After first backward: tensor(4.)
After second backward (accumulated): tensor(8.)
After zeroing and backward: tensor(4.)

2. In-place Operations

Be careful with in-place operations, as they can lead to incorrect gradient computation:

import torch

# This works fine
x = torch.tensor(2.0, requires_grad=True)
y = x * 2
z = y * y
z.backward()
print(f"Correct gradient: {x.grad}")

# Reset
x.grad.zero_()

# This will cause problems
x = torch.tensor(2.0, requires_grad=True)
y = x * 2
y *= 2  # In-place operation - BAD!
try:
    y.backward()
except RuntimeError as e:
    print(f"Error with in-place operation: {e}")

Output:

Correct gradient: tensor(8.)
Error with in-place operation: a leaf Variable that requires grad is being used in an in-place operation.

Summary

In this tutorial, you've learned:

How PyTorch's automatic differentiation works using the autograd engine
How to compute gradients using backward() and control gradient tracking
How computational graphs represent operations for differentiation
How to apply differentiation to train neural networks
Advanced techniques like gradient accumulation and higher-order derivatives
Common pitfalls and best practices when working with gradients

Understanding differentiation is crucial for implementing and debugging deep learning algorithms. PyTorch's automatic differentiation makes deep learning accessible by handling the complex mathematics for us, allowing us to focus on the model architecture and training process.

Additional Resources

Exercises

Create a tensor of shape (3,3) and compute the gradient of the sum of its elements.
Implement a small neural network using PyTorch's differentiation to solve a classification problem.
Experiment with different optimization algorithms (SGD, Adam, RMSprop) and observe how they use gradients differently.
Implement a function that computes both first and second derivatives of f(x) = sin(x) at different points.
Create a visualization of the gradient flow in a simple neural network during training.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

The Basics of PyTorch Differentiation​

Tracking Operations with requires_grad​

Computing Gradients​

What Happened?​

The Computational Graph​

Working with Gradients​

Gradients for Vectors​

Stopping Gradient Tracking​

Practical Applications​

Training a Simple Neural Network​

Advanced Example: Gradient Accumulation​

Advanced Topics​

Higher-Order Derivatives​

Jacobian Products​

Common Pitfalls and Solutions​

1. Gradient Accumulation​

2. In-place Operations​

Summary​

Additional Resources​

Exercises​

Introduction

The Basics of PyTorch Differentiation

Tracking Operations with `requires_grad`

Computing Gradients

What Happened?

The Computational Graph

Working with Gradients

Gradients for Vectors

Stopping Gradient Tracking

Practical Applications

Training a Simple Neural Network

Advanced Example: Gradient Accumulation

Advanced Topics

Higher-Order Derivatives

Jacobian Products

Common Pitfalls and Solutions

1. Gradient Accumulation

2. In-place Operations

Summary

Additional Resources

Exercises