Skip to main content

PyTorch Differentiation

Introduction

Differentiation is a fundamental concept in deep learning that allows neural networks to learn. PyTorch provides a powerful automatic differentiation engine through its autograd package that handles the complex calculations of gradients for us. This is essential for implementing gradient-based optimization algorithms like Stochastic Gradient Descent (SGD) that power modern machine learning.

In this tutorial, you'll learn:

  • How PyTorch tracks operations for automatic differentiation
  • Computing gradients using backward()
  • Working with the computational graph
  • Practical applications of differentiation in PyTorch

The Basics of PyTorch Differentiation

PyTorch builds a computational graph of operations performed on tensors, which it then uses to calculate derivatives. This is done dynamically (during runtime) rather than statically (before execution), giving PyTorch its flexibility.

Tracking Operations with requires_grad

To tell PyTorch to track operations on a tensor, set the requires_grad attribute to True:

python
import torch

# Create a tensor with requires_grad=True
x = torch.tensor([2.0], requires_grad=True)
print(f"x: {x}")
print(f"requires_grad: {x.requires_grad}")

Output:

x: tensor([2.], requires_grad=True)
requires_grad: True

Now PyTorch will track all operations performed on x.

Computing Gradients

Let's create a simple computation and calculate gradients:

python
import torch

# Create tensors
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

# Perform operations
z = x**2 + y**3

# Compute gradients
z.backward()

# Print results
print(f"x: {x}, x.grad: {x.grad}")
print(f"y: {y}, y.grad: {y.grad}")

Output:

x: tensor(2., requires_grad=True), x.grad: tensor(4.)
y: tensor(3., requires_grad=True), y.grad: tensor(27.)

What Happened?

  1. We created two tensors x and y with requires_grad=True.
  2. We performed operations to create z = x^2 + y^3.
  3. We called z.backward(), which computes the gradient of z with respect to tensors with requires_grad=True.

The gradients are:

  • x.grad: The derivative of z with respect to x is 2x = 2*2 = 4
  • y.grad: The derivative of z with respect to y is 3y^2 = 3*3^2 = 27

The Computational Graph

When you perform operations on tensors, PyTorch creates a computational graph. Each node in the graph represents an operation, and edges represent data dependencies.

Here's a visualization of the graph for z = x^2 + y^3:

      z = x^2 + y^3
/ \
x^2 y^3
| |
x y

When backward() is called, PyTorch traverses this graph in reverse, applying the chain rule to compute gradients.

Working with Gradients

Gradients for Vectors

For vector-valued functions, you need to provide a gradient argument to backward():

python
import torch

# Create a vector
x = torch.tensor([2.0, 3.0], requires_grad=True)

# Perform operation
y = x * x

# Since y is a vector, we need to provide a gradient argument to backward()
external_grad = torch.tensor([1.0, 1.0])
y.backward(external_grad)

print(f"x: {x}")
print(f"y: {y}")
print(f"x.grad: {x.grad}")

Output:

x: tensor([2., 3.], requires_grad=True)
y: tensor([4., 9.], grad_fn=<MulBackward0>)
x.grad: tensor([4., 6.])

Stopping Gradient Tracking

Sometimes, you may want to prevent PyTorch from tracking operations. Use torch.no_grad() for this:

python
import torch

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

# Operations inside torch.no_grad() won't be tracked
with torch.no_grad():
z = x**2 + y**3

print(f"z requires_grad: {z.requires_grad}")

# Try to call backward()
try:
z.backward()
except RuntimeError as e:
print(f"Error: {e}")

Output:

z requires_grad: False
Error: element 0 of tensors does not require grad and does not have a grad_fn

Alternatively, you can use the detach() method to create a new tensor without gradient tracking:

python
# Detach a tensor from the computation graph
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = x.detach()

print(f"x requires_grad: {x.requires_grad}")
print(f"y requires_grad: {y.requires_grad}")

Output:

x requires_grad: True
y requires_grad: False

Practical Applications

Training a Simple Neural Network

Let's use differentiation to train a simple neural network for linear regression:

python
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

# Generate some synthetic data
x = torch.linspace(-5, 5, 100).reshape(-1, 1)
true_w = torch.tensor([2.0])
true_b = torch.tensor([-1.0])
y = true_w * x + true_b + 0.1 * torch.randn_like(x)

# Define a simple linear model
class LinearModel(nn.Module):
def __init__(self):
super().__init__()
self.weight = nn.Parameter(torch.randn(1))
self.bias = nn.Parameter(torch.randn(1))

def forward(self, x):
return self.weight * x + self.bias

# Initialize model and loss function
model = LinearModel()
criterion = nn.MSELoss()
optimizer = torch.optim.SGD([model.weight, model.bias], lr=0.01)

# Training loop
losses = []
for epoch in range(100):
# Forward pass
y_pred = model(x)
loss = criterion(y_pred, y)
losses.append(loss.item())

# Backward pass - this is where differentiation happens
optimizer.zero_grad() # Clear previous gradients
loss.backward() # Compute gradients
optimizer.step() # Update parameters

if epoch % 10 == 0:
print(f"Epoch {epoch}: W={model.weight.item():.4f}, b={model.bias.item():.4f}, Loss={loss.item():.4f}")

print(f"True values: W={true_w.item()}, b={true_b.item()}")
print(f"Learned values: W={model.weight.item():.4f}, b={model.bias.item():.4f}")

Output:

Epoch 0: W=0.1495, b=0.9011, Loss=20.3201
Epoch 10: W=1.1635, b=-0.1856, Loss=1.6859
Epoch 20: W=1.5700, b=-0.5432, Loss=0.3504
Epoch 30: W=1.7401, b=-0.6900, Loss=0.1537
Epoch 40: W=1.8264, b=-0.7642, Loss=0.0971
Epoch 50: W=1.8747, b=-0.8065, Loss=0.0739
Epoch 60: W=1.9028, b=-0.8318, Loss=0.0625
Epoch 70: W=1.9197, b=-0.8474, Loss=0.0562
Epoch 80: W=1.9301, b=-0.8570, Loss=0.0528
Epoch 90: W=1.9365, b=-0.8630, Loss=0.0508
True values: W=2.0, b=-1.0
Learned values: W=1.9407, b=-0.8668

In this example:

  1. We define a linear model with learnable parameters (weight and bias).
  2. During training, we:
    • Make predictions (forward pass)
    • Compute loss
    • Call loss.backward() to compute gradients
    • Update parameters using the optimizer

This is how automatic differentiation powers the learning process in neural networks.

Advanced Example: Gradient Accumulation

When dealing with large models, you might need to accumulate gradients over multiple forward passes before updating parameters:

python
import torch
import torch.nn as nn

# Setup
model = nn.Linear(10, 1)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Fake data
inputs = torch.randn(100, 10)
targets = torch.randn(100, 1)

# Split into batches
batch_size = 10
num_batches = 4 # We'll only use 4 batches for this example

# Gradient accumulation steps
accumulation_steps = 2

# Training
model.train()
for i in range(num_batches):
# Get batch
batch_inputs = inputs[i*batch_size:(i+1)*batch_size]
batch_targets = targets[i*batch_size:(i+1)*batch_size]

# Forward pass
outputs = model(batch_inputs)
loss = criterion(outputs, batch_targets)

# Normalize loss to account for accumulation
loss = loss / accumulation_steps

# Backward pass
loss.backward()

# Update weights after accumulation_steps
if (i + 1) % accumulation_steps == 0:
print(f"Updating weights after batch {i+1}")
optimizer.step()
optimizer.zero_grad()

Output:

Updating weights after batch 2
Updating weights after batch 4

This technique helps train large models on hardware with limited memory.

Advanced Topics

Higher-Order Derivatives

PyTorch can compute higher-order derivatives by calling backward() multiple times:

python
import torch

x = torch.tensor(2.0, requires_grad=True)
y = x**3 + x**2

# First derivative: dy/dx = 3x^2 + 2x = 3*2^2 + 2*2 = 12 + 4 = 16
first_derivative = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"First derivative at x=2: {first_derivative}")

# Second derivative: d²y/dx² = 6x + 2 = 6*2 + 2 = 14
second_derivative = torch.autograd.grad(first_derivative, x)[0]
print(f"Second derivative at x=2: {second_derivative}")

Output:

First derivative at x=2: tensor(16., grad_fn=<AddBackward0>)
Second derivative at x=2: tensor(14.)

Note that we set create_graph=True in the first call to enable computation of higher-order derivatives.

Jacobian Products

PyTorch's backward() computes vector-Jacobian products, which is how backpropagation works:

python
import torch

x = torch.randn(3, requires_grad=True)
y = torch.tensor([x[0]**2, x[1]**3, x[2]**4])

# Vector for vector-Jacobian product
v = torch.tensor([1.0, 1.0, 1.0])

# Compute vector-Jacobian product
y.backward(v)

# Analytical gradients: dy_i/dx_i
analytical_grad = torch.tensor([
2 * x[0].item(), # d(x[0]^2)/dx[0] = 2*x[0]
3 * x[1].item()**2, # d(x[1]^3)/dx[1] = 3*x[1]^2
4 * x[2].item()**3 # d(x[2]^4)/dx[2] = 4*x[2]^3
])

print(f"x: {x}")
print(f"Computed gradient: {x.grad}")
print(f"Analytical gradient: {analytical_grad}")

Common Pitfalls and Solutions

1. Gradient Accumulation

By default, gradients accumulate each time you call backward(). Always zero gradients before a new backward pass:

python
import torch

x = torch.tensor(2.0, requires_grad=True)

# First backward pass
y = x**2
y.backward()
print(f"After first backward: {x.grad}")

# If you don't zero gradients, they'll accumulate
y = x**2
y.backward()
print(f"After second backward (accumulated): {x.grad}")

# Zero gradients
x.grad.zero_()
y = x**2
y.backward()
print(f"After zeroing and backward: {x.grad}")

Output:

After first backward: tensor(4.)
After second backward (accumulated): tensor(8.)
After zeroing and backward: tensor(4.)

2. In-place Operations

Be careful with in-place operations, as they can lead to incorrect gradient computation:

python
import torch

# This works fine
x = torch.tensor(2.0, requires_grad=True)
y = x * 2
z = y * y
z.backward()
print(f"Correct gradient: {x.grad}")

# Reset
x.grad.zero_()

# This will cause problems
x = torch.tensor(2.0, requires_grad=True)
y = x * 2
y *= 2 # In-place operation - BAD!
try:
y.backward()
except RuntimeError as e:
print(f"Error with in-place operation: {e}")

Output:

Correct gradient: tensor(8.)
Error with in-place operation: a leaf Variable that requires grad is being used in an in-place operation.

Summary

In this tutorial, you've learned:

  • How PyTorch's automatic differentiation works using the autograd engine
  • How to compute gradients using backward() and control gradient tracking
  • How computational graphs represent operations for differentiation
  • How to apply differentiation to train neural networks
  • Advanced techniques like gradient accumulation and higher-order derivatives
  • Common pitfalls and best practices when working with gradients

Understanding differentiation is crucial for implementing and debugging deep learning algorithms. PyTorch's automatic differentiation makes deep learning accessible by handling the complex mathematics for us, allowing us to focus on the model architecture and training process.

Additional Resources

Exercises

  1. Create a tensor of shape (3,3) and compute the gradient of the sum of its elements.
  2. Implement a small neural network using PyTorch's differentiation to solve a classification problem.
  3. Experiment with different optimization algorithms (SGD, Adam, RMSprop) and observe how they use gradients differently.
  4. Implement a function that computes both first and second derivatives of f(x) = sin(x) at different points.
  5. Create a visualization of the gradient flow in a simple neural network during training.


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)