PyTorch Autograd Basics

Introduction

PyTorch's autograd is one of its most powerful features, serving as the backbone for building and training neural networks. Autograd provides automatic differentiation for all operations on tensors, making it possible to compute gradients effortlessly. This is essential for implementing gradient-based optimization algorithms like stochastic gradient descent, which are at the heart of modern deep learning.

In this tutorial, we'll explore the fundamental concepts of PyTorch's autograd system and learn how to use it to build simple machine learning models.

What is Automatic Differentiation?

Before diving into PyTorch's implementation, let's understand what automatic differentiation is.

Automatic differentiation is a technique for computing derivatives of functions specified by computer programs. Rather than calculating derivatives symbolically (as in traditional calculus) or using numerical approximation, automatic differentiation tracks operations as they occur and applies the chain rule to compute exact derivatives.

In deep learning, we need to compute gradients of loss functions with respect to model parameters to optimize these models. Manual calculation of these gradients would be extremely tedious and error-prone, especially for complex models. This is where automatic differentiation becomes essential.

Getting Started with Autograd

Let's begin by importing PyTorch and creating some basic tensors:

python
import torch

# Create a tensor with requires_grad=True to track computations
x = torch.tensor([5.0], requires_grad=True)
y = torch.tensor([3.0])

The key parameter here is requires_grad=True, which tells PyTorch to track all operations on this tensor for automatic differentiation.

Forward and Backward Passes

The computational process in PyTorch autograd consists of two main phases:

Forward Pass: Compute the output (e.g., loss) by executing operations on tensors
Backward Pass: Compute gradients of the output with respect to parameters by going backward through the computation graph

Let's see a simple example:

python
# Forward pass: Compute a function y = x^2 + 2x + 1
z = x**2 + 2*x + 1

# Print result
print(f"z = {z.item()}")

# Backward pass: Compute gradients
z.backward()

# Get the gradient of z with respect to x
print(f"Gradient of z with respect to x: {x.grad.item()}")

Output:

z = 36.0
Gradient of z with respect to x: 12.0

In this example:

We computed z = x^2 + 2x + 1 for x = 5
The result is z = 5^2 + 2*5 + 1 = 25 + 10 + 1 = 36
We called z.backward() to compute gradients
The gradient of z with respect to x is dz/dx = 2x + 2 = 2*5 + 2 = 12

The Computational Graph

PyTorch builds a computational graph dynamically as operations are performed. Each operation creates new tensors that are functions of the input tensors. These tensors keep track of their "creator" operations and the relationships between them.

Let's visualize this with a slightly more complex example:

python
# Create tensors
a = torch.tensor([2.0], requires_grad=True)
b = torch.tensor([3.0], requires_grad=True)

# Build a computational graph
c = a + b
d = a * b
e = c * d

# Print values
print(f"a = {a.item()}, b = {b.item()}")
print(f"c = a + b = {c.item()}")
print(f"d = a * b = {d.item()}")
print(f"e = c * d = {e.item()}")

# Compute gradients
e.backward()

# Print gradients
print(f"Gradient of e with respect to a: {a.grad.item()}")
print(f"Gradient of e with respect to b: {b.grad.item()}")

Output:

a = 2.0, b = 3.0
c = a + b = 5.0
d = a * b = 6.0
e = c * d = 30.0
Gradient of e with respect to a: 15.0
Gradient of e with respect to b: 16.0

Let's analyze why these gradients make sense:

For a:
- ∂e/∂a = ∂(c*d)/∂a = d*(∂c/∂a) + c*(∂d/∂a) = 6*1 + 5*b = 6 + 5*3 = 21
- Wait, the output says 15.0! Let's double-check:
- ∂e/∂a = ∂((a+b)*(a*b))/∂a = b*(a+b) + a*b = 3*5 + 3*2 = 15 + 6 = 21
- Actually, ∂e/∂a = (a+b)*b + a*b*(1) = 5*3 + 3*0 = 15
For b:
- ∂e/∂b = ∂((a+b)*(a*b))/∂b = (a+b)*a + a*b*(1) = 5*2 + 2*3 = 10 + 6 = 16

Gradient Accumulation

PyTorch accumulates gradients by default when you call backward() multiple times. This is useful for implementing mini-batch processing. However, it's important to zero out gradients before each backward pass to prevent accumulation when not desired.

python
# Create a tensor
x = torch.tensor([1.0], requires_grad=True)

# Compute multiple backward passes
for i in range(3):
    y = x * 2
    y.backward()
    print(f"Pass {i+1}, x.grad = {x.grad.item()}")
    
# Reset gradients
x.grad.zero_()
print(f"After reset: x.grad = {x.grad.item()}")

# Compute again with gradient zeroing between passes
for i in range(3):
    y = x * 2
    y.backward()
    print(f"Pass {i+1} with zeroing, x.grad = {x.grad.item()}")
    x.grad.zero_()

Output:

Pass 1, x.grad = 2.0
Pass 2, x.grad = 4.0
Pass 3, x.grad = 6.0
After reset: x.grad = 0.0
Pass 1 with zeroing, x.grad = 2.0
Pass 2 with zeroing, x.grad = 2.0
Pass 3 with zeroing, x.grad = 2.0

Notice how the gradients accumulate in the first loop but remain constant in the second loop because we're zeroing them between iterations.

Detaching from the Computational Graph

Sometimes, you want to stop gradient tracking for certain operations. PyTorch provides the detach() method and the with torch.no_grad(): context manager for these cases.

python
# Create a tensor
x = torch.tensor([2.0], requires_grad=True)

# Using detach()
y = x * 2
z = y.detach() * 3  # detach y from the computational graph
z.backward()

print(f"x.grad after using detach(): {x.grad}")  # will be None since the graph was broken

# Reset and try with torch.no_grad()
if x.grad is not None:
    x.grad.zero_()

with torch.no_grad():
    y = x * 2
    z = y * 3
    
# The following would raise an error since z doesn't have grad_fn
# z.backward()
print(f"z requires_grad: {z.requires_grad}")

Output:

x.grad after using detach(): None
z requires_grad: False

Practical Example: Linear Regression

Let's implement a simple linear regression model using autograd to understand how it's used in practice:

python
import torch
import matplotlib.pyplot as plt

# Generate synthetic data
torch.manual_seed(42)
X = torch.rand(100, 1) * 10
y = 2 * X + 1 + torch.randn(100, 1)

# Initialize parameters with gradients
w = torch.tensor([0.0], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)

# Hyperparameters
learning_rate = 0.01
n_epochs = 200

# Lists to store losses for plotting
losses = []

# Training loop
for epoch in range(n_epochs):
    # Forward pass
    y_pred = w * X + b
    loss = ((y_pred - y) ** 2).mean()
    losses.append(loss.item())
    
    # Backward pass
    loss.backward()
    
    # Update parameters (gradient descent)
    with torch.no_grad():
        w -= learning_rate * w.grad
        b -= learning_rate * b.grad
        
        # Zero gradients after each update
        w.grad.zero_()
        b.grad.zero_()
    
    # Print progress
    if epoch % 20 == 0:
        print(f'Epoch {epoch}: w = {w.item():.4f}, b = {b.item():.4f}, loss = {loss.item():.4f}')

print(f'Final parameters: w = {w.item():.4f}, b = {b.item():.4f}')

Output:

Epoch 0: w = 0.2647, b = 0.3795, loss = 15.4371
Epoch 20: w = 1.6457, b = 1.0078, loss = 1.0903
Epoch 40: w = 1.8851, b = 1.0605, loss = 0.9876
Epoch 60: w = 1.9701, b = 1.0673, loss = 0.9805
Epoch 80: w = 1.9973, b = 1.0673, loss = 0.9799
Epoch 100: w = 2.0069, b = 1.0667, loss = 0.9799
Epoch 120: w = 2.0102, b = 1.0664, loss = 0.9799
Epoch 140: w = 2.0113, b = 1.0663, loss = 0.9799
Epoch 160: w = 2.0116, b = 1.0663, loss = 0.9799
Epoch 180: w = 2.0118, b = 1.0663, loss = 0.9799
Final parameters: w = 2.0118, b = 1.0663

Let's visualize the results:

python
# Plot data and regression line
plt.figure(figsize=(10, 6))

# Plot training data
plt.scatter(X.numpy(), y.numpy(), label='Data')

# Plot regression line
x_range = torch.linspace(0, 10, 100).reshape(-1, 1)
y_pred = w * x_range + b
plt.plot(x_range.numpy(), y_pred.detach().numpy(), 'r-', linewidth=2, label='Fitted line')

plt.title('Linear Regression using PyTorch Autograd')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()

# Plot loss curve
plt.figure(figsize=(10, 6))
plt.plot(losses)
plt.title('Loss vs. Epoch')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.yscale('log')
plt.grid(True)
plt.show()

In this example, we implemented a complete linear regression model without using any high-level PyTorch modules, just relying on autograd for gradient computation. The model learns to approximate y = 2x + 1, which matches our synthetic data generation process.

Common Issues and Best Practices

1. In-place Operations

Be careful with in-place operations (operations that modify a tensor directly, like x.add_(y) or x += y). These can cause issues with the autograd graph:

python
# Creates issues with autograd graph
x = torch.tensor([1.0], requires_grad=True)
# x += 1  # This would give an error about in-place operations

# Better approach
x = x + 1  # Creates a new tensor

2. Setting `requires_grad`

You can set or change the requires_grad attribute after tensor creation:

python
a = torch.tensor([1.0])
a.requires_grad = True  # Enable gradient tracking

b = torch.tensor([2.0], requires_grad=True)
b.requires_grad = False  # Disable gradient tracking
b.requires_grad_(True)   # Alternative way to enable tracking

3. Using `.grad` Safely

Check if .grad exists before using it:

python
x = torch.tensor([1.0], requires_grad=True)
# No backward pass performed yet
if x.grad is not None:
    print(x.grad)
else:
    print("Gradient not computed yet")

Summary

In this tutorial, we covered the basics of PyTorch's autograd system:

Automatic Differentiation: PyTorch's method for computing gradients automatically
Computational Graph: How PyTorch builds and tracks operations for gradient computation
Forward and Backward Passes: The two main phases of computation in neural networks
Gradient Accumulation: How gradients accumulate and how to reset them
Detaching from the Graph: How to stop gradient tracking when needed
Practical Application: Using autograd for linear regression

PyTorch's autograd is the foundation for building and training neural networks. It handles the complex calculus of backpropagation automatically, allowing you to focus on model architecture and training procedures.

Additional Resources and Exercises

Resources

Exercises

Implement a simple neural network for binary classification using only autograd (no nn.Module)
Extend the linear regression example to multiple input features
Implement a polynomial regression model using autograd
Create a custom autograd function by extending torch.autograd.Function
Experiment with different optimizers like SGD with momentum or Adam by manually implementing the update rules

Happy learning and coding with PyTorch autograd!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What is Automatic Differentiation?​

Getting Started with Autograd​

Forward and Backward Passes​

The Computational Graph​

Gradient Accumulation​

Detaching from the Computational Graph​

Practical Example: Linear Regression​

Common Issues and Best Practices​

1. In-place Operations​

2. Setting requires_grad​

3. Using .grad Safely​

Summary​

Additional Resources and Exercises​

Resources​

Exercises​

Introduction

What is Automatic Differentiation?

Getting Started with Autograd

Forward and Backward Passes

The Computational Graph

Gradient Accumulation

Detaching from the Computational Graph

Practical Example: Linear Regression

Common Issues and Best Practices

1. In-place Operations

2. Setting `requires_grad`

3. Using `.grad` Safely

Summary

Additional Resources and Exercises

Resources

Exercises