PyTorch Gradients

Introduction

Gradients are at the heart of how neural networks learn. They represent the rate of change of a function with respect to its inputs, and in deep learning, they drive the optimization process. PyTorch's autograd system makes computing these gradients almost magical, allowing us to focus on designing our models rather than manually calculating derivatives.

In this guide, we'll explore how PyTorch computes and manages gradients, how to access and use them in your code, and various techniques to handle gradients effectively in neural network training.

What are Gradients?

Before we dive into PyTorch's implementation, let's briefly revisit what gradients are:

A gradient is a vector of partial derivatives that tells us the direction of steepest increase of a function. In machine learning, we use gradients to adjust model parameters during optimization to minimize a loss function.

For a function $f(x,y)$ , the gradient is:

$\nabla f = \left( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right)$

Basic Gradient Computation in PyTorch

Let's start with a simple example to see how PyTorch computes gradients:

python
import torch

# Create a tensor with requires_grad=True
x = torch.tensor([2.0], requires_grad=True)
y = torch.tensor([3.0], requires_grad=True)

# Define a simple function z = x^2 + y^3
z = x**2 + y**3

# Compute gradients by backpropagation
z.backward()

# Access the gradients
print(f"dz/dx: {x.grad}")  # Should be 2x = 2*2 = 4
print(f"dz/dy: {y.grad}")  # Should be 3y^2 = 3*3^2 = 27

Output:

dz/dx: tensor([4.])
dz/dy: tensor([27.])

Here's what happened:

We created tensors x and y with requires_grad=True, telling PyTorch to track operations on these tensors
We defined a computation using these tensors
We called .backward() on the result, which computes gradients
The gradients were stored in the .grad attribute of each input tensor

Gradient Accumulation

By default, PyTorch accumulates gradients. This means that if you call .backward() multiple times, the gradients will be added together:

python
import torch

# Create a tensor
x = torch.tensor([1.0], requires_grad=True)

# First function: y = x^2
y = x**2
y.backward()

print(f"Gradient after first backward: {x.grad}")

# Second function: z = 2*x
z = 2 * x
z.backward()

print(f"Gradient after second backward: {x.grad}")  # This will be 2 + 2 = 4

# Reset gradients
x.grad.zero_()
print(f"Gradient after reset: {x.grad}")

Output:

Gradient after first backward: tensor([2.])
Gradient after second backward: tensor([4.])
Gradient after reset: tensor([0.])

Notice how we use zero_() to reset the gradients. This is a common pattern in training neural networks, where we need to zero the gradients before each optimization step.

Gradient Flow in Neural Networks

Let's look at a more practical example with a simple neural network:

python
import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.linear1 = nn.Linear(2, 3)
        self.linear2 = nn.Linear(3, 1)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.relu(self.linear1(x))
        x = self.linear2(x)
        return x

# Create a model and sample data
model = SimpleNN()
inputs = torch.tensor([[0.5, 0.3]], dtype=torch.float32)
targets = torch.tensor([[0.7]], dtype=torch.float32)

# Forward pass
outputs = model(inputs)
loss = (outputs - targets).pow(2).sum()  # MSE loss

# Backward pass
loss.backward()

# Inspect gradients of model parameters
for name, param in model.named_parameters():
    print(f"Parameter: {name}, Size: {param.size()}")
    print(f"Gradient: {param.grad}")
    print("-" * 30)

# Zero gradients
optimizer = optim.SGD(model.parameters(), lr=0.01)
optimizer.zero_grad()

This example shows how gradients flow through a neural network:

We perform a forward pass to compute the loss
We call backward() on the loss
The gradients flow backward through the network, accumulating in each parameter's .grad attribute
We can then use these gradients to update the parameters using an optimizer

Controlling Gradient Computation

Detaching Tensors

Sometimes, you want to stop gradient flow at a certain point in your computation. For this, you can use detach():

python
import torch

# Create tensors
x = torch.tensor([2.0], requires_grad=True)
y = x * 3

# Detach y from the computation graph
z = y.detach()
z = z * 2

# Backward pass
z.backward()

# x.grad will be None because we detached the computation
print(f"x.grad: {x.grad}")  # None or 0

# But if we don't detach:
y = x * 3
z = y * 2
z.backward()
print(f"x.grad after normal backward: {x.grad}")  # Should be 6

Output:

x.grad: None
x.grad after normal backward: tensor([6.])

Using `torch.no_grad()`

For larger blocks of code where you want to disable gradient tracking, use the torch.no_grad() context manager:

python
import torch

x = torch.tensor([2.0], requires_grad=True)

# Using torch.no_grad() context manager
with torch.no_grad():
    y = x * 3
    z = y * 2
    print(f"Does z require gradients? {z.requires_grad}")

# Outside the context, gradient tracking is re-enabled
y = x * 3
z = y * 2
print(f"Does z require gradients now? {z.requires_grad}")

Output:

Does z require gradients? False
Does z require gradients now? True

This is particularly useful for evaluation phases in training loops, where you don't need gradients and can save memory.

Handling Non-Scalar Outputs

In the examples so far, our final output was a scalar (single value). But what if we have a vector output and want to compute gradients? PyTorch requires a scalar output for backward(), but provides ways to handle vector outputs:

python
import torch

# Create a tensor
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)

# Vector function: y = x^2
y = x**2

# Option 1: Sum to scalar
z = y.sum()
z.backward()
print(f"Gradient when using sum: {x.grad}")

# Reset gradients
x.grad.zero_()

# Option 2: Specify gradient for backward
y.backward(torch.ones_like(y))
print(f"Gradient with explicit grad_outputs: {x.grad}")

Output:

Gradient when using sum: tensor([[2., 4.],
                                 [6., 8.]])
Gradient with explicit grad_outputs: tensor([[2., 4.],
                                            [6., 8.]])

The second approach is equivalent to computing the gradients of sum(y * ones_like(y)).

Gradient Clipping

Gradient clipping is a technique to prevent exploding gradients, which can destabilize training:

python
import torch
import torch.nn as nn
import torch.nn.utils as utils

# Define a simple model
model = nn.Sequential(
    nn.Linear(10, 5),
    nn.ReLU(),
    nn.Linear(5, 1)
)

# Sample input and target
x = torch.randn(3, 10)
target = torch.randn(3, 1)

# Compute loss
output = model(x)
loss = nn.MSELoss()(output, target)

# Backward pass
loss.backward()

# Before clipping
total_norm_before = 0
for p in model.parameters():
    if p.grad is not None:
        total_norm_before += p.grad.data.norm(2).item() ** 2
total_norm_before = total_norm_before ** 0.5
print(f"Total gradient norm before clipping: {total_norm_before}")

# Clip gradients by max norm
max_norm = 1.0
utils.clip_grad_norm_(model.parameters(), max_norm)

# After clipping
total_norm_after = 0
for p in model.parameters():
    if p.grad is not None:
        total_norm_after += p.grad.data.norm(2).item() ** 2
total_norm_after = total_norm_after ** 0.5
print(f"Total gradient norm after clipping: {total_norm_after}")

The output will vary but should show that the gradients are clipped to have a maximum norm of 1.0.

Mixed Precision Training and Gradients

For performance optimization, particularly on modern GPUs, you might want to use mixed precision training. This affects how gradients are handled:

python
import torch

# Make sure you have a compatible GPU
if torch.cuda.is_available():
    # Enable automatic mixed precision
    scaler = torch.cuda.amp.GradScaler()
    
    # Define model and optimizer (simplified)
    model = torch.nn.Linear(10, 1)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    
    # In training loop:
    for epoch in range(1):
        # Forward pass with autocast
        with torch.cuda.amp.autocast():
            x = torch.randn(3, 10, device='cuda')
            output = model(x)
            loss = output.mean()
        
        # Backward pass with scaled gradients
        scaler.scale(loss).backward()
        
        # Unscale gradients and clip if needed
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        # Update weights and scaler
        scaler.step(optimizer)
        scaler.update()
        
        print("Successfully ran mixed precision training example")
else:
    print("CUDA not available, skipping mixed precision example")

Real-world Application: Custom Gradient Manipulation

Let's look at a practical example where we modify gradients during training, a technique often used in advanced model training:

python
import torch
import torch.nn as nn

# Define a simple model
class GradientMonitorModel(nn.Module):
    def __init__(self):
        super(GradientMonitorModel, self).__init__()
        self.fc1 = nn.Linear(10, 5)
        self.fc2 = nn.Linear(5, 1)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x
        
    def register_gradient_hooks(self):
        """Register hooks to monitor/modify gradients"""
        def hook_fn(grad):
            # We could modify gradients here if needed
            print(f"Gradient statistics: min={grad.min().item()}, max={grad.max().item()}")
            
            # Example: clamp extreme values
            return torch.clamp(grad, min=-1.0, max=1.0)
        
        # Register hook on first layer weights
        self.fc1.weight.register_hook(hook_fn)
        return self

# Create model and register hooks
model = GradientMonitorModel().register_gradient_hooks()

# Training example
x = torch.randn(5, 10)
target = torch.randn(5, 1)

# Forward and backward pass
output = model(x)
loss = nn.MSELoss()(output, target)
loss.backward()

# Optimizer step
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
optimizer.step()

This example demonstrates how to use hooks to monitor and modify gradients as they flow through specific layers during training.

Summary

In this guide, we've covered:

Basic gradient computation in PyTorch
How to access and interpret gradients
Gradient accumulation and resetting
Flow of gradients through neural networks
Controlling gradient computation with detach() and torch.no_grad()
Handling non-scalar outputs for gradient computation
Gradient clipping to prevent training instability
Mixed precision training and its effects on gradients
Custom gradient manipulation using hooks

Understanding gradients is crucial for mastering deep learning. PyTorch's autograd system makes it easy to work with gradients, but knowing these concepts helps you debug training issues and implement advanced techniques.

Additional Resources and Exercises

Exercises

Basic Gradient Computation: Create a tensor x with value 3.0 and compute the gradient of the function f(x) = x^3 - 4x + 2.
Gradient Flow Visualization: Create a small neural network and visualize the magnitude of gradients at each layer during training. Do you notice any patterns?
Custom Autograd Function: Implement a custom autograd function that computes the Huber loss and its gradient.
Gradient Penalty: Implement the gradient penalty term used in Wasserstein GANs, which requires computing gradients of the discriminator output with respect to its inputs.
Gradient Checkpointing: Research and implement gradient checkpointing for a large model to reduce memory usage during training.

By practicing these exercises, you'll gain a deeper understanding of how PyTorch handles gradients and how you can leverage this knowledge to train more complex models efficiently.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What are Gradients?​

Basic Gradient Computation in PyTorch​

Gradient Accumulation​

Gradient Flow in Neural Networks​

Controlling Gradient Computation​

Detaching Tensors​

Using torch.no_grad()​

Handling Non-Scalar Outputs​

Gradient Clipping​

Mixed Precision Training and Gradients​

Real-world Application: Custom Gradient Manipulation​

Summary​

Additional Resources and Exercises​

Further Reading​

Exercises​

Introduction

What are Gradients?

Basic Gradient Computation in PyTorch

Gradient Accumulation

Gradient Flow in Neural Networks

Controlling Gradient Computation

Detaching Tensors

Using `torch.no_grad()`

Handling Non-Scalar Outputs

Gradient Clipping

Mixed Precision Training and Gradients

Real-world Application: Custom Gradient Manipulation

Summary

Additional Resources and Exercises

Further Reading

Exercises