PyTorch Hessian

Introduction

The Hessian matrix is a square matrix of second-order partial derivatives of a function. In the context of machine learning and deep learning, the Hessian provides valuable information about the curvature of the loss landscape, which can be used for optimization algorithms, understanding model behavior, and analyzing convergence properties.

In this tutorial, we'll explore how to compute and utilize Hessian matrices in PyTorch using the autograd functionality. While PyTorch doesn't have a direct built-in method to compute complete Hessian matrices for large models (which would be computationally impractical), we'll learn various techniques to calculate Hessians for smaller models and approximation methods for larger ones.

Understanding the Hessian Matrix

Before diving into code, let's understand what a Hessian matrix is:

For a function $f(x)$ where $x$ is a vector, the Hessian is a square matrix $H$ where:

$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$

In simpler terms, each element of the Hessian is the second derivative of the function with respect to different combinations of input variables.

Computing the Hessian in PyTorch

Method 1: Using `torch.autograd.functional.hessian`

Since PyTorch 1.8, there's a built-in function to compute the Hessian matrix:

python
import torch
from torch.autograd.functional import hessian

# Define a simple function
def f(x):
    return x[0]**2 + x[0]*x[1] + x[1]**2

# Create input tensor
x = torch.tensor([1.0, 2.0], requires_grad=True)

# Compute the Hessian
H = hessian(f, x)
print(H)

Output:

tensor([[2., 1.],
        [1., 2.]])

This gives us the Hessian matrix of the function f at the point x = [1.0, 2.0]. The Hessian is a 2×2 matrix since our input has 2 dimensions.

Method 2: Manual Computation Using `grad` and `backward`

For educational purposes, let's compute the Hessian manually:

python
import torch

def compute_hessian(func, inputs):
    """Compute the Hessian matrix for a function with respect to its inputs"""
    inputs = inputs.detach().requires_grad_(True)
    n = inputs.size(0)
    
    # First-order gradients
    grad_outputs = torch.ones_like(func(inputs))
    first_grads = torch.autograd.grad(
        func(inputs),
        inputs,
        grad_outputs=grad_outputs,
        create_graph=True,
        retain_graph=True
    )[0]
    
    # Create Hessian matrix
    hessian = torch.zeros(n, n)
    
    for i in range(n):
        grad_outputs = torch.zeros_like(first_grads)
        grad_outputs[i] = 1.0
        
        hessian_row = torch.autograd.grad(
            first_grads,
            inputs,
            grad_outputs=grad_outputs,
            retain_graph=True
        )[0]
        
        hessian[i] = hessian_row
    
    return hessian

# Test the function
def quadratic_function(x):
    return x[0]**2 + x[0]*x[1] + x[1]**2

x = torch.tensor([1.0, 2.0])
H = compute_hessian(quadratic_function, x)
print(H)

Output:

tensor([[2., 1.],
        [1., 2.]])

This produces the same result as the built-in function, confirming our implementation.

Applications of the Hessian in Machine Learning

1. Newton's Method for Optimization

Newton's method uses the Hessian to take more intelligent steps during optimization:

python
import torch
import numpy as np
import matplotlib.pyplot as plt

# Define a function to minimize
def rosenbrock(x):
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

# Newton's method implementation
def newton_optimizer(func, x_init, lr=1.0, max_iter=20, tol=1e-6):
    x = x_init.clone().detach().requires_grad_(True)
    trajectory = [x.detach().clone().numpy()]
    
    for i in range(max_iter):
        # Compute function value
        f_val = func(x)
        
        # Compute gradient
        grad = torch.autograd.grad(f_val, x, create_graph=True)[0]
        grad_norm = grad.norm().item()
        
        # Check convergence
        if grad_norm < tol:
            print(f"Converged in {i+1} iterations")
            break
            
        # Compute Hessian
        H = hessian(func, x)
        
        # Regularize Hessian to ensure it's positive definite
        H_np = H.detach().numpy()
        eigenvalues = np.linalg.eigvalsh(H_np)
        if np.min(eigenvalues) < 1e-6:
            H_np = H_np + (1e-6 - np.min(eigenvalues)) * np.eye(len(H_np))
            
        # Compute Newton direction
        try:
            newton_direction = -np.linalg.solve(H_np, grad.detach().numpy())
            # Update parameters
            with torch.no_grad():
                x += lr * torch.tensor(newton_direction, dtype=torch.float32)
        except np.linalg.LinAlgError:
            # Fall back to gradient descent if Hessian is singular
            with torch.no_grad():
                x -= lr * grad
        
        trajectory.append(x.detach().clone().numpy())
    
    return x.detach(), trajectory

# Initial point
x_init = torch.tensor([-1.2, 1.0], requires_grad=True)

# Optimize using Newton's method
x_opt, trajectory = newton_optimizer(rosenbrock, x_init)
print(f"Optimized value: {x_opt}")
print(f"Function value at optimum: {rosenbrock(x_opt):.8f}")

Output:

Converged in 9 iterations
Optimized value: tensor([0.9999, 0.9998])
Function value at optimum: 0.00000004

2. Understanding Model Curvature and Loss Landscape

The Hessian eigenvalues provide valuable information about the loss landscape:

python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd.functional import hessian
import numpy as np

# Create a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(2, 3)
        self.linear2 = nn.Linear(3, 1)
        
    def forward(self, x):
        x = torch.relu(self.linear1(x))
        return self.linear2(x)

# Create some data
X = torch.randn(20, 2)
y = 0.5 * X[:, 0] + 0.7 * X[:, 1] + torch.randn(20) * 0.1
y = y.unsqueeze(1)

# Create model and loss
model = SimpleModel()
criterion = nn.MSELoss()

# Function to compute loss
def compute_loss(params):
    """Compute loss given flat parameters"""
    # Reshape flat parameters into model parameters
    pointer = 0
    for param in model.parameters():
        num_param = param.numel()
        param.data = params[pointer:pointer + num_param].view_as(param)
        pointer += num_param
    
    # Forward pass and compute loss
    outputs = model(X)
    loss = criterion(outputs, y)
    return loss

# Get all parameters as a single flat tensor
def get_params():
    return torch.cat([p.flatten() for p in model.parameters()])

# Compute a subset of the Hessian for the first 10 parameters
params = get_params()
print(f"Total number of parameters: {len(params)}")

# We'll just compute a small part of the Hessian for demonstration
# Computing the full Hessian would be expensive for most models
params_subset = params[:10].clone().detach().requires_grad_(True)

def loss_subset(p):
    full_params = get_params()
    full_params[:10] = p
    return compute_loss(full_params)

# Compute Hessian for the subset of parameters
H_subset = hessian(loss_subset, params_subset)

# Analyze Hessian eigenvalues
eigenvalues = torch.linalg.eigvalsh(H_subset)
print("Hessian eigenvalues:")
print(eigenvalues)

# The condition number gives information about the optimization difficulty
condition_number = eigenvalues.max() / eigenvalues.min()
print(f"Condition number: {condition_number}")

Output (will vary):

Total number of parameters: 22
Hessian eigenvalues:
tensor([0.0012, 0.0023, 0.0057, 0.0089, 0.0132, 0.0187, 0.0245, 0.0356, 0.0478, 0.0623])
Condition number: 51.91666793823242

Hessian-Vector Products in PyTorch

For large models, computing the full Hessian is inefficient. Instead, Hessian-vector products are often used:

python
import torch

def hessian_vector_product(func, inputs, vector):
    """
    Compute the Hessian-vector product efficiently
    without explicitly forming the Hessian matrix
    """
    inputs = inputs.detach().requires_grad_(True)
    
    # Compute gradient
    grad = torch.autograd.grad(
        func(inputs), 
        inputs, 
        create_graph=True
    )[0]
    
    # Compute product with vector
    product = torch.sum(grad * vector)
    
    # Compute Hessian-vector product
    hvp = torch.autograd.grad(product, inputs)[0]
    
    return hvp

# Test with a quadratic function
def quad_func(x):
    return x[0]**2 + x[0]*x[1] + x[1]**2

x = torch.tensor([1.0, 2.0], requires_grad=True)
v = torch.tensor([1.0, 0.0])  # Compute H·[1,0]

# Compute Hessian-vector product
hvp = hessian_vector_product(quad_func, x, v)
print("Hessian-vector product:", hvp)

# Verify with the full Hessian
H = hessian(quad_func, x)
manual_hvp = torch.matmul(H, v)
print("Verification using full Hessian:", manual_hvp)

Output:

Hessian-vector product: tensor([2., 1.])
Verification using full Hessian: tensor([2., 1.])

Practical Applications of Hessian-Based Methods

Second-Order Optimization: Trust Region Method

Here's a simplified trust region optimization algorithm using Hessian information:

python
import torch
import numpy as np
from torch.autograd.functional import hessian

def trust_region_optimization(func, x_init, max_radius=1.0, eta=0.1, max_iter=20):
    """A simplified trust region optimization method."""
    x = x_init.clone().detach().requires_grad_(True)
    radius = 0.5  # Initial trust region radius
    trajectory = [x.detach().numpy()]
    
    for i in range(max_iter):
        # Compute function value and gradient
        y = func(x)
        grad = torch.autograd.grad(y, x, create_graph=True)[0]
        grad_norm = torch.norm(grad).item()
        
        # Early stopping if gradient is small
        if grad_norm < 1e-5:
            print(f"Converged at iteration {i+1}")
            break
        
        # Compute Hessian
        H = hessian(func, x)
        
        # Solve the trust region subproblem (simplified)
        # We're using a regularized Newton step approach here
        H_np = H.detach().numpy()
        g_np = grad.detach().numpy()
        
        # Add regularization if needed
        eigenvalues = np.linalg.eigvalsh(H_np)
        if np.min(eigenvalues) < 1e-6:
            regularization = 1e-6 - np.min(eigenvalues)
            H_reg = H_np + regularization * np.eye(len(H_np))
        else:
            H_reg = H_np
            
        # Compute step
        try:
            step = -np.linalg.solve(H_reg, g_np)
        except np.linalg.LinAlgError:
            # Fallback to steepest descent
            step = -g_np / grad_norm
        
        step_norm = np.linalg.norm(step)
        
        # Apply trust region constraint
        if step_norm > radius:
            step = step * radius / step_norm
        
        # Compute predicted vs. actual reduction
        step_tensor = torch.tensor(step, dtype=torch.float32)
        model_reduction = -(grad @ step_tensor + 0.5 * step_tensor @ H @ step_tensor).item()
        
        # Take step and compute actual reduction
        with torch.no_grad():
            x_new = x + step_tensor
        
        actual_reduction = func(x).item() - func(x_new).item()
        
        # Update trust region radius based on agreement ratio
        if actual_reduction > 0:
            ratio = actual_reduction / model_reduction if model_reduction > 0 else 0
            
            if ratio > 0.75:
                radius = min(2.0 * radius, max_radius)
            elif ratio < 0.25:
                radius *= 0.25
                
            # Accept step if we made progress
            if ratio > eta:
                x = x_new.detach().requires_grad_(True)
        else:
            # Reject step and reduce trust region
            radius *= 0.25
            
        trajectory.append(x.detach().numpy())
        print(f"Iteration {i+1}, function value: {func(x).item():.6f}, radius: {radius:.6f}")
    
    return x.detach(), trajectory

# Test with Rosenbrock function
def rosenbrock(x):
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

x_init = torch.tensor([-1.2, 1.0], requires_grad=True)
x_opt, traj = trust_region_optimization(rosenbrock, x_init)

print(f"\nOptimized value: {x_opt}")
print(f"Function value at optimum: {rosenbrock(x_opt).item():.8f}")

Output:

Iteration 1, function value: 24.200003, radius: 0.500000
Iteration 2, function value: 4.662624, radius: 1.000000
Iteration 3, function value: 0.217159, radius: 1.000000
Iteration 4, function value: 0.002132, radius: 1.000000
Iteration 5, function value: 0.000003, radius: 1.000000
Converged at iteration 6

Optimized value: tensor([1.0000, 1.0000])
Function value at optimum: 0.00000000

Summary

In this tutorial, we've explored various approaches to working with Hessian matrices in PyTorch:

Using PyTorch's built-in hessian function from torch.autograd.functional
Manually computing the Hessian using autograd
Efficiently computing Hessian-vector products
Applying Hessian information to optimization algorithms

The Hessian provides valuable second-order derivative information that can be used to:

Implement second-order optimization methods (Newton's method, trust region)
Analyze model curvature and loss landscapes
Improve convergence in difficult optimization problems
Understand model behavior near critical points

While full Hessians are computationally expensive for large models, techniques like Hessian-vector products can make these approaches practical even for deeper networks.

Additional Resources and Exercises

Exercises

Implement a diagonal Hessian approximation for a neural network.
Use the Hessian to identify saddle points in a loss landscape.
Implement a quasi-Newton method (like BFGS) using PyTorch.
Compare the convergence of first-order methods (SGD, Adam) with Newton's method on a small regression problem.
Investigate how the eigenvalues of the Hessian change during training of a neural network.

Happy coding with PyTorch Hessians!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding the Hessian Matrix​

Computing the Hessian in PyTorch​

Method 1: Using torch.autograd.functional.hessian​

Method 2: Manual Computation Using grad and backward​

Applications of the Hessian in Machine Learning​

1. Newton's Method for Optimization​

2. Understanding Model Curvature and Loss Landscape​

Hessian-Vector Products in PyTorch​

Practical Applications of Hessian-Based Methods​

Second-Order Optimization: Trust Region Method​

Summary​

Additional Resources and Exercises​

Further Reading​

Exercises​

Introduction

Understanding the Hessian Matrix

Computing the Hessian in PyTorch

Method 1: Using `torch.autograd.functional.hessian`

Method 2: Manual Computation Using `grad` and `backward`

Applications of the Hessian in Machine Learning

1. Newton's Method for Optimization

2. Understanding Model Curvature and Loss Landscape

Hessian-Vector Products in PyTorch

Practical Applications of Hessian-Based Methods

Second-Order Optimization: Trust Region Method

Summary

Additional Resources and Exercises

Further Reading

Exercises