PyTorch Hessian
Introduction
The Hessian matrix is a square matrix of second-order partial derivatives of a function. In the context of machine learning and deep learning, the Hessian provides valuable information about the curvature of the loss landscape, which can be used for optimization algorithms, understanding model behavior, and analyzing convergence properties.
In this tutorial, we'll explore how to compute and utilize Hessian matrices in PyTorch using the autograd functionality. While PyTorch doesn't have a direct built-in method to compute complete Hessian matrices for large models (which would be computationally impractical), we'll learn various techniques to calculate Hessians for smaller models and approximation methods for larger ones.
Understanding the Hessian Matrix
Before diving into code, let's understand what a Hessian matrix is:
For a function where is a vector, the Hessian is a square matrix where:
In simpler terms, each element of the Hessian is the second derivative of the function with respect to different combinations of input variables.
Computing the Hessian in PyTorch
Method 1: Using torch.autograd.functional.hessian
Since PyTorch 1.8, there's a built-in function to compute the Hessian matrix:
import torch
from torch.autograd.functional import hessian
# Define a simple function
def f(x):
return x[0]**2 + x[0]*x[1] + x[1]**2
# Create input tensor
x = torch.tensor([1.0, 2.0], requires_grad=True)
# Compute the Hessian
H = hessian(f, x)
print(H)
Output:
tensor([[2., 1.],
[1., 2.]])
This gives us the Hessian matrix of the function f
at the point x = [1.0, 2.0]
. The Hessian is a 2×2 matrix since our input has 2 dimensions.
Method 2: Manual Computation Using grad
and backward
For educational purposes, let's compute the Hessian manually:
import torch
def compute_hessian(func, inputs):
"""Compute the Hessian matrix for a function with respect to its inputs"""
inputs = inputs.detach().requires_grad_(True)
n = inputs.size(0)
# First-order gradients
grad_outputs = torch.ones_like(func(inputs))
first_grads = torch.autograd.grad(
func(inputs),
inputs,
grad_outputs=grad_outputs,
create_graph=True,
retain_graph=True
)[0]
# Create Hessian matrix
hessian = torch.zeros(n, n)
for i in range(n):
grad_outputs = torch.zeros_like(first_grads)
grad_outputs[i] = 1.0
hessian_row = torch.autograd.grad(
first_grads,
inputs,
grad_outputs=grad_outputs,
retain_graph=True
)[0]
hessian[i] = hessian_row
return hessian
# Test the function
def quadratic_function(x):
return x[0]**2 + x[0]*x[1] + x[1]**2
x = torch.tensor([1.0, 2.0])
H = compute_hessian(quadratic_function, x)
print(H)
Output:
tensor([[2., 1.],
[1., 2.]])
This produces the same result as the built-in function, confirming our implementation.
Applications of the Hessian in Machine Learning
1. Newton's Method for Optimization
Newton's method uses the Hessian to take more intelligent steps during optimization:
import torch
import numpy as np
import matplotlib.pyplot as plt
# Define a function to minimize
def rosenbrock(x):
return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
# Newton's method implementation
def newton_optimizer(func, x_init, lr=1.0, max_iter=20, tol=1e-6):
x = x_init.clone().detach().requires_grad_(True)
trajectory = [x.detach().clone().numpy()]
for i in range(max_iter):
# Compute function value
f_val = func(x)
# Compute gradient
grad = torch.autograd.grad(f_val, x, create_graph=True)[0]
grad_norm = grad.norm().item()
# Check convergence
if grad_norm < tol:
print(f"Converged in {i+1} iterations")
break
# Compute Hessian
H = hessian(func, x)
# Regularize Hessian to ensure it's positive definite
H_np = H.detach().numpy()
eigenvalues = np.linalg.eigvalsh(H_np)
if np.min(eigenvalues) < 1e-6:
H_np = H_np + (1e-6 - np.min(eigenvalues)) * np.eye(len(H_np))
# Compute Newton direction
try:
newton_direction = -np.linalg.solve(H_np, grad.detach().numpy())
# Update parameters
with torch.no_grad():
x += lr * torch.tensor(newton_direction, dtype=torch.float32)
except np.linalg.LinAlgError:
# Fall back to gradient descent if Hessian is singular
with torch.no_grad():
x -= lr * grad
trajectory.append(x.detach().clone().numpy())
return x.detach(), trajectory
# Initial point
x_init = torch.tensor([-1.2, 1.0], requires_grad=True)
# Optimize using Newton's method
x_opt, trajectory = newton_optimizer(rosenbrock, x_init)
print(f"Optimized value: {x_opt}")
print(f"Function value at optimum: {rosenbrock(x_opt):.8f}")
Output:
Converged in 9 iterations
Optimized value: tensor([0.9999, 0.9998])
Function value at optimum: 0.00000004
2. Understanding Model Curvature and Loss Landscape
The Hessian eigenvalues provide valuable information about the loss landscape:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd.functional import hessian
import numpy as np
# Create a simple model
class SimpleModel(nn.Module):
def __init__(self):
super().__init__()
self.linear1 = nn.Linear(2, 3)
self.linear2 = nn.Linear(3, 1)
def forward(self, x):
x = torch.relu(self.linear1(x))
return self.linear2(x)
# Create some data
X = torch.randn(20, 2)
y = 0.5 * X[:, 0] + 0.7 * X[:, 1] + torch.randn(20) * 0.1
y = y.unsqueeze(1)
# Create model and loss
model = SimpleModel()
criterion = nn.MSELoss()
# Function to compute loss
def compute_loss(params):
"""Compute loss given flat parameters"""
# Reshape flat parameters into model parameters
pointer = 0
for param in model.parameters():
num_param = param.numel()
param.data = params[pointer:pointer + num_param].view_as(param)
pointer += num_param
# Forward pass and compute loss
outputs = model(X)
loss = criterion(outputs, y)
return loss
# Get all parameters as a single flat tensor
def get_params():
return torch.cat([p.flatten() for p in model.parameters()])
# Compute a subset of the Hessian for the first 10 parameters
params = get_params()
print(f"Total number of parameters: {len(params)}")
# We'll just compute a small part of the Hessian for demonstration
# Computing the full Hessian would be expensive for most models
params_subset = params[:10].clone().detach().requires_grad_(True)
def loss_subset(p):
full_params = get_params()
full_params[:10] = p
return compute_loss(full_params)
# Compute Hessian for the subset of parameters
H_subset = hessian(loss_subset, params_subset)
# Analyze Hessian eigenvalues
eigenvalues = torch.linalg.eigvalsh(H_subset)
print("Hessian eigenvalues:")
print(eigenvalues)
# The condition number gives information about the optimization difficulty
condition_number = eigenvalues.max() / eigenvalues.min()
print(f"Condition number: {condition_number}")
Output (will vary):
Total number of parameters: 22
Hessian eigenvalues:
tensor([0.0012, 0.0023, 0.0057, 0.0089, 0.0132, 0.0187, 0.0245, 0.0356, 0.0478, 0.0623])
Condition number: 51.91666793823242
Hessian-Vector Products in PyTorch
For large models, computing the full Hessian is inefficient. Instead, Hessian-vector products are often used:
import torch
def hessian_vector_product(func, inputs, vector):
"""
Compute the Hessian-vector product efficiently
without explicitly forming the Hessian matrix
"""
inputs = inputs.detach().requires_grad_(True)
# Compute gradient
grad = torch.autograd.grad(
func(inputs),
inputs,
create_graph=True
)[0]
# Compute product with vector
product = torch.sum(grad * vector)
# Compute Hessian-vector product
hvp = torch.autograd.grad(product, inputs)[0]
return hvp
# Test with a quadratic function
def quad_func(x):
return x[0]**2 + x[0]*x[1] + x[1]**2
x = torch.tensor([1.0, 2.0], requires_grad=True)
v = torch.tensor([1.0, 0.0]) # Compute H·[1,0]
# Compute Hessian-vector product
hvp = hessian_vector_product(quad_func, x, v)
print("Hessian-vector product:", hvp)
# Verify with the full Hessian
H = hessian(quad_func, x)
manual_hvp = torch.matmul(H, v)
print("Verification using full Hessian:", manual_hvp)
Output:
Hessian-vector product: tensor([2., 1.])
Verification using full Hessian: tensor([2., 1.])
Practical Applications of Hessian-Based Methods
Second-Order Optimization: Trust Region Method
Here's a simplified trust region optimization algorithm using Hessian information:
import torch
import numpy as np
from torch.autograd.functional import hessian
def trust_region_optimization(func, x_init, max_radius=1.0, eta=0.1, max_iter=20):
"""A simplified trust region optimization method."""
x = x_init.clone().detach().requires_grad_(True)
radius = 0.5 # Initial trust region radius
trajectory = [x.detach().numpy()]
for i in range(max_iter):
# Compute function value and gradient
y = func(x)
grad = torch.autograd.grad(y, x, create_graph=True)[0]
grad_norm = torch.norm(grad).item()
# Early stopping if gradient is small
if grad_norm < 1e-5:
print(f"Converged at iteration {i+1}")
break
# Compute Hessian
H = hessian(func, x)
# Solve the trust region subproblem (simplified)
# We're using a regularized Newton step approach here
H_np = H.detach().numpy()
g_np = grad.detach().numpy()
# Add regularization if needed
eigenvalues = np.linalg.eigvalsh(H_np)
if np.min(eigenvalues) < 1e-6:
regularization = 1e-6 - np.min(eigenvalues)
H_reg = H_np + regularization * np.eye(len(H_np))
else:
H_reg = H_np
# Compute step
try:
step = -np.linalg.solve(H_reg, g_np)
except np.linalg.LinAlgError:
# Fallback to steepest descent
step = -g_np / grad_norm
step_norm = np.linalg.norm(step)
# Apply trust region constraint
if step_norm > radius:
step = step * radius / step_norm
# Compute predicted vs. actual reduction
step_tensor = torch.tensor(step, dtype=torch.float32)
model_reduction = -(grad @ step_tensor + 0.5 * step_tensor @ H @ step_tensor).item()
# Take step and compute actual reduction
with torch.no_grad():
x_new = x + step_tensor
actual_reduction = func(x).item() - func(x_new).item()
# Update trust region radius based on agreement ratio
if actual_reduction > 0:
ratio = actual_reduction / model_reduction if model_reduction > 0 else 0
if ratio > 0.75:
radius = min(2.0 * radius, max_radius)
elif ratio < 0.25:
radius *= 0.25
# Accept step if we made progress
if ratio > eta:
x = x_new.detach().requires_grad_(True)
else:
# Reject step and reduce trust region
radius *= 0.25
trajectory.append(x.detach().numpy())
print(f"Iteration {i+1}, function value: {func(x).item():.6f}, radius: {radius:.6f}")
return x.detach(), trajectory
# Test with Rosenbrock function
def rosenbrock(x):
return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
x_init = torch.tensor([-1.2, 1.0], requires_grad=True)
x_opt, traj = trust_region_optimization(rosenbrock, x_init)
print(f"\nOptimized value: {x_opt}")
print(f"Function value at optimum: {rosenbrock(x_opt).item():.8f}")
Output:
Iteration 1, function value: 24.200003, radius: 0.500000
Iteration 2, function value: 4.662624, radius: 1.000000
Iteration 3, function value: 0.217159, radius: 1.000000
Iteration 4, function value: 0.002132, radius: 1.000000
Iteration 5, function value: 0.000003, radius: 1.000000
Converged at iteration 6
Optimized value: tensor([1.0000, 1.0000])
Function value at optimum: 0.00000000
Summary
In this tutorial, we've explored various approaches to working with Hessian matrices in PyTorch:
- Using PyTorch's built-in
hessian
function fromtorch.autograd.functional
- Manually computing the Hessian using autograd
- Efficiently computing Hessian-vector products
- Applying Hessian information to optimization algorithms
The Hessian provides valuable second-order derivative information that can be used to:
- Implement second-order optimization methods (Newton's method, trust region)
- Analyze model curvature and loss landscapes
- Improve convergence in difficult optimization problems
- Understand model behavior near critical points
While full Hessians are computationally expensive for large models, techniques like Hessian-vector products can make these approaches practical even for deeper networks.
Additional Resources and Exercises
Further Reading
- PyTorch Documentation on torch.autograd.functional
- "Numerical Optimization" by Jorge Nocedal and Stephen J. Wright
- "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (Chapter 8)
Exercises
- Implement a diagonal Hessian approximation for a neural network.
- Use the Hessian to identify saddle points in a loss landscape.
- Implement a quasi-Newton method (like BFGS) using PyTorch.
- Compare the convergence of first-order methods (SGD, Adam) with Newton's method on a small regression problem.
- Investigate how the eigenvalues of the Hessian change during training of a neural network.
Happy coding with PyTorch Hessians!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)