Skip to main content

TensorFlow Variational Inference

Variational inference is a powerful technique in Bayesian machine learning that allows us to approximate complex posterior distributions. In this tutorial, we'll explore how to implement variational inference using TensorFlow Probability (TFP) and understand its applications in probabilistic modeling.

Introduction to Variational Inference

Bayesian modeling often requires computing posterior distributions, which can be mathematically intractable for complex models. Variational inference (VI) provides a deterministic alternative to sampling-based methods like Markov Chain Monte Carlo (MCMC) by posing posterior approximation as an optimization problem.

The core idea behind variational inference is to:

  1. Choose a family of simpler distributions (the variational family)
  2. Find the member of this family that is closest to the true posterior
  3. Use this approximation instead of the exact posterior for inference tasks

This approach trades some accuracy for significant computational advantages, making Bayesian inference feasible for large-scale problems.

Understanding the Mathematics

At its heart, variational inference minimizes the Kullback-Leibler (KL) divergence between the approximate distribution q(z) and the true posterior p(z|x):

KL(q(z)p(zx))\text{KL}(q(z) || p(z|x))

Since we don't know p(z|x) directly, VI instead maximizes the Evidence Lower Bound (ELBO):

ELBO(q)=Eq(z)[logp(x,z)]Eq(z)[logq(z)]\text{ELBO}(q) = \mathbb{E}_{q(z)}[\log p(x,z)] - \mathbb{E}_{q(z)}[\log q(z)]

This is equivalent to minimizing the KL divergence plus a constant.

Setting Up the Environment

Let's start by installing and importing the required libraries:

python
# Install TensorFlow Probability if needed
# !pip install tensorflow-probability

import tensorflow as tf
import tensorflow_probability as tfp
import matplotlib.pyplot as plt
import numpy as np

tfd = tfp.distributions
tfb = tfp.bijectors

Simple Example: Approximating a Gaussian Mixture

Let's start with a basic example of approximating a mixture of Gaussians with a single Gaussian using variational inference.

1. Define the Target Distribution

First, we'll create a mixture of two Gaussians as our target distribution:

python
# Define the target distribution (a mixture of two Gaussians)
mixture = tfd.Mixture(
cat=tfd.Categorical(probs=[0.3, 0.7]),
components=[
tfd.Normal(loc=-3.0, scale=1.0),
tfd.Normal(loc=2.0, scale=1.5)
]
)

# Visualize the target distribution
x = np.linspace(-8, 8, 1000)
plt.figure(figsize=(10, 6))
plt.plot(x, mixture.prob(x).numpy())
plt.title("Target Distribution: Mixture of Gaussians")
plt.xlabel("x")
plt.ylabel("Probability Density")
plt.grid(True)
plt.show()

Output would show a bimodal distribution with peaks at approximately -3 and 2.

2. Define the Variational Distribution

Now we'll define our variational distribution - a simple Gaussian whose parameters we'll optimize:

python
# Create trainable variables for the variational distribution parameters
loc = tf.Variable(0.0, name="loc")
scale = tf.Variable(1.0, name="scale", constraint=lambda x: tf.nn.softplus(x))

# Define the variational distribution
variational_dist = tfd.Normal(loc=loc, scale=scale)

3. Define the Loss Function (Negative ELBO)

We need to maximize the ELBO, which is equivalent to minimizing the negative ELBO:

python
def negative_elbo():
# Sample from the variational distribution
samples = variational_dist.sample(1000)

# Compute log probabilities
q_log_prob = variational_dist.log_prob(samples)
p_log_prob = mixture.log_prob(samples)

# Compute ELBO
elbo = tf.reduce_mean(p_log_prob - q_log_prob)

# Return negative ELBO (to minimize)
return -elbo

4. Optimize the Variational Parameters

Now we'll optimize the parameters of our variational distribution:

python
# Create an optimizer
optimizer = tf.optimizers.Adam(learning_rate=0.1)

# Training loop
for step in range(1000):
optimizer.minimize(negative_elbo, var_list=[loc, scale])

if step % 100 == 0:
print(f"Step {step}, loc: {loc.numpy():.4f}, scale: {scale.numpy():.4f}, loss: {negative_elbo().numpy():.4f}")

The output might look something like:

Step 0, loc: 0.0930, scale: 1.0230, loss: 2.9831
Step 100, loc: 0.4325, scale: 2.5167, loss: 2.2145
Step 200, loc: 0.4763, scale: 2.7239, loss: 2.1897
...
Step 900, loc: 0.5020, scale: 2.8336, loss: 2.1830

5. Visualize the Results

Let's visualize how well our variational distribution approximates the target:

python
# Visualize the results
x = np.linspace(-8, 8, 1000)
plt.figure(figsize=(10, 6))
plt.plot(x, mixture.prob(x).numpy(), label="Target Distribution")
plt.plot(x, variational_dist.prob(x).numpy(), label=f"Variational Approximation (μ={loc.numpy():.2f}, σ={scale.numpy():.2f})")
plt.title("Variational Inference Result")
plt.xlabel("x")
plt.ylabel("Probability Density")
plt.legend()
plt.grid(True)
plt.show()

The output would show how a single Gaussian attempts to approximate the bimodal mixture distribution.

Advanced Example: Bayesian Neural Network

Now let's implement a more practical example: a Bayesian Neural Network (BNN) using variational inference. BNNs provide uncertainty estimates for their predictions, which is valuable in many applications.

1. Generate Synthetic Data

python
# Generate synthetic data
np.random.seed(42)
N = 100
X = np.linspace(-3, 3, N).reshape(-1, 1).astype(np.float32)
y = np.sin(X) + 0.3 * np.random.randn(N, 1).astype(np.float32)

# Plot the data
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.7)
plt.title("Synthetic Data")
plt.xlabel("x")
plt.ylabel("y")
plt.grid(True)
plt.show()

2. Define the Bayesian Neural Network Model

We'll create a simple BNN with one hidden layer:

python
class BayesianDenseLayer(tf.Module):
def __init__(self, input_dim, output_dim, name=None):
super().__init__(name=name)

# Prior distributions
self.weight_prior = tfd.Normal(loc=0., scale=1.)
self.bias_prior = tfd.Normal(loc=0., scale=1.)

# Initial parameters for the variational distributions
self.weight_loc = tf.Variable(tf.random.normal([input_dim, output_dim], 0.0, 0.1))
self.weight_scale = tf.Variable(tf.ones([input_dim, output_dim]) * -3.0)
self.bias_loc = tf.Variable(tf.zeros([output_dim]))
self.bias_scale = tf.Variable(tf.ones([output_dim]) * -3.0)

def __call__(self, inputs, sampling=True):
# Define variational distributions
q_weight = tfd.Normal(loc=self.weight_loc, scale=tf.nn.softplus(self.weight_scale))
q_bias = tfd.Normal(loc=self.bias_loc, scale=tf.nn.softplus(self.bias_scale))

if sampling:
# Sample weights and biases
weights = q_weight.sample()
bias = q_bias.sample()
else:
# Use mean of the variational distribution
weights = self.weight_loc
bias = self.bias_loc

# Calculate outputs
outputs = tf.matmul(inputs, weights) + bias

# KL divergence for this layer
kl_weights = tf.reduce_sum(tfd.kl_divergence(q_weight, self.weight_prior))
kl_bias = tf.reduce_sum(tfd.kl_divergence(q_bias, self.bias_prior))
kl = kl_weights + kl_bias

return outputs, kl

class BayesianNeuralNetwork(tf.Module):
def __init__(self, name=None):
super().__init__(name=name)
# Define the architecture
self.hidden_layer = BayesianDenseLayer(1, 20, name="hidden")
self.output_layer = BayesianDenseLayer(20, 1, name="output")

def __call__(self, inputs, sampling=True):
hidden_output, kl_hidden = self.hidden_layer(inputs, sampling)
hidden_output = tf.nn.relu(hidden_output)
final_output, kl_output = self.output_layer(hidden_output, sampling)

# Total KL divergence
kl = kl_hidden + kl_output

return final_output, kl

3. Define the Loss Function

python
def negative_elbo_loss(model, x, y, kl_weight=1.0):
# Forward pass through the model
predictions, kl = model(x)

# Likelihood - assuming Gaussian observation model with fixed variance
log_likelihood = -0.5 * tf.reduce_sum(tf.square(y - predictions))

# ELBO = log_likelihood - kl_divergence
elbo = log_likelihood - kl_weight * kl

return -elbo # Negative ELBO as we're minimizing

4. Train the Model

python
# Initialize the model
bnn = BayesianNeuralNetwork()

# Create an optimizer
optimizer = tf.optimizers.Adam(learning_rate=0.01)

# Convert data to tensors
X_tensor = tf.convert_to_tensor(X)
y_tensor = tf.convert_to_tensor(y)

# Training loop
num_epochs = 1000
for epoch in range(num_epochs):
with tf.GradientTape() as tape:
loss = negative_elbo_loss(bnn, X_tensor, y_tensor, kl_weight=1.0/len(X))

# Get gradients and update model parameters
gradients = tape.gradient(loss, bnn.trainable_variables)
optimizer.apply_gradients(zip(gradients, bnn.trainable_variables))

if epoch % 100 == 0:
print(f"Epoch {epoch}, Loss: {loss.numpy():.4f}")

5. Make Predictions with Uncertainty

python
# Generate test points
X_test = np.linspace(-5, 5, 200).reshape(-1, 1).astype(np.float32)
X_test_tensor = tf.convert_to_tensor(X_test)

# Make multiple predictions to capture uncertainty
num_samples = 100
predictions = []

for _ in range(num_samples):
y_pred, _ = bnn(X_test_tensor)
predictions.append(y_pred)

# Stack predictions
predictions = np.hstack(predictions)

# Calculate mean and standard deviation
mean_prediction = np.mean(predictions, axis=1)
std_prediction = np.std(predictions, axis=1)

# Plot results with uncertainty
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.7, label="Training data")
plt.plot(X_test, mean_prediction, color='red', label="Mean prediction")
plt.fill_between(X_test.ravel(),
mean_prediction - 2 * std_prediction,
mean_prediction + 2 * std_prediction,
color='red', alpha=0.3, label="95% confidence")
plt.title("Bayesian Neural Network with Uncertainty")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.grid(True)
plt.show()

The output will show a plot of our model's predictions with uncertainty bands, which will be wider in regions with less data.

Variational Autoencoders (VAE)

Variational autoencoders are a popular application of variational inference to learn latent representations of data. Let's implement a simple VAE for MNIST digits.

1. Prepare the Data

python
# Load MNIST dataset
mnist = tf.keras.datasets.mnist
(x_train, _), (x_test, _) = mnist.load_data()

# Normalize and reshape
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
x_train = x_train.reshape(-1, 784)
x_test = x_test.reshape(-1, 784)

# Create dataset
train_dataset = tf.data.Dataset.from_tensor_slices(x_train)
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(100)

2. Define the VAE Model

python
class VAE(tf.keras.Model):
def __init__(self, latent_dim=2):
super(VAE, self).__init__()
self.latent_dim = latent_dim

# Encoder network
self.encoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(784,)),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(latent_dim * 2) # Mean and log_var
])

# Decoder network
self.decoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(latent_dim,)),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(784, activation='sigmoid')
])

def encode(self, x):
mean_logvar = self.encoder(x)
mean, logvar = tf.split(mean_logvar, num_or_size_splits=2, axis=1)
return mean, logvar

def reparameterize(self, mean, logvar):
eps = tf.random.normal(shape=tf.shape(mean))
return mean + tf.exp(logvar * 0.5) * eps

def decode(self, z):
return self.decoder(z)

def call(self, x):
mean, logvar = self.encode(x)
z = self.reparameterize(mean, logvar)
x_recon = self.decode(z)
return x_recon, mean, logvar

3. Define the Loss Function

python
def compute_vae_loss(model, x):
# Forward pass
x_recon, mean, logvar = model(x)

# Reconstruction loss
recon_loss = tf.reduce_sum(
tf.keras.losses.binary_crossentropy(x, x_recon), axis=1
)

# KL divergence
kl_loss = -0.5 * tf.reduce_sum(
1 + logvar - tf.square(mean) - tf.exp(logvar), axis=1
)

# Total loss
total_loss = tf.reduce_mean(recon_loss + kl_loss)

return total_loss

4. Train the VAE

python
# Create VAE model
vae = VAE(latent_dim=2)

# Create optimizer
optimizer = tf.keras.optimizers.Adam(1e-3)

# Train the model
epochs = 10
for epoch in range(epochs):
epoch_loss = 0
num_batches = 0

for batch in train_dataset:
with tf.GradientTape() as tape:
loss = compute_vae_loss(vae, batch)

gradients = tape.gradient(loss, vae.trainable_variables)
optimizer.apply_gradients(zip(gradients, vae.trainable_variables))

epoch_loss += loss
num_batches += 1

print(f"Epoch {epoch+1}, Loss: {epoch_loss/num_batches:.4f}")

5. Visualize the Latent Space

python
# Sample from latent space
def plot_latent_space(vae, n=30, figsize=15):
# Create a grid of points in latent space
digit_size = 28
scale = 2.0
figure = np.zeros((digit_size * n, digit_size * n))

# Linear space coordinates
grid_x = np.linspace(-scale, scale, n)
grid_y = np.linspace(-scale, scale, n)[::-1]

for i, yi in enumerate(grid_y):
for j, xi in enumerate(grid_x):
z = np.array([[xi, yi]])
x_decoded = vae.decode(z)
digit = x_decoded[0].numpy().reshape(digit_size, digit_size)
figure[i * digit_size:(i + 1) * digit_size,
j * digit_size:(j + 1) * digit_size] = digit

plt.figure(figsize=(figsize, figsize))
plt.imshow(figure, cmap='Greys_r')
plt.title("Latent Space Visualization")
plt.axis('off')
plt.tight_layout()
plt.show()

# Visualize the latent space
plot_latent_space(vae)

This will display a grid of generated digits that shows how the latent space is organized.

Summary

In this tutorial, we explored variational inference in TensorFlow Probability:

  1. We learned the fundamentals of variational inference as a method for approximating complex posterior distributions
  2. Implemented a simple example to approximate a mixture of Gaussians
  3. Built a Bayesian Neural Network that provides uncertainty estimates
  4. Created a Variational Autoencoder (VAE) for unsupervised learning

Variational inference offers a powerful alternative to sampling-based methods for Bayesian inference, making it possible to scale probabilistic modeling to large datasets and complex models.

Additional Resources

To learn more about variational inference and TensorFlow Probability:

  1. TensorFlow Probability Documentation
  2. Variational Inference: A Review for Statisticians
  3. Auto-Encoding Variational Bayes
  4. Practical Bayesian Deep Learning course

Exercises

  1. Modify the mixture approximation example to use a mixture of Gaussians as the variational distribution instead of a single Gaussian.
  2. Extend the Bayesian Neural Network to include a higher number of layers and apply it to a real-world regression dataset.
  3. Implement a conditional VAE that can generate digits conditioned on a specific class.
  4. Try implementing normalizing flows to create more flexible variational distributions.
  5. Apply variational inference to a time-series prediction model and visualize the uncertainty in forecasts.

By working through these exercises, you'll gain a deeper understanding of how to apply variational inference to various probabilistic modeling tasks.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)