TensorFlow Variational Inference

Variational inference is a powerful technique in Bayesian machine learning that allows us to approximate complex posterior distributions. In this tutorial, we'll explore how to implement variational inference using TensorFlow Probability (TFP) and understand its applications in probabilistic modeling.

Introduction to Variational Inference

Bayesian modeling often requires computing posterior distributions, which can be mathematically intractable for complex models. Variational inference (VI) provides a deterministic alternative to sampling-based methods like Markov Chain Monte Carlo (MCMC) by posing posterior approximation as an optimization problem.

The core idea behind variational inference is to:

Choose a family of simpler distributions (the variational family)
Find the member of this family that is closest to the true posterior
Use this approximation instead of the exact posterior for inference tasks

This approach trades some accuracy for significant computational advantages, making Bayesian inference feasible for large-scale problems.

Understanding the Mathematics

At its heart, variational inference minimizes the Kullback-Leibler (KL) divergence between the approximate distribution q(z) and the true posterior p(z|x):

$\text{KL}(q(z) || p(z|x))$

Since we don't know p(z|x) directly, VI instead maximizes the Evidence Lower Bound (ELBO):

$\text{ELBO}(q) = \mathbb{E}_{q(z)}[\log p(x,z)] - \mathbb{E}_{q(z)}[\log q(z)]$

This is equivalent to minimizing the KL divergence plus a constant.

Setting Up the Environment

Let's start by installing and importing the required libraries:

python
# Install TensorFlow Probability if needed
# !pip install tensorflow-probability

import tensorflow as tf
import tensorflow_probability as tfp
import matplotlib.pyplot as plt
import numpy as np

tfd = tfp.distributions
tfb = tfp.bijectors

Simple Example: Approximating a Gaussian Mixture

Let's start with a basic example of approximating a mixture of Gaussians with a single Gaussian using variational inference.

1. Define the Target Distribution

First, we'll create a mixture of two Gaussians as our target distribution:

python
# Define the target distribution (a mixture of two Gaussians)
mixture = tfd.Mixture(
    cat=tfd.Categorical(probs=[0.3, 0.7]),
    components=[
        tfd.Normal(loc=-3.0, scale=1.0),
        tfd.Normal(loc=2.0, scale=1.5)
    ]
)

# Visualize the target distribution
x = np.linspace(-8, 8, 1000)
plt.figure(figsize=(10, 6))
plt.plot(x, mixture.prob(x).numpy())
plt.title("Target Distribution: Mixture of Gaussians")
plt.xlabel("x")
plt.ylabel("Probability Density")
plt.grid(True)
plt.show()

Output would show a bimodal distribution with peaks at approximately -3 and 2.

2. Define the Variational Distribution

Now we'll define our variational distribution - a simple Gaussian whose parameters we'll optimize:

python
# Create trainable variables for the variational distribution parameters
loc = tf.Variable(0.0, name="loc")
scale = tf.Variable(1.0, name="scale", constraint=lambda x: tf.nn.softplus(x))

# Define the variational distribution
variational_dist = tfd.Normal(loc=loc, scale=scale)

3. Define the Loss Function (Negative ELBO)

We need to maximize the ELBO, which is equivalent to minimizing the negative ELBO:

python
def negative_elbo():
    # Sample from the variational distribution
    samples = variational_dist.sample(1000)
    
    # Compute log probabilities
    q_log_prob = variational_dist.log_prob(samples)
    p_log_prob = mixture.log_prob(samples)
    
    # Compute ELBO
    elbo = tf.reduce_mean(p_log_prob - q_log_prob)
    
    # Return negative ELBO (to minimize)
    return -elbo

4. Optimize the Variational Parameters

Now we'll optimize the parameters of our variational distribution:

python
# Create an optimizer
optimizer = tf.optimizers.Adam(learning_rate=0.1)

# Training loop
for step in range(1000):
    optimizer.minimize(negative_elbo, var_list=[loc, scale])
    
    if step % 100 == 0:
        print(f"Step {step}, loc: {loc.numpy():.4f}, scale: {scale.numpy():.4f}, loss: {negative_elbo().numpy():.4f}")

The output might look something like:

Step 0, loc: 0.0930, scale: 1.0230, loss: 2.9831
Step 100, loc: 0.4325, scale: 2.5167, loss: 2.2145
Step 200, loc: 0.4763, scale: 2.7239, loss: 2.1897
...
Step 900, loc: 0.5020, scale: 2.8336, loss: 2.1830

5. Visualize the Results

Let's visualize how well our variational distribution approximates the target:

python
# Visualize the results
x = np.linspace(-8, 8, 1000)
plt.figure(figsize=(10, 6))
plt.plot(x, mixture.prob(x).numpy(), label="Target Distribution")
plt.plot(x, variational_dist.prob(x).numpy(), label=f"Variational Approximation (μ={loc.numpy():.2f}, σ={scale.numpy():.2f})")
plt.title("Variational Inference Result")
plt.xlabel("x")
plt.ylabel("Probability Density")
plt.legend()
plt.grid(True)
plt.show()

The output would show how a single Gaussian attempts to approximate the bimodal mixture distribution.

Advanced Example: Bayesian Neural Network

Now let's implement a more practical example: a Bayesian Neural Network (BNN) using variational inference. BNNs provide uncertainty estimates for their predictions, which is valuable in many applications.

1. Generate Synthetic Data

python
# Generate synthetic data
np.random.seed(42)
N = 100
X = np.linspace(-3, 3, N).reshape(-1, 1).astype(np.float32)
y = np.sin(X) + 0.3 * np.random.randn(N, 1).astype(np.float32)

# Plot the data
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.7)
plt.title("Synthetic Data")
plt.xlabel("x")
plt.ylabel("y")
plt.grid(True)
plt.show()

2. Define the Bayesian Neural Network Model

We'll create a simple BNN with one hidden layer:

python
class BayesianDenseLayer(tf.Module):
    def __init__(self, input_dim, output_dim, name=None):
        super().__init__(name=name)
        
        # Prior distributions
        self.weight_prior = tfd.Normal(loc=0., scale=1.)
        self.bias_prior = tfd.Normal(loc=0., scale=1.)
        
        # Initial parameters for the variational distributions
        self.weight_loc = tf.Variable(tf.random.normal([input_dim, output_dim], 0.0, 0.1))
        self.weight_scale = tf.Variable(tf.ones([input_dim, output_dim]) * -3.0)
        self.bias_loc = tf.Variable(tf.zeros([output_dim]))
        self.bias_scale = tf.Variable(tf.ones([output_dim]) * -3.0)
        
    def __call__(self, inputs, sampling=True):
        # Define variational distributions
        q_weight = tfd.Normal(loc=self.weight_loc, scale=tf.nn.softplus(self.weight_scale))
        q_bias = tfd.Normal(loc=self.bias_loc, scale=tf.nn.softplus(self.bias_scale))
        
        if sampling:
            # Sample weights and biases
            weights = q_weight.sample()
            bias = q_bias.sample()
        else:
            # Use mean of the variational distribution
            weights = self.weight_loc
            bias = self.bias_loc
            
        # Calculate outputs
        outputs = tf.matmul(inputs, weights) + bias
        
        # KL divergence for this layer
        kl_weights = tf.reduce_sum(tfd.kl_divergence(q_weight, self.weight_prior))
        kl_bias = tf.reduce_sum(tfd.kl_divergence(q_bias, self.bias_prior))
        kl = kl_weights + kl_bias
        
        return outputs, kl

class BayesianNeuralNetwork(tf.Module):
    def __init__(self, name=None):
        super().__init__(name=name)
        # Define the architecture
        self.hidden_layer = BayesianDenseLayer(1, 20, name="hidden")
        self.output_layer = BayesianDenseLayer(20, 1, name="output")
        
    def __call__(self, inputs, sampling=True):
        hidden_output, kl_hidden = self.hidden_layer(inputs, sampling)
        hidden_output = tf.nn.relu(hidden_output)
        final_output, kl_output = self.output_layer(hidden_output, sampling)
        
        # Total KL divergence
        kl = kl_hidden + kl_output
        
        return final_output, kl

3. Define the Loss Function

python
def negative_elbo_loss(model, x, y, kl_weight=1.0):
    # Forward pass through the model
    predictions, kl = model(x)
    
    # Likelihood - assuming Gaussian observation model with fixed variance
    log_likelihood = -0.5 * tf.reduce_sum(tf.square(y - predictions))
    
    # ELBO = log_likelihood - kl_divergence
    elbo = log_likelihood - kl_weight * kl
    
    return -elbo  # Negative ELBO as we're minimizing

4. Train the Model

python
# Initialize the model
bnn = BayesianNeuralNetwork()

# Create an optimizer
optimizer = tf.optimizers.Adam(learning_rate=0.01)

# Convert data to tensors
X_tensor = tf.convert_to_tensor(X)
y_tensor = tf.convert_to_tensor(y)

# Training loop
num_epochs = 1000
for epoch in range(num_epochs):
    with tf.GradientTape() as tape:
        loss = negative_elbo_loss(bnn, X_tensor, y_tensor, kl_weight=1.0/len(X))
    
    # Get gradients and update model parameters
    gradients = tape.gradient(loss, bnn.trainable_variables)
    optimizer.apply_gradients(zip(gradients, bnn.trainable_variables))
    
    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss.numpy():.4f}")

5. Make Predictions with Uncertainty

python
# Generate test points
X_test = np.linspace(-5, 5, 200).reshape(-1, 1).astype(np.float32)
X_test_tensor = tf.convert_to_tensor(X_test)

# Make multiple predictions to capture uncertainty
num_samples = 100
predictions = []

for _ in range(num_samples):
    y_pred, _ = bnn(X_test_tensor)
    predictions.append(y_pred)

# Stack predictions
predictions = np.hstack(predictions)

# Calculate mean and standard deviation
mean_prediction = np.mean(predictions, axis=1)
std_prediction = np.std(predictions, axis=1)

# Plot results with uncertainty
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.7, label="Training data")
plt.plot(X_test, mean_prediction, color='red', label="Mean prediction")
plt.fill_between(X_test.ravel(), 
                 mean_prediction - 2 * std_prediction, 
                 mean_prediction + 2 * std_prediction,
                 color='red', alpha=0.3, label="95% confidence")
plt.title("Bayesian Neural Network with Uncertainty")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.grid(True)
plt.show()

The output will show a plot of our model's predictions with uncertainty bands, which will be wider in regions with less data.

Variational Autoencoders (VAE)

Variational autoencoders are a popular application of variational inference to learn latent representations of data. Let's implement a simple VAE for MNIST digits.

1. Prepare the Data

python
# Load MNIST dataset
mnist = tf.keras.datasets.mnist
(x_train, _), (x_test, _) = mnist.load_data()

# Normalize and reshape
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
x_train = x_train.reshape(-1, 784)
x_test = x_test.reshape(-1, 784)

# Create dataset
train_dataset = tf.data.Dataset.from_tensor_slices(x_train)
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(100)

2. Define the VAE Model

python
class VAE(tf.keras.Model):
    def __init__(self, latent_dim=2):
        super(VAE, self).__init__()
        self.latent_dim = latent_dim
        
        # Encoder network
        self.encoder = tf.keras.Sequential([
            tf.keras.layers.InputLayer(input_shape=(784,)),
            tf.keras.layers.Dense(512, activation='relu'),
            tf.keras.layers.Dense(256, activation='relu'),
            tf.keras.layers.Dense(latent_dim * 2)  # Mean and log_var
        ])
        
        # Decoder network
        self.decoder = tf.keras.Sequential([
            tf.keras.layers.InputLayer(input_shape=(latent_dim,)),
            tf.keras.layers.Dense(256, activation='relu'),
            tf.keras.layers.Dense(512, activation='relu'),
            tf.keras.layers.Dense(784, activation='sigmoid')
        ])
        
    def encode(self, x):
        mean_logvar = self.encoder(x)
        mean, logvar = tf.split(mean_logvar, num_or_size_splits=2, axis=1)
        return mean, logvar
    
    def reparameterize(self, mean, logvar):
        eps = tf.random.normal(shape=tf.shape(mean))
        return mean + tf.exp(logvar * 0.5) * eps
    
    def decode(self, z):
        return self.decoder(z)
    
    def call(self, x):
        mean, logvar = self.encode(x)
        z = self.reparameterize(mean, logvar)
        x_recon = self.decode(z)
        return x_recon, mean, logvar

3. Define the Loss Function

python
def compute_vae_loss(model, x):
    # Forward pass
    x_recon, mean, logvar = model(x)
    
    # Reconstruction loss
    recon_loss = tf.reduce_sum(
        tf.keras.losses.binary_crossentropy(x, x_recon), axis=1
    )
    
    # KL divergence
    kl_loss = -0.5 * tf.reduce_sum(
        1 + logvar - tf.square(mean) - tf.exp(logvar), axis=1
    )
    
    # Total loss
    total_loss = tf.reduce_mean(recon_loss + kl_loss)
    
    return total_loss

4. Train the VAE

python
# Create VAE model
vae = VAE(latent_dim=2)

# Create optimizer
optimizer = tf.keras.optimizers.Adam(1e-3)

# Train the model
epochs = 10
for epoch in range(epochs):
    epoch_loss = 0
    num_batches = 0
    
    for batch in train_dataset:
        with tf.GradientTape() as tape:
            loss = compute_vae_loss(vae, batch)
        
        gradients = tape.gradient(loss, vae.trainable_variables)
        optimizer.apply_gradients(zip(gradients, vae.trainable_variables))
        
        epoch_loss += loss
        num_batches += 1
    
    print(f"Epoch {epoch+1}, Loss: {epoch_loss/num_batches:.4f}")

5. Visualize the Latent Space

python
# Sample from latent space
def plot_latent_space(vae, n=30, figsize=15):
    # Create a grid of points in latent space
    digit_size = 28
    scale = 2.0
    figure = np.zeros((digit_size * n, digit_size * n))
    
    # Linear space coordinates
    grid_x = np.linspace(-scale, scale, n)
    grid_y = np.linspace(-scale, scale, n)[::-1]
    
    for i, yi in enumerate(grid_y):
        for j, xi in enumerate(grid_x):
            z = np.array([[xi, yi]])
            x_decoded = vae.decode(z)
            digit = x_decoded[0].numpy().reshape(digit_size, digit_size)
            figure[i * digit_size:(i + 1) * digit_size,
                   j * digit_size:(j + 1) * digit_size] = digit
    
    plt.figure(figsize=(figsize, figsize))
    plt.imshow(figure, cmap='Greys_r')
    plt.title("Latent Space Visualization")
    plt.axis('off')
    plt.tight_layout()
    plt.show()

# Visualize the latent space
plot_latent_space(vae)

This will display a grid of generated digits that shows how the latent space is organized.

Summary

In this tutorial, we explored variational inference in TensorFlow Probability:

We learned the fundamentals of variational inference as a method for approximating complex posterior distributions
Implemented a simple example to approximate a mixture of Gaussians
Built a Bayesian Neural Network that provides uncertainty estimates
Created a Variational Autoencoder (VAE) for unsupervised learning

Variational inference offers a powerful alternative to sampling-based methods for Bayesian inference, making it possible to scale probabilistic modeling to large datasets and complex models.

Additional Resources

To learn more about variational inference and TensorFlow Probability:

Exercises

Modify the mixture approximation example to use a mixture of Gaussians as the variational distribution instead of a single Gaussian.
Extend the Bayesian Neural Network to include a higher number of layers and apply it to a real-world regression dataset.
Implement a conditional VAE that can generate digits conditioned on a specific class.
Try implementing normalizing flows to create more flexible variational distributions.
Apply variational inference to a time-series prediction model and visualize the uncertainty in forecasts.

By working through these exercises, you'll gain a deeper understanding of how to apply variational inference to various probabilistic modeling tasks.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Variational Inference​

Understanding the Mathematics​

Setting Up the Environment​

Simple Example: Approximating a Gaussian Mixture​

1. Define the Target Distribution​

2. Define the Variational Distribution​

3. Define the Loss Function (Negative ELBO)​

4. Optimize the Variational Parameters​

5. Visualize the Results​

Advanced Example: Bayesian Neural Network​

1. Generate Synthetic Data​

2. Define the Bayesian Neural Network Model​

3. Define the Loss Function​

4. Train the Model​

5. Make Predictions with Uncertainty​

Variational Autoencoders (VAE)​

1. Prepare the Data​

2. Define the VAE Model​

3. Define the Loss Function​

4. Train the VAE​

5. Visualize the Latent Space​

Summary​

Additional Resources​

Exercises​

Introduction to Variational Inference

Understanding the Mathematics

Setting Up the Environment

Simple Example: Approximating a Gaussian Mixture

1. Define the Target Distribution

2. Define the Variational Distribution

3. Define the Loss Function (Negative ELBO)

4. Optimize the Variational Parameters

5. Visualize the Results

Advanced Example: Bayesian Neural Network

1. Generate Synthetic Data

2. Define the Bayesian Neural Network Model

3. Define the Loss Function

4. Train the Model

5. Make Predictions with Uncertainty

Variational Autoencoders (VAE)

1. Prepare the Data

2. Define the VAE Model

3. Define the Loss Function

4. Train the VAE

5. Visualize the Latent Space

Summary

Additional Resources

Exercises