TensorFlow Graph Mode

Introduction

TensorFlow was originally built around a computational graph paradigm where operations were first defined in a graph and then executed in sessions. While TensorFlow 2.x defaults to the more user-friendly Eager execution, Graph mode remains a powerful feature that offers significant performance benefits for deployment and distributed training. In this tutorial, we'll explore TensorFlow's Graph mode, understand its advantages, and learn how to use it effectively.

What is Graph Mode?

Graph mode is TensorFlow's original execution model where computations are defined as a dataflow graph before they're executed. In contrast to Eager execution (which evaluates operations immediately), Graph mode:

Defines operations first in a computational graph
Optimizes the graph for efficiency
Executes the graph only when requested

This approach allows TensorFlow to analyze your entire computation beforehand, enabling optimizations that wouldn't be possible when executing operations one by one.

Why Use Graph Mode?

Despite TensorFlow 2.x's focus on Eager execution, there are several compelling reasons to use Graph mode:

Performance: Graphs often execute faster, especially for complex models
Deployment: TensorFlow Serving and many production environments require graphs
Portability: Computational graphs can be saved and loaded across different environments
Optimization: The TensorFlow runtime can apply various optimizations to graphs
Distributed Execution: Better support for distributed training across multiple devices

Basic Graph Mode with `tf.function`

In TensorFlow 2.x, the primary way to use Graph mode is through the @tf.function decorator, which automatically converts Python functions into TensorFlow graphs.

Let's see a simple example:

import tensorflow as tf
import time

# Defining a function to be converted to graph mode
@tf.function
def graph_computation(x):
    print("Tracing function") # This runs during tracing, not execution
    return tf.matmul(x, x) + tf.reduce_sum(x)

# Create sample data
x = tf.random.normal((1000, 1000))

# First call - the function is traced
start = time.time()
result1 = graph_computation(x)
first_run = time.time() - start

# Second call - uses the cached graph
start = time.time()
result2 = graph_computation(x)
second_run = time.time() - start

print(f"First run (tracing): {first_run:.5f} seconds")
print(f"Second run (cached): {second_run:.5f} seconds")

Output:

Tracing function
First run (tracing): 0.14523 seconds
Second run (cached): 0.01245 seconds

Notice how "Tracing function" is printed only once, even though we called the function twice. This is because tf.function traces the function once to build the computational graph and then reuses it for subsequent calls with compatible inputs.

Understanding Tracing

When you apply the @tf.function decorator, TensorFlow "traces" your function, converting your Python code into a TensorFlow graph. This is a key concept to understand:

@tf.function
def add_and_multiply(a, b):
    print("Tracing with", a, b)
    c = a + b
    return c * b

# Different data types trigger different traces
print("Calling with integers:")
print(add_and_multiply(2, 3))
print(add_and_multiply(5, 7))

print("\nCalling with float:")
print(add_and_multiply(2.0, 3.0))

print("\nCalling with tensors:")
print(add_and_multiply(tf.constant(2), tf.constant(3)))

Output:

Tracing with 2 3
Calling with integers:
tf.Tensor(15, shape=(), dtype=int32)
tf.Tensor(84, shape=(), dtype=int32)

Tracing with 2.0 3.0
Calling with float:
tf.Tensor(15.0, shape=(), dtype=float32)

Tracing with Tensor("a:0", shape=(), dtype=int32) Tensor("b:0", shape=(), dtype=int32)
Calling with tensors:
tf.Tensor(15, shape=(), dtype=int32)

TensorFlow creates different traces for different input types (integers, floats, and tensors), but reuses the trace for inputs of the same type.

Control Flow in Graph Mode

TensorFlow 2.x can convert most Python control flow statements (like if and while) to their graph equivalents, but there are some differences to be aware of:

@tf.function
def complex_calculation(x, y, training=True):
    if training:
        # This branch is encoded in the graph
        result = x * y + tf.reduce_sum(x)
    else:
        # This branch is also encoded in the graph
        result = tf.matmul(x, y)
    
    for i in tf.range(3):
        # Graph-compatible loop
        result = result + tf.square(i)
    
    return result

# These calls will use the same graph
a = tf.ones((3, 3))
b = tf.ones((3, 3))
print(complex_calculation(a, b, training=True))
print(complex_calculation(a, b, training=True))

# This will use a different graph branch but same trace
print(complex_calculation(a, b, training=False))

Performance Comparison: Eager vs. Graph Mode

Let's compare the performance of Eager and Graph modes with a more realistic example:

import tensorflow as tf
import time

# Create large tensors for matrix multiplication
matrix_size = 2000
a = tf.random.normal((matrix_size, matrix_size))
b = tf.random.normal((matrix_size, matrix_size))

# Define operations in both eager and graph modes
def eager_matmul(a, b):
    return tf.matmul(a, b)

@tf.function
def graph_matmul(a, b):
    return tf.matmul(a, b)

# Warm-up
_ = eager_matmul(a, b)
_ = graph_matmul(a, b)

# Benchmarking Eager mode
eager_start = time.time()
for _ in range(10):
    _ = eager_matmul(a, b)
eager_time = time.time() - eager_start

# Benchmarking Graph mode
graph_start = time.time()
for _ in range(10):
    _ = graph_matmul(a, b)
graph_time = time.time() - graph_start

print(f"Eager execution: {eager_time:.4f} seconds")
print(f"Graph execution: {graph_time:.4f} seconds")
print(f"Speedup: {eager_time / graph_time:.2f}x")

Output (results may vary):

Eager execution: 0.8524 seconds
Graph execution: 0.4352 seconds
Speedup: 1.96x

As you can see, Graph mode can be significantly faster for compute-intensive operations.

Saving and Loading Graph Models

One of the key advantages of Graph mode is the ability to save and load models for deployment:

import tensorflow as tf

# Create a simple model
class SimpleModel(tf.Module):
    def __init__(self):
        super().__init__()
        self.w = tf.Variable(tf.random.normal([3, 1]), name='w')
        self.b = tf.Variable(tf.zeros([1]), name='b')
    
    @tf.function
    def __call__(self, x):
        return tf.matmul(x, self.w) + self.b

# Instantiate the model
model = SimpleModel()

# Create a concrete function from the model
@tf.function(input_signature=[tf.TensorSpec(shape=[None, 3], dtype=tf.float32)])
def serve_function(x):
    return model(x)

# Save the model
tf.saved_model.save(model, "simple_graph_model", signatures={"serving_default": serve_function})

# Later, we can load the model
loaded_model = tf.saved_model.load("simple_graph_model")
inference_function = loaded_model.signatures["serving_default"]

# Test inference
test_data = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], dtype=tf.float32)
result = inference_function(x=test_data)
print("Prediction:", result)

Common Pitfalls in Graph Mode

When working with Graph mode, watch out for these common issues:

1. Non-TensorFlow operations

Graph mode can't trace non-TensorFlow operations. For example:

@tf.function
def bad_function(x):
    import numpy as np
    # This will fail during tracing - numpy operations can't be captured in graph
    return np.mean(x.numpy()) 

# Instead, use TensorFlow equivalents:
@tf.function
def good_function(x):
    return tf.reduce_mean(x)

2. Python side effects

Operations with side effects (like printing or appending to lists) execute during tracing, not during graph execution:

@tf.function
def append_to_list(x, lst):
    lst.append(x)
    return x

my_list = []
for i in range(3):
    append_to_list(i, my_list)

print(my_list)  # May not contain what you expect!

3. Mutable Python objects

Graph functions don't track changes to Python objects:

@tf.function
def update_dict(d):
    d['key'] = 1  # This won't be tracked by the graph!
    return d

my_dict = {}
update_dict(my_dict)
print(my_dict)  # Probably empty

Real-World Application: Training a Model in Graph Mode

Here's a more realistic example showing how to train a model in Graph mode:

import tensorflow as tf
import time

# Load and preprocess MNIST dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
x_train = tf.cast(x_train, tf.float32)
y_train = tf.cast(y_train, tf.int64)
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(64)

# Create a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])

# Loss function and optimizer
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()

# Define training step in graph mode
@tf.function
def train_step(images, labels):
    with tf.GradientTape() as tape:
        predictions = model(images, training=True)
        loss = loss_fn(labels, predictions)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    return loss

# Training loop
epochs = 3
start_time = time.time()
for epoch in range(epochs):
    epoch_loss = 0
    for step, (images, labels) in enumerate(train_dataset):
        loss = train_step(images, labels)
        
        if step % 100 == 0:
            print(f"Epoch {epoch+1}, Step {step}, Loss: {loss:.4f}")
        
        epoch_loss += loss
    
    average_loss = epoch_loss / (step + 1)
    print(f"Epoch {epoch+1} completed, Average Loss: {average_loss:.4f}")

training_time = time.time() - start_time
print(f"Total training time: {training_time:.2f} seconds")

# Evaluate the model
test_loss = tf.keras.metrics.Mean()
test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()

@tf.function
def test_step(images, labels):
    predictions = model(images, training=False)
    t_loss = loss_fn(labels, predictions)
    test_loss(t_loss)
    test_accuracy(labels, predictions)

for test_images, test_labels in tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(64):
    test_step(test_images, test_labels)

print(f"Test accuracy: {test_accuracy.result() * 100:.2f}%")

This example demonstrates training a simple neural network using Graph mode. The @tf.function decorator on train_step and test_step ensures that these operations run as optimized TensorFlow graphs.

AutoGraph: Converting Python to Graph Code

AutoGraph is the technology that powers tf.function, automatically converting Python code to TensorFlow graph code. To see what's happening under the hood:

import tensorflow as tf

@tf.function
def my_function(x):
    if tf.reduce_sum(x) > 0:
        return x * x
    else:
        return x + x

# See the generated graph code
print(tf.autograph.to_code(my_function.python_function))

This output shows how Python control flow is converted to TensorFlow graph operations.

Summary

TensorFlow Graph mode remains a powerful feature that offers significant performance benefits, especially for deployment and production environments. Key points to remember:

Use @tf.function to convert Python functions to TensorFlow graphs
Graph mode offers better performance for computationally intensive operations
Understanding tracing is essential for effectively using Graph mode
Be aware of the differences between Eager and Graph executions
Graph mode is important for model deployment and distributed training

By mastering Graph mode, you can create TensorFlow models that are not only easier to deploy but also execute more efficiently.

Additional Resources

Exercises

Convert a simple neural network training loop to use Graph mode and compare its performance with Eager execution.
Create a model that uses both Graph mode and Eager execution in different parts. Which operations benefit most from Graph mode?
Create a custom training loop using Graph mode that includes computing custom metrics and logging.
Save and load a model created with Graph mode, then deploy it using TensorFlow Serving.
Experiment with using Graph mode in a distributed training setting across multiple GPUs.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What is Graph Mode?​

Why Use Graph Mode?​

Basic Graph Mode with tf.function​

Understanding Tracing​

Control Flow in Graph Mode​

Performance Comparison: Eager vs. Graph Mode​

Saving and Loading Graph Models​

Common Pitfalls in Graph Mode​

1. Non-TensorFlow operations​

2. Python side effects​

3. Mutable Python objects​

Real-World Application: Training a Model in Graph Mode​

AutoGraph: Converting Python to Graph Code​

Summary​

Additional Resources​

Exercises​