TensorFlow Serving

Introduction

TensorFlow Serving is a specialized system designed to deploy machine learning models in production environments. It's a key component of TensorFlow's ecosystem that bridges the gap between model development and real-world applications. When you've trained a brilliant model, you need a reliable way to make it available to users or other applications - that's exactly what TensorFlow Serving provides.

In this guide, we'll explore how TensorFlow Serving allows you to:

Serve multiple models or versions simultaneously
Deploy models without server downtime
Scale to handle large numbers of requests
Integrate with containerization tools like Docker
Serve predictions with high performance and low latency

Whether you're deploying your first simple model or managing complex production systems, understanding TensorFlow Serving is an essential skill for modern machine learning workflows.

What is TensorFlow Serving?

TensorFlow Serving is a production-ready serving system specifically designed for machine learning models. It allows you to deploy new algorithms and experiments while maintaining the same server architecture and APIs.

Key Features

Version Management: Maintain multiple versions of your model in parallel
High Performance: Optimized for production environments with high throughput requirements
Flexible Architecture: Easily extensible to serve different types of models and data
Standard APIs: Provides both REST and gRPC interfaces for model serving
Easy Integration: Works seamlessly with TensorFlow models and can be containerized

Installing TensorFlow Serving

Let's start by installing TensorFlow Serving. There are several ways to install it, but we'll cover the most common approaches.

Using Docker (Recommended for Beginners)

Docker is the simplest way to get started with TensorFlow Serving:

bash
# Pull the TensorFlow Serving Docker image
docker pull tensorflow/serving

# Start a TensorFlow Serving container
docker run -p 8501:8501 \
  --name tensorflow_serving \
  --mount type=bind,source=/path/to/your/models/directory,target=/models \
  -e MODEL_NAME=your_model_name \
  tensorflow/serving

Using apt (for Ubuntu)

bash
# Add TensorFlow Serving distribution URI as a package source
echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list

# Add tensorflow serving repository key
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add -

# Update packages and install TensorFlow Serving
sudo apt update
sudo apt-get install tensorflow-model-server

Preparing a Model for Serving

Before serving a model, you need to export it in the SavedModel format, which is TensorFlow's recommended format for model deployment.

Exporting a SavedModel

Here's a basic example of training a simple model and saving it in the SavedModel format:

python
import tensorflow as tf
import numpy as np
import os

# Create a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(10,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mse')

# Train with dummy data
x_train = np.random.random((100, 10))
y_train = np.random.random((100, 1))
model.fit(x_train, y_train, epochs=5, verbose=1)

# Define the export path with a version number
export_path = os.path.join("models", "simple_model", "1")

# Save the model
tf.keras.models.save_model(
    model,
    export_path,
    overwrite=True,
    include_optimizer=True,
    save_format=None,
    signatures=None,
    options=None
)

print(f"Model saved to: {export_path}")

The output would look something like:

Epoch 1/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2643
Epoch 2/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2513
Epoch 3/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2414
Epoch 4/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2329
Epoch 5/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2247
Model saved to: models/simple_model/1

Model Directory Structure

TensorFlow Serving expects a specific directory structure for models:

models/
└── simple_model/      # Model name
    ├── 1/             # Version 1 of the model
    │   ├── saved_model.pb
    │   └── variables/
    └── 2/             # Version 2 of the model
        ├── saved_model.pb
        └── variables/

Running TensorFlow Serving

Starting the Server

After exporting your model, you can start TensorFlow Serving to make it available:

bash
# Using the tensorflow_model_server command
tensorflow_model_server --port=8500 --rest_api_port=8501 \
  --model_name=simple_model --model_base_path=/absolute/path/to/models/simple_model

If you're using Docker:

bash
docker run -p 8501:8501 \
  --name tensorflow_serving \
  --mount type=bind,source=/absolute/path/to/models,target=/models \
  -e MODEL_NAME=simple_model \
  -e MODEL_BASE_PATH=/models \
  tensorflow/serving

You should see output indicating that the server has started, and your model has been loaded successfully.

Making Predictions with TensorFlow Serving

TensorFlow Serving provides two APIs for predictions:

REST API: Easier to use, primarily for lower throughput applications
gRPC API: Higher performance, ideal for production systems with high throughput

Using the REST API

Let's make a prediction using the REST API:

python
import json
import numpy as np
import requests

# Create sample data similar to the training data
data = np.random.random((2, 10)).tolist()

# Create the JSON payload
payload = {
    "instances": data
}

# Send the request to the server
response = requests.post(
    "http://localhost:8501/v1/models/simple_model:predict",
    data=json.dumps(payload)
)

# Process the response
if response.status_code == 200:
    predictions = response.json()["predictions"]
    print(f"Predictions: {predictions}")
else:
    print(f"Error: {response.text}")

Sample output:

Predictions: [[0.4206941], [-0.1234567]]

Using the gRPC API

For higher throughput applications, you can use the gRPC API, which requires more setup but offers better performance:

python
import grpc
import numpy as np
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

# Create a gRPC channel
channel = grpc.insecure_channel("localhost:8500")
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

# Create prediction request
request = predict_pb2.PredictRequest()
request.model_spec.name = "simple_model"
request.model_spec.signature_name = "serving_default"

# Prepare sample data
data = np.random.random((2, 10)).astype(np.float32)

# Add the input tensor
tensor_proto = tf.make_tensor_proto(data)
request.inputs["dense_input"].CopyFrom(tensor_proto)

# Send request
response = stub.Predict(request, 10.0)  # 10 second timeout

# Process response
output = tf.make_ndarray(response.outputs["dense_2"])
print(f"Predictions: {output}")

Sample output:

Predictions: [[ 0.37128112]
             [-0.09876545]]

Model Versioning and Updates

One of TensorFlow Serving's key features is its ability to handle model versions seamlessly. This allows you to deploy new model versions without interrupting service.

Adding a New Model Version

Let's add a new version of our model:

python
import tensorflow as tf
import numpy as np
import os

# Create an improved model (more layers)
improved_model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(10,)),
    tf.keras.layers.Dense(128, activation='relu'),  # Extra layer
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

improved_model.compile(optimizer='adam', loss='mse')

# Train with the same dummy data
x_train = np.random.random((100, 10))
y_train = np.random.random((100, 1))
improved_model.fit(x_train, y_train, epochs=5, verbose=1)

# Define the export path with a new version number
export_path = os.path.join("models", "simple_model", "2")  # Version 2

# Save the model
tf.keras.models.save_model(
    improved_model,
    export_path,
    overwrite=True,
    include_optimizer=True,
    save_format=None,
    signatures=None,
    options=None
)

print(f"Improved model saved to: {export_path}")

TensorFlow Serving will automatically pick up the new model version if you configured it with --model_base_path option pointing to the parent directory.

Version Policy

You can control which version of the model is served using version policies:

Latest: Always serves the latest version (highest number)
All: Serves all available versions
Specific: Serves specific versions that you define

For example, to serve only the latest version:

bash
tensorflow_model_server --port=8500 --rest_api_port=8501 \
  --model_name=simple_model --model_base_path=/path/to/models/simple_model \
  --model_version_policy="{ 'latest': { 'num_versions': 1 }}"

Real-world Application: Image Classification API

Let's implement a complete example of serving an image classification model with TensorFlow Serving:

Step 1: Export a Pre-trained MobileNet Model

python
import tensorflow as tf
import os

# Load a pre-trained MobileNetV2 model
base_model = tf.keras.applications.MobileNetV2(
    weights="imagenet",
    input_shape=(224, 224, 3),
    include_top=True
)

# Define preprocessing and post-processing functions
def preprocess(input_image):
    return tf.keras.applications.mobilenet_v2.preprocess_input(input_image)

# Create a serving model with pre-processing included
inputs = tf.keras.layers.Input(shape=(224, 224, 3), name='image')
x = tf.keras.layers.Lambda(preprocess)(inputs)
outputs = base_model(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

# Export the model
export_path = os.path.join("models", "mobilenet", "1")
tf.keras.models.save_model(
    model,
    export_path,
    overwrite=True,
    include_optimizer=False
)

print(f"MobileNet model saved to: {export_path}")

Step 2: Start TensorFlow Serving with the MobileNet Model

bash
tensorflow_model_server --port=8500 --rest_api_port=8501 \
  --model_name=mobilenet --model_base_path=/absolute/path/to/models/mobilenet

Step 3: Create a Client Application to Use the Model

python
import json
import numpy as np
import requests
from PIL import Image
import tensorflow as tf

def load_and_preprocess_image(image_path):
    # Load and resize image
    image = Image.open(image_path).resize((224, 224))
    image_array = np.array(image)
    # Add batch dimension
    image_array = np.expand_dims(image_array, 0)
    return image_array

def get_prediction(image_array):
    # Create the JSON payload
    payload = {
        "instances": image_array.tolist()
    }

    # Send the request to the server
    response = requests.post(
        "http://localhost:8501/v1/models/mobilenet:predict",
        data=json.dumps(payload)
    )

    # Process the response
    if response.status_code == 200:
        predictions = response.json()["predictions"][0]
        # Get the top 5 predictions
        top5_indices = np.argsort(predictions)[-5:][::-1]
        
        # Load ImageNet class labels
        class_names = [line.strip() for line in 
                    open("imagenet_classes.txt", "r").readlines()]
        
        # Return top 5 predictions with class names
        return [(class_names[i], predictions[i]) for i in top5_indices]
    else:
        print(f"Error: {response.text}")
        return None

# Example usage
if __name__ == "__main__":
    image_path = "cat.jpg"  # Replace with your test image
    image_array = load_and_preprocess_image(image_path)
    predictions = get_prediction(image_array)
    
    if predictions:
        print("Top 5 predictions:")
        for class_name, score in predictions:
            print(f"{class_name}: {score:.4f}")

Example output:

Top 5 predictions:
Egyptian cat: 0.8764
tabby, tabby cat: 0.0912
tiger cat: 0.0158
lynx, catamount: 0.0034
Persian cat: 0.0012

Advanced TensorFlow Serving Features

Model Batching

TensorFlow Serving can batch incoming requests for better performance. You can configure batching options when starting the server:

bash
tensorflow_model_server --port=8500 --rest_api_port=8501 \
  --model_name=simple_model --model_base_path=/path/to/models/simple_model \
  --enable_batching=true \
  --batching_parameters_file=batch_config.txt

Example batch_config.txt:

max_batch_size { value: 32 }
batch_timeout_micros { value: 5000 }
max_enqueued_batches { value: 100 }

Monitoring with Prometheus

You can configure TensorFlow Serving to expose metrics for Prometheus:

bash
tensorflow_model_server --port=8500 --rest_api_port=8501 \
  --model_name=simple_model --model_base_path=/path/to/models/simple_model \
  --monitoring_config_file=monitoring_config.txt

Example monitoring_config.txt:

prometheus_config {
  enable: true
  path: "/monitoring/prometheus/metrics"
}

Secure Connections with SSL/TLS

For production deployments, you can enable secure connections:

bash
tensorflow_model_server --port=8500 --rest_api_port=8501 \
  --model_name=simple_model --model_base_path=/path/to/models/simple_model \
  --ssl_config_file=ssl_config.txt

Example ssl_config.txt:

server_key: "/path/to/server.key"
server_cert: "/path/to/server.crt"
custom_ca: "/path/to/ca.crt"

Best Practices for TensorFlow Serving

Version your models: Follow semantic versioning for models to track changes
Monitor server performance: Set up logging and metrics to track model performance
Test before deployment: Validate model performance and API integration before production
Use Docker: Containerize your serving environment for consistent deployment
Implement A/B testing: Use model versioning to compare different model versions
Keep input processing consistent: Ensure preprocessing in training matches serving
Set appropriate timeouts: Configure timeouts based on your model's complexity
Use model warmup: Prime your model with sample data for faster initial predictions

Troubleshooting Common Issues

Model Not Loading

If your model doesn't load, check:

Correct path to the SavedModel directory
Model directory structure follows the expected format
Model files have appropriate permissions
Model format is compatible with your TensorFlow Serving version

High Latency

If you experience high latency:

Enable batching
Check if your model is complex and might need optimization
Consider using TensorRT for optimization
Adjust hardware resources (CPU, memory) allocated to the server

Connection Issues

For connection problems:

Verify ports are correctly mapped in Docker (if using)
Check firewall settings
Validate network configuration
Verify the correct URL in client applications

Summary

TensorFlow Serving is a powerful tool that bridges the gap between machine learning model development and production deployment. We've covered:

How to install and set up TensorFlow Serving
Exporting models in the SavedModel format
Serving models via REST and gRPC APIs
Managing multiple model versions
Building a real-world image classification API
Advanced features and best practices

This knowledge provides a solid foundation for deploying your TensorFlow models in production environments with scalability, high performance, and version management.

Additional Resources

Exercises

Export and serve a simple regression model you've trained
Implement a text classification API using TensorFlow Serving
Create a client application that can switch between different model versions
Set up TensorFlow Serving with batching and monitor performance improvements
Deploy a TensorFlow Serving model on a cloud provider (AWS, GCP, or Azure)

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What is TensorFlow Serving?​

Key Features​

Installing TensorFlow Serving​

Using Docker (Recommended for Beginners)​

Using apt (for Ubuntu)​

Preparing a Model for Serving​

Exporting a SavedModel​

Model Directory Structure​

Running TensorFlow Serving​

Starting the Server​

Making Predictions with TensorFlow Serving​

Using the REST API​

Using the gRPC API​

Model Versioning and Updates​

Adding a New Model Version​

Version Policy​

Real-world Application: Image Classification API​

Step 1: Export a Pre-trained MobileNet Model​

Step 2: Start TensorFlow Serving with the MobileNet Model​

Step 3: Create a Client Application to Use the Model​

Advanced TensorFlow Serving Features​

Model Batching​

Monitoring with Prometheus​

Secure Connections with SSL/TLS​

Best Practices for TensorFlow Serving​

Troubleshooting Common Issues​

Model Not Loading​

High Latency​

Connection Issues​

Summary​

Additional Resources​

Exercises​