Skip to main content

TensorFlow Serving

Introduction

TensorFlow Serving is a specialized system designed to deploy machine learning models in production environments. It's a key component of TensorFlow's ecosystem that bridges the gap between model development and real-world applications. When you've trained a brilliant model, you need a reliable way to make it available to users or other applications - that's exactly what TensorFlow Serving provides.

In this guide, we'll explore how TensorFlow Serving allows you to:

  • Serve multiple models or versions simultaneously
  • Deploy models without server downtime
  • Scale to handle large numbers of requests
  • Integrate with containerization tools like Docker
  • Serve predictions with high performance and low latency

Whether you're deploying your first simple model or managing complex production systems, understanding TensorFlow Serving is an essential skill for modern machine learning workflows.

What is TensorFlow Serving?

TensorFlow Serving is a production-ready serving system specifically designed for machine learning models. It allows you to deploy new algorithms and experiments while maintaining the same server architecture and APIs.

Key Features

  • Version Management: Maintain multiple versions of your model in parallel
  • High Performance: Optimized for production environments with high throughput requirements
  • Flexible Architecture: Easily extensible to serve different types of models and data
  • Standard APIs: Provides both REST and gRPC interfaces for model serving
  • Easy Integration: Works seamlessly with TensorFlow models and can be containerized

Installing TensorFlow Serving

Let's start by installing TensorFlow Serving. There are several ways to install it, but we'll cover the most common approaches.

Docker is the simplest way to get started with TensorFlow Serving:

bash
# Pull the TensorFlow Serving Docker image
docker pull tensorflow/serving

# Start a TensorFlow Serving container
docker run -p 8501:8501 \
--name tensorflow_serving \
--mount type=bind,source=/path/to/your/models/directory,target=/models \
-e MODEL_NAME=your_model_name \
tensorflow/serving

Using apt (for Ubuntu)

bash
# Add TensorFlow Serving distribution URI as a package source
echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list

# Add tensorflow serving repository key
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add -

# Update packages and install TensorFlow Serving
sudo apt update
sudo apt-get install tensorflow-model-server

Preparing a Model for Serving

Before serving a model, you need to export it in the SavedModel format, which is TensorFlow's recommended format for model deployment.

Exporting a SavedModel

Here's a basic example of training a simple model and saving it in the SavedModel format:

python
import tensorflow as tf
import numpy as np
import os

# Create a simple model
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(10,)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mse')

# Train with dummy data
x_train = np.random.random((100, 10))
y_train = np.random.random((100, 1))
model.fit(x_train, y_train, epochs=5, verbose=1)

# Define the export path with a version number
export_path = os.path.join("models", "simple_model", "1")

# Save the model
tf.keras.models.save_model(
model,
export_path,
overwrite=True,
include_optimizer=True,
save_format=None,
signatures=None,
options=None
)

print(f"Model saved to: {export_path}")

The output would look something like:

Epoch 1/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2643
Epoch 2/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2513
Epoch 3/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2414
Epoch 4/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2329
Epoch 5/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2247
Model saved to: models/simple_model/1

Model Directory Structure

TensorFlow Serving expects a specific directory structure for models:

models/
└── simple_model/ # Model name
├── 1/ # Version 1 of the model
│ ├── saved_model.pb
│ └── variables/
└── 2/ # Version 2 of the model
├── saved_model.pb
└── variables/

Running TensorFlow Serving

Starting the Server

After exporting your model, you can start TensorFlow Serving to make it available:

bash
# Using the tensorflow_model_server command
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=simple_model --model_base_path=/absolute/path/to/models/simple_model

If you're using Docker:

bash
docker run -p 8501:8501 \
--name tensorflow_serving \
--mount type=bind,source=/absolute/path/to/models,target=/models \
-e MODEL_NAME=simple_model \
-e MODEL_BASE_PATH=/models \
tensorflow/serving

You should see output indicating that the server has started, and your model has been loaded successfully.

Making Predictions with TensorFlow Serving

TensorFlow Serving provides two APIs for predictions:

  1. REST API: Easier to use, primarily for lower throughput applications
  2. gRPC API: Higher performance, ideal for production systems with high throughput

Using the REST API

Let's make a prediction using the REST API:

python
import json
import numpy as np
import requests

# Create sample data similar to the training data
data = np.random.random((2, 10)).tolist()

# Create the JSON payload
payload = {
"instances": data
}

# Send the request to the server
response = requests.post(
"http://localhost:8501/v1/models/simple_model:predict",
data=json.dumps(payload)
)

# Process the response
if response.status_code == 200:
predictions = response.json()["predictions"]
print(f"Predictions: {predictions}")
else:
print(f"Error: {response.text}")

Sample output:

Predictions: [[0.4206941], [-0.1234567]]

Using the gRPC API

For higher throughput applications, you can use the gRPC API, which requires more setup but offers better performance:

python
import grpc
import numpy as np
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

# Create a gRPC channel
channel = grpc.insecure_channel("localhost:8500")
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

# Create prediction request
request = predict_pb2.PredictRequest()
request.model_spec.name = "simple_model"
request.model_spec.signature_name = "serving_default"

# Prepare sample data
data = np.random.random((2, 10)).astype(np.float32)

# Add the input tensor
tensor_proto = tf.make_tensor_proto(data)
request.inputs["dense_input"].CopyFrom(tensor_proto)

# Send request
response = stub.Predict(request, 10.0) # 10 second timeout

# Process response
output = tf.make_ndarray(response.outputs["dense_2"])
print(f"Predictions: {output}")

Sample output:

Predictions: [[ 0.37128112]
[-0.09876545]]

Model Versioning and Updates

One of TensorFlow Serving's key features is its ability to handle model versions seamlessly. This allows you to deploy new model versions without interrupting service.

Adding a New Model Version

Let's add a new version of our model:

python
import tensorflow as tf
import numpy as np
import os

# Create an improved model (more layers)
improved_model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(10,)),
tf.keras.layers.Dense(128, activation='relu'), # Extra layer
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])

improved_model.compile(optimizer='adam', loss='mse')

# Train with the same dummy data
x_train = np.random.random((100, 10))
y_train = np.random.random((100, 1))
improved_model.fit(x_train, y_train, epochs=5, verbose=1)

# Define the export path with a new version number
export_path = os.path.join("models", "simple_model", "2") # Version 2

# Save the model
tf.keras.models.save_model(
improved_model,
export_path,
overwrite=True,
include_optimizer=True,
save_format=None,
signatures=None,
options=None
)

print(f"Improved model saved to: {export_path}")

TensorFlow Serving will automatically pick up the new model version if you configured it with --model_base_path option pointing to the parent directory.

Version Policy

You can control which version of the model is served using version policies:

  1. Latest: Always serves the latest version (highest number)
  2. All: Serves all available versions
  3. Specific: Serves specific versions that you define

For example, to serve only the latest version:

bash
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=simple_model --model_base_path=/path/to/models/simple_model \
--model_version_policy="{ 'latest': { 'num_versions': 1 }}"

Real-world Application: Image Classification API

Let's implement a complete example of serving an image classification model with TensorFlow Serving:

Step 1: Export a Pre-trained MobileNet Model

python
import tensorflow as tf
import os

# Load a pre-trained MobileNetV2 model
base_model = tf.keras.applications.MobileNetV2(
weights="imagenet",
input_shape=(224, 224, 3),
include_top=True
)

# Define preprocessing and post-processing functions
def preprocess(input_image):
return tf.keras.applications.mobilenet_v2.preprocess_input(input_image)

# Create a serving model with pre-processing included
inputs = tf.keras.layers.Input(shape=(224, 224, 3), name='image')
x = tf.keras.layers.Lambda(preprocess)(inputs)
outputs = base_model(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

# Export the model
export_path = os.path.join("models", "mobilenet", "1")
tf.keras.models.save_model(
model,
export_path,
overwrite=True,
include_optimizer=False
)

print(f"MobileNet model saved to: {export_path}")

Step 2: Start TensorFlow Serving with the MobileNet Model

bash
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=mobilenet --model_base_path=/absolute/path/to/models/mobilenet

Step 3: Create a Client Application to Use the Model

python
import json
import numpy as np
import requests
from PIL import Image
import tensorflow as tf

def load_and_preprocess_image(image_path):
# Load and resize image
image = Image.open(image_path).resize((224, 224))
image_array = np.array(image)
# Add batch dimension
image_array = np.expand_dims(image_array, 0)
return image_array

def get_prediction(image_array):
# Create the JSON payload
payload = {
"instances": image_array.tolist()
}

# Send the request to the server
response = requests.post(
"http://localhost:8501/v1/models/mobilenet:predict",
data=json.dumps(payload)
)

# Process the response
if response.status_code == 200:
predictions = response.json()["predictions"][0]
# Get the top 5 predictions
top5_indices = np.argsort(predictions)[-5:][::-1]

# Load ImageNet class labels
class_names = [line.strip() for line in
open("imagenet_classes.txt", "r").readlines()]

# Return top 5 predictions with class names
return [(class_names[i], predictions[i]) for i in top5_indices]
else:
print(f"Error: {response.text}")
return None

# Example usage
if __name__ == "__main__":
image_path = "cat.jpg" # Replace with your test image
image_array = load_and_preprocess_image(image_path)
predictions = get_prediction(image_array)

if predictions:
print("Top 5 predictions:")
for class_name, score in predictions:
print(f"{class_name}: {score:.4f}")

Example output:

Top 5 predictions:
Egyptian cat: 0.8764
tabby, tabby cat: 0.0912
tiger cat: 0.0158
lynx, catamount: 0.0034
Persian cat: 0.0012

Advanced TensorFlow Serving Features

Model Batching

TensorFlow Serving can batch incoming requests for better performance. You can configure batching options when starting the server:

bash
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=simple_model --model_base_path=/path/to/models/simple_model \
--enable_batching=true \
--batching_parameters_file=batch_config.txt

Example batch_config.txt:

max_batch_size { value: 32 }
batch_timeout_micros { value: 5000 }
max_enqueued_batches { value: 100 }

Monitoring with Prometheus

You can configure TensorFlow Serving to expose metrics for Prometheus:

bash
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=simple_model --model_base_path=/path/to/models/simple_model \
--monitoring_config_file=monitoring_config.txt

Example monitoring_config.txt:

prometheus_config {
enable: true
path: "/monitoring/prometheus/metrics"
}

Secure Connections with SSL/TLS

For production deployments, you can enable secure connections:

bash
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=simple_model --model_base_path=/path/to/models/simple_model \
--ssl_config_file=ssl_config.txt

Example ssl_config.txt:

server_key: "/path/to/server.key"
server_cert: "/path/to/server.crt"
custom_ca: "/path/to/ca.crt"

Best Practices for TensorFlow Serving

  1. Version your models: Follow semantic versioning for models to track changes
  2. Monitor server performance: Set up logging and metrics to track model performance
  3. Test before deployment: Validate model performance and API integration before production
  4. Use Docker: Containerize your serving environment for consistent deployment
  5. Implement A/B testing: Use model versioning to compare different model versions
  6. Keep input processing consistent: Ensure preprocessing in training matches serving
  7. Set appropriate timeouts: Configure timeouts based on your model's complexity
  8. Use model warmup: Prime your model with sample data for faster initial predictions

Troubleshooting Common Issues

Model Not Loading

If your model doesn't load, check:

  • Correct path to the SavedModel directory
  • Model directory structure follows the expected format
  • Model files have appropriate permissions
  • Model format is compatible with your TensorFlow Serving version

High Latency

If you experience high latency:

  • Enable batching
  • Check if your model is complex and might need optimization
  • Consider using TensorRT for optimization
  • Adjust hardware resources (CPU, memory) allocated to the server

Connection Issues

For connection problems:

  • Verify ports are correctly mapped in Docker (if using)
  • Check firewall settings
  • Validate network configuration
  • Verify the correct URL in client applications

Summary

TensorFlow Serving is a powerful tool that bridges the gap between machine learning model development and production deployment. We've covered:

  • How to install and set up TensorFlow Serving
  • Exporting models in the SavedModel format
  • Serving models via REST and gRPC APIs
  • Managing multiple model versions
  • Building a real-world image classification API
  • Advanced features and best practices

This knowledge provides a solid foundation for deploying your TensorFlow models in production environments with scalability, high performance, and version management.

Additional Resources

Exercises

  1. Export and serve a simple regression model you've trained
  2. Implement a text classification API using TensorFlow Serving
  3. Create a client application that can switch between different model versions
  4. Set up TensorFlow Serving with batching and monitor performance improvements
  5. Deploy a TensorFlow Serving model on a cloud provider (AWS, GCP, or Azure)


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)