TensorFlow Serving
Introduction
TensorFlow Serving is a specialized system designed to deploy machine learning models in production environments. It's a key component of TensorFlow's ecosystem that bridges the gap between model development and real-world applications. When you've trained a brilliant model, you need a reliable way to make it available to users or other applications - that's exactly what TensorFlow Serving provides.
In this guide, we'll explore how TensorFlow Serving allows you to:
- Serve multiple models or versions simultaneously
- Deploy models without server downtime
- Scale to handle large numbers of requests
- Integrate with containerization tools like Docker
- Serve predictions with high performance and low latency
Whether you're deploying your first simple model or managing complex production systems, understanding TensorFlow Serving is an essential skill for modern machine learning workflows.
What is TensorFlow Serving?
TensorFlow Serving is a production-ready serving system specifically designed for machine learning models. It allows you to deploy new algorithms and experiments while maintaining the same server architecture and APIs.
Key Features
- Version Management: Maintain multiple versions of your model in parallel
- High Performance: Optimized for production environments with high throughput requirements
- Flexible Architecture: Easily extensible to serve different types of models and data
- Standard APIs: Provides both REST and gRPC interfaces for model serving
- Easy Integration: Works seamlessly with TensorFlow models and can be containerized
Installing TensorFlow Serving
Let's start by installing TensorFlow Serving. There are several ways to install it, but we'll cover the most common approaches.
Using Docker (Recommended for Beginners)
Docker is the simplest way to get started with TensorFlow Serving:
# Pull the TensorFlow Serving Docker image
docker pull tensorflow/serving
# Start a TensorFlow Serving container
docker run -p 8501:8501 \
--name tensorflow_serving \
--mount type=bind,source=/path/to/your/models/directory,target=/models \
-e MODEL_NAME=your_model_name \
tensorflow/serving
Using apt (for Ubuntu)
# Add TensorFlow Serving distribution URI as a package source
echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list
# Add tensorflow serving repository key
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add -
# Update packages and install TensorFlow Serving
sudo apt update
sudo apt-get install tensorflow-model-server
Preparing a Model for Serving
Before serving a model, you need to export it in the SavedModel format, which is TensorFlow's recommended format for model deployment.
Exporting a SavedModel
Here's a basic example of training a simple model and saving it in the SavedModel format:
import tensorflow as tf
import numpy as np
import os
# Create a simple model
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(10,)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(optimizer='adam', loss='mse')
# Train with dummy data
x_train = np.random.random((100, 10))
y_train = np.random.random((100, 1))
model.fit(x_train, y_train, epochs=5, verbose=1)
# Define the export path with a version number
export_path = os.path.join("models", "simple_model", "1")
# Save the model
tf.keras.models.save_model(
model,
export_path,
overwrite=True,
include_optimizer=True,
save_format=None,
signatures=None,
options=None
)
print(f"Model saved to: {export_path}")
The output would look something like:
Epoch 1/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2643
Epoch 2/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2513
Epoch 3/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2414
Epoch 4/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2329
Epoch 5/5
4/4 [==============================] - 0s 2ms/step - loss: 0.2247
Model saved to: models/simple_model/1
Model Directory Structure
TensorFlow Serving expects a specific directory structure for models:
models/
└── simple_model/ # Model name
├── 1/ # Version 1 of the model
│ ├── saved_model.pb
│ └── variables/
└── 2/ # Version 2 of the model
├── saved_model.pb
└── variables/
Running TensorFlow Serving
Starting the Server
After exporting your model, you can start TensorFlow Serving to make it available:
# Using the tensorflow_model_server command
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=simple_model --model_base_path=/absolute/path/to/models/simple_model
If you're using Docker:
docker run -p 8501:8501 \
--name tensorflow_serving \
--mount type=bind,source=/absolute/path/to/models,target=/models \
-e MODEL_NAME=simple_model \
-e MODEL_BASE_PATH=/models \
tensorflow/serving
You should see output indicating that the server has started, and your model has been loaded successfully.
Making Predictions with TensorFlow Serving
TensorFlow Serving provides two APIs for predictions:
- REST API: Easier to use, primarily for lower throughput applications
- gRPC API: Higher performance, ideal for production systems with high throughput
Using the REST API
Let's make a prediction using the REST API:
import json
import numpy as np
import requests
# Create sample data similar to the training data
data = np.random.random((2, 10)).tolist()
# Create the JSON payload
payload = {
"instances": data
}
# Send the request to the server
response = requests.post(
"http://localhost:8501/v1/models/simple_model:predict",
data=json.dumps(payload)
)
# Process the response
if response.status_code == 200:
predictions = response.json()["predictions"]
print(f"Predictions: {predictions}")
else:
print(f"Error: {response.text}")
Sample output:
Predictions: [[0.4206941], [-0.1234567]]
Using the gRPC API
For higher throughput applications, you can use the gRPC API, which requires more setup but offers better performance:
import grpc
import numpy as np
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
# Create a gRPC channel
channel = grpc.insecure_channel("localhost:8500")
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
# Create prediction request
request = predict_pb2.PredictRequest()
request.model_spec.name = "simple_model"
request.model_spec.signature_name = "serving_default"
# Prepare sample data
data = np.random.random((2, 10)).astype(np.float32)
# Add the input tensor
tensor_proto = tf.make_tensor_proto(data)
request.inputs["dense_input"].CopyFrom(tensor_proto)
# Send request
response = stub.Predict(request, 10.0) # 10 second timeout
# Process response
output = tf.make_ndarray(response.outputs["dense_2"])
print(f"Predictions: {output}")
Sample output:
Predictions: [[ 0.37128112]
[-0.09876545]]
Model Versioning and Updates
One of TensorFlow Serving's key features is its ability to handle model versions seamlessly. This allows you to deploy new model versions without interrupting service.
Adding a New Model Version
Let's add a new version of our model:
import tensorflow as tf
import numpy as np
import os
# Create an improved model (more layers)
improved_model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(10,)),
tf.keras.layers.Dense(128, activation='relu'), # Extra layer
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])
improved_model.compile(optimizer='adam', loss='mse')
# Train with the same dummy data
x_train = np.random.random((100, 10))
y_train = np.random.random((100, 1))
improved_model.fit(x_train, y_train, epochs=5, verbose=1)
# Define the export path with a new version number
export_path = os.path.join("models", "simple_model", "2") # Version 2
# Save the model
tf.keras.models.save_model(
improved_model,
export_path,
overwrite=True,
include_optimizer=True,
save_format=None,
signatures=None,
options=None
)
print(f"Improved model saved to: {export_path}")
TensorFlow Serving will automatically pick up the new model version if you configured it with --model_base_path
option pointing to the parent directory.
Version Policy
You can control which version of the model is served using version policies:
- Latest: Always serves the latest version (highest number)
- All: Serves all available versions
- Specific: Serves specific versions that you define
For example, to serve only the latest version:
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=simple_model --model_base_path=/path/to/models/simple_model \
--model_version_policy="{ 'latest': { 'num_versions': 1 }}"
Real-world Application: Image Classification API
Let's implement a complete example of serving an image classification model with TensorFlow Serving:
Step 1: Export a Pre-trained MobileNet Model
import tensorflow as tf
import os
# Load a pre-trained MobileNetV2 model
base_model = tf.keras.applications.MobileNetV2(
weights="imagenet",
input_shape=(224, 224, 3),
include_top=True
)
# Define preprocessing and post-processing functions
def preprocess(input_image):
return tf.keras.applications.mobilenet_v2.preprocess_input(input_image)
# Create a serving model with pre-processing included
inputs = tf.keras.layers.Input(shape=(224, 224, 3), name='image')
x = tf.keras.layers.Lambda(preprocess)(inputs)
outputs = base_model(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
# Export the model
export_path = os.path.join("models", "mobilenet", "1")
tf.keras.models.save_model(
model,
export_path,
overwrite=True,
include_optimizer=False
)
print(f"MobileNet model saved to: {export_path}")
Step 2: Start TensorFlow Serving with the MobileNet Model
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=mobilenet --model_base_path=/absolute/path/to/models/mobilenet
Step 3: Create a Client Application to Use the Model
import json
import numpy as np
import requests
from PIL import Image
import tensorflow as tf
def load_and_preprocess_image(image_path):
# Load and resize image
image = Image.open(image_path).resize((224, 224))
image_array = np.array(image)
# Add batch dimension
image_array = np.expand_dims(image_array, 0)
return image_array
def get_prediction(image_array):
# Create the JSON payload
payload = {
"instances": image_array.tolist()
}
# Send the request to the server
response = requests.post(
"http://localhost:8501/v1/models/mobilenet:predict",
data=json.dumps(payload)
)
# Process the response
if response.status_code == 200:
predictions = response.json()["predictions"][0]
# Get the top 5 predictions
top5_indices = np.argsort(predictions)[-5:][::-1]
# Load ImageNet class labels
class_names = [line.strip() for line in
open("imagenet_classes.txt", "r").readlines()]
# Return top 5 predictions with class names
return [(class_names[i], predictions[i]) for i in top5_indices]
else:
print(f"Error: {response.text}")
return None
# Example usage
if __name__ == "__main__":
image_path = "cat.jpg" # Replace with your test image
image_array = load_and_preprocess_image(image_path)
predictions = get_prediction(image_array)
if predictions:
print("Top 5 predictions:")
for class_name, score in predictions:
print(f"{class_name}: {score:.4f}")
Example output:
Top 5 predictions:
Egyptian cat: 0.8764
tabby, tabby cat: 0.0912
tiger cat: 0.0158
lynx, catamount: 0.0034
Persian cat: 0.0012
Advanced TensorFlow Serving Features
Model Batching
TensorFlow Serving can batch incoming requests for better performance. You can configure batching options when starting the server:
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=simple_model --model_base_path=/path/to/models/simple_model \
--enable_batching=true \
--batching_parameters_file=batch_config.txt
Example batch_config.txt
:
max_batch_size { value: 32 }
batch_timeout_micros { value: 5000 }
max_enqueued_batches { value: 100 }
Monitoring with Prometheus
You can configure TensorFlow Serving to expose metrics for Prometheus:
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=simple_model --model_base_path=/path/to/models/simple_model \
--monitoring_config_file=monitoring_config.txt
Example monitoring_config.txt
:
prometheus_config {
enable: true
path: "/monitoring/prometheus/metrics"
}
Secure Connections with SSL/TLS
For production deployments, you can enable secure connections:
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=simple_model --model_base_path=/path/to/models/simple_model \
--ssl_config_file=ssl_config.txt
Example ssl_config.txt
:
server_key: "/path/to/server.key"
server_cert: "/path/to/server.crt"
custom_ca: "/path/to/ca.crt"
Best Practices for TensorFlow Serving
- Version your models: Follow semantic versioning for models to track changes
- Monitor server performance: Set up logging and metrics to track model performance
- Test before deployment: Validate model performance and API integration before production
- Use Docker: Containerize your serving environment for consistent deployment
- Implement A/B testing: Use model versioning to compare different model versions
- Keep input processing consistent: Ensure preprocessing in training matches serving
- Set appropriate timeouts: Configure timeouts based on your model's complexity
- Use model warmup: Prime your model with sample data for faster initial predictions
Troubleshooting Common Issues
Model Not Loading
If your model doesn't load, check:
- Correct path to the SavedModel directory
- Model directory structure follows the expected format
- Model files have appropriate permissions
- Model format is compatible with your TensorFlow Serving version
High Latency
If you experience high latency:
- Enable batching
- Check if your model is complex and might need optimization
- Consider using TensorRT for optimization
- Adjust hardware resources (CPU, memory) allocated to the server
Connection Issues
For connection problems:
- Verify ports are correctly mapped in Docker (if using)
- Check firewall settings
- Validate network configuration
- Verify the correct URL in client applications
Summary
TensorFlow Serving is a powerful tool that bridges the gap between machine learning model development and production deployment. We've covered:
- How to install and set up TensorFlow Serving
- Exporting models in the SavedModel format
- Serving models via REST and gRPC APIs
- Managing multiple model versions
- Building a real-world image classification API
- Advanced features and best practices
This knowledge provides a solid foundation for deploying your TensorFlow models in production environments with scalability, high performance, and version management.
Additional Resources
- TensorFlow Serving Official Documentation
- TensorFlow Serving GitHub Repository
- TensorFlow SavedModel Guide
- Docker Documentation for TensorFlow Serving
Exercises
- Export and serve a simple regression model you've trained
- Implement a text classification API using TensorFlow Serving
- Create a client application that can switch between different model versions
- Set up TensorFlow Serving with batching and monitor performance improvements
- Deploy a TensorFlow Serving model on a cloud provider (AWS, GCP, or Azure)
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)