TensorFlow Serving with TFX
Introduction
TensorFlow Serving is a flexible, high-performance serving system designed for deploying machine learning models in production environments. As part of the TensorFlow Extended (TFX) ecosystem, TensorFlow Serving allows you to transition smoothly from training models to deploying them in real-world applications with minimal effort.
In this tutorial, we'll explore:
- What TensorFlow Serving is and its key benefits
- How it integrates with the TFX ecosystem
- Setting up TensorFlow Serving
- Deploying models using different serving methods
- Best practices for model deployment
What is TensorFlow Serving?
TensorFlow Serving is a production-ready system that allows you to serve machine learning models in a scalable and efficient manner. Unlike manual deployment approaches that require custom code, TensorFlow Serving provides a standardized way to deploy models with the following benefits:
- Version management: Serve multiple versions of your model simultaneously
- Easy updates: Hot-swap new model versions without downtime
- High performance: Optimized for production workloads with batching support
- Flexibility: Serve via REST API or gRPC for different use cases
How TensorFlow Serving Works Within TFX
In the TFX ecosystem, TensorFlow Serving typically sits at the end of your machine learning pipeline:
- Your data is processed through TFX components like ExampleGen and Transform
- Models are trained with Trainer component
- Models are validated with Evaluator component
- Models are pushed to a model registry using Pusher component
- TensorFlow Serving deploys these models to serve predictions
Setting Up TensorFlow Serving
Let's start with installing the necessary dependencies:
# Install TensorFlow Serving
pip install tensorflow-serving-api
# For Docker-based deployment (recommended)
# Make sure Docker is installed on your system first
Preparing Your Model for Serving
Before we can serve a model, we need to export it in the SavedModel format, which is TensorFlow's standard serialization format:
import tensorflow as tf
# Assume we have a trained model
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(10,)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(optimizer='adam',
loss='mse',
metrics=['mae'])
# Train the model (simplified for this example)
import numpy as np
x_train = np.random.random((1000, 10))
y_train = np.random.random((1000, 1))
model.fit(x_train, y_train, epochs=5, batch_size=32)
# Export the model in SavedModel format
export_path = "./saved_models/model/1" # Version number as subfolder
tf.saved_model.save(model, export_path)
print(f"Model saved to: {export_path}")
The output would look something like:
Model saved to: ./saved_models/model/1
Note the directory structure: we're saving the model in a version-specific subdirectory (1
). This enables TensorFlow Serving to handle multiple versions of the same model.
Deployment Options with TensorFlow Serving
Option 1: Docker-based Deployment (Recommended)
The easiest way to deploy TensorFlow Serving is using Docker:
# Pull the TensorFlow Serving Docker image
docker pull tensorflow/serving
# Start a container with your model mounted
docker run -p 8501:8501 \
--mount type=bind,source=/path/to/saved_models,target=/models/my_model \
-e MODEL_NAME=my_model \
tensorflow/serving
Option 2: Direct Installation
You can also install TensorFlow Serving directly on your system:
# On Ubuntu/Debian
echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add -
sudo apt update
sudo apt-get install tensorflow-model-server
# Start the server
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=my_model --model_base_path=/path/to/saved_models
Making Predictions with the Served Model
Using REST API
Once your model is being served, you can make predictions using the REST API:
import json
import requests
import numpy as np
# Create sample data (matching the input shape of our model)
data = np.random.random((3, 10)).tolist()
# Create the request JSON
request_data = json.dumps({
"signature_name": "serving_default",
"instances": data
})
# Send the request to the server
response = requests.post(
"http://localhost:8501/v1/models/my_model:predict",
data=request_data
)
# Process the response
predictions = json.loads(response.text)["predictions"]
print("Predictions:", predictions)
Output:
Predictions: [[0.4301], [0.5672], [0.3959]] # Example values, yours will differ
Using gRPC (Higher Performance)
For higher throughput applications, you can use gRPC instead:
import grpc
import numpy as np
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc
# Create a gRPC channel
channel = grpc.insecure_channel('localhost:8500')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
# Create a request
request = predict_pb2.PredictRequest()
request.model_spec.name = 'my_model'
request.model_spec.signature_name = 'serving_default'
# Create sample data
data = np.random.random((3, 10)).astype(np.float32)
# Add inputs to the request
request.inputs['dense_input'].CopyFrom(
tf.make_tensor_proto(data)
)
# Send request
result = stub.Predict(request, 10.0) # 10 secs timeout
print(result)
Integrating with TFX Pusher
In a full TFX pipeline, you can automate the deployment process using the Pusher component:
from tfx.components import Pusher
from tfx.proto import pusher_pb2
# Define the Pusher component to export the model
pusher = Pusher(
model=trainer.outputs['model'],
model_blessing=evaluator.outputs['blessing'],
push_destination=pusher_pb2.PushDestination(
filesystem=pusher_pb2.PushDestination.Filesystem(
base_directory='/serving_models/my_model'
)
)
)
When this component runs, it will:
- Check if the model has been "blessed" (passed quality thresholds)
- If blessed, copy the model to the destination directory with proper versioning
- Your TensorFlow Serving instances can then pick up the new model
Advanced Features
Model Versioning
TensorFlow Serving automatically manages model versions:
# Structure your models like this:
/serving_models/
my_model/
1/ # Model version 1
saved_model.pb
variables/
2/ # Model version 2
saved_model.pb
variables/
By default, TensorFlow Serving will serve the highest numbered version. You can customize this behavior with the --model_config_file
flag and a configuration file.
A/B Testing with Model Versions
You can serve multiple versions simultaneously and assign traffic percentages:
{
"model_config_list": [
{
"name": "my_model",
"base_path": "/serving_models/my_model",
"model_version_policy": {
"specific": {
"versions": [1, 2]
}
},
"version_labels": {
"production": 1,
"experiment": 2
}
}
]
}
Monitoring
For production deployments, you'll want to monitor your serving infrastructure:
# Example using Prometheus for monitoring
# First, start TensorFlow Serving with monitoring enabled
# docker run ... -e TF_ENABLE_MONITORING=true ... tensorflow/serving
# Then connect your monitoring tools to the /monitoring/prometheus/metrics endpoint
Real-World Example: Sentiment Analysis API
Let's deploy a sentiment analysis model that classifies text as positive or negative:
import tensorflow as tf
import tensorflow_text as text # For text preprocessing
# Create a simple sentiment model (normally you'd train this properly)
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=[], dtype=tf.string, name='text'),
tf.keras.layers.TextVectorization(max_tokens=10000, output_mode='int'),
tf.keras.layers.Embedding(10000, 64),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid', name='sentiment')
])
# Export with signatures to make API clearer
@tf.function(input_signature=[tf.TensorSpec(shape=[None], dtype=tf.string)])
def serving_fn(inputs):
return {'sentiment_score': model(inputs)}
export_path = "./sentiment_model/1"
tf.saved_model.save(
model,
export_path,
signatures={'serving_default': serving_fn}
)
After deploying this model with TensorFlow Serving, you can use it like this:
import json
import requests
# Sample text to analyze
texts = ["I absolutely loved this product!",
"The service was terrible and disappointing."]
# Create the request
request_data = json.dumps({
"instances": texts
})
# Send request to the model
response = requests.post(
"http://localhost:8501/v1/models/sentiment_model:predict",
data=request_data
)
# Parse results
results = json.loads(response.text)
scores = results["predictions"]
# Print predictions
for text, score in zip(texts, scores):
sentiment = "Positive" if score[0] > 0.5 else "Negative"
print(f"Text: {text}")
print(f"Sentiment: {sentiment} (Score: {score[0]:.4f})")
print()
Sample output:
Text: I absolutely loved this product!
Sentiment: Positive (Score: 0.8734)
Text: The service was terrible and disappointing.
Sentiment: Negative (Score: 0.1243)
Performance Optimization
For high-traffic applications, consider these optimizations:
-
Batching: TensorFlow Serving automatically batches requests to maximize throughput.
-
Model optimization:
# Convert to TensorFlow Lite for edge deployment
converter = tf.lite.TFLiteConverter.from_saved_model(export_path)
tflite_model = converter.convert()
# Save the TF Lite model
with open('model.tflite', 'wb') as f:
f.write(tflite_model)
- Hardware acceleration: TensorFlow Serving automatically uses available GPUs.
Summary
In this tutorial, we've covered:
- What TensorFlow Serving is and its key benefits
- How it integrates with the TFX ecosystem
- Multiple deployment options (Docker, direct installation)
- Making predictions with REST API and gRPC
- Advanced features like versioning and A/B testing
- A real-world example of deploying a sentiment analysis model
- Performance optimization techniques
TensorFlow Serving provides a robust solution for deploying machine learning models in production environments. By following the practices outlined in this guide, you can transition from model experimentation to production-ready deployment within the TFX ecosystem.
Additional Resources
- TensorFlow Serving Official Documentation
- TFX Pusher Component Guide
- Model Serving Best Practices
- TensorFlow Serving API Reference
Exercises
- Deploy a pre-trained image classification model using TensorFlow Serving
- Implement A/B testing between two versions of the same model
- Create a simple web application that uses your deployed model via the REST API
- Benchmark the performance difference between REST and gRPC interfaces
- Set up monitoring for your TensorFlow Serving deployment using Prometheus
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)