TensorFlow Serving with TFX

Introduction

TensorFlow Serving is a flexible, high-performance serving system designed for deploying machine learning models in production environments. As part of the TensorFlow Extended (TFX) ecosystem, TensorFlow Serving allows you to transition smoothly from training models to deploying them in real-world applications with minimal effort.

In this tutorial, we'll explore:

What TensorFlow Serving is and its key benefits
How it integrates with the TFX ecosystem
Setting up TensorFlow Serving
Deploying models using different serving methods
Best practices for model deployment

What is TensorFlow Serving?

TensorFlow Serving is a production-ready system that allows you to serve machine learning models in a scalable and efficient manner. Unlike manual deployment approaches that require custom code, TensorFlow Serving provides a standardized way to deploy models with the following benefits:

Version management: Serve multiple versions of your model simultaneously
Easy updates: Hot-swap new model versions without downtime
High performance: Optimized for production workloads with batching support
Flexibility: Serve via REST API or gRPC for different use cases

How TensorFlow Serving Works Within TFX

In the TFX ecosystem, TensorFlow Serving typically sits at the end of your machine learning pipeline:

Your data is processed through TFX components like ExampleGen and Transform
Models are trained with Trainer component
Models are validated with Evaluator component
Models are pushed to a model registry using Pusher component
TensorFlow Serving deploys these models to serve predictions

Setting Up TensorFlow Serving

Let's start with installing the necessary dependencies:

# Install TensorFlow Serving
pip install tensorflow-serving-api

# For Docker-based deployment (recommended)
# Make sure Docker is installed on your system first

Preparing Your Model for Serving

Before we can serve a model, we need to export it in the SavedModel format, which is TensorFlow's standard serialization format:

import tensorflow as tf

# Assume we have a trained model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(10,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam',
              loss='mse',
              metrics=['mae'])

# Train the model (simplified for this example)
import numpy as np
x_train = np.random.random((1000, 10))
y_train = np.random.random((1000, 1))
model.fit(x_train, y_train, epochs=5, batch_size=32)

# Export the model in SavedModel format
export_path = "./saved_models/model/1"  # Version number as subfolder
tf.saved_model.save(model, export_path)

print(f"Model saved to: {export_path}")

The output would look something like:

Model saved to: ./saved_models/model/1

Note the directory structure: we're saving the model in a version-specific subdirectory (1). This enables TensorFlow Serving to handle multiple versions of the same model.

Deployment Options with TensorFlow Serving

Option 1: Docker-based Deployment (Recommended)

The easiest way to deploy TensorFlow Serving is using Docker:

# Pull the TensorFlow Serving Docker image
docker pull tensorflow/serving

# Start a container with your model mounted
docker run -p 8501:8501 \
  --mount type=bind,source=/path/to/saved_models,target=/models/my_model \
  -e MODEL_NAME=my_model \
  tensorflow/serving

Option 2: Direct Installation

You can also install TensorFlow Serving directly on your system:

# On Ubuntu/Debian
echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add -
sudo apt update
sudo apt-get install tensorflow-model-server

# Start the server
tensorflow_model_server --port=8500 --rest_api_port=8501 \
  --model_name=my_model --model_base_path=/path/to/saved_models

Making Predictions with the Served Model

Using REST API

Once your model is being served, you can make predictions using the REST API:

import json
import requests
import numpy as np

# Create sample data (matching the input shape of our model)
data = np.random.random((3, 10)).tolist()

# Create the request JSON
request_data = json.dumps({
    "signature_name": "serving_default",
    "instances": data
})

# Send the request to the server
response = requests.post(
    "http://localhost:8501/v1/models/my_model:predict",
    data=request_data
)

# Process the response
predictions = json.loads(response.text)["predictions"]
print("Predictions:", predictions)

Output:

Predictions: [[0.4301], [0.5672], [0.3959]]  # Example values, yours will differ

Using gRPC (Higher Performance)

For higher throughput applications, you can use gRPC instead:

import grpc
import numpy as np
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc

# Create a gRPC channel
channel = grpc.insecure_channel('localhost:8500')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

# Create a request
request = predict_pb2.PredictRequest()
request.model_spec.name = 'my_model'
request.model_spec.signature_name = 'serving_default'

# Create sample data
data = np.random.random((3, 10)).astype(np.float32)

# Add inputs to the request
request.inputs['dense_input'].CopyFrom(
    tf.make_tensor_proto(data)
)

# Send request
result = stub.Predict(request, 10.0)  # 10 secs timeout
print(result)

Integrating with TFX Pusher

In a full TFX pipeline, you can automate the deployment process using the Pusher component:

from tfx.components import Pusher
from tfx.proto import pusher_pb2

# Define the Pusher component to export the model
pusher = Pusher(
    model=trainer.outputs['model'],
    model_blessing=evaluator.outputs['blessing'],
    push_destination=pusher_pb2.PushDestination(
        filesystem=pusher_pb2.PushDestination.Filesystem(
            base_directory='/serving_models/my_model'
        )
    )
)

When this component runs, it will:

Check if the model has been "blessed" (passed quality thresholds)
If blessed, copy the model to the destination directory with proper versioning
Your TensorFlow Serving instances can then pick up the new model

Advanced Features

Model Versioning

TensorFlow Serving automatically manages model versions:

# Structure your models like this:
/serving_models/
  my_model/
    1/       # Model version 1
      saved_model.pb
      variables/
    2/       # Model version 2
      saved_model.pb
      variables/

By default, TensorFlow Serving will serve the highest numbered version. You can customize this behavior with the --model_config_file flag and a configuration file.

A/B Testing with Model Versions

You can serve multiple versions simultaneously and assign traffic percentages:

{
  "model_config_list": [
    {
      "name": "my_model",
      "base_path": "/serving_models/my_model",
      "model_version_policy": {
        "specific": {
          "versions": [1, 2]
        }
      },
      "version_labels": {
        "production": 1,
        "experiment": 2
      }
    }
  ]
}

Monitoring

For production deployments, you'll want to monitor your serving infrastructure:

# Example using Prometheus for monitoring
# First, start TensorFlow Serving with monitoring enabled
# docker run ... -e TF_ENABLE_MONITORING=true ... tensorflow/serving

# Then connect your monitoring tools to the /monitoring/prometheus/metrics endpoint

Real-World Example: Sentiment Analysis API

Let's deploy a sentiment analysis model that classifies text as positive or negative:

import tensorflow as tf
import tensorflow_text as text  # For text preprocessing

# Create a simple sentiment model (normally you'd train this properly)
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=[], dtype=tf.string, name='text'),
    tf.keras.layers.TextVectorization(max_tokens=10000, output_mode='int'),
    tf.keras.layers.Embedding(10000, 64),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid', name='sentiment')
])

# Export with signatures to make API clearer
@tf.function(input_signature=[tf.TensorSpec(shape=[None], dtype=tf.string)])
def serving_fn(inputs):
    return {'sentiment_score': model(inputs)}

export_path = "./sentiment_model/1"
tf.saved_model.save(
    model,
    export_path,
    signatures={'serving_default': serving_fn}
)

After deploying this model with TensorFlow Serving, you can use it like this:

import json
import requests

# Sample text to analyze
texts = ["I absolutely loved this product!", 
         "The service was terrible and disappointing."]

# Create the request
request_data = json.dumps({
    "instances": texts
})

# Send request to the model
response = requests.post(
    "http://localhost:8501/v1/models/sentiment_model:predict",
    data=request_data
)

# Parse results
results = json.loads(response.text)
scores = results["predictions"]

# Print predictions
for text, score in zip(texts, scores):
    sentiment = "Positive" if score[0] > 0.5 else "Negative"
    print(f"Text: {text}")
    print(f"Sentiment: {sentiment} (Score: {score[0]:.4f})")
    print()

Sample output:

Text: I absolutely loved this product!
Sentiment: Positive (Score: 0.8734)

Text: The service was terrible and disappointing.
Sentiment: Negative (Score: 0.1243)

Performance Optimization

For high-traffic applications, consider these optimizations:

Batching: TensorFlow Serving automatically batches requests to maximize throughput.
Model optimization:

# Convert to TensorFlow Lite for edge deployment
converter = tf.lite.TFLiteConverter.from_saved_model(export_path)
tflite_model = converter.convert()

# Save the TF Lite model
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

Hardware acceleration: TensorFlow Serving automatically uses available GPUs.

Summary

In this tutorial, we've covered:

What TensorFlow Serving is and its key benefits
How it integrates with the TFX ecosystem
Multiple deployment options (Docker, direct installation)
Making predictions with REST API and gRPC
Advanced features like versioning and A/B testing
A real-world example of deploying a sentiment analysis model
Performance optimization techniques

TensorFlow Serving provides a robust solution for deploying machine learning models in production environments. By following the practices outlined in this guide, you can transition from model experimentation to production-ready deployment within the TFX ecosystem.

Additional Resources

Exercises

Deploy a pre-trained image classification model using TensorFlow Serving
Implement A/B testing between two versions of the same model
Create a simple web application that uses your deployed model via the REST API
Benchmark the performance difference between REST and gRPC interfaces
Set up monitoring for your TensorFlow Serving deployment using Prometheus

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What is TensorFlow Serving?​

How TensorFlow Serving Works Within TFX​

Setting Up TensorFlow Serving​

Preparing Your Model for Serving​

Deployment Options with TensorFlow Serving​

Option 1: Docker-based Deployment (Recommended)​

Option 2: Direct Installation​

Making Predictions with the Served Model​

Using REST API​

Using gRPC (Higher Performance)​

Integrating with TFX Pusher​

Advanced Features​

Model Versioning​

A/B Testing with Model Versions​

Monitoring​

Real-World Example: Sentiment Analysis API​

Performance Optimization​

Summary​

Additional Resources​

Exercises​