TensorFlow Transform
Introduction
TensorFlow Transform (tf.Transform) is a library for preprocessing data with TensorFlow. It's a crucial component of the TensorFlow Extended (TFX) ecosystem that helps solve one of the most common challenges in machine learning: maintaining consistent data preprocessing between training and serving.
When building machine learning pipelines, data preprocessing is a critical step. However, inconsistencies can arise when you preprocess data differently during training versus when you deploy your model to production. TensorFlow Transform addresses this problem by allowing you to define preprocessing pipelines that can be applied consistently at both training and serving time.
In this guide, we'll explore:
- What TensorFlow Transform is and why it's important
- How to define preprocessing pipelines
- How to integrate tf.Transform with your ML workflows
- Real-world applications and best practices
Why Use TensorFlow Transform?
Before diving into how to use tf.Transform, let's understand why it's so valuable:
- Consistency between training and serving: Apply the exact same transformations to training data and serving data.
- Full-pass operations: Perform preprocessing that requires a full-pass over the dataset (like normalization).
- Graph integration: Embed preprocessing logic directly into the TensorFlow graph.
- Optimized performance: Preprocess data efficiently at scale.
- Pipeline integration: Seamlessly works with other TFX components.
Getting Started with TensorFlow Transform
Installation
First, let's install TensorFlow Transform:
pip install tensorflow-transform
You might also need to install additional dependencies depending on your setup:
pip install apache-beam tensorflow tfx
Basic Concepts
TensorFlow Transform works with a few key concepts:
- Preprocessing functions: Python functions that define your data transformations
- TFT analyzers: Functions that compute global statistics over the entire dataset
- TFT mappers: Functions that apply transformations to individual examples
Let's explore these concepts with examples.
Creating a Preprocessing Pipeline
A Simple Example
Here's a basic example of how to use tf.Transform to normalize numerical features and convert categorical features to one-hot encodings:
import tensorflow as tf
import tensorflow_transform as tft
import apache_beam as beam
# Define the preprocessing function
def preprocessing_fn(inputs):
"""Preprocessing function for tf.Transform."""
# Extract features
numerical_feature = inputs['numerical_feature']
categorical_feature = inputs['categorical_feature']
# Normalize the numerical feature using mean and variance
normalized_numerical = tft.scale_to_z_score(numerical_feature)
# Convert categorical feature to one-hot encoding
vocab = tft.vocabulary(categorical_feature)
one_hot_encoded = tft.compute_and_apply_vocabulary(categorical_feature, vocab_filename='vocab')
# Return transformed features
return {
'normalized_numerical': normalized_numerical,
'one_hot_encoded': one_hot_encoded,
'label': inputs['label']
}
This preprocessing function takes raw input features and returns transformed features. The key insight is that tf.Transform will compute the statistics (like mean and variance) during the preprocessing phase and then apply these transformations consistently during both training and serving.
Running the Transform
To run the transform, we need to set up an Apache Beam pipeline:
def run_transform(data_path, output_dir):
"""Runs the tf.Transform pipeline."""
# Define the beam pipeline
with beam.Pipeline() as pipeline:
# Read data from source
examples = (
pipeline
| 'ReadData' >> beam.io.ReadFromTFRecord(
data_path, coder=beam.coders.ProtoCoder(tf.train.Example))
)
# Parse the raw data
raw_data = (
examples
| 'DecodeData' >> beam.Map(lambda x: _parse_example(x))
)
# Apply tf.Transform
(transformed_data, transform_fn) = (
(raw_data, raw_metadata)
| 'AnalyzeAndTransform' >> tft_beam.AnalyzeAndTransformDataset(
preprocessing_fn)
)
# Write transformed data
transformed_data | 'WriteTransformedData' >> beam.io.WriteToTFRecord(
output_dir + '/transformed_data')
# Write transform function
transform_fn | 'WriteTransformFn' >> tft_beam.WriteTransformFn(
output_dir + '/transform_fn')
Common Transformations with tf.Transform
TensorFlow Transform provides a rich set of transformations. Here are some of the most commonly used ones:
Scaling Numerical Features
# Scale to [0, 1] range
scaled_0_1 = tft.scale_to_0_1(inputs['feature'])
# Scale to z-score (mean=0, std=1)
scaled_z = tft.scale_to_z_score(inputs['feature'])
# Scale by min-max to a custom range
scaled_custom = tft.scale_by_min_max(
inputs['feature'],
output_min=-1.0,
output_max=1.0
)
Handling Categorical Features
# Convert to vocabulary indices
indices = tft.compute_and_apply_vocabulary(
inputs['categorical_feature'],
vocab_filename='vocab'
)
# One-hot encoding
one_hot = tf.one_hot(
indices=indices,
depth=tft.get_num_buckets_for_transformed_feature(indices)
)
# Hashed feature
hashed = tft.hash_strings(inputs['feature'], hash_buckets=1000)
Handling Missing Values
# Fill missing values with a default
filled = tft.fill_in_missing(inputs['feature'], default_value=0)
# Check if value is missing
is_missing = tft.is_missing(inputs['feature'])
Integrating with TensorFlow Models
After preprocessing your data with tf.Transform, you'll need to integrate the transformed data with your TensorFlow model. Here's how:
def create_serving_input_receiver_fn(transform_output):
"""Creates a serving input receiver function."""
def serving_input_receiver_fn():
# Define placeholders for the raw input features
raw_feature_spec = transform_output.raw_feature_spec()
raw_features = tf.compat.v1.placeholder_with_default(
input=tf.constant([], dtype=tf.string),
shape=[None],
name='input_example_tensor')
# Parse the raw features
parsed_features = tf.io.parse_example(
serialized=raw_features, features=raw_feature_spec)
# Transform features
transformed_features = transform_output.transform_raw_features(
parsed_features)
# Return the receiver
return tf.estimator.export.ServingInputReceiver(
transformed_features, {'examples': raw_features})
return serving_input_receiver_fn
Then, when training your model, load the transform output and create a serving input receiver function:
# Load the transform output
transform_output = tft.TFTransformOutput(transform_output_dir)
# Create an estimator
estimator = tf.estimator.DNNClassifier(
feature_columns=feature_columns,
hidden_units=[100, 50],
model_dir=model_dir)
# Train the model
estimator.train(
input_fn=lambda: input_fn(train_data_file, transform_output),
steps=1000)
# Export the model for serving
estimator.export_saved_model(
export_dir_base=serving_model_dir,
serving_input_receiver_fn=create_serving_input_receiver_fn(transform_output))
Real-World Example: Preprocessing Taxi Data
Let's look at a more comprehensive example with the Chicago Taxi dataset, which is commonly used in TFX tutorials.
import tensorflow as tf
import tensorflow_transform as tft
import apache_beam as beam
# Define feature columns
NUMERIC_FEATURE_KEYS = [
'trip_miles', 'trip_seconds', 'fare', 'tips'
]
CATEGORICAL_FEATURE_KEYS = [
'pickup_community_area', 'dropoff_community_area', 'payment_type'
]
LABEL_KEY = 'tips_bin' # Binary label: whether tips > 20% of fare
def preprocessing_fn(inputs):
"""TensorFlow Transform preprocessing function."""
outputs = {}
# Scale numerical features
for key in NUMERIC_FEATURE_KEYS:
# Preserve the raw value
outputs[key + '_raw'] = inputs[key]
# Handle missing values
feature_with_defaults = tft.fill_in_missing(inputs[key])
# Scale to z-score
outputs[key + '_scaled'] = tft.scale_to_z_score(feature_with_defaults)
# Transform categorical features
for key in CATEGORICAL_FEATURE_KEYS:
# Convert to string
string_value = tf.as_string(inputs[key])
# Compute vocabulary and convert to indices
indices = tft.compute_and_apply_vocabulary(
string_value, vocab_filename=key)
# Convert to one-hot encoding
one_hot = tf.one_hot(
indices,
depth=tft.get_num_buckets_for_transformed_feature(indices),
on_value=1.0,
off_value=0.0
)
outputs[key + '_onehot'] = one_hot
# Include the label
outputs[LABEL_KEY] = inputs[LABEL_KEY]
return outputs
Running this transform on the Chicago Taxi dataset would give us properly scaled numeric features and one-hot encoded categorical features, ready to be used in our machine learning model.
Best Practices and Tips
When working with TensorFlow Transform, consider these best practices:
-
Plan your preprocessing pipeline carefully: Think about which operations need to be computed over the entire dataset.
-
Test your preprocessing functions: Verify that your preprocessing function works as expected with small test datasets.
-
Use TFRecord format: For large datasets, use TFRecord format for efficient storage and retrieval.
-
Monitor resource usage: Some preprocessing operations can be memory-intensive, especially with large vocabularies.
-
Use Apache Beam for scalability: If your dataset is large, take advantage of Apache Beam's distributed processing capabilities.
-
Version your preprocessing pipelines: Keep track of changes to your preprocessing function to ensure reproducibility.
Summary
TensorFlow Transform is a powerful library for ensuring consistent data preprocessing between training and serving in your machine learning pipelines. It allows you to:
- Define preprocessing operations that require a full pass over the dataset
- Embed preprocessing logic directly into your TensorFlow graph
- Maintain consistency between training and inference
- Scale preprocessing to handle large datasets
By using tf.Transform as part of your TFX pipeline, you can ensure that your data preprocessing is reproducible, efficient, and consistent across all stages of your machine learning workflow.
Additional Resources
Exercises
- Create a preprocessing pipeline that normalizes numerical features and bucketizes them into quantiles.
- Build a tf.Transform pipeline that handles text data by tokenizing and creating TF-IDF features.
- Extend the Chicago Taxi example to include feature crosses between pickup and dropoff community areas.
- Create a complete TFX pipeline that includes tf.Transform along with other components like ExampleGen and Trainer.
Happy preprocessing with TensorFlow Transform!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)