TensorFlow Preprocessing
Data preprocessing is a critical step in any machine learning workflow. In this tutorial, we'll explore how TensorFlow provides powerful tools to transform, normalize, and prepare your data for training deep learning models.
Introduction to Data Preprocessing
Raw data rarely comes in a format that's immediately suitable for training machine learning models. Data preprocessing involves cleaning, transforming, and organizing raw data before feeding it into a model. TensorFlow offers a rich set of tools to handle these preprocessing tasks efficiently.
Key benefits of TensorFlow's preprocessing capabilities include:
- Processing data on the GPU/TPU for faster execution
- Consistent preprocessing during both training and inference
- Integration with TensorFlow's data pipelines
- Built-in support for common preprocessing operations
Basic Data Preprocessing with TensorFlow
Let's start with some fundamental preprocessing techniques using TensorFlow.
Importing Required Libraries
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Normalization
Normalization scales numerical features to a standard range, typically between 0 and 1.
# Sample data
data = np.array([[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0]], dtype=np.float32)
# Create a normalization layer
normalizer = tf.keras.layers.Normalization(axis=-1)
# Adapt the normalizer to the data
normalizer.adapt(data)
# Apply normalization
normalized_data = normalizer(data)
print("Original data:")
print(data)
print("\nNormalized data:")
print(normalized_data.numpy())
Output:
Original data:
[[1. 2. 3.]
[4. 5. 6.]
[7. 8. 9.]]
Normalized data:
[[-1.2247448 -1.2247448 -1.2247448]
[ 0. 0. 0. ]
[ 1.2247448 1.2247448 1.2247448]]
One-Hot Encoding
For categorical data, one-hot encoding is a common preprocessing technique:
# Sample categorical data
categories = tf.constant(["apple", "banana", "apple", "orange", "banana", "apple"])
# Create a StringLookup layer for vocabulary management and index lookup
lookup_layer = tf.keras.layers.StringLookup()
lookup_layer.adapt(categories)
# Convert strings to indices
indices = lookup_layer(categories)
# Create a one-hot encoding layer
onehot_layer = tf.keras.layers.CategoryEncoding(
num_tokens=lookup_layer.vocabulary_size(),
output_mode="one_hot"
)
# Apply one-hot encoding
onehot_data = onehot_layer(indices)
print("Categories:")
print(categories.numpy())
print("\nOne-hot encoded:")
print(onehot_data.numpy())
Output:
Categories:
[b'apple' b'banana' b'apple' b'orange' b'banana' b'apple']
One-hot encoded:
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]]
The tf.keras.preprocessing
Module
TensorFlow provides the tf.keras.preprocessing
module which includes several utilities for preprocessing various data types.
Image Preprocessing
TensorFlow offers several tools for preprocessing images:
from tensorflow.keras.preprocessing import image
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# Create an image data generator with augmentation
datagen = ImageDataGenerator(
rescale=1./255, # Normalize pixel values
rotation_range=20, # Randomly rotate images
width_shift_range=0.2, # Randomly shift image horizontally
height_shift_range=0.2, # Randomly shift image vertically
horizontal_flip=True, # Randomly flip images horizontally
zoom_range=0.2 # Randomly zoom in on images
)
# Example of loading and preprocessing a single image
img_path = 'sample_image.jpg' # Replace with your image path
img = image.load_img(img_path, target_size=(224, 224))
img_array = image.img_to_array(img)
img_array = np.expand_dims(img_array, axis=0) # Create batch dimension
# Preprocess the image
processed_img = datagen.standardize(img_array)
print(f"Original shape: {img_array.shape}")
print(f"Processed shape: {processed_img.shape}")
print(f"Value range: min={processed_img.min()}, max={processed_img.max()}")
Text Preprocessing
Text data requires special preprocessing techniques:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Sample text data
texts = [
'I love TensorFlow',
'TensorFlow is awesome',
'Data preprocessing is important',
'Machine learning requires clean data'
]
# Create a tokenizer
tokenizer = Tokenizer(num_words=100) # Keep top 100 words
tokenizer.fit_on_texts(texts)
# Convert texts to sequences of integers
sequences = tokenizer.texts_to_sequences(texts)
print("Sequences:", sequences)
# Pad sequences to ensure uniform length
padded_sequences = pad_sequences(sequences, padding='post', maxlen=6)
print("\nPadded sequences:")
print(padded_sequences)
# Get the word index
word_index = tokenizer.word_index
print("\nWord index:")
for word, index in list(word_index.items())[:10]: # Show first 10 words
print(f"{word}: {index}")
Output:
Sequences: [[1, 2, 3], [3, 4, 5], [6, 7, 4, 8], [9, 10, 11, 12, 6]]
Padded sequences:
[[ 1 2 3 0 0 0]
[ 3 4 5 0 0 0]
[ 6 7 4 8 0 0]
[ 9 10 11 12 6 0]]
Word index:
tensorflow: 3
is: 4
data: 6
i: 1
love: 2
awesome: 5
preprocessing: 7
important: 8
machine: 9
learning: 10
The tf.data.Dataset
API for Preprocessing
TensorFlow's tf.data.Dataset
API provides powerful tools for building efficient data preprocessing pipelines:
# Create a simple dataset
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5])
# Apply preprocessing operations
processed_dataset = dataset \
.map(lambda x: x * 2) \
.map(lambda x: x + 1) \
.batch(2)
print("Processed dataset elements:")
for element in processed_dataset:
print(element.numpy())
Output:
Processed dataset elements:
[3 5]
[7 9]
[11]
Building a Complete Preprocessing Pipeline
Let's build a more comprehensive pipeline for a real dataset:
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
# Define preprocessing functions
def normalize_image(image, label):
"""Normalize images: `uint8` -> `float32`."""
return tf.cast(image, tf.float32) / 255.0, label
def augment_image(image, label):
"""Apply random augmentation to the image."""
# Add random brightness
image = tf.image.random_brightness(image, max_delta=0.2)
# Make sure values are still in [0, 1] range
image = tf.clip_by_value(image, 0.0, 1.0)
return image, label
# Create training and validation datasets
BATCH_SIZE = 64
BUFFER_SIZE = 10000 # For shuffling
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))
# Apply preprocessing to training data
train_dataset = train_dataset \
.map(normalize_image) \
.map(augment_image) \
.cache() \
.shuffle(BUFFER_SIZE) \
.batch(BATCH_SIZE) \
.prefetch(tf.data.AUTOTUNE)
# Apply preprocessing to test data (no augmentation for test data)
test_dataset = test_dataset \
.map(normalize_image) \
.batch(BATCH_SIZE) \
.prefetch(tf.data.AUTOTUNE)
# Visualize a batch of preprocessed images
def visualize_batch(batch):
images, labels = batch
plt.figure(figsize=(10, 10))
for i in range(25):
plt.subplot(5, 5, i + 1)
plt.imshow(images[i].numpy().reshape(28, 28), cmap='gray')
plt.title(f'Label: {labels[i].numpy()}')
plt.axis('off')
plt.tight_layout()
plt.show()
# Get a batch from the train dataset
for batch in train_dataset.take(1):
visualize_batch(batch)
Advanced Preprocessing with Feature Columns
TensorFlow's feature columns provide a way to bridge the gap between raw data and the features that you feed into your model:
# Sample data (as a pandas DataFrame)
data = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'income': [50000, 60000, 70000, 80000, 90000],
'city': ['New York', 'San Francisco', 'Chicago', 'Boston', 'Seattle']
})
# Create feature columns
numeric_feature_columns = [
tf.feature_column.numeric_column('age'),
tf.feature_column.numeric_column('income')
]
city_column = tf.feature_column.categorical_column_with_vocabulary_list(
'city', ['New York', 'San Francisco', 'Chicago', 'Boston', 'Seattle']
)
# Convert categorical column to one-hot representation
city_one_hot = tf.feature_column.indicator_column(city_column)
# Bucketize the age feature
age_buckets = tf.feature_column.bucketized_column(
numeric_feature_columns[0], boundaries=[30, 40]
)
# Combine all features
feature_columns = numeric_feature_columns + [city_one_hot, age_buckets]
# Create a feature layer which can be used in models
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
# Convert pandas DataFrame to a tf.data.Dataset
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
df = dataframe.copy()
labels = df.pop('age') # Using age as a dummy target variable
ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
ds = ds.prefetch(tf.data.AUTOTUNE)
return ds
# Create the dataset
batch_size = 2
dataset = df_to_dataset(data, batch_size=batch_size)
# Apply the feature layer
for feature_batch, label_batch in dataset.take(1):
processed_features = feature_layer(feature_batch)
print("Processed features shape:", processed_features.shape)
print("First example processed features:", processed_features[0].numpy())
Real-world Example: Image Classification Pipeline
Let's create a complete image preprocessing pipeline for a simple flower classification task:
import tensorflow as tf
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
# Load the dataset
(train_ds, val_ds, test_ds), metadata = tfds.load(
'tf_flowers',
split=['train[:80%]', 'train[80%:90%]', 'train[90%:]'],
with_info=True,
as_supervised=True,
)
num_classes = metadata.features['label'].num_classes
print(f"Number of classes: {num_classes}")
# Display class names
class_names = metadata.features['label'].names
print("Class names:", class_names)
# Visualize some examples
plt.figure(figsize=(10, 10))
for i, (image, label) in enumerate(train_ds.take(9)):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(image)
plt.title(class_names[label])
plt.axis('off')
plt.tight_layout()
# plt.show() # Uncomment to display the figure
# Define preprocessing parameters
IMG_SIZE = 224
BATCH_SIZE = 32
def preprocess_image(image, label):
"""Resize images and normalize pixel values."""
image = tf.image.resize(image, [IMG_SIZE, IMG_SIZE])
image = tf.cast(image, tf.float32) / 255.0 # Normalize to [0,1]
return image, label
def augment_image(image, label):
"""Apply data augmentation."""
# Randomly flip the image horizontally
image = tf.image.random_flip_left_right(image)
# Randomly adjust brightness
image = tf.image.random_brightness(image, max_delta=0.1)
# Randomly adjust contrast
image = tf.image.random_contrast(image, lower=0.9, upper=1.1)
# Make sure pixel values are still in [0, 1] range
image = tf.clip_by_value(image, 0.0, 1.0)
return image, label
# Apply preprocessing to datasets
train_ds = train_ds \
.map(preprocess_image, num_parallel_calls=tf.data.AUTOTUNE) \
.map(augment_image, num_parallel_calls=tf.data.AUTOTUNE) \
.shuffle(1000) \
.batch(BATCH_SIZE) \
.prefetch(tf.data.AUTOTUNE)
val_ds = val_ds \
.map(preprocess_image, num_parallel_calls=tf.data.AUTOTUNE) \
.batch(BATCH_SIZE) \
.prefetch(tf.data.AUTOTUNE)
test_ds = test_ds \
.map(preprocess_image, num_parallel_calls=tf.data.AUTOTUNE) \
.batch(BATCH_SIZE) \
.prefetch(tf.data.AUTOTUNE)
# Create and train a model using this preprocessing pipeline
model = tf.keras.Sequential([
tf.keras.applications.MobileNetV2(
input_shape=(IMG_SIZE, IMG_SIZE, 3),
include_top=False,
weights='imagenet'),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(num_classes, activation='softmax')
])
# Freeze the base model
model.layers[0].trainable = False
# Compile the model
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# Model summary
model.summary()
# Train the model (commented out since this is just an example)
# history = model.fit(
# train_ds,
# validation_data=val_ds,
# epochs=5
# )
The TensorFlow Extended (TFX) Data Validation Tool
For production-level data preprocessing, TensorFlow offers TFX Data Validation:
# Install TensorFlow Data Validation (if not already installed)
# !pip install tensorflow-data-validation
import tensorflow_data_validation as tfdv
import pandas as pd
# Sample data
data = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'income': [50000, 60000, 70000, 80000, 90000],
'city': ['New York', 'San Francisco', 'Chicago', 'Boston', 'Seattle']
})
# Generate statistics
stats = tfdv.generate_statistics_from_dataframe(data)
# Generate a schema from statistics
schema = tfdv.infer_schema(stats)
# Display statistics and schema info
tfdv.display_schema(schema)
# Validate a new dataset against the schema
new_data = pd.DataFrame({
'age': [22, 32, -5, 100], # Note: contains anomalies (-5 and 100)
'income': [45000, 65000, 75000, 1000000], # Note: contains anomaly (1000000)
'city': ['New York', 'Unknown', 'Chicago', 'Seattle'] # Note: contains anomaly ("Unknown")
})
new_stats = tfdv.generate_statistics_from_dataframe(new_data)
anomalies = tfdv.validate_statistics(new_stats, schema)
# Display anomalies
tfdv.display_anomalies(anomalies)
Summary
In this tutorial, we explored TensorFlow's comprehensive data preprocessing capabilities:
- Basic Preprocessing: Normalization, one-hot encoding, and other fundamental techniques
- Image and Text Processing: Specialized tools for different data types
tf.data.Dataset
API: Building efficient data pipelines- Feature Columns: Bridging raw data to model-ready features
- Real-World Applications: Complete preprocessing pipelines for image classification
- Production Tools: TFX Data Validation for robust data pipelines
Proper data preprocessing is crucial for machine learning success. TensorFlow provides all the tools you need to clean, transform, and prepare your data for model training.
Additional Resources
- TensorFlow Data Preprocessing Guide
- tf.data: Build TensorFlow input pipelines
- TensorFlow Feature Columns
- TensorFlow Extended (TFX)
Exercises
- Build a preprocessing pipeline for the MNIST dataset that includes normalization and data augmentation.
- Create a text preprocessing pipeline using the
tf.data.Dataset
API for sentiment analysis. - Implement a preprocessing pipeline for a tabular dataset using feature columns.
- Design an image data augmentation pipeline with at least five different augmentation techniques.
- Use TensorFlow Data Validation to analyze and validate a dataset of your choice.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)