TensorFlow Preprocessing

Data preprocessing is a critical step in any machine learning workflow. In this tutorial, we'll explore how TensorFlow provides powerful tools to transform, normalize, and prepare your data for training deep learning models.

Introduction to Data Preprocessing

Raw data rarely comes in a format that's immediately suitable for training machine learning models. Data preprocessing involves cleaning, transforming, and organizing raw data before feeding it into a model. TensorFlow offers a rich set of tools to handle these preprocessing tasks efficiently.

Key benefits of TensorFlow's preprocessing capabilities include:

Processing data on the GPU/TPU for faster execution
Consistent preprocessing during both training and inference
Integration with TensorFlow's data pipelines
Built-in support for common preprocessing operations

Basic Data Preprocessing with TensorFlow

Let's start with some fundamental preprocessing techniques using TensorFlow.

Importing Required Libraries

python
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Normalization

Normalization scales numerical features to a standard range, typically between 0 and 1.

python
# Sample data
data = np.array([[1.0, 2.0, 3.0], 
                 [4.0, 5.0, 6.0],
                 [7.0, 8.0, 9.0]], dtype=np.float32)

# Create a normalization layer
normalizer = tf.keras.layers.Normalization(axis=-1)

# Adapt the normalizer to the data
normalizer.adapt(data)

# Apply normalization
normalized_data = normalizer(data)

print("Original data:")
print(data)
print("\nNormalized data:")
print(normalized_data.numpy())

Output:

Original data:
[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]

Normalized data:
[[-1.2247448 -1.2247448 -1.2247448]
 [ 0.         0.         0.        ]
 [ 1.2247448  1.2247448  1.2247448]]

One-Hot Encoding

For categorical data, one-hot encoding is a common preprocessing technique:

python
# Sample categorical data
categories = tf.constant(["apple", "banana", "apple", "orange", "banana", "apple"])

# Create a StringLookup layer for vocabulary management and index lookup
lookup_layer = tf.keras.layers.StringLookup()
lookup_layer.adapt(categories)

# Convert strings to indices
indices = lookup_layer(categories)

# Create a one-hot encoding layer
onehot_layer = tf.keras.layers.CategoryEncoding(
    num_tokens=lookup_layer.vocabulary_size(),
    output_mode="one_hot"
)

# Apply one-hot encoding
onehot_data = onehot_layer(indices)

print("Categories:")
print(categories.numpy())
print("\nOne-hot encoded:")
print(onehot_data.numpy())

Output:

Categories:
[b'apple' b'banana' b'apple' b'orange' b'banana' b'apple']

One-hot encoded:
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]]

The `tf.keras.preprocessing` Module

TensorFlow provides the tf.keras.preprocessing module which includes several utilities for preprocessing various data types.

Image Preprocessing

TensorFlow offers several tools for preprocessing images:

python
from tensorflow.keras.preprocessing import image
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Create an image data generator with augmentation
datagen = ImageDataGenerator(
    rescale=1./255,              # Normalize pixel values
    rotation_range=20,           # Randomly rotate images
    width_shift_range=0.2,       # Randomly shift image horizontally
    height_shift_range=0.2,      # Randomly shift image vertically
    horizontal_flip=True,        # Randomly flip images horizontally
    zoom_range=0.2               # Randomly zoom in on images
)

# Example of loading and preprocessing a single image
img_path = 'sample_image.jpg'  # Replace with your image path
img = image.load_img(img_path, target_size=(224, 224))
img_array = image.img_to_array(img)
img_array = np.expand_dims(img_array, axis=0)  # Create batch dimension

# Preprocess the image
processed_img = datagen.standardize(img_array)

print(f"Original shape: {img_array.shape}")
print(f"Processed shape: {processed_img.shape}")
print(f"Value range: min={processed_img.min()}, max={processed_img.max()}")

Text Preprocessing

Text data requires special preprocessing techniques:

python
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample text data
texts = [
    'I love TensorFlow',
    'TensorFlow is awesome',
    'Data preprocessing is important',
    'Machine learning requires clean data'
]

# Create a tokenizer
tokenizer = Tokenizer(num_words=100)  # Keep top 100 words
tokenizer.fit_on_texts(texts)

# Convert texts to sequences of integers
sequences = tokenizer.texts_to_sequences(texts)
print("Sequences:", sequences)

# Pad sequences to ensure uniform length
padded_sequences = pad_sequences(sequences, padding='post', maxlen=6)
print("\nPadded sequences:")
print(padded_sequences)

# Get the word index
word_index = tokenizer.word_index
print("\nWord index:")
for word, index in list(word_index.items())[:10]:  # Show first 10 words
    print(f"{word}: {index}")

Output:

Sequences: [[1, 2, 3], [3, 4, 5], [6, 7, 4, 8], [9, 10, 11, 12, 6]]

Padded sequences:
[[ 1  2  3  0  0  0]
 [ 3  4  5  0  0  0]
 [ 6  7  4  8  0  0]
 [ 9 10 11 12  6  0]]

Word index:
tensorflow: 3
is: 4
data: 6
i: 1
love: 2
awesome: 5
preprocessing: 7
important: 8
machine: 9
learning: 10

The `tf.data.Dataset` API for Preprocessing

TensorFlow's tf.data.Dataset API provides powerful tools for building efficient data preprocessing pipelines:

python
# Create a simple dataset
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5])

# Apply preprocessing operations
processed_dataset = dataset \
    .map(lambda x: x * 2) \
    .map(lambda x: x + 1) \
    .batch(2)

print("Processed dataset elements:")
for element in processed_dataset:
    print(element.numpy())

Output:

Processed dataset elements:
[3 5]
[7 9]
[11]

Building a Complete Preprocessing Pipeline

Let's build a more comprehensive pipeline for a real dataset:

python
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Define preprocessing functions
def normalize_image(image, label):
    """Normalize images: `uint8` -> `float32`."""
    return tf.cast(image, tf.float32) / 255.0, label

def augment_image(image, label):
    """Apply random augmentation to the image."""
    # Add random brightness
    image = tf.image.random_brightness(image, max_delta=0.2)
    
    # Make sure values are still in [0, 1] range
    image = tf.clip_by_value(image, 0.0, 1.0)
    return image, label

# Create training and validation datasets
BATCH_SIZE = 64
BUFFER_SIZE = 10000  # For shuffling

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))

# Apply preprocessing to training data
train_dataset = train_dataset \
    .map(normalize_image) \
    .map(augment_image) \
    .cache() \
    .shuffle(BUFFER_SIZE) \
    .batch(BATCH_SIZE) \
    .prefetch(tf.data.AUTOTUNE)

# Apply preprocessing to test data (no augmentation for test data)
test_dataset = test_dataset \
    .map(normalize_image) \
    .batch(BATCH_SIZE) \
    .prefetch(tf.data.AUTOTUNE)

# Visualize a batch of preprocessed images
def visualize_batch(batch):
    images, labels = batch
    plt.figure(figsize=(10, 10))
    for i in range(25):
        plt.subplot(5, 5, i + 1)
        plt.imshow(images[i].numpy().reshape(28, 28), cmap='gray')
        plt.title(f'Label: {labels[i].numpy()}')
        plt.axis('off')
    plt.tight_layout()
    plt.show()

# Get a batch from the train dataset
for batch in train_dataset.take(1):
    visualize_batch(batch)

Advanced Preprocessing with Feature Columns

TensorFlow's feature columns provide a way to bridge the gap between raw data and the features that you feed into your model:

python
# Sample data (as a pandas DataFrame)
data = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],
    'income': [50000, 60000, 70000, 80000, 90000],
    'city': ['New York', 'San Francisco', 'Chicago', 'Boston', 'Seattle']
})

# Create feature columns
numeric_feature_columns = [
    tf.feature_column.numeric_column('age'),
    tf.feature_column.numeric_column('income')
]

city_column = tf.feature_column.categorical_column_with_vocabulary_list(
    'city', ['New York', 'San Francisco', 'Chicago', 'Boston', 'Seattle']
)

# Convert categorical column to one-hot representation
city_one_hot = tf.feature_column.indicator_column(city_column)

# Bucketize the age feature
age_buckets = tf.feature_column.bucketized_column(
    numeric_feature_columns[0], boundaries=[30, 40]
)

# Combine all features
feature_columns = numeric_feature_columns + [city_one_hot, age_buckets]

# Create a feature layer which can be used in models
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

# Convert pandas DataFrame to a tf.data.Dataset
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    df = dataframe.copy()
    labels = df.pop('age')  # Using age as a dummy target variable
    ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    ds = ds.prefetch(tf.data.AUTOTUNE)
    return ds

# Create the dataset
batch_size = 2
dataset = df_to_dataset(data, batch_size=batch_size)

# Apply the feature layer
for feature_batch, label_batch in dataset.take(1):
    processed_features = feature_layer(feature_batch)
    print("Processed features shape:", processed_features.shape)
    print("First example processed features:", processed_features[0].numpy())

Real-world Example: Image Classification Pipeline

Let's create a complete image preprocessing pipeline for a simple flower classification task:

python
import tensorflow as tf
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt

# Load the dataset
(train_ds, val_ds, test_ds), metadata = tfds.load(
    'tf_flowers',
    split=['train[:80%]', 'train[80%:90%]', 'train[90%:]'],
    with_info=True,
    as_supervised=True,
)

num_classes = metadata.features['label'].num_classes
print(f"Number of classes: {num_classes}")

# Display class names
class_names = metadata.features['label'].names
print("Class names:", class_names)

# Visualize some examples
plt.figure(figsize=(10, 10))
for i, (image, label) in enumerate(train_ds.take(9)):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(image)
    plt.title(class_names[label])
    plt.axis('off')
plt.tight_layout()
# plt.show()  # Uncomment to display the figure

# Define preprocessing parameters
IMG_SIZE = 224
BATCH_SIZE = 32

def preprocess_image(image, label):
    """Resize images and normalize pixel values."""
    image = tf.image.resize(image, [IMG_SIZE, IMG_SIZE])
    image = tf.cast(image, tf.float32) / 255.0  # Normalize to [0,1]
    return image, label

def augment_image(image, label):
    """Apply data augmentation."""
    # Randomly flip the image horizontally
    image = tf.image.random_flip_left_right(image)
    
    # Randomly adjust brightness
    image = tf.image.random_brightness(image, max_delta=0.1)
    
    # Randomly adjust contrast
    image = tf.image.random_contrast(image, lower=0.9, upper=1.1)
    
    # Make sure pixel values are still in [0, 1] range
    image = tf.clip_by_value(image, 0.0, 1.0)
    
    return image, label

# Apply preprocessing to datasets
train_ds = train_ds \
    .map(preprocess_image, num_parallel_calls=tf.data.AUTOTUNE) \
    .map(augment_image, num_parallel_calls=tf.data.AUTOTUNE) \
    .shuffle(1000) \
    .batch(BATCH_SIZE) \
    .prefetch(tf.data.AUTOTUNE)

val_ds = val_ds \
    .map(preprocess_image, num_parallel_calls=tf.data.AUTOTUNE) \
    .batch(BATCH_SIZE) \
    .prefetch(tf.data.AUTOTUNE)

test_ds = test_ds \
    .map(preprocess_image, num_parallel_calls=tf.data.AUTOTUNE) \
    .batch(BATCH_SIZE) \
    .prefetch(tf.data.AUTOTUNE)

# Create and train a model using this preprocessing pipeline
model = tf.keras.Sequential([
    tf.keras.applications.MobileNetV2(
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
        include_top=False,
        weights='imagenet'),
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])

# Freeze the base model
model.layers[0].trainable = False

# Compile the model
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Model summary
model.summary()

# Train the model (commented out since this is just an example)
# history = model.fit(
#     train_ds,
#     validation_data=val_ds,
#     epochs=5
# )

The TensorFlow Extended (TFX) Data Validation Tool

For production-level data preprocessing, TensorFlow offers TFX Data Validation:

python
# Install TensorFlow Data Validation (if not already installed)
# !pip install tensorflow-data-validation

import tensorflow_data_validation as tfdv
import pandas as pd

# Sample data
data = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],
    'income': [50000, 60000, 70000, 80000, 90000],
    'city': ['New York', 'San Francisco', 'Chicago', 'Boston', 'Seattle']
})

# Generate statistics
stats = tfdv.generate_statistics_from_dataframe(data)

# Generate a schema from statistics
schema = tfdv.infer_schema(stats)

# Display statistics and schema info
tfdv.display_schema(schema)

# Validate a new dataset against the schema
new_data = pd.DataFrame({
    'age': [22, 32, -5, 100],  # Note: contains anomalies (-5 and 100)
    'income': [45000, 65000, 75000, 1000000],  # Note: contains anomaly (1000000)
    'city': ['New York', 'Unknown', 'Chicago', 'Seattle']  # Note: contains anomaly ("Unknown")
})

new_stats = tfdv.generate_statistics_from_dataframe(new_data)
anomalies = tfdv.validate_statistics(new_stats, schema)

# Display anomalies
tfdv.display_anomalies(anomalies)

Summary

In this tutorial, we explored TensorFlow's comprehensive data preprocessing capabilities:

Basic Preprocessing: Normalization, one-hot encoding, and other fundamental techniques
Image and Text Processing: Specialized tools for different data types
tf.data.Dataset API: Building efficient data pipelines
Feature Columns: Bridging raw data to model-ready features
Real-World Applications: Complete preprocessing pipelines for image classification
Production Tools: TFX Data Validation for robust data pipelines

Proper data preprocessing is crucial for machine learning success. TensorFlow provides all the tools you need to clean, transform, and prepare your data for model training.

Additional Resources

Exercises

Build a preprocessing pipeline for the MNIST dataset that includes normalization and data augmentation.
Create a text preprocessing pipeline using the tf.data.Dataset API for sentiment analysis.
Implement a preprocessing pipeline for a tabular dataset using feature columns.
Design an image data augmentation pipeline with at least five different augmentation techniques.
Use TensorFlow Data Validation to analyze and validate a dataset of your choice.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Data Preprocessing​

Basic Data Preprocessing with TensorFlow​

Importing Required Libraries​

Normalization​

One-Hot Encoding​

The tf.keras.preprocessing Module​

Image Preprocessing​

Text Preprocessing​

The tf.data.Dataset API for Preprocessing​

Building a Complete Preprocessing Pipeline​

Advanced Preprocessing with Feature Columns​

Real-world Example: Image Classification Pipeline​

The TensorFlow Extended (TFX) Data Validation Tool​

Summary​

Additional Resources​

Exercises​

Introduction to Data Preprocessing

Basic Data Preprocessing with TensorFlow

Importing Required Libraries

Normalization

One-Hot Encoding

The `tf.keras.preprocessing` Module

Image Preprocessing

Text Preprocessing

The `tf.data.Dataset` API for Preprocessing

Building a Complete Preprocessing Pipeline

Advanced Preprocessing with Feature Columns

Real-world Example: Image Classification Pipeline

The TensorFlow Extended (TFX) Data Validation Tool

Summary

Additional Resources

Exercises