TensorFlow CSV Processing

Comma-separated values (CSV) files are one of the most common formats for storing structured data. When working with machine learning projects, you'll frequently need to load and process data from CSV files. TensorFlow provides powerful tools through its tf.data API to efficiently handle CSV data, enabling you to build performant input pipelines.

In this tutorial, you'll learn how to:

Load data from CSV files using TensorFlow
Parse different types of columns
Build efficient input pipelines
Apply transformations to your CSV data
Handle real-world CSV processing scenarios

Understanding CSV Data

CSV files store tabular data in plain text where:

Each line represents a record or row
Fields/columns are separated by commas (or other delimiters)
The first row often contains header information

Here's a simple example of what a CSV file looks like:

name,age,height,is_student
John,28,175.5,true
Mary,32,160.2,false
Alex,21,183.0,true

Loading CSV Data in TensorFlow

TensorFlow provides several methods to work with CSV data. Let's explore them step by step.

Method 1: Using `tf.data.experimental.CsvDataset`

The simplest approach is using tf.data.experimental.CsvDataset:

import tensorflow as tf

# Define the file path
csv_file = "sample_data.csv"

# Define column types - matching the order in your CSV
record_defaults = [tf.string, tf.int32, tf.float32, tf.bool]

# Create the dataset
dataset = tf.data.experimental.CsvDataset(
    csv_file, 
    record_defaults, 
    header=True,  # Skip the header row
    field_delim=",")

# Display the first few elements
for line in dataset.take(2):
    print(f"Name: {line[0].numpy().decode()}, Age: {line[1].numpy()}, " 
          f"Height: {line[2].numpy()}, Is Student: {line[3].numpy()}")

Output:

Name: John, Age: 28, Height: 175.5, Is Student: True
Name: Mary, Age: 32, Height: 160.2, Is Student: False

This approach works well for simple use cases but has limitations when dealing with complex CSV files.

Method 2: Using `tf.data.TextLineDataset` with Parsing

For more flexibility, you can use TextLineDataset and parse each line:

import tensorflow as tf

# Define the file path
csv_file = "sample_data.csv"

# Create a dataset from text file
dataset = tf.data.TextLineDataset(csv_file)

# Skip header line
dataset = dataset.skip(1)

# Parse CSV function
def parse_csv_line(line):
    # Split by comma
    fields = tf.strings.split(line, ',')
    
    # Convert to appropriate types
    name = fields[0]
    age = tf.strings.to_number(fields[1], out_type=tf.int32)
    height = tf.strings.to_number(fields[2], out_type=tf.float32)
    is_student = tf.strings.to_bool(fields[3])
    
    return name, age, height, is_student

# Apply the parsing function
parsed_dataset = dataset.map(parse_csv_line)

# Display the first few elements
for name, age, height, is_student in parsed_dataset.take(2):
    print(f"Name: {name.numpy().decode()}, Age: {age.numpy()}, "
          f"Height: {height.numpy()}, Is Student: {is_student.numpy()}")

Output:

Name: John, Age: 28, Height: 175.5, Is Student: True
Name: Mary, Age: 32, Height: 160.2, Is Student: False

Method 3: Using `tf.data.experimental.make_csv_dataset` (Recommended)

TensorFlow provides a high-level API specifically for CSV data that offers more features:

import tensorflow as tf

# Define the file path
csv_file = "sample_data.csv"

# Create dataset with advanced features
dataset = tf.data.experimental.make_csv_dataset(
    csv_file,
    batch_size=2,
    column_defaults=[tf.string, tf.int32, tf.float32, tf.bool],
    label_name='is_student',  # Use one column as label
    num_epochs=1,
    shuffle=True,
    shuffle_buffer_size=10000
)

# Display the first batch
for features, labels in dataset.take(1):
    for name, value in features.items():
        print(f"{name}: {value.numpy()}")
    print(f"Labels (is_student): {labels.numpy()}")

Output:

name: [b'John' b'Alex']
age: [28 21]
height: [175.5 183.0]
Labels (is_student): [True True]

This method is particularly useful because it:

Automatically handles batching
Separates features and labels
Returns data as dictionaries keyed by column names
Manages shuffling and multiple epochs

Working with Real-World CSV Data

Real-world CSV files often have challenges such as:

Missing values
Different column types
Large file sizes
Inconsistent formatting

Let's see how to handle these issues:

Handling Missing Values

import tensorflow as tf

# Define defaults that will be used for missing values
defaults = [
    tf.constant("Unknown", dtype=tf.string),  # name
    0,                                        # age
    0.0,                                      # height
    False                                     # is_student
]

# Create dataset with handling for missing values
dataset = tf.data.experimental.make_csv_dataset(
    "data_with_missing_values.csv",
    batch_size=5,
    column_defaults=defaults,
    select_columns=['name', 'age', 'height', 'is_student'],  # Select specific columns
    na_value="NA",  # Define what represents a missing value
    num_epochs=1
)

for features in dataset.take(1):
    for name, value in features.items():
        print(f"{name}: {value.numpy()}")

Creating an Efficient Input Pipeline

For training machine learning models, you need efficient data pipelines:

import tensorflow as tf

# Create the dataset
dataset = tf.data.experimental.make_csv_dataset(
    "large_dataset.csv",
    batch_size=32,
    column_defaults=[tf.string, tf.int32, tf.float32, tf.bool],
    select_columns=['name', 'age', 'height', 'is_student'],
    label_name='is_student',
    num_epochs=None,  # Repeat indefinitely
    shuffle=True,
    shuffle_buffer_size=10000
)

# Optimize performance
dataset = dataset.cache()                      # Cache the dataset in memory
dataset = dataset.prefetch(tf.data.AUTOTUNE)   # Prefetch next batch while processing current one

# Create preprocessing function
def preprocess_data(features, label):
    # Normalize height value
    features['height'] = features['height'] / 200.0
    
    # Create age range feature (one-hot)
    age_buckets = tf.one_hot(tf.cast(features['age'] // 10, tf.int32), 10)
    features['age_bucket'] = age_buckets
    
    return features, label

# Apply preprocessing
dataset = dataset.map(preprocess_data, num_parallel_calls=tf.data.AUTOTUNE)

# Use the dataset for model training
for features_batch, labels_batch in dataset.take(1):
    print("Features:")
    for name, value in features_batch.items():
        print(f"{name}: {value.shape}")
    print(f"Labels: {labels_batch.shape}")

Practical Example: Student Performance Prediction

Let's walk through a complete example analyzing student data to predict their performance:

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# Assume we have a CSV with student data:
# student_id,study_hours,sleep_hours,prev_score,final_score
# 1,5.2,7.0,82,88
# 2,3.5,8.0,70,72
# ...

# Step 1: Load the CSV data
csv_file = "student_data.csv"

column_names = ['student_id', 'study_hours', 'sleep_hours', 'prev_score', 'final_score']
column_defaults = [tf.int32, tf.float32, tf.float32, tf.float32, tf.float32]

batch_size = 32
dataset = tf.data.experimental.make_csv_dataset(
    csv_file,
    batch_size,
    column_defaults=column_defaults,
    column_names=column_names,
    label_name='final_score',
    num_epochs=None,
    shuffle=True
)

# Step 2: Data preprocessing
def preprocess(features, label):
    # Drop the student_id as it's not useful for prediction
    features.pop('student_id')
    
    # Normalize numeric features
    features['study_hours'] = features['study_hours'] / 12.0  # Assuming max is 12 hours
    features['sleep_hours'] = features['sleep_hours'] / 12.0  # Assuming max is 12 hours
    features['prev_score'] = features['prev_score'] / 100.0   # Scores are out of 100
    
    # Normalize label
    label = label / 100.0
    
    return features, label

processed_dataset = dataset.map(preprocess)
processed_dataset = processed_dataset.cache().prefetch(tf.data.AUTOTUNE)

# Step 3: Build and train a model
def build_and_compile_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(64, activation='relu', input_shape=(3,)),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dense(1)
    ])
    
    model.compile(
        optimizer='adam',
        loss='mse',
        metrics=['mae']
    )
    
    return model

# Function to convert features dict to tensor
def pack_features(features, labels):
    feature_values = tf.stack([
        features['study_hours'],
        features['sleep_hours'],
        features['prev_score']
    ], axis=1)
    return feature_values, labels

train_data = processed_dataset.map(pack_features).take(500)
validation_data = processed_dataset.map(pack_features).skip(500).take(100)

model = build_and_compile_model()
history = model.fit(
    train_data,
    epochs=50,
    validation_data=validation_data,
    verbose=0
)

# Step 4: Visualize training results
plt.figure(figsize=(10, 6))
plt.plot(history.history['mae'], label='MAE (Training)')
plt.plot(history.history['val_mae'], label='MAE (Validation)')
plt.title('Mean Absolute Error')
plt.ylabel('MAE Value')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)
plt.show()

# Step 5: Make predictions with the model
def predict_score(study_hours, sleep_hours, prev_score):
    # Normalize inputs
    study_norm = study_hours / 12.0
    sleep_norm = sleep_hours / 12.0
    prev_norm = prev_score / 100.0
    
    input_data = tf.constant([[study_norm, sleep_norm, prev_norm]])
    prediction = model.predict(input_data)[0][0] * 100
    
    return prediction.numpy()

# Example prediction
predicted_score = predict_score(6.0, 7.5, 80.0)
print(f"Predicted final score: {predicted_score:.1f}")

This example showcases a complete workflow:

Loading CSV data
Preprocessing and normalizing features
Building an input pipeline
Training a TensorFlow model
Making predictions with new data

Advanced CSV Processing Techniques

Handling Large CSV Files

For datasets too large to fit in memory:

import tensorflow as tf

# Create dataset without loading everything into memory
dataset = tf.data.experimental.make_csv_dataset(
    "very_large_file.csv",
    batch_size=1024,
    num_parallel_reads=tf.data.AUTOTUNE,  # Parallelize file reading
    shuffle_buffer_size=10000,            # Buffer for shuffling
)

# Use dataset in a memory-efficient way
dataset = dataset.prefetch(tf.data.AUTOTUNE)

Custom CSV Parsing

For complex CSV formats or when you need more control:

import tensorflow as tf

# Load the raw text data
raw_dataset = tf.data.TextLineDataset("complex_data.csv")
raw_dataset = raw_dataset.skip(1)  # Skip header

def custom_parser(line):
    # Split by comma, but handle quoted fields correctly
    parts = tf.strings.regex_replace(line, r'\"(.*?)\",', r'\1')
    parts = tf.strings.split(parts, ',')
    
    # Extract fields and convert to appropriate types
    id = tf.strings.to_number(parts[0], out_type=tf.int32)
    text = parts[1]
    values = tf.strings.to_number(parts[2:5], out_type=tf.float32)
    date = parts[5]
    
    # Parse date (example: converting "2022-01-01" to days since epoch)
    date_parts = tf.strings.split(date, '-')
    year = tf.strings.to_number(date_parts[0], out_type=tf.int32)
    month = tf.strings.to_number(date_parts[1], out_type=tf.int32)
    day = tf.strings.to_number(date_parts[2], out_type=tf.int32)
    
    # Create features dictionary
    features = {
        'id': id,
        'text': text,
        'values': values,
        'year': year,
        'month': month,
        'day': day
    }
    
    return features

# Apply the custom parser
parsed_dataset = raw_dataset.map(custom_parser, num_parallel_calls=tf.data.AUTOTUNE)

Summary

In this tutorial, you've learned how to:

Load CSV data using different TensorFlow methods
Handle various data types and missing values
Create efficient input pipelines for machine learning
Preprocess CSV data for model training
Work with real-world CSV processing scenarios

TensorFlow's CSV processing capabilities allow you to seamlessly integrate structured data into your machine learning workflows. By using the tf.data API effectively, you can build scalable and efficient data processing pipelines that can handle datasets of any size.

Additional Resources

Exercises

Create a CSV dataset with weather information (temperature, humidity, pressure) and build a model to predict rainfall.
Modify the student performance example to include categorical features like "subject" or "study_location".
Benchmark different batch sizes and preprocessing strategies to optimize performance for a large CSV file.
Implement data validation checks to handle corrupted or inconsistent CSV data.
Build a CSV dataset that combines data from multiple files and apply feature engineering.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Understanding CSV Data​

Loading CSV Data in TensorFlow​

Method 1: Using tf.data.experimental.CsvDataset​

Method 2: Using tf.data.TextLineDataset with Parsing​

Method 3: Using tf.data.experimental.make_csv_dataset (Recommended)​

Working with Real-World CSV Data​

Handling Missing Values​

Creating an Efficient Input Pipeline​

Practical Example: Student Performance Prediction​

Advanced CSV Processing Techniques​

Handling Large CSV Files​

Custom CSV Parsing​

Summary​

Additional Resources​

Exercises​

Understanding CSV Data

Loading CSV Data in TensorFlow

Method 1: Using `tf.data.experimental.CsvDataset`

Method 2: Using `tf.data.TextLineDataset` with Parsing

Method 3: Using `tf.data.experimental.make_csv_dataset` (Recommended)

Working with Real-World CSV Data

Handling Missing Values

Creating an Efficient Input Pipeline

Practical Example: Student Performance Prediction

Advanced CSV Processing Techniques

Handling Large CSV Files

Custom CSV Parsing

Summary

Additional Resources

Exercises