Skip to main content

TensorFlow CSV Processing

Comma-separated values (CSV) files are one of the most common formats for storing structured data. When working with machine learning projects, you'll frequently need to load and process data from CSV files. TensorFlow provides powerful tools through its tf.data API to efficiently handle CSV data, enabling you to build performant input pipelines.

In this tutorial, you'll learn how to:

  • Load data from CSV files using TensorFlow
  • Parse different types of columns
  • Build efficient input pipelines
  • Apply transformations to your CSV data
  • Handle real-world CSV processing scenarios

Understanding CSV Data

CSV files store tabular data in plain text where:

  • Each line represents a record or row
  • Fields/columns are separated by commas (or other delimiters)
  • The first row often contains header information

Here's a simple example of what a CSV file looks like:

name,age,height,is_student
John,28,175.5,true
Mary,32,160.2,false
Alex,21,183.0,true

Loading CSV Data in TensorFlow

TensorFlow provides several methods to work with CSV data. Let's explore them step by step.

Method 1: Using tf.data.experimental.CsvDataset

The simplest approach is using tf.data.experimental.CsvDataset:

python
import tensorflow as tf

# Define the file path
csv_file = "sample_data.csv"

# Define column types - matching the order in your CSV
record_defaults = [tf.string, tf.int32, tf.float32, tf.bool]

# Create the dataset
dataset = tf.data.experimental.CsvDataset(
csv_file,
record_defaults,
header=True, # Skip the header row
field_delim=",")

# Display the first few elements
for line in dataset.take(2):
print(f"Name: {line[0].numpy().decode()}, Age: {line[1].numpy()}, "
f"Height: {line[2].numpy()}, Is Student: {line[3].numpy()}")

Output:

Name: John, Age: 28, Height: 175.5, Is Student: True
Name: Mary, Age: 32, Height: 160.2, Is Student: False

This approach works well for simple use cases but has limitations when dealing with complex CSV files.

Method 2: Using tf.data.TextLineDataset with Parsing

For more flexibility, you can use TextLineDataset and parse each line:

python
import tensorflow as tf

# Define the file path
csv_file = "sample_data.csv"

# Create a dataset from text file
dataset = tf.data.TextLineDataset(csv_file)

# Skip header line
dataset = dataset.skip(1)

# Parse CSV function
def parse_csv_line(line):
# Split by comma
fields = tf.strings.split(line, ',')

# Convert to appropriate types
name = fields[0]
age = tf.strings.to_number(fields[1], out_type=tf.int32)
height = tf.strings.to_number(fields[2], out_type=tf.float32)
is_student = tf.strings.to_bool(fields[3])

return name, age, height, is_student

# Apply the parsing function
parsed_dataset = dataset.map(parse_csv_line)

# Display the first few elements
for name, age, height, is_student in parsed_dataset.take(2):
print(f"Name: {name.numpy().decode()}, Age: {age.numpy()}, "
f"Height: {height.numpy()}, Is Student: {is_student.numpy()}")

Output:

Name: John, Age: 28, Height: 175.5, Is Student: True
Name: Mary, Age: 32, Height: 160.2, Is Student: False

TensorFlow provides a high-level API specifically for CSV data that offers more features:

python
import tensorflow as tf

# Define the file path
csv_file = "sample_data.csv"

# Create dataset with advanced features
dataset = tf.data.experimental.make_csv_dataset(
csv_file,
batch_size=2,
column_defaults=[tf.string, tf.int32, tf.float32, tf.bool],
label_name='is_student', # Use one column as label
num_epochs=1,
shuffle=True,
shuffle_buffer_size=10000
)

# Display the first batch
for features, labels in dataset.take(1):
for name, value in features.items():
print(f"{name}: {value.numpy()}")
print(f"Labels (is_student): {labels.numpy()}")

Output:

name: [b'John' b'Alex']
age: [28 21]
height: [175.5 183.0]
Labels (is_student): [True True]

This method is particularly useful because it:

  • Automatically handles batching
  • Separates features and labels
  • Returns data as dictionaries keyed by column names
  • Manages shuffling and multiple epochs

Working with Real-World CSV Data

Real-world CSV files often have challenges such as:

  • Missing values
  • Different column types
  • Large file sizes
  • Inconsistent formatting

Let's see how to handle these issues:

Handling Missing Values

python
import tensorflow as tf

# Define defaults that will be used for missing values
defaults = [
tf.constant("Unknown", dtype=tf.string), # name
0, # age
0.0, # height
False # is_student
]

# Create dataset with handling for missing values
dataset = tf.data.experimental.make_csv_dataset(
"data_with_missing_values.csv",
batch_size=5,
column_defaults=defaults,
select_columns=['name', 'age', 'height', 'is_student'], # Select specific columns
na_value="NA", # Define what represents a missing value
num_epochs=1
)

for features in dataset.take(1):
for name, value in features.items():
print(f"{name}: {value.numpy()}")

Creating an Efficient Input Pipeline

For training machine learning models, you need efficient data pipelines:

python
import tensorflow as tf

# Create the dataset
dataset = tf.data.experimental.make_csv_dataset(
"large_dataset.csv",
batch_size=32,
column_defaults=[tf.string, tf.int32, tf.float32, tf.bool],
select_columns=['name', 'age', 'height', 'is_student'],
label_name='is_student',
num_epochs=None, # Repeat indefinitely
shuffle=True,
shuffle_buffer_size=10000
)

# Optimize performance
dataset = dataset.cache() # Cache the dataset in memory
dataset = dataset.prefetch(tf.data.AUTOTUNE) # Prefetch next batch while processing current one

# Create preprocessing function
def preprocess_data(features, label):
# Normalize height value
features['height'] = features['height'] / 200.0

# Create age range feature (one-hot)
age_buckets = tf.one_hot(tf.cast(features['age'] // 10, tf.int32), 10)
features['age_bucket'] = age_buckets

return features, label

# Apply preprocessing
dataset = dataset.map(preprocess_data, num_parallel_calls=tf.data.AUTOTUNE)

# Use the dataset for model training
for features_batch, labels_batch in dataset.take(1):
print("Features:")
for name, value in features_batch.items():
print(f"{name}: {value.shape}")
print(f"Labels: {labels_batch.shape}")

Practical Example: Student Performance Prediction

Let's walk through a complete example analyzing student data to predict their performance:

python
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# Assume we have a CSV with student data:
# student_id,study_hours,sleep_hours,prev_score,final_score
# 1,5.2,7.0,82,88
# 2,3.5,8.0,70,72
# ...

# Step 1: Load the CSV data
csv_file = "student_data.csv"

column_names = ['student_id', 'study_hours', 'sleep_hours', 'prev_score', 'final_score']
column_defaults = [tf.int32, tf.float32, tf.float32, tf.float32, tf.float32]

batch_size = 32
dataset = tf.data.experimental.make_csv_dataset(
csv_file,
batch_size,
column_defaults=column_defaults,
column_names=column_names,
label_name='final_score',
num_epochs=None,
shuffle=True
)

# Step 2: Data preprocessing
def preprocess(features, label):
# Drop the student_id as it's not useful for prediction
features.pop('student_id')

# Normalize numeric features
features['study_hours'] = features['study_hours'] / 12.0 # Assuming max is 12 hours
features['sleep_hours'] = features['sleep_hours'] / 12.0 # Assuming max is 12 hours
features['prev_score'] = features['prev_score'] / 100.0 # Scores are out of 100

# Normalize label
label = label / 100.0

return features, label

processed_dataset = dataset.map(preprocess)
processed_dataset = processed_dataset.cache().prefetch(tf.data.AUTOTUNE)

# Step 3: Build and train a model
def build_and_compile_model():
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(3,)),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1)
])

model.compile(
optimizer='adam',
loss='mse',
metrics=['mae']
)

return model

# Function to convert features dict to tensor
def pack_features(features, labels):
feature_values = tf.stack([
features['study_hours'],
features['sleep_hours'],
features['prev_score']
], axis=1)
return feature_values, labels

train_data = processed_dataset.map(pack_features).take(500)
validation_data = processed_dataset.map(pack_features).skip(500).take(100)

model = build_and_compile_model()
history = model.fit(
train_data,
epochs=50,
validation_data=validation_data,
verbose=0
)

# Step 4: Visualize training results
plt.figure(figsize=(10, 6))
plt.plot(history.history['mae'], label='MAE (Training)')
plt.plot(history.history['val_mae'], label='MAE (Validation)')
plt.title('Mean Absolute Error')
plt.ylabel('MAE Value')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)
plt.show()

# Step 5: Make predictions with the model
def predict_score(study_hours, sleep_hours, prev_score):
# Normalize inputs
study_norm = study_hours / 12.0
sleep_norm = sleep_hours / 12.0
prev_norm = prev_score / 100.0

input_data = tf.constant([[study_norm, sleep_norm, prev_norm]])
prediction = model.predict(input_data)[0][0] * 100

return prediction.numpy()

# Example prediction
predicted_score = predict_score(6.0, 7.5, 80.0)
print(f"Predicted final score: {predicted_score:.1f}")

This example showcases a complete workflow:

  1. Loading CSV data
  2. Preprocessing and normalizing features
  3. Building an input pipeline
  4. Training a TensorFlow model
  5. Making predictions with new data

Advanced CSV Processing Techniques

Handling Large CSV Files

For datasets too large to fit in memory:

python
import tensorflow as tf

# Create dataset without loading everything into memory
dataset = tf.data.experimental.make_csv_dataset(
"very_large_file.csv",
batch_size=1024,
num_parallel_reads=tf.data.AUTOTUNE, # Parallelize file reading
shuffle_buffer_size=10000, # Buffer for shuffling
)

# Use dataset in a memory-efficient way
dataset = dataset.prefetch(tf.data.AUTOTUNE)

Custom CSV Parsing

For complex CSV formats or when you need more control:

python
import tensorflow as tf

# Load the raw text data
raw_dataset = tf.data.TextLineDataset("complex_data.csv")
raw_dataset = raw_dataset.skip(1) # Skip header

def custom_parser(line):
# Split by comma, but handle quoted fields correctly
parts = tf.strings.regex_replace(line, r'\"(.*?)\",', r'\1')
parts = tf.strings.split(parts, ',')

# Extract fields and convert to appropriate types
id = tf.strings.to_number(parts[0], out_type=tf.int32)
text = parts[1]
values = tf.strings.to_number(parts[2:5], out_type=tf.float32)
date = parts[5]

# Parse date (example: converting "2022-01-01" to days since epoch)
date_parts = tf.strings.split(date, '-')
year = tf.strings.to_number(date_parts[0], out_type=tf.int32)
month = tf.strings.to_number(date_parts[1], out_type=tf.int32)
day = tf.strings.to_number(date_parts[2], out_type=tf.int32)

# Create features dictionary
features = {
'id': id,
'text': text,
'values': values,
'year': year,
'month': month,
'day': day
}

return features

# Apply the custom parser
parsed_dataset = raw_dataset.map(custom_parser, num_parallel_calls=tf.data.AUTOTUNE)

Summary

In this tutorial, you've learned how to:

  • Load CSV data using different TensorFlow methods
  • Handle various data types and missing values
  • Create efficient input pipelines for machine learning
  • Preprocess CSV data for model training
  • Work with real-world CSV processing scenarios

TensorFlow's CSV processing capabilities allow you to seamlessly integrate structured data into your machine learning workflows. By using the tf.data API effectively, you can build scalable and efficient data processing pipelines that can handle datasets of any size.

Additional Resources

Exercises

  1. Create a CSV dataset with weather information (temperature, humidity, pressure) and build a model to predict rainfall.
  2. Modify the student performance example to include categorical features like "subject" or "study_location".
  3. Benchmark different batch sizes and preprocessing strategies to optimize performance for a large CSV file.
  4. Implement data validation checks to handle corrupted or inconsistent CSV data.
  5. Build a CSV dataset that combines data from multiple files and apply feature engineering.


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)