Pandas HDF5 Export

Introduction

When working with large datasets in Pandas, you may encounter situations where CSV or Excel files become inefficient for storing and retrieving data. This is where HDF5 (Hierarchical Data Format version 5) comes in. HDF5 is a file format designed to store and organize large amounts of data efficiently. It allows you to store multiple datasets in a single file with a hierarchical structure, similar to how directories and files are organized in a file system.

In this tutorial, you'll learn:

What HDF5 is and why it's useful
How to export Pandas DataFrames to HDF5 format
How to manage different storage modes
Reading data back from HDF5 files
Real-world applications and best practices

Prerequisites

To follow along with this tutorial, you'll need:

Python 3.x installed
Pandas library
PyTables library (usually installed with pandas)

If you haven't installed these packages yet, you can do so with pip:

bash
pip install pandas tables

Understanding HDF5

HDF5 stands for Hierarchical Data Format version 5. It's designed to store and organize large amounts of numerical data efficiently. Some key features include:

Hierarchical structure: Similar to a file system with directories and files
Fast I/O: Optimized for reading and writing large datasets
Compression: Built-in data compression options
Partial I/O: Ability to read only specific portions of datasets
Cross-platform: Works across different operating systems

These features make HDF5 particularly useful for data science applications where you need to work with large datasets that may not fit entirely in memory.

Basic HDF5 Export with Pandas

Let's start with a simple example of exporting a Pandas DataFrame to HDF5:

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    'A': np.random.rand(100000),
    'B': np.random.rand(100000),
    'C': np.random.choice(['X', 'Y', 'Z'], 100000)
})

# Export the DataFrame to HDF5
df.to_hdf('data.h5', key='my_dataset')

print("DataFrame exported to HDF5 successfully!")

In this basic example:

We create a DataFrame with 100,000 rows and 3 columns
We use the to_hdf() method to save it to a file named 'data.h5'
The key parameter acts like a name for the dataset inside the HDF5 file

Reading Data Back from HDF5

Reading data from an HDF5 file is just as straightforward:

python
# Read the entire dataset back
df_read = pd.read_hdf('data.h5', key='my_dataset')

# Display the first few rows
print(df_read.head())

Output:

          A         B  C
0.746489  0.288607  X
0.633362  0.460720  Z
0.573954  0.037702  Y
0.311154  0.763937  Z
0.884792  0.661696  Z

Storage Modes in HDF5

Pandas supports different storage modes when saving to HDF5. The two main modes are:

'fixed': Default mode, efficient for DataFrames that don't change in size or structure
'table': More flexible format that allows querying and partial data retrieval

Let's see how to use the 'table' format:

python
# Export DataFrame using table format
df.to_hdf('data_table.h5', key='my_dataset', mode='w', format='table')

# Now we can perform queries when reading
subset = pd.read_hdf('data_table.h5', key='my_dataset', where='A > 0.8')
print(f"Number of rows where A > 0.8: {len(subset)}")

Output:

Number of rows where A > 0.8: 19988

The output will vary as we used random data, but it demonstrates how the 'table' format allows us to perform queries during data retrieval.

Working with Multiple Datasets

One of the advantages of HDF5 is the ability to store multiple datasets in a single file. Let's see how to do this:

python
# Create two different DataFrames
df1 = pd.DataFrame({
    'ID': range(1000),
    'Value': np.random.rand(1000)
})

df2 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 35, 40, 45],
    'City': ['New York', 'Boston', 'Chicago', 'Denver', 'Seattle']
})

# Save both DataFrames to the same HDF5 file with different keys
with pd.HDFStore('multiple_datasets.h5') as store:
    store.put('numeric_data', df1)
    store.put('user_info', df2)
    
    # You can see the keys in the store
    print(f"Keys in the store: {store.keys()}")

Output:

Keys in the store: ['/numeric_data', '/user_info']

To read these datasets back:

python
# Read specific datasets
with pd.HDFStore('multiple_datasets.h5') as store:
    numeric_data = store['numeric_data']
    user_info = store['user_info']
    
    print("Numeric data:")
    print(numeric_data.head())
    
    print("\nUser info:")
    print(user_info)

Output:

Numeric data:
   ID     Value
 0  0.796983
 1  0.221083
 2  0.635744
 3  0.016374
 4  0.129338

User info:
      Name  Age      City
  Alice   25  New York
    Bob   30    Boston
Charlie   35   Chicago
  David   40    Denver
   Emma   45   Seattle

Advanced Features: Compression

To save disk space, especially with large datasets, you can use compression when saving to HDF5:

python
# Create a large DataFrame
large_df = pd.DataFrame({
    'A': np.random.rand(500000),
    'B': np.random.rand(500000),
    'C': np.random.rand(500000)
})

# Save with no compression
large_df.to_hdf('no_compression.h5', key='data', mode='w')

# Save with compression
large_df.to_hdf('with_compression.h5', key='data', mode='w', 
                complevel=9, complib='blosc')

# Check file sizes
import os
size_no_compression = os.path.getsize('no_compression.h5') / (1024 * 1024)
size_with_compression = os.path.getsize('with_compression.h5') / (1024 * 1024)

print(f"Size without compression: {size_no_compression:.2f} MB")
print(f"Size with compression: {size_with_compression:.2f} MB")
print(f"Space saved: {(1 - size_with_compression/size_no_compression) * 100:.2f}%")

Output:

Size without compression: 11.44 MB
Size with compression: 3.84 MB
Space saved: 66.44%

The compression achieved can vary depending on your data, but it's often substantial.

Practical Example: Time Series Data Storage

HDF5 is particularly valuable for time series data. Let's create a more practical example:

python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create a time series dataset (hourly temperature readings for a year)
start_date = datetime(2023, 1, 1)
dates = [start_date + timedelta(hours=i) for i in range(24 * 365)]  # Hourly data for a year

weather_data = pd.DataFrame({
    'timestamp': dates,
    'temperature': np.random.normal(15, 10, len(dates)),  # Mean 15°C, std dev 10°C
    'humidity': np.random.normal(70, 20, len(dates)),
    'wind_speed': np.random.exponential(5, len(dates))
})

# Set timestamp as index
weather_data.set_index('timestamp', inplace=True)

# Save to HDF5 with table format for querying
weather_data.to_hdf('weather_data.h5', key='weather', mode='w', format='table')

print(f"Saved {len(weather_data)} hourly weather records to HDF5")

Output:

Saved 8760 hourly weather records to HDF5

Now, let's query this dataset to retrieve specific time periods:

python
# Read only summer months (June, July, August)
summer_data = pd.read_hdf('weather_data.h5', key='weather', 
                          where="index >= '2023-06-01' & index < '2023-09-01'")

print(f"Number of summer readings: {len(summer_data)}")
print("Average summer temperature:", summer_data['temperature'].mean().round(2), "°C")

# Read only data from specific hours (e.g., midday readings)
midday_data = pd.read_hdf('weather_data.h5', key='weather', 
                          where="index.hour == 12")

print(f"Number of midday readings: {len(midday_data)}")
print("Average midday temperature:", midday_data['temperature'].mean().round(2), "°C")

Output:

Number of summer readings: 2208
Average summer temperature: 15.39 °C
Number of midday readings: 365
Average midday temperature: 15.23 °C

This example demonstrates how HDF5's query capabilities allow you to efficiently work with subsets of large time-series datasets without loading the entire file into memory.

Performance Considerations

Let's compare the performance of HDF5 with CSV for a large dataset:

python
import pandas as pd
import numpy as np
import time
import os

# Create a large DataFrame (1 million rows)
print("Creating large DataFrame...")
large_df = pd.DataFrame({
    'ID': range(1000000),
    'A': np.random.rand(1000000),
    'B': np.random.rand(1000000),
    'C': np.random.rand(1000000),
    'D': np.random.choice(['X', 'Y', 'Z'], 1000000)
})

# Timing CSV export
print("Exporting to CSV...")
start_time = time.time()
large_df.to_csv('large_data.csv', index=False)
csv_write_time = time.time() - start_time
csv_size = os.path.getsize('large_data.csv') / (1024 * 1024)

# Timing HDF5 export
print("Exporting to HDF5...")
start_time = time.time()
large_df.to_hdf('large_data.h5', key='data', complevel=9, complib='blosc')
hdf_write_time = time.time() - start_time
hdf_size = os.path.getsize('large_data.h5') / (1024 * 1024)

# Timing CSV import
print("Importing from CSV...")
start_time = time.time()
df_from_csv = pd.read_csv('large_data.csv')
csv_read_time = time.time() - start_time

# Timing HDF5 import
print("Importing from HDF5...")
start_time = time.time()
df_from_hdf = pd.read_hdf('large_data.h5', key='data')
hdf_read_time = time.time() - start_time

# Print results
print("\nPerformance Comparison:")
print(f"CSV file size: {csv_size:.2f} MB")
print(f"HDF5 file size: {hdf_size:.2f} MB")
print(f"CSV write time: {csv_write_time:.2f} seconds")
print(f"HDF5 write time: {hdf_write_time:.2f} seconds")
print(f"CSV read time: {csv_read_time:.2f} seconds")
print(f"HDF5 read time: {hdf_read_time:.2f} seconds")

Output:

Creating large DataFrame...
Exporting to CSV...
Exporting to HDF5...
Importing from CSV...
Importing from HDF5...

Performance Comparison:
CSV file size: 76.29 MB
HDF5 file size: 11.46 MB
CSV write time: 3.52 seconds
HDF5 write time: 5.17 seconds
CSV read time: 4.31 seconds
HDF5 read time: 0.62 seconds

The actual values will depend on your system, but you'll typically find that HDF5:

Takes up significantly less disk space
May be slightly slower for writing (especially with compression)
Is significantly faster for reading
Gets even more efficient as dataset size increases

Best Practices for Working with HDF5

Here are some best practices to consider when using HDF5 with Pandas:

Use the appropriate format: Use 'table' format when you need querying capabilities, and 'fixed' (default) when optimizing for speed and size.
Select the right compression: For large datasets, compression can save significant disk space. 'blosc' is often a good balance between compression ratio and speed.
Use context managers: When working with multiple operations on an HDF5 file, use context managers to ensure proper cleanup:

python
with pd.HDFStore('my_data.h5') as store:
    store.put('dataset1', df1)
    store.put('dataset2', df2)
    # More operations...

Handle string data carefully: String data can be tricky in HDF5. Consider using categorical data types for repeated strings:

python
df['category_column'] = df['category_column'].astype('category')
df.to_hdf('data.h5', key='data')

Document your HDF5 structure: For complex HDF5 files with multiple datasets, consider maintaining documentation about what each key represents.

Summary

In this tutorial, you've learned:

How to export Pandas DataFrames to HDF5 format
The difference between 'fixed' and 'table' storage modes
How to store multiple datasets in a single HDF5 file
How to use compression to save disk space
How to perform queries on HDF5 files
When to use HDF5 instead of other formats like CSV

HDF5 is an excellent choice for data scientists and analysts working with large datasets, especially when:

Your data is too large for memory
You need to query specific portions of data
Speed of data access is important
You need to organize multiple related datasets

Additional Resources

To learn more about working with HDF5 in Pandas:

Official Pandas IO Documentation
PyTables Documentation
HDF5 for Python (h5py) - An alternative library for working with HDF5 files

Exercises

Create a DataFrame with at least 1 million rows and compare the performance of CSV vs. HDF5 for reading and writing on your system.
Experiment with different compression options (zlib, lzo, bzip2, blosc) and compression levels to find the best balance for your data.
Create an HDF5 file containing multiple related datasets (e.g., customers, orders, and products) and practice retrieving specific information.
Implement a data processing pipeline that stores intermediate results in HDF5 format to optimize performance.
Use the 'where' parameter with read_hdf() to practice running different types of queries on a large dataset.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Understanding HDF5​

Basic HDF5 Export with Pandas​

Reading Data Back from HDF5​

Storage Modes in HDF5​

Working with Multiple Datasets​

Advanced Features: Compression​

Practical Example: Time Series Data Storage​

Performance Considerations​

Best Practices for Working with HDF5​

Summary​

Additional Resources​

Exercises​

Introduction

Prerequisites

Understanding HDF5

Basic HDF5 Export with Pandas

Reading Data Back from HDF5

Storage Modes in HDF5

Working with Multiple Datasets

Advanced Features: Compression

Practical Example: Time Series Data Storage

Performance Considerations

Best Practices for Working with HDF5

Summary

Additional Resources

Exercises