Pandas HDF5 Export
Introduction
When working with large datasets in Pandas, you may encounter situations where CSV or Excel files become inefficient for storing and retrieving data. This is where HDF5 (Hierarchical Data Format version 5) comes in. HDF5 is a file format designed to store and organize large amounts of data efficiently. It allows you to store multiple datasets in a single file with a hierarchical structure, similar to how directories and files are organized in a file system.
In this tutorial, you'll learn:
- What HDF5 is and why it's useful
- How to export Pandas DataFrames to HDF5 format
- How to manage different storage modes
- Reading data back from HDF5 files
- Real-world applications and best practices
Prerequisites
To follow along with this tutorial, you'll need:
- Python 3.x installed
- Pandas library
- PyTables library (usually installed with pandas)
If you haven't installed these packages yet, you can do so with pip:
pip install pandas tables
Understanding HDF5
HDF5 stands for Hierarchical Data Format version 5. It's designed to store and organize large amounts of numerical data efficiently. Some key features include:
- Hierarchical structure: Similar to a file system with directories and files
- Fast I/O: Optimized for reading and writing large datasets
- Compression: Built-in data compression options
- Partial I/O: Ability to read only specific portions of datasets
- Cross-platform: Works across different operating systems
These features make HDF5 particularly useful for data science applications where you need to work with large datasets that may not fit entirely in memory.
Basic HDF5 Export with Pandas
Let's start with a simple example of exporting a Pandas DataFrame to HDF5:
import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'A': np.random.rand(100000),
'B': np.random.rand(100000),
'C': np.random.choice(['X', 'Y', 'Z'], 100000)
})
# Export the DataFrame to HDF5
df.to_hdf('data.h5', key='my_dataset')
print("DataFrame exported to HDF5 successfully!")
In this basic example:
- We create a DataFrame with 100,000 rows and 3 columns
- We use the
to_hdf()
method to save it to a file named 'data.h5' - The
key
parameter acts like a name for the dataset inside the HDF5 file
Reading Data Back from HDF5
Reading data from an HDF5 file is just as straightforward:
# Read the entire dataset back
df_read = pd.read_hdf('data.h5', key='my_dataset')
# Display the first few rows
print(df_read.head())
Output:
A B C
0 0.746489 0.288607 X
1 0.633362 0.460720 Z
2 0.573954 0.037702 Y
3 0.311154 0.763937 Z
4 0.884792 0.661696 Z
Storage Modes in HDF5
Pandas supports different storage modes when saving to HDF5. The two main modes are:
- 'fixed': Default mode, efficient for DataFrames that don't change in size or structure
- 'table': More flexible format that allows querying and partial data retrieval
Let's see how to use the 'table' format:
# Export DataFrame using table format
df.to_hdf('data_table.h5', key='my_dataset', mode='w', format='table')
# Now we can perform queries when reading
subset = pd.read_hdf('data_table.h5', key='my_dataset', where='A > 0.8')
print(f"Number of rows where A > 0.8: {len(subset)}")
Output:
Number of rows where A > 0.8: 19988
The output will vary as we used random data, but it demonstrates how the 'table' format allows us to perform queries during data retrieval.
Working with Multiple Datasets
One of the advantages of HDF5 is the ability to store multiple datasets in a single file. Let's see how to do this:
# Create two different DataFrames
df1 = pd.DataFrame({
'ID': range(1000),
'Value': np.random.rand(1000)
})
df2 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 45],
'City': ['New York', 'Boston', 'Chicago', 'Denver', 'Seattle']
})
# Save both DataFrames to the same HDF5 file with different keys
with pd.HDFStore('multiple_datasets.h5') as store:
store.put('numeric_data', df1)
store.put('user_info', df2)
# You can see the keys in the store
print(f"Keys in the store: {store.keys()}")
Output:
Keys in the store: ['/numeric_data', '/user_info']
To read these datasets back:
# Read specific datasets
with pd.HDFStore('multiple_datasets.h5') as store:
numeric_data = store['numeric_data']
user_info = store['user_info']
print("Numeric data:")
print(numeric_data.head())
print("\nUser info:")
print(user_info)
Output:
Numeric data:
ID Value
0 0 0.796983
1 1 0.221083
2 2 0.635744
3 3 0.016374
4 4 0.129338
User info:
Name Age City
0 Alice 25 New York
1 Bob 30 Boston
2 Charlie 35 Chicago
3 David 40 Denver
4 Emma 45 Seattle
Advanced Features: Compression
To save disk space, especially with large datasets, you can use compression when saving to HDF5:
# Create a large DataFrame
large_df = pd.DataFrame({
'A': np.random.rand(500000),
'B': np.random.rand(500000),
'C': np.random.rand(500000)
})
# Save with no compression
large_df.to_hdf('no_compression.h5', key='data', mode='w')
# Save with compression
large_df.to_hdf('with_compression.h5', key='data', mode='w',
complevel=9, complib='blosc')
# Check file sizes
import os
size_no_compression = os.path.getsize('no_compression.h5') / (1024 * 1024)
size_with_compression = os.path.getsize('with_compression.h5') / (1024 * 1024)
print(f"Size without compression: {size_no_compression:.2f} MB")
print(f"Size with compression: {size_with_compression:.2f} MB")
print(f"Space saved: {(1 - size_with_compression/size_no_compression) * 100:.2f}%")
Output:
Size without compression: 11.44 MB
Size with compression: 3.84 MB
Space saved: 66.44%
The compression achieved can vary depending on your data, but it's often substantial.
Practical Example: Time Series Data Storage
HDF5 is particularly valuable for time series data. Let's create a more practical example:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Create a time series dataset (hourly temperature readings for a year)
start_date = datetime(2023, 1, 1)
dates = [start_date + timedelta(hours=i) for i in range(24 * 365)] # Hourly data for a year
weather_data = pd.DataFrame({
'timestamp': dates,
'temperature': np.random.normal(15, 10, len(dates)), # Mean 15°C, std dev 10°C
'humidity': np.random.normal(70, 20, len(dates)),
'wind_speed': np.random.exponential(5, len(dates))
})
# Set timestamp as index
weather_data.set_index('timestamp', inplace=True)
# Save to HDF5 with table format for querying
weather_data.to_hdf('weather_data.h5', key='weather', mode='w', format='table')
print(f"Saved {len(weather_data)} hourly weather records to HDF5")
Output:
Saved 8760 hourly weather records to HDF5
Now, let's query this dataset to retrieve specific time periods:
# Read only summer months (June, July, August)
summer_data = pd.read_hdf('weather_data.h5', key='weather',
where="index >= '2023-06-01' & index < '2023-09-01'")
print(f"Number of summer readings: {len(summer_data)}")
print("Average summer temperature:", summer_data['temperature'].mean().round(2), "°C")
# Read only data from specific hours (e.g., midday readings)
midday_data = pd.read_hdf('weather_data.h5', key='weather',
where="index.hour == 12")
print(f"Number of midday readings: {len(midday_data)}")
print("Average midday temperature:", midday_data['temperature'].mean().round(2), "°C")
Output:
Number of summer readings: 2208
Average summer temperature: 15.39 °C
Number of midday readings: 365
Average midday temperature: 15.23 °C
This example demonstrates how HDF5's query capabilities allow you to efficiently work with subsets of large time-series datasets without loading the entire file into memory.
Performance Considerations
Let's compare the performance of HDF5 with CSV for a large dataset:
import pandas as pd
import numpy as np
import time
import os
# Create a large DataFrame (1 million rows)
print("Creating large DataFrame...")
large_df = pd.DataFrame({
'ID': range(1000000),
'A': np.random.rand(1000000),
'B': np.random.rand(1000000),
'C': np.random.rand(1000000),
'D': np.random.choice(['X', 'Y', 'Z'], 1000000)
})
# Timing CSV export
print("Exporting to CSV...")
start_time = time.time()
large_df.to_csv('large_data.csv', index=False)
csv_write_time = time.time() - start_time
csv_size = os.path.getsize('large_data.csv') / (1024 * 1024)
# Timing HDF5 export
print("Exporting to HDF5...")
start_time = time.time()
large_df.to_hdf('large_data.h5', key='data', complevel=9, complib='blosc')
hdf_write_time = time.time() - start_time
hdf_size = os.path.getsize('large_data.h5') / (1024 * 1024)
# Timing CSV import
print("Importing from CSV...")
start_time = time.time()
df_from_csv = pd.read_csv('large_data.csv')
csv_read_time = time.time() - start_time
# Timing HDF5 import
print("Importing from HDF5...")
start_time = time.time()
df_from_hdf = pd.read_hdf('large_data.h5', key='data')
hdf_read_time = time.time() - start_time
# Print results
print("\nPerformance Comparison:")
print(f"CSV file size: {csv_size:.2f} MB")
print(f"HDF5 file size: {hdf_size:.2f} MB")
print(f"CSV write time: {csv_write_time:.2f} seconds")
print(f"HDF5 write time: {hdf_write_time:.2f} seconds")
print(f"CSV read time: {csv_read_time:.2f} seconds")
print(f"HDF5 read time: {hdf_read_time:.2f} seconds")
Output:
Creating large DataFrame...
Exporting to CSV...
Exporting to HDF5...
Importing from CSV...
Importing from HDF5...
Performance Comparison:
CSV file size: 76.29 MB
HDF5 file size: 11.46 MB
CSV write time: 3.52 seconds
HDF5 write time: 5.17 seconds
CSV read time: 4.31 seconds
HDF5 read time: 0.62 seconds
The actual values will depend on your system, but you'll typically find that HDF5:
- Takes up significantly less disk space
- May be slightly slower for writing (especially with compression)
- Is significantly faster for reading
- Gets even more efficient as dataset size increases
Best Practices for Working with HDF5
Here are some best practices to consider when using HDF5 with Pandas:
-
Use the appropriate format: Use 'table' format when you need querying capabilities, and 'fixed' (default) when optimizing for speed and size.
-
Select the right compression: For large datasets, compression can save significant disk space. 'blosc' is often a good balance between compression ratio and speed.
-
Use context managers: When working with multiple operations on an HDF5 file, use context managers to ensure proper cleanup:
with pd.HDFStore('my_data.h5') as store:
store.put('dataset1', df1)
store.put('dataset2', df2)
# More operations...
- Handle string data carefully: String data can be tricky in HDF5. Consider using categorical data types for repeated strings:
df['category_column'] = df['category_column'].astype('category')
df.to_hdf('data.h5', key='data')
- Document your HDF5 structure: For complex HDF5 files with multiple datasets, consider maintaining documentation about what each key represents.
Summary
In this tutorial, you've learned:
- How to export Pandas DataFrames to HDF5 format
- The difference between 'fixed' and 'table' storage modes
- How to store multiple datasets in a single HDF5 file
- How to use compression to save disk space
- How to perform queries on HDF5 files
- When to use HDF5 instead of other formats like CSV
HDF5 is an excellent choice for data scientists and analysts working with large datasets, especially when:
- Your data is too large for memory
- You need to query specific portions of data
- Speed of data access is important
- You need to organize multiple related datasets
Additional Resources
To learn more about working with HDF5 in Pandas:
- Official Pandas IO Documentation
- PyTables Documentation
- HDF5 for Python (h5py) - An alternative library for working with HDF5 files
Exercises
-
Create a DataFrame with at least 1 million rows and compare the performance of CSV vs. HDF5 for reading and writing on your system.
-
Experiment with different compression options (zlib, lzo, bzip2, blosc) and compression levels to find the best balance for your data.
-
Create an HDF5 file containing multiple related datasets (e.g., customers, orders, and products) and practice retrieving specific information.
-
Implement a data processing pipeline that stores intermediate results in HDF5 format to optimize performance.
-
Use the 'where' parameter with
read_hdf()
to practice running different types of queries on a large dataset.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)