Pandas HDF5 Import
Introduction to HDF5 in Pandas
HDF5 (Hierarchical Data Format version 5) is a file format designed to store and organize large amounts of numerical data. When combined with pandas, it provides an excellent solution for storing and accessing large datasets efficiently. The pandas library offers built-in functionality to read from and write to HDF5 files using the HDFStore
class and the read_hdf()
function.
HDF5 stores are particularly useful when you need to:
- Work with datasets too large to fit in memory
- Maintain hierarchical organization of multiple datasets
- Access data quickly, even from very large files
- Store pandas objects with their full information (including data types and indexes)
In this guide, we'll explore how to import data from HDF5 files using pandas, and why this might be your preferred storage format for certain scenarios.
Prerequisites
Before working with HDF5 files in pandas, you'll need:
pip install pandas pytables
The pytables
package (sometimes referred to as tables
) provides the underlying HDF5 functionality that pandas leverages.
Basic HDF5 Import with Pandas
Let's start with the basics of importing data from an HDF5 file:
import pandas as pd
# Reading an HDF5 file
df = pd.read_hdf('my_data.h5', key='dataset_name')
print(df.head())
The output would look something like:
column1 column2 column3
0 1.2 3.4 5.6
1 2.3 4.5 6.7
2 3.4 5.6 7.8
3 4.5 6.7 8.9
4 5.6 7.8 9.0
The key
parameter is essential as HDF5 files can store multiple datasets, and we need to specify which one we want to read.
Understanding HDF5 Structure
HDF5 files work like a file system within a file:
- Groups: Similar to directories - they can contain other groups or datasets
- Datasets: The actual data arrays stored in the file
- Attributes: Metadata attached to groups or datasets
In pandas, we typically store DataFrame objects as datasets inside an HDF5 file, using keys as identifiers.
Creating and Reading from an HDF5 Store
Let's see a complete example of creating an HDF5 file and then reading from it:
import pandas as pd
import numpy as np
# Create some sample data
df1 = pd.DataFrame(np.random.rand(5, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(7, 2), columns=['X', 'Y'])
# Write to HDF5 file
with pd.HDFStore('sample_store.h5') as store:
store.put('dataframe1', df1)
store.put('dataframe2', df2)
# Print the contents of the store
print("Store contents:")
print(store.info())
# Read from HDF5 file
with pd.HDFStore('sample_store.h5', mode='r') as store:
# List all keys in the store
print("\nAvailable keys:", store.keys())
# Read specific dataframe
read_df1 = store.get('dataframe1')
print("\nDataFrame 1:")
print(read_df1)
# Alternative method to read
read_df2 = pd.read_hdf('sample_store.h5', 'dataframe2')
print("\nDataFrame 2:")
print(read_df2)
Output:
Store contents:
<class 'pandas.io.pytables.HDFStore'>
File path: sample_store.h5
/dataframe1 frame (shape->[5,3])
/dataframe2 frame (shape->[7,2])
Available keys: ['/dataframe1', '/dataframe2']
DataFrame 1:
A B C
0 0.454063 0.800193 0.967246
1 0.786799 0.915822 0.393531
2 0.791360 0.528909 0.568332
3 0.925571 0.438178 0.264635
4 0.476055 0.679252 0.750341
DataFrame 2:
X Y
0 0.020356 0.437264
1 -1.366606 -1.537211
2 -0.580331 1.301770
3 0.286720 1.609199
4 1.369728 -0.192632
5 0.177820 0.258573
6 -0.385074 1.643550
Advanced HDF5 Import Techniques
Specifying Data Columns and Queries
One of the powerful features of HDF5 with pandas is the ability to query data directly from storage without loading the entire dataset:
# Create a larger dataset
data = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C', 'D'], 10000),
'value1': np.random.randn(10000),
'value2': np.random.randn(10000),
'value3': np.random.rand(10000)
})
# Store with data_columns for querying
data.to_hdf('queryable.h5', 'data', format='table', data_columns=True)
# Now we can query without loading the whole dataset
result = pd.read_hdf('queryable.h5', 'data', where='category="A" & value1>0')
print(f"Found {len(result)} rows matching the query")
print(result.head())
Output:
Found 1258 rows matching the query
category value1 value2 value3
2 A 0.030936 0.196930 0.875933
4 A 0.083854 1.408124 0.629321
8 A 1.172823 0.170708 0.933428
10 A 0.297109 -0.083165 0.064870
15 A 1.418179 -2.290428 0.555552
Reading Specific Columns
You can also choose to read only specific columns from a large dataset:
# Read only specific columns
partial_df = pd.read_hdf('queryable.h5', 'data', columns=['category', 'value1'])
print(partial_df.head())
Output:
category value1
0 C 0.358381
1 C -0.498432
2 A 0.030936
3 D 1.405254
4 A 0.083854
Working with Large Datasets
HDF5 really shines when handling datasets too large to fit in memory. Here's how you might work with such data:
# Create a large dataset (let's simulate it being very large)
# In a real scenario, this could be much larger
chunk_size = 100000
total_chunks = 10
# Create HDF5 file in chunks
with pd.HDFStore('large_data.h5', mode='w') as store:
for i in range(total_chunks):
print(f"Creating chunk {i+1}/{total_chunks}")
chunk = pd.DataFrame({
'id': range(i*chunk_size, (i+1)*chunk_size),
'value': np.random.rand(chunk_size),
'group': np.random.choice(['X', 'Y', 'Z'], chunk_size)
})
# Append to the store
store.append('dataset', chunk, format='table', data_columns=True)
# Process the data in chunks without loading everything
chunk_iterator = pd.read_hdf('large_data.h5', 'dataset', chunksize=50000)
# Process each chunk
total_rows = 0
for i, chunk in enumerate(chunk_iterator):
# Do some processing on each chunk
filtered_rows = len(chunk[chunk['value'] > 0.75])
total_rows += filtered_rows
print(f"Chunk {i+1}: Found {filtered_rows} rows with value > 0.75")
print(f"Total qualifying rows: {total_rows}")
Output:
Creating chunk 1/10
Creating chunk 2/10
...
Creating chunk 10/10
Chunk 1: Found 12476 rows with value > 0.75
Chunk 2: Found 12584 rows with value > 0.75
...
Chunk 20: Found 12509 rows with value > 0.75
Total qualifying rows: 250127
Practical Examples
Example 1: Storing Time Series Data
HDF5 is particularly good for time series data. Let's see how to store and retrieve financial time series:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Create sample time series data
dates = pd.date_range(start='2022-01-01', periods=1000)
stocks = ['AAPL', 'GOOG', 'MSFT', 'AMZN']
# Create a panel of stock prices
stock_data = {}
for stock in stocks:
# Create simulated price data with some randomness
base_price = np.random.randint(50, 500)
prices = base_price + np.cumsum(np.random.normal(0, 2, len(dates)))
stock_data[stock] = pd.Series(prices, index=dates)
price_df = pd.DataFrame(stock_data)
# Store in HDF5 file
price_df.to_hdf('stock_prices.h5', key='prices', format='table',
data_columns=True)
# Read a subset of dates
start_date = '2022-03-01'
end_date = '2022-03-31'
march_prices = pd.read_hdf('stock_prices.h5', 'prices',
where=f"index >= '{start_date}' & index <= '{end_date}'")
print(f"Retrieved {len(march_prices)} days of price data for March 2022:")
print(march_prices.head())
# Get data for a specific stock
apple_prices = pd.read_hdf('stock_prices.h5', 'prices', columns=['AAPL'])
print("\nApple stock prices (first 5 days):")
print(apple_prices.head())
Output:
Retrieved 31 days of price data for March 2022:
AAPL GOOG MSFT AMZN
2022-03-01 473.986359 429.316047 253.575074 301.859832
2022-03-02 470.506909 428.392252 254.611636 301.935262
2022-03-03 469.842412 429.802378 253.145621 304.654150
2022-03-04 470.875683 432.519478 255.394160 305.058455
2022-03-05 471.295310 433.694654 254.137482 307.631035
Apple stock prices (first 5 days):
AAPL
2022-01-01 429.968588
2022-01-02 431.430510
2022-01-03 432.990569
2022-01-04 435.398748
2022-01-05 435.289620
Example 2: Sensor Data Logging System
Let's simulate a system that logs sensor data and uses HDF5 for storage:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import time
# Function to simulate sensor readings
def generate_sensor_data(num_readings=100, num_sensors=5):
timestamps = [datetime.now() - timedelta(seconds=i) for i in range(num_readings)]
data = {
'timestamp': timestamps,
'sensor_id': np.random.choice(range(1, num_sensors+1), num_readings),
'temperature': np.random.normal(25, 3, num_readings),
'humidity': np.random.normal(50, 10, num_readings),
'pressure': np.random.normal(1013, 5, num_readings)
}
return pd.DataFrame(data)
# Simulate collecting data in batches
def sensor_logging_system():
with pd.HDFStore('sensor_logs.h5', mode='w') as store:
# Simulate collecting 5 batches of data
for batch in range(5):
print(f"Collecting batch {batch+1}...")
batch_data = generate_sensor_data(100)
# Store data with timestamp as index
batch_data.set_index('timestamp', inplace=True)
store.append('sensor_readings', batch_data, format='table',
data_columns=True)
# Wait a bit before next batch (just for simulation)
if batch < 4:
time.sleep(1)
# Now analyze the data
with pd.HDFStore('sensor_logs.h5', mode='r') as store:
# Get data for sensor 1
sensor1_data = pd.read_hdf('sensor_logs.h5', 'sensor_readings',
where='sensor_id=1')
# Get high temperature readings across all sensors
high_temp = pd.read_hdf('sensor_logs.h5', 'sensor_readings',
where='temperature>28')
# Print summary
print(f"\nCollected {len(store.get('sensor_readings'))} total readings")
print(f"Sensor 1 has {len(sensor1_data)} readings")
print(f"Found {len(high_temp)} high temperature readings")
# Calculate average readings per sensor
all_data = store.get('sensor_readings')
sensor_stats = all_data.groupby('sensor_id').agg({
'temperature': 'mean',
'humidity': 'mean',
'pressure': 'mean'
}).round(2)
print("\nAverage readings per sensor:")
print(sensor_stats)
# Run the simulation
sensor_logging_system()
Output:
Collecting batch 1...
Collecting batch 2...
Collecting batch 3...
Collecting batch 4...
Collecting batch 5...
Collected 500 total readings
Sensor 1 has 103 readings
Found 68 high temperature readings
Average readings per sensor:
temperature humidity pressure
sensor_id
1 25.23 49.65 1013.17
2 25.01 50.20 1012.67
3 24.69 50.27 1012.35
4 24.87 49.46 1012.88
5 24.87 49.97 1013.46
Performance Considerations
HDF5 offers excellent performance for certain use cases, but there are some factors to consider:
-
Format Types:
fixed
format (default) is faster for writing but less flexibletable
format enables queries but is slower for writing
-
Compression Options:
pythondf.to_hdf('compressed.h5', 'data', complevel=9, complib='blosc')
-
Chunking: HDF5 uses chunks for storage; pandas will automatically set chunk sizes but you may need to tune this for very large datasets
-
Memory Usage: Even with HDF5, be aware of operations that might load large amounts of data into memory
Summary
HDF5 is a powerful format for working with large datasets in pandas. In this guide, we covered:
- Basic import from HDF5 files using
pd.read_hdf()
andHDFStore
- Advanced queries to filter data directly from storage
- Working with large datasets using chunking
- Practical examples for time series and sensor data
- Performance considerations
HDF5 provides a flexible, hierarchical way to store multiple pandas objects in a single file, while maintaining good performance for both reading and writing operations. It's especially valuable when working with datasets too large to fit in memory and when you need to query subsets of your data efficiently.
Additional Resources and Exercises
Resources
Exercises
-
Basic Exercise: Create an HDF5 file containing three different DataFrames with various data types. Practice reading them back and exploring the structure.
-
Query Performance: Create a large DataFrame (1 million+ rows) and save it in both CSV and HDF5 formats. Compare the time it takes to:
- Load the entire dataset from both formats
- Query a subset of rows based on conditions
-
Advanced Storage: Create a hierarchical storage structure within an HDF5 file that mimics a database with multiple tables. Store related datasets in groups (e.g., '/sales/2022', '/sales/2023').
-
Real-world Project: Download a large public dataset (like NYC Taxi Trips or weather data) and convert it to HDF5 format. Create a data analysis script that efficiently queries and processes specific subsets of the data.
By mastering HDF5 in pandas, you'll have a powerful tool for handling large datasets efficiently in your data science projects!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)