Skip to main content

Pandas HDF5 Import

Introduction to HDF5 in Pandas

HDF5 (Hierarchical Data Format version 5) is a file format designed to store and organize large amounts of numerical data. When combined with pandas, it provides an excellent solution for storing and accessing large datasets efficiently. The pandas library offers built-in functionality to read from and write to HDF5 files using the HDFStore class and the read_hdf() function.

HDF5 stores are particularly useful when you need to:

  • Work with datasets too large to fit in memory
  • Maintain hierarchical organization of multiple datasets
  • Access data quickly, even from very large files
  • Store pandas objects with their full information (including data types and indexes)

In this guide, we'll explore how to import data from HDF5 files using pandas, and why this might be your preferred storage format for certain scenarios.

Prerequisites

Before working with HDF5 files in pandas, you'll need:

bash
pip install pandas pytables

The pytables package (sometimes referred to as tables) provides the underlying HDF5 functionality that pandas leverages.

Basic HDF5 Import with Pandas

Let's start with the basics of importing data from an HDF5 file:

python
import pandas as pd

# Reading an HDF5 file
df = pd.read_hdf('my_data.h5', key='dataset_name')
print(df.head())

The output would look something like:

   column1  column2  column3
0 1.2 3.4 5.6
1 2.3 4.5 6.7
2 3.4 5.6 7.8
3 4.5 6.7 8.9
4 5.6 7.8 9.0

The key parameter is essential as HDF5 files can store multiple datasets, and we need to specify which one we want to read.

Understanding HDF5 Structure

HDF5 files work like a file system within a file:

  1. Groups: Similar to directories - they can contain other groups or datasets
  2. Datasets: The actual data arrays stored in the file
  3. Attributes: Metadata attached to groups or datasets

In pandas, we typically store DataFrame objects as datasets inside an HDF5 file, using keys as identifiers.

Creating and Reading from an HDF5 Store

Let's see a complete example of creating an HDF5 file and then reading from it:

python
import pandas as pd
import numpy as np

# Create some sample data
df1 = pd.DataFrame(np.random.rand(5, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(7, 2), columns=['X', 'Y'])

# Write to HDF5 file
with pd.HDFStore('sample_store.h5') as store:
store.put('dataframe1', df1)
store.put('dataframe2', df2)

# Print the contents of the store
print("Store contents:")
print(store.info())

# Read from HDF5 file
with pd.HDFStore('sample_store.h5', mode='r') as store:
# List all keys in the store
print("\nAvailable keys:", store.keys())

# Read specific dataframe
read_df1 = store.get('dataframe1')
print("\nDataFrame 1:")
print(read_df1)

# Alternative method to read
read_df2 = pd.read_hdf('sample_store.h5', 'dataframe2')
print("\nDataFrame 2:")
print(read_df2)

Output:

Store contents:
<class 'pandas.io.pytables.HDFStore'>
File path: sample_store.h5
/dataframe1 frame (shape->[5,3])
/dataframe2 frame (shape->[7,2])

Available keys: ['/dataframe1', '/dataframe2']

DataFrame 1:
A B C
0 0.454063 0.800193 0.967246
1 0.786799 0.915822 0.393531
2 0.791360 0.528909 0.568332
3 0.925571 0.438178 0.264635
4 0.476055 0.679252 0.750341

DataFrame 2:
X Y
0 0.020356 0.437264
1 -1.366606 -1.537211
2 -0.580331 1.301770
3 0.286720 1.609199
4 1.369728 -0.192632
5 0.177820 0.258573
6 -0.385074 1.643550

Advanced HDF5 Import Techniques

Specifying Data Columns and Queries

One of the powerful features of HDF5 with pandas is the ability to query data directly from storage without loading the entire dataset:

python
# Create a larger dataset
data = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C', 'D'], 10000),
'value1': np.random.randn(10000),
'value2': np.random.randn(10000),
'value3': np.random.rand(10000)
})

# Store with data_columns for querying
data.to_hdf('queryable.h5', 'data', format='table', data_columns=True)

# Now we can query without loading the whole dataset
result = pd.read_hdf('queryable.h5', 'data', where='category="A" & value1>0')
print(f"Found {len(result)} rows matching the query")
print(result.head())

Output:

Found 1258 rows matching the query
category value1 value2 value3
2 A 0.030936 0.196930 0.875933
4 A 0.083854 1.408124 0.629321
8 A 1.172823 0.170708 0.933428
10 A 0.297109 -0.083165 0.064870
15 A 1.418179 -2.290428 0.555552

Reading Specific Columns

You can also choose to read only specific columns from a large dataset:

python
# Read only specific columns
partial_df = pd.read_hdf('queryable.h5', 'data', columns=['category', 'value1'])
print(partial_df.head())

Output:

  category    value1
0 C 0.358381
1 C -0.498432
2 A 0.030936
3 D 1.405254
4 A 0.083854

Working with Large Datasets

HDF5 really shines when handling datasets too large to fit in memory. Here's how you might work with such data:

python
# Create a large dataset (let's simulate it being very large)
# In a real scenario, this could be much larger
chunk_size = 100000
total_chunks = 10

# Create HDF5 file in chunks
with pd.HDFStore('large_data.h5', mode='w') as store:
for i in range(total_chunks):
print(f"Creating chunk {i+1}/{total_chunks}")
chunk = pd.DataFrame({
'id': range(i*chunk_size, (i+1)*chunk_size),
'value': np.random.rand(chunk_size),
'group': np.random.choice(['X', 'Y', 'Z'], chunk_size)
})
# Append to the store
store.append('dataset', chunk, format='table', data_columns=True)

# Process the data in chunks without loading everything
chunk_iterator = pd.read_hdf('large_data.h5', 'dataset', chunksize=50000)

# Process each chunk
total_rows = 0
for i, chunk in enumerate(chunk_iterator):
# Do some processing on each chunk
filtered_rows = len(chunk[chunk['value'] > 0.75])
total_rows += filtered_rows
print(f"Chunk {i+1}: Found {filtered_rows} rows with value > 0.75")

print(f"Total qualifying rows: {total_rows}")

Output:

Creating chunk 1/10
Creating chunk 2/10
...
Creating chunk 10/10
Chunk 1: Found 12476 rows with value > 0.75
Chunk 2: Found 12584 rows with value > 0.75
...
Chunk 20: Found 12509 rows with value > 0.75
Total qualifying rows: 250127

Practical Examples

Example 1: Storing Time Series Data

HDF5 is particularly good for time series data. Let's see how to store and retrieve financial time series:

python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create sample time series data
dates = pd.date_range(start='2022-01-01', periods=1000)
stocks = ['AAPL', 'GOOG', 'MSFT', 'AMZN']

# Create a panel of stock prices
stock_data = {}
for stock in stocks:
# Create simulated price data with some randomness
base_price = np.random.randint(50, 500)
prices = base_price + np.cumsum(np.random.normal(0, 2, len(dates)))
stock_data[stock] = pd.Series(prices, index=dates)

price_df = pd.DataFrame(stock_data)

# Store in HDF5 file
price_df.to_hdf('stock_prices.h5', key='prices', format='table',
data_columns=True)

# Read a subset of dates
start_date = '2022-03-01'
end_date = '2022-03-31'
march_prices = pd.read_hdf('stock_prices.h5', 'prices',
where=f"index >= '{start_date}' & index <= '{end_date}'")

print(f"Retrieved {len(march_prices)} days of price data for March 2022:")
print(march_prices.head())

# Get data for a specific stock
apple_prices = pd.read_hdf('stock_prices.h5', 'prices', columns=['AAPL'])
print("\nApple stock prices (first 5 days):")
print(apple_prices.head())

Output:

Retrieved 31 days of price data for March 2022:
AAPL GOOG MSFT AMZN
2022-03-01 473.986359 429.316047 253.575074 301.859832
2022-03-02 470.506909 428.392252 254.611636 301.935262
2022-03-03 469.842412 429.802378 253.145621 304.654150
2022-03-04 470.875683 432.519478 255.394160 305.058455
2022-03-05 471.295310 433.694654 254.137482 307.631035

Apple stock prices (first 5 days):
AAPL
2022-01-01 429.968588
2022-01-02 431.430510
2022-01-03 432.990569
2022-01-04 435.398748
2022-01-05 435.289620

Example 2: Sensor Data Logging System

Let's simulate a system that logs sensor data and uses HDF5 for storage:

python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import time

# Function to simulate sensor readings
def generate_sensor_data(num_readings=100, num_sensors=5):
timestamps = [datetime.now() - timedelta(seconds=i) for i in range(num_readings)]
data = {
'timestamp': timestamps,
'sensor_id': np.random.choice(range(1, num_sensors+1), num_readings),
'temperature': np.random.normal(25, 3, num_readings),
'humidity': np.random.normal(50, 10, num_readings),
'pressure': np.random.normal(1013, 5, num_readings)
}
return pd.DataFrame(data)

# Simulate collecting data in batches
def sensor_logging_system():
with pd.HDFStore('sensor_logs.h5', mode='w') as store:
# Simulate collecting 5 batches of data
for batch in range(5):
print(f"Collecting batch {batch+1}...")
batch_data = generate_sensor_data(100)

# Store data with timestamp as index
batch_data.set_index('timestamp', inplace=True)
store.append('sensor_readings', batch_data, format='table',
data_columns=True)

# Wait a bit before next batch (just for simulation)
if batch < 4:
time.sleep(1)

# Now analyze the data
with pd.HDFStore('sensor_logs.h5', mode='r') as store:
# Get data for sensor 1
sensor1_data = pd.read_hdf('sensor_logs.h5', 'sensor_readings',
where='sensor_id=1')

# Get high temperature readings across all sensors
high_temp = pd.read_hdf('sensor_logs.h5', 'sensor_readings',
where='temperature>28')

# Print summary
print(f"\nCollected {len(store.get('sensor_readings'))} total readings")
print(f"Sensor 1 has {len(sensor1_data)} readings")
print(f"Found {len(high_temp)} high temperature readings")

# Calculate average readings per sensor
all_data = store.get('sensor_readings')
sensor_stats = all_data.groupby('sensor_id').agg({
'temperature': 'mean',
'humidity': 'mean',
'pressure': 'mean'
}).round(2)

print("\nAverage readings per sensor:")
print(sensor_stats)

# Run the simulation
sensor_logging_system()

Output:

Collecting batch 1...
Collecting batch 2...
Collecting batch 3...
Collecting batch 4...
Collecting batch 5...

Collected 500 total readings
Sensor 1 has 103 readings
Found 68 high temperature readings

Average readings per sensor:
temperature humidity pressure
sensor_id
1 25.23 49.65 1013.17
2 25.01 50.20 1012.67
3 24.69 50.27 1012.35
4 24.87 49.46 1012.88
5 24.87 49.97 1013.46

Performance Considerations

HDF5 offers excellent performance for certain use cases, but there are some factors to consider:

  1. Format Types:

    • fixed format (default) is faster for writing but less flexible
    • table format enables queries but is slower for writing
  2. Compression Options:

    python
    df.to_hdf('compressed.h5', 'data', complevel=9, complib='blosc')
  3. Chunking: HDF5 uses chunks for storage; pandas will automatically set chunk sizes but you may need to tune this for very large datasets

  4. Memory Usage: Even with HDF5, be aware of operations that might load large amounts of data into memory

Summary

HDF5 is a powerful format for working with large datasets in pandas. In this guide, we covered:

  • Basic import from HDF5 files using pd.read_hdf() and HDFStore
  • Advanced queries to filter data directly from storage
  • Working with large datasets using chunking
  • Practical examples for time series and sensor data
  • Performance considerations

HDF5 provides a flexible, hierarchical way to store multiple pandas objects in a single file, while maintaining good performance for both reading and writing operations. It's especially valuable when working with datasets too large to fit in memory and when you need to query subsets of your data efficiently.

Additional Resources and Exercises

Resources

Exercises

  1. Basic Exercise: Create an HDF5 file containing three different DataFrames with various data types. Practice reading them back and exploring the structure.

  2. Query Performance: Create a large DataFrame (1 million+ rows) and save it in both CSV and HDF5 formats. Compare the time it takes to:

    • Load the entire dataset from both formats
    • Query a subset of rows based on conditions
  3. Advanced Storage: Create a hierarchical storage structure within an HDF5 file that mimics a database with multiple tables. Store related datasets in groups (e.g., '/sales/2022', '/sales/2023').

  4. Real-world Project: Download a large public dataset (like NYC Taxi Trips or weather data) and convert it to HDF5 format. Create a data analysis script that efficiently queries and processes specific subsets of the data.

By mastering HDF5 in pandas, you'll have a powerful tool for handling large datasets efficiently in your data science projects!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)