Pandas HDF5 Import

Introduction to HDF5 in Pandas

HDF5 (Hierarchical Data Format version 5) is a file format designed to store and organize large amounts of numerical data. When combined with pandas, it provides an excellent solution for storing and accessing large datasets efficiently. The pandas library offers built-in functionality to read from and write to HDF5 files using the HDFStore class and the read_hdf() function.

HDF5 stores are particularly useful when you need to:

Work with datasets too large to fit in memory
Maintain hierarchical organization of multiple datasets
Access data quickly, even from very large files
Store pandas objects with their full information (including data types and indexes)

In this guide, we'll explore how to import data from HDF5 files using pandas, and why this might be your preferred storage format for certain scenarios.

Prerequisites

Before working with HDF5 files in pandas, you'll need:

pip install pandas pytables

The pytables package (sometimes referred to as tables) provides the underlying HDF5 functionality that pandas leverages.

Basic HDF5 Import with Pandas

Let's start with the basics of importing data from an HDF5 file:

import pandas as pd

# Reading an HDF5 file
df = pd.read_hdf('my_data.h5', key='dataset_name')
print(df.head())

The output would look something like:

   column1  column2  column3
    1.2      3.4      5.6
    2.3      4.5      6.7
    3.4      5.6      7.8
    4.5      6.7      8.9
    5.6      7.8      9.0

The key parameter is essential as HDF5 files can store multiple datasets, and we need to specify which one we want to read.

Understanding HDF5 Structure

HDF5 files work like a file system within a file:

Groups: Similar to directories - they can contain other groups or datasets
Datasets: The actual data arrays stored in the file
Attributes: Metadata attached to groups or datasets

In pandas, we typically store DataFrame objects as datasets inside an HDF5 file, using keys as identifiers.

Creating and Reading from an HDF5 Store

Let's see a complete example of creating an HDF5 file and then reading from it:

import pandas as pd
import numpy as np

# Create some sample data
df1 = pd.DataFrame(np.random.rand(5, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(7, 2), columns=['X', 'Y'])

# Write to HDF5 file
with pd.HDFStore('sample_store.h5') as store:
    store.put('dataframe1', df1)
    store.put('dataframe2', df2)
    
    # Print the contents of the store
    print("Store contents:")
    print(store.info())

# Read from HDF5 file
with pd.HDFStore('sample_store.h5', mode='r') as store:
    # List all keys in the store
    print("\nAvailable keys:", store.keys())
    
    # Read specific dataframe
    read_df1 = store.get('dataframe1')
    print("\nDataFrame 1:")
    print(read_df1)
    
    # Alternative method to read
    read_df2 = pd.read_hdf('sample_store.h5', 'dataframe2')
    print("\nDataFrame 2:")
    print(read_df2)

Output:

Store contents:
<class 'pandas.io.pytables.HDFStore'>
File path: sample_store.h5
/dataframe1            frame        (shape->[5,3])
/dataframe2            frame        (shape->[7,2])

Available keys: ['/dataframe1', '/dataframe2']

DataFrame 1:
          A         B         C
0  0.454063  0.800193  0.967246
1  0.786799  0.915822  0.393531
2  0.791360  0.528909  0.568332
3  0.925571  0.438178  0.264635
4  0.476055  0.679252  0.750341

DataFrame 2:
          X         Y
0  0.020356  0.437264
1 -1.366606 -1.537211
2 -0.580331  1.301770
3  0.286720  1.609199
4  1.369728 -0.192632
5  0.177820  0.258573
6 -0.385074  1.643550

Advanced HDF5 Import Techniques

Specifying Data Columns and Queries

One of the powerful features of HDF5 with pandas is the ability to query data directly from storage without loading the entire dataset:

# Create a larger dataset
data = pd.DataFrame({
    'category': np.random.choice(['A', 'B', 'C', 'D'], 10000),
    'value1': np.random.randn(10000),
    'value2': np.random.randn(10000),
    'value3': np.random.rand(10000)
})

# Store with data_columns for querying
data.to_hdf('queryable.h5', 'data', format='table', data_columns=True)

# Now we can query without loading the whole dataset
result = pd.read_hdf('queryable.h5', 'data', where='category="A" & value1>0')
print(f"Found {len(result)} rows matching the query")
print(result.head())

Output:

Found 1258 rows matching the query
  category    value1    value2    value3
      A  0.030936  0.196930  0.875933
      A  0.083854  1.408124  0.629321
      A  1.172823  0.170708  0.933428
     A  0.297109 -0.083165  0.064870
     A  1.418179 -2.290428  0.555552

Reading Specific Columns

You can also choose to read only specific columns from a large dataset:

# Read only specific columns
partial_df = pd.read_hdf('queryable.h5', 'data', columns=['category', 'value1'])
print(partial_df.head())

Output:

  category    value1
      C  0.358381
      C -0.498432
      A  0.030936
      D  1.405254
      A  0.083854

Working with Large Datasets

HDF5 really shines when handling datasets too large to fit in memory. Here's how you might work with such data:

# Create a large dataset (let's simulate it being very large)
# In a real scenario, this could be much larger
chunk_size = 100000
total_chunks = 10

# Create HDF5 file in chunks
with pd.HDFStore('large_data.h5', mode='w') as store:
    for i in range(total_chunks):
        print(f"Creating chunk {i+1}/{total_chunks}")
        chunk = pd.DataFrame({
            'id': range(i*chunk_size, (i+1)*chunk_size),
            'value': np.random.rand(chunk_size),
            'group': np.random.choice(['X', 'Y', 'Z'], chunk_size)
        })
        # Append to the store
        store.append('dataset', chunk, format='table', data_columns=True)

# Process the data in chunks without loading everything
chunk_iterator = pd.read_hdf('large_data.h5', 'dataset', chunksize=50000)

# Process each chunk
total_rows = 0
for i, chunk in enumerate(chunk_iterator):
    # Do some processing on each chunk
    filtered_rows = len(chunk[chunk['value'] > 0.75])
    total_rows += filtered_rows
    print(f"Chunk {i+1}: Found {filtered_rows} rows with value > 0.75")

print(f"Total qualifying rows: {total_rows}")

Output:

Creating chunk 1/10
Creating chunk 2/10
...
Creating chunk 10/10
Chunk 1: Found 12476 rows with value > 0.75
Chunk 2: Found 12584 rows with value > 0.75
...
Chunk 20: Found 12509 rows with value > 0.75
Total qualifying rows: 250127

Practical Examples

Example 1: Storing Time Series Data

HDF5 is particularly good for time series data. Let's see how to store and retrieve financial time series:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create sample time series data
dates = pd.date_range(start='2022-01-01', periods=1000)
stocks = ['AAPL', 'GOOG', 'MSFT', 'AMZN']

# Create a panel of stock prices
stock_data = {}
for stock in stocks:
    # Create simulated price data with some randomness
    base_price = np.random.randint(50, 500)
    prices = base_price + np.cumsum(np.random.normal(0, 2, len(dates)))
    stock_data[stock] = pd.Series(prices, index=dates)

price_df = pd.DataFrame(stock_data)

# Store in HDF5 file
price_df.to_hdf('stock_prices.h5', key='prices', format='table', 
                data_columns=True)

# Read a subset of dates
start_date = '2022-03-01'
end_date = '2022-03-31'
march_prices = pd.read_hdf('stock_prices.h5', 'prices', 
                          where=f"index >= '{start_date}' & index <= '{end_date}'")

print(f"Retrieved {len(march_prices)} days of price data for March 2022:")
print(march_prices.head())

# Get data for a specific stock
apple_prices = pd.read_hdf('stock_prices.h5', 'prices', columns=['AAPL'])
print("\nApple stock prices (first 5 days):")
print(apple_prices.head())

Output:

Retrieved 31 days of price data for March 2022:
                AAPL        GOOG        MSFT        AMZN
2022-03-01  473.986359  429.316047  253.575074  301.859832
2022-03-02  470.506909  428.392252  254.611636  301.935262
2022-03-03  469.842412  429.802378  253.145621  304.654150
2022-03-04  470.875683  432.519478  255.394160  305.058455
2022-03-05  471.295310  433.694654  254.137482  307.631035

Apple stock prices (first 5 days):
                AAPL
2022-01-01  429.968588
2022-01-02  431.430510
2022-01-03  432.990569
2022-01-04  435.398748
2022-01-05  435.289620

Example 2: Sensor Data Logging System

Let's simulate a system that logs sensor data and uses HDF5 for storage:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import time

# Function to simulate sensor readings
def generate_sensor_data(num_readings=100, num_sensors=5):
    timestamps = [datetime.now() - timedelta(seconds=i) for i in range(num_readings)]
    data = {
        'timestamp': timestamps,
        'sensor_id': np.random.choice(range(1, num_sensors+1), num_readings),
        'temperature': np.random.normal(25, 3, num_readings),
        'humidity': np.random.normal(50, 10, num_readings),
        'pressure': np.random.normal(1013, 5, num_readings)
    }
    return pd.DataFrame(data)

# Simulate collecting data in batches
def sensor_logging_system():
    with pd.HDFStore('sensor_logs.h5', mode='w') as store:
        # Simulate collecting 5 batches of data
        for batch in range(5):
            print(f"Collecting batch {batch+1}...")
            batch_data = generate_sensor_data(100)
            
            # Store data with timestamp as index
            batch_data.set_index('timestamp', inplace=True)
            store.append('sensor_readings', batch_data, format='table', 
                        data_columns=True)
            
            # Wait a bit before next batch (just for simulation)
            if batch < 4:
                time.sleep(1)
    
    # Now analyze the data
    with pd.HDFStore('sensor_logs.h5', mode='r') as store:
        # Get data for sensor 1
        sensor1_data = pd.read_hdf('sensor_logs.h5', 'sensor_readings', 
                                  where='sensor_id=1')
        
        # Get high temperature readings across all sensors
        high_temp = pd.read_hdf('sensor_logs.h5', 'sensor_readings', 
                               where='temperature>28')
        
        # Print summary
        print(f"\nCollected {len(store.get('sensor_readings'))} total readings")
        print(f"Sensor 1 has {len(sensor1_data)} readings")
        print(f"Found {len(high_temp)} high temperature readings")
        
        # Calculate average readings per sensor
        all_data = store.get('sensor_readings')
        sensor_stats = all_data.groupby('sensor_id').agg({
            'temperature': 'mean',
            'humidity': 'mean',
            'pressure': 'mean'
        }).round(2)
        
        print("\nAverage readings per sensor:")
        print(sensor_stats)

# Run the simulation
sensor_logging_system()

Output:

Collecting batch 1...
Collecting batch 2...
Collecting batch 3...
Collecting batch 4...
Collecting batch 5...

Collected 500 total readings
Sensor 1 has 103 readings
Found 68 high temperature readings

Average readings per sensor:
           temperature  humidity  pressure
sensor_id                                 
1               25.23     49.65   1013.17
2               25.01     50.20   1012.67
3               24.69     50.27   1012.35
4               24.87     49.46   1012.88
5               24.87     49.97   1013.46

Performance Considerations

HDF5 offers excellent performance for certain use cases, but there are some factors to consider:

Format Types:
- fixed format (default) is faster for writing but less flexible
- table format enables queries but is slower for writing

Compression Options:

df.to_hdf('compressed.h5', 'data', complevel=9, complib='blosc')

Chunking: HDF5 uses chunks for storage; pandas will automatically set chunk sizes but you may need to tune this for very large datasets
Memory Usage: Even with HDF5, be aware of operations that might load large amounts of data into memory

Summary

HDF5 is a powerful format for working with large datasets in pandas. In this guide, we covered:

Basic import from HDF5 files using pd.read_hdf() and HDFStore
Advanced queries to filter data directly from storage
Working with large datasets using chunking
Practical examples for time series and sensor data
Performance considerations

HDF5 provides a flexible, hierarchical way to store multiple pandas objects in a single file, while maintaining good performance for both reading and writing operations. It's especially valuable when working with datasets too large to fit in memory and when you need to query subsets of your data efficiently.

Additional Resources and Exercises

Resources

Exercises

Basic Exercise: Create an HDF5 file containing three different DataFrames with various data types. Practice reading them back and exploring the structure.
Query Performance: Create a large DataFrame (1 million+ rows) and save it in both CSV and HDF5 formats. Compare the time it takes to:
- Load the entire dataset from both formats
- Query a subset of rows based on conditions
Advanced Storage: Create a hierarchical storage structure within an HDF5 file that mimics a database with multiple tables. Store related datasets in groups (e.g., '/sales/2022', '/sales/2023').
Real-world Project: Download a large public dataset (like NYC Taxi Trips or weather data) and convert it to HDF5 format. Create a data analysis script that efficiently queries and processes specific subsets of the data.

By mastering HDF5 in pandas, you'll have a powerful tool for handling large datasets efficiently in your data science projects!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to HDF5 in Pandas​

Prerequisites​

Basic HDF5 Import with Pandas​

Understanding HDF5 Structure​

Creating and Reading from an HDF5 Store​

Advanced HDF5 Import Techniques​

Specifying Data Columns and Queries​

Reading Specific Columns​

Working with Large Datasets​

Practical Examples​

Example 1: Storing Time Series Data​

Example 2: Sensor Data Logging System​

Performance Considerations​

Summary​

Additional Resources and Exercises​

Resources​

Exercises​

Introduction to HDF5 in Pandas

Prerequisites

Basic HDF5 Import with Pandas

Understanding HDF5 Structure

Creating and Reading from an HDF5 Store

Advanced HDF5 Import Techniques

Specifying Data Columns and Queries

Reading Specific Columns

Working with Large Datasets

Practical Examples

Example 1: Storing Time Series Data

Example 2: Sensor Data Logging System

Performance Considerations

Summary

Additional Resources and Exercises

Resources

Exercises