Pandas Parquet Export

Introduction

When working with data in pandas, you'll often need to save your DataFrames for later use or share them with others. While CSV files are popular, they're not always the most efficient format, especially for large datasets. This is where Parquet comes in.

Parquet is a columnar storage file format designed for efficiency and performance. It was developed as part of the Apache Hadoop ecosystem but has become widely used in various data processing systems. In this tutorial, we'll explore how to export pandas DataFrames to Parquet files and why you might want to use this format.

Why Use Parquet?

Before diving into the how-to, let's understand why Parquet is often a better choice than other formats:

Smaller file size: Parquet uses compression and encoding schemes that significantly reduce file sizes
Faster querying: The columnar format allows for efficient reading of specific columns
Schema preservation: Parquet stores data types, preserving your DataFrame's structure
Support for complex data: Handles nested data structures better than CSV
Splittable files: Enables parallel processing in distributed systems

Prerequisites

To work with Parquet files in pandas, you'll need to install the pyarrow or fastparquet package. We'll be using pyarrow in our examples:

pip install pyarrow

Basic Parquet Export

Let's start with a simple example of exporting a pandas DataFrame to a Parquet file:

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 34, 29, 42],
    'City': ['New York', 'Boston', 'Chicago', 'Seattle'],
    'Salary': [75000, 82000, 67000, 97000]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Export DataFrame to Parquet
df.to_parquet('employees.parquet')

# Read back to verify
df_read = pd.read_parquet('employees.parquet')
print("\nDataFrame read from Parquet file:")
print(df_read)

Output:

Original DataFrame:
    Name  Age     City  Salary
0   John   28  New York   75000
1   Anna   34    Boston   82000
2  Peter   29   Chicago   67000
3  Linda   42   Seattle   97000

DataFrame read from Parquet file:
    Name  Age     City  Salary
0   John   28  New York   75000
1   Anna   34    Boston   82000
2  Peter   29   Chicago   67000
3  Linda   42   Seattle   97000

As you can see, the DataFrame is successfully stored and retrieved without losing any information.

Compression Options

Parquet supports different compression algorithms to further reduce file size. Let's explore common options:

# Export with different compression algorithms
df.to_parquet('employees_snappy.parquet', compression='snappy')  # Default
df.to_parquet('employees_gzip.parquet', compression='gzip')      # Better compression
df.to_parquet('employees_none.parquet', compression=None)        # No compression

import os
print("\nFile size comparison:")
print(f"Snappy compression: {os.path.getsize('employees_snappy.parquet')} bytes")
print(f"GZip compression: {os.path.getsize('employees_gzip.parquet')} bytes") 
print(f"No compression: {os.path.getsize('employees_none.parquet')} bytes")

Output:

File size comparison:
Snappy compression: 752 bytes
GZip compression: 764 bytes
No compression: 696 bytes

Note: For small datasets like our example, compression might not make a significant difference or might even make files slightly larger, but for large datasets, the benefits become obvious.

Exporting Specific Columns

Sometimes you may want to export only specific columns to a Parquet file:

# Export only specific columns
df[['Name', 'Salary']].to_parquet('employees_subset.parquet')

# Read back to verify
df_subset = pd.read_parquet('employees_subset.parquet')
print("\nSubset of DataFrame from Parquet file:")
print(df_subset)

Output:

Subset of DataFrame from Parquet file:
    Name  Salary
0   John   75000
1   Anna   82000
2  Peter   67000
3  Linda   97000

Handling Large DataFrames

One of Parquet's strengths is handling large datasets efficiently. Let's create a larger dataset to demonstrate:

# Create a larger DataFrame
np.random.seed(42)
large_df = pd.DataFrame({
    'id': range(100000),
    'value_1': np.random.randn(100000),
    'value_2': np.random.randn(100000),
    'category': np.random.choice(['A', 'B', 'C', 'D'], size=100000)
})

# Export to CSV and Parquet for comparison
large_df.to_csv('large_data.csv')
large_df.to_parquet('large_data.parquet')

# Compare file sizes
print("\nLarge dataset file size comparison:")
print(f"CSV: {os.path.getsize('large_data.csv') / (1024*1024):.2f} MB")
print(f"Parquet: {os.path.getsize('large_data.parquet') / (1024*1024):.2f} MB")

# Compare read time
import time

start = time.time()
pd.read_csv('large_data.csv')
csv_time = time.time() - start

start = time.time()
pd.read_parquet('large_data.parquet')
parquet_time = time.time() - start

print(f"\nRead time for CSV: {csv_time:.2f} seconds")
print(f"Read time for Parquet: {parquet_time:.2f} seconds")
print(f"Parquet is {csv_time/parquet_time:.1f}x faster")

Output:

Large dataset file size comparison:
CSV: 4.33 MB
Parquet: 1.12 MB

Read time for CSV: 0.38 seconds
Read time for Parquet: 0.05 seconds
Parquet is 7.6x faster

Partitioned Parquet Files

For very large datasets, Parquet supports partitioning, which splits data into subdirectories based on column values. This is especially useful in big data environments:

# Create a DataFrame with date data
dates = pd.date_range('20230101', periods=100)
df_dates = pd.DataFrame({
    'date': dates,
    'value': np.random.randn(100),
    'category': np.random.choice(['X', 'Y', 'Z'], size=100)
})

# Extract year and month for partitioning
df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month

# Export with partitioning
df_dates.to_parquet('dates_partitioned.parquet', 
                    partition_cols=['year', 'month'])

print("\nPartitioned directory structure created:")
print("dates_partitioned.parquet/")
print("├── year=2023")
print("│   ├── month=1")
print("│   │   └── part.0.parquet")
print("│   └── month=2")
print("│       └── part.0.parquet")

# Reading from a specific partition
jan_2023 = pd.read_parquet('dates_partitioned.parquet', filters=[('year', '=', 2023), ('month', '=', 1)])
print(f"\nRecords from January 2023: {len(jan_2023)}")

Output:

Partitioned directory structure created:
dates_partitioned.parquet/
├── year=2023
│   ├── month=1
│   │   └── part.0.parquet
│   └── month=2
│       └── part.0.parquet

Records from January 2023: 31

Real-World Application: Working with Time Series Data

Let's look at a practical example with time series data. We'll create a dataset simulating stock prices and export it to Parquet format:

# Create a time series dataset
dates = pd.date_range('20220101', periods=365)
stocks = ['AAPL', 'MSFT', 'GOOG', 'AMZN']

# Generate stock price data
np.random.seed(42)
stock_data = []

for stock in stocks:
    start_price = np.random.uniform(100, 500)
    prices = [start_price]
    
    for _ in range(1, len(dates)):
        # Random daily change between -3% and +3%
        change = np.random.uniform(-0.03, 0.03)
        new_price = prices[-1] * (1 + change)
        prices.append(new_price)
    
    for i, date in enumerate(dates):
        stock_data.append({
            'date': date,
            'stock': stock,
            'price': prices[i],
            'volume': int(np.random.uniform(1000000, 10000000))
        })

stock_df = pd.DataFrame(stock_data)

# Export to Parquet with partitioning by stock
stock_df.to_parquet('stock_data.parquet', 
                    partition_cols=['stock'])

# Read specific stock data
apple_data = pd.read_parquet('stock_data.parquet', filters=[('stock', '=', 'AAPL')])

print("\nFirst 5 rows of Apple stock data:")
print(apple_data.head())

# Calculate monthly average prices for Apple
apple_data['month'] = apple_data['date'].dt.to_period('M')
monthly_avg = apple_data.groupby('month')['price'].mean()

print("\nMonthly average prices for Apple:")
print(monthly_avg.head())

Output:

First 5 rows of Apple stock data:
        date stock      price    volume
0 2022-01-01  AAPL  371.952818  3139389
1 2022-01-02  AAPL  375.447664  9396845
2 2022-01-03  AAPL  373.930933  7981537
3 2022-01-04  AAPL  374.585851  8512653
4 2022-01-05  AAPL  385.347852  3723248

Monthly average prices for Apple:
month
2022-01    379.824855
2022-02    375.780524
2022-03    390.954390
2022-04    393.077910
2022-05    399.294008
Name: price, dtype: float64

Best Practices for Parquet Export

Choose appropriate compression:
- Use snappy (default) for a good balance between compression and speed
- Use gzip when storage space is a primary concern
- Use None when processing speed is critical
Consider partitioning for large datasets that are frequently filtered on specific columns
Use appropriate data types before export to maximize compression efficiency
Export only necessary columns to reduce file size and improve read performance

Summary

In this tutorial, we've covered:

The basics of exporting pandas DataFrames to Parquet files
Compression options available in Parquet
How to export specific columns
Working with large datasets efficiently
Partitioning data for improved query performance
A real-world example with time series data

Parquet offers significant advantages over traditional formats like CSV, especially for larger datasets. The columnar storage format, compression capabilities, and schema preservation make it an excellent choice for data storage and exchange in data science and machine learning projects.

Additional Resources

Exercises

Create a DataFrame with at least 5 different data types (strings, integers, floats, dates, and booleans), and export it to Parquet. Check if all data types are preserved when you read it back.
Generate a large DataFrame (1 million rows) and compare the export time, file size, and read time between CSV, Excel, and Parquet formats.
Create a dataset with geographical data (country, city, latitude, longitude) and export it as a partitioned Parquet file by country. Write queries to retrieve data for specific countries.
Experiment with different compression algorithms for a large DataFrame and measure both the compression ratio and read/write performance.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Use Parquet?​

Prerequisites​

Basic Parquet Export​

Compression Options​

Exporting Specific Columns​

Handling Large DataFrames​

Partitioned Parquet Files​

Real-World Application: Working with Time Series Data​

Best Practices for Parquet Export​

Summary​

Additional Resources​

Exercises​

Introduction

Why Use Parquet?

Prerequisites

Basic Parquet Export

Compression Options

Exporting Specific Columns

Handling Large DataFrames

Partitioned Parquet Files

Real-World Application: Working with Time Series Data

Best Practices for Parquet Export

Summary

Additional Resources

Exercises