Skip to main content

Pandas Parquet Export

Introduction

When working with data in pandas, you'll often need to save your DataFrames for later use or share them with others. While CSV files are popular, they're not always the most efficient format, especially for large datasets. This is where Parquet comes in.

Parquet is a columnar storage file format designed for efficiency and performance. It was developed as part of the Apache Hadoop ecosystem but has become widely used in various data processing systems. In this tutorial, we'll explore how to export pandas DataFrames to Parquet files and why you might want to use this format.

Why Use Parquet?

Before diving into the how-to, let's understand why Parquet is often a better choice than other formats:

  • Smaller file size: Parquet uses compression and encoding schemes that significantly reduce file sizes
  • Faster querying: The columnar format allows for efficient reading of specific columns
  • Schema preservation: Parquet stores data types, preserving your DataFrame's structure
  • Support for complex data: Handles nested data structures better than CSV
  • Splittable files: Enables parallel processing in distributed systems

Prerequisites

To work with Parquet files in pandas, you'll need to install the pyarrow or fastparquet package. We'll be using pyarrow in our examples:

bash
pip install pyarrow

Basic Parquet Export

Let's start with a simple example of exporting a pandas DataFrame to a Parquet file:

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 42],
'City': ['New York', 'Boston', 'Chicago', 'Seattle'],
'Salary': [75000, 82000, 67000, 97000]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Export DataFrame to Parquet
df.to_parquet('employees.parquet')

# Read back to verify
df_read = pd.read_parquet('employees.parquet')
print("\nDataFrame read from Parquet file:")
print(df_read)

Output:

Original DataFrame:
Name Age City Salary
0 John 28 New York 75000
1 Anna 34 Boston 82000
2 Peter 29 Chicago 67000
3 Linda 42 Seattle 97000

DataFrame read from Parquet file:
Name Age City Salary
0 John 28 New York 75000
1 Anna 34 Boston 82000
2 Peter 29 Chicago 67000
3 Linda 42 Seattle 97000

As you can see, the DataFrame is successfully stored and retrieved without losing any information.

Compression Options

Parquet supports different compression algorithms to further reduce file size. Let's explore common options:

python
# Export with different compression algorithms
df.to_parquet('employees_snappy.parquet', compression='snappy') # Default
df.to_parquet('employees_gzip.parquet', compression='gzip') # Better compression
df.to_parquet('employees_none.parquet', compression=None) # No compression

import os
print("\nFile size comparison:")
print(f"Snappy compression: {os.path.getsize('employees_snappy.parquet')} bytes")
print(f"GZip compression: {os.path.getsize('employees_gzip.parquet')} bytes")
print(f"No compression: {os.path.getsize('employees_none.parquet')} bytes")

Output:

File size comparison:
Snappy compression: 752 bytes
GZip compression: 764 bytes
No compression: 696 bytes

Note: For small datasets like our example, compression might not make a significant difference or might even make files slightly larger, but for large datasets, the benefits become obvious.

Exporting Specific Columns

Sometimes you may want to export only specific columns to a Parquet file:

python
# Export only specific columns
df[['Name', 'Salary']].to_parquet('employees_subset.parquet')

# Read back to verify
df_subset = pd.read_parquet('employees_subset.parquet')
print("\nSubset of DataFrame from Parquet file:")
print(df_subset)

Output:

Subset of DataFrame from Parquet file:
Name Salary
0 John 75000
1 Anna 82000
2 Peter 67000
3 Linda 97000

Handling Large DataFrames

One of Parquet's strengths is handling large datasets efficiently. Let's create a larger dataset to demonstrate:

python
# Create a larger DataFrame
np.random.seed(42)
large_df = pd.DataFrame({
'id': range(100000),
'value_1': np.random.randn(100000),
'value_2': np.random.randn(100000),
'category': np.random.choice(['A', 'B', 'C', 'D'], size=100000)
})

# Export to CSV and Parquet for comparison
large_df.to_csv('large_data.csv')
large_df.to_parquet('large_data.parquet')

# Compare file sizes
print("\nLarge dataset file size comparison:")
print(f"CSV: {os.path.getsize('large_data.csv') / (1024*1024):.2f} MB")
print(f"Parquet: {os.path.getsize('large_data.parquet') / (1024*1024):.2f} MB")

# Compare read time
import time

start = time.time()
pd.read_csv('large_data.csv')
csv_time = time.time() - start

start = time.time()
pd.read_parquet('large_data.parquet')
parquet_time = time.time() - start

print(f"\nRead time for CSV: {csv_time:.2f} seconds")
print(f"Read time for Parquet: {parquet_time:.2f} seconds")
print(f"Parquet is {csv_time/parquet_time:.1f}x faster")

Output:

Large dataset file size comparison:
CSV: 4.33 MB
Parquet: 1.12 MB

Read time for CSV: 0.38 seconds
Read time for Parquet: 0.05 seconds
Parquet is 7.6x faster

Partitioned Parquet Files

For very large datasets, Parquet supports partitioning, which splits data into subdirectories based on column values. This is especially useful in big data environments:

python
# Create a DataFrame with date data
dates = pd.date_range('20230101', periods=100)
df_dates = pd.DataFrame({
'date': dates,
'value': np.random.randn(100),
'category': np.random.choice(['X', 'Y', 'Z'], size=100)
})

# Extract year and month for partitioning
df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month

# Export with partitioning
df_dates.to_parquet('dates_partitioned.parquet',
partition_cols=['year', 'month'])

print("\nPartitioned directory structure created:")
print("dates_partitioned.parquet/")
print("├── year=2023")
print("│ ├── month=1")
print("│ │ └── part.0.parquet")
print("│ └── month=2")
print("│ └── part.0.parquet")

# Reading from a specific partition
jan_2023 = pd.read_parquet('dates_partitioned.parquet', filters=[('year', '=', 2023), ('month', '=', 1)])
print(f"\nRecords from January 2023: {len(jan_2023)}")

Output:

Partitioned directory structure created:
dates_partitioned.parquet/
├── year=2023
│ ├── month=1
│ │ └── part.0.parquet
│ └── month=2
│ └── part.0.parquet

Records from January 2023: 31

Real-World Application: Working with Time Series Data

Let's look at a practical example with time series data. We'll create a dataset simulating stock prices and export it to Parquet format:

python
# Create a time series dataset
dates = pd.date_range('20220101', periods=365)
stocks = ['AAPL', 'MSFT', 'GOOG', 'AMZN']

# Generate stock price data
np.random.seed(42)
stock_data = []

for stock in stocks:
start_price = np.random.uniform(100, 500)
prices = [start_price]

for _ in range(1, len(dates)):
# Random daily change between -3% and +3%
change = np.random.uniform(-0.03, 0.03)
new_price = prices[-1] * (1 + change)
prices.append(new_price)

for i, date in enumerate(dates):
stock_data.append({
'date': date,
'stock': stock,
'price': prices[i],
'volume': int(np.random.uniform(1000000, 10000000))
})

stock_df = pd.DataFrame(stock_data)

# Export to Parquet with partitioning by stock
stock_df.to_parquet('stock_data.parquet',
partition_cols=['stock'])

# Read specific stock data
apple_data = pd.read_parquet('stock_data.parquet', filters=[('stock', '=', 'AAPL')])

print("\nFirst 5 rows of Apple stock data:")
print(apple_data.head())

# Calculate monthly average prices for Apple
apple_data['month'] = apple_data['date'].dt.to_period('M')
monthly_avg = apple_data.groupby('month')['price'].mean()

print("\nMonthly average prices for Apple:")
print(monthly_avg.head())

Output:

First 5 rows of Apple stock data:
date stock price volume
0 2022-01-01 AAPL 371.952818 3139389
1 2022-01-02 AAPL 375.447664 9396845
2 2022-01-03 AAPL 373.930933 7981537
3 2022-01-04 AAPL 374.585851 8512653
4 2022-01-05 AAPL 385.347852 3723248

Monthly average prices for Apple:
month
2022-01 379.824855
2022-02 375.780524
2022-03 390.954390
2022-04 393.077910
2022-05 399.294008
Name: price, dtype: float64

Best Practices for Parquet Export

  1. Choose appropriate compression:

    • Use snappy (default) for a good balance between compression and speed
    • Use gzip when storage space is a primary concern
    • Use None when processing speed is critical
  2. Consider partitioning for large datasets that are frequently filtered on specific columns

  3. Use appropriate data types before export to maximize compression efficiency

  4. Export only necessary columns to reduce file size and improve read performance

Summary

In this tutorial, we've covered:

  • The basics of exporting pandas DataFrames to Parquet files
  • Compression options available in Parquet
  • How to export specific columns
  • Working with large datasets efficiently
  • Partitioning data for improved query performance
  • A real-world example with time series data

Parquet offers significant advantages over traditional formats like CSV, especially for larger datasets. The columnar storage format, compression capabilities, and schema preservation make it an excellent choice for data storage and exchange in data science and machine learning projects.

Additional Resources

Exercises

  1. Create a DataFrame with at least 5 different data types (strings, integers, floats, dates, and booleans), and export it to Parquet. Check if all data types are preserved when you read it back.

  2. Generate a large DataFrame (1 million rows) and compare the export time, file size, and read time between CSV, Excel, and Parquet formats.

  3. Create a dataset with geographical data (country, city, latitude, longitude) and export it as a partitioned Parquet file by country. Write queries to retrieve data for specific countries.

  4. Experiment with different compression algorithms for a large DataFrame and measure both the compression ratio and read/write performance.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)