Pandas Parquet Export
Introduction
When working with data in pandas, you'll often need to save your DataFrames for later use or share them with others. While CSV files are popular, they're not always the most efficient format, especially for large datasets. This is where Parquet comes in.
Parquet is a columnar storage file format designed for efficiency and performance. It was developed as part of the Apache Hadoop ecosystem but has become widely used in various data processing systems. In this tutorial, we'll explore how to export pandas DataFrames to Parquet files and why you might want to use this format.
Why Use Parquet?
Before diving into the how-to, let's understand why Parquet is often a better choice than other formats:
- Smaller file size: Parquet uses compression and encoding schemes that significantly reduce file sizes
- Faster querying: The columnar format allows for efficient reading of specific columns
- Schema preservation: Parquet stores data types, preserving your DataFrame's structure
- Support for complex data: Handles nested data structures better than CSV
- Splittable files: Enables parallel processing in distributed systems
Prerequisites
To work with Parquet files in pandas, you'll need to install the pyarrow
or fastparquet
package. We'll be using pyarrow
in our examples:
pip install pyarrow
Basic Parquet Export
Let's start with a simple example of exporting a pandas DataFrame to a Parquet file:
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 42],
'City': ['New York', 'Boston', 'Chicago', 'Seattle'],
'Salary': [75000, 82000, 67000, 97000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Export DataFrame to Parquet
df.to_parquet('employees.parquet')
# Read back to verify
df_read = pd.read_parquet('employees.parquet')
print("\nDataFrame read from Parquet file:")
print(df_read)
Output:
Original DataFrame:
Name Age City Salary
0 John 28 New York 75000
1 Anna 34 Boston 82000
2 Peter 29 Chicago 67000
3 Linda 42 Seattle 97000
DataFrame read from Parquet file:
Name Age City Salary
0 John 28 New York 75000
1 Anna 34 Boston 82000
2 Peter 29 Chicago 67000
3 Linda 42 Seattle 97000
As you can see, the DataFrame is successfully stored and retrieved without losing any information.
Compression Options
Parquet supports different compression algorithms to further reduce file size. Let's explore common options:
# Export with different compression algorithms
df.to_parquet('employees_snappy.parquet', compression='snappy') # Default
df.to_parquet('employees_gzip.parquet', compression='gzip') # Better compression
df.to_parquet('employees_none.parquet', compression=None) # No compression
import os
print("\nFile size comparison:")
print(f"Snappy compression: {os.path.getsize('employees_snappy.parquet')} bytes")
print(f"GZip compression: {os.path.getsize('employees_gzip.parquet')} bytes")
print(f"No compression: {os.path.getsize('employees_none.parquet')} bytes")
Output:
File size comparison:
Snappy compression: 752 bytes
GZip compression: 764 bytes
No compression: 696 bytes
Note: For small datasets like our example, compression might not make a significant difference or might even make files slightly larger, but for large datasets, the benefits become obvious.
Exporting Specific Columns
Sometimes you may want to export only specific columns to a Parquet file:
# Export only specific columns
df[['Name', 'Salary']].to_parquet('employees_subset.parquet')
# Read back to verify
df_subset = pd.read_parquet('employees_subset.parquet')
print("\nSubset of DataFrame from Parquet file:")
print(df_subset)
Output:
Subset of DataFrame from Parquet file:
Name Salary
0 John 75000
1 Anna 82000
2 Peter 67000
3 Linda 97000
Handling Large DataFrames
One of Parquet's strengths is handling large datasets efficiently. Let's create a larger dataset to demonstrate:
# Create a larger DataFrame
np.random.seed(42)
large_df = pd.DataFrame({
'id': range(100000),
'value_1': np.random.randn(100000),
'value_2': np.random.randn(100000),
'category': np.random.choice(['A', 'B', 'C', 'D'], size=100000)
})
# Export to CSV and Parquet for comparison
large_df.to_csv('large_data.csv')
large_df.to_parquet('large_data.parquet')
# Compare file sizes
print("\nLarge dataset file size comparison:")
print(f"CSV: {os.path.getsize('large_data.csv') / (1024*1024):.2f} MB")
print(f"Parquet: {os.path.getsize('large_data.parquet') / (1024*1024):.2f} MB")
# Compare read time
import time
start = time.time()
pd.read_csv('large_data.csv')
csv_time = time.time() - start
start = time.time()
pd.read_parquet('large_data.parquet')
parquet_time = time.time() - start
print(f"\nRead time for CSV: {csv_time:.2f} seconds")
print(f"Read time for Parquet: {parquet_time:.2f} seconds")
print(f"Parquet is {csv_time/parquet_time:.1f}x faster")
Output:
Large dataset file size comparison:
CSV: 4.33 MB
Parquet: 1.12 MB
Read time for CSV: 0.38 seconds
Read time for Parquet: 0.05 seconds
Parquet is 7.6x faster
Partitioned Parquet Files
For very large datasets, Parquet supports partitioning, which splits data into subdirectories based on column values. This is especially useful in big data environments:
# Create a DataFrame with date data
dates = pd.date_range('20230101', periods=100)
df_dates = pd.DataFrame({
'date': dates,
'value': np.random.randn(100),
'category': np.random.choice(['X', 'Y', 'Z'], size=100)
})
# Extract year and month for partitioning
df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month
# Export with partitioning
df_dates.to_parquet('dates_partitioned.parquet',
partition_cols=['year', 'month'])
print("\nPartitioned directory structure created:")
print("dates_partitioned.parquet/")
print("├── year=2023")
print("│ ├── month=1")
print("│ │ └── part.0.parquet")
print("│ └── month=2")
print("│ └── part.0.parquet")
# Reading from a specific partition
jan_2023 = pd.read_parquet('dates_partitioned.parquet', filters=[('year', '=', 2023), ('month', '=', 1)])
print(f"\nRecords from January 2023: {len(jan_2023)}")
Output:
Partitioned directory structure created:
dates_partitioned.parquet/
├── year=2023
│ ├── month=1
│ │ └── part.0.parquet
│ └── month=2
│ └── part.0.parquet
Records from January 2023: 31
Real-World Application: Working with Time Series Data
Let's look at a practical example with time series data. We'll create a dataset simulating stock prices and export it to Parquet format:
# Create a time series dataset
dates = pd.date_range('20220101', periods=365)
stocks = ['AAPL', 'MSFT', 'GOOG', 'AMZN']
# Generate stock price data
np.random.seed(42)
stock_data = []
for stock in stocks:
start_price = np.random.uniform(100, 500)
prices = [start_price]
for _ in range(1, len(dates)):
# Random daily change between -3% and +3%
change = np.random.uniform(-0.03, 0.03)
new_price = prices[-1] * (1 + change)
prices.append(new_price)
for i, date in enumerate(dates):
stock_data.append({
'date': date,
'stock': stock,
'price': prices[i],
'volume': int(np.random.uniform(1000000, 10000000))
})
stock_df = pd.DataFrame(stock_data)
# Export to Parquet with partitioning by stock
stock_df.to_parquet('stock_data.parquet',
partition_cols=['stock'])
# Read specific stock data
apple_data = pd.read_parquet('stock_data.parquet', filters=[('stock', '=', 'AAPL')])
print("\nFirst 5 rows of Apple stock data:")
print(apple_data.head())
# Calculate monthly average prices for Apple
apple_data['month'] = apple_data['date'].dt.to_period('M')
monthly_avg = apple_data.groupby('month')['price'].mean()
print("\nMonthly average prices for Apple:")
print(monthly_avg.head())
Output:
First 5 rows of Apple stock data:
date stock price volume
0 2022-01-01 AAPL 371.952818 3139389
1 2022-01-02 AAPL 375.447664 9396845
2 2022-01-03 AAPL 373.930933 7981537
3 2022-01-04 AAPL 374.585851 8512653
4 2022-01-05 AAPL 385.347852 3723248
Monthly average prices for Apple:
month
2022-01 379.824855
2022-02 375.780524
2022-03 390.954390
2022-04 393.077910
2022-05 399.294008
Name: price, dtype: float64
Best Practices for Parquet Export
-
Choose appropriate compression:
- Use
snappy
(default) for a good balance between compression and speed - Use
gzip
when storage space is a primary concern - Use
None
when processing speed is critical
- Use
-
Consider partitioning for large datasets that are frequently filtered on specific columns
-
Use appropriate data types before export to maximize compression efficiency
-
Export only necessary columns to reduce file size and improve read performance
Summary
In this tutorial, we've covered:
- The basics of exporting pandas DataFrames to Parquet files
- Compression options available in Parquet
- How to export specific columns
- Working with large datasets efficiently
- Partitioning data for improved query performance
- A real-world example with time series data
Parquet offers significant advantages over traditional formats like CSV, especially for larger datasets. The columnar storage format, compression capabilities, and schema preservation make it an excellent choice for data storage and exchange in data science and machine learning projects.
Additional Resources
Exercises
-
Create a DataFrame with at least 5 different data types (strings, integers, floats, dates, and booleans), and export it to Parquet. Check if all data types are preserved when you read it back.
-
Generate a large DataFrame (1 million rows) and compare the export time, file size, and read time between CSV, Excel, and Parquet formats.
-
Create a dataset with geographical data (country, city, latitude, longitude) and export it as a partitioned Parquet file by country. Write queries to retrieve data for specific countries.
-
Experiment with different compression algorithms for a large DataFrame and measure both the compression ratio and read/write performance.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)