Pandas Parquet Import
Introduction
The Parquet format is a columnar storage file format designed for efficient data processing and storage. It's becoming increasingly popular in the data science ecosystem due to its performance benefits and compatibility with big data frameworks. In this tutorial, we'll learn how to import Parquet files into pandas DataFrames and explore the advantages of using this format for data analysis tasks.
Parquet offers several benefits over other file formats:
- Columnar storage: Data is stored column-wise rather than row-wise, enabling more efficient querying and compression
- Schema preservation: Data types and column names are stored with the data
- Compression: Built-in compression reduces file size significantly
- Performance: Much faster read/write operations than CSV or JSON files
- Big data compatibility: Works seamlessly with tools like Apache Spark, Hadoop, and Dask
Prerequisites
To work with Parquet files, you'll need to install the pyarrow
or fastparquet
engine. We'll use pyarrow
in this tutorial:
pip install pandas pyarrow
Basic Parquet Import
Let's start with the simplest way to read a Parquet file into a pandas DataFrame:
import pandas as pd
# Read a Parquet file
df = pd.read_parquet('data.parquet')
# Display the DataFrame
print(df.head())
Output:
id name age city
0 1 Alice 25 New York
1 2 Bob 30 Chicago
2 3 Charlie 35 Boston
3 4 David 40 Seattle
4 5 Emma 28 Denver
The read_parquet()
function automatically handles importing the data with the appropriate schema and data types.
Customizing Parquet Import
Specifying the Engine
You can explicitly specify which engine to use:
# Using pyarrow engine
df = pd.read_parquet('data.parquet', engine='pyarrow')
# Or using fastparquet engine (requires fastparquet package)
# df = pd.read_parquet('data.parquet', engine='fastparquet')
Reading Specific Columns
For large datasets, you might want to read only specific columns:
# Read only specific columns
df = pd.read_parquet('data.parquet', columns=['name', 'age'])
print(df.head())
Output:
name age
0 Alice 25
1 Bob 30
2 Charlie 35
3 David 40
4 Emma 28
Reading From Different Sources
Pandas can read Parquet files from various sources:
# From a local file
df1 = pd.read_parquet('data.parquet')
# From a URL
df2 = pd.read_parquet('https://example.com/data.parquet')
# From S3 (requires s3fs package)
# df3 = pd.read_parquet('s3://bucket/data.parquet')
# From a BytesIO object
import io
with open('data.parquet', 'rb') as f:
content = f.read()
df4 = pd.read_parquet(io.BytesIO(content))
Exploring Parquet Metadata
Parquet files store metadata that provides information about the schema:
import pyarrow.parquet as pq
# Get metadata from a Parquet file
parquet_file = pq.ParquetFile('data.parquet')
metadata = parquet_file.metadata
# Print basic file information
print(f"Number of rows: {metadata.num_rows}")
print(f"Number of columns: {metadata.num_columns}")
print(f"Created by: {metadata.created_by}")
# Print schema information
print("\nSchema:")
schema = parquet_file.schema
print(schema)
Output:
Number of rows: 1000
Number of columns: 4
Created by: parquet-cpp version 1.5.1-SNAPSHOT
Schema:
id: INT64
name: UTF8
age: INT64
city: UTF8
Handling Partitioned Parquet Data
Parquet supports partitioning data across multiple files based on column values, which is common in big data systems:
# Read a partitioned dataset
# In this example, data is partitioned by 'year' and 'month' columns
df = pd.read_parquet('partitioned_data/')
# The partitioning columns are included automatically
print(df.head())
Output:
id name age city year month
0 1 Alice 25 New York 2022 01
1 2 Bob 30 Chicago 2022 01
2 3 Charlie 35 Boston 2022 01
3 4 David 40 Seattle 2022 02
4 5 Emma 28 Denver 2022 02
Converting Between Formats
CSV to Parquet
Converting a CSV file to Parquet can significantly improve performance:
# Read CSV file
csv_df = pd.read_csv('data.csv')
# Write to Parquet format
csv_df.to_parquet('data.parquet')
# Check file sizes
import os
csv_size = os.path.getsize('data.csv')
parquet_size = os.path.getsize('data.parquet')
print(f"CSV file size: {csv_size:,} bytes")
print(f"Parquet file size: {parquet_size:,} bytes")
print(f"Compression ratio: {csv_size / parquet_size:.2f}x")
Output:
CSV file size: 1,234,567 bytes
Parquet file size: 246,913 bytes
Compression ratio: 5.00x
Real-World Example: Analyzing Large Datasets
Let's see a practical example of working with a large dataset in Parquet format. In this example, we'll analyze a dataset of flight delays:
import pandas as pd
import matplotlib.pyplot as plt
# Read flight data from Parquet (much faster than CSV for large data)
flights = pd.read_parquet('flights.parquet')
# Show dataset information
print(flights.info())
# Calculate average delay by airline
avg_delays = flights.groupby('airline')['arrival_delay'].mean().sort_values()
# Create a simple visualization
plt.figure(figsize=(12, 6))
avg_delays.plot(kind='bar')
plt.title('Average Arrival Delay by Airline')
plt.xlabel('Airline')
plt.ylabel('Average Delay (minutes)')
plt.tight_layout()
plt.savefig('airline_delays.png')
plt.close()
print("Analysis complete! Check 'airline_delays.png' for the visualization.")
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5819079 entries, 0 to 5819078
Data columns (total 31 columns):
# Column Dtype
--- ------ -----
0 year int64
1 month int64
2 day int64
3 day_of_week int64
4 airline object
5 flight_number int64
6 origin object
7 destination object
8 departure_time float64
9 departure_delay float64
10 arrival_time float64
11 arrival_delay float64
...
dtypes: float64(7), int64(17), object(7)
memory usage: 1.3+ GB
Analysis complete! Check 'airline_delays.png' for the visualization.
Performance Comparison: CSV vs. Parquet
Let's compare the performance of reading CSV versus Parquet:
import pandas as pd
import time
# Create a sample DataFrame
import numpy as np
large_df = pd.DataFrame({
'id': range(1000000),
'value1': np.random.randn(1000000),
'value2': np.random.randn(1000000),
'category': np.random.choice(['A', 'B', 'C', 'D'], 1000000)
})
# Save as CSV and Parquet
large_df.to_csv('large_data.csv', index=False)
large_df.to_parquet('large_data.parquet', index=False)
# Test reading from CSV
start_time = time.time()
csv_df = pd.read_csv('large_data.csv')
csv_time = time.time() - start_time
print(f"CSV read time: {csv_time:.2f} seconds")
# Test reading from Parquet
start_time = time.time()
parquet_df = pd.read_parquet('large_data.parquet')
parquet_time = time.time() - start_time
print(f"Parquet read time: {parquet_time:.2f} seconds")
# Calculate speedup
speedup = csv_time / parquet_time
print(f"Parquet is {speedup:.2f}x faster than CSV")
Output:
CSV read time: 4.23 seconds
Parquet read time: 0.31 seconds
Parquet is 13.65x faster than CSV
Advanced Parquet Options
Compression Options
You can specify different compression codecs when saving to Parquet:
# Available compression codecs: 'snappy', 'gzip', 'brotli', 'lz4', 'zstd', None (uncompressed)
df.to_parquet('data_snappy.parquet', compression='snappy') # Default, good balance
df.to_parquet('data_gzip.parquet', compression='gzip') # Better compression, slower
df.to_parquet('data_uncompressed.parquet', compression=None) # No compression
# Check sizes
import os
for file in ['data_snappy.parquet', 'data_gzip.parquet', 'data_uncompressed.parquet']:
size = os.path.getsize(file)
print(f"{file}: {size:,} bytes")
Output:
data_snappy.parquet: 2,345,678 bytes
data_gzip.parquet: 2,123,456 bytes
data_uncompressed.parquet: 5,678,901 bytes
Row Group Size
You can optimize for your use case by adjusting the row group size:
# For large files that will be queried partially, smaller row groups can be better
df.to_parquet('data_small_groups.parquet', row_group_size=10000)
# For files that will typically be read in full, larger row groups can be more efficient
df.to_parquet('data_large_groups.parquet', row_group_size=100000)
Summary
In this tutorial, we've covered:
- How to import Parquet files into pandas DataFrames
- The benefits of using Parquet format over other formats
- Reading specific columns and from different sources
- Working with partitioned Parquet data
- Converting between formats
- Performance comparison between CSV and Parquet
- Advanced options for Parquet files
Parquet is an excellent choice for data analysis workflows due to its performance benefits, especially when working with large datasets. The columnar storage format allows for more efficient querying, filtering, and compression compared to row-based formats like CSV.
Additional Resources
Exercises
- Download a CSV dataset from a public source (like Kaggle), convert it to Parquet format, and compare the file sizes and read times.
- Create a partitioned Parquet dataset based on a categorical column in your data.
- Use
read_parquet()
with thefilters
parameter to load only specific rows from a large Parquet file. - Compare the performance of different compression codecs (snappy, gzip, zstd) for your specific dataset.
- Create a data pipeline that reads data from CSV, performs transformations, and saves the results in Parquet format for later analysis.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)