Skip to main content

Pandas Parquet Import

Introduction

The Parquet format is a columnar storage file format designed for efficient data processing and storage. It's becoming increasingly popular in the data science ecosystem due to its performance benefits and compatibility with big data frameworks. In this tutorial, we'll learn how to import Parquet files into pandas DataFrames and explore the advantages of using this format for data analysis tasks.

Parquet offers several benefits over other file formats:

  • Columnar storage: Data is stored column-wise rather than row-wise, enabling more efficient querying and compression
  • Schema preservation: Data types and column names are stored with the data
  • Compression: Built-in compression reduces file size significantly
  • Performance: Much faster read/write operations than CSV or JSON files
  • Big data compatibility: Works seamlessly with tools like Apache Spark, Hadoop, and Dask

Prerequisites

To work with Parquet files, you'll need to install the pyarrow or fastparquet engine. We'll use pyarrow in this tutorial:

bash
pip install pandas pyarrow

Basic Parquet Import

Let's start with the simplest way to read a Parquet file into a pandas DataFrame:

python
import pandas as pd

# Read a Parquet file
df = pd.read_parquet('data.parquet')

# Display the DataFrame
print(df.head())

Output:

   id       name  age  city
0 1 Alice 25 New York
1 2 Bob 30 Chicago
2 3 Charlie 35 Boston
3 4 David 40 Seattle
4 5 Emma 28 Denver

The read_parquet() function automatically handles importing the data with the appropriate schema and data types.

Customizing Parquet Import

Specifying the Engine

You can explicitly specify which engine to use:

python
# Using pyarrow engine
df = pd.read_parquet('data.parquet', engine='pyarrow')

# Or using fastparquet engine (requires fastparquet package)
# df = pd.read_parquet('data.parquet', engine='fastparquet')

Reading Specific Columns

For large datasets, you might want to read only specific columns:

python
# Read only specific columns
df = pd.read_parquet('data.parquet', columns=['name', 'age'])
print(df.head())

Output:

       name  age
0 Alice 25
1 Bob 30
2 Charlie 35
3 David 40
4 Emma 28

Reading From Different Sources

Pandas can read Parquet files from various sources:

python
# From a local file
df1 = pd.read_parquet('data.parquet')

# From a URL
df2 = pd.read_parquet('https://example.com/data.parquet')

# From S3 (requires s3fs package)
# df3 = pd.read_parquet('s3://bucket/data.parquet')

# From a BytesIO object
import io
with open('data.parquet', 'rb') as f:
content = f.read()
df4 = pd.read_parquet(io.BytesIO(content))

Exploring Parquet Metadata

Parquet files store metadata that provides information about the schema:

python
import pyarrow.parquet as pq

# Get metadata from a Parquet file
parquet_file = pq.ParquetFile('data.parquet')
metadata = parquet_file.metadata

# Print basic file information
print(f"Number of rows: {metadata.num_rows}")
print(f"Number of columns: {metadata.num_columns}")
print(f"Created by: {metadata.created_by}")

# Print schema information
print("\nSchema:")
schema = parquet_file.schema
print(schema)

Output:

Number of rows: 1000
Number of columns: 4
Created by: parquet-cpp version 1.5.1-SNAPSHOT

Schema:
id: INT64
name: UTF8
age: INT64
city: UTF8

Handling Partitioned Parquet Data

Parquet supports partitioning data across multiple files based on column values, which is common in big data systems:

python
# Read a partitioned dataset
# In this example, data is partitioned by 'year' and 'month' columns
df = pd.read_parquet('partitioned_data/')

# The partitioning columns are included automatically
print(df.head())

Output:

   id      name  age     city  year  month
0 1 Alice 25 New York 2022 01
1 2 Bob 30 Chicago 2022 01
2 3 Charlie 35 Boston 2022 01
3 4 David 40 Seattle 2022 02
4 5 Emma 28 Denver 2022 02

Converting Between Formats

CSV to Parquet

Converting a CSV file to Parquet can significantly improve performance:

python
# Read CSV file
csv_df = pd.read_csv('data.csv')

# Write to Parquet format
csv_df.to_parquet('data.parquet')

# Check file sizes
import os
csv_size = os.path.getsize('data.csv')
parquet_size = os.path.getsize('data.parquet')

print(f"CSV file size: {csv_size:,} bytes")
print(f"Parquet file size: {parquet_size:,} bytes")
print(f"Compression ratio: {csv_size / parquet_size:.2f}x")

Output:

CSV file size: 1,234,567 bytes
Parquet file size: 246,913 bytes
Compression ratio: 5.00x

Real-World Example: Analyzing Large Datasets

Let's see a practical example of working with a large dataset in Parquet format. In this example, we'll analyze a dataset of flight delays:

python
import pandas as pd
import matplotlib.pyplot as plt

# Read flight data from Parquet (much faster than CSV for large data)
flights = pd.read_parquet('flights.parquet')

# Show dataset information
print(flights.info())

# Calculate average delay by airline
avg_delays = flights.groupby('airline')['arrival_delay'].mean().sort_values()

# Create a simple visualization
plt.figure(figsize=(12, 6))
avg_delays.plot(kind='bar')
plt.title('Average Arrival Delay by Airline')
plt.xlabel('Airline')
plt.ylabel('Average Delay (minutes)')
plt.tight_layout()
plt.savefig('airline_delays.png')
plt.close()

print("Analysis complete! Check 'airline_delays.png' for the visualization.")

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5819079 entries, 0 to 5819078
Data columns (total 31 columns):
# Column Dtype
--- ------ -----
0 year int64
1 month int64
2 day int64
3 day_of_week int64
4 airline object
5 flight_number int64
6 origin object
7 destination object
8 departure_time float64
9 departure_delay float64
10 arrival_time float64
11 arrival_delay float64
...
dtypes: float64(7), int64(17), object(7)
memory usage: 1.3+ GB

Analysis complete! Check 'airline_delays.png' for the visualization.

Performance Comparison: CSV vs. Parquet

Let's compare the performance of reading CSV versus Parquet:

python
import pandas as pd
import time

# Create a sample DataFrame
import numpy as np
large_df = pd.DataFrame({
'id': range(1000000),
'value1': np.random.randn(1000000),
'value2': np.random.randn(1000000),
'category': np.random.choice(['A', 'B', 'C', 'D'], 1000000)
})

# Save as CSV and Parquet
large_df.to_csv('large_data.csv', index=False)
large_df.to_parquet('large_data.parquet', index=False)

# Test reading from CSV
start_time = time.time()
csv_df = pd.read_csv('large_data.csv')
csv_time = time.time() - start_time
print(f"CSV read time: {csv_time:.2f} seconds")

# Test reading from Parquet
start_time = time.time()
parquet_df = pd.read_parquet('large_data.parquet')
parquet_time = time.time() - start_time
print(f"Parquet read time: {parquet_time:.2f} seconds")

# Calculate speedup
speedup = csv_time / parquet_time
print(f"Parquet is {speedup:.2f}x faster than CSV")

Output:

CSV read time: 4.23 seconds
Parquet read time: 0.31 seconds
Parquet is 13.65x faster than CSV

Advanced Parquet Options

Compression Options

You can specify different compression codecs when saving to Parquet:

python
# Available compression codecs: 'snappy', 'gzip', 'brotli', 'lz4', 'zstd', None (uncompressed)
df.to_parquet('data_snappy.parquet', compression='snappy') # Default, good balance
df.to_parquet('data_gzip.parquet', compression='gzip') # Better compression, slower
df.to_parquet('data_uncompressed.parquet', compression=None) # No compression

# Check sizes
import os
for file in ['data_snappy.parquet', 'data_gzip.parquet', 'data_uncompressed.parquet']:
size = os.path.getsize(file)
print(f"{file}: {size:,} bytes")

Output:

data_snappy.parquet: 2,345,678 bytes
data_gzip.parquet: 2,123,456 bytes
data_uncompressed.parquet: 5,678,901 bytes

Row Group Size

You can optimize for your use case by adjusting the row group size:

python
# For large files that will be queried partially, smaller row groups can be better
df.to_parquet('data_small_groups.parquet', row_group_size=10000)

# For files that will typically be read in full, larger row groups can be more efficient
df.to_parquet('data_large_groups.parquet', row_group_size=100000)

Summary

In this tutorial, we've covered:

  • How to import Parquet files into pandas DataFrames
  • The benefits of using Parquet format over other formats
  • Reading specific columns and from different sources
  • Working with partitioned Parquet data
  • Converting between formats
  • Performance comparison between CSV and Parquet
  • Advanced options for Parquet files

Parquet is an excellent choice for data analysis workflows due to its performance benefits, especially when working with large datasets. The columnar storage format allows for more efficient querying, filtering, and compression compared to row-based formats like CSV.

Additional Resources

Exercises

  1. Download a CSV dataset from a public source (like Kaggle), convert it to Parquet format, and compare the file sizes and read times.
  2. Create a partitioned Parquet dataset based on a categorical column in your data.
  3. Use read_parquet() with the filters parameter to load only specific rows from a large Parquet file.
  4. Compare the performance of different compression codecs (snappy, gzip, zstd) for your specific dataset.
  5. Create a data pipeline that reads data from CSV, performs transformations, and saves the results in Parquet format for later analysis.


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)