Pandas Parquet Import

Introduction

The Parquet format is a columnar storage file format designed for efficient data processing and storage. It's becoming increasingly popular in the data science ecosystem due to its performance benefits and compatibility with big data frameworks. In this tutorial, we'll learn how to import Parquet files into pandas DataFrames and explore the advantages of using this format for data analysis tasks.

Parquet offers several benefits over other file formats:

Columnar storage: Data is stored column-wise rather than row-wise, enabling more efficient querying and compression
Schema preservation: Data types and column names are stored with the data
Compression: Built-in compression reduces file size significantly
Performance: Much faster read/write operations than CSV or JSON files
Big data compatibility: Works seamlessly with tools like Apache Spark, Hadoop, and Dask

Prerequisites

To work with Parquet files, you'll need to install the pyarrow or fastparquet engine. We'll use pyarrow in this tutorial:

bash
pip install pandas pyarrow

Basic Parquet Import

Let's start with the simplest way to read a Parquet file into a pandas DataFrame:

python
import pandas as pd

# Read a Parquet file
df = pd.read_parquet('data.parquet')

# Display the DataFrame
print(df.head())

Output:

   id       name  age  city
 1      Alice   25  New York
 2        Bob   30  Chicago
 3    Charlie   35  Boston
 4      David   40  Seattle
 5       Emma   28  Denver

The read_parquet() function automatically handles importing the data with the appropriate schema and data types.

Customizing Parquet Import

Specifying the Engine

You can explicitly specify which engine to use:

python
# Using pyarrow engine
df = pd.read_parquet('data.parquet', engine='pyarrow')

# Or using fastparquet engine (requires fastparquet package)
# df = pd.read_parquet('data.parquet', engine='fastparquet')

Reading Specific Columns

For large datasets, you might want to read only specific columns:

python
# Read only specific columns
df = pd.read_parquet('data.parquet', columns=['name', 'age'])
print(df.head())

Output:

       name  age
   Alice   25
     Bob   30
 Charlie   35
   David   40
    Emma   28

Reading From Different Sources

Pandas can read Parquet files from various sources:

python
# From a local file
df1 = pd.read_parquet('data.parquet')

# From a URL
df2 = pd.read_parquet('https://example.com/data.parquet')

# From S3 (requires s3fs package)
# df3 = pd.read_parquet('s3://bucket/data.parquet')

# From a BytesIO object
import io
with open('data.parquet', 'rb') as f:
    content = f.read()
    df4 = pd.read_parquet(io.BytesIO(content))

Exploring Parquet Metadata

Parquet files store metadata that provides information about the schema:

python
import pyarrow.parquet as pq

# Get metadata from a Parquet file
parquet_file = pq.ParquetFile('data.parquet')
metadata = parquet_file.metadata

# Print basic file information
print(f"Number of rows: {metadata.num_rows}")
print(f"Number of columns: {metadata.num_columns}")
print(f"Created by: {metadata.created_by}")

# Print schema information
print("\nSchema:")
schema = parquet_file.schema
print(schema)

Output:

Number of rows: 1000
Number of columns: 4
Created by: parquet-cpp version 1.5.1-SNAPSHOT

Schema:
id: INT64
name: UTF8
age: INT64
city: UTF8

Handling Partitioned Parquet Data

Parquet supports partitioning data across multiple files based on column values, which is common in big data systems:

python
# Read a partitioned dataset
# In this example, data is partitioned by 'year' and 'month' columns
df = pd.read_parquet('partitioned_data/')

# The partitioning columns are included automatically
print(df.head())

Output:

   id      name  age     city  year  month
 1     Alice   25  New York  2022     01
 2       Bob   30   Chicago  2022     01
 3   Charlie   35    Boston  2022     01
 4     David   40   Seattle  2022     02
 5      Emma   28    Denver  2022     02

Converting Between Formats

CSV to Parquet

Converting a CSV file to Parquet can significantly improve performance:

python
# Read CSV file
csv_df = pd.read_csv('data.csv')

# Write to Parquet format
csv_df.to_parquet('data.parquet')

# Check file sizes
import os
csv_size = os.path.getsize('data.csv')
parquet_size = os.path.getsize('data.parquet')

print(f"CSV file size: {csv_size:,} bytes")
print(f"Parquet file size: {parquet_size:,} bytes")
print(f"Compression ratio: {csv_size / parquet_size:.2f}x")

Output:

CSV file size: 1,234,567 bytes
Parquet file size: 246,913 bytes
Compression ratio: 5.00x

Real-World Example: Analyzing Large Datasets

Let's see a practical example of working with a large dataset in Parquet format. In this example, we'll analyze a dataset of flight delays:

python
import pandas as pd
import matplotlib.pyplot as plt

# Read flight data from Parquet (much faster than CSV for large data)
flights = pd.read_parquet('flights.parquet')

# Show dataset information
print(flights.info())

# Calculate average delay by airline
avg_delays = flights.groupby('airline')['arrival_delay'].mean().sort_values()

# Create a simple visualization
plt.figure(figsize=(12, 6))
avg_delays.plot(kind='bar')
plt.title('Average Arrival Delay by Airline')
plt.xlabel('Airline')
plt.ylabel('Average Delay (minutes)')
plt.tight_layout()
plt.savefig('airline_delays.png')
plt.close()

print("Analysis complete! Check 'airline_delays.png' for the visualization.")

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5819079 entries, 0 to 5819078
Data columns (total 31 columns):
 #   Column              Dtype  
---  ------              -----  
 0   year                int64  
 1   month               int64  
 2   day                 int64  
 3   day_of_week         int64  
 4   airline             object 
 5   flight_number       int64  
 6   origin              object 
 7   destination         object 
 8   departure_time      float64
 9   departure_delay     float64
 10  arrival_time        float64
 11  arrival_delay       float64
...
dtypes: float64(7), int64(17), object(7)
memory usage: 1.3+ GB

Analysis complete! Check 'airline_delays.png' for the visualization.

Performance Comparison: CSV vs. Parquet

Let's compare the performance of reading CSV versus Parquet:

python
import pandas as pd
import time

# Create a sample DataFrame
import numpy as np
large_df = pd.DataFrame({
    'id': range(1000000),
    'value1': np.random.randn(1000000),
    'value2': np.random.randn(1000000),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 1000000)
})

# Save as CSV and Parquet
large_df.to_csv('large_data.csv', index=False)
large_df.to_parquet('large_data.parquet', index=False)

# Test reading from CSV
start_time = time.time()
csv_df = pd.read_csv('large_data.csv')
csv_time = time.time() - start_time
print(f"CSV read time: {csv_time:.2f} seconds")

# Test reading from Parquet
start_time = time.time()
parquet_df = pd.read_parquet('large_data.parquet')
parquet_time = time.time() - start_time
print(f"Parquet read time: {parquet_time:.2f} seconds")

# Calculate speedup
speedup = csv_time / parquet_time
print(f"Parquet is {speedup:.2f}x faster than CSV")

Output:

CSV read time: 4.23 seconds
Parquet read time: 0.31 seconds
Parquet is 13.65x faster than CSV

Advanced Parquet Options

Compression Options

You can specify different compression codecs when saving to Parquet:

python
# Available compression codecs: 'snappy', 'gzip', 'brotli', 'lz4', 'zstd', None (uncompressed)
df.to_parquet('data_snappy.parquet', compression='snappy')  # Default, good balance
df.to_parquet('data_gzip.parquet', compression='gzip')      # Better compression, slower
df.to_parquet('data_uncompressed.parquet', compression=None)  # No compression

# Check sizes
import os
for file in ['data_snappy.parquet', 'data_gzip.parquet', 'data_uncompressed.parquet']:
    size = os.path.getsize(file)
    print(f"{file}: {size:,} bytes")

Output:

data_snappy.parquet: 2,345,678 bytes
data_gzip.parquet: 2,123,456 bytes
data_uncompressed.parquet: 5,678,901 bytes

Row Group Size

You can optimize for your use case by adjusting the row group size:

python
# For large files that will be queried partially, smaller row groups can be better
df.to_parquet('data_small_groups.parquet', row_group_size=10000)

# For files that will typically be read in full, larger row groups can be more efficient
df.to_parquet('data_large_groups.parquet', row_group_size=100000)

Summary

In this tutorial, we've covered:

How to import Parquet files into pandas DataFrames
The benefits of using Parquet format over other formats
Reading specific columns and from different sources
Working with partitioned Parquet data
Converting between formats
Performance comparison between CSV and Parquet
Advanced options for Parquet files

Parquet is an excellent choice for data analysis workflows due to its performance benefits, especially when working with large datasets. The columnar storage format allows for more efficient querying, filtering, and compression compared to row-based formats like CSV.

Additional Resources

Exercises

Download a CSV dataset from a public source (like Kaggle), convert it to Parquet format, and compare the file sizes and read times.
Create a partitioned Parquet dataset based on a categorical column in your data.
Use read_parquet() with the filters parameter to load only specific rows from a large Parquet file.
Compare the performance of different compression codecs (snappy, gzip, zstd) for your specific dataset.
Create a data pipeline that reads data from CSV, performs transformations, and saves the results in Parquet format for later analysis.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Basic Parquet Import​

Customizing Parquet Import​

Specifying the Engine​

Reading Specific Columns​

Reading From Different Sources​

Exploring Parquet Metadata​

Handling Partitioned Parquet Data​

Converting Between Formats​

CSV to Parquet​

Real-World Example: Analyzing Large Datasets​

Performance Comparison: CSV vs. Parquet​

Advanced Parquet Options​

Compression Options​

Row Group Size​

Summary​

Additional Resources​

Exercises​

Introduction

Prerequisites

Basic Parquet Import

Customizing Parquet Import

Specifying the Engine

Reading Specific Columns

Reading From Different Sources

Exploring Parquet Metadata

Handling Partitioned Parquet Data

Converting Between Formats

CSV to Parquet

Real-World Example: Analyzing Large Datasets

Performance Comparison: CSV vs. Parquet

Advanced Parquet Options

Compression Options

Row Group Size

Summary

Additional Resources

Exercises