Pandas Export Compression
Introduction
When working with large datasets in pandas, file sizes can quickly become unwieldy. Compression techniques help reduce file sizes, save disk space, and sometimes even improve read/write performance. In this tutorial, you'll learn how to apply different compression methods when exporting pandas DataFrames to various file formats.
Compression is especially useful when:
- You're working with large datasets
- You need to transfer files over networks
- You want to optimize storage space
- You're archiving data for long-term storage
Supported Compression Formats
Pandas supports several compression formats depending on the file type you're exporting to:
gzip
: General-purpose compression, good balance of speed and compressionbz2
: Higher compression ratio but slower than gzipzip
: Common format, widely compatiblexz
: Very high compression ratio but slowerzstd
: Fast compression with good compression ratio (newer option)snappy
: Very fast compression, lower compression ratio
Let's explore how to use these with different file formats.
Compressing CSV Exports
Basic CSV Compression
To export a DataFrame to a compressed CSV file, use the compression
parameter in the to_csv()
method:
import pandas as pd
import numpy as np
# Create sample data
df = pd.DataFrame({
'A': np.random.rand(100000),
'B': np.random.rand(100000),
'C': np.random.choice(['X', 'Y', 'Z'], 100000)
})
# Export to gzip-compressed CSV
df.to_csv('data_compressed.csv.gz', compression='gzip')
# Export to bz2-compressed CSV
df.to_csv('data_compressed.csv.bz2', compression='bz2')
# Export without compression for comparison
df.to_csv('data_uncompressed.csv')
Comparing File Sizes
Let's compare the sizes of the files we just created:
import os
# Function to get file size in KB
def get_size_kb(filename):
return os.path.getsize(filename) / 1024
# Compare sizes
print(f"Uncompressed: {get_size_kb('data_uncompressed.csv'):.2f} KB")
print(f"Gzip compressed: {get_size_kb('data_compressed.csv.gz'):.2f} KB")
print(f"BZ2 compressed: {get_size_kb('data_compressed.csv.bz2'):.2f} KB")
Example output:
Uncompressed: 4883.12 KB
Gzip compressed: 1208.45 KB
BZ2 compressed: 1036.22 KB
Adjusting Compression Level
Some compression methods allow you to specify a compression level, trading off speed versus compression ratio:
# Highest compression with gzip (9 is maximum)
df.to_csv('data_max_compressed.csv.gz', compression={'method': 'gzip', 'compresslevel': 9})
# Faster compression with gzip (1 is minimum)
df.to_csv('data_fast_compressed.csv.gz', compression={'method': 'gzip', 'compresslevel': 1})
Compressing Excel Files
Excel with ZIP Compression
When exporting to Excel, pandas uses the openpyxl
or xlsxwriter
engines, which already use ZIP compression. However, you can control some compression options:
# Export to Excel with default compression
df.to_excel('data.xlsx')
# With xlsxwriter, you can set options
df.to_excel('data_compressed.xlsx', engine='xlsxwriter',
options={'constant_memory': True})
Compressing to Parquet Format
Parquet is already an efficient columnar storage format, but it supports additional compression:
# Export to Parquet with snappy compression (default)
df.to_parquet('data.parquet')
# With gzip compression (better ratio, slower)
df.to_parquet('data_gzip.parquet', compression='gzip')
# With different compression for different columns
df.to_parquet('data_mixed.parquet',
compression={'A': 'snappy', 'B': 'gzip', 'C': None})
Reading Compressed Parquet Files
Pandas automatically detects and handles compression when reading:
# Read compressed parquet file (compression detected automatically)
df_read = pd.read_parquet('data_gzip.parquet')
print(df_read.head())
Compressing to HDF5 Format
HDF5 is a hierarchical data format that also supports compression:
# Export to HDF5 with default compression
df.to_hdf('data.h5', key='df')
# With specific compression method and level
df.to_hdf('data_compressed.h5', key='df', complevel=9, complib='zlib')
# Different compression options
compression_options = {
'zlib': 'Good general purpose compression',
'lzo': 'Very fast compression',
'bzip2': 'Higher compression ratio',
'blosc': 'Fast multi-threaded compression'
}
print("Available HDF5 compression libraries:")
for lib, desc in compression_options.items():
print(f"- {lib}: {desc}")
Real-World Application: Batch Processing Large Files
Here's a practical example of processing a large dataset in chunks and saving with compression:
import pandas as pd
import numpy as np
import os
# Simulate a large dataset creator
def create_large_csv(filename, rows=1000000):
chunk_size = 100000
for i in range(0, rows, chunk_size):
df_chunk = pd.DataFrame({
'id': range(i, i + chunk_size),
'value': np.random.randn(chunk_size),
'category': np.random.choice(['A', 'B', 'C', 'D'], chunk_size)
})
mode = 'w' if i == 0 else 'a'
header = i == 0
df_chunk.to_csv(filename, mode=mode, header=header, index=False)
print(f"Created {filename} with {rows} rows")
# Create a large uncompressed file
create_large_csv('large_dataset.csv')
# Process in chunks with compression
def process_and_compress(input_file, output_file, compression, chunk_size=100000):
total_chunks = 0
# Process file in chunks
for chunk in pd.read_csv(input_file, chunksize=chunk_size):
# Perform some operation (example: calculate moving average)
chunk['moving_avg'] = chunk['value'].rolling(window=3, min_periods=1).mean()
# Append to compressed output
mode = 'w' if total_chunks == 0 else 'a'
header = total_chunks == 0
chunk.to_csv(output_file,
mode=mode,
header=header,
index=False,
compression=compression)
total_chunks += 1
print(f"Processed {total_chunks} chunks and saved to {output_file}")
# Process and compress with different methods
process_and_compress('large_dataset.csv', 'processed_gzip.csv.gz', 'gzip')
process_and_compress('large_dataset.csv', 'processed_bz2.csv.bz2', 'bz2')
Compression Selection Guidelines
When choosing a compression method, consider these factors:
- File size priority: Use
bz2
orxz
for best compression ratio - Speed priority: Use
snappy
orzstd
for fastest compression/decompression - Balanced approach: Use
gzip
for a good balance between speed and size - Compatibility concerns: Use
zip
orgzip
for the widest compatibility
Performance Benchmark
Here's a simple benchmark to compare different compression methods:
import pandas as pd
import numpy as np
import time
import os
# Create larger sample data
df_large = pd.DataFrame({
'A': np.random.rand(500000),
'B': np.random.rand(500000),
'C': np.random.choice(['X', 'Y', 'Z'], 500000),
'D': np.random.randint(1, 100, 500000),
'E': pd.date_range('2020-01-01', periods=500000)
})
compression_methods = ['gzip', 'bz2', 'zip', 'xz', 'zstd']
results = []
for method in compression_methods:
filename = f'benchmark_{method}.csv.{method}'
# Measure write time
start_time = time.time()
df_large.to_csv(filename, compression=method)
write_time = time.time() - start_time
# Get file size
file_size = os.path.getsize(filename) / (1024 * 1024) # Size in MB
# Measure read time
start_time = time.time()
df_read = pd.read_csv(filename, compression=method)
read_time = time.time() - start_time
results.append({
'Method': method,
'File Size (MB)': round(file_size, 2),
'Write Time (s)': round(write_time, 2),
'Read Time (s)': round(read_time, 2)
})
# Display results
benchmark_df = pd.DataFrame(results)
print(benchmark_df)
Example output:
Method File Size (MB) Write Time (s) Read Time (s)
0 gzip 10.42 3.21 1.85
1 bz2 8.76 12.53 5.32
2 zip 11.05 2.74 1.92
3 xz 7.91 18.37 2.45
4 zstd 9.87 1.47 1.21
Summary
In this tutorial, you've learned how to:
- Export pandas DataFrames to various file formats with compression
- Choose the right compression algorithm for your needs
- Apply compression options to CSV, Excel, Parquet, and HDF5 files
- Process large files in chunks with compression
- Benchmark different compression methods to make informed decisions
Using compression when exporting data can significantly reduce file sizes while maintaining all your data. This is particularly valuable when dealing with large datasets or when storage or bandwidth limitations are a concern.
Additional Resources
Exercises
- Create a DataFrame with 100,000 rows and compare file sizes when exporting to CSV with different compression methods.
- Benchmark the read and write speeds for Parquet files using different compression options.
- Process a large CSV file in chunks and save it as a compressed Parquet file.
- Create a utility function that automatically selects the best compression method based on a sample of your data and desired priorities (size vs. speed).
- Experiment with column-specific compression in Parquet files and measure the impact on file size and performance.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)