Skip to main content

Pandas Export Compression

Introduction

When working with large datasets in pandas, file sizes can quickly become unwieldy. Compression techniques help reduce file sizes, save disk space, and sometimes even improve read/write performance. In this tutorial, you'll learn how to apply different compression methods when exporting pandas DataFrames to various file formats.

Compression is especially useful when:

  • You're working with large datasets
  • You need to transfer files over networks
  • You want to optimize storage space
  • You're archiving data for long-term storage

Supported Compression Formats

Pandas supports several compression formats depending on the file type you're exporting to:

  • gzip: General-purpose compression, good balance of speed and compression
  • bz2: Higher compression ratio but slower than gzip
  • zip: Common format, widely compatible
  • xz: Very high compression ratio but slower
  • zstd: Fast compression with good compression ratio (newer option)
  • snappy: Very fast compression, lower compression ratio

Let's explore how to use these with different file formats.

Compressing CSV Exports

Basic CSV Compression

To export a DataFrame to a compressed CSV file, use the compression parameter in the to_csv() method:

python
import pandas as pd
import numpy as np

# Create sample data
df = pd.DataFrame({
'A': np.random.rand(100000),
'B': np.random.rand(100000),
'C': np.random.choice(['X', 'Y', 'Z'], 100000)
})

# Export to gzip-compressed CSV
df.to_csv('data_compressed.csv.gz', compression='gzip')

# Export to bz2-compressed CSV
df.to_csv('data_compressed.csv.bz2', compression='bz2')

# Export without compression for comparison
df.to_csv('data_uncompressed.csv')

Comparing File Sizes

Let's compare the sizes of the files we just created:

python
import os

# Function to get file size in KB
def get_size_kb(filename):
return os.path.getsize(filename) / 1024

# Compare sizes
print(f"Uncompressed: {get_size_kb('data_uncompressed.csv'):.2f} KB")
print(f"Gzip compressed: {get_size_kb('data_compressed.csv.gz'):.2f} KB")
print(f"BZ2 compressed: {get_size_kb('data_compressed.csv.bz2'):.2f} KB")

Example output:

Uncompressed: 4883.12 KB
Gzip compressed: 1208.45 KB
BZ2 compressed: 1036.22 KB

Adjusting Compression Level

Some compression methods allow you to specify a compression level, trading off speed versus compression ratio:

python
# Highest compression with gzip (9 is maximum)
df.to_csv('data_max_compressed.csv.gz', compression={'method': 'gzip', 'compresslevel': 9})

# Faster compression with gzip (1 is minimum)
df.to_csv('data_fast_compressed.csv.gz', compression={'method': 'gzip', 'compresslevel': 1})

Compressing Excel Files

Excel with ZIP Compression

When exporting to Excel, pandas uses the openpyxl or xlsxwriter engines, which already use ZIP compression. However, you can control some compression options:

python
# Export to Excel with default compression
df.to_excel('data.xlsx')

# With xlsxwriter, you can set options
df.to_excel('data_compressed.xlsx', engine='xlsxwriter',
options={'constant_memory': True})

Compressing to Parquet Format

Parquet is already an efficient columnar storage format, but it supports additional compression:

python
# Export to Parquet with snappy compression (default)
df.to_parquet('data.parquet')

# With gzip compression (better ratio, slower)
df.to_parquet('data_gzip.parquet', compression='gzip')

# With different compression for different columns
df.to_parquet('data_mixed.parquet',
compression={'A': 'snappy', 'B': 'gzip', 'C': None})

Reading Compressed Parquet Files

Pandas automatically detects and handles compression when reading:

python
# Read compressed parquet file (compression detected automatically)
df_read = pd.read_parquet('data_gzip.parquet')
print(df_read.head())

Compressing to HDF5 Format

HDF5 is a hierarchical data format that also supports compression:

python
# Export to HDF5 with default compression
df.to_hdf('data.h5', key='df')

# With specific compression method and level
df.to_hdf('data_compressed.h5', key='df', complevel=9, complib='zlib')

# Different compression options
compression_options = {
'zlib': 'Good general purpose compression',
'lzo': 'Very fast compression',
'bzip2': 'Higher compression ratio',
'blosc': 'Fast multi-threaded compression'
}

print("Available HDF5 compression libraries:")
for lib, desc in compression_options.items():
print(f"- {lib}: {desc}")

Real-World Application: Batch Processing Large Files

Here's a practical example of processing a large dataset in chunks and saving with compression:

python
import pandas as pd
import numpy as np
import os

# Simulate a large dataset creator
def create_large_csv(filename, rows=1000000):
chunk_size = 100000
for i in range(0, rows, chunk_size):
df_chunk = pd.DataFrame({
'id': range(i, i + chunk_size),
'value': np.random.randn(chunk_size),
'category': np.random.choice(['A', 'B', 'C', 'D'], chunk_size)
})

mode = 'w' if i == 0 else 'a'
header = i == 0

df_chunk.to_csv(filename, mode=mode, header=header, index=False)

print(f"Created {filename} with {rows} rows")

# Create a large uncompressed file
create_large_csv('large_dataset.csv')

# Process in chunks with compression
def process_and_compress(input_file, output_file, compression, chunk_size=100000):
total_chunks = 0

# Process file in chunks
for chunk in pd.read_csv(input_file, chunksize=chunk_size):
# Perform some operation (example: calculate moving average)
chunk['moving_avg'] = chunk['value'].rolling(window=3, min_periods=1).mean()

# Append to compressed output
mode = 'w' if total_chunks == 0 else 'a'
header = total_chunks == 0

chunk.to_csv(output_file,
mode=mode,
header=header,
index=False,
compression=compression)

total_chunks += 1

print(f"Processed {total_chunks} chunks and saved to {output_file}")

# Process and compress with different methods
process_and_compress('large_dataset.csv', 'processed_gzip.csv.gz', 'gzip')
process_and_compress('large_dataset.csv', 'processed_bz2.csv.bz2', 'bz2')

Compression Selection Guidelines

When choosing a compression method, consider these factors:

  1. File size priority: Use bz2 or xz for best compression ratio
  2. Speed priority: Use snappy or zstd for fastest compression/decompression
  3. Balanced approach: Use gzip for a good balance between speed and size
  4. Compatibility concerns: Use zip or gzip for the widest compatibility

Performance Benchmark

Here's a simple benchmark to compare different compression methods:

python
import pandas as pd
import numpy as np
import time
import os

# Create larger sample data
df_large = pd.DataFrame({
'A': np.random.rand(500000),
'B': np.random.rand(500000),
'C': np.random.choice(['X', 'Y', 'Z'], 500000),
'D': np.random.randint(1, 100, 500000),
'E': pd.date_range('2020-01-01', periods=500000)
})

compression_methods = ['gzip', 'bz2', 'zip', 'xz', 'zstd']
results = []

for method in compression_methods:
filename = f'benchmark_{method}.csv.{method}'

# Measure write time
start_time = time.time()
df_large.to_csv(filename, compression=method)
write_time = time.time() - start_time

# Get file size
file_size = os.path.getsize(filename) / (1024 * 1024) # Size in MB

# Measure read time
start_time = time.time()
df_read = pd.read_csv(filename, compression=method)
read_time = time.time() - start_time

results.append({
'Method': method,
'File Size (MB)': round(file_size, 2),
'Write Time (s)': round(write_time, 2),
'Read Time (s)': round(read_time, 2)
})

# Display results
benchmark_df = pd.DataFrame(results)
print(benchmark_df)

Example output:

  Method  File Size (MB)  Write Time (s)  Read Time (s)
0 gzip 10.42 3.21 1.85
1 bz2 8.76 12.53 5.32
2 zip 11.05 2.74 1.92
3 xz 7.91 18.37 2.45
4 zstd 9.87 1.47 1.21

Summary

In this tutorial, you've learned how to:

  • Export pandas DataFrames to various file formats with compression
  • Choose the right compression algorithm for your needs
  • Apply compression options to CSV, Excel, Parquet, and HDF5 files
  • Process large files in chunks with compression
  • Benchmark different compression methods to make informed decisions

Using compression when exporting data can significantly reduce file sizes while maintaining all your data. This is particularly valuable when dealing with large datasets or when storage or bandwidth limitations are a concern.

Additional Resources

Exercises

  1. Create a DataFrame with 100,000 rows and compare file sizes when exporting to CSV with different compression methods.
  2. Benchmark the read and write speeds for Parquet files using different compression options.
  3. Process a large CSV file in chunks and save it as a compressed Parquet file.
  4. Create a utility function that automatically selects the best compression method based on a sample of your data and desired priorities (size vs. speed).
  5. Experiment with column-specific compression in Parquet files and measure the impact on file size and performance.


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)