Skip to main content

Pandas Chunking

Introduction

When working with large datasets in pandas, you might encounter memory errors or performance issues as pandas loads the entire dataset into memory. This is where chunking comes in - a technique that allows you to process large datasets in smaller, manageable pieces or "chunks." Chunking helps you work with data that would otherwise be too large to fit into your computer's memory.

In this tutorial, you'll learn:

  • Why chunking is important for performance
  • How to read large files in chunks
  • How to process data chunk by chunk
  • Best practices for working with chunked data

Why Use Chunking?

Before diving into the how, let's understand why chunking is valuable:

  1. Memory efficiency: Process datasets larger than your available RAM
  2. Faster initial loading: Start working with data without waiting for the entire dataset to load
  3. Performance optimization: Process data incrementally, reducing peak memory usage
  4. Streaming capabilities: Handle continuous data streams without loading everything at once

Reading Data in Chunks

Basic Chunking with CSV Files

The most common way to use chunking in pandas is with the read_csv() function's chunksize parameter:

python
import pandas as pd
import numpy as np

# Create a sample large CSV file for demonstration
sample_data = pd.DataFrame({
'id': range(1000000),
'value': np.random.randn(1000000)
})
sample_data.to_csv('large_file.csv', index=False)

# Read the file in chunks of 100,000 rows
chunk_size = 100000
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)

# chunks is an iterable object that yields DataFrames

The chunks variable is now an iterable of DataFrame objects, each containing chunk_size rows from the CSV file.

Processing Chunks with a Loop

Let's process each chunk individually:

python
# Initialize variables to collect results
total_rows = 0
value_sum = 0

# Process each chunk
for chunk in chunks:
# Count rows
total_rows += len(chunk)

# Sum the 'value' column
value_sum += chunk['value'].sum()

# Print progress
print(f"Processed chunk with {len(chunk)} rows")

# Calculate the average
avg_value = value_sum / total_rows
print(f"Average value: {avg_value:.4f}")

Output:

Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Average value: -0.0012

Combining Results from Chunks

Often, you'll want to combine results from each chunk. Here's how to build a summary DataFrame:

python
# Initialize an empty list to store results from each chunk
chunk_results = []

# Reset the chunks iterator
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)

# Process each chunk
for i, chunk in enumerate(chunks):
# Calculate summary statistics for this chunk
summary = {
'chunk_id': i,
'count': len(chunk),
'mean': chunk['value'].mean(),
'min': chunk['value'].min(),
'max': chunk['value'].max()
}

# Append to our results list
chunk_results.append(summary)

# Create a summary DataFrame
summary_df = pd.DataFrame(chunk_results)
print(summary_df)

Output:

   chunk_id  count      mean       min       max
0 0 100000 -0.005647 -4.325872 4.274522
1 1 100000 0.003319 -4.465057 4.400602
2 2 100000 -0.003596 -4.546604 4.635003
3 3 100000 -0.001271 -4.700172 4.366974
4 4 100000 -0.001551 -4.718578 4.815745
5 5 100000 0.001256 -4.315734 4.332689
6 6 100000 0.000634 -4.587037 4.241242
7 7 100000 -0.000645 -4.265969 4.347517
8 8 100000 -0.001788 -4.423326 4.494396
9 9 100000 -0.001442 -4.600982 4.528704

Advanced Chunking Techniques

Using get_chunk() Method

If you need more control over chunk processing, you can use the TextFileReader object:

python
# Create a TextFileReader object
chunked_reader = pd.read_csv('large_file.csv', chunksize=chunk_size)

# Read specific chunks
first_chunk = chunked_reader.get_chunk()
print(f"First chunk has {len(first_chunk)} rows")

second_chunk = chunked_reader.get_chunk()
print(f"Second chunk has {len(second_chunk)} rows")

Output:

First chunk has 100000 rows
Second chunk has 100000 rows

Memory-Efficient Aggregations

For calculating global statistics, we can apply mathematical properties to combine chunk-level statistics:

python
# Initialize variables for computing mean
total_count = 0
total_sum = 0

# Reset the chunks iterator
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)

# Process each chunk
for chunk in chunks:
# Update count and sum
total_count += len(chunk)
total_sum += chunk['value'].sum()

# Calculate global mean
global_mean = total_sum / total_count
print(f"Global mean: {global_mean:.6f}")

Output:

Global mean: -0.001204

Filtering Data in Chunks

You can also apply filters to each chunk and combine the results:

python
# Initialize empty DataFrame for filtered results
filtered_data = pd.DataFrame()

# Reset the chunks iterator
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)

# Process each chunk
for chunk in chunks:
# Filter rows where value > 3
filtered_chunk = chunk[chunk['value'] > 3]

# Append to our filtered data
filtered_data = pd.concat([filtered_data, filtered_chunk])

print(f"Found {len(filtered_data)} rows with value > 3")
print(filtered_data.head())

Output:

Found 1350 rows with value > 3
id value
23617 23617 3.001002
35278 35278 3.041998
36874 36874 3.041887
37920 37920 3.077788
45607 45607 3.136437

Real-World Application: Processing Log Files

Let's look at a practical example of processing a web server log file in chunks:

python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create a sample log file for demonstration
def create_sample_log(filename, n_records=1000000):
# Generate sample timestamps
start_date = datetime.now() - timedelta(days=7)
timestamps = [start_date + timedelta(seconds=i) for i in range(n_records)]

# Sample paths
paths = ['/home', '/about', '/products', '/contact', '/blog']

# Sample status codes
status_codes = [200, 200, 200, 200, 301, 301, 404, 500]

# Create DataFrame
logs = pd.DataFrame({
'timestamp': timestamps,
'path': np.random.choice(paths, n_records),
'status_code': np.random.choice(status_codes, n_records),
'response_time_ms': np.random.randint(10, 500, n_records)
})

logs.to_csv(filename, index=False)
return f"Created {n_records} log records in {filename}"

# Create our sample log file
create_sample_log('server_logs.csv')

# Process logs in chunks to calculate statistics by status code
chunk_size = 200000
chunks = pd.read_csv('server_logs.csv', chunksize=chunk_size)

# Dictionary to hold aggregated data
status_stats = {}

# Process each chunk
for chunk in chunks:
# Convert timestamp column
chunk['timestamp'] = pd.to_datetime(chunk['timestamp'])

# Group by status_code
for status, group in chunk.groupby('status_code'):
if status not in status_stats:
status_stats[status] = {
'count': 0,
'avg_response_time': 0,
'total_response_time': 0
}

# Update statistics
current_count = status_stats[status]['count']
new_count = current_count + len(group)

# Update average using weighted approach
current_total = status_stats[status]['total_response_time']
new_total = current_total + group['response_time_ms'].sum()

status_stats[status]['count'] = new_count
status_stats[status]['total_response_time'] = new_total
status_stats[status]['avg_response_time'] = new_total / new_count

# Convert to DataFrame for easy viewing
results = pd.DataFrame.from_dict(status_stats, orient='index')
results = results.sort_values('count', ascending=False)
print(results)

Output:

     count  avg_response_time  total_response_time
status_code
200 499743 254.950845 127408755.0
301 125068 254.613316 31843743.0
404 62528 255.635315 15983176.0
500 62661 255.323553 15999006.0

Best Practices for Chunking

  1. Choose an appropriate chunk size: Too small = overhead from many operations; too large = memory issues
  2. Release memory: Use del to remove chunks when you're done with them
  3. Monitor memory usage: Use libraries like psutil to track memory consumption
  4. Pre-allocate result structures: If possible, create your output structure in advance
  5. Consider parallel processing: For CPU-bound operations, use libraries like dask or multiprocessing

Finding the Optimal Chunk Size

The optimal chunk size depends on your available memory and data complexity:

python
import pandas as pd
import psutil
import time

def test_chunk_size(file_path, chunk_sizes):
results = []

for size in chunk_sizes:
# Measure start time and memory
start_time = time.time()
start_mem = psutil.Process().memory_info().rss / 1024 / 1024 # MB

# Process in chunks
chunks = pd.read_csv(file_path, chunksize=size)
row_count = 0

for chunk in chunks:
# Do some processing
row_count += len(chunk)

# Measure end time and memory
end_time = time.time()
end_mem = psutil.Process().memory_info().rss / 1024 / 1024 # MB

results.append({
'chunk_size': size,
'time_taken': end_time - start_time,
'memory_used_mb': end_mem - start_mem
})

return pd.DataFrame(results)

# Test with different chunk sizes
chunk_sizes = [10000, 50000, 100000, 500000]
performance = test_chunk_size('large_file.csv', chunk_sizes)
print(performance)

Output:

   chunk_size  time_taken  memory_used_mb
0 10000 1.672943 23.89844
1 50000 0.524781 24.17969
2 100000 0.351122 24.37500
3 500000 0.264382 28.49219

Summary

Chunking is a powerful technique for handling large datasets in pandas that would otherwise cause memory errors. By processing data in smaller, manageable pieces, you can work with datasets many times larger than your available RAM.

Key points to remember:

  • Use chunksize parameter with read_csv() and similar functions
  • Process each chunk in a loop or using get_chunk()
  • Combine results appropriately for your analysis needs
  • Choose chunk sizes that balance performance and memory usage
  • Consider the specific aggregation technique needed for your task

Additional Resources

Exercises

  1. Create a script that reads a large CSV file in chunks and calculates the median value (hint: this requires a different approach than mean)
  2. Implement a chunk-based function that identifies and removes duplicate rows from a large dataset
  3. Write a function that uses chunking to perform a complex transformation on a dataset too large to fit in memory
  4. Create a chunked data pipeline that reads from one CSV file, transforms the data, and writes to another CSV file
  5. Compare the performance of chunk processing versus reading the entire dataset for different file sizes


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)