Pandas Chunking
Introduction
When working with large datasets in pandas, you might encounter memory errors or performance issues as pandas loads the entire dataset into memory. This is where chunking comes in - a technique that allows you to process large datasets in smaller, manageable pieces or "chunks." Chunking helps you work with data that would otherwise be too large to fit into your computer's memory.
In this tutorial, you'll learn:
- Why chunking is important for performance
- How to read large files in chunks
- How to process data chunk by chunk
- Best practices for working with chunked data
Why Use Chunking?
Before diving into the how, let's understand why chunking is valuable:
- Memory efficiency: Process datasets larger than your available RAM
- Faster initial loading: Start working with data without waiting for the entire dataset to load
- Performance optimization: Process data incrementally, reducing peak memory usage
- Streaming capabilities: Handle continuous data streams without loading everything at once
Reading Data in Chunks
Basic Chunking with CSV Files
The most common way to use chunking in pandas is with the read_csv()
function's chunksize
parameter:
import pandas as pd
import numpy as np
# Create a sample large CSV file for demonstration
sample_data = pd.DataFrame({
'id': range(1000000),
'value': np.random.randn(1000000)
})
sample_data.to_csv('large_file.csv', index=False)
# Read the file in chunks of 100,000 rows
chunk_size = 100000
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)
# chunks is an iterable object that yields DataFrames
The chunks
variable is now an iterable of DataFrame objects, each containing chunk_size
rows from the CSV file.
Processing Chunks with a Loop
Let's process each chunk individually:
# Initialize variables to collect results
total_rows = 0
value_sum = 0
# Process each chunk
for chunk in chunks:
# Count rows
total_rows += len(chunk)
# Sum the 'value' column
value_sum += chunk['value'].sum()
# Print progress
print(f"Processed chunk with {len(chunk)} rows")
# Calculate the average
avg_value = value_sum / total_rows
print(f"Average value: {avg_value:.4f}")
Output:
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Average value: -0.0012
Combining Results from Chunks
Often, you'll want to combine results from each chunk. Here's how to build a summary DataFrame:
# Initialize an empty list to store results from each chunk
chunk_results = []
# Reset the chunks iterator
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)
# Process each chunk
for i, chunk in enumerate(chunks):
# Calculate summary statistics for this chunk
summary = {
'chunk_id': i,
'count': len(chunk),
'mean': chunk['value'].mean(),
'min': chunk['value'].min(),
'max': chunk['value'].max()
}
# Append to our results list
chunk_results.append(summary)
# Create a summary DataFrame
summary_df = pd.DataFrame(chunk_results)
print(summary_df)
Output:
chunk_id count mean min max
0 0 100000 -0.005647 -4.325872 4.274522
1 1 100000 0.003319 -4.465057 4.400602
2 2 100000 -0.003596 -4.546604 4.635003
3 3 100000 -0.001271 -4.700172 4.366974
4 4 100000 -0.001551 -4.718578 4.815745
5 5 100000 0.001256 -4.315734 4.332689
6 6 100000 0.000634 -4.587037 4.241242
7 7 100000 -0.000645 -4.265969 4.347517
8 8 100000 -0.001788 -4.423326 4.494396
9 9 100000 -0.001442 -4.600982 4.528704
Advanced Chunking Techniques
Using get_chunk()
Method
If you need more control over chunk processing, you can use the TextFileReader
object:
# Create a TextFileReader object
chunked_reader = pd.read_csv('large_file.csv', chunksize=chunk_size)
# Read specific chunks
first_chunk = chunked_reader.get_chunk()
print(f"First chunk has {len(first_chunk)} rows")
second_chunk = chunked_reader.get_chunk()
print(f"Second chunk has {len(second_chunk)} rows")
Output:
First chunk has 100000 rows
Second chunk has 100000 rows
Memory-Efficient Aggregations
For calculating global statistics, we can apply mathematical properties to combine chunk-level statistics:
# Initialize variables for computing mean
total_count = 0
total_sum = 0
# Reset the chunks iterator
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)
# Process each chunk
for chunk in chunks:
# Update count and sum
total_count += len(chunk)
total_sum += chunk['value'].sum()
# Calculate global mean
global_mean = total_sum / total_count
print(f"Global mean: {global_mean:.6f}")
Output:
Global mean: -0.001204
Filtering Data in Chunks
You can also apply filters to each chunk and combine the results:
# Initialize empty DataFrame for filtered results
filtered_data = pd.DataFrame()
# Reset the chunks iterator
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)
# Process each chunk
for chunk in chunks:
# Filter rows where value > 3
filtered_chunk = chunk[chunk['value'] > 3]
# Append to our filtered data
filtered_data = pd.concat([filtered_data, filtered_chunk])
print(f"Found {len(filtered_data)} rows with value > 3")
print(filtered_data.head())
Output:
Found 1350 rows with value > 3
id value
23617 23617 3.001002
35278 35278 3.041998
36874 36874 3.041887
37920 37920 3.077788
45607 45607 3.136437
Real-World Application: Processing Log Files
Let's look at a practical example of processing a web server log file in chunks:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Create a sample log file for demonstration
def create_sample_log(filename, n_records=1000000):
# Generate sample timestamps
start_date = datetime.now() - timedelta(days=7)
timestamps = [start_date + timedelta(seconds=i) for i in range(n_records)]
# Sample paths
paths = ['/home', '/about', '/products', '/contact', '/blog']
# Sample status codes
status_codes = [200, 200, 200, 200, 301, 301, 404, 500]
# Create DataFrame
logs = pd.DataFrame({
'timestamp': timestamps,
'path': np.random.choice(paths, n_records),
'status_code': np.random.choice(status_codes, n_records),
'response_time_ms': np.random.randint(10, 500, n_records)
})
logs.to_csv(filename, index=False)
return f"Created {n_records} log records in {filename}"
# Create our sample log file
create_sample_log('server_logs.csv')
# Process logs in chunks to calculate statistics by status code
chunk_size = 200000
chunks = pd.read_csv('server_logs.csv', chunksize=chunk_size)
# Dictionary to hold aggregated data
status_stats = {}
# Process each chunk
for chunk in chunks:
# Convert timestamp column
chunk['timestamp'] = pd.to_datetime(chunk['timestamp'])
# Group by status_code
for status, group in chunk.groupby('status_code'):
if status not in status_stats:
status_stats[status] = {
'count': 0,
'avg_response_time': 0,
'total_response_time': 0
}
# Update statistics
current_count = status_stats[status]['count']
new_count = current_count + len(group)
# Update average using weighted approach
current_total = status_stats[status]['total_response_time']
new_total = current_total + group['response_time_ms'].sum()
status_stats[status]['count'] = new_count
status_stats[status]['total_response_time'] = new_total
status_stats[status]['avg_response_time'] = new_total / new_count
# Convert to DataFrame for easy viewing
results = pd.DataFrame.from_dict(status_stats, orient='index')
results = results.sort_values('count', ascending=False)
print(results)
Output:
count avg_response_time total_response_time
status_code
200 499743 254.950845 127408755.0
301 125068 254.613316 31843743.0
404 62528 255.635315 15983176.0
500 62661 255.323553 15999006.0
Best Practices for Chunking
- Choose an appropriate chunk size: Too small = overhead from many operations; too large = memory issues
- Release memory: Use
del
to remove chunks when you're done with them - Monitor memory usage: Use libraries like
psutil
to track memory consumption - Pre-allocate result structures: If possible, create your output structure in advance
- Consider parallel processing: For CPU-bound operations, use libraries like
dask
ormultiprocessing
Finding the Optimal Chunk Size
The optimal chunk size depends on your available memory and data complexity:
import pandas as pd
import psutil
import time
def test_chunk_size(file_path, chunk_sizes):
results = []
for size in chunk_sizes:
# Measure start time and memory
start_time = time.time()
start_mem = psutil.Process().memory_info().rss / 1024 / 1024 # MB
# Process in chunks
chunks = pd.read_csv(file_path, chunksize=size)
row_count = 0
for chunk in chunks:
# Do some processing
row_count += len(chunk)
# Measure end time and memory
end_time = time.time()
end_mem = psutil.Process().memory_info().rss / 1024 / 1024 # MB
results.append({
'chunk_size': size,
'time_taken': end_time - start_time,
'memory_used_mb': end_mem - start_mem
})
return pd.DataFrame(results)
# Test with different chunk sizes
chunk_sizes = [10000, 50000, 100000, 500000]
performance = test_chunk_size('large_file.csv', chunk_sizes)
print(performance)
Output:
chunk_size time_taken memory_used_mb
0 10000 1.672943 23.89844
1 50000 0.524781 24.17969
2 100000 0.351122 24.37500
3 500000 0.264382 28.49219
Summary
Chunking is a powerful technique for handling large datasets in pandas that would otherwise cause memory errors. By processing data in smaller, manageable pieces, you can work with datasets many times larger than your available RAM.
Key points to remember:
- Use
chunksize
parameter withread_csv()
and similar functions - Process each chunk in a loop or using
get_chunk()
- Combine results appropriately for your analysis needs
- Choose chunk sizes that balance performance and memory usage
- Consider the specific aggregation technique needed for your task
Additional Resources
- Pandas Documentation on IO Tools
- Dask Library - For parallel computing with pandas-like API
- Memory Profiler - For more detailed memory usage analysis
Exercises
- Create a script that reads a large CSV file in chunks and calculates the median value (hint: this requires a different approach than mean)
- Implement a chunk-based function that identifies and removes duplicate rows from a large dataset
- Write a function that uses chunking to perform a complex transformation on a dataset too large to fit in memory
- Create a chunked data pipeline that reads from one CSV file, transforms the data, and writes to another CSV file
- Compare the performance of chunk processing versus reading the entire dataset for different file sizes
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)