Pandas Chunking

Introduction

When working with large datasets in pandas, you might encounter memory errors or performance issues as pandas loads the entire dataset into memory. This is where chunking comes in - a technique that allows you to process large datasets in smaller, manageable pieces or "chunks." Chunking helps you work with data that would otherwise be too large to fit into your computer's memory.

In this tutorial, you'll learn:

Why chunking is important for performance
How to read large files in chunks
How to process data chunk by chunk
Best practices for working with chunked data

Why Use Chunking?

Before diving into the how, let's understand why chunking is valuable:

Memory efficiency: Process datasets larger than your available RAM
Faster initial loading: Start working with data without waiting for the entire dataset to load
Performance optimization: Process data incrementally, reducing peak memory usage
Streaming capabilities: Handle continuous data streams without loading everything at once

Reading Data in Chunks

Basic Chunking with CSV Files

The most common way to use chunking in pandas is with the read_csv() function's chunksize parameter:

import pandas as pd
import numpy as np

# Create a sample large CSV file for demonstration
sample_data = pd.DataFrame({
    'id': range(1000000),
    'value': np.random.randn(1000000)
})
sample_data.to_csv('large_file.csv', index=False)

# Read the file in chunks of 100,000 rows
chunk_size = 100000
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)

# chunks is an iterable object that yields DataFrames

The chunks variable is now an iterable of DataFrame objects, each containing chunk_size rows from the CSV file.

Processing Chunks with a Loop

Let's process each chunk individually:

# Initialize variables to collect results
total_rows = 0
value_sum = 0

# Process each chunk
for chunk in chunks:
    # Count rows
    total_rows += len(chunk)
    
    # Sum the 'value' column
    value_sum += chunk['value'].sum()
    
    # Print progress
    print(f"Processed chunk with {len(chunk)} rows")

# Calculate the average
avg_value = value_sum / total_rows
print(f"Average value: {avg_value:.4f}")

Output:

Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Processed chunk with 100000 rows
Average value: -0.0012

Combining Results from Chunks

Often, you'll want to combine results from each chunk. Here's how to build a summary DataFrame:

# Initialize an empty list to store results from each chunk
chunk_results = []

# Reset the chunks iterator
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)

# Process each chunk
for i, chunk in enumerate(chunks):
    # Calculate summary statistics for this chunk
    summary = {
        'chunk_id': i,
        'count': len(chunk),
        'mean': chunk['value'].mean(),
        'min': chunk['value'].min(),
        'max': chunk['value'].max()
    }
    
    # Append to our results list
    chunk_results.append(summary)

# Create a summary DataFrame
summary_df = pd.DataFrame(chunk_results)
print(summary_df)

Output:

   chunk_id  count      mean       min       max
       0  100000 -0.005647 -4.325872  4.274522
       1  100000  0.003319 -4.465057  4.400602
       2  100000 -0.003596 -4.546604  4.635003
       3  100000 -0.001271 -4.700172  4.366974
       4  100000 -0.001551 -4.718578  4.815745
       5  100000  0.001256 -4.315734  4.332689
       6  100000  0.000634 -4.587037  4.241242
       7  100000 -0.000645 -4.265969  4.347517
       8  100000 -0.001788 -4.423326  4.494396
       9  100000 -0.001442 -4.600982  4.528704

Advanced Chunking Techniques

Using `get_chunk()` Method

If you need more control over chunk processing, you can use the TextFileReader object:

# Create a TextFileReader object
chunked_reader = pd.read_csv('large_file.csv', chunksize=chunk_size)

# Read specific chunks
first_chunk = chunked_reader.get_chunk()
print(f"First chunk has {len(first_chunk)} rows")

second_chunk = chunked_reader.get_chunk()
print(f"Second chunk has {len(second_chunk)} rows")

Output:

First chunk has 100000 rows
Second chunk has 100000 rows

Memory-Efficient Aggregations

For calculating global statistics, we can apply mathematical properties to combine chunk-level statistics:

# Initialize variables for computing mean
total_count = 0
total_sum = 0

# Reset the chunks iterator
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)

# Process each chunk
for chunk in chunks:
    # Update count and sum
    total_count += len(chunk)
    total_sum += chunk['value'].sum()

# Calculate global mean
global_mean = total_sum / total_count
print(f"Global mean: {global_mean:.6f}")

Output:

Global mean: -0.001204

Filtering Data in Chunks

You can also apply filters to each chunk and combine the results:

# Initialize empty DataFrame for filtered results
filtered_data = pd.DataFrame()

# Reset the chunks iterator
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)

# Process each chunk
for chunk in chunks:
    # Filter rows where value > 3
    filtered_chunk = chunk[chunk['value'] > 3]
    
    # Append to our filtered data
    filtered_data = pd.concat([filtered_data, filtered_chunk])

print(f"Found {len(filtered_data)} rows with value > 3")
print(filtered_data.head())

Output:

Found 1350 rows with value > 3
          id     value
 23617  3.001002
 35278  3.041998
 36874  3.041887
 37920  3.077788
 45607  3.136437

Real-World Application: Processing Log Files

Let's look at a practical example of processing a web server log file in chunks:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create a sample log file for demonstration
def create_sample_log(filename, n_records=1000000):
    # Generate sample timestamps
    start_date = datetime.now() - timedelta(days=7)
    timestamps = [start_date + timedelta(seconds=i) for i in range(n_records)]
    
    # Sample paths
    paths = ['/home', '/about', '/products', '/contact', '/blog']
    
    # Sample status codes
    status_codes = [200, 200, 200, 200, 301, 301, 404, 500]
    
    # Create DataFrame
    logs = pd.DataFrame({
        'timestamp': timestamps,
        'path': np.random.choice(paths, n_records),
        'status_code': np.random.choice(status_codes, n_records),
        'response_time_ms': np.random.randint(10, 500, n_records)
    })
    
    logs.to_csv(filename, index=False)
    return f"Created {n_records} log records in {filename}"

# Create our sample log file
create_sample_log('server_logs.csv')

# Process logs in chunks to calculate statistics by status code
chunk_size = 200000
chunks = pd.read_csv('server_logs.csv', chunksize=chunk_size)

# Dictionary to hold aggregated data
status_stats = {}

# Process each chunk
for chunk in chunks:
    # Convert timestamp column
    chunk['timestamp'] = pd.to_datetime(chunk['timestamp'])
    
    # Group by status_code
    for status, group in chunk.groupby('status_code'):
        if status not in status_stats:
            status_stats[status] = {
                'count': 0,
                'avg_response_time': 0,
                'total_response_time': 0
            }
        
        # Update statistics
        current_count = status_stats[status]['count']
        new_count = current_count + len(group)
        
        # Update average using weighted approach
        current_total = status_stats[status]['total_response_time']
        new_total = current_total + group['response_time_ms'].sum()
        
        status_stats[status]['count'] = new_count
        status_stats[status]['total_response_time'] = new_total
        status_stats[status]['avg_response_time'] = new_total / new_count

# Convert to DataFrame for easy viewing
results = pd.DataFrame.from_dict(status_stats, orient='index')
results = results.sort_values('count', ascending=False)
print(results)

Output:

     count  avg_response_time  total_response_time
status_code                                       
200    499743        254.950845         127408755.0
301    125068        254.613316          31843743.0
404     62528        255.635315          15983176.0
500     62661        255.323553          15999006.0

Best Practices for Chunking

Choose an appropriate chunk size: Too small = overhead from many operations; too large = memory issues
Release memory: Use del to remove chunks when you're done with them
Monitor memory usage: Use libraries like psutil to track memory consumption
Pre-allocate result structures: If possible, create your output structure in advance
Consider parallel processing: For CPU-bound operations, use libraries like dask or multiprocessing

Finding the Optimal Chunk Size

The optimal chunk size depends on your available memory and data complexity:

import pandas as pd
import psutil
import time

def test_chunk_size(file_path, chunk_sizes):
    results = []
    
    for size in chunk_sizes:
        # Measure start time and memory
        start_time = time.time()
        start_mem = psutil.Process().memory_info().rss / 1024 / 1024  # MB
        
        # Process in chunks
        chunks = pd.read_csv(file_path, chunksize=size)
        row_count = 0
        
        for chunk in chunks:
            # Do some processing
            row_count += len(chunk)
        
        # Measure end time and memory
        end_time = time.time()
        end_mem = psutil.Process().memory_info().rss / 1024 / 1024  # MB
        
        results.append({
            'chunk_size': size,
            'time_taken': end_time - start_time,
            'memory_used_mb': end_mem - start_mem
        })
    
    return pd.DataFrame(results)

# Test with different chunk sizes
chunk_sizes = [10000, 50000, 100000, 500000]
performance = test_chunk_size('large_file.csv', chunk_sizes)
print(performance)

Output:

   chunk_size  time_taken  memory_used_mb
     10000    1.672943        23.89844
     50000    0.524781        24.17969
    100000    0.351122        24.37500
    500000    0.264382        28.49219

Summary

Chunking is a powerful technique for handling large datasets in pandas that would otherwise cause memory errors. By processing data in smaller, manageable pieces, you can work with datasets many times larger than your available RAM.

Key points to remember:

Use chunksize parameter with read_csv() and similar functions
Process each chunk in a loop or using get_chunk()
Combine results appropriately for your analysis needs
Choose chunk sizes that balance performance and memory usage
Consider the specific aggregation technique needed for your task

Additional Resources

Pandas Documentation on IO Tools
Dask Library - For parallel computing with pandas-like API
Memory Profiler - For more detailed memory usage analysis

Exercises

Create a script that reads a large CSV file in chunks and calculates the median value (hint: this requires a different approach than mean)
Implement a chunk-based function that identifies and removes duplicate rows from a large dataset
Write a function that uses chunking to perform a complex transformation on a dataset too large to fit in memory
Create a chunked data pipeline that reads from one CSV file, transforms the data, and writes to another CSV file
Compare the performance of chunk processing versus reading the entire dataset for different file sizes

💡 Found a typo or mistake? Click "Edit this page" to suggest a correction. Your feedback is greatly appreciated!

Introduction​

Why Use Chunking?​

Reading Data in Chunks​

Basic Chunking with CSV Files​

Processing Chunks with a Loop​

Combining Results from Chunks​

Advanced Chunking Techniques​

Using get_chunk() Method​

Memory-Efficient Aggregations​

Filtering Data in Chunks​

Real-World Application: Processing Log Files​

Best Practices for Chunking​

Finding the Optimal Chunk Size​

Summary​

Additional Resources​

Exercises​

Introduction

Why Use Chunking?

Reading Data in Chunks

Basic Chunking with CSV Files

Processing Chunks with a Loop

Combining Results from Chunks

Advanced Chunking Techniques

Using `get_chunk()` Method

Memory-Efficient Aggregations

Filtering Data in Chunks

Real-World Application: Processing Log Files

Best Practices for Chunking

Finding the Optimal Chunk Size

Summary

Additional Resources

Exercises