Pandas Memory Usage

When working with large datasets in pandas, understanding and managing memory usage becomes crucial. Efficient memory management can significantly improve performance and help you work with bigger datasets on limited hardware. This guide will explain how to analyze and optimize memory usage in pandas objects.

Introduction to Memory Management in Pandas

Pandas is powerful for data analysis, but it can be memory-intensive, especially with large datasets. Each pandas object (DataFrame or Series) consumes memory based on:

The number of rows and columns
The data types of each column
Additional memory overhead for indexes and internal structures

Understanding how pandas allocates memory will help you work with larger datasets and avoid running into memory errors.

Checking Memory Usage

Basic Memory Usage

Let's start by exploring how to check the memory footprint of pandas objects.

import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    'A': np.random.rand(100000),
    'B': np.random.randint(0, 100, size=100000),
    'C': ['string' + str(i) for i in range(100000)],
    'D': pd.date_range('20210101', periods=100000)
})

# Check memory usage
print(df.info(memory_usage='deep'))

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   A       100000 non-null  float64       
 1   B       100000 non-null  int64         
 2   C       100000 non-null  object        
 3   D       100000 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 12.2+ MB
None

Detailed Memory Analysis

For more detailed information, you can use the memory_usage() method:

# Get memory usage per column
print(df.memory_usage(deep=True))

# Get total memory usage
print(f"Total memory usage: {df.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

Output:

Index      800000
A          800000
B          800000
C         6400044
D          800000
dtype: int64
Total memory usage: 8.39 MB

The deep=True parameter is important as it accounts for the actual memory used by object dtypes (like strings). Without it, pandas only reports the shallow memory usage which can underestimate the true memory footprint.

Understanding Data Types and Memory

Different data types consume different amounts of memory:

Data Type	Typical Memory Usage
int64	8 bytes per value
float64	8 bytes per value
bool	1 byte per value
datetime64	8 bytes per value
object	Variable (typically more)

The object dtype (used for strings and mixed types) is usually the biggest memory consumer.

Optimizing Memory Usage

Now let's look at strategies to reduce memory consumption:

1. Using Appropriate Data Types

Converting columns to more memory-efficient types can save significant space:

# Original memory usage
print(f"Original memory: {df.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

# Convert integer column to smaller dtype
df['B'] = df['B'].astype('int8')  # Values 0-100 fit in int8 (-128 to 127)

# Check new memory usage
print(f"After int8 conversion: {df.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

Output:

Original memory: 8.39 MB
After int8 conversion: 7.64 MB

2. Categorical Data for Strings

For columns with repeated string values, using categorical data types can save substantial memory:

# Create a DataFrame with repeated values
df_cat = pd.DataFrame({
    'city': np.random.choice(['New York', 'London', 'Tokyo', 'Paris', 'Beijing'], 100000)
})

# Check memory before conversion
print(f"Before categorical: {df_cat.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

# Convert to categorical
df_cat['city'] = df_cat['city'].astype('category')

# Check memory after conversion
print(f"After categorical: {df_cat.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

Output:

Before categorical: 3.96 MB
After categorical: 0.10 MB

3. Automatic Type Optimization with `pd.read_csv()`

When loading data from CSV files, you can use dtype and parse_dates parameters to optimize memory usage right from the start:

# Sample code for optimized CSV loading
optimized_df = pd.read_csv('large_file.csv', 
                          dtype={
                              'id': 'int32',
                              'small_numbers': 'int8',
                              'category_column': 'category'
                          },
                          parse_dates=['date_column'])

4. Using `downcast` Option

The downcast parameter can automatically reduce memory usage:

# Create a DataFrame with numbers
df_nums = pd.DataFrame({
    'A': np.random.randint(0, 100, size=100000),
    'B': np.random.rand(100000)
})

# Check memory usage
print(f"Before downcast: {df_nums.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

# Downcast to smaller dtypes
df_nums_small = df_nums.copy()
df_nums_small['A'] = pd.to_numeric(df_nums_small['A'], downcast='integer')
df_nums_small['B'] = pd.to_numeric(df_nums_small['B'], downcast='float')

# Check memory usage after downcast
print(f"After downcast: {df_nums_small.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

Output:

Before downcast: 1.53 MB
After downcast: 0.39 MB

Automatic Memory Optimization

For a more automated approach to memory optimization, let's create a utility function:

def reduce_memory_usage(df, verbose=True):
    """
    Iterate through all numeric columns and modify their types to reduce memory usage.
    """
    start_mem = df.memory_usage(deep=True).sum() / 1024**2
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        
        else:
            # For object columns, convert to categorical if fewer than 50% unique values
            if df[col].nunique() / len(df) < 0.5:
                df[col] = df[col].astype('category')
    
    end_mem = df.memory_usage(deep=True).sum() / 1024**2
    reduction = 100 * (start_mem - end_mem) / start_mem
    
    if verbose:
        print(f'Memory usage decreased from {start_mem:.2f} MB to {end_mem:.2f} MB ({reduction:.2f}% reduction)')
    
    return df

Let's use this function on a larger DataFrame:

# Create a larger DataFrame
large_df = pd.DataFrame({
    'id': range(500000),
    'int_col': np.random.randint(0, 100, size=500000),
    'float_col': np.random.rand(500000),
    'category_col': np.random.choice(['A', 'B', 'C', 'D', 'E'], 500000)
})

# Optimize memory usage
optimized_df = reduce_memory_usage(large_df)

Output:

Memory usage decreased from 17.17 MB to 2.86 MB (83.31% reduction)

Real-World Application: Processing Large CSV Files

Here's a practical example of how to process a large CSV file in chunks to manage memory usage:

def process_large_csv(filename, chunksize=100000):
    # Process in chunks to avoid loading the entire file into memory
    chunks = []
    
    for chunk in pd.read_csv(filename, chunksize=chunksize):
        # Optimize memory usage for this chunk
        chunk = reduce_memory_usage(chunk, verbose=False)
        
        # Process the chunk (example: filter rows)
        processed_chunk = chunk[chunk['value'] > 0]
        
        chunks.append(processed_chunk)
    
    # Combine processed chunks
    result = pd.concat(chunks)
    return result

Best Practices for Memory Management in Pandas

Check memory usage regularly with df.info(memory_usage='deep') and df.memory_usage(deep=True)
Use appropriate data types:
- Use smaller integer types when possible (int8, int16 instead of int64)
- Convert string columns with limited unique values to categorical
- Use datetime64 for date columns instead of strings
Free memory when possible:
- Delete intermediate DataFrames you no longer need
- Use del df and gc.collect() to release memory
Process large files in chunks instead of loading all at once
Consider using more specialized tools like Dask or Vaex for extremely large datasets that don't fit in memory

Summary

Managing memory in pandas is essential when working with large datasets. By:

Understanding how pandas allocates memory
Choosing appropriate data types
Using techniques like categorical conversion and downcasting
Processing large files in chunks

You can significantly reduce memory consumption and work with larger datasets even on hardware with limited RAM.

Exercises

Create a DataFrame with 1 million rows containing different data types and optimize its memory usage.
Write a function that reads a large CSV file and reports the memory that could be saved by optimizing each column.
Compare the performance (both memory and speed) of string columns vs. categorical columns for different operations like filtering and groupby.
Implement a function that processes a large dataset in chunks, applying a memory optimization to each chunk.

Additional Resources

Pandas Documentation on Memory Usage
Tips for Optimizing Pandas Code
Working with Large Data in Python
Dask Documentation - For when datasets are too large for pandas

💡 Found a typo or mistake? Click "Edit this page" to suggest a correction. Your feedback is greatly appreciated!

Introduction to Memory Management in Pandas​

Checking Memory Usage​

Basic Memory Usage​

Detailed Memory Analysis​

Understanding Data Types and Memory​

Optimizing Memory Usage​

1. Using Appropriate Data Types​

2. Categorical Data for Strings​

3. Automatic Type Optimization with pd.read_csv()​

4. Using downcast Option​

Automatic Memory Optimization​

Real-World Application: Processing Large CSV Files​

Best Practices for Memory Management in Pandas​

Summary​

Exercises​

Additional Resources​

Introduction to Memory Management in Pandas

Checking Memory Usage

Basic Memory Usage

Detailed Memory Analysis

Understanding Data Types and Memory

Optimizing Memory Usage

1. Using Appropriate Data Types

2. Categorical Data for Strings

3. Automatic Type Optimization with `pd.read_csv()`

4. Using `downcast` Option

Automatic Memory Optimization

Real-World Application: Processing Large CSV Files

Best Practices for Memory Management in Pandas

Summary

Exercises

Additional Resources