Skip to main content

Pandas Memory Usage

When working with large datasets in pandas, understanding and managing memory usage becomes crucial. Efficient memory management can significantly improve performance and help you work with bigger datasets on limited hardware. This guide will explain how to analyze and optimize memory usage in pandas objects.

Introduction to Memory Management in Pandas

Pandas is powerful for data analysis, but it can be memory-intensive, especially with large datasets. Each pandas object (DataFrame or Series) consumes memory based on:

  1. The number of rows and columns
  2. The data types of each column
  3. Additional memory overhead for indexes and internal structures

Understanding how pandas allocates memory will help you work with larger datasets and avoid running into memory errors.

Checking Memory Usage

Basic Memory Usage

Let's start by exploring how to check the memory footprint of pandas objects.

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
'A': np.random.rand(100000),
'B': np.random.randint(0, 100, size=100000),
'C': ['string' + str(i) for i in range(100000)],
'D': pd.date_range('20210101', periods=100000)
})

# Check memory usage
print(df.info(memory_usage='deep'))

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 100000 non-null float64
1 B 100000 non-null int64
2 C 100000 non-null object
3 D 100000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 12.2+ MB
None

Detailed Memory Analysis

For more detailed information, you can use the memory_usage() method:

python
# Get memory usage per column
print(df.memory_usage(deep=True))

# Get total memory usage
print(f"Total memory usage: {df.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

Output:

Index      800000
A 800000
B 800000
C 6400044
D 800000
dtype: int64
Total memory usage: 8.39 MB

The deep=True parameter is important as it accounts for the actual memory used by object dtypes (like strings). Without it, pandas only reports the shallow memory usage which can underestimate the true memory footprint.

Understanding Data Types and Memory

Different data types consume different amounts of memory:

Data TypeTypical Memory Usage
int648 bytes per value
float648 bytes per value
bool1 byte per value
datetime648 bytes per value
objectVariable (typically more)

The object dtype (used for strings and mixed types) is usually the biggest memory consumer.

Optimizing Memory Usage

Now let's look at strategies to reduce memory consumption:

1. Using Appropriate Data Types

Converting columns to more memory-efficient types can save significant space:

python
# Original memory usage
print(f"Original memory: {df.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

# Convert integer column to smaller dtype
df['B'] = df['B'].astype('int8') # Values 0-100 fit in int8 (-128 to 127)

# Check new memory usage
print(f"After int8 conversion: {df.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

Output:

Original memory: 8.39 MB
After int8 conversion: 7.64 MB

2. Categorical Data for Strings

For columns with repeated string values, using categorical data types can save substantial memory:

python
# Create a DataFrame with repeated values
df_cat = pd.DataFrame({
'city': np.random.choice(['New York', 'London', 'Tokyo', 'Paris', 'Beijing'], 100000)
})

# Check memory before conversion
print(f"Before categorical: {df_cat.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

# Convert to categorical
df_cat['city'] = df_cat['city'].astype('category')

# Check memory after conversion
print(f"After categorical: {df_cat.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

Output:

Before categorical: 3.96 MB
After categorical: 0.10 MB

3. Automatic Type Optimization with pd.read_csv()

When loading data from CSV files, you can use dtype and parse_dates parameters to optimize memory usage right from the start:

python
# Sample code for optimized CSV loading
optimized_df = pd.read_csv('large_file.csv',
dtype={
'id': 'int32',
'small_numbers': 'int8',
'category_column': 'category'
},
parse_dates=['date_column'])

4. Using downcast Option

The downcast parameter can automatically reduce memory usage:

python
# Create a DataFrame with numbers
df_nums = pd.DataFrame({
'A': np.random.randint(0, 100, size=100000),
'B': np.random.rand(100000)
})

# Check memory usage
print(f"Before downcast: {df_nums.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

# Downcast to smaller dtypes
df_nums_small = df_nums.copy()
df_nums_small['A'] = pd.to_numeric(df_nums_small['A'], downcast='integer')
df_nums_small['B'] = pd.to_numeric(df_nums_small['B'], downcast='float')

# Check memory usage after downcast
print(f"After downcast: {df_nums_small.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")

Output:

Before downcast: 1.53 MB
After downcast: 0.39 MB

Automatic Memory Optimization

For a more automated approach to memory optimization, let's create a utility function:

python
def reduce_memory_usage(df, verbose=True):
"""
Iterate through all numeric columns and modify their types to reduce memory usage.
"""
start_mem = df.memory_usage(deep=True).sum() / 1024**2

for col in df.columns:
col_type = df[col].dtype

if col_type != object:
c_min = df[col].min()
c_max = df[col].max()

if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)

else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)

else:
# For object columns, convert to categorical if fewer than 50% unique values
if df[col].nunique() / len(df) < 0.5:
df[col] = df[col].astype('category')

end_mem = df.memory_usage(deep=True).sum() / 1024**2
reduction = 100 * (start_mem - end_mem) / start_mem

if verbose:
print(f'Memory usage decreased from {start_mem:.2f} MB to {end_mem:.2f} MB ({reduction:.2f}% reduction)')

return df

Let's use this function on a larger DataFrame:

python
# Create a larger DataFrame
large_df = pd.DataFrame({
'id': range(500000),
'int_col': np.random.randint(0, 100, size=500000),
'float_col': np.random.rand(500000),
'category_col': np.random.choice(['A', 'B', 'C', 'D', 'E'], 500000)
})

# Optimize memory usage
optimized_df = reduce_memory_usage(large_df)

Output:

Memory usage decreased from 17.17 MB to 2.86 MB (83.31% reduction)

Real-World Application: Processing Large CSV Files

Here's a practical example of how to process a large CSV file in chunks to manage memory usage:

python
def process_large_csv(filename, chunksize=100000):
# Process in chunks to avoid loading the entire file into memory
chunks = []

for chunk in pd.read_csv(filename, chunksize=chunksize):
# Optimize memory usage for this chunk
chunk = reduce_memory_usage(chunk, verbose=False)

# Process the chunk (example: filter rows)
processed_chunk = chunk[chunk['value'] > 0]

chunks.append(processed_chunk)

# Combine processed chunks
result = pd.concat(chunks)
return result

Best Practices for Memory Management in Pandas

  1. Check memory usage regularly with df.info(memory_usage='deep') and df.memory_usage(deep=True)

  2. Use appropriate data types:

    • Use smaller integer types when possible (int8, int16 instead of int64)
    • Convert string columns with limited unique values to categorical
    • Use datetime64 for date columns instead of strings
  3. Free memory when possible:

    • Delete intermediate DataFrames you no longer need
    • Use del df and gc.collect() to release memory
  4. Process large files in chunks instead of loading all at once

  5. Consider using more specialized tools like Dask or Vaex for extremely large datasets that don't fit in memory

Summary

Managing memory in pandas is essential when working with large datasets. By:

  • Understanding how pandas allocates memory
  • Choosing appropriate data types
  • Using techniques like categorical conversion and downcasting
  • Processing large files in chunks

You can significantly reduce memory consumption and work with larger datasets even on hardware with limited RAM.

Exercises

  1. Create a DataFrame with 1 million rows containing different data types and optimize its memory usage.

  2. Write a function that reads a large CSV file and reports the memory that could be saved by optimizing each column.

  3. Compare the performance (both memory and speed) of string columns vs. categorical columns for different operations like filtering and groupby.

  4. Implement a function that processes a large dataset in chunks, applying a memory optimization to each chunk.

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)