Pandas Memory Usage
When working with large datasets in pandas, understanding and managing memory usage becomes crucial. Efficient memory management can significantly improve performance and help you work with bigger datasets on limited hardware. This guide will explain how to analyze and optimize memory usage in pandas objects.
Introduction to Memory Management in Pandas
Pandas is powerful for data analysis, but it can be memory-intensive, especially with large datasets. Each pandas object (DataFrame or Series) consumes memory based on:
- The number of rows and columns
- The data types of each column
- Additional memory overhead for indexes and internal structures
Understanding how pandas allocates memory will help you work with larger datasets and avoid running into memory errors.
Checking Memory Usage
Basic Memory Usage
Let's start by exploring how to check the memory footprint of pandas objects.
import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'A': np.random.rand(100000),
'B': np.random.randint(0, 100, size=100000),
'C': ['string' + str(i) for i in range(100000)],
'D': pd.date_range('20210101', periods=100000)
})
# Check memory usage
print(df.info(memory_usage='deep'))
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 100000 non-null float64
1 B 100000 non-null int64
2 C 100000 non-null object
3 D 100000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 12.2+ MB
None
Detailed Memory Analysis
For more detailed information, you can use the memory_usage()
method:
# Get memory usage per column
print(df.memory_usage(deep=True))
# Get total memory usage
print(f"Total memory usage: {df.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")
Output:
Index 800000
A 800000
B 800000
C 6400044
D 800000
dtype: int64
Total memory usage: 8.39 MB
The deep=True
parameter is important as it accounts for the actual memory used by object dtypes (like strings). Without it, pandas only reports the shallow memory usage which can underestimate the true memory footprint.
Understanding Data Types and Memory
Different data types consume different amounts of memory:
Data Type | Typical Memory Usage |
---|---|
int64 | 8 bytes per value |
float64 | 8 bytes per value |
bool | 1 byte per value |
datetime64 | 8 bytes per value |
object | Variable (typically more) |
The object
dtype (used for strings and mixed types) is usually the biggest memory consumer.
Optimizing Memory Usage
Now let's look at strategies to reduce memory consumption:
1. Using Appropriate Data Types
Converting columns to more memory-efficient types can save significant space:
# Original memory usage
print(f"Original memory: {df.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")
# Convert integer column to smaller dtype
df['B'] = df['B'].astype('int8') # Values 0-100 fit in int8 (-128 to 127)
# Check new memory usage
print(f"After int8 conversion: {df.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")
Output:
Original memory: 8.39 MB
After int8 conversion: 7.64 MB
2. Categorical Data for Strings
For columns with repeated string values, using categorical data types can save substantial memory:
# Create a DataFrame with repeated values
df_cat = pd.DataFrame({
'city': np.random.choice(['New York', 'London', 'Tokyo', 'Paris', 'Beijing'], 100000)
})
# Check memory before conversion
print(f"Before categorical: {df_cat.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")
# Convert to categorical
df_cat['city'] = df_cat['city'].astype('category')
# Check memory after conversion
print(f"After categorical: {df_cat.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")
Output:
Before categorical: 3.96 MB
After categorical: 0.10 MB
3. Automatic Type Optimization with pd.read_csv()
When loading data from CSV files, you can use dtype
and parse_dates
parameters to optimize memory usage right from the start:
# Sample code for optimized CSV loading
optimized_df = pd.read_csv('large_file.csv',
dtype={
'id': 'int32',
'small_numbers': 'int8',
'category_column': 'category'
},
parse_dates=['date_column'])
4. Using downcast
Option
The downcast
parameter can automatically reduce memory usage:
# Create a DataFrame with numbers
df_nums = pd.DataFrame({
'A': np.random.randint(0, 100, size=100000),
'B': np.random.rand(100000)
})
# Check memory usage
print(f"Before downcast: {df_nums.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")
# Downcast to smaller dtypes
df_nums_small = df_nums.copy()
df_nums_small['A'] = pd.to_numeric(df_nums_small['A'], downcast='integer')
df_nums_small['B'] = pd.to_numeric(df_nums_small['B'], downcast='float')
# Check memory usage after downcast
print(f"After downcast: {df_nums_small.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")
Output:
Before downcast: 1.53 MB
After downcast: 0.39 MB
Automatic Memory Optimization
For a more automated approach to memory optimization, let's create a utility function:
def reduce_memory_usage(df, verbose=True):
"""
Iterate through all numeric columns and modify their types to reduce memory usage.
"""
start_mem = df.memory_usage(deep=True).sum() / 1024**2
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
# For object columns, convert to categorical if fewer than 50% unique values
if df[col].nunique() / len(df) < 0.5:
df[col] = df[col].astype('category')
end_mem = df.memory_usage(deep=True).sum() / 1024**2
reduction = 100 * (start_mem - end_mem) / start_mem
if verbose:
print(f'Memory usage decreased from {start_mem:.2f} MB to {end_mem:.2f} MB ({reduction:.2f}% reduction)')
return df
Let's use this function on a larger DataFrame:
# Create a larger DataFrame
large_df = pd.DataFrame({
'id': range(500000),
'int_col': np.random.randint(0, 100, size=500000),
'float_col': np.random.rand(500000),
'category_col': np.random.choice(['A', 'B', 'C', 'D', 'E'], 500000)
})
# Optimize memory usage
optimized_df = reduce_memory_usage(large_df)
Output:
Memory usage decreased from 17.17 MB to 2.86 MB (83.31% reduction)
Real-World Application: Processing Large CSV Files
Here's a practical example of how to process a large CSV file in chunks to manage memory usage:
def process_large_csv(filename, chunksize=100000):
# Process in chunks to avoid loading the entire file into memory
chunks = []
for chunk in pd.read_csv(filename, chunksize=chunksize):
# Optimize memory usage for this chunk
chunk = reduce_memory_usage(chunk, verbose=False)
# Process the chunk (example: filter rows)
processed_chunk = chunk[chunk['value'] > 0]
chunks.append(processed_chunk)
# Combine processed chunks
result = pd.concat(chunks)
return result
Best Practices for Memory Management in Pandas
-
Check memory usage regularly with
df.info(memory_usage='deep')
anddf.memory_usage(deep=True)
-
Use appropriate data types:
- Use smaller integer types when possible (
int8
,int16
instead ofint64
) - Convert string columns with limited unique values to categorical
- Use datetime64 for date columns instead of strings
- Use smaller integer types when possible (
-
Free memory when possible:
- Delete intermediate DataFrames you no longer need
- Use
del df
andgc.collect()
to release memory
-
Process large files in chunks instead of loading all at once
-
Consider using more specialized tools like Dask or Vaex for extremely large datasets that don't fit in memory
Summary
Managing memory in pandas is essential when working with large datasets. By:
- Understanding how pandas allocates memory
- Choosing appropriate data types
- Using techniques like categorical conversion and downcasting
- Processing large files in chunks
You can significantly reduce memory consumption and work with larger datasets even on hardware with limited RAM.
Exercises
-
Create a DataFrame with 1 million rows containing different data types and optimize its memory usage.
-
Write a function that reads a large CSV file and reports the memory that could be saved by optimizing each column.
-
Compare the performance (both memory and speed) of string columns vs. categorical columns for different operations like filtering and groupby.
-
Implement a function that processes a large dataset in chunks, applying a memory optimization to each chunk.
Additional Resources
- Pandas Documentation on Memory Usage
- Tips for Optimizing Pandas Code
- Working with Large Data in Python
- Dask Documentation - For when datasets are too large for pandas
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)