Pandas Apply Functions
Introduction
When working with data in Pandas, you'll often need to transform values or perform calculations across your DataFrame or Series objects. While basic operations and built-in methods can handle many tasks, sometimes you need to apply custom logic to your data. This is where Pandas' apply functions come in.
In this tutorial, we'll explore how to use apply()
, applymap()
, and map()
functions in Pandas to transform data efficiently. These functions allow you to apply custom operations to your data without writing explicit loops, making your code more concise and often more performant.
The Apply Family of Functions
Pandas provides several functions for applying operations to your data:
apply()
- Works on both DataFrame and Series objects to apply a function along an axisapplymap()
- Works on DataFrame objects to apply a function to every elementmap()
- Works on Series objects to map values to other values
Let's explore each of these functions with examples.
Using apply()
with Series
The apply()
method for Series allows you to apply a function to each element in the Series. This is perfect when you need to transform each value independently.
Basic Example
import pandas as pd
# Create a simple Series
numbers = pd.Series([1, 2, 3, 4, 5])
# Apply a square function to each element
squared = numbers.apply(lambda x: x**2)
print("Original Series:")
print(numbers)
print("\nAfter applying square function:")
print(squared)
Output:
Original Series:
0 1
1 2
2 3
3 4
4 5
dtype: int64
After applying square function:
0 1
1 4
2 9
3 16
4 25
dtype: int64
Using Named Functions
You can also use named functions instead of lambda functions:
import pandas as pd
import numpy as np
# Create a Series with some missing values
data = pd.Series([10, 20, np.nan, 30, np.nan, 40])
def replace_missing(x):
return 0 if pd.isna(x) else x
# Apply the function to replace missing values
cleaned_data = data.apply(replace_missing)
print("Original Series:")
print(data)
print("\nAfter replacing missing values:")
print(cleaned_data)
Output:
Original Series:
0 10.0
1 20.0
2 NaN
3 30.0
4 NaN
5 40.0
dtype: float64
After replacing missing values:
0 10.0
1 20.0
2 0.0
3 30.0
4 0.0
5 40.0
dtype: float64
Using apply()
with DataFrames
When working with DataFrames, apply()
operates on entire rows or columns at once, depending on the axis
parameter.
Applying to Columns (Default: axis=0)
import pandas as pd
# Create a DataFrame with student scores
data = {
'Math': [85, 90, 70, 95, 80],
'Science': [90, 85, 95, 88, 92],
'English': [75, 85, 80, 90, 85]
}
df = pd.DataFrame(data)
# Calculate the average score for each subject
avg_scores = df.apply(np.mean)
print("Student Scores DataFrame:")
print(df)
print("\nAverage score for each subject:")
print(avg_scores)
Output:
Student Scores DataFrame:
Math Science English
0 85 90 75
1 90 85 85
2 70 95 80
3 95 88 90
4 80 92 85
Average score for each subject:
Math 84.0
Science 90.0
English 83.0
dtype: float64
Applying to Rows (axis=1)
import pandas as pd
# Create a DataFrame with student scores
data = {
'Math': [85, 90, 70, 95, 80],
'Science': [90, 85, 95, 88, 92],
'English': [75, 85, 80, 90, 85]
}
df = pd.DataFrame(data)
# Calculate the average score for each student
df['Average'] = df.apply(lambda row: row.mean(), axis=1)
# Determine if the student passed (average >= 80)
df['Passed'] = df['Average'].apply(lambda x: 'Yes' if x >= 80 else 'No')
print("Student Scores with Average and Pass Status:")
print(df)
Output:
Student Scores with Average and Pass Status:
Math Science English Average Passed
0 85 90 75 83.33 Yes
1 90 85 85 86.67 Yes
2 70 95 80 81.67 Yes
3 95 88 90 91.00 Yes
4 80 92 85 85.67 Yes
Using applymap()
for Element-wise Operations
The applymap()
function applies a function to each element in a DataFrame, making it ideal for element-wise transformations.
import pandas as pd
import numpy as np
# Create a DataFrame with some float values
data = {
'A': [1.23456, 2.34567, 3.45678],
'B': [4.56789, 5.67890, 6.78901],
'C': [7.89012, 8.90123, 9.01234]
}
df = pd.DataFrame(data)
# Round all values to 2 decimal places
rounded_df = df.applymap(lambda x: round(x, 2))
print("Original DataFrame:")
print(df)
print("\nAfter rounding all values:")
print(rounded_df)
Output:
Original DataFrame:
A B C
0 1.23456 4.56789 7.89012
1 2.34567 5.67890 8.90123
2 3.45678 6.78901 9.01234
After rounding all values:
A B C
0 1.23 4.57 7.89
1 2.35 5.68 8.90
2 3.46 6.79 9.01
Using map()
for Value Substitution
The map()
function works on Series and is perfect for value substitution or mapping values from one domain to another.
Basic Mapping
import pandas as pd
# Create a Series with fruit names
fruits = pd.Series(['apple', 'banana', 'orange', 'grape', 'apple', 'orange'])
# Create a mapping dictionary
fruit_prices = {
'apple': 1.2,
'banana': 0.5,
'orange': 0.8,
'grape': 2.5
}
# Map the fruits to their prices
fruit_prices_series = fruits.map(fruit_prices)
print("Fruits:")
print(fruits)
print("\nMapped Prices:")
print(fruit_prices_series)
Output:
Fruits:
0 apple
1 banana
2 orange
3 grape
4 apple
5 orange
dtype: object
Mapped Prices:
0 1.2
1 0.5
2 0.8
3 2.5
4 1.2
5 0.8
dtype: float64
Mapping with a Function
You can also use a function with map()
:
import pandas as pd
# Create a Series with string values
data = pd.Series(['PYTHON', 'pandas', 'DATA', 'analysis'])
# Apply a function to standardize the strings
standardized = data.map(lambda x: x.capitalize())
print("Original Series:")
print(data)
print("\nAfter standardizing:")
print(standardized)
Output:
Original Series:
0 PYTHON
1 pandas
2 DATA
3 analysis
dtype: object
After standardizing:
0 Python
1 Pandas
2 Data
3 Analysis
dtype: object
Real-world Examples
Example 1: Cleaning and Transforming Customer Data
import pandas as pd
import numpy as np
# Sample customer data
data = {
'customer_id': [101, 102, 103, 104, 105],
'name': ['John Smith', 'JANE DOE', 'robert johnson', 'Sarah Williams', 'mike brown'],
'email': ['[email protected]', 'jane@example', '[email protected]', '', '[email protected]'],
'purchase_amount': [125.50, 200.75, np.nan, 350.25, 175.00],
'purchase_date': ['2023-01-15', '2023-01-20', '2023-01-25', '2023-02-01', '2023-02-10']
}
df = pd.DataFrame(data)
# Data cleaning and transformation
# 1. Standardize names (first letter capitalized)
df['name'] = df['name'].apply(lambda x: ' '.join([word.capitalize() for word in x.split()]))
# 2. Validate emails
def validate_email(email):
if not email or '@' not in email or '.' not in email.split('@')[1]:
return 'Invalid Email'
return email
df['email'] = df['email'].apply(validate_email)
# 3. Fill missing purchase amounts with average
avg_purchase = df['purchase_amount'].mean()
df['purchase_amount'] = df['purchase_amount'].apply(lambda x: avg_purchase if pd.isna(x) else x)
# 4. Convert dates to datetime and extract month
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
df['purchase_month'] = df['purchase_date'].apply(lambda x: x.strftime('%B'))
print("Cleaned and transformed customer data:")
print(df)
Output:
Cleaned and transformed customer data:
customer_id name email purchase_amount purchase_date purchase_month
0 101 John Smith [email protected] 125.50 2023-01-15 January
1 102 Jane Doe Invalid Email 200.75 2023-01-20 January
2 103 Robert Johnson [email protected] 212.88 2023-01-25 January
3 104 Sarah Williams Invalid Email 350.25 2023-02-01 February
4 105 Mike Brown [email protected] 175.00 2023-02-10 February
Example 2: Analyzing Financial Data
import pandas as pd
import numpy as np
# Sample stock data
data = {
'date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
'stock_a': [100, 102, 104, 103, 105, 107, 108, 106, 104, 105],
'stock_b': [50, 52, 51, 53, 54, 52, 51, 50, 51, 52],
'stock_c': [200, 198, 195, 197, 201, 203, 205, 202, 200, 205]
}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)
# Calculate daily returns
def calculate_return(column):
return column.pct_change() * 100
daily_returns = df.apply(calculate_return)
# Calculate volatility (standard deviation of returns)
volatility = daily_returns.apply(np.std)
# Calculate cumulative returns
def calculate_cumulative_return(column):
return ((column.iloc[-1] - column.iloc[0]) / column.iloc[0]) * 100
cumulative_returns = df.apply(calculate_cumulative_return)
# Create a summary DataFrame
summary = pd.DataFrame({
'Starting Price': df.iloc[0],
'Ending Price': df.iloc[-1],
'Cumulative Return (%)': cumulative_returns,
'Volatility (%)': volatility
})
print("Stock Price Summary:")
print(summary)
Output:
Stock Price Summary:
Starting Price Ending Price Cumulative Return (%) Volatility (%)
stock_a 100 105 5.00 1.25
stock_b 50 52 4.00 1.37
stock_c 200 205 2.50 1.30
Tips and Best Practices
-
Use vectorized operations when possible: Before reaching for
apply()
, check if there's a built-in Pandas function that can do the job more efficiently. -
Consider performance: For large DataFrames,
apply()
can be slower than vectorized operations. If performance is critical, consider alternatives like NumPy operations. -
Choose the right function:
- Use
apply()
when working with rows or columns as a whole - Use
applymap()
for element-wise operations on DataFrames - Use
map()
for simple value substitutions on Series
- Use
-
Pass additional arguments to your function using partial from functools:
from functools import partial
def custom_function(x, multiplier):
return x * multiplier
# Apply with a specific multiplier
df.apply(partial(custom_function, multiplier=2))
- Combine with method chaining for cleaner code:
result = (df
.dropna()
.apply(custom_function)
.sort_values())
Summary
Pandas apply functions provide a powerful way to transform data in DataFrames and Series:
apply()
lets you apply functions to entire rows or columnsapplymap()
applies functions to every element in a DataFramemap()
is great for value substitution in a Series
These functions help you avoid explicit loops and make your data transformation code more concise and readable. While they may not always be the most performant option for large datasets, they strike a good balance between readability and efficiency for most data analysis tasks.
Exercises
-
Create a DataFrame with employee data (name, department, salary) and use
apply()
to calculate a bonus for each employee based on their department and salary. -
Given a Series of dates, use
apply()
to extract the day of the week for each date. -
Create a DataFrame with product information and use
applymap()
to format all string columns to be title case and all numeric columns to have two decimal places. -
Use
map()
to convert a Series of country codes to full country names using a dictionary. -
Analyze a dataset of your choice using the apply functions to transform and extract meaningful information.
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)