Skip to main content

Pandas Apply Functions

Introduction

When working with data in Pandas, you'll often need to transform values or perform calculations across your DataFrame or Series objects. While basic operations and built-in methods can handle many tasks, sometimes you need to apply custom logic to your data. This is where Pandas' apply functions come in.

In this tutorial, we'll explore how to use apply(), applymap(), and map() functions in Pandas to transform data efficiently. These functions allow you to apply custom operations to your data without writing explicit loops, making your code more concise and often more performant.

The Apply Family of Functions

Pandas provides several functions for applying operations to your data:

  1. apply() - Works on both DataFrame and Series objects to apply a function along an axis
  2. applymap() - Works on DataFrame objects to apply a function to every element
  3. map() - Works on Series objects to map values to other values

Let's explore each of these functions with examples.

Using apply() with Series

The apply() method for Series allows you to apply a function to each element in the Series. This is perfect when you need to transform each value independently.

Basic Example

python
import pandas as pd

# Create a simple Series
numbers = pd.Series([1, 2, 3, 4, 5])

# Apply a square function to each element
squared = numbers.apply(lambda x: x**2)

print("Original Series:")
print(numbers)
print("\nAfter applying square function:")
print(squared)

Output:

Original Series:
0 1
1 2
2 3
3 4
4 5
dtype: int64

After applying square function:
0 1
1 4
2 9
3 16
4 25
dtype: int64

Using Named Functions

You can also use named functions instead of lambda functions:

python
import pandas as pd
import numpy as np

# Create a Series with some missing values
data = pd.Series([10, 20, np.nan, 30, np.nan, 40])

def replace_missing(x):
return 0 if pd.isna(x) else x

# Apply the function to replace missing values
cleaned_data = data.apply(replace_missing)

print("Original Series:")
print(data)
print("\nAfter replacing missing values:")
print(cleaned_data)

Output:

Original Series:
0 10.0
1 20.0
2 NaN
3 30.0
4 NaN
5 40.0
dtype: float64

After replacing missing values:
0 10.0
1 20.0
2 0.0
3 30.0
4 0.0
5 40.0
dtype: float64

Using apply() with DataFrames

When working with DataFrames, apply() operates on entire rows or columns at once, depending on the axis parameter.

Applying to Columns (Default: axis=0)

python
import pandas as pd

# Create a DataFrame with student scores
data = {
'Math': [85, 90, 70, 95, 80],
'Science': [90, 85, 95, 88, 92],
'English': [75, 85, 80, 90, 85]
}

df = pd.DataFrame(data)

# Calculate the average score for each subject
avg_scores = df.apply(np.mean)

print("Student Scores DataFrame:")
print(df)
print("\nAverage score for each subject:")
print(avg_scores)

Output:

Student Scores DataFrame:
Math Science English
0 85 90 75
1 90 85 85
2 70 95 80
3 95 88 90
4 80 92 85

Average score for each subject:
Math 84.0
Science 90.0
English 83.0
dtype: float64

Applying to Rows (axis=1)

python
import pandas as pd

# Create a DataFrame with student scores
data = {
'Math': [85, 90, 70, 95, 80],
'Science': [90, 85, 95, 88, 92],
'English': [75, 85, 80, 90, 85]
}

df = pd.DataFrame(data)

# Calculate the average score for each student
df['Average'] = df.apply(lambda row: row.mean(), axis=1)

# Determine if the student passed (average >= 80)
df['Passed'] = df['Average'].apply(lambda x: 'Yes' if x >= 80 else 'No')

print("Student Scores with Average and Pass Status:")
print(df)

Output:

Student Scores with Average and Pass Status:
Math Science English Average Passed
0 85 90 75 83.33 Yes
1 90 85 85 86.67 Yes
2 70 95 80 81.67 Yes
3 95 88 90 91.00 Yes
4 80 92 85 85.67 Yes

Using applymap() for Element-wise Operations

The applymap() function applies a function to each element in a DataFrame, making it ideal for element-wise transformations.

python
import pandas as pd
import numpy as np

# Create a DataFrame with some float values
data = {
'A': [1.23456, 2.34567, 3.45678],
'B': [4.56789, 5.67890, 6.78901],
'C': [7.89012, 8.90123, 9.01234]
}

df = pd.DataFrame(data)

# Round all values to 2 decimal places
rounded_df = df.applymap(lambda x: round(x, 2))

print("Original DataFrame:")
print(df)
print("\nAfter rounding all values:")
print(rounded_df)

Output:

Original DataFrame:
A B C
0 1.23456 4.56789 7.89012
1 2.34567 5.67890 8.90123
2 3.45678 6.78901 9.01234

After rounding all values:
A B C
0 1.23 4.57 7.89
1 2.35 5.68 8.90
2 3.46 6.79 9.01

Using map() for Value Substitution

The map() function works on Series and is perfect for value substitution or mapping values from one domain to another.

Basic Mapping

python
import pandas as pd

# Create a Series with fruit names
fruits = pd.Series(['apple', 'banana', 'orange', 'grape', 'apple', 'orange'])

# Create a mapping dictionary
fruit_prices = {
'apple': 1.2,
'banana': 0.5,
'orange': 0.8,
'grape': 2.5
}

# Map the fruits to their prices
fruit_prices_series = fruits.map(fruit_prices)

print("Fruits:")
print(fruits)
print("\nMapped Prices:")
print(fruit_prices_series)

Output:

Fruits:
0 apple
1 banana
2 orange
3 grape
4 apple
5 orange
dtype: object

Mapped Prices:
0 1.2
1 0.5
2 0.8
3 2.5
4 1.2
5 0.8
dtype: float64

Mapping with a Function

You can also use a function with map():

python
import pandas as pd

# Create a Series with string values
data = pd.Series(['PYTHON', 'pandas', 'DATA', 'analysis'])

# Apply a function to standardize the strings
standardized = data.map(lambda x: x.capitalize())

print("Original Series:")
print(data)
print("\nAfter standardizing:")
print(standardized)

Output:

Original Series:
0 PYTHON
1 pandas
2 DATA
3 analysis
dtype: object

After standardizing:
0 Python
1 Pandas
2 Data
3 Analysis
dtype: object

Real-world Examples

Example 1: Cleaning and Transforming Customer Data

python
import pandas as pd
import numpy as np

# Sample customer data
data = {
'customer_id': [101, 102, 103, 104, 105],
'name': ['John Smith', 'JANE DOE', 'robert johnson', 'Sarah Williams', 'mike brown'],
'email': ['[email protected]', 'jane@example', '[email protected]', '', '[email protected]'],
'purchase_amount': [125.50, 200.75, np.nan, 350.25, 175.00],
'purchase_date': ['2023-01-15', '2023-01-20', '2023-01-25', '2023-02-01', '2023-02-10']
}

df = pd.DataFrame(data)

# Data cleaning and transformation
# 1. Standardize names (first letter capitalized)
df['name'] = df['name'].apply(lambda x: ' '.join([word.capitalize() for word in x.split()]))

# 2. Validate emails
def validate_email(email):
if not email or '@' not in email or '.' not in email.split('@')[1]:
return 'Invalid Email'
return email

df['email'] = df['email'].apply(validate_email)

# 3. Fill missing purchase amounts with average
avg_purchase = df['purchase_amount'].mean()
df['purchase_amount'] = df['purchase_amount'].apply(lambda x: avg_purchase if pd.isna(x) else x)

# 4. Convert dates to datetime and extract month
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
df['purchase_month'] = df['purchase_date'].apply(lambda x: x.strftime('%B'))

print("Cleaned and transformed customer data:")
print(df)

Output:

Cleaned and transformed customer data:
customer_id name email purchase_amount purchase_date purchase_month
0 101 John Smith [email protected] 125.50 2023-01-15 January
1 102 Jane Doe Invalid Email 200.75 2023-01-20 January
2 103 Robert Johnson [email protected] 212.88 2023-01-25 January
3 104 Sarah Williams Invalid Email 350.25 2023-02-01 February
4 105 Mike Brown [email protected] 175.00 2023-02-10 February

Example 2: Analyzing Financial Data

python
import pandas as pd
import numpy as np

# Sample stock data
data = {
'date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
'stock_a': [100, 102, 104, 103, 105, 107, 108, 106, 104, 105],
'stock_b': [50, 52, 51, 53, 54, 52, 51, 50, 51, 52],
'stock_c': [200, 198, 195, 197, 201, 203, 205, 202, 200, 205]
}

df = pd.DataFrame(data)
df.set_index('date', inplace=True)

# Calculate daily returns
def calculate_return(column):
return column.pct_change() * 100

daily_returns = df.apply(calculate_return)

# Calculate volatility (standard deviation of returns)
volatility = daily_returns.apply(np.std)

# Calculate cumulative returns
def calculate_cumulative_return(column):
return ((column.iloc[-1] - column.iloc[0]) / column.iloc[0]) * 100

cumulative_returns = df.apply(calculate_cumulative_return)

# Create a summary DataFrame
summary = pd.DataFrame({
'Starting Price': df.iloc[0],
'Ending Price': df.iloc[-1],
'Cumulative Return (%)': cumulative_returns,
'Volatility (%)': volatility
})

print("Stock Price Summary:")
print(summary)

Output:

Stock Price Summary:
Starting Price Ending Price Cumulative Return (%) Volatility (%)
stock_a 100 105 5.00 1.25
stock_b 50 52 4.00 1.37
stock_c 200 205 2.50 1.30

Tips and Best Practices

  1. Use vectorized operations when possible: Before reaching for apply(), check if there's a built-in Pandas function that can do the job more efficiently.

  2. Consider performance: For large DataFrames, apply() can be slower than vectorized operations. If performance is critical, consider alternatives like NumPy operations.

  3. Choose the right function:

    • Use apply() when working with rows or columns as a whole
    • Use applymap() for element-wise operations on DataFrames
    • Use map() for simple value substitutions on Series
  4. Pass additional arguments to your function using partial from functools:

python
from functools import partial

def custom_function(x, multiplier):
return x * multiplier

# Apply with a specific multiplier
df.apply(partial(custom_function, multiplier=2))
  1. Combine with method chaining for cleaner code:
python
result = (df
.dropna()
.apply(custom_function)
.sort_values())

Summary

Pandas apply functions provide a powerful way to transform data in DataFrames and Series:

  • apply() lets you apply functions to entire rows or columns
  • applymap() applies functions to every element in a DataFrame
  • map() is great for value substitution in a Series

These functions help you avoid explicit loops and make your data transformation code more concise and readable. While they may not always be the most performant option for large datasets, they strike a good balance between readability and efficiency for most data analysis tasks.

Exercises

  1. Create a DataFrame with employee data (name, department, salary) and use apply() to calculate a bonus for each employee based on their department and salary.

  2. Given a Series of dates, use apply() to extract the day of the week for each date.

  3. Create a DataFrame with product information and use applymap() to format all string columns to be title case and all numeric columns to have two decimal places.

  4. Use map() to convert a Series of country codes to full country names using a dictionary.

  5. Analyze a dataset of your choice using the apply functions to transform and extract meaningful information.

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)