Skip to main content

Pandas Value Replacement

Introduction

When working with real-world data, you'll often encounter values that need to be replaced - whether they're missing values, incorrect entries, outliers, or values that need standardization. Pandas provides several powerful methods for replacing values in DataFrames and Series that can help clean your data and prepare it for analysis.

In this tutorial, we'll learn how to replace values in pandas using different techniques. We'll cover replacing specific values, using dictionaries for mapping replacements, using regular expressions, and handling missing values.

Basic Value Replacement

Let's start with the most basic form of value replacement using the replace() method.

Replace Specific Values

To replace specific values in a DataFrame or Series, use the replace() method:

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
'product': ['apple', 'banana', 'orange', 'apple', 'banana'],
'price': [1.2, 0.5, 0.8, 1.2, 0.6],
'quantity': [10, 20, 15, 8, np.nan]
})

print("Original DataFrame:")
print(df)

# Replace 'apple' with 'Apple' in the 'product' column
df['product'] = df['product'].replace('apple', 'Apple')

print("\nAfter replacing 'apple' with 'Apple':")
print(df)

Output:

Original DataFrame:
product price quantity
0 apple 1.2 10.0
1 banana 0.5 20.0
2 orange 0.8 15.0
3 apple 1.2 8.0
4 banana 0.6 NaN

After replacing 'apple' with 'Apple':
product price quantity
0 Apple 1.2 10.0
1 banana 0.5 20.0
2 orange 0.8 15.0
3 Apple 1.2 8.0
4 banana 0.6 NaN

Replace Multiple Values at Once

You can replace multiple values at once by passing lists of values to be replaced and their replacements:

python
# Create a sample Series
s = pd.Series([0, 1, 2, 3, 4])

# Replace multiple values
s_replaced = s.replace([0, 1, 2], [100, 200, 300])
print("Original Series:")
print(s)
print("\nAfter replacing multiple values:")
print(s_replaced)

Output:

Original Series:
0 0
1 1
2 2
3 3
4 4
dtype: int64

After replacing multiple values:
0 100
1 200
2 300
3 3
4 4
dtype: int64

Using Dictionaries for Replacement

A more flexible way to replace values is using dictionaries, where keys are the values to be replaced and values are the replacements:

python
# Create a sample DataFrame
data = {
'fruits': ['apple', 'banana', 'orange', 'strawberry', 'kiwi'],
'color': ['red', 'yellow', 'orange', 'red', 'green']
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Replace values using a dictionary
replacement_dict = {'apple': 'Apple', 'banana': 'Banana', 'red': 'Red'}
df_replaced = df.replace(replacement_dict)

print("\nAfter replacing with dictionary:")
print(df_replaced)

Output:

Original DataFrame:
fruits color
0 apple red
1 banana yellow
2 orange orange
3 strawberry red
4 kiwi green

After replacing with dictionary:
fruits color
0 Apple Red
1 Banana yellow
2 orange orange
3 strawberry Red
4 kiwi green

Column-Specific Replacements

You can also apply different replacements to different columns:

python
# Create a sample DataFrame
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]
})

print("Original DataFrame:")
print(df)

# Apply different replacements to different columns
replacements = {'A': {1: 100, 2: 200}, 'B': {5: 500, 7: 700}}
df_replaced = df.replace(replacements)

print("\nAfter column-specific replacements:")
print(df_replaced)

Output:

Original DataFrame:
A B
0 1 5
1 2 6
2 3 7
3 4 8

After column-specific replacements:
A B
0 100 500
1 200 6
2 3 700
3 4 8

Replacing Values Using Regular Expressions

Pandas also allows you to replace values using regular expressions, which is powerful for pattern-based replacements:

python
import pandas as pd

# Create a DataFrame with text data
df = pd.DataFrame({
'text': ['Phone: 123-456-7890', 'Email: [email protected]',
'Date: 2022-01-15', 'Phone: 987-654-3210']
})

print("Original DataFrame:")
print(df)

# Replace phone number pattern with "PHONE_HIDDEN"
df_replaced = df.replace(to_replace=r'Phone: \d{3}-\d{3}-\d{4}', value='Phone: HIDDEN', regex=True)

print("\nAfter regex replacement:")
print(df_replaced)

Output:

Original DataFrame:
text
0 Phone: 123-456-7890
1 Email: [email protected]
2 Date: 2022-01-15
3 Phone: 987-654-3210

After regex replacement:
text
0 Phone: HIDDEN
1 Email: [email protected]
2 Date: 2022-01-15
3 Phone: HIDDEN

Replacing Missing Values

Pandas provides several ways to replace missing values (NaN):

Using fillna()

The most common way to replace missing values is using the fillna() method:

python
import pandas as pd
import numpy as np

# Create a DataFrame with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]
})

print("DataFrame with missing values:")
print(df)

# Replace NaN with a specific value
df_filled = df.fillna(0)

print("\nAfter replacing NaN with 0:")
print(df_filled)

# Replace NaN with different values for each column
df_column_filled = df.fillna({'A': 0, 'B': -1})

print("\nAfter replacing NaN with column-specific values:")
print(df_column_filled)

Output:

DataFrame with missing values:
A B C
0 1.0 5.0 9
1 2.0 NaN 10
2 NaN NaN 11
3 4.0 8.0 12

After replacing NaN with 0:
A B C
0 1.0 5.0 9
1 2.0 0.0 10
2 0.0 0.0 11
3 4.0 8.0 12

After replacing NaN with column-specific values:
A B C
0 1.0 5.0 9
1 2.0 -1.0 10
2 0.0 -1.0 11
3 4.0 8.0 12

Fill Methods

The fillna() method supports different strategies for filling missing values:

python
# Create a Series with missing values
s = pd.Series([1, np.nan, np.nan, 4, 5, np.nan])

print("Original Series with NaN values:")
print(s)

# Fill forward (use previous valid value)
print("\nFill forward (ffill):")
print(s.fillna(method='ffill'))

# Fill backward (use next valid value)
print("\nFill backward (bfill):")
print(s.fillna(method='bfill'))

# Fill with mean
print("\nFill with mean:")
print(s.fillna(s.mean()))

Output:

Original Series with NaN values:
0 1.0
1 NaN
2 NaN
3 4.0
4 5.0
5 NaN
dtype: float64

Fill forward (ffill):
0 1.0
1 1.0
2 1.0
3 4.0
4 5.0
5 5.0
dtype: float64

Fill backward (bfill):
0 1.0
1 4.0
2 4.0
3 4.0
4 5.0
5 NaN
dtype: float64

Fill with mean:
0 1.0
1 3.3333
2 3.3333
3 4.0
4 5.0
5 3.3333
dtype: float64

Real-World Application: Data Cleaning

Let's apply what we've learned to a real-world data cleaning scenario:

python
import pandas as pd
import numpy as np

# Create a dataset with common issues
data = {
'customer_id': [1, 2, 3, 4, 5],
'name': ['John Smith', 'Jane doe', np.nan, 'Robert JOHNSON', 'Susan Brown'],
'age': [34, -1, 45, 999, 28],
'status': ['active', 'ACTIVE', 'inactive', 'Active', 'unknown'],
'joining_date': ['2022-01-15', '01/15/2022', 'Jan 15, 2022', '2022/01/15', np.nan]
}

df = pd.DataFrame(data)

print("Original DataFrame with issues:")
print(df)

# 1. Fix the names (standardize capitalization, fix extra spaces)
df['name'] = df['name'].str.title().str.strip().replace({np.nan: 'Unknown'})

# 2. Fix age values (replace impossible values)
df['age'] = df['age'].replace({-1: np.nan, 999: np.nan}).fillna(df['age'].median())

# 3. Standardize status values
df['status'] = df['status'].str.lower().replace({'unknown': 'inactive'})

# 4. Standardize dates (in a real scenario, you'd use pd.to_datetime)
# For simplicity, we'll just replace with a consistent format
date_replacements = {
'01/15/2022': '2022-01-15',
'Jan 15, 2022': '2022-01-15',
'2022/01/15': '2022-01-15'
}
df['joining_date'] = df['joining_date'].replace(date_replacements).fillna('Unknown')

print("\nCleaned DataFrame:")
print(df)

Output:

Original DataFrame with issues:
customer_id name age status joining_date
0 1 John Smith 34 active 2022-01-15
1 2 Jane doe -1 ACTIVE 01/15/2022
2 3 NaN 45 inactive Jan 15, 2022
3 4 Robert JOHNSON 999 Active 2022/01/15
4 5 Susan Brown 28 unknown NaN

Cleaned DataFrame:
customer_id name age status joining_date
0 1 John Smith 34.00 active 2022-01-15
1 2 Jane Doe 34.00 active 2022-01-15
2 3 Unknown 45.00 inactive 2022-01-15
3 4 Robert Johnson 34.00 active 2022-01-15
4 5 Susan Brown 28.00 inactive Unknown

Summary

In this tutorial, we've learned how to replace values in pandas DataFrames and Series using various techniques:

  • Basic value replacement using replace()
  • Replacing multiple values at once
  • Using dictionaries for flexible replacements
  • Column-specific replacements
  • Pattern-based replacements using regular expressions
  • Handling missing values with fillna()
  • Different strategies for filling missing values

Value replacement is an essential part of the data cleaning process that helps to:

  • Standardize data formats
  • Handle missing or invalid values
  • Fix errors in the data
  • Prepare data for analysis

By mastering these techniques, you'll be able to clean and transform your data more effectively for analysis and modeling.

Exercises

  1. Create a DataFrame with a 'rating' column that contains values from 1 to 5, but also has some values as strings ('good', 'bad', 'average'). Replace these string values with appropriate numeric ratings.

  2. Create a DataFrame with customer information that includes inconsistent state abbreviations (e.g., 'New York', 'NY', 'N.Y.'). Standardize all state abbreviations to two-letter codes.

  3. You have a temperature dataset with values in both Celsius and Fahrenheit, indicated by a 'C' or 'F' suffix. Replace all temperatures to be in Celsius only.

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)