Pandas Duplicates
When working with real-world datasets, duplicate data is a common issue that can skew your analysis and lead to incorrect conclusions. Pandas provides powerful tools to identify and handle duplicate entries in your DataFrames. This guide will walk you through the process of detecting and managing duplicates effectively.
Introduction to Duplicates in Data
Duplicate data occurs when the same information appears multiple times in your dataset. This can happen due to:
- Data entry errors
- System glitches during data collection
- Improper merging of multiple data sources
- Repeated submissions in forms or surveys
Cleaning duplicates is an essential step in the data preparation process that helps maintain data integrity and improve the accuracy of your analysis.
Identifying Duplicates
Using duplicated()
Method
The duplicated()
method helps you identify rows that are duplicates of previously occurring rows in your DataFrame.
import pandas as pd
# Sample DataFrame with duplicates
data = {
'Name': ['John', 'Anna', 'Peter', 'John', 'Linda'],
'Age': [28, 24, 35, 28, 32],
'City': ['New York', 'Paris', 'Berlin', 'New York', 'Rome']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Check for duplicate rows
duplicates = df.duplicated()
print("\nDuplicate rows (True indicates duplicate):")
print(duplicates)
Output:
Original DataFrame:
Name Age City
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
3 John 28 New York
4 Linda 32 Rome
Duplicate rows (True indicates duplicate):
0 False
1 False
2 False
3 True
4 False
dtype: bool
By default, duplicated()
marks the second and subsequent occurrences of a row as True
. The first occurrence is marked as False
.
Customizing Duplicate Detection
You can customize how duplicates are identified with these parameters:
subset
: Check for duplicates in specific columns onlykeep
: Control which duplicates are marked ('first', 'last', or False)
# Check duplicates based only on the 'Name' column
name_dupes = df.duplicated(subset=['Name'])
print("\nDuplicates based on 'Name' column only:")
print(name_dupes)
# Mark all duplicates including first occurrence
all_dupes = df.duplicated(keep=False)
print("\nAll duplicates (including first occurrence):")
print(all_dupes)
# Mark first occurrence as duplicate instead of subsequent ones
last_dupes = df.duplicated(keep='last')
print("\nDuplicates (keeping last occurrence):")
print(last_dupes)
Output:
Duplicates based on 'Name' column only:
0 False
1 False
2 False
3 True
4 False
dtype: bool
All duplicates (including first occurrence):
0 True
1 False
2 False
3 True
4 False
dtype: bool
Duplicates (keeping last occurrence):
0 True
1 False
2 False
3 False
4 False
dtype: bool
Finding and Displaying Duplicate Rows
To see the actual duplicate records, you can combine duplicated()
with boolean indexing:
# Display all duplicate rows
duplicate_rows = df[df.duplicated(keep=False)]
print("\nAll duplicate rows:")
print(duplicate_rows)
# Display only the first occurrences of duplicates
first_dupes = df[df.duplicated(keep='last')]
print("\nFirst occurrences of duplicates:")
print(first_dupes)
# Display only the second+ occurrences of duplicates
second_dupes = df[df.duplicated(keep='first')]
print("\nSecond+ occurrences of duplicates:")
print(second_dupes)
Output:
All duplicate rows:
Name Age City
0 John 28 New York
3 John 28 New York
First occurrences of duplicates:
Name Age City
0 John 28 New York
Second+ occurrences of duplicates:
Name Age City
3 John 28 New York
Removing Duplicates with drop_duplicates()
The drop_duplicates()
method removes duplicate rows from your DataFrame:
# Remove duplicate rows (keeps first occurrence by default)
df_no_dupes = df.drop_duplicates()
print("\nDataFrame after removing duplicates (keeping first occurrence):")
print(df_no_dupes)
# Remove duplicates keeping last occurrence
df_last = df.drop_duplicates(keep='last')
print("\nDataFrame after removing duplicates (keeping last occurrence):")
print(df_last)
# Remove duplicates based on specific columns
df_subset = df.drop_duplicates(subset=['Name'])
print("\nDataFrame after removing duplicates based on 'Name' column:")
print(df_subset)
Output:
DataFrame after removing duplicates (keeping first occurrence):
Name Age City
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
4 Linda 32 Rome
DataFrame after removing duplicates (keeping last occurrence):
Name Age City
1 Anna 24 Paris
2 Peter 35 Berlin
3 John 28 New York
4 Linda 32 Rome
DataFrame after removing duplicates based on 'Name' column:
Name Age City
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
4 Linda 32 Rome
In-place Modification
To modify the original DataFrame rather than creating a new one, use the inplace
parameter:
# Remove duplicates in-place
df.drop_duplicates(inplace=True)
print("\nOriginal DataFrame after in-place duplicate removal:")
print(df)
Output:
Original DataFrame after in-place duplicate removal:
Name Age City
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
4 Linda 32 Rome
Real-World Example: Cleaning Customer Data
Let's walk through a more practical example of handling duplicates in a customer database:
# Customer data with potential duplicates
customer_data = {
'customer_id': [101, 102, 101, 103, 104, 105, 103],
'name': ['Alice Smith', 'Bob Jones', 'Alice Smith', 'Charlie Brown', 'Dana White', 'Eve Black', 'Charlie Brown'],
'email': ['[email protected]', '[email protected]', '[email protected]',
'[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'purchase_amount': [150.50, 200.75, 150.50, 50.25, 300.00, 125.75, 75.50]
}
customers_df = pd.DataFrame(customer_data)
print("Customer Database:")
print(customers_df)
# Count total duplicates
total_dupes = customers_df.duplicated().sum()
print(f"\nTotal number of duplicate rows: {total_dupes}")
# Find duplicates based on customer_id
id_dupes = customers_df.duplicated(subset=['customer_id'], keep=False)
print("\nDuplicates based on customer_id:")
print(customers_df[id_dupes])
# Find exact duplicates (all columns are the same)
exact_dupes = customers_df.duplicated(keep=False)
print("\nExact duplicates (all columns):")
print(customers_df[exact_dupes])
# Remove duplicates while keeping the first occurrence for each customer_id
clean_customers = customers_df.drop_duplicates(subset=['customer_id'])
print("\nCleaned customer database (unique customer_ids):")
print(clean_customers)
Output:
Customer Database:
customer_id name email purchase_amount
0 101 Alice Smith [email protected] 150.50
1 102 Bob Jones [email protected] 200.75
2 101 Alice Smith [email protected] 150.50
3 103 Charlie Brown [email protected] 50.25
4 104 Dana White [email protected] 300.00
5 105 Eve Black [email protected] 125.75
6 103 Charlie Brown [email protected] 75.50
Total number of duplicate rows: 1
Duplicates based on customer_id:
customer_id name email purchase_amount
0 101 Alice Smith [email protected] 150.50
2 101 Alice Smith [email protected] 150.50
3 103 Charlie Brown [email protected] 50.25
6 103 Charlie Brown [email protected] 75.50
Exact duplicates (all columns):
customer_id name email purchase_amount
0 101 Alice Smith [email protected] 150.50
2 101 Alice Smith [email protected] 150.50
Cleaned customer database (unique customer_ids):
customer_id name email purchase_amount
0 101 Alice Smith [email protected] 150.50
1 102 Bob Jones [email protected] 200.75
3 103 Charlie Brown [email protected] 50.25
4 104 Dana White [email protected] 300.00
5 105 Eve Black [email protected] 125.75
Handling Special Cases
In the real world, you might need more nuanced approaches to duplicate handling:
# Keep row with maximum purchase amount when duplicates exist by customer_id
best_customers = customers_df.sort_values('purchase_amount', ascending=False).drop_duplicates(subset=['customer_id'])
print("\nCustomers with highest purchase amounts:")
print(best_customers)
# Aggregate data for duplicate customers (sum purchase amounts)
# Group by customer_id and take first value for name and email while summing purchase_amount
aggregated_customers = customers_df.groupby('customer_id').agg({
'name': 'first',
'email': 'first',
'purchase_amount': 'sum'
}).reset_index()
print("\nAggregated customer data (summed purchases):")
print(aggregated_customers)
Output:
Customers with highest purchase amounts:
customer_id name email purchase_amount
4 104 Dana White [email protected] 300.00
1 102 Bob Jones [email protected] 200.75
0 101 Alice Smith [email protected] 150.50
5 105 Eve Black [email protected] 125.75
6 103 Charlie Brown [email protected] 75.50
Aggregated customer data (summed purchases):
customer_id name email purchase_amount
0 101 Alice Smith [email protected] 301.00
1 102 Bob Jones [email protected] 200.75
2 103 Charlie Brown [email protected] 125.75
3 104 Dana White [email protected] 300.00
4 105 Eve Black [email protected] 125.75
Summary
Handling duplicates is a crucial aspect of data cleaning with pandas. In this guide, we covered:
- How to identify duplicates using
duplicated()
- Customizing duplicate detection with parameters like
subset
andkeep
- Removing duplicates with
drop_duplicates()
- Practical examples of handling duplicates in real-world data
By properly managing duplicates in your datasets, you can ensure more accurate analyses and reliable results.
Exercises
-
Create a DataFrame with student records that contain some duplicate entries. Remove duplicates based on student ID while keeping the record with the highest score.
-
Import a CSV file of your choice and identify what percentage of the data consists of duplicates.
-
Given a DataFrame with customer transactions, identify customers who have made multiple purchases on the same day and consolidate their transactions.
-
Create a function that takes a DataFrame and returns a report of duplicate counts by different columns.
-
Use the
subset
parameter to find partial duplicates in a complex dataset, such as records with the same name but different contact information.
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)