Skip to main content

Pandas Duplicates

When working with real-world datasets, duplicate data is a common issue that can skew your analysis and lead to incorrect conclusions. Pandas provides powerful tools to identify and handle duplicate entries in your DataFrames. This guide will walk you through the process of detecting and managing duplicates effectively.

Introduction to Duplicates in Data

Duplicate data occurs when the same information appears multiple times in your dataset. This can happen due to:

  • Data entry errors
  • System glitches during data collection
  • Improper merging of multiple data sources
  • Repeated submissions in forms or surveys

Cleaning duplicates is an essential step in the data preparation process that helps maintain data integrity and improve the accuracy of your analysis.

Identifying Duplicates

Using duplicated() Method

The duplicated() method helps you identify rows that are duplicates of previously occurring rows in your DataFrame.

python
import pandas as pd

# Sample DataFrame with duplicates
data = {
'Name': ['John', 'Anna', 'Peter', 'John', 'Linda'],
'Age': [28, 24, 35, 28, 32],
'City': ['New York', 'Paris', 'Berlin', 'New York', 'Rome']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Check for duplicate rows
duplicates = df.duplicated()
print("\nDuplicate rows (True indicates duplicate):")
print(duplicates)

Output:

Original DataFrame:
Name Age City
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
3 John 28 New York
4 Linda 32 Rome

Duplicate rows (True indicates duplicate):
0 False
1 False
2 False
3 True
4 False
dtype: bool

By default, duplicated() marks the second and subsequent occurrences of a row as True. The first occurrence is marked as False.

Customizing Duplicate Detection

You can customize how duplicates are identified with these parameters:

  1. subset: Check for duplicates in specific columns only
  2. keep: Control which duplicates are marked ('first', 'last', or False)
python
# Check duplicates based only on the 'Name' column
name_dupes = df.duplicated(subset=['Name'])
print("\nDuplicates based on 'Name' column only:")
print(name_dupes)

# Mark all duplicates including first occurrence
all_dupes = df.duplicated(keep=False)
print("\nAll duplicates (including first occurrence):")
print(all_dupes)

# Mark first occurrence as duplicate instead of subsequent ones
last_dupes = df.duplicated(keep='last')
print("\nDuplicates (keeping last occurrence):")
print(last_dupes)

Output:

Duplicates based on 'Name' column only:
0 False
1 False
2 False
3 True
4 False
dtype: bool

All duplicates (including first occurrence):
0 True
1 False
2 False
3 True
4 False
dtype: bool

Duplicates (keeping last occurrence):
0 True
1 False
2 False
3 False
4 False
dtype: bool

Finding and Displaying Duplicate Rows

To see the actual duplicate records, you can combine duplicated() with boolean indexing:

python
# Display all duplicate rows
duplicate_rows = df[df.duplicated(keep=False)]
print("\nAll duplicate rows:")
print(duplicate_rows)

# Display only the first occurrences of duplicates
first_dupes = df[df.duplicated(keep='last')]
print("\nFirst occurrences of duplicates:")
print(first_dupes)

# Display only the second+ occurrences of duplicates
second_dupes = df[df.duplicated(keep='first')]
print("\nSecond+ occurrences of duplicates:")
print(second_dupes)

Output:

All duplicate rows:
Name Age City
0 John 28 New York
3 John 28 New York

First occurrences of duplicates:
Name Age City
0 John 28 New York

Second+ occurrences of duplicates:
Name Age City
3 John 28 New York

Removing Duplicates with drop_duplicates()

The drop_duplicates() method removes duplicate rows from your DataFrame:

python
# Remove duplicate rows (keeps first occurrence by default)
df_no_dupes = df.drop_duplicates()
print("\nDataFrame after removing duplicates (keeping first occurrence):")
print(df_no_dupes)

# Remove duplicates keeping last occurrence
df_last = df.drop_duplicates(keep='last')
print("\nDataFrame after removing duplicates (keeping last occurrence):")
print(df_last)

# Remove duplicates based on specific columns
df_subset = df.drop_duplicates(subset=['Name'])
print("\nDataFrame after removing duplicates based on 'Name' column:")
print(df_subset)

Output:

DataFrame after removing duplicates (keeping first occurrence):
Name Age City
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
4 Linda 32 Rome

DataFrame after removing duplicates (keeping last occurrence):
Name Age City
1 Anna 24 Paris
2 Peter 35 Berlin
3 John 28 New York
4 Linda 32 Rome

DataFrame after removing duplicates based on 'Name' column:
Name Age City
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
4 Linda 32 Rome

In-place Modification

To modify the original DataFrame rather than creating a new one, use the inplace parameter:

python
# Remove duplicates in-place
df.drop_duplicates(inplace=True)
print("\nOriginal DataFrame after in-place duplicate removal:")
print(df)

Output:

Original DataFrame after in-place duplicate removal:
Name Age City
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
4 Linda 32 Rome

Real-World Example: Cleaning Customer Data

Let's walk through a more practical example of handling duplicates in a customer database:

python
# Customer data with potential duplicates
customer_data = {
'customer_id': [101, 102, 101, 103, 104, 105, 103],
'name': ['Alice Smith', 'Bob Jones', 'Alice Smith', 'Charlie Brown', 'Dana White', 'Eve Black', 'Charlie Brown'],
'email': ['[email protected]', '[email protected]', '[email protected]',
'[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'purchase_amount': [150.50, 200.75, 150.50, 50.25, 300.00, 125.75, 75.50]
}

customers_df = pd.DataFrame(customer_data)
print("Customer Database:")
print(customers_df)

# Count total duplicates
total_dupes = customers_df.duplicated().sum()
print(f"\nTotal number of duplicate rows: {total_dupes}")

# Find duplicates based on customer_id
id_dupes = customers_df.duplicated(subset=['customer_id'], keep=False)
print("\nDuplicates based on customer_id:")
print(customers_df[id_dupes])

# Find exact duplicates (all columns are the same)
exact_dupes = customers_df.duplicated(keep=False)
print("\nExact duplicates (all columns):")
print(customers_df[exact_dupes])

# Remove duplicates while keeping the first occurrence for each customer_id
clean_customers = customers_df.drop_duplicates(subset=['customer_id'])
print("\nCleaned customer database (unique customer_ids):")
print(clean_customers)

Output:

Customer Database:
customer_id name email purchase_amount
0 101 Alice Smith [email protected] 150.50
1 102 Bob Jones [email protected] 200.75
2 101 Alice Smith [email protected] 150.50
3 103 Charlie Brown [email protected] 50.25
4 104 Dana White [email protected] 300.00
5 105 Eve Black [email protected] 125.75
6 103 Charlie Brown [email protected] 75.50

Total number of duplicate rows: 1

Duplicates based on customer_id:
customer_id name email purchase_amount
0 101 Alice Smith [email protected] 150.50
2 101 Alice Smith [email protected] 150.50
3 103 Charlie Brown [email protected] 50.25
6 103 Charlie Brown [email protected] 75.50

Exact duplicates (all columns):
customer_id name email purchase_amount
0 101 Alice Smith [email protected] 150.50
2 101 Alice Smith [email protected] 150.50

Cleaned customer database (unique customer_ids):
customer_id name email purchase_amount
0 101 Alice Smith [email protected] 150.50
1 102 Bob Jones [email protected] 200.75
3 103 Charlie Brown [email protected] 50.25
4 104 Dana White [email protected] 300.00
5 105 Eve Black [email protected] 125.75

Handling Special Cases

In the real world, you might need more nuanced approaches to duplicate handling:

python
# Keep row with maximum purchase amount when duplicates exist by customer_id
best_customers = customers_df.sort_values('purchase_amount', ascending=False).drop_duplicates(subset=['customer_id'])
print("\nCustomers with highest purchase amounts:")
print(best_customers)

# Aggregate data for duplicate customers (sum purchase amounts)
# Group by customer_id and take first value for name and email while summing purchase_amount
aggregated_customers = customers_df.groupby('customer_id').agg({
'name': 'first',
'email': 'first',
'purchase_amount': 'sum'
}).reset_index()

print("\nAggregated customer data (summed purchases):")
print(aggregated_customers)

Output:

Customers with highest purchase amounts:
customer_id name email purchase_amount
4 104 Dana White [email protected] 300.00
1 102 Bob Jones [email protected] 200.75
0 101 Alice Smith [email protected] 150.50
5 105 Eve Black [email protected] 125.75
6 103 Charlie Brown [email protected] 75.50

Aggregated customer data (summed purchases):
customer_id name email purchase_amount
0 101 Alice Smith [email protected] 301.00
1 102 Bob Jones [email protected] 200.75
2 103 Charlie Brown [email protected] 125.75
3 104 Dana White [email protected] 300.00
4 105 Eve Black [email protected] 125.75

Summary

Handling duplicates is a crucial aspect of data cleaning with pandas. In this guide, we covered:

  • How to identify duplicates using duplicated()
  • Customizing duplicate detection with parameters like subset and keep
  • Removing duplicates with drop_duplicates()
  • Practical examples of handling duplicates in real-world data

By properly managing duplicates in your datasets, you can ensure more accurate analyses and reliable results.

Exercises

  1. Create a DataFrame with student records that contain some duplicate entries. Remove duplicates based on student ID while keeping the record with the highest score.

  2. Import a CSV file of your choice and identify what percentage of the data consists of duplicates.

  3. Given a DataFrame with customer transactions, identify customers who have made multiple purchases on the same day and consolidate their transactions.

  4. Create a function that takes a DataFrame and returns a report of duplicate counts by different columns.

  5. Use the subset parameter to find partial duplicates in a complex dataset, such as records with the same name but different contact information.

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)